[doc] Little explanation on alignall_iterative() (closes #116943)

authorSimon Chabot <simon.chabot@logilab.fr>
changesete5f1e678e654
branchdefault
phasepublic
hiddenno
parent revision#33cc52731e55 [aligner] Speed up the alignset reduction (closes #116942)
child revision#6d80b4e863f3 [Aligner] `normalize_set` handles tuples. (closes #117136)
files modified by this revision
doc.rst
# HG changeset patch
# User Simon Chabot <simon.chabot@logilab.fr>
# Date 1359555263 -3600
# Wed Jan 30 15:14:23 2013 +0100
# Node ID e5f1e678e6546c0d2a2bfc9a50e21f15504f4092
# Parent 33cc52731e552f1cac65f4266ed0a59ca5954c54
[doc] Little explanation on alignall_iterative() (closes #116943)

diff --git a/doc.rst b/doc.rst
@@ -348,17 +348,18 @@
1                             mode='kdtree',  #The mode to use
2                             uniq=True) #Return only the best results
3 
4 
5  Let's explain the code. We have two files, containing a list of cities we want
6 -to align, the first column is the identifier, and the second is the name of the city
7 -and the last one is location of the city (longitude and latitude), gathered into
8 -a single tuple.
9 +to align, the first column is the identifier, and the second is the name of the
10 +city and the last one is the location of the city (longitude and latitude), gathered
11 +into a single tuple.
12 
13 -In this example, we want to build a *kdtree* on the couple (latitude, longitude) to
14 -divide our data in few candidates. This clustering is coarse, and is only used to reduce
15 -the potential candidats without loosing any more refined possible matchs.
16 +In this example, we want to build a *kdtree* on the couple (latitude, longitude)
17 +to divide our data into a few candidates. This clustering is coarse, and is only
18 +used to reduce the potential candidates without loosing any more refined
19 +possible matches.
20 
21  So, in the next step, we define the treatments to apply.
22  It is the same as before, but we ask for a non-normalized matrix
23  (i.e.: the real output of the levenshtein distance).
24  Thus, we call the ``alignall`` function. ``indexes`` is a tuple saying the
@@ -376,10 +377,75 @@
25  may take some time before yielding the first tuples, because all the computation
26  must be done…)
27 
28  .. _kdtree: http://en.wikipedia.org/wiki/K-d_tree
29 
30 +The `alignall_iterative` and `cache` usage
31 +==========================================
32 +
33 +Even when using methods such as ``kdtree`` or ``minhashing`` or ``clustering``,
34 +the alignment process might be long. That’s why we provide you a function,
35 +called `alignall_iterative` which works directly with your files. The idea
36 +behind this function is simple, it splits your files (the `alignfile` and the
37 +`targetfile`) into smallers ones and tries to align each item of each subsets.
38 +When processing, if an alignment is estimated almost perfect
39 +then the item aligned is _removed_ from the `alignset` to faster the process − so
40 +Nazca doesn’t retry to align it.
41 +
42 +Moreover, this function uses a cache system. When a alignment is done, it is
43 +stored into the cache and if in the future a *better* alignment is found, the
44 +cached is updated. At the end, you get only the better alignment found.
45 +
46 +.. code-block:: python
47 +
48 +    alignformat = {'indexes': [0, 3, 2],
49 +                   'formatopt': {0: lambda x:x.decode('utf-8'),
50 +                                 1: lambda x:x.decode('utf-8'),
51 +                                 2: lambda x:x.decode('utf-8'),
52 +                                },
53 +                  }
54 +
55 +    targetformat = {'indexes': [0, 3, 2],
56 +                   'formatopt': {0: lambda x:x.decode('utf-8'),
57 +                                 1: lambda x:x.decode('utf-8'),
58 +                                 2: lambda x:x.decode('utf-8'),
59 +                                },
60 +                  }
61 +
62 +    tr_name = {'normalization': [aln.simplify],
63 +               'metric': approxMatch,
64 +               'matrix_normalized': False,
65 +              }
66 +    tr_info = {'normalization': [aln.simplify],
67 +               'metric': approxMatch,
68 +               'matrix_normalized': False,
69 +               'weighting': 0.3,
70 +              }
71 +
72 +    alignments = ala.alignall_iterative('align_csvfile', 'target_csvfile',
73 +                                        alignformat, targetformat, 0.20,
74 +                                        treatments={1:tr_name,
75 +                                                    2:tr_info,
76 +                                                   },
77 +                                        equality_threshold=0.05,
78 +                                        size=25000,
79 +                                        mode='minhashing',
80 +                                        indexes=(1,1),
81 +                                        neighbours_threshold=0.2,
82 +                                       )
83 +
84 +    with open('results.csv', 'w') as fobj:
85 +        for aligned, (targeted, distance) in alignments.iteritems():
86 +            fobj.write('%s\t%s\t%s\n' % (aligned, targeted, distance))
87 +
88 +Roughly, this function expects the same arguments than the previously shown
89 +`alignall` function, excepting the `equality_threshold` and the `size`.
90 +
91 + - `size` is the number items to have in each subsets
92 + - `equality_threshold` is the threshold above which two items are said as
93 +   equal.
94 +
95  `Try <http://demo.cubicweb.org/nazca/view?vid=nazca>`_ it online !
96  ==================================================================
97 
98  We have also made a little application of Nazca, using `CubicWeb
99  <http://www.cubicweb.org/>`_. This application provides a user interface for