[doc] Use sphinx roles and update the sample code (closes #119623)

authorSimon Chabot <simon.chabot@logilab.fr>
changesetdb9a8b3f6f16
branchdefault
phasepublic
hiddenno
parent revision#6d80b4e863f3 [Aligner] `normalize_set` handles tuples. (closes #117136)
child revision#59a15b188628 preparing 0.2.0
files modified by this revision
doc.rst
# HG changeset patch
# User Simon Chabot <simon.chabot@logilab.fr>
# Date 1365092255 -7200
# Thu Apr 04 18:17:35 2013 +0200
# Node ID db9a8b3f6f168bf960723fdb895a9fae7b421ec4
# Parent 6d80b4e863f34bc7e889bbf5dad928d0a3f0b878
[doc] Use sphinx roles and update the sample code (closes #119623)

diff --git a/doc.rst b/doc.rst
@@ -17,31 +17,31 @@
1  ============
2 
3  The alignment process is divided into three main steps:
4 
5  1. Gather and format the data we want to align.
6 -   In this step, we define two sets called the ``alignset`` and the
7 -   ``targetset``. The ``alignset`` contains our data, and the
8 -   ``targetset`` contains the data on which we would like to make the links.
9 +   In this step, we define two sets called the `alignset` and the
10 +   `targetset`. The `alignset` contains our data, and the
11 +   `targetset` contains the data on which we would like to make the links.
12  2. Compute the similarity between the items gathered.  We compute a distance
13     matrix between the two sets according to a given distance.
14  3. Find the items having a high similarity thanks to the distance matrix.
15 
16  Simple case
17  -----------
18 
19 -1. Let's define ``alignset`` and ``targetset`` as simple python lists.
20 +1. Let's define `alignset` and `targetset` as simple python lists.
21 
22 -.. code-block:: python
23 +.. sourcecode:: python
24 
25      alignset = ['Victor Hugo', 'Albert Camus']
26      targetset = ['Albert Camus', 'Guillaume Apollinaire', 'Victor Hugo']
27 
28  2. Now, we have to compute the similarity between each items. For that purpose, the
29     `Levenshtein distance <http://en.wikipedia.org/wiki/Levenshtein_distance>`_
30     [#]_, which is well accurate to compute the distance between few words, is used.
31 -   Such a function is provided in the ``nazca.distance`` module.
32 +   Such a function is provided in the `nazca.distance` module.
33 
34     The next step is to compute the distance matrix according to the Levenshtein
35     distance. The result is given in the following tables.
36 
37 
@@ -68,11 +68,11 @@
38  name), but it is frequent to have a lot of *attributes* to align, such as the name
39  and the birth date and the birth city. The steps remains the same, except that
40  three distance matrices will be computed, and *items* will be represented as
41  nested lists. See the following example:
42 
43 -.. code-block:: python
44 +.. sourcecode:: python
45 
46      alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
47                  ['Jacques Dupuis', '06-01-1999', 'Bressuire'],
48                  ['Michel Edouard', '18-04-1881', 'Nantes']]
49      targetset = [['Dupond Paul', '14/08/1991', 'Paris'],
@@ -82,17 +82,17 @@
50 
51 
52  In such a case, two distance functions are used, the Levenshtein one for the
53  name and the city and a temporal one for the birth date [#]_.
54 
55 -.. [#] Provided in the ``nazca.distances`` module.
56 +.. [#] Provided in the `nazca.distances` module.
57 
58 
59 -The ``cdist`` function of ``nazca.distances`` enables us to compute those
60 +The :func:`cdist` function of `nazca.distances` enables us to compute those
61  matrices :
62 
63 -.. code-block:: python
64 +.. sourcecode:: python
65 
66      >>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
67      >>>                    'levenshtein', matrix_normalized=False)
68      array([[ 1.,  6.,  5.,  0.],
69             [ 5.,  6.,  0.,  5.],
@@ -106,11 +106,11 @@
70  | Jacques Dupuis | 5           | 6              | 0              | 5           |
71  +----------------+-------------+----------------+----------------+-------------+
72  | Edouard Michel | 6           | 0              | 6              | 6           |
73  +----------------+-------------+----------------+----------------+-------------+
74 
75 -.. code-block:: python
76 +.. sourcecode:: python
77 
78      >>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
79      >>>                    'temporal', matrix_normalized=False)
80      array([[     0.,  40294.,   2702.,   7780.],
81             [  2702.,  42996.,      0.,   5078.],
@@ -124,11 +124,11 @@
82  | 06-01-1999 | 2702       | 42996      | 0          | 5078       |
83  +------------+------------+------------+------------+------------+
84  | 18-04-1881 | 40294      | 0          | 42996      | 48074      |
85  +------------+------------+------------+------------+------------+
86 
87 -.. code-block:: python
88 +.. sourcecode:: python
89 
90      >>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
91      >>>                    'levenshtein', matrix_normalized=False)
92      array([[ 0.,  4.,  8.,  0.],
93             [ 8.,  9.,  0.,  8.],
@@ -158,16 +158,16 @@
94  | 2 | 40304 | 0     | 43011 | 48084 |
95  +---+-------+-------+-------+-------+
96 
97  Allowing some misspelling mistakes (for example *Dupont* and *Dupond* are very
98  close), the matching threshold can be set to 1 or 2. Thus we can see that the
99 -item 0 in our ``alignset`` is the same that the item 0 in the ``targetset``, the
100 -1 in the ``alignset`` and the 2 of the ``targetset`` too : the links can be
101 +item 0 in our `alignset` is the same that the item 0 in the `targetset`, the
102 +1 in the `alignset` and the 2 of the `targetset` too : the links can be
103  done !
104 
105 -It's important to notice that even if the item 0 of the ``alignset`` and the 3
106 -of the ``targetset`` have the same name and the same birthplace they are
107 +It's important to notice that even if the item 0 of the `alignset` and the 3
108 +of the `targetset` have the same name and the same birthplace they are
109  unlikely identical because of their very different birth date.
110 
111 
112  You may have noticed that working with matrices as I did for the example is a
113  little bit boring. The good news is that this project makes all this job for you. You
@@ -178,11 +178,11 @@
114  Real applications
115  =================
116 
117  Just before we start, we will assume the following imports have been done:
118 
119 -.. code-block:: python
120 +.. sourcecode:: python
121 
122      from nazca import dataio as aldio #Functions for input and output data
123      from nazca import distances as ald #Functions to compute the distances
124      from nazca import normalize as aln#Functions to normalize data
125      from nazca import aligner as ala  #Functions to align data
@@ -197,11 +197,11 @@
126 
127  .. [#] Let's imagine the *Goncourt prize winners* category does not exist in
128         dbpedia
129 
130  We simply copy/paste the winners list of wikipedia into a file and replace all
131 -the separators (``-`` and ``,``) by ``#``. So, the beginning of our file is :
132 +the separators (`-` and `,`) by `#`. So, the beginning of our file is :
133 
134  ..
135 
136      | 1903#John-Antoine Nau#Force ennemie (Plume)
137      | 1904#Léon Frapié#La Maternelle (Albin Michel)
@@ -212,28 +212,28 @@
138  least two elements: an *identifier* (the name, or the URI) and the *attribute* to
139  compare. With the previous file, we will use the name (so the column number 1)
140  as *identifier* (we don't have an *URI* here as identifier) and *attribute* to align.
141  This is told to python thanks to the following code:
142 
143 -.. code-block:: python
144 +.. sourcecode:: python
145 
146      alignset = aldio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')
147 
148 -So, the beginning of our ``alignset`` is:
149 +So, the beginning of our `alignset` is:
150 
151 -.. code-block:: python
152 +.. sourcecode:: python
153 
154      >>> alignset[:3]
155      [[u'John-Antoine Nau', u'John-Antoine Nau'],
156       [u'Léon Frapié', u'Léon, Frapié'],
157       [u'Claude Farrère', u'Claude Farrère']]
158 
159 
160 -Now, let's build the ``targetset`` thanks to a *sparql query* and the dbpedia
161 +Now, let's build the `targetset` thanks to a *sparql query* and the dbpedia
162  end-point:
163 
164 -.. code-block:: python
165 +.. sourcecode:: python
166 
167     query = """
168          SELECT ?writer, ?name WHERE {
169            ?writer  <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:French_novelists>.
170            ?writer rdfs:label ?name.
@@ -245,44 +245,44 @@
171  Both functions return nested lists as presented before. Now, we have to define
172  the distance function to be used for the alignment. This is done thanks to a
173  python dictionary where the keys are the columns to work on, and the values are
174  the treatments to apply.
175 
176 -.. code-block:: python
177 +.. sourcecode:: python
178 
179      treatments = {1: {'metric': ald.levenshtein}} #Use a levenshtein on the name
180 
181 -Finally, the last thing we have to do, is to call the ``align`` function:
182 +Finally, the last thing we have to do, is to call the :func:`alignall` function:
183 
184 -.. code-block:: python
185 +.. sourcecode:: python
186 
187      alignments = ala.alignall(alignset, targetset,
188                             0.4, #This is the matching threshold
189                             treatments,
190                             mode=None,#We'll discuss about that later
191                             uniq=True #Get the best results only
192                            )
193 
194  This function returns an iterator over the (different) carried out alignments.
195 
196 -.. code-block:: python
197 +.. sourcecode:: python
198 
199      for a, t in alignments:
200          print '%s has been aligned onto %s' % (a, t)
201 
202  It may be important to apply some pre-treatment on the data to align. For
203  instance, names can be written with lower or upper characters, with extra
204  characters as punctuation or unwanted information in parenthesis and so on. That
205 -is why we provide some functions to `normalize` your data. The most useful may
206 -be the `simplify()` function (see the docstring for more information). So the
207 +is why we provide some functions to ``normalize`` your data. The most useful may
208 +be the :func:`simplify` function (see the docstring for more information). So the
209  treatments list can be given as follow:
210 
211 
212 -.. code-block:: python
213 +.. sourcecode:: python
214 
215      def remove_after(string, sub):
216 -        """ Remove the text after ``sub`` in ``string``
217 +        """ Remove the text after `sub` in `string`
218              >>> remove_after('I like cats and dogs', 'and')
219              'I like cats'
220              >>> remove_after('I like cats and dogs', '(')
221              'I like cats and dogs'
222          """
@@ -300,11 +300,11 @@
223 
224 
225  Cities alignment
226  ----------------
227 
228 -The previous case with the `Goncourt prize winners` was pretty simply because
229 +The previous case with the ``Goncourt prize winners`` was pretty simply because
230  the number of items was small, and the computation fast. But in a more real use
231  case, the number of items to align may be huge (some thousands or millions…). Is
232  such a case it's unthinkable to build the global alignment matrix because it
233  would be too big and it would take (at least...) fews days to achieve the computation.
234  So the idea is to make small groups of possible similar data to compute smaller
@@ -313,11 +313,11 @@
235  functions to group text and numerical data.
236 
237 
238  This is done by the following python code:
239 
240 -.. code-block:: python
241 +.. sourcecode:: python
242 
243      targetset = aldio.rqlquery('http://demo.cubicweb.org/geonames',
244                                 'Any U, N, LONG, LAT WHERE X is Location, X name'
245                                 ' N, X country C, C name "France", X longitude'
246                                 ' LONG, X latitude LAT, X population > 1000, X'
@@ -360,56 +360,64 @@
247  possible matches.
248 
249  So, in the next step, we define the treatments to apply.
250  It is the same as before, but we ask for a non-normalized matrix
251  (i.e.: the real output of the levenshtein distance).
252 -Thus, we call the ``alignall`` function. ``indexes`` is a tuple saying the
253 -position of the point on which the kdtree_ must be built, ``mode`` is the mode
254 +Thus, we call the :func:`alignall` function. `indexes` is a tuple saying the
255 +position of the point on which the kdtree_ must be built, `mode` is the mode
256  used to find neighbours [#]_.
257 
258 -Finally, ``uniq`` ask to the function to return the best
259 +Finally, `uniq` ask to the function to return the best
260  candidate (i.e.: the one having the shortest distance below the given threshold)
261 
262 -.. [#] The available modes are ``kdtree``, ``kmeans`` and ``minibatch`` for
263 -       numerical data and ``minhashing`` for text one.
264 +.. [#] The available modes are `kdtree`, `kmeans` and `minibatch` for
265 +       numerical data and `minhashing` for text one.
266 
267  The function output a generator yielding tuples where the first element is the
268 -identifier of the ``alignset`` item and the second is the ``targetset`` one (It
269 +identifier of the `alignset` item and the second is the `targetset` one (It
270  may take some time before yielding the first tuples, because all the computation
271  must be done…)
272 
273  .. _kdtree: http://en.wikipedia.org/wiki/K-d_tree
274 
275 -The `alignall_iterative` and `cache` usage
276 +The :func:`alignall_iterative` and `cache` usage
277  ==========================================
278 
279 -Even when using methods such as ``kdtree`` or ``minhashing`` or ``clustering``,
280 +Even when using methods such as `kdtree` or `minhashing` or `clustering`,
281  the alignment process might be long. That’s why we provide you a function,
282 -called `alignall_iterative` which works directly with your files. The idea
283 +called :func:`alignall_iterative` which works directly with your files. The idea
284  behind this function is simple, it splits your files (the `alignfile` and the
285  `targetfile`) into smallers ones and tries to align each item of each subsets.
286  When processing, if an alignment is estimated almost perfect
287  then the item aligned is _removed_ from the `alignset` to faster the process − so
288  Nazca doesn’t retry to align it.
289 
290  Moreover, this function uses a cache system. When a alignment is done, it is
291  stored into the cache and if in the future a *better* alignment is found, the
292  cached is updated. At the end, you get only the better alignment found.
293 
294 -.. code-block:: python
295 +.. sourcecode:: python
296 +
297 +    from difflib import SequenceMatcher
298 +
299 +    from nazca import normalize as aln#Functions to normalize data
300 +    from nazca import aligner as ala  #Functions to align data
301 +
302 +    def approxMatch(x, y):
303 +        return 1.0 - SequenceMatcher(None, x, y).ratio()
304 
305      alignformat = {'indexes': [0, 3, 2],
306                     'formatopt': {0: lambda x:x.decode('utf-8'),
307                                   1: lambda x:x.decode('utf-8'),
308                                   2: lambda x:x.decode('utf-8'),
309                                  },
310                    }
311 
312 -    targetformat = {'indexes': [0, 3, 2],
313 +    targetformat = {'indexes': [0, 1, 3],
314                     'formatopt': {0: lambda x:x.decode('utf-8'),
315                                   1: lambda x:x.decode('utf-8'),
316 -                                 2: lambda x:x.decode('utf-8'),
317 +                                 3: lambda x:x.decode('utf-8'),
318                                  },
319                    }
320 
321      tr_name = {'normalization': [aln.simplify],
322                 'metric': approxMatch,
@@ -436,11 +444,11 @@
323      with open('results.csv', 'w') as fobj:
324          for aligned, (targeted, distance) in alignments.iteritems():
325              fobj.write('%s\t%s\t%s\n' % (aligned, targeted, distance))
326 
327  Roughly, this function expects the same arguments than the previously shown
328 -`alignall` function, excepting the `equality_threshold` and the `size`.
329 +:func:`alignall` function, excepting the `equality_threshold` and the `size`.
330 
331   - `size` is the number items to have in each subsets
332   - `equality_threshold` is the threshold above which two items are said as
333     equal.
334