# [doc] Typo + a litte text about the online demo. (closes #116939)

author Simon Chabot 97900fe196c9 default public no #3365a267b308 [aligner] Enables the user to give formatting options (closes #116930) #6afc3891e633 [dataio] Implements split_file() (closes #116931)
files modified by this revision
doc.rst
# HG changeset patch
# User Simon Chabot <simon.chabot@logilab.fr>
# Date 1358939215 -3600
# Wed Jan 23 12:06:55 2013 +0100
# Node ID 97900fe196c948f4962abfb7bcf52ae07977689b
[doc] Typo + a litte text about the online demo. (closes #116939)

diff --git a/doc.rst b/doc.rst
@@ -1,68 +1,70 @@
```1 +==================
2  Alignment project
3  ==================
4
5  What is it for ?
6 -----------------
7 +================
8
9  This python library aims to help you to *align data*. For instance, you have a
10  list of cities, described by their name and their country and you would like to
12  the latitude for example. If you have two or three cities, it can be done with
13  bare hands, but it could not if there are hundreds or thousands cities.
14  This library provides you all the stuff we need to do it.
15
16
17  Introduction
18 -------------
19 +============
20
21  The alignment process is divided into three main steps:
22
23  1. Gather and format the data we want to align.
24     In this step, we define two sets called the ``alignset`` and the
25     ``targetset``. The ``alignset`` contains our data, and the
26     ``targetset`` contains the data on which we would like to make the links.
27 -2. Compute the similarity between the items gathered.
28 -   We compute a distance matrix between the two sets according a given distance.
29 +2. Compute the similarity between the items gathered.  We compute a distance
30 +   matrix between the two sets according to a given distance.
31  3. Find the items having a high similarity thanks to the distance matrix.
32
33  Simple case
34 -^^^^^^^^^^^
35 +-----------
36
37 -Let's defining ``alignset`` and ``targetset`` as simple python lists.
38 +1. Let's define ``alignset`` and ``targetset`` as simple python lists.
39
40  .. code-block:: python
41
42      alignset = ['Victor Hugo', 'Albert Camus']
43      targetset = ['Albert Camus', 'Guillaume Apollinaire', 'Victor Hugo']
44
45 -Now, we have to compute the similarity between each items. For that purpose, the
46 -`Levenshtein distance <http://en.wikipedia.org/wiki/Levenshtein_distance>`_
47 -[#]_, which is well accurate to compute the distance between few words, is used.
48 -Such a function is provided in the ``nazca.distance`` module.
49 +2. Now, we have to compute the similarity between each items. For that purpose, the
50 +   `Levenshtein distance <http://en.wikipedia.org/wiki/Levenshtein_distance>`_
51 +   [#]_, which is well accurate to compute the distance between few words, is used.
52 +   Such a function is provided in the ``nazca.distance`` module.
53 +
54 +   The next step is to compute the distance matrix according to the Levenshtein
55 +   distance. The result is given in the following tables.
56 +
57 +
58 +   +--------------+--------------+-----------------------+-------------+
59 +   |              | Albert Camus | Guillaume Apollinaire | Victor Hugo |
60 +   +==============+==============+=======================+=============+
61 +   | Victor Hugo  | 6            | 9                     | 0           |
62 +   +--------------+--------------+-----------------------+-------------+
63 +   | Albert Camus | 0            | 8                     | 6           |
64 +   +--------------+--------------+-----------------------+-------------+
65
66  .. [#] Also called the *edit distance*, because the distance between two words
67         is equal to the number of single-character edits required to change one
68         word into the other.
69
70 -The next step is to compute the distance matrix according to the Levenshtein
71 -distance. The result is given in the following tables.
72 -
73
74 -+--------------+--------------+-----------------------+-------------+
75 -|              | Albert Camus | Guillaume Apollinaire | Victor Hugo |
76 -+==============+==============+=======================+=============+
77 -| Victor Hugo  | 6            | 9                     | 0           |
78 -+--------------+--------------+-----------------------+-------------+
79 -| Albert Camus | 0            | 8                     | 6           |
80 -+--------------+--------------+-----------------------+-------------+
81 -
82 -The alignment process is ended by reading the matrix and saying items having a
83 -value inferior to a given threshold are identical.
84 +3. The alignment process is ended by reading the matrix and saying items having a
85 +   value inferior to a given threshold are identical.
86
87  A more complex one
88 -^^^^^^^^^^^^^^^^^^
89 +------------------
90
91  The previous case was simple, because we had only one *attribute* to align (the
92  name), but it is frequent to have a lot of *attributes* to align, such as the name
93  and the birth date and the birth city. The steps remains the same, except that
94  three distance matrices will be computed, and *items* will be represented as
```
@@ -71,54 +73,78 @@
```95  .. code-block:: python
96
97      alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
98                  ['Jacques Dupuis', '06-01-1999', 'Bressuire'],
99                  ['Michel Edouard', '18-04-1881', 'Nantes']]
100 -    targets = [['Dupond Paul', '14/08/1991', 'Paris'],
101 -                ['Edouard Michel', '18/04/1881', 'Nantes'],
102 -                ['Dupuis Jacques ', '06/01/1999', 'Bressuire'],
103 -                ['Dupont Paul', '01-12-2012', 'Paris']]
104 +    targetset = [['Dupond Paul', '14/08/1991', 'Paris'],
105 +                 ['Edouard Michel', '18/04/1881', 'Nantes'],
106 +                 ['Dupuis Jacques ', '06/01/1999', 'Bressuire'],
107 +                 ['Dupont Paul', '01-12-2012', 'Paris']]
108
109
110  In such a case, two distance functions are used, the Levenshtein one for the
111  name and the city and a temporal one for the birth date [#]_.
112
113 -.. [#] Provided in the ``nazca.distance`` module.
114 +.. [#] Provided in the ``nazca.distances`` module.
115
116
117 -We obtain the three following matrices:
118 +The ``cdist`` function of ``nazca.distances`` enables us to compute those
119 +matrices :
120
121 -For the name
122 -    +----------------+-------------+----------------+----------------+-------------+
123 -    |                | Dupond Paul | Edouard Michel | Dupuis Jacques | Dupont Paul |
124 -    +================+=============+================+================+=============+
125 -    | Paul Dupont    | 1           | 6              | 5              | 0           |
126 -    +----------------+-------------+----------------+----------------+-------------+
127 -    | Jacques Dupuis | 5           | 6              | 0              | 5           |
128 -    +----------------+-------------+----------------+----------------+-------------+
129 -    | Edouard Michel | 6           | 0              | 6              | 6           |
130 -    +----------------+-------------+----------------+----------------+-------------+
131 -For the birth date
132 -    +------------+------------+------------+------------+------------+
133 -    |            | 14/08/1991 | 18/04/1881 | 06/01/1999 | 01-12-2012 |
134 -    +============+============+============+============+============+
135 -    | 14-08-1991 | 0          | 40294      | 2702       | 7780       |
136 -    +------------+------------+------------+------------+------------+
137 -    | 06-01-1999 | 2702       | 42996      | 0          | 5078       |
138 -    +------------+------------+------------+------------+------------+
139 -    | 18-04-1881 | 40294      | 0          | 42996      | 48074      |
140 -    +------------+------------+------------+------------+------------+
141 -For the city
142 -    +-----------+-------+--------+-----------+-------+
143 -    |           | Paris | Nantes | Bressuire | Paris |
144 -    +===========+=======+========+===========+=======+
145 -    | Paris     | 0     | 4      | 8         | 0     |
146 -    +-----------+-------+--------+-----------+-------+
147 -    | Bressuire | 8     | 9      | 0         | 8     |
148 -    +-----------+-------+--------+-----------+-------+
149 -    | Nantes    | 4     | 0      | 9         | 4     |
150 -    +-----------+-------+--------+-----------+-------+
151 +.. code-block:: python
152 +
153 +    >>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
154 +    >>>                    'levenshtein', matrix_normalized=False)
155 +    array([[ 1.,  6.,  5.,  0.],
156 +           [ 5.,  6.,  0.,  5.],
157 +           [ 6.,  0.,  6.,  6.]], dtype=float32)
158 +
159 ++----------------+-------------+----------------+----------------+-------------+
160 +|                | Dupond Paul | Edouard Michel | Dupuis Jacques | Dupont Paul |
161 ++================+=============+================+================+=============+
162 +| Paul Dupont    | 1           | 6              | 5              | 0           |
163 ++----------------+-------------+----------------+----------------+-------------+
164 +| Jacques Dupuis | 5           | 6              | 0              | 5           |
165 ++----------------+-------------+----------------+----------------+-------------+
166 +| Edouard Michel | 6           | 0              | 6              | 6           |
167 ++----------------+-------------+----------------+----------------+-------------+
168 +
169 +.. code-block:: python
170 +
171 +    >>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
172 +    >>>                    'temporal', matrix_normalized=False)
173 +    array([[     0.,  40294.,   2702.,   7780.],
174 +           [  2702.,  42996.,      0.,   5078.],
175 +           [ 40294.,      0.,  42996.,  48074.]], dtype=float32)
176 +
177 ++------------+------------+------------+------------+------------+
178 +|            | 14/08/1991 | 18/04/1881 | 06/01/1999 | 01-12-2012 |
179 ++============+============+============+============+============+
180 +| 14-08-1991 | 0          | 40294      | 2702       | 7780       |
181 ++------------+------------+------------+------------+------------+
182 +| 06-01-1999 | 2702       | 42996      | 0          | 5078       |
183 ++------------+------------+------------+------------+------------+
184 +| 18-04-1881 | 40294      | 0          | 42996      | 48074      |
185 ++------------+------------+------------+------------+------------+
186 +
187 +.. code-block:: python
188 +
189 +    >>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
190 +    >>>                    'levenshtein', matrix_normalized=False)
191 +    array([[ 0.,  4.,  8.,  0.],
192 +           [ 8.,  9.,  0.,  8.],
193 +           [ 4.,  0.,  9.,  4.]], dtype=float32)
194 +
195 ++-----------+-------+--------+-----------+-------+
196 +|           | Paris | Nantes | Bressuire | Paris |
197 ++===========+=======+========+===========+=======+
198 +| Paris     | 0     | 4      | 8         | 0     |
199 ++-----------+-------+--------+-----------+-------+
200 +| Bressuire | 8     | 9      | 0         | 8     |
201 ++-----------+-------+--------+-----------+-------+
202 +| Nantes    | 4     | 0      | 9         | 4     |
203 ++-----------+-------+--------+-----------+-------+
204
205
206  The next step is gathering those three matrices into a global one, called the
207  `global alignment matrix`. Thus we have :
208
```
@@ -148,23 +174,23 @@
```209  just have to give the sets and distance functions and that's all. An other good
210  news is the project comes with the needed functions to build the sets !
211
212
213  Real applications
214 ------------------
215 +=================
216
217  Just before we start, we will assume the following imports have been done:
218
219  .. code-block:: python
220
221      from nazca import dataio as aldio #Functions for input and output data
222 -    from nazca import distance as ald #Functions to compute the distances
223 +    from nazca import distances as ald #Functions to compute the distances
224      from nazca import normalize as aln#Functions to normalize data
225      from nazca import aligner as ala  #Functions to align data
226
227  The Goncourt prize
228 -^^^^^^^^^^^^^^^^^^
229 +------------------
230
231  On wikipedia, we can find the `Goncourt prize winners
232  <https://fr.wikipedia.org/wiki/Prix_Goncourt#Liste_des_laur.C3.A9ats>`_, and we
233  would like to establish a link between the winners and their URI on dbpedia
234  [#]_.
```
@@ -188,11 +214,11 @@
```235  as *identifier* (we don't have an *URI* here as identifier) and *attribute* to align.
236  This is told to python thanks to the following code:
237
238  .. code-block:: python
239
240 -    alignset = adio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')
241 +    alignset = aldio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')
242
243  So, the beginning of our ``alignset`` is:
244
245  .. code-block:: python
246
```
@@ -212,36 +238,38 @@
```247            ?writer  <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:French_novelists>.
248            ?writer rdfs:label ?name.
249            FILTER(lang(?name) = 'fr')
250         }
251      """
252 -    targetset = adio.sparqlquery('http://dbpedia.org/sparql', query)
253 +    targetset = aldio.sparqlquery('http://dbpedia.org/sparql', query)
254
255  Both functions return nested lists as presented before. Now, we have to define
256  the distance function to be used for the alignment. This is done thanks to a
257  python dictionary where the keys are the columns to work on, and the values are
258  the treatments to apply.
259
260  .. code-block:: python
261
262 -    treatments = {1: {'metric': ald.levenshtein}}
263 +    treatments = {1: {'metric': ald.levenshtein}} #Use a levenshtein on the name
264
265  Finally, the last thing we have to do, is to call the ``align`` function:
266
267  .. code-block:: python
268
269 -    global_matrix, hasmatched = ala.align(alignset,
270 -                                          targset,
271 -                                          0.4,   #This is the matching threshold
272 -                                          treatments,
273 -                                          'goncourtprize_alignment')
274 +    alignments = ala.alignall(alignset, targetset,
275 +                           0.4, #This is the matching threshold
276 +                           treatments,
277 +                           mode=None,#We'll discuss about that later
279 +                          )
280
281 -The alignment results will be written into the `goncourtprize_alignment` file
282 -(note that this is optional, we could have work directly with the global matrix
283 -without writting the results).
284 -The `align` function returns the global alignment matrix and a boolean set to
285 -``True`` if at least one matching has been done, ``False`` otherwise.
286 +This function returns an iterator over the (different) carried out alignments.
287 +
288 +.. code-block:: python
289 +
290 +    for a, t in alignments:
291 +        print '%s has been aligned onto %s' % (a, t)
292
293  It may be important to apply some pre-treatment on the data to align. For
294  instance, names can be written with lower or upper characters, with extra
295  characters as punctuation or unwanted information in parenthesis and so on. That
296  is why we provide some functions to `normalize` your data. The most useful may
```
@@ -270,11 +298,11 @@
```297                       }
298                   }
299
300
301  Cities alignment
302 -^^^^^^^^^^^^^^^^
303 +----------------
304
305  The previous case with the `Goncourt prize winners` was pretty simply because
306  the number of items was small, and the computation fast. But in a more real use
307  case, the number of items to align may be huge (some thousands or millions…). Is
308  such a case it's unthinkable to build the global alignment matrix because it
```
@@ -287,16 +315,33 @@
```309
310  This is done by the following python code:
311
312  .. code-block:: python
313
314 -    targetset = aldio.parsefile('FR.txt', indexes=[0, 1, (4, 5)])
315 -    alignset = aldio.parsefile('frenchbnf', indexes=[0, 2, (14, 12)])
316 +    targetset = aldio.rqlquery('http://demo.cubicweb.org/geonames',
317 +                               'Any U, N, LONG, LAT WHERE X is Location, X name'
318 +                               ' N, X country C, C name "France", X longitude'
319 +                               ' LONG, X latitude LAT, X population > 1000, X'
320 +                               ' feature_class "P", X cwuri U',
321 +                               indexes=[0, 1, (2, 3)])
322 +    alignset = aldio.sparqlquery('http://dbpedia.inria.fr/sparql',
323 +                                 'prefix db-owl: <http://dbpedia.org/ontology/>'
324 +                                 'prefix db-prop: <http://fr.dbpedia.org/property/>'
325 +                                 'select ?ville, ?name, ?long, ?lat where {'
326 +                                 ' ?ville db-owl:country <http://fr.dbpedia.org/resource/France> .'
327 +                                 ' ?ville rdf:type db-owl:PopulatedPlace .'
328 +                                 ' ?ville db-owl:populationTotal ?population .'
329 +                                 ' ?ville foaf:name ?name .'
330 +                                 ' ?ville db-prop:longitude ?long .'
331 +                                 ' ?ville db-prop:latitude ?lat .'
332 +                                 ' FILTER (?population > 1000)'
333 +                                 '}',
334 +                                 indexes=[0, 1, (2, 3)])
335
336
337      treatments = {1: {'normalization': [aln.simply],
338 -                      'metric': ald.levenshtein
339 +                      'metric': ald.levenshtein,
340                        'matrix_normalized': False
341                       }
342                   }
343      results = ala.alignall(alignset, targetset, 3, treatments=treatments, #As before
344                             indexes=(2, 2), #On which data build the kdtree
```
@@ -313,22 +358,37 @@
```345  divide our data in few candidates. This clustering is coarse, and is only used to reduce
346  the potential candidats without loosing any more refined possible matchs.
347
348  So, in the next step, we define the treatments to apply.
349  It is the same as before, but we ask for a non-normalized matrix
350 -(ie: the real output of the levenshtein distance).
351 +(i.e.: the real output of the levenshtein distance).
352  Thus, we call the ``alignall`` function. ``indexes`` is a tuple saying the
353  position of the point on which the kdtree_ must be built, ``mode`` is the mode
354  used to find neighbours [#]_.
355
356  Finally, ``uniq`` ask to the function to return the best
357 -candidate (ie: the one having the shortest distance above the given threshold)
358 +candidate (i.e.: the one having the shortest distance below the given threshold)
359
360  .. [#] The available modes are ``kdtree``, ``kmeans`` and ``minibatch`` for
361         numerical data and ``minhashing`` for text one.
362
363  The function output a generator yielding tuples where the first element is the
364  identifier of the ``alignset`` item and the second is the ``targetset`` one (It
365  may take some time before yielding the first tuples, because all the computation
366  must be done…)
367
368  .. _kdtree: http://en.wikipedia.org/wiki/K-d_tree
369 +
370 +`Try <http://demo.cubicweb.org/nazca/view?vid=nazca>`_ it online !
371 +==================================================================
372 +
373 +We have also made a little application of Nazca, using `CubicWeb
374 +<http://www.cubicweb.org/>`_. This application provides a user interface for
375 +Nazca, helping you to choose what you want to align. You can use sparql or rql
376 +queries, as in the previous example, or import your own cvs file [#]_. Once you
377 +have choosen what you want to align, you can click the *Next step* button to
378 +customize the treatments you want to apply, just as you did before in python !
379 +Once done, by clicking the *Next step*, you start the alignment process. Wait a
380 +little bit, and you can either download the results in a *csv* or *rdf* file, or
381 +directly see the results online choosing the *html* output.
382 +
383 +.. [#] Your csv file must be tab-separated for the moment…
```