The previous case was simple, because we had only one attribute to align (the
name), but it is frequent to have a lot of attributes to align, such as the name
and the birth date and the birth city. The steps remain the same, except that
three distance matrices will be computed, and items will be represented as
nested lists. See the following example:
alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
['Jacques Dupuis', '06-01-1999', 'Bressuire'],
['Michel Edouard', '18-04-1881', 'Nantes']]
targetset = [['Dupond Paul', '14/08/1991', 'Paris'],
['Edouard Michel', '18/04/1881', 'Nantes'],
['Dupuis Jacques ', '06/01/1999', 'Bressuire'],
['Dupont Paul', '01-12-2012', 'Paris']]
In such a case, two distance functions are used, the Levenshtein one for the
name and the city and a temporal one for the birth date .
The cdist function of nazca.distances enables us to compute those
matrices :
>>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
>>> 'levenshtein', matrix_normalized=False)
array([[ 1., 6., 5., 0.],
[ 5., 6., 0., 5.],
[ 6., 0., 6., 6.]], dtype=float32)
|
Dupond Paul |
Edouard Michel |
Dupuis Jacques |
Dupont Paul |
Paul Dupont |
1 |
6 |
5 |
0 |
Jacques Dupuis |
5 |
6 |
0 |
5 |
Edouard Michel |
6 |
0 |
6 |
6 |
>>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
>>> 'temporal', matrix_normalized=False)
array([[ 0., 40294., 2702., 7780.],
[ 2702., 42996., 0., 5078.],
[ 40294., 0., 42996., 48074.]], dtype=float32)
|
14/08/1991 |
18/04/1881 |
06/01/1999 |
01-12-2012 |
14-08-1991 |
0 |
40294 |
2702 |
7780 |
06-01-1999 |
2702 |
42996 |
0 |
5078 |
18-04-1881 |
40294 |
0 |
42996 |
48074 |
>>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
>>> 'levenshtein', matrix_normalized=False)
array([[ 0., 4., 8., 0.],
[ 8., 9., 0., 8.],
[ 4., 0., 9., 4.]], dtype=float32)
|
Paris |
Nantes |
Bressuire |
Paris |
Paris |
0 |
4 |
8 |
0 |
Bressuire |
8 |
9 |
0 |
8 |
Nantes |
4 |
0 |
9 |
4 |
The next step is gathering those three matrices into a global one, called the
global alignment matrix. Thus we have :
|
0 |
1 |
2 |
3 |
0 |
1 |
40304 |
2715 |
7780 |
1 |
2715 |
43011 |
0 |
5091 |
2 |
40304 |
0 |
43011 |
48084 |
Allowing some misspelling mistakes (for example Dupont and Dupond are very
closed), the matching threshold can be set to 1 or 2. Thus we can see that the
item 0 in our alignset is the same that the item 0 in the targetset, the
1 in the alignset and the 2 of the targetset too : the links can be
done !
It's important to notice that even if the item 0 of the alignset and the 3
of the targetset have the same name and the same birthplace they are
unlikely identical because of their very different birth date.
You may have noticed that working with matrices as I did for the example is a
little bit boring. The good news is that Nazca makes all this job for you. You just
have to give the sets and distance functions and that's all. An other good news
is the project comes with the needed functions to build the sets !