<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Nazca is out ! (Logilab.org) RSS Feed</title>
    <description></description>
    <link>http://www.logilab.org/blogentry/115136</link>
<item>
<guid isPermaLink="true">http://www.logilab.org/blogentry/115136</guid>
  <title>Nazca is out !</title>
  <link>http://www.logilab.org/blogentry/115136</link>
  <description>&lt;div class=&quot;section&quot; id=&quot;what-is-it-for&quot;&gt;
&lt;h3&gt;&lt;a&gt;What is it for ?&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a class=&quot;reference&quot; href=&quot;https://www.logilab.org/project/Nazca&quot;&gt;Nazca&lt;/a&gt; is a python library aiming to
help you to &lt;em&gt;align data&lt;/em&gt;. But, what does “align data”&amp;nbsp;mean? For instance,
you have a list of cities, described by their name and their country and you
would like to find their URI on dbpedia to have more information about them, as
the longitude and the latitude.  If you have two or three cities, it can be done
with bare hands, but it could not if there are hundreds or thousands cities.
Nazca provides you all the stuff we need to do it.&lt;/p&gt;
&lt;p&gt;This blog post aims to introduce you how this library works and can be used.
Once you have understood the main concepts behind this library, don&#39;t hesitate
to try Nazca &lt;a class=&quot;reference&quot; href=&quot;http://demo.cubicweb.org/nazca/view?vid=nazca&quot;&gt;online&lt;/a&gt; !&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;section&quot; id=&quot;introduction&quot;&gt;
&lt;h3&gt;&lt;a&gt;Introduction&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The alignment process is divided into three main steps:&lt;/p&gt;
&lt;ol class=&quot;arabic simple&quot;&gt;
&lt;li&gt;Gather and format the data we want to align.
In this step, we define two sets called the &lt;tt class=&quot;docutils literal&quot;&gt;alignset&lt;/tt&gt; and the
&lt;tt class=&quot;docutils literal&quot;&gt;targetset&lt;/tt&gt;. The &lt;tt class=&quot;docutils literal&quot;&gt;alignset&lt;/tt&gt; contains our data, and the
&lt;tt class=&quot;docutils literal&quot;&gt;targetset&lt;/tt&gt; contains the data on which we would like to make the links.&lt;/li&gt;
&lt;li&gt;Compute the similarity between the items gathered.  We compute a distance
matrix between the two sets according to a given distance.&lt;/li&gt;
&lt;li&gt;Find the items having a high similarity thanks to the distance matrix.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&quot;section&quot; id=&quot;simple-case&quot;&gt;
&lt;h4&gt;&lt;a&gt;Simple case&lt;/a&gt;&lt;/h4&gt;
&lt;ol class=&quot;arabic simple&quot;&gt;
&lt;li&gt;Let&#39;s define &lt;tt class=&quot;docutils literal&quot;&gt;alignset&lt;/tt&gt; and &lt;tt class=&quot;docutils literal&quot;&gt;targetset&lt;/tt&gt; as simple python lists.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Victor Hugo&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Albert Camus&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;targetset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Albert Camus&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Guillaume Apollinaire&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Victor Hugo&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;ol class=&quot;arabic&quot; start=&quot;2&quot;&gt;
&lt;li&gt;&lt;p class=&quot;first&quot;&gt;Now, we have to compute the similarity between each items. For that purpose, the
&lt;a class=&quot;reference&quot; href=&quot;http://en.wikipedia.org/wiki/Levenshtein_distance&quot;&gt;Levenshtein distance&lt;/a&gt;
&lt;a class=&quot;footnote-reference&quot; href=&quot;#id2&quot; id=&quot;id1&quot;&gt;[1]&lt;/a&gt;, which is well accurate to compute the distance between few words, is used.
Such a function is provided in the &lt;tt class=&quot;docutils literal&quot;&gt;nazca.distance&lt;/tt&gt; module.&lt;/p&gt;
&lt;p&gt;The next step is to compute the distance matrix according to the Levenshtein
distance. The result is given in the following table.&lt;/p&gt;
&lt;table border=&quot;1&quot; class=&quot;docutils&quot;&gt;
&lt;colgroup&gt;
&lt;col width=&quot;22%&quot; /&gt;
&lt;col width=&quot;22%&quot; /&gt;
&lt;col width=&quot;36%&quot; /&gt;
&lt;col width=&quot;20%&quot; /&gt;
&lt;/colgroup&gt;
&lt;thead valign=&quot;bottom&quot;&gt;
&lt;tr&gt;&lt;th class=&quot;head&quot;&gt;&amp;nbsp;&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;&lt;p class=&quot;first last&quot;&gt;Albert Camus&lt;/p&gt;
&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;&lt;p class=&quot;first last&quot;&gt;Guillaume Apollinaire&lt;/p&gt;
&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;&lt;p class=&quot;first last&quot;&gt;Victor Hugo&lt;/p&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign=&quot;top&quot;&gt;
&lt;tr&gt;&lt;td&gt;&lt;p class=&quot;first last&quot;&gt;Victor Hugo&lt;/p&gt;
&lt;/td&gt;
&lt;td&gt;&lt;p class=&quot;first last&quot;&gt;6&lt;/p&gt;
&lt;/td&gt;
&lt;td&gt;&lt;p class=&quot;first last&quot;&gt;9&lt;/p&gt;
&lt;/td&gt;
&lt;td&gt;&lt;p class=&quot;first last&quot;&gt;0&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;p class=&quot;first last&quot;&gt;Albert Camus&lt;/p&gt;
&lt;/td&gt;
&lt;td&gt;&lt;p class=&quot;first last&quot;&gt;0&lt;/p&gt;
&lt;/td&gt;
&lt;td&gt;&lt;p class=&quot;first last&quot;&gt;8&lt;/p&gt;
&lt;/td&gt;
&lt;td&gt;&lt;p class=&quot;first last&quot;&gt;6&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p class=&quot;first&quot;&gt;The alignment process is ended by reading the matrix and saying items having a
value inferior to a given threshold are identical.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;table class=&quot;docutils footnote&quot; frame=&quot;void&quot; id=&quot;id2&quot; rules=&quot;none&quot;&gt;
&lt;colgroup&gt;&lt;col class=&quot;label&quot; /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign=&quot;top&quot;&gt;
&lt;tr&gt;&lt;td class=&quot;label&quot;&gt;&lt;a class=&quot;fn-backref&quot; href=&quot;#id1&quot;&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Also called the &lt;em&gt;edit distance&lt;/em&gt;, because the distance between two words
is equal to the number of single-character edits required to change one
word into the other.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div class=&quot;section&quot; id=&quot;a-more-complex-one&quot;&gt;
&lt;h4&gt;&lt;a&gt;A more complex one&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;The previous case was simple, because we had only one &lt;em&gt;attribute&lt;/em&gt; to align (the
name), but it is frequent to have a lot of &lt;em&gt;attributes&lt;/em&gt; to align, such as the name
and the birth date and the birth city. The steps remain the same, except that
three distance matrices will be computed, and &lt;em&gt;items&lt;/em&gt; will be represented as
nested lists. See the following example:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Paul Dupont&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;14-08-1991&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Paris&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Jacques Dupuis&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;06-01-1999&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Bressuire&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Michel Edouard&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;18-04-1881&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Nantes&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;targetset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Dupond Paul&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;14/08/1991&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Paris&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
             &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Edouard Michel&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;18/04/1881&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Nantes&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
             &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Dupuis Jacques &amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;06/01/1999&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Bressuire&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
             &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Dupont Paul&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;01-12-2012&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Paris&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In such a case, two distance functions are used, the Levenshtein one for the
name and the city and a temporal one for the birth date &lt;a class=&quot;footnote-reference&quot; href=&quot;#id5&quot; id=&quot;id3&quot;&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;tt class=&quot;docutils literal&quot;&gt;cdist&lt;/tt&gt; function of &lt;tt class=&quot;docutils literal&quot;&gt;nazca.distances&lt;/tt&gt; enables us to compute those
matrices&amp;nbsp;:&lt;/p&gt;
&lt;ul class=&quot;simple&quot;&gt;
&lt;li&gt;For the names:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nazca&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;matrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cdist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;targetset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;                    &lt;span class=&quot;s&quot;&gt;&amp;#39;levenshtein&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;matrix_normalized&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([[&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;6.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;5.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
       &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;5.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;6.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;5.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
       &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;6.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;6.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;6.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;table border=&quot;1&quot; class=&quot;docutils&quot;&gt;
&lt;colgroup&gt;
&lt;col width=&quot;22%&quot; /&gt;
&lt;col width=&quot;18%&quot; /&gt;
&lt;col width=&quot;22%&quot; /&gt;
&lt;col width=&quot;22%&quot; /&gt;
&lt;col width=&quot;18%&quot; /&gt;
&lt;/colgroup&gt;
&lt;thead valign=&quot;bottom&quot;&gt;
&lt;tr&gt;&lt;th class=&quot;head&quot;&gt;&amp;nbsp;&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;Dupond Paul&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;Edouard Michel&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;Dupuis Jacques&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;Dupont Paul&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign=&quot;top&quot;&gt;
&lt;tr&gt;&lt;td&gt;Paul Dupont&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Jacques Dupuis&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Edouard Michel&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;ul class=&quot;simple&quot;&gt;
&lt;li&gt;For the birthdates:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nazca&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;matrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cdist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;targetset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;                    &lt;span class=&quot;s&quot;&gt;&amp;#39;temporal&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;matrix_normalized&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([[&lt;/span&gt;     &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;40294.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mf&quot;&gt;2702.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mf&quot;&gt;7780.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
       &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;2702.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;42996.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;      &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mf&quot;&gt;5078.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
       &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;40294.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;      &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;42996.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;48074.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;table border=&quot;1&quot; class=&quot;docutils&quot;&gt;
&lt;colgroup&gt;
&lt;col width=&quot;20%&quot; /&gt;
&lt;col width=&quot;20%&quot; /&gt;
&lt;col width=&quot;20%&quot; /&gt;
&lt;col width=&quot;20%&quot; /&gt;
&lt;col width=&quot;20%&quot; /&gt;
&lt;/colgroup&gt;
&lt;thead valign=&quot;bottom&quot;&gt;
&lt;tr&gt;&lt;th class=&quot;head&quot;&gt;&amp;nbsp;&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;14/08/1991&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;18/04/1881&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;06/01/1999&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;01-12-2012&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign=&quot;top&quot;&gt;
&lt;tr&gt;&lt;td&gt;14-08-1991&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;40294&lt;/td&gt;
&lt;td&gt;2702&lt;/td&gt;
&lt;td&gt;7780&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;06-01-1999&lt;/td&gt;
&lt;td&gt;2702&lt;/td&gt;
&lt;td&gt;42996&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;5078&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;18-04-1881&lt;/td&gt;
&lt;td&gt;40294&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;42996&lt;/td&gt;
&lt;td&gt;48074&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;ul class=&quot;simple&quot;&gt;
&lt;li&gt;For the birthplaces:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nazca&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;matrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cdist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;targetset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;                    &lt;span class=&quot;s&quot;&gt;&amp;#39;levenshtein&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;matrix_normalized&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([[&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;4.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;8.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
       &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;8.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;9.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;8.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
       &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;4.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;0.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;9.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mf&quot;&gt;4.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;table border=&quot;1&quot; class=&quot;docutils&quot;&gt;
&lt;colgroup&gt;
&lt;col width=&quot;25%&quot; /&gt;
&lt;col width=&quot;16%&quot; /&gt;
&lt;col width=&quot;18%&quot; /&gt;
&lt;col width=&quot;25%&quot; /&gt;
&lt;col width=&quot;16%&quot; /&gt;
&lt;/colgroup&gt;
&lt;thead valign=&quot;bottom&quot;&gt;
&lt;tr&gt;&lt;th class=&quot;head&quot;&gt;&amp;nbsp;&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;Paris&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;Nantes&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;Bressuire&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;Paris&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign=&quot;top&quot;&gt;
&lt;tr&gt;&lt;td&gt;Paris&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Bressuire&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Nantes&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The next step is gathering those three matrices into a global one, called the
&lt;cite&gt;global alignment matrix&lt;/cite&gt;. Thus we have :&lt;/p&gt;
&lt;table border=&quot;1&quot; class=&quot;docutils&quot;&gt;
&lt;colgroup&gt;
&lt;col width=&quot;10%&quot; /&gt;
&lt;col width=&quot;23%&quot; /&gt;
&lt;col width=&quot;23%&quot; /&gt;
&lt;col width=&quot;23%&quot; /&gt;
&lt;col width=&quot;23%&quot; /&gt;
&lt;/colgroup&gt;
&lt;thead valign=&quot;bottom&quot;&gt;
&lt;tr&gt;&lt;th class=&quot;head&quot;&gt;&amp;nbsp;&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;0&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;1&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;2&lt;/th&gt;
&lt;th class=&quot;head&quot;&gt;3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody valign=&quot;top&quot;&gt;
&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;40304&lt;/td&gt;
&lt;td&gt;2715&lt;/td&gt;
&lt;td&gt;7780&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2715&lt;/td&gt;
&lt;td&gt;43011&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;5091&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;40304&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;43011&lt;/td&gt;
&lt;td&gt;48084&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Allowing some misspelling mistakes (for example &lt;em&gt;Dupont&lt;/em&gt; and &lt;em&gt;Dupond&lt;/em&gt; are very
closed), the matching threshold can be set to 1 or 2. Thus we can see that the
item 0 in our &lt;tt class=&quot;docutils literal&quot;&gt;alignset&lt;/tt&gt; is the same that the item 0 in the &lt;tt class=&quot;docutils literal&quot;&gt;targetset&lt;/tt&gt;, the
1 in the &lt;tt class=&quot;docutils literal&quot;&gt;alignset&lt;/tt&gt; and the 2 of the &lt;tt class=&quot;docutils literal&quot;&gt;targetset&lt;/tt&gt; too : the links can be
done&amp;nbsp;!&lt;/p&gt;
&lt;p&gt;It&#39;s important to notice that even if the item 0 of the &lt;tt class=&quot;docutils literal&quot;&gt;alignset&lt;/tt&gt; and the 3
of the &lt;tt class=&quot;docutils literal&quot;&gt;targetset&lt;/tt&gt; have the same name and the same birthplace they are
unlikely identical because of their very different birth date.&lt;/p&gt;
&lt;p&gt;You may have noticed that working with matrices as I did for the example is a
little bit boring. The good news is that &lt;a class=&quot;reference&quot; href=&quot;https://www.logilab.org/project/Nazca&quot;&gt;Nazca&lt;/a&gt; makes all this job for you. You just
have to give the sets and distance functions and that&#39;s all. An other good news
is the project comes with the needed functions to build the sets !&lt;/p&gt;
&lt;table class=&quot;docutils footnote&quot; frame=&quot;void&quot; id=&quot;id5&quot; rules=&quot;none&quot;&gt;
&lt;colgroup&gt;&lt;col class=&quot;label&quot; /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign=&quot;top&quot;&gt;
&lt;tr&gt;&lt;td class=&quot;label&quot;&gt;&lt;a class=&quot;fn-backref&quot; href=&quot;#id3&quot;&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Provided in the &lt;tt class=&quot;docutils literal&quot;&gt;nazca.distances&lt;/tt&gt; module.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;section&quot; id=&quot;real-applications&quot;&gt;
&lt;h3&gt;&lt;a&gt;Real applications&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Just before we start, we will assume the following imports have been done:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;nazca&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dataio&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aldio&lt;/span&gt;   &lt;span class=&quot;c&quot;&gt;#Functions for input and output data&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;nazca&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;distances&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ald&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;#Functions to compute the distances&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;nazca&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;normalize&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aln&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;#Functions to normalize data&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;nazca&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aligner&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ala&lt;/span&gt;    &lt;span class=&quot;c&quot;&gt;#Functions to align data&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;div class=&quot;section&quot; id=&quot;the-goncourt-prize&quot;&gt;
&lt;h4&gt;&lt;a&gt;The Goncourt prize&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;On wikipedia, we can find the &lt;a class=&quot;reference&quot; href=&quot;https://fr.wikipedia.org/wiki/Prix_Goncourt#Liste_des_laur.C3.A9ats&quot;&gt;Goncourt prize winners&lt;/a&gt;, and we
would like to establish a link between the winners and their URI on dbpedia
(Let&#39;s imagine the &lt;em&gt;Goncourt prize winners&lt;/em&gt; category does not exist in dbpedia)&lt;/p&gt;
&lt;p&gt;We simply copy/paste the winners list of wikipedia into a file and replace all
the separators (&lt;tt class=&quot;docutils literal&quot;&gt;-&lt;/tt&gt; and &lt;tt class=&quot;docutils literal&quot;&gt;,&lt;/tt&gt;) by &lt;tt class=&quot;docutils literal&quot;&gt;#&lt;/tt&gt;. So, the beginning of our file is :&lt;/p&gt;
&lt;!--  --&gt;
&lt;blockquote&gt;
&lt;div class=&quot;line-block&quot;&gt;
&lt;div class=&quot;line&quot;&gt;1903#John-Antoine Nau#Force ennemie (Plume)&lt;/div&gt;
&lt;div class=&quot;line&quot;&gt;1904#Léon Frapié#La Maternelle (Albin Michel)&lt;/div&gt;
&lt;div class=&quot;line&quot;&gt;1905#Claude Farrère#Les Civilisés (Paul Ollendorff)&lt;/div&gt;
&lt;div class=&quot;line&quot;&gt;1906#Jérôme et Jean Tharaud#Dingley, l&#39;illustre écrivain (Cahiers de la Quinzaine)&lt;/div&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;p&gt;When using the high-level functions of this library, each item must have at
least two elements: an &lt;em&gt;identifier&lt;/em&gt; (the name, or the URI) and the &lt;em&gt;attribute&lt;/em&gt; to
compare. With the previous file, we will use the name (so the column number 1)
as &lt;em&gt;identifier&lt;/em&gt; (we don&#39;t have an &lt;em&gt;URI&lt;/em&gt; here as identifier) and &lt;em&gt;attribute&lt;/em&gt; to align.
This is told to python thanks to the following code:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aldio&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parsefile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;prixgoncourt&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;indexes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delimiter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;So, the beginning of our &lt;tt class=&quot;docutils literal&quot;&gt;alignset&lt;/tt&gt; is:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;u&amp;#39;John-Antoine Nau&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;u&amp;#39;John-Antoine Nau&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;u&amp;#39;Léon Frapié&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;u&amp;#39;Léon, Frapié&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
 &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;u&amp;#39;Claude Farrère&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;u&amp;#39;Claude Farrère&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now, let&#39;s build the &lt;tt class=&quot;docutils literal&quot;&gt;targetset&lt;/tt&gt; thanks to a &lt;em&gt;sparql query&lt;/em&gt; and the dbpedia
end-point. We ask for the list of the French novelists, described by their URI
and their name in French:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;     SELECT ?writer, ?name WHERE {&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;       ?writer  &amp;lt;http://purl.org/dc/terms/subject&amp;gt; &amp;lt;http://dbpedia.org/resource/Category:French_novelists&amp;gt;.&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;       ?writer rdfs:label ?name.&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;       FILTER(lang(?name) = &amp;#39;fr&amp;#39;)&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;    }&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt; &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
 &lt;span class=&quot;n&quot;&gt;targetset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aldio&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sparqlquery&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://dbpedia.org/sparql&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Both functions return nested lists as presented before. Now, we have to define
the distance function to be used for the alignment. This is done thanks to a
python dictionary where the keys are the columns to work on, and the values are
the treatments to apply.&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;treatments&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;metric&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ald&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;levenshtein&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# Use a levenshtein on the name&lt;/span&gt;
                                              &lt;span class=&quot;c&quot;&gt;# (column 1)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Finally, the last thing we have to do, is to call the &lt;tt class=&quot;docutils literal&quot;&gt;alignall&lt;/tt&gt; function:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;alignments&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ala&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;alignall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;targetset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                       &lt;span class=&quot;mf&quot;&gt;0.4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#This is the matching threshold&lt;/span&gt;
                       &lt;span class=&quot;n&quot;&gt;treatments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                       &lt;span class=&quot;n&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;c&quot;&gt;#We&amp;#39;ll discuss about that later&lt;/span&gt;
                       &lt;span class=&quot;n&quot;&gt;uniq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#Get the best results only&lt;/span&gt;
                      &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This function returns an iterator over the different alignments done. You can
see the results thanks to the following code&amp;nbsp;:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alignments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; has been aligned onto &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It may be important to apply some pre-treatment on the data to align. For
instance, names can be written with lower or upper characters, with extra
characters as punctuation or unwanted information in parenthesis and so on. That
is why we provide some functions to &lt;cite&gt;normalize&lt;/cite&gt; your data. The most useful may
be the &lt;cite&gt;simplify()&lt;/cite&gt; function (see the docstring for more information). So the
treatments list can be given as follow:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;remove_after&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;sd&quot;&gt;&amp;quot;&amp;quot;&amp;quot; Remove the text after ``sub`` in ``string``&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;        &amp;gt;&amp;gt;&amp;gt; remove_after(&amp;#39;I like cats and dogs&amp;#39;, &amp;#39;and&amp;#39;)&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;        &amp;#39;I like cats&amp;#39;&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;        &amp;gt;&amp;gt;&amp;gt; remove_after(&amp;#39;I like cats and dogs&amp;#39;, &amp;#39;(&amp;#39;)&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;        &amp;#39;I like cats and dogs&amp;#39;&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sub&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;strip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;ne&quot;&gt;ValueError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;


&lt;span class=&quot;n&quot;&gt;treatments&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;normalization&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;remove_after&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;(&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                                    &lt;span class=&quot;n&quot;&gt;aln&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;simply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
                  &lt;span class=&quot;s&quot;&gt;&amp;#39;metric&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ald&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;levenshtein&lt;/span&gt;
                 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
             &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;section&quot; id=&quot;cities-alignment&quot;&gt;
&lt;h4&gt;&lt;a&gt;Cities alignment&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;The previous case with the &lt;cite&gt;Goncourt prize winners&lt;/cite&gt; was pretty simply because
the number of items was small, and the computation fast. But in a more real use
case, the number of items to align may be huge (some thousands or millions…). In
such a case it&#39;s unthinkable to build the global alignment matrix because it
would be too big and it would take (at least...) fews days to achieve the computation.
So the idea is to make small groups of possible similar data to compute smaller
matrices (i.e. a &lt;em&gt;divide and conquer&lt;/em&gt; approach).
For this purpose, we provide some functions to group/cluster data. We have
functions to group text and numerical data.&lt;/p&gt;
&lt;p&gt;This is the code used, we will explain it:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;targetset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aldio&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rqlquery&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://demo.cubicweb.org/geonames&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                           &lt;span class=&quot;sd&quot;&gt;&amp;quot;&amp;quot;&amp;quot;Any U, N, LONG, LAT WHERE X is Location, X name&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              N, X country C, C name &amp;quot;France&amp;quot;, X longitude&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              LONG, X latitude LAT, X population &amp;gt; 1000, X&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              feature_class &amp;quot;P&amp;quot;, X cwuri U&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                           &lt;span class=&quot;n&quot;&gt;indexes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aldio&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sparqlquery&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://dbpedia.inria.fr/sparql&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                             &lt;span class=&quot;sd&quot;&gt;&amp;quot;&amp;quot;&amp;quot;prefix db-owl: &amp;lt;http://dbpedia.org/ontology/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                             prefix db-prop: &amp;lt;http://fr.dbpedia.org/property/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                             select ?ville, ?name, ?long, ?lat where {&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              ?ville db-owl:country &amp;lt;http://fr.dbpedia.org/resource/France&amp;gt; .&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              ?ville rdf:type db-owl:PopulatedPlace .&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              ?ville db-owl:populationTotal ?population .&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              ?ville foaf:name ?name .&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              ?ville db-prop:longitude ?long .&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              ?ville db-prop:latitude ?lat .&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                              FILTER (?population &amp;gt; 1000)&lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;                             }&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                             &lt;span class=&quot;n&quot;&gt;indexes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)])&lt;/span&gt;


&lt;span class=&quot;n&quot;&gt;treatments&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;normalization&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;aln&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;simply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
                  &lt;span class=&quot;s&quot;&gt;&amp;#39;metric&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ald&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;levenshtein&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                  &lt;span class=&quot;s&quot;&gt;&amp;#39;matrix_normalized&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;
                 &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
             &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;results&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ala&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;alignall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;alignset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;targetset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;treatments&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;treatments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#As before&lt;/span&gt;
                       &lt;span class=&quot;n&quot;&gt;indexes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#On which data build the kdtree&lt;/span&gt;
                       &lt;span class=&quot;n&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;kdtree&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;#The mode to use&lt;/span&gt;
                       &lt;span class=&quot;n&quot;&gt;uniq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#Return only the best results&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let&#39;s explain the code. We have two files, containing a list of cities we want
to align, the first column is the identifier, and the second is the name of the city
and the last one is location of the city (longitude and latitude), gathered into
a single tuple.&lt;/p&gt;
&lt;p&gt;In this example, we want to build a &lt;a class=&quot;reference&quot; href=&quot;http://en.wikipedia.org/wiki/K-d_tree&quot;&gt;kdtree&lt;/a&gt; on the couple (longitude, latitude)
to divide our data in few candidates. This clustering is coarse, and is only
used to reduce the potential candidats without loosing any more refined possible
matchs.&lt;/p&gt;
&lt;p&gt;So, in the next step, we define the treatments to apply.
It is the same as before, but we ask for a non-normalized matrix
(ie: the real output of the levenshtein distance).
Thus, we call the &lt;tt class=&quot;docutils literal&quot;&gt;alignall&lt;/tt&gt; function. &lt;tt class=&quot;docutils literal&quot;&gt;indexes&lt;/tt&gt; is a tuple saying the
position of the point on which the &lt;a class=&quot;reference&quot; href=&quot;http://en.wikipedia.org/wiki/K-d_tree&quot;&gt;kdtree&lt;/a&gt; must be built, &lt;tt class=&quot;docutils literal&quot;&gt;mode&lt;/tt&gt; is the mode
used to find neighbours &lt;a class=&quot;footnote-reference&quot; href=&quot;#id7&quot; id=&quot;id6&quot;&gt;[3]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Finally, &lt;tt class=&quot;docutils literal&quot;&gt;uniq&lt;/tt&gt; ask to the function to return the best
candidate (ie: the one having the shortest distance below the given threshold)&lt;/p&gt;
&lt;p&gt;The function outputs a generator yielding tuples where the first element is the
identifier of the &lt;tt class=&quot;docutils literal&quot;&gt;alignset&lt;/tt&gt; item and the second is the &lt;tt class=&quot;docutils literal&quot;&gt;targetset&lt;/tt&gt; one (It
may take some time before yielding the first tuples, because all the computation
must be done…)&lt;/p&gt;
&lt;table class=&quot;docutils footnote&quot; frame=&quot;void&quot; id=&quot;id7&quot; rules=&quot;none&quot;&gt;
&lt;colgroup&gt;&lt;col class=&quot;label&quot; /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign=&quot;top&quot;&gt;
&lt;tr&gt;&lt;td class=&quot;label&quot;&gt;&lt;a class=&quot;fn-backref&quot; href=&quot;#id6&quot;&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;The available modes are &lt;tt class=&quot;docutils literal&quot;&gt;kdtree&lt;/tt&gt;, &lt;tt class=&quot;docutils literal&quot;&gt;kmeans&lt;/tt&gt; and &lt;tt class=&quot;docutils literal&quot;&gt;minibatch&lt;/tt&gt; for
numerical data and &lt;tt class=&quot;docutils literal&quot;&gt;minhashing&lt;/tt&gt; for text one.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;section&quot; id=&quot;try-it-online&quot;&gt;
&lt;h3&gt;&lt;a&gt;Try it online !&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;We have also made this &lt;a class=&quot;reference&quot; href=&quot;http://demo.cubicweb.org/nazca/view?vid=nazca&quot;&gt;little application&lt;/a&gt; of Nazca, using &lt;a class=&quot;reference&quot; href=&quot;http://www.cubicweb.org/&quot;&gt;Cubicweb&lt;/a&gt;. This application provides a user interface for
Nazca, helping you to choose what you want to align. You can use sparql or rql
queries, as in the previous example, or import your own cvs file &lt;a class=&quot;footnote-reference&quot; href=&quot;#id9&quot; id=&quot;id8&quot;&gt;[4]&lt;/a&gt;. Once you
have choosen what you want to align, you can click the &lt;em&gt;Next step&lt;/em&gt; button to
customize the treatments you want to apply, just as you did before in python !
Once done, by clicking the &lt;em&gt;Next step&lt;/em&gt;, you start the alignment process. Wait a
little bit, and you can either download the results in a &lt;em&gt;csv&lt;/em&gt; or &lt;em&gt;rdf&lt;/em&gt; file, or
directly see the results online choosing the &lt;em&gt;html&lt;/em&gt; output.&lt;/p&gt;
&lt;table class=&quot;docutils footnote&quot; frame=&quot;void&quot; id=&quot;id9&quot; rules=&quot;none&quot;&gt;
&lt;colgroup&gt;&lt;col class=&quot;label&quot; /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign=&quot;top&quot;&gt;
&lt;tr&gt;&lt;td class=&quot;label&quot;&gt;&lt;a class=&quot;fn-backref&quot; href=&quot;#id8&quot;&gt;[4]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Your csv file must be tab-separated for the moment…&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</description>
  <dc:date>2012-12-21T19:01-01:00</dc:date>
  <dc:creator>Simon Chabot</dc:creator>
</item>
  </channel>
</rss>