blog entries created by Simon Chabot

What is it for ?

Nazca is a python library aiming to help you to align data. But, what does “align data” mean? For instance, you have a list of cities, described by their name and their country and you would like to find their URI on dbpedia to have more information about them, as the longitude and the latitude. If you have two or three cities, it can be done with bare hands, but it could not if there are hundreds or thousands cities. Nazca provides you all the stuff we need to do it.

This blog post aims to introduce you how this library works and can be used. Once you have understood the main concepts behind this library, don't hesitate to try Nazca online !

Introduction

The alignment process is divided into three main steps:

  1. Gather and format the data we want to align. In this step, we define two sets called the alignset and the targetset. The alignset contains our data, and the targetset contains the data on which we would like to make the links.
  2. Compute the similarity between the items gathered. We compute a distance matrix between the two sets according to a given distance.
  3. Find the items having a high similarity thanks to the distance matrix.

Simple case

  1. Let's define alignset and targetset as simple python lists.
alignset = ['Victor Hugo', 'Albert Camus']
targetset = ['Albert Camus', 'Guillaume Apollinaire', 'Victor Hugo']
  1. Now, we have to compute the similarity between each items. For that purpose, the Levenshtein distance [1], which is well accurate to compute the distance between few words, is used. Such a function is provided in the nazca.distance module.

    The next step is to compute the distance matrix according to the Levenshtein distance. The result is given in the following table.

     

    Albert Camus

    Guillaume Apollinaire

    Victor Hugo

    Victor Hugo

    6

    9

    0

    Albert Camus

    0

    8

    6

  2. The alignment process is ended by reading the matrix and saying items having a value inferior to a given threshold are identical.

[1]Also called the edit distance, because the distance between two words is equal to the number of single-character edits required to change one word into the other.

A more complex one

The previous case was simple, because we had only one attribute to align (the name), but it is frequent to have a lot of attributes to align, such as the name and the birth date and the birth city. The steps remain the same, except that three distance matrices will be computed, and items will be represented as nested lists. See the following example:

alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
            ['Jacques Dupuis', '06-01-1999', 'Bressuire'],
            ['Michel Edouard', '18-04-1881', 'Nantes']]
targetset = [['Dupond Paul', '14/08/1991', 'Paris'],
             ['Edouard Michel', '18/04/1881', 'Nantes'],
             ['Dupuis Jacques ', '06/01/1999', 'Bressuire'],
             ['Dupont Paul', '01-12-2012', 'Paris']]

In such a case, two distance functions are used, the Levenshtein one for the name and the city and a temporal one for the birth date [2].

The cdist function of nazca.distances enables us to compute those matrices :

  • For the names:
>>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
>>>                    'levenshtein', matrix_normalized=False)
array([[ 1.,  6.,  5.,  0.],
       [ 5.,  6.,  0.,  5.],
       [ 6.,  0.,  6.,  6.]], dtype=float32)
  Dupond Paul Edouard Michel Dupuis Jacques Dupont Paul
Paul Dupont 1 6 5 0
Jacques Dupuis 5 6 0 5
Edouard Michel 6 0 6 6
  • For the birthdates:
>>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
>>>                    'temporal', matrix_normalized=False)
array([[     0.,  40294.,   2702.,   7780.],
       [  2702.,  42996.,      0.,   5078.],
       [ 40294.,      0.,  42996.,  48074.]], dtype=float32)
  14/08/1991 18/04/1881 06/01/1999 01-12-2012
14-08-1991 0 40294 2702 7780
06-01-1999 2702 42996 0 5078
18-04-1881 40294 0 42996 48074
  • For the birthplaces:
>>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
>>>                    'levenshtein', matrix_normalized=False)
array([[ 0.,  4.,  8.,  0.],
       [ 8.,  9.,  0.,  8.],
       [ 4.,  0.,  9.,  4.]], dtype=float32)
  Paris Nantes Bressuire Paris
Paris 0 4 8 0
Bressuire 8 9 0 8
Nantes 4 0 9 4

The next step is gathering those three matrices into a global one, called the global alignment matrix. Thus we have :

  0 1 2 3
0 1 40304 2715 7780
1 2715 43011 0 5091
2 40304 0 43011 48084

Allowing some misspelling mistakes (for example Dupont and Dupond are very closed), the matching threshold can be set to 1 or 2. Thus we can see that the item 0 in our alignset is the same that the item 0 in the targetset, the 1 in the alignset and the 2 of the targetset too : the links can be done !

It's important to notice that even if the item 0 of the alignset and the 3 of the targetset have the same name and the same birthplace they are unlikely identical because of their very different birth date.

You may have noticed that working with matrices as I did for the example is a little bit boring. The good news is that Nazca makes all this job for you. You just have to give the sets and distance functions and that's all. An other good news is the project comes with the needed functions to build the sets !

[2]Provided in the nazca.distances module.

Real applications

Just before we start, we will assume the following imports have been done:

from nazca import dataio as aldio   #Functions for input and output data
from nazca import distances as ald  #Functions to compute the distances
from nazca import normalize as aln  #Functions to normalize data
from nazca import aligner as ala    #Functions to align data

The Goncourt prize

On wikipedia, we can find the Goncourt prize winners, and we would like to establish a link between the winners and their URI on dbpedia (Let's imagine the Goncourt prize winners category does not exist in dbpedia)

We simply copy/paste the winners list of wikipedia into a file and replace all the separators (- and ,) by #. So, the beginning of our file is :

1903#John-Antoine Nau#Force ennemie (Plume)
1904#Léon Frapié#La Maternelle (Albin Michel)
1905#Claude Farrère#Les Civilisés (Paul Ollendorff)
1906#Jérôme et Jean Tharaud#Dingley, l'illustre écrivain (Cahiers de la Quinzaine)

When using the high-level functions of this library, each item must have at least two elements: an identifier (the name, or the URI) and the attribute to compare. With the previous file, we will use the name (so the column number 1) as identifier (we don't have an URI here as identifier) and attribute to align. This is told to python thanks to the following code:

alignset = aldio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')

So, the beginning of our alignset is:

>>> alignset[:3]
[[u'John-Antoine Nau', u'John-Antoine Nau'],
 [u'Léon Frapié', u'Léon, Frapié'],
 [u'Claude Farrère', u'Claude Farrère']]

Now, let's build the targetset thanks to a sparql query and the dbpedia end-point. We ask for the list of the French novelists, described by their URI and their name in French:

query = """
     SELECT ?writer, ?name WHERE {
       ?writer  <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:French_novelists>.
       ?writer rdfs:label ?name.
       FILTER(lang(?name) = 'fr')
    }
 """
 targetset = aldio.sparqlquery('http://dbpedia.org/sparql', query)

Both functions return nested lists as presented before. Now, we have to define the distance function to be used for the alignment. This is done thanks to a python dictionary where the keys are the columns to work on, and the values are the treatments to apply.

treatments = {1: {'metric': ald.levenshtein}} # Use a levenshtein on the name
                                              # (column 1)

Finally, the last thing we have to do, is to call the alignall function:

alignments = ala.alignall(alignset, targetset,
                       0.4, #This is the matching threshold
                       treatments,
                       mode=None,#We'll discuss about that later
                       uniq=True #Get the best results only
                      )

This function returns an iterator over the different alignments done. You can see the results thanks to the following code :

for a, t in alignments:
    print '%s has been aligned onto %s' % (a, t)

It may be important to apply some pre-treatment on the data to align. For instance, names can be written with lower or upper characters, with extra characters as punctuation or unwanted information in parenthesis and so on. That is why we provide some functions to normalize your data. The most useful may be the simplify() function (see the docstring for more information). So the treatments list can be given as follow:

def remove_after(string, sub):
    """ Remove the text after ``sub`` in ``string``
        >>> remove_after('I like cats and dogs', 'and')
        'I like cats'
        >>> remove_after('I like cats and dogs', '(')
        'I like cats and dogs'
    """
    try:
        return string[:string.lower().index(sub.lower())].strip()
    except ValueError:
        return string


treatments = {1: {'normalization': [lambda x:remove_after(x, '('),
                                    aln.simply],
                  'metric': ald.levenshtein
                 }
             }

Cities alignment

The previous case with the Goncourt prize winners was pretty simply because the number of items was small, and the computation fast. But in a more real use case, the number of items to align may be huge (some thousands or millions…). In such a case it's unthinkable to build the global alignment matrix because it would be too big and it would take (at least...) fews days to achieve the computation. So the idea is to make small groups of possible similar data to compute smaller matrices (i.e. a divide and conquer approach). For this purpose, we provide some functions to group/cluster data. We have functions to group text and numerical data.

This is the code used, we will explain it:

targetset = aldio.rqlquery('http://demo.cubicweb.org/geonames',
                           """Any U, N, LONG, LAT WHERE X is Location, X name
                              N, X country C, C name "France", X longitude
                              LONG, X latitude LAT, X population > 1000, X
                              feature_class "P", X cwuri U""",
                           indexes=[0, 1, (2, 3)])
alignset = aldio.sparqlquery('http://dbpedia.inria.fr/sparql',
                             """prefix db-owl: <http://dbpedia.org/ontology/>
                             prefix db-prop: <http://fr.dbpedia.org/property/>
                             select ?ville, ?name, ?long, ?lat where {
                              ?ville db-owl:country <http://fr.dbpedia.org/resource/France> .
                              ?ville rdf:type db-owl:PopulatedPlace .
                              ?ville db-owl:populationTotal ?population .
                              ?ville foaf:name ?name .
                              ?ville db-prop:longitude ?long .
                              ?ville db-prop:latitude ?lat .
                              FILTER (?population > 1000)
                             }""",
                             indexes=[0, 1, (2, 3)])


treatments = {1: {'normalization': [aln.simply],
                  'metric': ald.levenshtein,
                  'matrix_normalized': False
                 }
             }
results = ala.alignall(alignset, targetset, 3, treatments=treatments, #As before
                       indexes=(2, 2), #On which data build the kdtree
                       mode='kdtree',  #The mode to use
                       uniq=True) #Return only the best results

Let's explain the code. We have two files, containing a list of cities we want to align, the first column is the identifier, and the second is the name of the city and the last one is location of the city (longitude and latitude), gathered into a single tuple.

In this example, we want to build a kdtree on the couple (longitude, latitude) to divide our data in few candidates. This clustering is coarse, and is only used to reduce the potential candidats without loosing any more refined possible matchs.

So, in the next step, we define the treatments to apply. It is the same as before, but we ask for a non-normalized matrix (ie: the real output of the levenshtein distance). Thus, we call the alignall function. indexes is a tuple saying the position of the point on which the kdtree must be built, mode is the mode used to find neighbours [3].

Finally, uniq ask to the function to return the best candidate (ie: the one having the shortest distance below the given threshold)

The function outputs a generator yielding tuples where the first element is the identifier of the alignset item and the second is the targetset one (It may take some time before yielding the first tuples, because all the computation must be done…)

[3]The available modes are kdtree, kmeans and minibatch for numerical data and minhashing for text one.

Try it online !

We have also made this little application of Nazca, using Cubicweb. This application provides a user interface for Nazca, helping you to choose what you want to align. You can use sparql or rql queries, as in the previous example, or import your own cvs file [4]. Once you have choosen what you want to align, you can click the Next step button to customize the treatments you want to apply, just as you did before in python ! Once done, by clicking the Next step, you start the alignment process. Wait a little bit, and you can either download the results in a csv or rdf file, or directly see the results online choosing the html output.

[4]Your csv file must be tab-separated for the moment…

blog entry of

Logilab.org - en