What is it for ?Nazca is a python library aiming to help you to align data. But, what does “align data” mean? For instance, you have a list of cities, described by their name and their country and you would like to find their URI on dbpedia to have more information about them, as the longitude and the latitude. If you have two or three cities, it can be done with bare hands, but it could not if there are hundreds or thousands cities. Nazca provides you all the stuff we need to do it. This blog post aims to introduce you how this library works and can be used. Once you have understood the main concepts behind this library, don't hesitate to try Nazca online ! IntroductionThe alignment process is divided into three main steps:
Simple case
alignset = ['Victor Hugo', 'Albert Camus']
targetset = ['Albert Camus', 'Guillaume Apollinaire', 'Victor Hugo']
A more complex oneThe previous case was simple, because we had only one attribute to align (the name), but it is frequent to have a lot of attributes to align, such as the name and the birth date and the birth city. The steps remain the same, except that three distance matrices will be computed, and items will be represented as nested lists. See the following example: alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
['Jacques Dupuis', '06-01-1999', 'Bressuire'],
['Michel Edouard', '18-04-1881', 'Nantes']]
targetset = [['Dupond Paul', '14/08/1991', 'Paris'],
['Edouard Michel', '18/04/1881', 'Nantes'],
['Dupuis Jacques ', '06/01/1999', 'Bressuire'],
['Dupont Paul', '01-12-2012', 'Paris']]
In such a case, two distance functions are used, the Levenshtein one for the name and the city and a temporal one for the birth date [2]. The cdist function of nazca.distances enables us to compute those matrices :
>>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
>>> 'levenshtein', matrix_normalized=False)
array([[ 1., 6., 5., 0.],
[ 5., 6., 0., 5.],
[ 6., 0., 6., 6.]], dtype=float32)
>>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
>>> 'temporal', matrix_normalized=False)
array([[ 0., 40294., 2702., 7780.],
[ 2702., 42996., 0., 5078.],
[ 40294., 0., 42996., 48074.]], dtype=float32)
>>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
>>> 'levenshtein', matrix_normalized=False)
array([[ 0., 4., 8., 0.],
[ 8., 9., 0., 8.],
[ 4., 0., 9., 4.]], dtype=float32)
The next step is gathering those three matrices into a global one, called the global alignment matrix. Thus we have :
Allowing some misspelling mistakes (for example Dupont and Dupond are very closed), the matching threshold can be set to 1 or 2. Thus we can see that the item 0 in our alignset is the same that the item 0 in the targetset, the 1 in the alignset and the 2 of the targetset too : the links can be done ! It's important to notice that even if the item 0 of the alignset and the 3 of the targetset have the same name and the same birthplace they are unlikely identical because of their very different birth date. You may have noticed that working with matrices as I did for the example is a little bit boring. The good news is that Nazca makes all this job for you. You just have to give the sets and distance functions and that's all. An other good news is the project comes with the needed functions to build the sets !
Real applicationsJust before we start, we will assume the following imports have been done: from nazca import dataio as aldio #Functions for input and output data
from nazca import distances as ald #Functions to compute the distances
from nazca import normalize as aln #Functions to normalize data
from nazca import aligner as ala #Functions to align data
The Goncourt prizeOn wikipedia, we can find the Goncourt prize winners, and we would like to establish a link between the winners and their URI on dbpedia (Let's imagine the Goncourt prize winners category does not exist in dbpedia) We simply copy/paste the winners list of wikipedia into a file and replace all the separators (- and ,) by #. So, the beginning of our file is :
When using the high-level functions of this library, each item must have at least two elements: an identifier (the name, or the URI) and the attribute to compare. With the previous file, we will use the name (so the column number 1) as identifier (we don't have an URI here as identifier) and attribute to align. This is told to python thanks to the following code: alignset = aldio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')
So, the beginning of our alignset is: >>> alignset[:3]
[[u'John-Antoine Nau', u'John-Antoine Nau'],
[u'Léon Frapié', u'Léon, Frapié'],
[u'Claude Farrère', u'Claude Farrère']]
Now, let's build the targetset thanks to a sparql query and the dbpedia end-point. We ask for the list of the French novelists, described by their URI and their name in French: query = """
SELECT ?writer, ?name WHERE {
?writer <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:French_novelists>.
?writer rdfs:label ?name.
FILTER(lang(?name) = 'fr')
}
"""
targetset = aldio.sparqlquery('http://dbpedia.org/sparql', query)
Both functions return nested lists as presented before. Now, we have to define the distance function to be used for the alignment. This is done thanks to a python dictionary where the keys are the columns to work on, and the values are the treatments to apply. treatments = {1: {'metric': ald.levenshtein}} # Use a levenshtein on the name
# (column 1)
Finally, the last thing we have to do, is to call the alignall function: alignments = ala.alignall(alignset, targetset,
0.4, #This is the matching threshold
treatments,
mode=None,#We'll discuss about that later
uniq=True #Get the best results only
)
This function returns an iterator over the different alignments done. You can see the results thanks to the following code : for a, t in alignments:
print '%s has been aligned onto %s' % (a, t)
It may be important to apply some pre-treatment on the data to align. For instance, names can be written with lower or upper characters, with extra characters as punctuation or unwanted information in parenthesis and so on. That is why we provide some functions to normalize your data. The most useful may be the simplify() function (see the docstring for more information). So the treatments list can be given as follow: def remove_after(string, sub):
""" Remove the text after ``sub`` in ``string``
>>> remove_after('I like cats and dogs', 'and')
'I like cats'
>>> remove_after('I like cats and dogs', '(')
'I like cats and dogs'
"""
try:
return string[:string.lower().index(sub.lower())].strip()
except ValueError:
return string
treatments = {1: {'normalization': [lambda x:remove_after(x, '('),
aln.simply],
'metric': ald.levenshtein
}
}
Cities alignmentThe previous case with the Goncourt prize winners was pretty simply because the number of items was small, and the computation fast. But in a more real use case, the number of items to align may be huge (some thousands or millions…). In such a case it's unthinkable to build the global alignment matrix because it would be too big and it would take (at least...) fews days to achieve the computation. So the idea is to make small groups of possible similar data to compute smaller matrices (i.e. a divide and conquer approach). For this purpose, we provide some functions to group/cluster data. We have functions to group text and numerical data. This is the code used, we will explain it: targetset = aldio.rqlquery('http://demo.cubicweb.org/geonames',
"""Any U, N, LONG, LAT WHERE X is Location, X name
N, X country C, C name "France", X longitude
LONG, X latitude LAT, X population > 1000, X
feature_class "P", X cwuri U""",
indexes=[0, 1, (2, 3)])
alignset = aldio.sparqlquery('http://dbpedia.inria.fr/sparql',
"""prefix db-owl: <http://dbpedia.org/ontology/>
prefix db-prop: <http://fr.dbpedia.org/property/>
select ?ville, ?name, ?long, ?lat where {
?ville db-owl:country <http://fr.dbpedia.org/resource/France> .
?ville rdf:type db-owl:PopulatedPlace .
?ville db-owl:populationTotal ?population .
?ville foaf:name ?name .
?ville db-prop:longitude ?long .
?ville db-prop:latitude ?lat .
FILTER (?population > 1000)
}""",
indexes=[0, 1, (2, 3)])
treatments = {1: {'normalization': [aln.simply],
'metric': ald.levenshtein,
'matrix_normalized': False
}
}
results = ala.alignall(alignset, targetset, 3, treatments=treatments, #As before
indexes=(2, 2), #On which data build the kdtree
mode='kdtree', #The mode to use
uniq=True) #Return only the best results
Let's explain the code. We have two files, containing a list of cities we want to align, the first column is the identifier, and the second is the name of the city and the last one is location of the city (longitude and latitude), gathered into a single tuple. In this example, we want to build a kdtree on the couple (longitude, latitude) to divide our data in few candidates. This clustering is coarse, and is only used to reduce the potential candidats without loosing any more refined possible matchs. So, in the next step, we define the treatments to apply. It is the same as before, but we ask for a non-normalized matrix (ie: the real output of the levenshtein distance). Thus, we call the alignall function. indexes is a tuple saying the position of the point on which the kdtree must be built, mode is the mode used to find neighbours [3]. Finally, uniq ask to the function to return the best candidate (ie: the one having the shortest distance below the given threshold) The function outputs a generator yielding tuples where the first element is the identifier of the alignset item and the second is the targetset one (It may take some time before yielding the first tuples, because all the computation must be done…)
Try it online !We have also made this little application of Nazca, using Cubicweb. This application provides a user interface for Nazca, helping you to choose what you want to align. You can use sparql or rql queries, as in the previous example, or import your own cvs file [4]. Once you have choosen what you want to align, you can click the Next step button to customize the treatments you want to apply, just as you did before in python ! Once done, by clicking the Next step, you start the alignment process. Wait a little bit, and you can either download the results in a csv or rdf file, or directly see the results online choosing the html output.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


Openstack, Wheezy and ZFS on Linux
