We have just published the following ipython notebooks explaining how to perform record linkage and entities matching with Nazca:
During Euroscipy, the Logilab Team presented an original approach for querying news using semantic information: "Rss feeds aggregator based on Scikits.learn and CubicWeb" by Vincent Michel This work is based on two major pieces of software:
- CubicWeb, the pythonic semantic web framework, is used to store and query Dbpedia information. CubicWeb is able to reconstruct links from rdf/nt files, and can easily execute complex queries in a database with more than 8 millions entities and 75 millions links when using a PostgreSQL backend.
- Scikit.learn is a cutting-edge python toolbox for machine learning. It provides algorithms that are simple and easy to use.
Based on these tools, we built a pure Python application to query the news:
- Named Entities are extracted from RSS articles of a few mainstream English newspapers (New York Times, Reuteurs, BBC News, etc.), for each group of words in an article, we check if a Dbpedia entry has the same label. If so, we create a semantic link between the article and the Dbpedia entry.
- An occurrence matrix of "RSS Articles" times "Named Entities" is constructed and may be used against several machine learning algorithms (MeanShift algorithm, Hierachical Clustering) in order to provide original and informative views of recent events.
Moreover, queries may be used jointly with semantic information from Dbpedia:
All musical artists in the news:
DISTINCT Any E, R WHERE E appears_in_rss R, E has_type T, T label "musical artist"
All living office holder persons in the news:
DISTINCT Any E WHERE E appears_in_rss R, E has_type T, T label "office holder", E has_subject C, C label "Living people"
All news that talk about Barack Obama and any scientist:
DISTINCT Any R WHERE E1 label "Barack Obama", E1 appears_in_rss R, E2 appears_in_rss R, E2 has_type T, T label "scientist"
All news that talk about a drug:
Any X, R WHERE X appears_in_rss R, X has_type T, T label "drug"
We are planning a one day coding sprint on scikits.learn the 1st April.
Venues, or remote participation on IRC are more than welcome !
More information can be found on the wiki: