Blog entries

  • DBpedia 3.2 released

    2008/11/19 by Nicolas Chauvat
    http://wiki.dbpedia.org/images/dbpedia_logo.png

    For those interested in the Semantic Web as much as we are at Logilab, the announce of the new DBpedia release is very good news. Version 3.2 is extracted from the October 2008 Wikipedia dumps and provides three mayor improvements: the DBpedia Schema which is a restricted vocabulary extracted from the Wikipedia infoboxes ; RDF links from DBpedia to Freebase, the open-license database providing about a million of things from various domains ; cleaner abstracts without the traces of Wikipedia markup that made them difficult to reuse.

    DBpedia can be downloaded, queried with SPARQL or linked to via the Linked Data interface. See the about page for details.

    It is important to note that ontologies are usually more of a common language for data exchange, meant for broad re-use, which means that they can not enforce too many restrictions. On the opposite, database schemas are more restrictive and allow for more interesting inferences. For example, a database schema may enforce that the Publisher of a Document is a Person, whereas a more general ontology will have to allow for Publisher to be a Person or a Company.

    DBpedia provides its schema and moves forward by adding a mapping from that schema to actual ontologies like UMBEL, OpenCyc and Yago. This enables DBpedia users to infer from facts fetched from different databases, like DBpedia + Freebase + OpenCyc. Moreover 'checking' DBpedia's data against ontologies will help detect mistakes or weirdnesses in Wikipedia's pages. For example, if data extracted from Wikipedia's infoboxes states that "Paris was_born_in New_York", reasoning and consistency checking tools will be able to point out that a person may be born in a city, but not a city, hence the above fact is probably an error and should be reviewed.

    With CubicWeb, one can easily define a schema specific to his domain, then quickly set up a web application and easily publish the content of its database as RDF for a known ontology. In other words, CubicWeb makes almost no difference between a web application and a database accessible thru the web.


  • Semantic web technology conference 2009

    2009/06/17 by Sandrine Ribeau
    The semantic web technology conference is taking place every year in San Jose, California. It is meant to be the world's symposium on the business of semantic technologies. Essentially here we discuss about semantic search, how to improve access to the data and how we make sense of structured, but mainly unstructured content. Some exhibitors were more NLP oriented, concepts extraction (such as SemanticV), others were more focused on providing a scalable storage (essentially RDF storage). Most of the solutions includes a data aggregator/unifier in order to combine multi-sources data into a single storage from which ontologies could be defined. Then on top of that is the enhanced search engine. They concentrate on internal data within the enterprise and not that much about using the Web as a resource. For those who built a web application on top of the data, they choosed Flex as their framework (Metatomix).
    From all the exhibitors, the ones that kept my attention were The Anzo suite (open source project), ORDI and Allegrograph RDF store.
    Developped by Cambridge Semantics, in Java, Anzo suite, especially, Anzo on the web and Anzo collaboration server, is the closest tools to CubicWeb, providing a multi source data server and an AJAX/HTML interface to develop semantic web applications, customize views of the data using a templating language. It is available in open source. The feature that I think was interesting is an assistant to load data into their application that then helps the users define the data model based on that data. The internal representation of the content is totally transparent to the user, types are inferred by the application, as well as relations.
    RDF Resource Description Framework IconI did not get a demo of ORDI, but it was just mentionned to me as an open source equivalent to CubicWeb, which I am not too sure about after looking at their web site. It does data integration into RDF.
    Allegrograph RDF store is a potential candidate for another source type in CubicWeb . It is already supported by Jena and Sesame framework. They developped a Python client API to attract pythonist in the Java world.
    They all agreed on one thing : the use of SPARQL should be the standard query language. I quickly heard about Knowledge Interface Format (KIF) which seems to be an interesting representation of knowledge used for multi-lingual applications. If there was one buzz word to recall from the conference, I would choose ontology :)

  • The Web is reaching version 3

    2009/06/05 by Nicolas Chauvat
    http://www.logilab.org/image/9295?vid=download

    I presented CubicWeb at several conferences recently and I used the following as an introduction.

    Web version numbers:

    • version 0 = the internet links computers
    • version 1 = the web links documents
    • version 2 = web applications
    • version 3 = the semantic web links data [we are here!]
    • version 4 = more personnalization and fix problems with privacy and security
    • ... reach into physical world, bits of AI, etc.

    In his blog at MIT, Tim Berners-Lee calls version 0 the International Information Infrastructure, version 1 the World Wide Web and version 3 the Giant Global Graph. Read the details about the Giant Global Graph on his blog.


  • Nazca is out !

    2012/12/21 by Simon Chabot

    What is it for ?

    Nazca is a python library aiming to help you to align data. But, what does “align data” mean? For instance, you have a list of cities, described by their name and their country and you would like to find their URI on dbpedia to have more information about them, as the longitude and the latitude. If you have two or three cities, it can be done with bare hands, but it could not if there are hundreds or thousands cities. Nazca provides you all the stuff we need to do it.

    This blog post aims to introduce you how this library works and can be used. Once you have understood the main concepts behind this library, don't hesitate to try Nazca online !

    Introduction

    The alignment process is divided into three main steps:

    1. Gather and format the data we want to align. In this step, we define two sets called the alignset and the targetset. The alignset contains our data, and the targetset contains the data on which we would like to make the links.
    2. Compute the similarity between the items gathered. We compute a distance matrix between the two sets according to a given distance.
    3. Find the items having a high similarity thanks to the distance matrix.

    Simple case

    1. Let's define alignset and targetset as simple python lists.
    alignset = ['Victor Hugo', 'Albert Camus']
    targetset = ['Albert Camus', 'Guillaume Apollinaire', 'Victor Hugo']
    
    1. Now, we have to compute the similarity between each items. For that purpose, the Levenshtein distance [1], which is well accurate to compute the distance between few words, is used. Such a function is provided in the nazca.distance module.

      The next step is to compute the distance matrix according to the Levenshtein distance. The result is given in the following table.

        Albert Camus Guillaume Apollinaire Victor Hugo
      Victor Hugo 6 9 0
      Albert Camus 0 8 6
    2. The alignment process is ended by reading the matrix and saying items having a value inferior to a given threshold are identical.

    [1]Also called the edit distance, because the distance between two words is equal to the number of single-character edits required to change one word into the other.

    A more complex one

    The previous case was simple, because we had only one attribute to align (the name), but it is frequent to have a lot of attributes to align, such as the name and the birth date and the birth city. The steps remain the same, except that three distance matrices will be computed, and items will be represented as nested lists. See the following example:

    alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
                ['Jacques Dupuis', '06-01-1999', 'Bressuire'],
                ['Michel Edouard', '18-04-1881', 'Nantes']]
    targetset = [['Dupond Paul', '14/08/1991', 'Paris'],
                 ['Edouard Michel', '18/04/1881', 'Nantes'],
                 ['Dupuis Jacques ', '06/01/1999', 'Bressuire'],
                 ['Dupont Paul', '01-12-2012', 'Paris']]
    

    In such a case, two distance functions are used, the Levenshtein one for the name and the city and a temporal one for the birth date [2].

    The cdist function of nazca.distances enables us to compute those matrices :

    • For the names:
    >>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
    >>>                    'levenshtein', matrix_normalized=False)
    array([[ 1.,  6.,  5.,  0.],
           [ 5.,  6.,  0.,  5.],
           [ 6.,  0.,  6.,  6.]], dtype=float32)
    
      Dupond Paul Edouard Michel Dupuis Jacques Dupont Paul
    Paul Dupont 1 6 5 0
    Jacques Dupuis 5 6 0 5
    Edouard Michel 6 0 6 6
    • For the birthdates:
    >>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
    >>>                    'temporal', matrix_normalized=False)
    array([[     0.,  40294.,   2702.,   7780.],
           [  2702.,  42996.,      0.,   5078.],
           [ 40294.,      0.,  42996.,  48074.]], dtype=float32)
    
      14/08/1991 18/04/1881 06/01/1999 01-12-2012
    14-08-1991 0 40294 2702 7780
    06-01-1999 2702 42996 0 5078
    18-04-1881 40294 0 42996 48074
    • For the birthplaces:
    >>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
    >>>                    'levenshtein', matrix_normalized=False)
    array([[ 0.,  4.,  8.,  0.],
           [ 8.,  9.,  0.,  8.],
           [ 4.,  0.,  9.,  4.]], dtype=float32)
    
      Paris Nantes Bressuire Paris
    Paris 0 4 8 0
    Bressuire 8 9 0 8
    Nantes 4 0 9 4

    The next step is gathering those three matrices into a global one, called the global alignment matrix. Thus we have :

      0 1 2 3
    0 1 40304 2715 7780
    1 2715 43011 0 5091
    2 40304 0 43011 48084

    Allowing some misspelling mistakes (for example Dupont and Dupond are very closed), the matching threshold can be set to 1 or 2. Thus we can see that the item 0 in our alignset is the same that the item 0 in the targetset, the 1 in the alignset and the 2 of the targetset too : the links can be done !

    It's important to notice that even if the item 0 of the alignset and the 3 of the targetset have the same name and the same birthplace they are unlikely identical because of their very different birth date.

    You may have noticed that working with matrices as I did for the example is a little bit boring. The good news is that Nazca makes all this job for you. You just have to give the sets and distance functions and that's all. An other good news is the project comes with the needed functions to build the sets !

    [2]Provided in the nazca.distances module.

    Real applications

    Just before we start, we will assume the following imports have been done:

    from nazca import dataio as aldio   #Functions for input and output data
    from nazca import distances as ald  #Functions to compute the distances
    from nazca import normalize as aln  #Functions to normalize data
    from nazca import aligner as ala    #Functions to align data
    

    The Goncourt prize

    On wikipedia, we can find the Goncourt prize winners, and we would like to establish a link between the winners and their URI on dbpedia (Let's imagine the Goncourt prize winners category does not exist in dbpedia)

    We simply copy/paste the winners list of wikipedia into a file and replace all the separators (- and ,) by #. So, the beginning of our file is :

    1903#John-Antoine Nau#Force ennemie (Plume)
    1904#Léon Frapié#La Maternelle (Albin Michel)
    1905#Claude Farrère#Les Civilisés (Paul Ollendorff)
    1906#Jérôme et Jean Tharaud#Dingley, l'illustre écrivain (Cahiers de la Quinzaine)

    When using the high-level functions of this library, each item must have at least two elements: an identifier (the name, or the URI) and the attribute to compare. With the previous file, we will use the name (so the column number 1) as identifier (we don't have an URI here as identifier) and attribute to align. This is told to python thanks to the following code:

    alignset = aldio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')
    

    So, the beginning of our alignset is:

    >>> alignset[:3]
    [[u'John-Antoine Nau', u'John-Antoine Nau'],
     [u'Léon Frapié', u'Léon, Frapié'],
     [u'Claude Farrère', u'Claude Farrère']]
    

    Now, let's build the targetset thanks to a sparql query and the dbpedia end-point. We ask for the list of the French novelists, described by their URI and their name in French:

    query = """
         SELECT ?writer, ?name WHERE {
           ?writer  <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:French_novelists>.
           ?writer rdfs:label ?name.
           FILTER(lang(?name) = 'fr')
        }
     """
     targetset = aldio.sparqlquery('http://dbpedia.org/sparql', query)
    

    Both functions return nested lists as presented before. Now, we have to define the distance function to be used for the alignment. This is done thanks to a python dictionary where the keys are the columns to work on, and the values are the treatments to apply.

    treatments = {1: {'metric': ald.levenshtein}} # Use a levenshtein on the name
                                                  # (column 1)
    

    Finally, the last thing we have to do, is to call the alignall function:

    alignments = ala.alignall(alignset, targetset,
                           0.4, #This is the matching threshold
                           treatments,
                           mode=None,#We'll discuss about that later
                           uniq=True #Get the best results only
                          )
    

    This function returns an iterator over the different alignments done. You can see the results thanks to the following code :

    for a, t in alignments:
        print '%s has been aligned onto %s' % (a, t)
    

    It may be important to apply some pre-treatment on the data to align. For instance, names can be written with lower or upper characters, with extra characters as punctuation or unwanted information in parenthesis and so on. That is why we provide some functions to normalize your data. The most useful may be the simplify() function (see the docstring for more information). So the treatments list can be given as follow:

    def remove_after(string, sub):
        """ Remove the text after ``sub`` in ``string``
            >>> remove_after('I like cats and dogs', 'and')
            'I like cats'
            >>> remove_after('I like cats and dogs', '(')
            'I like cats and dogs'
        """
        try:
            return string[:string.lower().index(sub.lower())].strip()
        except ValueError:
            return string
    
    
    treatments = {1: {'normalization': [lambda x:remove_after(x, '('),
                                        aln.simply],
                      'metric': ald.levenshtein
                     }
                 }
    

    Cities alignment

    The previous case with the Goncourt prize winners was pretty simply because the number of items was small, and the computation fast. But in a more real use case, the number of items to align may be huge (some thousands or millions…). In such a case it's unthinkable to build the global alignment matrix because it would be too big and it would take (at least...) fews days to achieve the computation. So the idea is to make small groups of possible similar data to compute smaller matrices (i.e. a divide and conquer approach). For this purpose, we provide some functions to group/cluster data. We have functions to group text and numerical data.

    This is the code used, we will explain it:

    targetset = aldio.rqlquery('http://demo.cubicweb.org/geonames',
                               """Any U, N, LONG, LAT WHERE X is Location, X name
                                  N, X country C, C name "France", X longitude
                                  LONG, X latitude LAT, X population > 1000, X
                                  feature_class "P", X cwuri U""",
                               indexes=[0, 1, (2, 3)])
    alignset = aldio.sparqlquery('http://dbpedia.inria.fr/sparql',
                                 """prefix db-owl: <http://dbpedia.org/ontology/>
                                 prefix db-prop: <http://fr.dbpedia.org/property/>
                                 select ?ville, ?name, ?long, ?lat where {
                                  ?ville db-owl:country <http://fr.dbpedia.org/resource/France> .
                                  ?ville rdf:type db-owl:PopulatedPlace .
                                  ?ville db-owl:populationTotal ?population .
                                  ?ville foaf:name ?name .
                                  ?ville db-prop:longitude ?long .
                                  ?ville db-prop:latitude ?lat .
                                  FILTER (?population > 1000)
                                 }""",
                                 indexes=[0, 1, (2, 3)])
    
    
    treatments = {1: {'normalization': [aln.simply],
                      'metric': ald.levenshtein,
                      'matrix_normalized': False
                     }
                 }
    results = ala.alignall(alignset, targetset, 3, treatments=treatments, #As before
                           indexes=(2, 2), #On which data build the kdtree
                           mode='kdtree',  #The mode to use
                           uniq=True) #Return only the best results
    

    Let's explain the code. We have two files, containing a list of cities we want to align, the first column is the identifier, and the second is the name of the city and the last one is location of the city (longitude and latitude), gathered into a single tuple.

    In this example, we want to build a kdtree on the couple (longitude, latitude) to divide our data in few candidates. This clustering is coarse, and is only used to reduce the potential candidats without loosing any more refined possible matchs.

    So, in the next step, we define the treatments to apply. It is the same as before, but we ask for a non-normalized matrix (ie: the real output of the levenshtein distance). Thus, we call the alignall function. indexes is a tuple saying the position of the point on which the kdtree must be built, mode is the mode used to find neighbours [3].

    Finally, uniq ask to the function to return the best candidate (ie: the one having the shortest distance below the given threshold)

    The function outputs a generator yielding tuples where the first element is the identifier of the alignset item and the second is the targetset one (It may take some time before yielding the first tuples, because all the computation must be done…)

    [3]The available modes are kdtree, kmeans and minibatch for numerical data and minhashing for text one.

    Try it online !

    We have also made this little application of Nazca, using Cubicweb. This application provides a user interface for Nazca, helping you to choose what you want to align. You can use sparql or rql queries, as in the previous example, or import your own cvs file [4]. Once you have choosen what you want to align, you can click the Next step button to customize the treatments you want to apply, just as you did before in python ! Once done, by clicking the Next step, you start the alignment process. Wait a little bit, and you can either download the results in a csv or rdf file, or directly see the results online choosing the html output.

    [4]Your csv file must be tab-separated for the moment…

  • Semantic Roundup - infos glanées au NextMediaCamp

    2008/11/12 by Arthur Lutz
    http://barcamp.org/f/NMC.PNG

    Par manque de temps voici les infos en brut glanées jeudi soir dernier au NextMediaBarCamp :

    Un BarCamp c'est assez rigolo, un peu trop de jeune cadres dynamiques en cravate à mon goût, mais bon. Parmi les trois mini-conférences auxquelles j'ai participé, il y avait une sur le web sémantique. Animée en partie par Fabrice Epelboin qui écrit pour la version française de ReadWriteWeb, j'ai appris des choses. Pour résumer ce que j'y ai compris : le web sémantique il y a deux approches :

    1. soit on pompe le contenu existant et on le transforme en contenu lisible par les machines avec des algorithmes de la mort, comme opencalais le fait (top-down)
    2. soit on écrit nous même en web sémantique avec des microformats ensuite les machines liront de plus en plus le web (bottom-up)
    http://microformats.org/wordpress/wp-content/themes/microformats/img/logo.gif

    Dans le deuxième cas la difficulté est de faciliter la tache des rédacteurs du web pour qu'ils puissent facilement publier du contenu en web sémantique. Pour cela ces outils sont mentionnés : Zemanta, Glue (de la société AdaptiveBlue).

    Tout ça m'a fait penser au fait que si CubicWeb publie déjà en microformats, il lui manque une interface d'édition à la hauteur des enjeux. Par exemple lorsque l'on tape un article et que l'application a les coordonnées d'une personne metionnée, il fait automagiquement une relation à cet élément. A creuser...

    Sinon sur les autres confs et le futur des médias, selon les personnes présentes, c'est assez glauque : des medias publicitaires, custom-made pour le bonheur de l'individu, où l'on met des sous dans les agrégateurs de contenu plutôt sur des journalistes de terrain. Pour ceux que ça intéresse, j'ai aussi découvert lors de ce BarCamp un petit film "rigolo" qui traite ce sujet préoccupant.


  • EuroPython 2009

    2009/07/06 by Nicolas Chauvat
    http://www.logilab.org/file/9580/raw/europython_logo.png

    Once again Logilab sponsored the EuroPython conference. We would like to thank the organization team (especially John Pinner and Laura Creighton) for their hard work. The Conservatoire is a very central location in Birmingham and walking around the city center and along the canals was nice. The website was helpful when preparing the trip and made it easy to find places where to eat and stay. The conference program was full of talks about interesting topics.

    I presented CubicWeb and spent a large part of my talk explaining what is the semantic web and what features we need in the tools we will use to be part of that web of data. I insisted on the fact that CubicWeb is made of two parts, the web engine and the data repository, and that the repository can be used without the web engine. I demonstrated this with a TurboGears application that used the CubicWeb repository as its persistence layer. RQL in TurboGears! See my slides and Reinout Van Rees' write-up.

    Christian Tismer took over the development of Psyco a few months ago. He said he recently removed some bugs that were show stoppers, including one that was generating way too many recompilations. His new version looks very promising. Performance improved, long numbers are supported, 64bit support may become possible, generators work... and Stackless is about to be rebuilt on top of Psyco! Psyco 2.0 should be out today.

    I had a nice chat with Cosmin Basca about the Semantic Web. He suggested using Mako as a templating language for CubicWeb. Cosmin is doing his PhD at DERI and develops SurfRDF which is an Object-RDF mapper that wraps a SPARQL endpoint to provide "discoverable" objects. See his slides and Reinout Van Rees' summary of his talk.

    I saw a lightning talk about the Nagare framework which refuses to use templating languages, for the same reason we do not use them in CubicWeb. Is their h.something the right way of doing things? The example reminds me of the C++ concatenation operator. I am not really convinced with the continuation idea since I have been for years a happy user of the reactor model that's implemented in frameworks liked Twisted. Read the blog and documentation for more information.

    I had a chat with Jasper Op de Coul about Infrae's OAI Server and the work he did to manage RDF data in Subversion and a relational database before publishing it within a web app based on YUI. We commented code that handles books and library catalogs. Part of my CubicWeb demo was about books in DBpedia and cubicweb-book. He gave me a nice link to the WorldCat API.

    Souheil Chelfouh showed me his work on Dolmen and Menhir. For several design problems and framework architecture issues, we compared the solutions offered by the Zope Toolkit library with the ones found by CubicWeb. I will have to read more about Martian and Grok to make sure I understand the details of that component architecture.

    I had a chat with Martijn Faassen about packaging Python modules. A one sentence summary would be that the Python community should agree on a meta-data format that describes packages and their dependencies, then let everyone use the tool he likes most to manage the installation and removal of software on his system. I hope the work done during the last PyConUS and led by Tarek Ziadé arrived at the same conclusion. Read David Cournapeau's blog entry about Python Packaging for a detailed explanation of why the meta-data format is the way to go. By the way, Martijn is the lead developer of Grok and Martian.

    Godefroid Chapelle and I talked a lot about Zope Toolkit (ZTK) and CubicWeb. We compared the way the two frameworks deal with pluggable components. ZTK has adapters and a registry. CubicWeb does not use adapters as ZTK does, but has a view selection mechanism that required a registry with more features than the one used in ZTK. The ZTK registry only has to match a tuple (Interface, Class) when looking for an adapter, whereas CubicWeb's registry has to find the views that can be applied to a result set by checking various properties:

    • interfaces: all items of first column implement the Calendar Interface,
    • dimensions: more than one line, more than two columns,
    • types: items of first column are numbers or dates,
    • form: form contains key XYZ that has a value lower than 10,
    • session: user is authenticated,
    • etc.

    As for Grok and Martian, I will have to look into the details to make sure nothing evil is hinding there. I should also find time to compare zope.schema and yams and write about it on this blog.

    And if you want more information about the conference:


  • Simile-Widgets

    2008/08/07 by Nicolas Chauvat
    http://simile.mit.edu/images/logo.png

    While working on knowledge management and semantic web technologies, I came across the Simile project at MIT a few years back. I even had a demo of the Exhibit widget fetching then displaying data from our semantic web application framework back in 2006 at the Web2 track of Solutions Linux in Paris.

    Now that we are using these widgets when implementing web apps for clients, I was happy to see that the projects got a life of their own outside of MIT and became full-fledged free-software projects hosted on Google Code. See Simile-Widgets for more details and expect us to provide a debian package soon unless someone does it first.

    Speaking of Debian, here is a nice demo a the Timeline widget presenting the Debian history.

    http://beta.thumbalizr.com/app/thumbs/?src=/thumbs/onl/source/d2/d280583f143793f040bdacf44a39b0d5.png&w=320&q=0&enc=

  • Introduction au tutoriel d'introduction au Web sémantique - SemWeb.Pro 2015

    2015/11/03 by Nicolas Chauvat

    Voici un court texte d'introduction à lire avant de participer au tutoriel qui ouvrira la journée de conférence SemWeb.Pro 2015.

    1940 - Vannevar Bush

    Vannevar Bush est l'un des pionniers d'Internet, à travers notamment son article As We May Think, paru en 1945 dans le magazine Atlantic Monthly, dans lequel il prédit l'invention de l'hypertexte, selon les principes énoncés par Paul Otlet dans son Traité de documentation. Dans cet article, il décrit un système, appelé Memex, sorte d'extension de la mémoire de l'homme. Ce texte jette les bases de l'ordinateur et des réseaux informatiques. Il envisage de pouvoir y stocker des livres, des notes personnelles, des idées et de pouvoir les associer entre elles pour les retrouver facilement. Il y évoque déjà les notions de liens et de parcours, prenant pour modèle le fonctionnement par association du cerveau humain.

    1960 - Ted Nelson

    Ted Nelson ayant imaginé une machine qui permettrait de stocker des données et de les mettre à disposition de tous, partout, il met en place en 1960 le projet Xanadu et tente, avec plus ou moins de succès, de mettre en application ce qu'il nomme « le projet original de l'hypertexte. »

    Le principe de l'hypertexte a été repris par de nombreux pionniers de l'informatique, comme Douglas Engelbart pour mettre au point une interface homme-machine dans les années 1960, Bill Atkinson, chez Apple, pour développer HyperCard, ou encore Tim Berners-Lee en 1989, pour définir les bases du World Wide Web.

    1970 - Internet

    L'internet est le réseau informatique mondial accessible au public. C'est un réseau de réseaux, sans centre névralgique, composé de millions de réseaux aussi bien publics que privés, universitaires, commerciaux et gouvernementaux, eux-mêmes regroupés, en 2014, en 47 000 réseaux autonomes. L'information est transmise par l'internet grâce à un ensemble standardisé de protocoles de transfert de données, qui permet l'élaboration d'applications et de services variés comme le courrier électronique, la messagerie instantanée, le pair-à-pair et le World Wide Web.

    L'internet ayant été popularisé par l'apparition du World Wide Web (WWW), les deux sont parfois confondus par le public non averti. Le World Wide Web n'est pourtant que l'une des applications de l'internet.

    1990 - Tim Berners-Lee

    Tim Berners-Lee travaille en 1989 au Conseil européen pour la recherche nucléaire (CERN), où il est connecté au réseau du centre de recherche et à l'internet. Il propose alors à sa hiérarchie un projet de système de partage des documents informatiques, qu'il a l'idée de réaliser en associant le principe de l’hypertexte à l'utilisation d'Internet. Il déclarera plus tard à ce sujet : « Je n'ai fait que prendre le principe d’hypertexte et le relier au principe du TCP et du DNS et alors – boum ! – ce fut le World Wide Web ! ». Quelques années plus tard, en 1994, lors de la première édition de la conférence WWW, il présente ses idées concernant la manière de faire apparaître les sujets et informations dont traitent les documents publiés sur le web, ce qu'il nomme la sémantique du web.

    2010 - Le web des données liées

    C'est à partir de 2010 qu'explose le nombre de jeux de données publiés sur le web. Lisez la page au sujet de Victor Hugo à la BnF, puis tentez d'interpréter les données équivalentes et de comparer à celles de DBPedia.

    Sources: Wikimedia Foundation, W3C, Lod-Cloud, Xanadu Project


  • Another step towards the semantic web

    2007/02/06 by Nicolas Chauvat

    I co-organized the Web2.0 conference track that was held at Solutions Linux 2007 in Paris last week . Researching to prepare the talk I gave, I came accross microformats and GRDDL. Both try to add semantics on top of (X)HTML.

    Microformats uses the class attribute and the "invisibility" of `div` and `span` to insert semantic information, as in ::
      <li class="vevent">
        <a class="url" href="http://www.solutionslinux.fr/">
         <span class="summary">Solutions Linux Web 2.0 Conference</span>: 
         <abbr class="dtstart" title="20070201T143000Z">February 1st 2:30pm</abbr>-
         <abbr class="dtend" title="20070201T18000Z">6pm</abbr>, at the 
         <span class="location">CNIT, La Défense</span>
        </a>
      </li>
    

    GRDDL information is added to the `head` of the XHTML page and points to an XSL that can extract the information from the page and output it as RDF.

    Another option is to add `link` to the `head` of the page, pointing to an alternate representations like a RDF formatted one.

    Firefox has add-ons that help you spot semantic enabled web pages: Tails detects microformats and the semantic radar detects RDF. Operator is an option I found too invasive.

    As for my talk, it involved demonstrating CubicWeb, the engine behind logilab.org, and querying the data stored at logilab.org to reuse it with Exhibit.


  • Logilab at OSCON 2009

    2009/07/27 by Sandrine Ribeau
    http://assets.en.oreilly.com/1/event/27/oscon2009_oscon_11_years.gif

    OSCON, Open Source CONvention, takes place every year and promotes Open Source for technology. It is one of the meeting hubs for the growing open source community. This was the occasion for us to learn about new projects and to present CubicWeb during a BAYPIGgies meeting hosted by OSCON.

    http://www.openlina.com/templates/rhuk_milkyway/images/header_red_left.png

    I had the chance to talk with some of the folks working at OpenLina where they presented LINA. LINA is a thin virtual layer that enables developers to write and compile code using ordinary Linux tools, then package that code into a single executable that runs on a variety of operating systems. LINA runs invisibly in the background, enabling the user to install and run LINAfied Linux applications as if they were native to that user's operating system. They were curious about CubicWeb and took as a challenge to package it with LINA... maybe soon on LINA's applications list.

    Two open sources projects catched my attention as potential semantic data publishers. The first one is Family search where they provide a tool to search for family history and genealogy. Also they are working to define a standard format to exchange citation with Open Library. Democracy Lab provide an application to collect votes and build geographic statitics based on political interests. They will at some point publish data semantically so that their application data could be consumed.

    It also was for us the occasion of introducing CubicWeb to the BayPIGgies folks. The same presentation as the one held at Europython 2009. I'd like to take the opportunity to answer a question I did not manage to answer at that time. The question was: how different is CubicWeb from Freebase Parallax in terms of interface and views filters? Before answering this question let's detail what Freebase Parallax is.

    Freebase Parallax provides a new way to browse and explore data in Freebase. It allows to browse data from a set of data to a related set of data. This interface enables to aggregate visualization. For instance, given the set of US presidents, different types of views could be applied, such as a timeline view, where the user could set up which start and end date to use to draw the timeline. So generic views (which applies to any data) are customizable by the user.

    http://res.freebase.com/s/f64a2f0cc4534b2b17140fd169cee825a7ed7ddcefe0bf81570301c72a83c0a8/resources/images/freebase-logo.png

    The search powered by Parallax is very similar to CubicWeb faceted search, except that Parallax provides the user with a list of suggested filters to add in addition to the default one, the user can even remove a filter. That is something we could think about for CubicWeb: provide a generated faceted search so that the user could decide which filters to choose.

    Parallax also provides related topics to the current data set which ease navigation between sets of data. The main difference I could see with the view filter offered by Parallax and CubicWeb is that Parallax provides the same views to any type of data whereas CubicWeb has specific views depending on the data type and generic views that applies to any type of data. This is a nice Web interface to browse data and it could be a good source of inspiration for CubicWeb.

    http://www.zgeek.com/forum/gallery/files/6/3/2/img_228_96x96.jpg

    During this talk, I mentionned that CubicWeb now understands SPARQL queries thanks to the fyzz parser.


  • SemWeb.Pro - first french Semantic Web conference, Jan 17/18 2011

    2010/09/20 by Nicolas Chauvat

    SemWeb.Pro, the first french conference dedicated to the Semantic Web will take place in Paris on January 17/18 2011.

    One day of talks, one day of tutorials.

    Want to grok the Web 3.0? Be there.

    Something you want to share? Call for papers ends on October 15, 2010.

    http://www.semweb.pro/semwebpro.png

  • One way to convert Eurovoc into plain SKOS

    2016/06/27 by Yann Voté

    This is the second part of an article where I show how to import the Eurovoc thesaurus from the European Union into an application using a plain SKOS data model. I've recently faced the problem of importing Eurovoc into CubicWeb using the SKOS cube, and the solution I've chose is discussed here.

    The first part was an introduction to thesauri and SKOS.

    The whole article assumes familiarity with RDF, as describing RDF would require more than a blog entry and is out of scope.

    Difficulties with Eurovoc and SKOS

    Eurovoc

    Eurovoc is the main thesaurus covering European Union business domains. It is published and maintained by the EU commission. It is quite complex and big, structured as a tree of keywords.

    You can see Eurovoc keywords and browse the tree from the Eurovoc homepage using the link Browse the subject-oriented version.

    For example, when publishing statistics about education in the EU, you can tag the published data with the broadest keyword Education and communications. Or you can be more precise and use the following narrower keywords, in increasing order of preference: Education, Education policy, Education statistics.

    Problem: hierarchy of thesauri

    The EU commission uses SKOS to publish its Eurovoc thesaurus, so it should be straightforward to import Eurovoc into our own application. But things are not that simple...

    For some reasons, Eurovoc uses a hierarchy of concept schemes. For example, Education and communications is a sub-concept scheme of Eurovoc (it is called a domain), and Education is a sub-concept scheme of Education and communications (it is called a micro-thesaurus). Education policy is (a label of) the first concept in this hierarchy.

    But with SKOS this is not possible: a concept scheme cannot be contained into another concept scheme.

    Possible solutions

    So to import Eurovoc into our SKOS application, and not loose data, one solution is to turn sub-concept schemes into concepts. We have two strategies:

    • keep only one concept scheme (Eurovoc) and turn domains and micro-thesauri into concepts,
    • keep domains as concept schemes, drop Eurovoc concept scheme, and only turn micro-thesauri into concepts.

    Here we will discuss the latter solution.

    Lets get to work

    Eurovoc thesaurus can be downloaded at the following URL: http://publications.europa.eu/mdr/resource/thesaurus/eurovoc/skos/eurovoc_skos.zip

    The ZIP archive contains only one XML file named eurovoc_skos.rdf. Put it somewhere where you can find it easily.

    To read this file easily, we will use the RDFLib Python library. This library makes it really convenient to work with RDF data. It has only one drawback: it is very slow. Reading the whole Eurovoc thesaurus with it takes a very long time. Make the process faster is the first thing to consider for later improvements.

    Reading the Eurovoc thesaurus is as simple as creating an empty RDF Graph and parsing the file. As said above, this takes a long long time (from half an hour to two hours).

    import rdflib
    
    eurovoc_graph = rdflib.Graph()
    eurovoc_graph.parse('<path/to/eurovoc_skos.rdf>', format='xml')
    
    <Graph identifier=N52834ca3766d4e71b5e08d50788c5a13 (<class 'rdflib.graph.Graph'>)>
    

    We can see that Eurovoc contains more than 2 million triples.

    len(eurovoc_graph)
    
    2828910
    

    Now, before actually converting Eurovoc to plain SKOS, lets introduce some helper functions:

    • the first one, uriref(), will allow us to build RDFLib URIRef objects from simple prefixed URIs like skos:prefLabel or dcterms:title,
    • the second one, capitalized_eurovoc_domains(), is used to convert Eurovoc domain names, all uppercase (eg. 32 EDUCATION ET COMMUNICATION) to a string where only first letter is uppercase (eg. 32 Education and communication)
    import re
    
    from rdflib import Literal, Namespace, RDF, URIRef
    from rdflib.namespace import DCTERMS, SKOS
    
    eu_ns = Namespace('http://eurovoc.europa.eu/schema#')
    thes_ns = Namespace('http://purl.org/iso25964/skos-thes#')
    
    prefixes = {
        'dcterms': DCTERMS,
        'skos': SKOS,
        'eu': eu_ns,
        'thes': thes_ns,
    }
    
    def uriref(prefixed_uri):
        prefix, value = prefixed_uri.split(':', 1)
        ns = prefixes[prefix]
        return ns[value]
    
    def capitalized_eurovoc_domain(domain):
        """Return the given Eurovoc domain name with only the first letter uppercase."""
        return re.sub(r'^(\d+\s)(.)(.+)$',
                      lambda m: u'{0}{1}{2}'.format(m.group(1), m.group(2).upper(), m.group(3).lower()),
                      domain, re.UNICODE)
    

    Now the actual work. After using variables to reference URIs, the loop will parse each triple in original graph and:

    • discard it if it contains deprecated data,
    • if triple is like (<uri>, rdf:type, eu:Domain), replace it with (<uri>, rdf:type, skos:ConceptScheme),
    • if triple is like (<uri>, rdf:type, eu:MicroThesaurus), replace it with (<uri>, rdf:type, skos:Concept) and add triple (<uri>, skos:inScheme, <domain_uri>),
    • if triple is like (<uri>, rdf:type, eu:ThesaurusConcept), replace it with (<uri>, rdf:type, skos:Concept),
    • if triple is like (<uri>, skos:topConceptOf, <microthes_uri>), replace it with (<uri>, skos:broader, <microthes_uri>),
    • if triple is like (<uri>, skos:inScheme, <microthes_uri>), replace it with (<uri>, skos:inScheme, <domain_uri>),
    • keep triples like (<uri>, skos:prefLabel, <label_uri>), (<uri>, skos:altLabel, <label_uri>), and (<uri>, skos:broader, <concept_uri>),
    • discard all other non-deprecated triples.

    Note that, to replace a micro thesaurus with a domain, we have to build a mapping between each micro thesaurus and its containing domain (microthes2domain dict).

    This loop is also quite long.

    eurovoc_ref = URIRef(u'http://eurovoc.europa.eu/100141')
    deprecated_ref = URIRef(u'http://publications.europa.eu/resource/authority/status/deprecated')
    title_ref = uriref('dcterms:title')
    status_ref = uriref('thes:status')
    class_domain_ref = uriref('eu:Domain')
    rel_domain_ref = uriref('eu:domain')
    microthes_ref = uriref('eu:MicroThesaurus')
    thesconcept_ref = uriref('eu:ThesaurusConcept')
    concept_scheme_ref = uriref('skos:ConceptScheme')
    concept_ref = uriref('skos:Concept')
    pref_label_ref = uriref('skos:prefLabel')
    alt_label_ref = uriref('skos:altLabel')
    in_scheme_ref = uriref('skos:inScheme')
    broader_ref = uriref('skos:broader')
    top_concept_ref = uriref('skos:topConceptOf')
    
    microthes2domain = dict((mt, next(eurovoc_graph.objects(mt, uriref('eu:domain'))))
                            for mt in eurovoc_graph.subjects(RDF.type, uriref('eu:MicroThesaurus')))
    
    new_graph = rdflib.ConjunctiveGraph()
    for subj_ref, pred_ref, obj_ref in eurovoc_graph:
        if deprecated_ref in list(eurovoc_graph.objects(subj_ref, status_ref)):
            continue
        # Convert eu:Domain into a skos:ConceptScheme
        if obj_ref == class_domain_ref:
            new_graph.add((subj_ref, RDF.type, concept_scheme_ref))
            for title in eurovoc_graph.objects(subj_ref, pref_label_ref):
                if title.language == u'en':
                    new_graph.add((subj_ref, title_ref,
                                   Literal(capitalized_eurovoc_domain(title))))
                    break
        # Convert eu:MicroThesaurus into a skos:Concept
        elif obj_ref == microthes_ref:
            new_graph.add((subj_ref, RDF.type, concept_ref))
            scheme_ref = next(eurovoc_graph.objects(subj_ref, rel_domain_ref))
            new_graph.add((subj_ref, in_scheme_ref, scheme_ref))
        # Convert eu:ThesaurusConcept into a skos:Concept
        elif obj_ref == thesconcept_ref:
            new_graph.add((subj_ref, RDF.type, concept_ref))
        # Replace <concept> topConceptOf <MicroThesaurus> by <concept> broader <MicroThesaurus>
        elif pred_ref == top_concept_ref:
            new_graph.add((subj_ref, broader_ref, obj_ref))
        # Replace <concept> skos:inScheme <MicroThes> by <concept> skos:inScheme <Domain>
        elif pred_ref == in_scheme_ref and obj_ref in microthes2domain:
            new_graph.add((subj_ref, in_scheme_ref, microthes2domain[obj_ref]))
        # Keep label triples
        elif (subj_ref != eurovoc_ref and obj_ref != eurovoc_ref
              and pred_ref in (pref_label_ref, alt_label_ref)):
            new_graph.add((subj_ref, pred_ref, obj_ref))
        # Keep existing skos:broader relations and existing concepts
        elif pred_ref == broader_ref or obj_ref == concept_ref:
            new_graph.add((subj_ref, pred_ref, obj_ref))
    

    We can check that we now have far less triples than before.

    len(new_graph)
    
    388582
    

    Now we dump this new graph to disk. We choose the Turtle format as it is far more readable than RDF/XML for humans, and slightly faster to parse for machines. This file will contain plain SKOS data that can be directly imported into any application able to read SKOS.

    with open('eurovoc.n3', 'w') as f:
        new_graph.serialize(f, format='n3')
    

    With CubicWeb using the SKOS cube, it is a one command step:

    cubicweb-ctl skos-import --cw-store=massive <instance_name> eurovoc.n3