This is the second part of an article where I show how to import the Eurovoc thesaurus from the European Union into an application using a plain SKOS data model. I've recently faced the problem of importing Eurovoc into CubicWeb using the SKOS cube, and the solution I've chose is discussed here.

The first part was an introduction to thesauri and SKOS.

The whole article assumes familiarity with RDF, as describing RDF would require more than a blog entry and is out of scope.

Difficulties with Eurovoc and SKOS


Eurovoc is the main thesaurus covering European Union business domains. It is published and maintained by the EU commission. It is quite complex and big, structured as a tree of keywords.

You can see Eurovoc keywords and browse the tree from the Eurovoc homepage using the link Browse the subject-oriented version.

For example, when publishing statistics about education in the EU, you can tag the published data with the broadest keyword Education and communications. Or you can be more precise and use the following narrower keywords, in increasing order of preference: Education, Education policy, Education statistics.

Problem: hierarchy of thesauri

The EU commission uses SKOS to publish its Eurovoc thesaurus, so it should be straightforward to import Eurovoc into our own application. But things are not that simple...

For some reasons, Eurovoc uses a hierarchy of concept schemes. For example, Education and communications is a sub-concept scheme of Eurovoc (it is called a domain), and Education is a sub-concept scheme of Education and communications (it is called a micro-thesaurus). Education policy is (a label of) the first concept in this hierarchy.

But with SKOS this is not possible: a concept scheme cannot be contained into another concept scheme.

Possible solutions

So to import Eurovoc into our SKOS application, and not loose data, one solution is to turn sub-concept schemes into concepts. We have two strategies:

  • keep only one concept scheme (Eurovoc) and turn domains and micro-thesauri into concepts,
  • keep domains as concept schemes, drop Eurovoc concept scheme, and only turn micro-thesauri into concepts.

Here we will discuss the latter solution.

Lets get to work

Eurovoc thesaurus can be downloaded at the following URL:

The ZIP archive contains only one XML file named eurovoc_skos.rdf. Put it somewhere where you can find it easily.

To read this file easily, we will use the RDFLib Python library. This library makes it really convenient to work with RDF data. It has only one drawback: it is very slow. Reading the whole Eurovoc thesaurus with it takes a very long time. Make the process faster is the first thing to consider for later improvements.

Reading the Eurovoc thesaurus is as simple as creating an empty RDF Graph and parsing the file. As said above, this takes a long long time (from half an hour to two hours).

import rdflib

eurovoc_graph = rdflib.Graph()
eurovoc_graph.parse('<path/to/eurovoc_skos.rdf>', format='xml')
<Graph identifier=N52834ca3766d4e71b5e08d50788c5a13 (<class 'rdflib.graph.Graph'>)>

We can see that Eurovoc contains more than 2 million triples.


Now, before actually converting Eurovoc to plain SKOS, lets introduce some helper functions:

  • the first one, uriref(), will allow us to build RDFLib URIRef objects from simple prefixed URIs like skos:prefLabel or dcterms:title,
  • the second one, capitalized_eurovoc_domains(), is used to convert Eurovoc domain names, all uppercase (eg. 32 EDUCATION ET COMMUNICATION) to a string where only first letter is uppercase (eg. 32 Education and communication)
import re

from rdflib import Literal, Namespace, RDF, URIRef
from rdflib.namespace import DCTERMS, SKOS

eu_ns = Namespace('')
thes_ns = Namespace('')

prefixes = {
    'dcterms': DCTERMS,
    'skos': SKOS,
    'eu': eu_ns,
    'thes': thes_ns,

def uriref(prefixed_uri):
    prefix, value = prefixed_uri.split(':', 1)
    ns = prefixes[prefix]
    return ns[value]

def capitalized_eurovoc_domain(domain):
    """Return the given Eurovoc domain name with only the first letter uppercase."""
    return re.sub(r'^(\d+\s)(.)(.+)$',
                  lambda m: u'{0}{1}{2}'.format(,,,
                  domain, re.UNICODE)

Now the actual work. After using variables to reference URIs, the loop will parse each triple in original graph and:

  • discard it if it contains deprecated data,
  • if triple is like (<uri>, rdf:type, eu:Domain), replace it with (<uri>, rdf:type, skos:ConceptScheme),
  • if triple is like (<uri>, rdf:type, eu:MicroThesaurus), replace it with (<uri>, rdf:type, skos:Concept) and add triple (<uri>, skos:inScheme, <domain_uri>),
  • if triple is like (<uri>, rdf:type, eu:ThesaurusConcept), replace it with (<uri>, rdf:type, skos:Concept),
  • if triple is like (<uri>, skos:topConceptOf, <microthes_uri>), replace it with (<uri>, skos:broader, <microthes_uri>),
  • if triple is like (<uri>, skos:inScheme, <microthes_uri>), replace it with (<uri>, skos:inScheme, <domain_uri>),
  • keep triples like (<uri>, skos:prefLabel, <label_uri>), (<uri>, skos:altLabel, <label_uri>), and (<uri>, skos:broader, <concept_uri>),
  • discard all other non-deprecated triples.

Note that, to replace a micro thesaurus with a domain, we have to build a mapping between each micro thesaurus and its containing domain (microthes2domain dict).

This loop is also quite long.

eurovoc_ref = URIRef(u'')
deprecated_ref = URIRef(u'')
title_ref = uriref('dcterms:title')
status_ref = uriref('thes:status')
class_domain_ref = uriref('eu:Domain')
rel_domain_ref = uriref('eu:domain')
microthes_ref = uriref('eu:MicroThesaurus')
thesconcept_ref = uriref('eu:ThesaurusConcept')
concept_scheme_ref = uriref('skos:ConceptScheme')
concept_ref = uriref('skos:Concept')
pref_label_ref = uriref('skos:prefLabel')
alt_label_ref = uriref('skos:altLabel')
in_scheme_ref = uriref('skos:inScheme')
broader_ref = uriref('skos:broader')
top_concept_ref = uriref('skos:topConceptOf')

microthes2domain = dict((mt, next(eurovoc_graph.objects(mt, uriref('eu:domain'))))
                        for mt in eurovoc_graph.subjects(RDF.type, uriref('eu:MicroThesaurus')))

new_graph = rdflib.ConjunctiveGraph()
for subj_ref, pred_ref, obj_ref in eurovoc_graph:
    if deprecated_ref in list(eurovoc_graph.objects(subj_ref, status_ref)):
    # Convert eu:Domain into a skos:ConceptScheme
    if obj_ref == class_domain_ref:
        new_graph.add((subj_ref, RDF.type, concept_scheme_ref))
        for title in eurovoc_graph.objects(subj_ref, pref_label_ref):
            if title.language == u'en':
                new_graph.add((subj_ref, title_ref,
    # Convert eu:MicroThesaurus into a skos:Concept
    elif obj_ref == microthes_ref:
        new_graph.add((subj_ref, RDF.type, concept_ref))
        scheme_ref = next(eurovoc_graph.objects(subj_ref, rel_domain_ref))
        new_graph.add((subj_ref, in_scheme_ref, scheme_ref))
    # Convert eu:ThesaurusConcept into a skos:Concept
    elif obj_ref == thesconcept_ref:
        new_graph.add((subj_ref, RDF.type, concept_ref))
    # Replace <concept> topConceptOf <MicroThesaurus> by <concept> broader <MicroThesaurus>
    elif pred_ref == top_concept_ref:
        new_graph.add((subj_ref, broader_ref, obj_ref))
    # Replace <concept> skos:inScheme <MicroThes> by <concept> skos:inScheme <Domain>
    elif pred_ref == in_scheme_ref and obj_ref in microthes2domain:
        new_graph.add((subj_ref, in_scheme_ref, microthes2domain[obj_ref]))
    # Keep label triples
    elif (subj_ref != eurovoc_ref and obj_ref != eurovoc_ref
          and pred_ref in (pref_label_ref, alt_label_ref)):
        new_graph.add((subj_ref, pred_ref, obj_ref))
    # Keep existing skos:broader relations and existing concepts
    elif pred_ref == broader_ref or obj_ref == concept_ref:
        new_graph.add((subj_ref, pred_ref, obj_ref))

We can check that we now have far less triples than before.


Now we dump this new graph to disk. We choose the Turtle format as it is far more readable than RDF/XML for humans, and slightly faster to parse for machines. This file will contain plain SKOS data that can be directly imported into any application able to read SKOS.

with open('eurovoc.n3', 'w') as f:
    new_graph.serialize(f, format='n3')

With CubicWeb using the SKOS cube, it is a one command step:

cubicweb-ctl skos-import --cw-store=massive <instance_name> eurovoc.n3
blog entry of