Recently, I've faced the problem to import the European Union thesaurus, Eurovoc, into cubicweb using the SKOS cube. Eurovoc doesn't follow the SKOS data model and I'll show here how I managed to adapt Eurovoc to fit in SKOS.
This article is in two parts:
- this is the first part where I introduce what a thesaurus is and what SKOS is,
- the second part will show how to convert Eurovoc to plain SKOS.
The whole text assumes familiarity with RDF, as describing RDF would require more than a blog entry and is out of scope.
A common need in our digital lives is to attach keywords to documents, web pages, pictures, and so on, so that search is easier. For example, you may want to add two keywords:
in a picture's metadata about this flower. If you have a large collection of flower pictures, this will make your life easier when you want to search for a particular species later on.
In this example, keywords are free: you can choose whatever keyword you want, very general or very specific. For example you may just use the keyword:
if you don't care about species. You are also free to use lowercase or uppercase letters, and to make typos...
On the other side, sometimes you have to select keywords from a list. Such a constrained list is called a controlled vocabulary. For instance, a very simple controlled vocabulary with only two keywords is the one about a person's gender:
- male (or man),
- female (or woman).
But there are more complex examples: think about how a library organizes books by themes: there are very general themes (eg. Science), then more and more specific ones (eg. Computer science -> Software -> Operating systems). There may also be synonyms (eg. Computing for Computer science) or referrals (eg. there may be a "see also" link between keywords Algebra and Geometry). Such a controlled vocabulary where keywords are organized in a tree structure, and with relations like synonym and referral, is called a thesaurus.
For the sake of simplicity, in the following we will call thesaurus any controlled vocabulary, even a simple one with two keywords like male/female.
SKOS, from the World Wide Web Consortium (W3C), is an ontology for the semantic web describing thesauri. To make it simple, it is a common data model for thesauri that can be used on the web. If you have a thesaurus and publish it on the web using SKOS, then anyone can understand how your thesaurus is organized.
SKOS is very versatile. You can use it to produce very simple thesauri (like male/female) and very complex ones, with a tree of keywords, even in multiple languages.
To cope with this complexity, SKOS data model splits each keyword into two entities: a concept and its labels. For example, the concept of a male person have multiple labels: male and man in English, homme and masculin in French. The concept of a lily flower also has multiple labels: lily in English, lilium in Latin, lys in French.
Among all labels for a given concept, some can be preferred, while others are alternative. There may be only one preferred label per language. In the person's gender example, man may be the preferred label in English and male an alternative one, while in French homme would be the preferred label and masculin and alternative one. In the flower example, lily (resp. lys) is the preferred label in English (resp. French), and lilium is an alternative label in Latin (no preferred label in Latin).
And of course, in SKOS, it is possible to say that a concept is broader than another one (just like topic Science is broader than topic Computer science).
So to summarize, in SKOS, a thesaurus is a tree of concepts, and each concept have one or more labels, preferred or alternative. A thesaurus is also called a concept scheme in SKOS.
Also, please note that SKOS data model is slightly more complicated than what we've shown here, but this will be sufficient for our purpose.
In order to publish a thesaurus in RDF using SKOS ontology, SKOS introduces the "skos:" namespace associated to the following URI: http://www.w3.org/2004/02/skos/core#.
Within that namespace, SKOS defines some classes and predicates corresponding to what has been described above. For example:
- the triple (<uri>, rdf:type, skos:ConceptScheme) says that <uri> belongs to class skos:ConceptScheme (that is, is a concept scheme),
- the triple (<uri>, rdf:type, skos:Concept) says that <uri> belongs to class skos:Concept (that is, is a concept),
- the triple (<uri>, skos:prefLabel, <literal>) says that <literal> is a preferred label for concept <uri>,
- the triple (<uri>, skos:altLabel, <literal>) says that <literal> is an alternative label for concept <uri>,
- the triple (<uri1>, skos:broader, <uri2>) says that concept <uri2> is a broder concept of <uri1>.