Blog entries by Yann Voté [4]
  • A non-numeric Pandas example with Geonames

    2017/02/20 by Yann Voté

    The aim of this 2-parts blog post is to show some useful, yet not very complicated, features of the Pandas Python library that are not found in most (numeric oriented) tutorials.

    We will illustrate these techniques with Geonames data : extract useful data from the Geonames dump, transform it, and load it into another file. There is no numeric computation involved here, nor statictics. We will prove that pandas can be used in a wide range of cases beyond numerical analysis.

    While the first part was an introduction to Geonames data, this second part contains the real work with Pandas. Please read part 1 if you are not familiar with Geonames data.

    The project

    The goal is to read Geonames data, from allCountries.txt and alternateNames.txt files, combine them, and produce a new CSV file with the following columns:

    • gid, the Geonames id,
    • frname, French name when available,
    • gname, Geonames main name,
    • fclass, feature class,
    • fcode, feature code,
    • parent_gid, Geonames id of the parent location,
    • lat, latitude,
    • lng, longitude.

    Also we don't want to use too much RAM during the process. 1 GB is a maximum.

    You may think that this new CSV file is a low-level objective, not very interesting. But it can be a step into a larger process. For example one can build a local Geonames database that can then be used by tools like Elastic Search or Solr to provide easy searching and auto-completion in French.

    Another interesting feature is provided by the parent_gid column. One can use this column to build a tree of locations, or a SKOS thesaurus of concepts.

    The method

    Before diving into technical details, let us first look at the big picture, the overall process to achieve the aforementioned goal.

    Considering the wanted columns, we can see the following differences with allCountries.txt:

    • we are not interested in some original columns (population, elevation, ...),
    • the column frname is new and will come from alternateNames.txt, thanks to the Geonames id,
    • the column parent_gid is new too. We must derive this information from the adminX_code (X=1, 2, 3, 4) columns in allCountries.txt.

    This column deserves an example. Look at Arrondissement de toulouse.

    feature_code: ADM3, country_code: FR, admin1_code: 76, admin2_code: 31, admin3_code: 313
    

    To find its parent, we must look at a place with the following properties,

    feature_code: ADM2, country_code: FR, admin1_code: 76, admin2_code: 31
    

    and there is only one place with such properties, Geonames id 3,013,767, namely Département de la Haute-Garonne. Thus we must find a way to derive the Geonames id from the feature_code, country_code and adminX_code columns. Pandas will make this easy for us.


    Let's get to work. And of course, we must first import the Pandas library to make it available. We also import the csv module because we will need it to perform basic operations on CSV files.

    >>> import pandas as pd
    >>> import csv
    

    The French names step

    Loading file

    We begin by loading data from the alternateNames.txt file.

    And indeed, that's a big file. To save memory we won't load the whole file. Recall from previous part, that the alternateNames.txt file provides the following columns in order.

    alternateNameId   : the id of this alternate name, int
    geonameid         : geonameId referring to id in table 'geoname', int
    isolanguage       : iso 639 language code 2- or 3-characters; (...)
    alternate name    : alternate name or name variant, varchar(400)
    isPreferredName   : '1', if this alternate name is an official/preferred name
    isShortName       : '1', if this is a short name like 'California' for 'State of California'
    isColloquial      : '1', if this alternate name is a colloquial or slang term
    isHistoric        : '1', if this alternate name is historic and was used in the pastq
    

    For our purpose we are only interested in columns geonameid (so that we can find corresponding place in allCountries.txt file), isolanguage (so that we can keep only French names), alternate name (of course), and isPreferredName (because we want to keep preferred names when possible).

    Another way to save memory is to filter the file before loading it. Indeed, it's a better practice to load a smaller dataset (filter before loading) than to load a big one and then filter it after loading. It's important to keep those things in mind when you are working with large datasets. So in our case, it is cleaner (but slower) to prepare the CSV file before, keeping only French names.

    >>> # After restarting Python
    >>> with open('alternateNames.txt') as f:
    ...     reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
    ...     with open('frAltNames.txt', 'w') as g:
    ...         writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE,
    ...                             escapechar='\\')
    ...         for row in reader:
    ...             if row[2] == u'fr':
    ...                 writer.writerow(row)
    

    To load a CSV file, Pandas provides the read_csv function. It returns a dataframe populated with data from a CSV file.

    >>> with open('frAltNames.txt') as altf:
    ...     frnames = pd.read_csv(
    ...         altf, encoding='utf-8',
    ...         sep='\t', quoting=csv.QUOTE_NONE,
    ...         header=None, usecols=(1, 3, 4), index_col=0,
    ...         names=('gid', 'frname', 'pref'),
    ...         dtype={'gid': 'uint32', 'frname': str, 'pref': 'bool_'},
    ...         true_values=['1'], false_values=[u''], na_filter=False,
    ...         skipinitialspace=True, error_bad_lines=False
    ...     )
    ...
    

    The read_csv function is quite complex and it takes some time to use it rightly. In our cases, the first few parameters are self-explanatory: encoding for the file encoding, sep for the CSV separator, quoting for the quoting protocol (here there is none), and header for lines to be considered as label lines (here there is none).

    usecols tells pandas to load only the specified columns. Beware that indices start at 0, so column 1 is the second column in the file (geonameid in this case).

    index_col says to use one of the columns as a row index (instead of creating a new index from scratch). Note that the number for index_col is relative to usecols. In other words, 0 means the first column of usecols, not the first column of the file.

    names gives labels for columns (instead of using integers from 0). Hence we can extract the last column with frnames['pref'] (instead of frnames[3]). Please note, that this parameter is not compatible with headers=0 for example (in that case, the first line is used to label columns).

    The dtype parameter is interesting. It allows you to specify one type per column. When possible, prefer to use NumPy types to save memory (eg np.uint32, np.bool_).

    Since we want boolean values for the pref column, we can tell Pandas to convert '1' strings to True and empty strings ('') to False. That is the point of using parameters true_values and false_values.

    But, by default, Pandas detect empty strings and affect them np.nan (NaN means not a number). To prevent this behavior, setting na_filter to False will leave empty strings as empty strings. Thus, empty strings in the pref column will be converted to False (thanks to the false_values parameter and to the 'bool_' data type).

    Finally, skipinitialspace tells Pandas to left-strip strings to remove leading spaces. Without it, a line like a, b (commas-separated) would give values 'a' and ' b' (note leading space before b).

    And lastly, error_bad_lines make pandas ignore lines he cannot understand (eg. wrong number of columns). The default behavior is to raise an exception.

    There are many more parameters to this function, which is much more powerful than the simple reader object from the csv module. Please, refer to the documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html for a full list of options. For example, the header parameter can accept a list of integers, like [1, 3, 4], saying that lines at position 1, 3 and 4 are label lines.

    >>> frnames.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 55684 entries, 18918 to 11441793
    Data columns (total 2 columns):
    frname     55684 non-null object
    pref       55684 non-null bool
    dtypes: bool(1), object(1)
    memory usage: 6.6 MB
    

    We can see here that our dataframe contains 55,684 entries and that it takes about 6.6 MB in memory.

    Let's look at entry number 3,013,767 for the Haute-Garonne department.

    >>> frnames.loc[3013767]
                                     frname    pref
    gid
    3013767  Département de la Haute-Garonne  False
    3013767                    Haute-Garonne   True
    

    Removing duplicated indices

    There is one last thing to do with the frnames dataframe: make its row indices uniques. Indeed, we can see from the previous example that there are two entries for the Haute-Garonne department. We need to keep only one, and, when possible, the preferred form (when the pref column is True).

    This is not always possible: we can see at the beginning of the dataframe (use the head method) that there is no preferred French name for Rwanda.

    >>> frnames.head()
                         frname   pref
    gid
    18918              Protaras   True
    49518  République du Rwanda  False
    49518                Rwanda  False
    50360       Woqooyi Galbeed  False
    51230              Togdheer  False
    

    In such a case, we will take one of the two lines at random. Maybe a clever rule to decide which one to keep would be useful (like involving other columns), but for the purpose of this tutorial it does not matter which one is kept.

    Back to our problem, how to make indices uniques ? Pandas provides the duplicated method on Index objects. This method returns a boolean NumPy 1d-array (a vector), the size of which is the number of entries. So, since our dataframe has 55,684 entries, the length of the returned vector is 55,684.

    >>> dups_idx = frnames.index.duplicated()
    >>> dups_idx
    array([False, False,  True, ..., False, False, False], dtype=bool)
    >>> len(dups_idx)
    55684
    

    The meaning is simple: when you encounter True, it means that the index at this position is a duplicate of a previously encountered index. For example, we can see that the third value in dups_idx is True. And indeed, the third line of frnames has an index (49,518) which is a duplicate of the second line.

    So duplicated is meant to mark as True duplicated indices and to keep only the first one (there is an optional parameter to change this: read the doc). How do we make sure that the first entry is the preferred entry ? Sorting the dataframe of course! We can sort a table by a column (or by a list of columns), using the sort_values method. We give ascending=False because True > False (that's a philosophical question!), and inplace=True to sort the dataframe in place, thus we do not create a copy (still thinking about memory usage).

    >>> frnames.sort_values('pref', ascending=False, inplace=True)
    >>> frnames.head()
                                   frname  pref
    gid
    18918                        Protaras  True
    8029850  Hôtel de ville de Copenhague  True
    8127417                        Kalmar  True
    8127361                     Sundsvall  True
    8127291                        Örebro  True
    >>> frnames.loc[3013767]
                                      frname   pref
    gid
    3013767                    Haute-Garonne   True
    3013767  Département de la Haute-Garonne  False
    

    Great! All preferred names are now first in the dataframe. We can then use the duplicated index method to filter out duplicated entries.

    >>> frnames = frnames[~(frnames.index.duplicated())]
    >>> frnames.loc[3013767]
    frname     Haute-Garonne
    pref                True
    Name: 3013767, dtype: object
    >>> frnames.loc[49518]
    frname     Rwanda
    pref        False
    Name: 49518, dtype: object
    >>> frnames.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 49047 entries, 18918 to 11441793
    Data columns (total 2 columns):
    frname     49047 non-null object
    pref       49047 non-null bool
    dtypes: bool(1), object(1)
    memory usage: 5.7 MB
    

    We end up with 49,047 French names. Notice the ~ in the filter expression ? That's because we want to keep the first entry for each duplicated index, and the first entry is False in the vector returned by duplicated.

    Summary for the french names step

    One last thing to do is to remove the pref column which won't be used anymore. We already know how to do it.

    >>> frnames.drop('pref', axis=1, inplace=True)
    

    There has been a lot of talking until now. But if we summarize, very few commands where needed to obtain this table with only two columns (gid and frname):

    1. Prepare a smaller file so there is less data to load (keep only french names).

      with open('alternateNames.txt') as f:
          reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
          with open('frAltNames.txt', 'w') as g:
              writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE,
                                  escapechar='\\')
              for row in reader:
                  if row[2] == u'fr':
                      writer.writerow(row)
      
    2. Load the file into Pandas.

      with open('frAltNames.txt') as altf:
          frnames = pd.read_csv(
              altf, encoding='utf-8',
              sep='\t', quoting=csv.QUOTE_NONE,
              header=None, usecols=(1, 3, 4), index_col=0,
              names=('gid', 'frname', 'pref'),
              dtype={'gid': 'uint32', 'frname': str, 'pref': 'bool_'},
              true_values=['1'], false_values=[u''], na_filter=False,
              skipinitialspace=True, error_bad_lines=False
          )
      
    3. Sort on the pref column and remove duplicated indices.

      frnames.sort_values('pref', ascending=False, inplace=True)
      frnames = frnames[~(frnames.index.duplicated())]
      
    4. Remove the pref column.

      frnames.drop('pref', axis=1, inplace=True)
      

    Simple, isn't it ? We'll keep this dataframe for later use. Pandas will make it a breeze to merge it with the main dataframe coming from allCountries.txt to create the new frname column.

    The administrative tree step

    But for now, let's look at the second problem: how to derive a gid from a feature_code, a country_code, and a bunch of adminX_code (X=1, 2, 3, 4).

    Loading file

    First we need the administrative part of the file allCountries.txt, that is all places with the A feature class.

    Of course we can load the whole file into Pandas and then filter to keep only A-class entries, but now you know that this is memory intensive (and this file is much bigger than alternateNames.txt). So we'll be more clever and first prepare a smaller file.

    >>> with open('allCountries.txt') as f:
    ...     reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
    ...     with open('adm_geonames.txt', 'w') as g:
    ...         writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE, escapechar='\\')
    ...         for row in reader:
    ...             if row[6] == 'A':
    ...                 writer.writerow(row)
    

    Now we load this file into pandas. Remembering our goal, the resulting dataframe will only be used to compute gid from the code columns. So all columns we need are geonameid, feature_code, country_code, admin1_code, admin2_code, admin3_code, and admin4_code.

    What about data types. All these columns are strings except for geonameid which is an integer. Since it will be painful to type <colname>: str for the dtype parameter dictionary, let's use the fromkeys constructor instead.

    >>> d_types = dict.fromkeys(['fcode', 'code0', 'code1', 'code2', 'code3',
    ...                          'code4'], str)
    >>> d_types['gid'] = 'uint32'  # Shorter geonameid in gid
    

    We can now load the file.

    >>> with open('adm_geonames.txt') as admf:
    ...     admgids = pd.read_csv(
    ...         admf, encoding='utf-8',
    ...         sep='\t', quoting=csv.QUOTE_NONE,
    ...         header=None, usecols=(0, 7, 8, 10, 11, 12, 13),
    ...         names=('gid', 'fcode', 'code0', 'code1', 'code2', 'code3', 'code4'),
    ...         dtype=d_types, na_values='', keep_default_na=False,
    ...         error_bad_lines=False
    ...     )
    

    We recognize most parameters in this instruction. Notice that we didn't use index_col=0: gid is now a normal column, not an index, and Pandas will generate automatically a row index with integers starting at 0.

    Two new parameters are na_values and keep_default_na. The first one gives Pandas additional strings to be considerer as NaN (Not a Number). The astute reader would say that empty strings ('') are already considered by Pandas as NaN, and he would be right.

    But here comes the second parameter which, if set to False, tells Pandas to forget about its default list of strings recognized as NaN. The default list contains a bunch of strings like 'N/A' or '#NA' or, this is interesting, simply 'NA'. But 'NA' is used in allCountries.txt as country code for Namibia. If we keep the default list, this whole country will be ignored. So the combination of these two parameters tells Pandas to:

    • reset its default list of NaN strings,
    • use '' as a NaN string, which, consequently, will be the only one.

    That's it for NaN. What can Pandas tells us about our dataframe ?

    >>> admgids.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 369277 entries, 0 to 369276
    Data columns (total 7 columns):
    gid      369277 non-null uint32
    fcode    369273 non-null object
    code0    369272 non-null object
    code1    369197 non-null object
    code2    288494 non-null object
    code3    215908 non-null object
    code4    164899 non-null object
    dtypes: object(6), uint32(1)
    memory usage: 128.8 MB
    

    Note the RangeIndex which Pandas has created for us. The dataframe takes about 130 MB of memory because it has a lots of object columns. Apart from that, everything looks good.

    Replacing values

    One thing we want to do before going on, is to replace all 'PCL<X>' values in the fcode column with just PCL. This will make our life easier later when searching in this dataframe.

    >>> pd.unique(admgids.fcode)
    array([u'ADMD', u'ADM1', u'PCLI', u'ADM2', u'ADM3', u'ADM2H', u'PCLD',
           u'ADM4', u'ZN', u'ADM1H', u'PCLH', u'TERR', u'ADM3H', u'PRSH',
           u'PCLIX', u'ADM5', u'ADMDH', nan, u'PCLS', u'LTER', u'ADM4H',
           u'ZNB', u'PCLF', u'PCL'], dtype=object)
    

    Pandas provides the replace method for that.

    >>> admgids.replace({'fcode': {r'PCL[A-Z]{1,2}': 'PCL'}}, regex=True, inplace=True)
    >>> pd.unique(admgids.fcode)
    array([u'ADMD', u'ADM1', u'PCL', u'ADM2', u'ADM3', u'ADM2H', u'ADM4',
           u'ZN', u'ADM1H', u'TERR', u'ADM3H', u'PRSH', u'ADM5', u'ADMDH', nan,
           u'LTER', u'ADM4H', u'ZNB'], dtype=object)
    

    The replace method has lot of different signatures, refer to the Pandas documentation for a comprehensive description. Here, with the dictionary, we hare saying to look only in column fcode, and in this column, replace strings matching the regular expression with the given value. Since we are using regular expressions, the regex parameter must be set to True.

    And as usual, the inplace parameter avoids creation of a copy.

    Multi-indexing

    Remember the goal: we want to be able to get the gid from the others columns. Well, dear reader, you'll be happy to know that Pandas allows an index to be composite, to be composed of multiple columns, what Pandas calls a MultiIndex.

    To put it simply, a multi-index is useful when you have hierarchical indices Consider for example the following table.

    lvl1 lvl2 N S
    A AA 11 1x1
    A AB 12 1x2
    B BA 21 2x1
    B BB 22 2x2
    B BC 33 3x3

    If we were to load such a table in a Pandas dataframe df (exercise: do it), we would be able to use it as follows.

    >>> df.loc['A']
         N    S
    AA  11  1x1
    AB  12  1x2
    >>> df.loc['B']
         N    S
    BA  21  2x1
    BB  22  2x2
    BC  23  2x3
    >>> df.loc[('A',), 'S']
    AA    1x1
    AB    1x2
    Name: S, dtype: object
    >>> df.loc[('A', 'AB'), 'S']
    '1x2'
    >>> df.loc['B', 'BB']
    N     22
    S    2x2
    Name: (B, BB), dtype: object
    >>> df.loc[('B', 'BB')]
    N     22
    S    2x2
    Name: (B, BB), dtype: object
    

    So, basically, we can query the multi-index using tuples (and we can omit tuples if column indexing is not involved). But must importantly we can query a multi-index partially: df.loc['A'] returns a sub-dataframe with one level of index gone.

    Back to our subject, we clearly have hierarchical information in our code columns: the country code, then the admin1 code, then the admin2 code, and so on. Moreover, we can put the feature code at the top. But how to do that ?

    It can't be more simpler. The main issue is to find the correct method: set_index.

    >>> admgids.set_index(['fcode', 'code0', 'code1', 'code2', 'code3', 'code4'],
    ...                    inplace=True)
    >>> admgids.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 369277 entries, (ADMD, AD, 00, nan, nan, nan) to (PCLH, nan, nan, nan, nan, nan)
    Data columns (total 1 columns):
    gid      369277 non-null uint32
    dtypes: uint32(1)
    memory usage: 21.9 MB
    >>> admgids.head()
                                             gid
    fcode code0 code1 code2 code3 code4
    ADMD  AD    00    NaN   NaN   NaN    3038817
                                  NaN    3039039
    ADM1  AD    06    NaN   NaN   NaN    3039162
                05    NaN   NaN   NaN    3039676
                04    NaN   NaN   NaN    3040131
    

    That's not really it, because Pandas has kept the original dataframe order, so multi-index is all messed up. No problem, there is the sort_index method.

    >>> admgids.sort_index(inplace=True)
    >>> admgids.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 369277 entries, (nan, CA, 10, 02, 94235, nan) to (ZNB, SY, 00, nan, nan, nan)
    Data columns (total 1 columns):
    gid      369277 non-null uint32
    dtypes: uint32(1)
    memory usage: 21.9 MB
    >>> admgids.head()
                                             gid
    fcode code0 code1 code2 code3 code4
    NaN   CA    10    02    94235 NaN    6544163
          CN    NaN   NaN   NaN   NaN    6255000
          CY    04    NaN   NaN   NaN    6640324
          MX    16    NaN   NaN   NaN    6618819
    ADM1  AD    02    NaN   NaN   NaN    3041203
    

    Much better. We can even see that there are three entries without a feature code.

    Let's see if all this effort was worth it. Can we fulfill our goal ? Can we get a gid from a bunch of codes ? Can we get the Haute-Garonne gid ?

    >>> admgids.loc['ADM2', 'FR', '76', '31']
                     gid
    code3 code4
    NaN   NaN    3013767
    

    And for the Toulouse ADM4 ?

    >>> admgids.loc['ADM4', 'FR', '76', '31', '313', '31555']
                                             gid
    fcode code0 code1 code2 code3 code4
    ADM4  FR    76    31    313   31555  6453974
    

    Cheers! You've deserved a glass of wine!

    Summary for the administrative data step

    Before moving on four our grand finale, let's summarize what have been done to prepare administrative data.

    1. We've prepared a smaller file with only administrative data.

      with open('allCountries.txt') as f:
          reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
          with open('adm_geonames.txt', 'w') as g:
              writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE,
                                  escapechar='\\')
              for row in reader:
                  if row[6] == 'A':
                      writer.writerow(row)
      
    2. We've loaded the file into Pandas.

      d_types = dict.fromkeys(['fcode', 'code0', 'code1', 'code2', 'code3',
                               'code4'], str)
      d_types['gid'] = 'uint32'
      with open('adm_geonames.txt') as admf:
          admgids = pd.read_csv(
              admf, encoding='utf-8',
              sep='\t', quoting=csv.QUOTE_NONE,
              header=None, usecols=(0, 7, 8, 10, 11, 12, 13),
              names=('gid', 'fcode', 'code0', 'code1', 'code2', 'code3', 'code4'),
              dtype=d_types, na_values='', keep_default_na=False,
              error_bad_lines=False
          )
      
    3. We've replaced all 'PCL<XX>' values with just 'PCL' in the fcode column.

      admgids.replace({'fcode': {r'PCL[A-Z]{1,2}': 'PCL'}}, regex=True,
                      inplace=True)
      
    4. Then we've created a multi-index with columns fcode, code0, code1, code2, code3, code4.

      admgids.set_index(['fcode', 'code0', 'code1', 'code2', 'code3', 'code4'],
                         inplace=True)
      admgids.sort_index(inplace=True)
      

    Putting it all together

    Time has finally come to load the main file now: allCountries.txt. On one hand, we will then be able to use the frnames dataframe to get French name for each entry and populate the frname column, and on the other hand we will use the admgids dataframe to compute parent gid for each line too.

    Loading file

    On my computer, loading the whole allCountries.txt at once takes 5.5 GB of memory, clearly too much! And in this case, there is no trick to reduce the size of the file first: we want all the data.

    Pandas can help us with the chunk_size parameter to the read_csv function. It allows us to read the file chunk by chunk (it returns an iterator). The idea is to first create an empty CSV file for our final data, then read each chunk, perform data manipulation (that is add frname and parent_gid) of this chunk, and append the data to the file.

    So we load data into Pandas the usual way. The only difference is that we add a new parameter chunksize with value 1,000,000. You can choose a smaller or a larger number of rows depending of your memory limit.

    >>> d_types = dict.fromkeys(['fclass', 'fcode', 'code0', 'code1',
    ...                          'code2', 'code3', 'code4'], str)
    >>> d_types['gid'] = 'uint32'
    >>> d_types['lat'] = 'float16'
    >>> d_types['lng'] = 'float16'
    >>> with open('allCountries.txt') as geof:
    ...     reader = pd.read_csv(
    ...         geof, encoding='utf-8',
    ...         sep='\t', quoting=csv.QUOTE_NONE,
    ...         header=None, usecols=(0, 1, 4, 5, 6, 7, 8, 10, 11, 12, 13),
    ...         names=('gid', 'name', 'lat', 'lng', 'fclass', 'fcode',
    ...                'code0', 'code1', 'code2', 'code3', 'code4'),
    ...         dtype=d_types, index_col=0,
    ...         na_values='', keep_default_na=False,
    ...         chunksize=1000000, error_bad_lines=False
    ...     )
    ...     for chunk in reader:
    ...         pass  # We will put here code to work on each chunk
    

    Joining tables

    Pandas can perform a JOIN on two dataframes, much like in SQL. The function to do so is merge.

    ...     for chunk in reader:
    ...         chunk = pd.merge(chunk, frnames, how='left',
    ...                          left_index=True, right_index=True)
    

    merge expects first the two dataframes to be joined. The how parameter tells what type of JOIN to perform (it can be left, right, inner, ...). Here we wants to keep all lines in chunk, which is the first parameter, so it is 'left' (if chunk was the second parameter, it would have been 'right').

    left_index=True and right_index=True tell Pandas that the pivot column is the index on each table. Indeed, in our case the gid index will be use in both tables to compute the merge. If in one table, for example the right one, the pivot column is not the index, one can set right_index=False and add parameter right_on='<column_name>' (same parameter exists for left table).

    Additionally, If there are name clashes (same column name in both tables), one can also use the suffixes parameter. For example suffixes=('_first', '_second').

    And that's all we need to add the frname column. Simple isn't it ?

    Computing parent gid

    The parent_gid column is trickier. We'll delegate computation of the gid from administrative codes to a separate function. But first let's define two reverse dictionaries linking the fcode column to a level number

    >>> level_fcode = {0: 'PCL', 1: 'ADM1', 2: 'ADM2', 3: 'ADM3', 4: 'ADM4'}
    >>> fcode_level = dict((v, k) for k, v in level_fcode.items())
    

    And here is the function computing the Geonames id from administrative codes.

    >>> def geonameid_from_codes(level, **codes):
    ...     """Return the Geoname id of the *administrative* place with the
    ...     given information.
    ...
    ...     Return ``None`` if there is no match.
    ...
    ...     ``level`` is an integer from 0 to 4. 0 means we are looking for a
    ...     political entity (``PCL<X>`` feature code), 1 for a first-level
    ...     administrative division (``ADM1``), and so on until fourth-level
    ...     (``ADM4``).
    ...
    ...     Then user must provide at least one of the ``code0``, ...,
    ...     ``code4`` keyword parameters, depending on ``level``.
    ...
    ...     Examples::
    ...
    ...         >>> geonameid_from_codes(level=0, code0='RE')
    ...         935317
    ...         >>> geonameid_from_codes(level=3, code0='RE', code1='RE',
    ...         ...                      code2='974', code3='9742')
    ...         935213
    ...         >>> geonameid_from_codes(level=0, code0='AB')  # None
    ...
    ...     """
    ...     try:
    ...         idx = tuple(codes['code{0}'.format(i)] for i in range(level+1))
    ...     except KeyError:
    ...         raise ValueError('Not enough codeX parameters for level {0}'.format(level))
    ...     idx = (level_fcode[level],) + idx
    ...     try:
    ...         return admgids.loc[idx, 'gid'].values[0]
    ...     except (KeyError, TypeError):
    ...         return None
    ...
    

    A few comments about this function: the first part builds a tuple idx with the given codes. Then this idx is used as an index value in the admgids dataframe to find the matching gid.

    We also need another function which will compute the parent gid for a Pandas row.

    >>> from six import string_types
    >>> def parent_geonameid(row):
    ...     """Return the Geoname id of the parent of the given Pandas row.
    ...
    ...     Return ``None`` if we can't find the parent's gid.
    ...     """
    ...     # Get the parent's administrative level (PCL or ADM1, ..., ADM4)
    ...     level = fcode_level.get(row.fcode)
    ...     if (level is None and isinstance(fcode, string_types)
    ...             and len(fcode) >= 3):
    ...         fcode_level.get(row.fcode[:3], 5)
    ...     level = level or 5
    ...     level -= 1
    ...     if level < 0:  # We were on a country, no parent
    ...         return None
    ...     # Compute available codes
    ...     l = list(range(5))
    ...     while l and pd.isnull(row['code{0}'.format(l[-1])]):
    ...         l.pop()  # Remove NaN values backwards from code4
    ...     codes = {}
    ...     for i in l:
    ...         if i > level:
    ...             break
    ...         code_label = 'code{0}'.format(i)
    ...         codes[code_label] = row[code_label]
    ...     try:
    ...         return geonameid_from_codes(level, **codes)
    ...     except ValueError:
    ...         return None
    ...
    

    In this function, we first look at the row fcode to get the row administrative level. If the fcode is ADMX we get the level directly. If it is PCL<X>, we get level 0 from PCL. Else we set it to level 5 to say that it is below level 4. So the parent's level is the found level minus one. And if it is -1, we know that we were on a country and there is no parent.

    Then we compute all available administrative codes, removing codes with NaN values from the end.

    With the level and the codes, we can search for the parent's gid using the previous function.

    Now, how do we use this function. No need for a for loop, Pandas gives us the apply method.

    ...     for chunk in reader:
    ...         # (...) pd.merge as before
    ...         parent_gids = chunk.apply(parent_geonameid, axis=1)
    

    This will apply the parent_geonameid to each row and return a new Pandas series whose head looks like this.

    gid
    2986043     nan
    2993838     nan
    2994701     nan
    3007683     nan
    3017832     nan
    ...
    3039162     3041565.0
    

    None values have been converted to NaN. Thus, integer values have been converted to float (you cannot have NaN within an integer column), and this is not what we want. As a compromise, we are going to convert this into str and suppress the decimal part.

    ...         parent_gids = parent_gid.astype(str)  # no inplace=True here
    ...         parent_gids.replace(r'\.0', '', regex=True, inplace=True)
    

    We also add a label to the column. That's the name the column in our future dataframe.

    ...         parent_gids = parent_gids.rename('parent_gid')
    

    And we can now append this new column to our chunk dataframe.

    ...         chunk = pd.concat([chunk, parent_gids], axis=1)
    

    We're almost there. Before we can save the chunk in a CSV file, we must reorganize its columns to the expected order. For now, the frname and parent_gid columns have been appended at the end of the dataframe.

    ...         chunk = chunk.reindex(columns=['frname', 'name', 'fclass',
    ...                                        'fcode', 'parent_gid', 'lat',
    ...                                        'lng'])
    

    At last, we save the chunk to the file opened in append mode.

    ...         chunk.to_csv('final.txt', mode='a', encoding='utf-8',
    ...                      quoting=csv.QUOTE_NONE, sep='\t', header=None)
    

    Caching to speed up

    Currently, creating the new CSV file from Geonames takes hours, and this is not acceptable. There are multiple ways to make thing go faster. One of the most significant change is to cache results of the parent_geonameid function. Indeed, many places in Geonames have the same parent ; computing the parent gid once and caching it sounds like a good idea.

    If you are using Python3, you can simply use the @functools.lru_cache decorator on the parent_geonameid function. But let us try to define our own custom cache.

    >>> from six import string_types
    >>> gid_cache = {}
    >>> def parent_geonameid(row):
    ...     """Return the Geoname id of the parent of the given Pandas row.
    ...
    ...     Return ``None`` if we can't find the parent's gid.
    ...     """
    ...     # Get the parent's administrative level (PCL or ADM1, ..., ADM4)
    ...     level = fcode_level.get(row.fcode)
    ...     if (level is None and isinstance(fcode, string_types)
    ...             and len(fcode) >= 3):
    ...         fcode_level.get(row.fcode[:3], 5)
    ...     level = level or 5
    ...     level -= 1
    ...     if level < 0:  # We were on a country, no parent
    ...         return None
    ...     # Compute available codes
    ...     l = list(range(5))
    ...     while l and pd.isnull(row['code{0}'.format(l[-1])]):
    ...         l.pop()  # Remove NaN values backwards from code4
    ...     codes = {}
    ...     code_tuple = [level]
    ...     for i in l:
    ...         if i > level:
    ...             break
    ...         code_label = 'code{0}'.format(i)
    ...         code = row[code_label]
    ...         codes[code_label] = code
    ...         code_tuple.append(code)
    ...     code_tuple = tuple(code_tuple)
    ...     try:
    ...         parent_gid = (gid_cache.get(code_tuple)
    ...                       or geonameid_from_codes(level, **codes))
    ...     except ValueError:
    ...         parent_gid = None
    ...     # Put value in cache if not already to speed up future lookup
    ...     if code_tuple not in gid_cache:
    ...         gid_cache[code_tuple] = parent_gid
    ...     return parent_gid
    

    The only difference with the previous version is the use of a gid_cache dictionary. Keys for this dictionary are tuples (<level>, <code0>, [[<code1>], ..., <code4>]) (stored in the code_tuple variable), and the corresponding value is the parent gid for this combination of level and codes. Then the returned parent_gid is first looked in this dictionary for a previous cached result, else is computed from the geonameid_from_codes function like before, and the result is cached.

    Summary for the final step

    Let's review what we have done.

    1. We have defined three useful dictionaries. Two to get a level number from a feature code and conversely, and one to cache computation results.

      level_fcode = {0: 'PCL', 1: 'ADM1', 2: 'ADM2', 3: 'ADM3', 4: 'ADM4'}
      fcode_level = dict((v, k) for k, v in level_fcode.items())
      gid_cache = {}
      
    2. We have defined a function computing a Geoname id from administrative codes.

      def geonameid_from_codes(level, **codes):
          """Return the Geoname id of the *administrative* place with the
          given information.
      
          Return ``None`` if there is no match.
      
          ``level`` is an integer from 0 to 4. 0 means we are looking for a
          political entity (``PCL<X>`` feature code), 1 for a first-level
          administrative division (``ADM1``), and so on until fourth-level
          (``ADM4``).
      
          Then user must provide at least one of the ``code0``, ...,
          ``code4`` keyword parameters, depending on ``level``.
      
          Examples::
      
              >>> geonameid_from_codes(level=0, code0='RE')
              935317
              >>> geonameid_from_codes(level=3, code0='RE', code1='RE',
              ...                      code2='974', code3='9742')
              935213
              >>> geonameid_from_codes(level=0, code0='AB')  # None
              >>>
      
          """
          try:
              idx = tuple(codes['code{0}'.format(i)] for i in range(level+1))
          except KeyError:
              raise ValueError('Not enough codeX parameters for level {0}'.format(level))
          idx = (level_fcode[level],) + idx
          try:
              return admgids.loc[idx, 'gid'].values[0]
          except (KeyError, TypeError):
              return None
      
    3. We have defined a function computing the parent's gid of a Pandas row.

      def parent_geonameid(row):
          """Return the Geoname id of the parent of the given Pandas row.
      
          Return ``None`` if we can't find the parent's gid.
          """
          # Get the parent's administrative level (PCL or ADM1, ..., ADM4)
          level = fcode_level.get(row.fcode)
          if (level is None and isinstance(fcode, string_types)
                  and len(fcode) >= 3):
              fcode_level.get(row.fcode[:3], 5)
          level = level or 5
          level -= 1
          if level < 0:  # We were on a country, no parent
              return None
          # Compute available codes
          l = list(range(5))
          while l and pd.isnull(row['code{0}'.format(l[-1])]):
              l.pop()  # Remove NaN values backwards from code4
          codes = {}
          code_tuple = [level]
          for i in l:
              if i > level:
                  break
              code_label = 'code{0}'.format(i)
              code = row[code_label]
              codes[code_label] = code
              code_tuple.append(code)
          code_tuple = tuple(code_tuple)
          try:
              parent_gid = (gid_cache.get(code_tuple)
                            or geonameid_from_codes(level, **codes))
          except ValueError:
              parent_gid = None
          # Put value in cache if not already to speed up future lookup
          if code_tuple not in gid_cache:
              gid_cache[code_tuple] = parent_gid
          return parent_gid
      
    4. And finally we have loaded the file allCountries.txt into Pandas using chunks of 1,000,000 rows to save memory. For each chunk, we have merged it with the frnames table to add the frname column, and we applied the parent_geonameid function to add the parent_gid column. We then reordered the columns and append the chunk to the final CSV file.

      d_types = dict.fromkeys(['fclass', 'fcode', 'code0', 'code1',
                               'code2', 'code3', 'code4'], str)
      d_types['gid'] = 'uint32'
      d_types['lat'] = 'float16'
      d_types['lng'] = 'float16'
      with open('allCountries.txt') as geof:
          reader = pd.read_csv(
              geof, encoding='utf-8',
              sep='\t', quoting=csv.QUOTE_NONE,
              header=None, usecols=(0, 1, 4, 5, 6, 7, 8, 10, 11, 12, 13),
              names=('gid', 'name', 'lat', 'lng', 'fclass', 'fcode',
                     'code0', 'code1', 'code2', 'code3', 'code4'),
              dtype=d_types, index_col=0,
              na_values='', keep_default_na=False,
              chunksize=1000000, error_bad_lines=False
          )
          for chunk in reader:
              chunk = pd.merge(chunk, frnames, how='left',
                               left_index=True, right_index=True)
              parent_gids = chunk.apply(parent_geonameid, axis=1)
              parent_gids = parent_gids.astype(str)  # no inplace=True here
              parent_gids.replace(r'\.0', '', regex=True, inplace=True)
              parent_gids = parent_gids.rename('parent_gid')
              chunk = pd.concat([chunk, parent_gids], axis=1)
              chunk = chunk.reindex(columns=['frname', 'name', 'fclass',
                                             'fcode', 'parent_gid', 'lat',
                                             'lng'])
              chunk.to_csv('final.txt', mode='a', encoding='utf-8',
                           quoting=csv.QUOTE_NONE, sep='\t', header=None)
      

    This final part is the longest, because the parent_geonameid function takes some time on each chunk to compute all parents gids. But at the end of the process we'll proudly see a final.txt file with data the way we want it, and without using too much memory... High five!

    What can be improved

    Congratulations! This ends our journey into the Pandas world.

    Regarding Geonames, to be honest, we've only scratch the surface of its complexity. There's so much to be done.

    If you look at the file we've just produced, you'll see plenty of empty values in the parent_gid column. May be our method to get the Geonames id of the parent needs to be improved. May be all those orphan places should be moved inside their countries.

    Another problem lies within Geonames data. France has overseas territories, for example Reunion Island, Geoname id 935,317. This place has feature code PCLD, which means "dependent political entity". And indeed, Reunion Island is not a country and should not appear at the top level of the tree, at the same level as France. So some work should be done here to have Reunion Island linked to France in some way, may be using the until now ignored cc2 column (for "alternate country codes").

    Still another improvement, easier this one, is to have yet another parent level for continents. For this, one can use the file countryInfo.txt, downloadable from the same page.

    Considering speed this time, there is also room for improvements. First, the code itself might be better designed to avoid some tests and for loops. Another possibility is tu use multiprocessing, since each chunk in allCountries.txt isindependent. Processes can put their finished chunk on a queue that a writer process will read to write data in the output file. Another way to go is Cython (see: http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html).


  • Understanding Geonames dump

    2017/02/20 by Yann Voté

    The aim of this 2-parts blog post is to show some useful, yet not very complicated, features of the Pandas Python library that are not found in most (numeric oriented) tutorials.

    We will illustrate these techniques with Geonames data : extract useful data from the Geonames dump, transform it, and load it into another file. There is no numeric computation involved here, nor statictics. We will prove that pandas can be used in a wide range of cases beyond numerical analysis.

    This first part is an introduction to Geonames data. The real work with Pandas will be shown on the second part You can skip this part and go directly to part 2 if you are already familiar with Geonames.

    Main data

    Geonames data can be downloaded from http://download.geonames.org/export/dump/ The main file to download is allCountries.zip. Once extracted, you'll get a CSV file named allCountries.txt which contains nearly all Geonames data. In this file, CSV data are separated by tabulation.

    A sample Geonames entry

    Let's look in Geonames data for the city of Toulouse in France.

    $ grep -P '\tToulouse\t' allCountries.txt
    

    You'll find multiple results (including one in the United States of America), and among them the following line.

    2972315     Toulouse        Toulouse        Gorad Tuluza,Lapangan Terbang Blagnac,TLS,Tolosa,(...)  43.60426        1.44367 P       PPLA    FR              76      31      313     31555   433055          150     Europe/Paris    2016-02-18
    

    What is the meaning of each column ? There is no header line at the top of the file... Go back to the web page from where you downloaded the file. Below download links, you'll find some documentation. In particular, consider the following excerpt.

    The main 'geoname' table has the following fields :
    ---------------------------------------------------
    geonameid         : integer id of record in geonames database
    name              : name of geographical point (utf8) varchar(200)
    asciiname         : name of geographical point in plain ascii characters, varchar(200)
    alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
    latitude          : latitude in decimal degrees (wgs84)
    longitude         : longitude in decimal degrees (wgs84)
    feature class     : see http://www.geonames.org/export/codes.html, char(1)
    feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
    country code      : ISO-3166 2-letter country code, 2 characters
    cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
    admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
    admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
    admin3 code       : code for third level administrative division, varchar(20)
    admin4 code       : code for fourth level administrative division, varchar(20)
    population        : bigint (8 byte int)
    elevation         : in meters, integer
    dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
    timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
    modification date : date of last modification in yyyy-MM-dd format
    

    So we see that entry 2,972,315 represents a place named Toulouse, for which latitude is 43.60 and longitude is 1.44. The place is located in France (FR country code), its population is estimated to 433,055, elevation is 150 m, and timezone is the same than Paris.

    To understand the meaning of P and PPLA, as feature class and feature code respectively, the indicated web page at http://www.geonames.org/export/codes.html provides us with the following information.

    P city, village,...
    PPLA        seat of a first-order administrative division
    

    Thus line 2,972,315 is about a city which is the seat of a first-order administrative division in France. And indeed, Toulouse is the seat of the Occitanie region in France.

    Geonames administrative divisions

    So let's look for Occitanie.

    $ grep -P '\tOccitanie\t' allCountries.txt
    

    You'll end up with the following line.

    11071623    Occitanie       Occitanie       Languedoc-Roussillon-Midi-Pyrenees,(...)        44.02722        1.63559 A       ADM1    FR              76              5626858         188     Europe/Paris    2016-11-16
    

    This is entry number 11,071,623, and we have latitude, longitude, population, elevation and timezone as usual. The place's class is A and its code is ADM1. Information about these codes can be found in the same previous page.

    A country, state, region,...
    ADM1        first-order administrative division
    

    This confirms that Occitanie is a first-order administrative division of France.

    Now look at the adminX_code part of the Occitanie line (X=1, 2, 3, 4).

    76
    

    Only column admin1_code is filled with value 76. Compare this with the same part from the Toulouse line.

    76  31      313     31555
    

    Here all columns admin1_code, admin2_code, admin3_code and admin4_code are given a value, respectively 76, 31 313, and 31555.

    Furthermore, we see that column admin1_code matches for Occitanie and Toulouse (value 76). This let us deduce that Toulouse is actually a city located in Occitanie.

    So, following the same logic, we can infer that Toulouse is also located in a second-order administrative division of France (feature code ADM2) with admin1_code of 76 and admin2_code equals to 31. Let's look for such a line in allCountries.txt.

    $ grep -P '\tADM2\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \
    > | grep -P '\t31\t'
    

    We get the following line.

    3013767     Département de la Haute-Garonne Departement de la Haute-Garonne Alta Garonna,Alto Garona,(...)  43.41667        1.5     A       ADM2    FR              76      31                      1254347         181     Europe/Paris    2016-02-18
    

    Success! Toulouse is actually in the Haute-Garonne department, which is a department in Occitanie region.

    Is this going to work for third-level administrative division ? Let's look for an ADM3 line, with admin1_code=76, admin2_code=31, and admin3_code=313.

    $ grep -P '\tADM3\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \
    > | grep -P '\t31\t' | grep -P '\t313\t'
    

    Here is the result.

    2972314     Arrondissement de Toulouse      Arrondissement de Toulouse      Arrondissement de Toulouse      43.58333        1.5     A       ADM3    FR              76      31      313             972098          139     Europe/Paris    2016-12-05
    

    Still works! Finally, let's find out the fourth-order administrative division which contains the city of Toulouse.

    $ grep -P '\tADM4\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \
    > | grep -P '\t31\t' | grep -P '\t313\t' | grep -P '\t31555\t'
    

    We get a place also named Toulouse.

    6453974     Toulouse        Toulouse        Toulouse        43.60444        1.44194 A       ADM4    FR              76      31      313     31555   440204          153     Europe/Paris    2016-02-18
    

    That's a surprise. That's because in France, an arrondissement is the smallest administrative division above a city. So for Geonames, city of Toulouse is both an administrative place (feature class A with feature code ADM4) and a populated place (feature class P with feature code PPLA), so there is two different entries with the same name (beware that this may be different for other countries).

    And that's it! We have found the whole hierarchy of administrative divisions above city of Toulouse, thanks to the adminX_code columns: Occitanie, Département de la Haute-Garonne, Arrondissement de Toulouse, Toulouse (ADM4), and Toulouse (PPL).

    That's it, really ? What about the top-most administrative level, the country ? This may not be intuitive: feature codes for countries start with PCL (for political entity).

    $ grep -P '\tPCL' allCountries.txt | grep -P '\tFR\t'
    

    Among the result, there's only one PCLI, which means independant political entity.

    3017382     Republic of France      Republic of France      An Fhrainc,An Fhraing,(...)     46      2       A       PCLI    FR              00      64768389                543     Europe/Paris    2015-01-08
    

    Now we have the whole hierarchy, summarized in the following table.

    level name id fclass fcode ccode adm1_code adm2_code adm3_code adm4_code
    country Republic of France 3017382 A PCLI FR        
    level 1 Occitanie 11071623 A ADM1 FR 76      
    level 2 Département de la Haute-Garonne 3013767 A ADM2 FR 76 31    
    level 3 Arrondissement de Toulouse 2972314 A ADM3 FR 76 31 313  
    level 4 Toulouse 6453974 A ADM4 FR 76 31 313 31555
    city Toulouse 2972315 P PPL FR 76 31 313 31555

    Alternate names

    In the previous example, "Republic of France" is the main name for Geonames. But this is not the name in French ("République française"), and even the name in French is not the most commonly used name which is "France".

    In the same way, "Département de la Haute-Garonne" is not the most commonly used name for the department, which is just "Haute-Garonne".

    The fourth column in allCountries.txt provides a comma-separated list of alternate names for a place, in other languages and in other forms. But this is not very useful because we can't decide which form in the list is in which language.

    For this, the Geonames project provides another file to download: alternateNames.zip. Go back to the download page, download it, and extract it. You'll get another tabulation-separeted CSV file named alternateNames.txt.

    Let's look at alternate names for the Haute-Garonne department.

    $ grep -P '\t3013767\t' alternateNames.txt
    

    Multiple names are printed.

    2080345     3013767 fr      Département de la Haute-Garonne
    2080346     3013767 es      Alto Garona     1       1
    2187116     3013767 fr      Haute-Garonne   1       1
    2431178     3013767 it      Alta Garonna    1       1
    2703076     3013767 en      Upper Garonne   1       1
    2703077     3013767 de      Haute-Garonne   1       1
    3047130     3013767 link    http://en.wikipedia.org/wiki/Haute-Garonne
    3074362     3013767 en      Haute Garonne
    4288095     3013767         Département de la Haute-Garonne
    10273497    3013767 link    http://id.loc.gov/authorities/names/n80009763
    

    To understand these columns, let's look again at the documentation.

    The table 'alternate names' :
    -----------------------------
    alternateNameId   : the id of this alternate name, int
    geonameid         : geonameId referring to id in table 'geoname', int
    isolanguage       : iso 639 language code 2- or 3-characters; (...)
    alternate name    : alternate name or name variant, varchar(400)
    isPreferredName   : '1', if this alternate name is an official/preferred name
    isShortName       : '1', if this is a short name like 'California' for 'State of California'
    isColloquial      : '1', if this alternate name is a colloquial or slang term
    isHistoric        : '1', if this alternate name is historic and was used in the pastq
    

    We can see that "Haute-Garonne" is a short version of "Département de la Haute-Garonne" and is the preferred form in French. As an exercise, the reader can confirm in the same way that "France" is the preferred shorter form for "Republic of France" in French.


    And that's it for our introductory journey into Geonames. You are now familiar enough with this data to begin working with it using Pandas in Python. In fact, what we have done until now, namely working with grep commands, is not very useful... See you in part 2!


  • Introduction to thesauri and SKOS

    2016/06/27 by Yann Voté

    Recently, I've faced the problem to import the European Union thesaurus, Eurovoc, into cubicweb using the SKOS cube. Eurovoc doesn't follow the SKOS data model and I'll show here how I managed to adapt Eurovoc to fit in SKOS.

    This article is in two parts:

    • this is the first part where I introduce what a thesaurus is and what SKOS is,
    • the second part will show how to convert Eurovoc to plain SKOS.

    The whole text assumes familiarity with RDF, as describing RDF would require more than a blog entry and is out of scope.

    What is a thesaurus ?

    A common need in our digital lives is to attach keywords to documents, web pages, pictures, and so on, so that search is easier. For example, you may want to add two keywords:

    • lily,
    • lilium

    in a picture's metadata about this flower. If you have a large collection of flower pictures, this will make your life easier when you want to search for a particular species later on.

    free-text keywords on a picture

    In this example, keywords are free: you can choose whatever keyword you want, very general or very specific. For example you may just use the keyword:

    • flower

    if you don't care about species. You are also free to use lowercase or uppercase letters, and to make typos...

    free-text keyword on a picture

    On the other side, sometimes you have to select keywords from a list. Such a constrained list is called a controlled vocabulary. For instance, a very simple controlled vocabulary with only two keywords is the one about a person's gender:

    • male (or man),
    • female (or woman).
    a simple controlled vocabulary

    But there are more complex examples: think about how a library organizes books by themes: there are very general themes (eg. Science), then more and more specific ones (eg. Computer science -> Software -> Operating systems). There may also be synonyms (eg. Computing for Computer science) or referrals (eg. there may be a "see also" link between keywords Algebra and Geometry). Such a controlled vocabulary where keywords are organized in a tree structure, and with relations like synonym and referral, is called a thesaurus.

    an example thesaurus with a tree of keywords

    For the sake of simplicity, in the following we will call thesaurus any controlled vocabulary, even a simple one with two keywords like male/female.

    SKOS

    SKOS, from the World Wide Web Consortium (W3C), is an ontology for the semantic web describing thesauri. To make it simple, it is a common data model for thesauri that can be used on the web. If you have a thesaurus and publish it on the web using SKOS, then anyone can understand how your thesaurus is organized.

    SKOS is very versatile. You can use it to produce very simple thesauri (like male/female) and very complex ones, with a tree of keywords, even in multiple languages.

    To cope with this complexity, SKOS data model splits each keyword into two entities: a concept and its labels. For example, the concept of a male person have multiple labels: male and man in English, homme and masculin in French. The concept of a lily flower also has multiple labels: lily in English, lilium in Latin, lys in French.

    Among all labels for a given concept, some can be preferred, while others are alternative. There may be only one preferred label per language. In the person's gender example, man may be the preferred label in English and male an alternative one, while in French homme would be the preferred label and masculin and alternative one. In the flower example, lily (resp. lys) is the preferred label in English (resp. French), and lilium is an alternative label in Latin (no preferred label in Latin).

    SKOS concepts and labels

    And of course, in SKOS, it is possible to say that a concept is broader than another one (just like topic Science is broader than topic Computer science).

    So to summarize, in SKOS, a thesaurus is a tree of concepts, and each concept have one or more labels, preferred or alternative. A thesaurus is also called a concept scheme in SKOS.

    Also, please note that SKOS data model is slightly more complicated than what we've shown here, but this will be sufficient for our purpose.

    RDF URIs defined by SKOS

    In order to publish a thesaurus in RDF using SKOS ontology, SKOS introduces the "skos:" namespace associated to the following URI: http://www.w3.org/2004/02/skos/core#.

    Within that namespace, SKOS defines some classes and predicates corresponding to what has been described above. For example:

    • the triple (<uri>, rdf:type, skos:ConceptScheme) says that <uri> belongs to class skos:ConceptScheme (that is, is a concept scheme),
    • the triple (<uri>, rdf:type, skos:Concept) says that <uri> belongs to class skos:Concept (that is, is a concept),
    • the triple (<uri>, skos:prefLabel, <literal>) says that <literal> is a preferred label for concept <uri>,
    • the triple (<uri>, skos:altLabel, <literal>) says that <literal> is an alternative label for concept <uri>,
    • the triple (<uri1>, skos:broader, <uri2>) says that concept <uri2> is a broder concept of <uri1>.

  • One way to convert Eurovoc into plain SKOS

    2016/06/27 by Yann Voté

    This is the second part of an article where I show how to import the Eurovoc thesaurus from the European Union into an application using a plain SKOS data model. I've recently faced the problem of importing Eurovoc into CubicWeb using the SKOS cube, and the solution I've chose is discussed here.

    The first part was an introduction to thesauri and SKOS.

    The whole article assumes familiarity with RDF, as describing RDF would require more than a blog entry and is out of scope.

    Difficulties with Eurovoc and SKOS

    Eurovoc

    Eurovoc is the main thesaurus covering European Union business domains. It is published and maintained by the EU commission. It is quite complex and big, structured as a tree of keywords.

    You can see Eurovoc keywords and browse the tree from the Eurovoc homepage using the link Browse the subject-oriented version.

    For example, when publishing statistics about education in the EU, you can tag the published data with the broadest keyword Education and communications. Or you can be more precise and use the following narrower keywords, in increasing order of preference: Education, Education policy, Education statistics.

    Problem: hierarchy of thesauri

    The EU commission uses SKOS to publish its Eurovoc thesaurus, so it should be straightforward to import Eurovoc into our own application. But things are not that simple...

    For some reasons, Eurovoc uses a hierarchy of concept schemes. For example, Education and communications is a sub-concept scheme of Eurovoc (it is called a domain), and Education is a sub-concept scheme of Education and communications (it is called a micro-thesaurus). Education policy is (a label of) the first concept in this hierarchy.

    But with SKOS this is not possible: a concept scheme cannot be contained into another concept scheme.

    Possible solutions

    So to import Eurovoc into our SKOS application, and not loose data, one solution is to turn sub-concept schemes into concepts. We have two strategies:

    • keep only one concept scheme (Eurovoc) and turn domains and micro-thesauri into concepts,
    • keep domains as concept schemes, drop Eurovoc concept scheme, and only turn micro-thesauri into concepts.

    Here we will discuss the latter solution.

    Lets get to work

    Eurovoc thesaurus can be downloaded at the following URL: http://publications.europa.eu/mdr/resource/thesaurus/eurovoc/skos/eurovoc_skos.zip

    The ZIP archive contains only one XML file named eurovoc_skos.rdf. Put it somewhere where you can find it easily.

    To read this file easily, we will use the RDFLib Python library. This library makes it really convenient to work with RDF data. It has only one drawback: it is very slow. Reading the whole Eurovoc thesaurus with it takes a very long time. Make the process faster is the first thing to consider for later improvements.

    Reading the Eurovoc thesaurus is as simple as creating an empty RDF Graph and parsing the file. As said above, this takes a long long time (from half an hour to two hours).

    import rdflib
    
    eurovoc_graph = rdflib.Graph()
    eurovoc_graph.parse('<path/to/eurovoc_skos.rdf>', format='xml')
    
    <Graph identifier=N52834ca3766d4e71b5e08d50788c5a13 (<class 'rdflib.graph.Graph'>)>
    

    We can see that Eurovoc contains more than 2 million triples.

    len(eurovoc_graph)
    
    2828910
    

    Now, before actually converting Eurovoc to plain SKOS, lets introduce some helper functions:

    • the first one, uriref(), will allow us to build RDFLib URIRef objects from simple prefixed URIs like skos:prefLabel or dcterms:title,
    • the second one, capitalized_eurovoc_domains(), is used to convert Eurovoc domain names, all uppercase (eg. 32 EDUCATION ET COMMUNICATION) to a string where only first letter is uppercase (eg. 32 Education and communication)
    import re
    
    from rdflib import Literal, Namespace, RDF, URIRef
    from rdflib.namespace import DCTERMS, SKOS
    
    eu_ns = Namespace('http://eurovoc.europa.eu/schema#')
    thes_ns = Namespace('http://purl.org/iso25964/skos-thes#')
    
    prefixes = {
        'dcterms': DCTERMS,
        'skos': SKOS,
        'eu': eu_ns,
        'thes': thes_ns,
    }
    
    def uriref(prefixed_uri):
        prefix, value = prefixed_uri.split(':', 1)
        ns = prefixes[prefix]
        return ns[value]
    
    def capitalized_eurovoc_domain(domain):
        """Return the given Eurovoc domain name with only the first letter uppercase."""
        return re.sub(r'^(\d+\s)(.)(.+)$',
                      lambda m: u'{0}{1}{2}'.format(m.group(1), m.group(2).upper(), m.group(3).lower()),
                      domain, re.UNICODE)
    

    Now the actual work. After using variables to reference URIs, the loop will parse each triple in original graph and:

    • discard it if it contains deprecated data,
    • if triple is like (<uri>, rdf:type, eu:Domain), replace it with (<uri>, rdf:type, skos:ConceptScheme),
    • if triple is like (<uri>, rdf:type, eu:MicroThesaurus), replace it with (<uri>, rdf:type, skos:Concept) and add triple (<uri>, skos:inScheme, <domain_uri>),
    • if triple is like (<uri>, rdf:type, eu:ThesaurusConcept), replace it with (<uri>, rdf:type, skos:Concept),
    • if triple is like (<uri>, skos:topConceptOf, <microthes_uri>), replace it with (<uri>, skos:broader, <microthes_uri>),
    • if triple is like (<uri>, skos:inScheme, <microthes_uri>), replace it with (<uri>, skos:inScheme, <domain_uri>),
    • keep triples like (<uri>, skos:prefLabel, <label_uri>), (<uri>, skos:altLabel, <label_uri>), and (<uri>, skos:broader, <concept_uri>),
    • discard all other non-deprecated triples.

    Note that, to replace a micro thesaurus with a domain, we have to build a mapping between each micro thesaurus and its containing domain (microthes2domain dict).

    This loop is also quite long.

    eurovoc_ref = URIRef(u'http://eurovoc.europa.eu/100141')
    deprecated_ref = URIRef(u'http://publications.europa.eu/resource/authority/status/deprecated')
    title_ref = uriref('dcterms:title')
    status_ref = uriref('thes:status')
    class_domain_ref = uriref('eu:Domain')
    rel_domain_ref = uriref('eu:domain')
    microthes_ref = uriref('eu:MicroThesaurus')
    thesconcept_ref = uriref('eu:ThesaurusConcept')
    concept_scheme_ref = uriref('skos:ConceptScheme')
    concept_ref = uriref('skos:Concept')
    pref_label_ref = uriref('skos:prefLabel')
    alt_label_ref = uriref('skos:altLabel')
    in_scheme_ref = uriref('skos:inScheme')
    broader_ref = uriref('skos:broader')
    top_concept_ref = uriref('skos:topConceptOf')
    
    microthes2domain = dict((mt, next(eurovoc_graph.objects(mt, uriref('eu:domain'))))
                            for mt in eurovoc_graph.subjects(RDF.type, uriref('eu:MicroThesaurus')))
    
    new_graph = rdflib.ConjunctiveGraph()
    for subj_ref, pred_ref, obj_ref in eurovoc_graph:
        if deprecated_ref in list(eurovoc_graph.objects(subj_ref, status_ref)):
            continue
        # Convert eu:Domain into a skos:ConceptScheme
        if obj_ref == class_domain_ref:
            new_graph.add((subj_ref, RDF.type, concept_scheme_ref))
            for title in eurovoc_graph.objects(subj_ref, pref_label_ref):
                if title.language == u'en':
                    new_graph.add((subj_ref, title_ref,
                                   Literal(capitalized_eurovoc_domain(title))))
                    break
        # Convert eu:MicroThesaurus into a skos:Concept
        elif obj_ref == microthes_ref:
            new_graph.add((subj_ref, RDF.type, concept_ref))
            scheme_ref = next(eurovoc_graph.objects(subj_ref, rel_domain_ref))
            new_graph.add((subj_ref, in_scheme_ref, scheme_ref))
        # Convert eu:ThesaurusConcept into a skos:Concept
        elif obj_ref == thesconcept_ref:
            new_graph.add((subj_ref, RDF.type, concept_ref))
        # Replace <concept> topConceptOf <MicroThesaurus> by <concept> broader <MicroThesaurus>
        elif pred_ref == top_concept_ref:
            new_graph.add((subj_ref, broader_ref, obj_ref))
        # Replace <concept> skos:inScheme <MicroThes> by <concept> skos:inScheme <Domain>
        elif pred_ref == in_scheme_ref and obj_ref in microthes2domain:
            new_graph.add((subj_ref, in_scheme_ref, microthes2domain[obj_ref]))
        # Keep label triples
        elif (subj_ref != eurovoc_ref and obj_ref != eurovoc_ref
              and pred_ref in (pref_label_ref, alt_label_ref)):
            new_graph.add((subj_ref, pred_ref, obj_ref))
        # Keep existing skos:broader relations and existing concepts
        elif pred_ref == broader_ref or obj_ref == concept_ref:
            new_graph.add((subj_ref, pred_ref, obj_ref))
    

    We can check that we now have far less triples than before.

    len(new_graph)
    
    388582
    

    Now we dump this new graph to disk. We choose the Turtle format as it is far more readable than RDF/XML for humans, and slightly faster to parse for machines. This file will contain plain SKOS data that can be directly imported into any application able to read SKOS.

    with open('eurovoc.n3', 'w') as f:
        new_graph.serialize(f, format='n3')
    

    With CubicWeb using the SKOS cube, it is a one command step:

    cubicweb-ctl skos-import --cw-store=massive <instance_name> eurovoc.n3