The aim of this 2-parts blog post is to show some useful, yet not very complicated, features of the Pandas Python library that are not found in most (numeric oriented) tutorials.

We will illustrate these techniques with Geonames data : extract useful data from the Geonames dump, transform it, and load it into another file. There is no numeric computation involved here, nor statictics. We will prove that pandas can be used in a wide range of cases beyond numerical analysis.

This first part is an introduction to Geonames data. The real work with Pandas will be shown on the second part You can skip this part and go directly to part 2 if you are already familiar with Geonames.

Main data

Geonames data can be downloaded from http://download.geonames.org/export/dump/ The main file to download is allCountries.zip. Once extracted, you'll get a CSV file named allCountries.txt which contains nearly all Geonames data. In this file, CSV data are separated by tabulation.

A sample Geonames entry

Let's look in Geonames data for the city of Toulouse in France.

$ grep -P '\tToulouse\t' allCountries.txt

You'll find multiple results (including one in the United States of America), and among them the following line.

2972315     Toulouse        Toulouse        Gorad Tuluza,Lapangan Terbang Blagnac,TLS,Tolosa,(...)  43.60426        1.44367 P       PPLA    FR              76      31      313     31555   433055          150     Europe/Paris    2016-02-18

What is the meaning of each column ? There is no header line at the top of the file... Go back to the web page from where you downloaded the file. Below download links, you'll find some documentation. In particular, consider the following excerpt.

The main 'geoname' table has the following fields :
---------------------------------------------------
geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see http://www.geonames.org/export/codes.html, char(1)
feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int)
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format

So we see that entry 2,972,315 represents a place named Toulouse, for which latitude is 43.60 and longitude is 1.44. The place is located in France (FR country code), its population is estimated to 433,055, elevation is 150 m, and timezone is the same than Paris.

To understand the meaning of P and PPLA, as feature class and feature code respectively, the indicated web page at http://www.geonames.org/export/codes.html provides us with the following information.

P city, village,...
PPLA        seat of a first-order administrative division

Thus line 2,972,315 is about a city which is the seat of a first-order administrative division in France. And indeed, Toulouse is the seat of the Occitanie region in France.

Geonames administrative divisions

So let's look for Occitanie.

$ grep -P '\tOccitanie\t' allCountries.txt

You'll end up with the following line.

11071623    Occitanie       Occitanie       Languedoc-Roussillon-Midi-Pyrenees,(...)        44.02722        1.63559 A       ADM1    FR              76              5626858         188     Europe/Paris    2016-11-16

This is entry number 11,071,623, and we have latitude, longitude, population, elevation and timezone as usual. The place's class is A and its code is ADM1. Information about these codes can be found in the same previous page.

A country, state, region,...
ADM1        first-order administrative division

This confirms that Occitanie is a first-order administrative division of France.

Now look at the adminX_code part of the Occitanie line (X=1, 2, 3, 4).

76

Only column admin1_code is filled with value 76. Compare this with the same part from the Toulouse line.

76  31      313     31555

Here all columns admin1_code, admin2_code, admin3_code and admin4_code are given a value, respectively 76, 31 313, and 31555.

Furthermore, we see that column admin1_code matches for Occitanie and Toulouse (value 76). This let us deduce that Toulouse is actually a city located in Occitanie.

So, following the same logic, we can infer that Toulouse is also located in a second-order administrative division of France (feature code ADM2) with admin1_code of 76 and admin2_code equals to 31. Let's look for such a line in allCountries.txt.

$ grep -P '\tADM2\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \
> | grep -P '\t31\t'

We get the following line.

3013767     Département de la Haute-Garonne Departement de la Haute-Garonne Alta Garonna,Alto Garona,(...)  43.41667        1.5     A       ADM2    FR              76      31                      1254347         181     Europe/Paris    2016-02-18

Success! Toulouse is actually in the Haute-Garonne department, which is a department in Occitanie region.

Is this going to work for third-level administrative division ? Let's look for an ADM3 line, with admin1_code=76, admin2_code=31, and admin3_code=313.

$ grep -P '\tADM3\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \
> | grep -P '\t31\t' | grep -P '\t313\t'

Here is the result.

2972314     Arrondissement de Toulouse      Arrondissement de Toulouse      Arrondissement de Toulouse      43.58333        1.5     A       ADM3    FR              76      31      313             972098          139     Europe/Paris    2016-12-05

Still works! Finally, let's find out the fourth-order administrative division which contains the city of Toulouse.

$ grep -P '\tADM4\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \
> | grep -P '\t31\t' | grep -P '\t313\t' | grep -P '\t31555\t'

We get a place also named Toulouse.

6453974     Toulouse        Toulouse        Toulouse        43.60444        1.44194 A       ADM4    FR              76      31      313     31555   440204          153     Europe/Paris    2016-02-18

That's a surprise. That's because in France, an arrondissement is the smallest administrative division above a city. So for Geonames, city of Toulouse is both an administrative place (feature class A with feature code ADM4) and a populated place (feature class P with feature code PPLA), so there is two different entries with the same name (beware that this may be different for other countries).

And that's it! We have found the whole hierarchy of administrative divisions above city of Toulouse, thanks to the adminX_code columns: Occitanie, Département de la Haute-Garonne, Arrondissement de Toulouse, Toulouse (ADM4), and Toulouse (PPL).

That's it, really ? What about the top-most administrative level, the country ? This may not be intuitive: feature codes for countries start with PCL (for political entity).

$ grep -P '\tPCL' allCountries.txt | grep -P '\tFR\t'

Among the result, there's only one PCLI, which means independant political entity.

3017382     Republic of France      Republic of France      An Fhrainc,An Fhraing,(...)     46      2       A       PCLI    FR              00      64768389                543     Europe/Paris    2015-01-08

Now we have the whole hierarchy, summarized in the following table.

level name id fclass fcode ccode adm1_code adm2_code adm3_code adm4_code
country Republic of France 3017382 A PCLI FR        
level 1 Occitanie 11071623 A ADM1 FR 76      
level 2 Département de la Haute-Garonne 3013767 A ADM2 FR 76 31    
level 3 Arrondissement de Toulouse 2972314 A ADM3 FR 76 31 313  
level 4 Toulouse 6453974 A ADM4 FR 76 31 313 31555
city Toulouse 2972315 P PPL FR 76 31 313 31555

Alternate names

In the previous example, "Republic of France" is the main name for Geonames. But this is not the name in French ("République française"), and even the name in French is not the most commonly used name which is "France".

In the same way, "Département de la Haute-Garonne" is not the most commonly used name for the department, which is just "Haute-Garonne".

The fourth column in allCountries.txt provides a comma-separated list of alternate names for a place, in other languages and in other forms. But this is not very useful because we can't decide which form in the list is in which language.

For this, the Geonames project provides another file to download: alternateNames.zip. Go back to the download page, download it, and extract it. You'll get another tabulation-separeted CSV file named alternateNames.txt.

Let's look at alternate names for the Haute-Garonne department.

$ grep -P '\t3013767\t' alternateNames.txt

Multiple names are printed.

2080345     3013767 fr      Département de la Haute-Garonne
2080346     3013767 es      Alto Garona     1       1
2187116     3013767 fr      Haute-Garonne   1       1
2431178     3013767 it      Alta Garonna    1       1
2703076     3013767 en      Upper Garonne   1       1
2703077     3013767 de      Haute-Garonne   1       1
3047130     3013767 link    http://en.wikipedia.org/wiki/Haute-Garonne
3074362     3013767 en      Haute Garonne
4288095     3013767         Département de la Haute-Garonne
10273497    3013767 link    http://id.loc.gov/authorities/names/n80009763

To understand these columns, let's look again at the documentation.

The table 'alternate names' :
-----------------------------
alternateNameId   : the id of this alternate name, int
geonameid         : geonameId referring to id in table 'geoname', int
isolanguage       : iso 639 language code 2- or 3-characters; (...)
alternate name    : alternate name or name variant, varchar(400)
isPreferredName   : '1', if this alternate name is an official/preferred name
isShortName       : '1', if this is a short name like 'California' for 'State of California'
isColloquial      : '1', if this alternate name is a colloquial or slang term
isHistoric        : '1', if this alternate name is historic and was used in the pastq

We can see that "Haute-Garonne" is a short version of "Département de la Haute-Garonne" and is the preferred form in French. As an exercise, the reader can confirm in the same way that "France" is the preferred shorter form for "Republic of France" in French.


And that's it for our introductory journey into Geonames. You are now familiar enough with this data to begin working with it using Pandas in Python. In fact, what we have done until now, namely working with grep commands, is not very useful... See you in part 2!

blog entry of