The aim of this 2-parts blog post is to show some useful, yet not very complicated, features of the Pandas Python library that are not found in most (numeric oriented) tutorials.
We will illustrate these techniques with Geonames data : extract useful data from the Geonames dump, transform it, and load it into another file. There is no numeric computation involved here, nor statictics. We will prove that pandas can be used in a wide range of cases beyond numerical analysis.
This first part is an introduction to Geonames data. The real work with Pandas will be shown on the second part You can skip this part and go directly to part 2 if you are already familiar with Geonames.
Main data
Geonames data can be downloaded from http://download.geonames.org/export/dump/ The main file to download is allCountries.zip. Once extracted, you'll get a CSV file named allCountries.txt which contains nearly all Geonames data. In this file, CSV data are separated by tabulation.
A sample Geonames entry
Let's look in Geonames data for the city of Toulouse in France.
$ grep -P '\tToulouse\t' allCountries.txt
You'll find multiple results (including one in the United States of America), and among them the following line.
2972315 Toulouse Toulouse Gorad Tuluza,Lapangan Terbang Blagnac,TLS,Tolosa,(...) 43.60426 1.44367 P PPLA FR 76 31 313 31555 433055 150 Europe/Paris 2016-02-18
What is the meaning of each column ? There is no header line at the top of the file... Go back to the web page from where you downloaded the file. Below download links, you'll find some documentation. In particular, consider the following excerpt.
The main 'geoname' table has the following fields : --------------------------------------------------- geonameid : integer id of record in geonames database name : name of geographical point (utf8) varchar(200) asciiname : name of geographical point in plain ascii characters, varchar(200) alternatenames : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000) latitude : latitude in decimal degrees (wgs84) longitude : longitude in decimal degrees (wgs84) feature class : see http://www.geonames.org/export/codes.html, char(1) feature code : see http://www.geonames.org/export/codes.html, varchar(10) country code : ISO-3166 2-letter country code, 2 characters cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20) admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) admin3 code : code for third level administrative division, varchar(20) admin4 code : code for fourth level administrative division, varchar(20) population : bigint (8 byte int) elevation : in meters, integer dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat. timezone : the iana timezone id (see file timeZone.txt) varchar(40) modification date : date of last modification in yyyy-MM-dd format
So we see that entry 2,972,315 represents a place named Toulouse, for which latitude is 43.60 and longitude is 1.44. The place is located in France (FR country code), its population is estimated to 433,055, elevation is 150 m, and timezone is the same than Paris.
To understand the meaning of P and PPLA, as feature class and feature code respectively, the indicated web page at http://www.geonames.org/export/codes.html provides us with the following information.
P city, village,... PPLA seat of a first-order administrative division
Thus line 2,972,315 is about a city which is the seat of a first-order administrative division in France. And indeed, Toulouse is the seat of the Occitanie region in France.
Geonames administrative divisions
So let's look for Occitanie.
$ grep -P '\tOccitanie\t' allCountries.txt
You'll end up with the following line.
11071623 Occitanie Occitanie Languedoc-Roussillon-Midi-Pyrenees,(...) 44.02722 1.63559 A ADM1 FR 76 5626858 188 Europe/Paris 2016-11-16
This is entry number 11,071,623, and we have latitude, longitude, population, elevation and timezone as usual. The place's class is A and its code is ADM1. Information about these codes can be found in the same previous page.
A country, state, region,... ADM1 first-order administrative division
This confirms that Occitanie is a first-order administrative division of France.
Now look at the adminX_code part of the Occitanie line (X=1, 2, 3, 4).
76
Only column admin1_code is filled with value 76. Compare this with the same part from the Toulouse line.
76 31 313 31555
Here all columns admin1_code, admin2_code, admin3_code and admin4_code are given a value, respectively 76, 31 313, and 31555.
Furthermore, we see that column admin1_code matches for Occitanie and Toulouse (value 76). This let us deduce that Toulouse is actually a city located in Occitanie.
So, following the same logic, we can infer that Toulouse is also located in a second-order administrative division of France (feature code ADM2) with admin1_code of 76 and admin2_code equals to 31. Let's look for such a line in allCountries.txt.
$ grep -P '\tADM2\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \ > | grep -P '\t31\t'
We get the following line.
3013767 Département de la Haute-Garonne Departement de la Haute-Garonne Alta Garonna,Alto Garona,(...) 43.41667 1.5 A ADM2 FR 76 31 1254347 181 Europe/Paris 2016-02-18
Success! Toulouse is actually in the Haute-Garonne department, which is a department in Occitanie region.
Is this going to work for third-level administrative division ? Let's look for an ADM3 line, with admin1_code=76, admin2_code=31, and admin3_code=313.
$ grep -P '\tADM3\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \ > | grep -P '\t31\t' | grep -P '\t313\t'
Here is the result.
2972314 Arrondissement de Toulouse Arrondissement de Toulouse Arrondissement de Toulouse 43.58333 1.5 A ADM3 FR 76 31 313 972098 139 Europe/Paris 2016-12-05
Still works! Finally, let's find out the fourth-order administrative division which contains the city of Toulouse.
$ grep -P '\tADM4\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \ > | grep -P '\t31\t' | grep -P '\t313\t' | grep -P '\t31555\t'
We get a place also named Toulouse.
6453974 Toulouse Toulouse Toulouse 43.60444 1.44194 A ADM4 FR 76 31 313 31555 440204 153 Europe/Paris 2016-02-18
That's a surprise. That's because in France, an arrondissement is the smallest administrative division above a city. So for Geonames, city of Toulouse is both an administrative place (feature class A with feature code ADM4) and a populated place (feature class P with feature code PPLA), so there is two different entries with the same name (beware that this may be different for other countries).
And that's it! We have found the whole hierarchy of administrative divisions above city of Toulouse, thanks to the adminX_code columns: Occitanie, Département de la Haute-Garonne, Arrondissement de Toulouse, Toulouse (ADM4), and Toulouse (PPL).
That's it, really ? What about the top-most administrative level, the country ? This may not be intuitive: feature codes for countries start with PCL (for political entity).
$ grep -P '\tPCL' allCountries.txt | grep -P '\tFR\t'
Among the result, there's only one PCLI, which means independant political entity.
3017382 Republic of France Republic of France An Fhrainc,An Fhraing,(...) 46 2 A PCLI FR 00 64768389 543 Europe/Paris 2015-01-08
Now we have the whole hierarchy, summarized in the following table.
level name id fclass fcode ccode adm1_code adm2_code adm3_code adm4_code country Republic of France 3017382 A PCLI FR level 1 Occitanie 11071623 A ADM1 FR 76 level 2 Département de la Haute-Garonne 3013767 A ADM2 FR 76 31 level 3 Arrondissement de Toulouse 2972314 A ADM3 FR 76 31 313 level 4 Toulouse 6453974 A ADM4 FR 76 31 313 31555 city Toulouse 2972315 P PPL FR 76 31 313 31555
Alternate names
In the previous example, "Republic of France" is the main name for Geonames. But this is not the name in French ("République française"), and even the name in French is not the most commonly used name which is "France".
In the same way, "Département de la Haute-Garonne" is not the most commonly used name for the department, which is just "Haute-Garonne".
The fourth column in allCountries.txt provides a comma-separated list of alternate names for a place, in other languages and in other forms. But this is not very useful because we can't decide which form in the list is in which language.
For this, the Geonames project provides another file to download: alternateNames.zip. Go back to the download page, download it, and extract it. You'll get another tabulation-separeted CSV file named alternateNames.txt.
Let's look at alternate names for the Haute-Garonne department.
$ grep -P '\t3013767\t' alternateNames.txt
Multiple names are printed.
2080345 3013767 fr Département de la Haute-Garonne 2080346 3013767 es Alto Garona 1 1 2187116 3013767 fr Haute-Garonne 1 1 2431178 3013767 it Alta Garonna 1 1 2703076 3013767 en Upper Garonne 1 1 2703077 3013767 de Haute-Garonne 1 1 3047130 3013767 link http://en.wikipedia.org/wiki/Haute-Garonne 3074362 3013767 en Haute Garonne 4288095 3013767 Département de la Haute-Garonne 10273497 3013767 link http://id.loc.gov/authorities/names/n80009763
To understand these columns, let's look again at the documentation.
The table 'alternate names' : ----------------------------- alternateNameId : the id of this alternate name, int geonameid : geonameId referring to id in table 'geoname', int isolanguage : iso 639 language code 2- or 3-characters; (...) alternate name : alternate name or name variant, varchar(400) isPreferredName : '1', if this alternate name is an official/preferred name isShortName : '1', if this is a short name like 'California' for 'State of California' isColloquial : '1', if this alternate name is a colloquial or slang term isHistoric : '1', if this alternate name is historic and was used in the pastq
We can see that "Haute-Garonne" is a short version of "Département de la Haute-Garonne" and is the preferred form in French. As an exercise, the reader can confirm in the same way that "France" is the preferred shorter form for "Republic of France" in French.
And that's it for our introductory journey into Geonames. You are now familiar enough with this data to begin working with it using Pandas in Python. In fact, what we have done until now, namely working with grep commands, is not very useful... See you in part 2!