The aim of this 2-parts blog post is to show some useful, yet not very
complicated, features of the Pandas Python library that are not found in most
(numeric oriented) tutorials.
We will illustrate these techniques with Geonames data : extract useful data
from the Geonames dump, transform it, and load it into another file. There is
no numeric computation involved here, nor statictics. We will prove that pandas
can be used in a wide range of cases beyond numerical analysis.
Before diving into technical details, let us first look at the big picture, the
overall process to achieve the aforementioned goal.
Considering the wanted columns, we can see the following differences with
allCountries.txt:
- we are not interested in some original columns (population,
elevation, ...),
- the column frname is new and will come from alternateNames.txt,
thanks to the Geonames id,
- the column parent_gid is new too. We must derive this information from
the adminX_code (X=1, 2, 3, 4) columns in allCountries.txt.
This column deserves an example. Look at Arrondissement de toulouse.
feature_code: ADM3, country_code: FR, admin1_code: 76, admin2_code: 31, admin3_code: 313
To find its parent, we must look at a place with the following properties,
feature_code: ADM2, country_code: FR, admin1_code: 76, admin2_code: 31
and there is only one place with such properties, Geonames id 3,013,767, namely
Département de la Haute-Garonne. Thus we must find a way to derive the
Geonames id from the feature_code, country_code and adminX_code
columns. Pandas will make this easy for us.
Let's get to work. And of course, we must first import the Pandas library to
make it available. We also import the csv module because we will need it to
perform basic operations on CSV files.
>>> import pandas as pd
>>> import csv
We begin by loading data from the alternateNames.txt file.
And indeed, that's a big file. To save memory we won't load the whole file.
Recall from previous part, that the alternateNames.txt file provides the
following columns in order.
alternateNameId : the id of this alternate name, int
geonameid : geonameId referring to id in table 'geoname', int
isolanguage : iso 639 language code 2- or 3-characters; (...)
alternate name : alternate name or name variant, varchar(400)
isPreferredName : '1', if this alternate name is an official/preferred name
isShortName : '1', if this is a short name like 'California' for 'State of California'
isColloquial : '1', if this alternate name is a colloquial or slang term
isHistoric : '1', if this alternate name is historic and was used in the pastq
For our purpose we are only interested in columns geonameid (so that we can
find corresponding place in allCountries.txt file), isolanguage (so
that we can keep only French names), alternate name (of course), and
isPreferredName (because we want to keep preferred names when possible).
Another way to save memory is to filter the file before loading it. Indeed, it's
a better practice to load a smaller dataset (filter before loading) than to load a big one and then
filter it after loading. It's important to keep those things in mind when you are working
with large datasets. So in our case, it is cleaner (but slower) to prepare the
CSV file before, keeping only French names.
>>>
>>> with open('alternateNames.txt') as f:
... reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
... with open('frAltNames.txt', 'w') as g:
... writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE,
... escapechar='\\')
... for row in reader:
... if row[2] == u'fr':
... writer.writerow(row)
To load a CSV file, Pandas provides the read_csv function. It returns a
dataframe populated with data from a CSV file.
>>> with open('frAltNames.txt') as altf:
... frnames = pd.read_csv(
... altf, encoding='utf-8',
... sep='\t', quoting=csv.QUOTE_NONE,
... header=None, usecols=(1, 3, 4), index_col=0,
... names=('gid', 'frname', 'pref'),
... dtype={'gid': 'uint32', 'frname': str, 'pref': 'bool_'},
... true_values=['1'], false_values=[u''], na_filter=False,
... skipinitialspace=True, error_bad_lines=False
... )
...
The read_csv function is quite complex and it takes some time to use it
rightly. In our cases, the first few parameters are self-explanatory:
encoding for the file encoding, sep for the CSV separator, quoting
for the quoting protocol (here there is none), and header for lines to be
considered as label lines (here there is none).
usecols tells pandas to load only the specified columns. Beware that
indices start at 0, so column 1 is the second column in the file (geonameid
in this case).
index_col says to use one of the columns as a row index (instead of
creating a new index from scratch). Note that the number for index_col is
relative to usecols. In other words, 0 means the first column of
usecols, not the first column of the file.
names gives labels for columns (instead of using integers from 0). Hence we
can extract the last column with frnames['pref'] (instead of
frnames[3]). Please note, that this parameter is not compatible with
headers=0 for example (in that case, the first line is used to label
columns).
The dtype parameter is interesting. It allows you to specify one type per
column. When possible, prefer to use NumPy types to save memory (eg
np.uint32, np.bool_).
Since we want boolean values for the pref column, we can tell Pandas to
convert '1' strings to True and empty strings ('') to False.
That is the point of using parameters true_values and false_values.
But, by default, Pandas detect empty strings and affect them np.nan
(NaN means not a number). To prevent this behavior, setting na_filter
to False will leave empty strings as empty strings. Thus, empty strings in the
pref column will be converted to False (thanks to the false_values
parameter and to the 'bool_' data type).
Finally, skipinitialspace tells Pandas to left-strip strings to remove
leading spaces. Without it, a line like a, b (commas-separated) would give
values 'a' and ' b' (note leading space before b).
And lastly, error_bad_lines make pandas ignore lines he cannot understand
(eg. wrong number of columns). The default behavior is to raise an exception.
There are many more parameters to this function, which is much more powerful
than the simple reader object from the csv module. Please, refer to the
documentation at
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html for
a full list of options. For example, the header parameter can accept a list
of integers, like [1, 3, 4], saying that lines at position 1, 3 and 4 are
label lines.
>>> frnames.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 55684 entries, 18918 to 11441793
Data columns (total 2 columns):
frname 55684 non-null object
pref 55684 non-null bool
dtypes: bool(1), object(1)
memory usage: 6.6 MB
We can see here that our dataframe contains 55,684 entries and that it
takes about 6.6 MB in memory.
Let's look at entry number 3,013,767 for the Haute-Garonne department.
>>> frnames.loc[3013767]
frname pref
gid
3013767 Département de la Haute-Garonne False
3013767 Haute-Garonne True
There is one last thing to do with the frnames dataframe: make its row
indices uniques. Indeed, we can see from the previous example that there are
two entries for the Haute-Garonne department. We need to keep only one, and,
when possible, the preferred form (when the pref column is True).
This is not always possible: we can see at the beginning of the dataframe (use
the head method) that there is no preferred French name for Rwanda.
>>> frnames.head()
frname pref
gid
18918 Protaras True
49518 République du Rwanda False
49518 Rwanda False
50360 Woqooyi Galbeed False
51230 Togdheer False
In such a case, we will take one of the two lines at random. Maybe a clever
rule to decide which one to keep would be useful (like involving other
columns), but for the purpose of this tutorial it does not matter which one is
kept.
Back to our problem, how to make indices uniques ? Pandas provides the
duplicated method on Index objects. This method returns a boolean NumPy
1d-array (a vector), the size of which is the number of entries. So, since our
dataframe has 55,684 entries, the length of the returned vector is 55,684.
>>> dups_idx = frnames.index.duplicated()
>>> dups_idx
array([False, False, True, ..., False, False, False], dtype=bool)
>>> len(dups_idx)
55684
The meaning is simple: when you encounter True, it means that the index at
this position is a duplicate of a previously encountered index. For example, we
can see that the third value in dups_idx is True. And indeed, the third
line of frnames has an index (49,518) which is a duplicate of the second
line.
So duplicated is meant to mark as True duplicated indices and to keep
only the first one (there is an optional parameter to change this: read the
doc). How do we make sure that the first entry is the preferred entry ? Sorting
the dataframe of course! We can sort a table by a column (or by a list of
columns), using the sort_values method. We give ascending=False because
True > False (that's a philosophical question!), and inplace=True to
sort the dataframe in place, thus we do not create a copy (still thinking about
memory usage).
>>> frnames.sort_values('pref', ascending=False, inplace=True)
>>> frnames.head()
frname pref
gid
18918 Protaras True
8029850 Hôtel de ville de Copenhague True
8127417 Kalmar True
8127361 Sundsvall True
8127291 Örebro True
>>> frnames.loc[3013767]
frname pref
gid
3013767 Haute-Garonne True
3013767 Département de la Haute-Garonne False
Great! All preferred names are now first in the dataframe. We can then use the
duplicated index method to filter out duplicated entries.
>>> frnames = frnames[~(frnames.index.duplicated())]
>>> frnames.loc[3013767]
frname Haute-Garonne
pref True
Name: 3013767, dtype: object
>>> frnames.loc[49518]
frname Rwanda
pref False
Name: 49518, dtype: object
>>> frnames.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 49047 entries, 18918 to 11441793
Data columns (total 2 columns):
frname 49047 non-null object
pref 49047 non-null bool
dtypes: bool(1), object(1)
memory usage: 5.7 MB
We end up with 49,047 French names. Notice the ~ in the filter expression ?
That's because we want to keep the first entry for each duplicated index, and
the first entry is False in the vector returned by duplicated.
One last thing to do is to remove the pref column which won't be used
anymore. We already know how to do it.
>>> frnames.drop('pref', axis=1, inplace=True)
There has been a lot of talking until now. But if we summarize, very few
commands where needed to obtain this table with only two columns (gid and
frname):
Prepare a smaller file so there is less data to load (keep only french
names).
with open('alternateNames.txt') as f:
reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
with open('frAltNames.txt', 'w') as g:
writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE,
escapechar='\\')
for row in reader:
if row[2] == u'fr':
writer.writerow(row)
Load the file into Pandas.
with open('frAltNames.txt') as altf:
frnames = pd.read_csv(
altf, encoding='utf-8',
sep='\t', quoting=csv.QUOTE_NONE,
header=None, usecols=(1, 3, 4), index_col=0,
names=('gid', 'frname', 'pref'),
dtype={'gid': 'uint32', 'frname': str, 'pref': 'bool_'},
true_values=['1'], false_values=[u''], na_filter=False,
skipinitialspace=True, error_bad_lines=False
)
Sort on the pref column and remove duplicated indices.
frnames.sort_values('pref', ascending=False, inplace=True)
frnames = frnames[~(frnames.index.duplicated())]
Remove the pref column.
frnames.drop('pref', axis=1, inplace=True)
Simple, isn't it ? We'll keep this dataframe for later use. Pandas will make it
a breeze to merge it with the main dataframe coming from allCountries.txt
to create the new frname column.
But for now, let's look at the second problem: how to derive a gid from a
feature_code, a country_code, and a bunch of adminX_code (X=1, 2,
3, 4).
First we need the administrative part of the file allCountries.txt, that is
all places with the A feature class.
Of course we can load the whole file into Pandas and then filter to keep only
A-class entries, but now you know that this is memory intensive (and this
file is much bigger than alternateNames.txt). So we'll be more clever and
first prepare a smaller file.
>>> with open('allCountries.txt') as f:
... reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
... with open('adm_geonames.txt', 'w') as g:
... writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE, escapechar='\\')
... for row in reader:
... if row[6] == 'A':
... writer.writerow(row)
Now we load this file into pandas. Remembering our goal, the resulting
dataframe will only be used to compute gid from the code columns. So all
columns we need are geonameid, feature_code, country_code,
admin1_code, admin2_code, admin3_code, and admin4_code.
What about data types. All these columns are strings except for geonameid
which is an integer. Since it will be painful to type <colname>: str for
the dtype parameter dictionary, let's use the fromkeys constructor
instead.
>>> d_types = dict.fromkeys(['fcode', 'code0', 'code1', 'code2', 'code3',
... 'code4'], str)
>>> d_types['gid'] = 'uint32'
We can now load the file.
>>> with open('adm_geonames.txt') as admf:
... admgids = pd.read_csv(
... admf, encoding='utf-8',
... sep='\t', quoting=csv.QUOTE_NONE,
... header=None, usecols=(0, 7, 8, 10, 11, 12, 13),
... names=('gid', 'fcode', 'code0', 'code1', 'code2', 'code3', 'code4'),
... dtype=d_types, na_values='', keep_default_na=False,
... error_bad_lines=False
... )
We recognize most parameters in this instruction. Notice that we didn't use
index_col=0: gid is now a normal column, not an index, and Pandas will
generate automatically a row index with integers starting at 0.
Two new parameters are na_values and keep_default_na. The first one
gives Pandas additional strings to be considerer as NaN (Not a Number). The
astute reader would say that empty strings ('') are already considered by
Pandas as NaN, and he would be right.
But here comes the second parameter which, if set to False, tells Pandas to
forget about its default list of strings recognized as NaN. The default
list contains a bunch of strings like 'N/A' or '#NA' or, this is
interesting, simply 'NA'. But 'NA' is used in allCountries.txt as
country code for Namibia. If we keep the default list, this whole country will
be ignored. So the combination of these two parameters tells Pandas to:
- reset its default list of NaN strings,
- use '' as a NaN string, which, consequently, will be the only one.
That's it for NaN. What can Pandas tells us about our dataframe ?
>>> admgids.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 369277 entries, 0 to 369276
Data columns (total 7 columns):
gid 369277 non-null uint32
fcode 369273 non-null object
code0 369272 non-null object
code1 369197 non-null object
code2 288494 non-null object
code3 215908 non-null object
code4 164899 non-null object
dtypes: object(6), uint32(1)
memory usage: 128.8 MB
Note the RangeIndex which Pandas has created for us. The dataframe takes
about 130 MB of memory because it has a lots of object columns. Apart from
that, everything looks good.
One thing we want to do before going on, is to replace all 'PCL<X>' values
in the fcode column with just PCL. This will make our life easier later
when searching in this dataframe.
>>> pd.unique(admgids.fcode)
array([u'ADMD', u'ADM1', u'PCLI', u'ADM2', u'ADM3', u'ADM2H', u'PCLD',
u'ADM4', u'ZN', u'ADM1H', u'PCLH', u'TERR', u'ADM3H', u'PRSH',
u'PCLIX', u'ADM5', u'ADMDH', nan, u'PCLS', u'LTER', u'ADM4H',
u'ZNB', u'PCLF', u'PCL'], dtype=object)
Pandas provides the replace method for that.
>>> admgids.replace({'fcode': {r'PCL[A-Z]{1,2}': 'PCL'}}, regex=True, inplace=True)
>>> pd.unique(admgids.fcode)
array([u'ADMD', u'ADM1', u'PCL', u'ADM2', u'ADM3', u'ADM2H', u'ADM4',
u'ZN', u'ADM1H', u'TERR', u'ADM3H', u'PRSH', u'ADM5', u'ADMDH', nan,
u'LTER', u'ADM4H', u'ZNB'], dtype=object)
The replace method has lot of different signatures, refer to the Pandas
documentation for a comprehensive description. Here, with the dictionary, we
hare saying to look only in column fcode, and in this column, replace
strings matching the regular expression with the given value. Since we are
using regular expressions, the regex parameter must be set to True.
And as usual, the inplace parameter avoids creation of a copy.
Remember the goal: we want to be able to get the gid from the others
columns. Well, dear reader, you'll be happy to know that Pandas allows an index
to be composite, to be composed of multiple columns, what Pandas calls a
MultiIndex.
To put it simply, a multi-index is useful when you have hierarchical indices
Consider for example the following table.
lvl1 |
lvl2 |
N |
S |
A |
AA |
11 |
1x1 |
A |
AB |
12 |
1x2 |
B |
BA |
21 |
2x1 |
B |
BB |
22 |
2x2 |
B |
BC |
33 |
3x3 |
If we were to load such a table in a Pandas dataframe df (exercise: do it),
we would be able to use it as follows.
>>> df.loc['A']
N S
AA 11 1x1
AB 12 1x2
>>> df.loc['B']
N S
BA 21 2x1
BB 22 2x2
BC 23 2x3
>>> df.loc[('A',), 'S']
AA 1x1
AB 1x2
Name: S, dtype: object
>>> df.loc[('A', 'AB'), 'S']
'1x2'
>>> df.loc['B', 'BB']
N 22
S 2x2
Name: (B, BB), dtype: object
>>> df.loc[('B', 'BB')]
N 22
S 2x2
Name: (B, BB), dtype: object
So, basically, we can query the multi-index using tuples (and we can omit
tuples if column indexing is not involved). But must importantly we can query a
multi-index partially: df.loc['A'] returns a sub-dataframe with one level
of index gone.
Back to our subject, we clearly have hierarchical information in our code
columns: the country code, then the admin1 code, then the admin2 code, and so
on. Moreover, we can put the feature code at the top. But how to do that ?
It can't be more simpler. The main issue is to find the correct method:
set_index.
>>> admgids.set_index(['fcode', 'code0', 'code1', 'code2', 'code3', 'code4'],
... inplace=True)
>>> admgids.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 369277 entries, (ADMD, AD, 00, nan, nan, nan) to (PCLH, nan, nan, nan, nan, nan)
Data columns (total 1 columns):
gid 369277 non-null uint32
dtypes: uint32(1)
memory usage: 21.9 MB
>>> admgids.head()
gid
fcode code0 code1 code2 code3 code4
ADMD AD 00 NaN NaN NaN 3038817
NaN 3039039
ADM1 AD 06 NaN NaN NaN 3039162
05 NaN NaN NaN 3039676
04 NaN NaN NaN 3040131
That's not really it, because Pandas has kept the original dataframe order, so
multi-index is all messed up. No problem, there is the sort_index method.
>>> admgids.sort_index(inplace=True)
>>> admgids.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 369277 entries, (nan, CA, 10, 02, 94235, nan) to (ZNB, SY, 00, nan, nan, nan)
Data columns (total 1 columns):
gid 369277 non-null uint32
dtypes: uint32(1)
memory usage: 21.9 MB
>>> admgids.head()
gid
fcode code0 code1 code2 code3 code4
NaN CA 10 02 94235 NaN 6544163
CN NaN NaN NaN NaN 6255000
CY 04 NaN NaN NaN 6640324
MX 16 NaN NaN NaN 6618819
ADM1 AD 02 NaN NaN NaN 3041203
Much better. We can even see that there are three entries without a feature
code.
Let's see if all this effort was worth it. Can we fulfill our goal ? Can we get
a gid from a bunch of codes ? Can we get the Haute-Garonne gid ?
>>> admgids.loc['ADM2', 'FR', '76', '31']
gid
code3 code4
NaN NaN 3013767
And for the Toulouse ADM4 ?
>>> admgids.loc['ADM4', 'FR', '76', '31', '313', '31555']
gid
fcode code0 code1 code2 code3 code4
ADM4 FR 76 31 313 31555 6453974
Cheers! You've deserved a glass of wine!
Before moving on four our grand finale, let's summarize what have been done to
prepare administrative data.
We've prepared a smaller file with only administrative data.
with open('allCountries.txt') as f:
reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
with open('adm_geonames.txt', 'w') as g:
writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE,
escapechar='\\')
for row in reader:
if row[6] == 'A':
writer.writerow(row)
We've loaded the file into Pandas.
d_types = dict.fromkeys(['fcode', 'code0', 'code1', 'code2', 'code3',
'code4'], str)
d_types['gid'] = 'uint32'
with open('adm_geonames.txt') as admf:
admgids = pd.read_csv(
admf, encoding='utf-8',
sep='\t', quoting=csv.QUOTE_NONE,
header=None, usecols=(0, 7, 8, 10, 11, 12, 13),
names=('gid', 'fcode', 'code0', 'code1', 'code2', 'code3', 'code4'),
dtype=d_types, na_values='', keep_default_na=False,
error_bad_lines=False
)
We've replaced all 'PCL<XX>' values with just 'PCL' in the fcode
column.
admgids.replace({'fcode': {r'PCL[A-Z]{1,2}': 'PCL'}}, regex=True,
inplace=True)
Then we've created a multi-index with columns fcode, code0,
code1, code2, code3, code4.
admgids.set_index(['fcode', 'code0', 'code1', 'code2', 'code3', 'code4'],
inplace=True)
admgids.sort_index(inplace=True)
Time has finally come to load the main file now: allCountries.txt. On one
hand, we will then be able to use the frnames dataframe to get French name
for each entry and populate the frname column, and on the other hand we
will use the admgids dataframe to compute parent gid for each line
too.
On my computer, loading the whole allCountries.txt at once takes 5.5 GB of
memory, clearly too much! And in this case, there is no trick to reduce the
size of the file first: we want all the data.
Pandas can help us with the chunk_size parameter to the read_csv
function. It allows us to read the file chunk by chunk (it returns an
iterator). The idea is to first create an empty CSV file for our final data,
then read each chunk, perform data manipulation (that is add frname and
parent_gid) of this chunk, and append the data to the file.
So we load data into Pandas the usual way. The only difference is that we add a
new parameter chunksize with value 1,000,000. You can choose a smaller or a
larger number of rows depending of your memory limit.
>>> d_types = dict.fromkeys(['fclass', 'fcode', 'code0', 'code1',
... 'code2', 'code3', 'code4'], str)
>>> d_types['gid'] = 'uint32'
>>> d_types['lat'] = 'float16'
>>> d_types['lng'] = 'float16'
>>> with open('allCountries.txt') as geof:
... reader = pd.read_csv(
... geof, encoding='utf-8',
... sep='\t', quoting=csv.QUOTE_NONE,
... header=None, usecols=(0, 1, 4, 5, 6, 7, 8, 10, 11, 12, 13),
... names=('gid', 'name', 'lat', 'lng', 'fclass', 'fcode',
... 'code0', 'code1', 'code2', 'code3', 'code4'),
... dtype=d_types, index_col=0,
... na_values='', keep_default_na=False,
... chunksize=1000000, error_bad_lines=False
... )
... for chunk in reader:
... pass
Pandas can perform a JOIN on two dataframes, much like in SQL. The function
to do so is merge.
... for chunk in reader:
... chunk = pd.merge(chunk, frnames, how='left',
... left_index=True, right_index=True)
merge expects first the two dataframes to be joined. The how parameter
tells what type of JOIN to perform (it can be left, right,
inner, ...). Here we wants to keep all lines in chunk, which is the first
parameter, so it is 'left' (if chunk was the second parameter, it would
have been 'right').
left_index=True and right_index=True tell Pandas that the pivot column
is the index on each table. Indeed, in our case the gid index will be use
in both tables to compute the merge. If in one table, for example the right
one, the pivot column is not the index, one can set right_index=False and
add parameter right_on='<column_name>' (same parameter exists for left
table).
Additionally, If there are name clashes (same column name in both tables), one
can also use the suffixes parameter. For example suffixes=('_first',
'_second').
And that's all we need to add the frname column. Simple isn't it ?
The parent_gid column is trickier. We'll delegate computation of the gid
from administrative codes to a separate function. But first let's define two
reverse dictionaries linking the fcode column to a level number
>>> level_fcode = {0: 'PCL', 1: 'ADM1', 2: 'ADM2', 3: 'ADM3', 4: 'ADM4'}
>>> fcode_level = dict((v, k) for k, v in level_fcode.items())
And here is the function computing the Geonames id from administrative codes.
>>> def geonameid_from_codes(level, **codes):
... """Return the Geoname id of the *administrative* place with the
... given information.
...
... Return ``None`` if there is no match.
...
... ``level`` is an integer from 0 to 4. 0 means we are looking for a
... political entity (``PCL<X>`` feature code), 1 for a first-level
... administrative division (``ADM1``), and so on until fourth-level
... (``ADM4``).
...
... Then user must provide at least one of the ``code0``, ...,
... ``code4`` keyword parameters, depending on ``level``.
...
... Examples::
...
... >>> geonameid_from_codes(level=0, code0='RE')
... 935317
... >>> geonameid_from_codes(level=3, code0='RE', code1='RE',
... ... code2='974', code3='9742')
... 935213
... >>> geonameid_from_codes(level=0, code0='AB') # None
...
... """
... try:
... idx = tuple(codes['code{0}'.format(i)] for i in range(level+1))
... except KeyError:
... raise ValueError('Not enough codeX parameters for level {0}'.format(level))
... idx = (level_fcode[level],) + idx
... try:
... return admgids.loc[idx, 'gid'].values[0]
... except (KeyError, TypeError):
... return None
...
A few comments about this function: the first part builds a tuple idx with
the given codes. Then this idx is used as an index value in the admgids
dataframe to find the matching gid.
We also need another function which will compute the parent gid for a Pandas
row.
>>> from six import string_types
>>> def parent_geonameid(row):
... """Return the Geoname id of the parent of the given Pandas row.
...
... Return ``None`` if we can't find the parent's gid.
... """
...
... level = fcode_level.get(row.fcode)
... if (level is None and isinstance(fcode, string_types)
... and len(fcode) >= 3):
... fcode_level.get(row.fcode[:3], 5)
... level = level or 5
... level -= 1
... if level < 0:
... return None
...
... l = list(range(5))
... while l and pd.isnull(row['code{0}'.format(l[-1])]):
... l.pop()
... codes = {}
... for i in l:
... if i > level:
... break
... code_label = 'code{0}'.format(i)
... codes[code_label] = row[code_label]
... try:
... return geonameid_from_codes(level, **codes)
... except ValueError:
... return None
...
In this function, we first look at the row fcode to get the row
administrative level. If the fcode is ADMX we get the level directly.
If it is PCL<X>, we get level 0 from PCL. Else we set it to level 5 to
say that it is below level 4. So the parent's level is the found level minus
one. And if it is -1, we know that we were on a country and there is no parent.
Then we compute all available administrative codes, removing codes with NaN
values from the end.
With the level and the codes, we can search for the parent's gid using the
previous function.
Now, how do we use this function. No need for a for loop, Pandas gives us
the apply method.
... for chunk in reader:
...
... parent_gids = chunk.apply(parent_geonameid, axis=1)
This will apply the parent_geonameid to each row and return a new Pandas
series whose head looks like this.
gid
2986043 nan
2993838 nan
2994701 nan
3007683 nan
3017832 nan
...
3039162 3041565.0
None values have been converted to NaN. Thus, integer values have been
converted to float (you cannot have NaN within an integer column), and this
is not what we want. As a compromise, we are going to convert this into str
and suppress the decimal part.
... parent_gids = parent_gid.astype(str)
... parent_gids.replace(r'\.0', '', regex=True, inplace=True)
We also add a label to the column. That's the name the column in our future
dataframe.
... parent_gids = parent_gids.rename('parent_gid')
And we can now append this new column to our chunk dataframe.
... chunk = pd.concat([chunk, parent_gids], axis=1)
We're almost there. Before we can save the chunk in a CSV file, we must
reorganize its columns to the expected order. For now, the frname and
parent_gid columns have been appended at the end of the dataframe.
... chunk = chunk.reindex(columns=['frname', 'name', 'fclass',
... 'fcode', 'parent_gid', 'lat',
... 'lng'])
At last, we save the chunk to the file opened in append mode.
... chunk.to_csv('final.txt', mode='a', encoding='utf-8',
... quoting=csv.QUOTE_NONE, sep='\t', header=None)
Currently, creating the new CSV file from Geonames takes hours, and this is not
acceptable. There are multiple ways to make thing go faster. One of the most
significant change is to cache results of the parent_geonameid function.
Indeed, many places in Geonames have the same parent ; computing the parent gid
once and caching it sounds like a good idea.
If you are using Python3, you can simply use the @functools.lru_cache
decorator on the parent_geonameid function. But let us try to define our
own custom cache.
>>> from six import string_types
>>> gid_cache = {}
>>> def parent_geonameid(row):
... """Return the Geoname id of the parent of the given Pandas row.
...
... Return ``None`` if we can't find the parent's gid.
... """
...
... level = fcode_level.get(row.fcode)
... if (level is None and isinstance(fcode, string_types)
... and len(fcode) >= 3):
... fcode_level.get(row.fcode[:3], 5)
... level = level or 5
... level -= 1
... if level < 0:
... return None
...
... l = list(range(5))
... while l and pd.isnull(row['code{0}'.format(l[-1])]):
... l.pop()
... codes = {}
... code_tuple = [level]
... for i in l:
... if i > level:
... break
... code_label = 'code{0}'.format(i)
... code = row[code_label]
... codes[code_label] = code
... code_tuple.append(code)
... code_tuple = tuple(code_tuple)
... try:
... parent_gid = (gid_cache.get(code_tuple)
... or geonameid_from_codes(level, **codes))
... except ValueError:
... parent_gid = None
...
... if code_tuple not in gid_cache:
... gid_cache[code_tuple] = parent_gid
... return parent_gid
The only difference with the previous version is the use of a gid_cache
dictionary. Keys for this dictionary are tuples (<level>, <code0>,
[[<code1>], ..., <code4>]) (stored in the code_tuple variable), and the
corresponding value is the parent gid for this combination of level and codes.
Then the returned parent_gid is first looked in this dictionary for a
previous cached result, else is computed from the geonameid_from_codes
function like before, and the result is cached.
Let's review what we have done.
We have defined three useful dictionaries. Two to get a level number from a feature
code and conversely, and one to cache computation results.
level_fcode = {0: 'PCL', 1: 'ADM1', 2: 'ADM2', 3: 'ADM3', 4: 'ADM4'}
fcode_level = dict((v, k) for k, v in level_fcode.items())
gid_cache = {}
We have defined a function computing a Geoname id from administrative codes.
def geonameid_from_codes(level, **codes):
"""Return the Geoname id of the *administrative* place with the
given information.
Return ``None`` if there is no match.
``level`` is an integer from 0 to 4. 0 means we are looking for a
political entity (``PCL<X>`` feature code), 1 for a first-level
administrative division (``ADM1``), and so on until fourth-level
(``ADM4``).
Then user must provide at least one of the ``code0``, ...,
``code4`` keyword parameters, depending on ``level``.
Examples::
>>> geonameid_from_codes(level=0, code0='RE')
935317
>>> geonameid_from_codes(level=3, code0='RE', code1='RE',
... code2='974', code3='9742')
935213
>>> geonameid_from_codes(level=0, code0='AB') # None
>>>
"""
try:
idx = tuple(codes['code{0}'.format(i)] for i in range(level+1))
except KeyError:
raise ValueError('Not enough codeX parameters for level {0}'.format(level))
idx = (level_fcode[level],) + idx
try:
return admgids.loc[idx, 'gid'].values[0]
except (KeyError, TypeError):
return None
We have defined a function computing the parent's gid of a Pandas row.
def parent_geonameid(row):
"""Return the Geoname id of the parent of the given Pandas row.
Return ``None`` if we can't find the parent's gid.
"""
level = fcode_level.get(row.fcode)
if (level is None and isinstance(fcode, string_types)
and len(fcode) >= 3):
fcode_level.get(row.fcode[:3], 5)
level = level or 5
level -= 1
if level < 0:
return None
l = list(range(5))
while l and pd.isnull(row['code{0}'.format(l[-1])]):
l.pop()
codes = {}
code_tuple = [level]
for i in l:
if i > level:
break
code_label = 'code{0}'.format(i)
code = row[code_label]
codes[code_label] = code
code_tuple.append(code)
code_tuple = tuple(code_tuple)
try:
parent_gid = (gid_cache.get(code_tuple)
or geonameid_from_codes(level, **codes))
except ValueError:
parent_gid = None
if code_tuple not in gid_cache:
gid_cache[code_tuple] = parent_gid
return parent_gid
And finally we have loaded the file allCountries.txt into Pandas using
chunks of 1,000,000 rows to save memory. For each chunk, we have merged it
with the frnames table to add the frname column, and we applied the
parent_geonameid function to add the parent_gid column. We then
reordered the columns and append the chunk to the final CSV file.
d_types = dict.fromkeys(['fclass', 'fcode', 'code0', 'code1',
'code2', 'code3', 'code4'], str)
d_types['gid'] = 'uint32'
d_types['lat'] = 'float16'
d_types['lng'] = 'float16'
with open('allCountries.txt') as geof:
reader = pd.read_csv(
geof, encoding='utf-8',
sep='\t', quoting=csv.QUOTE_NONE,
header=None, usecols=(0, 1, 4, 5, 6, 7, 8, 10, 11, 12, 13),
names=('gid', 'name', 'lat', 'lng', 'fclass', 'fcode',
'code0', 'code1', 'code2', 'code3', 'code4'),
dtype=d_types, index_col=0,
na_values='', keep_default_na=False,
chunksize=1000000, error_bad_lines=False
)
for chunk in reader:
chunk = pd.merge(chunk, frnames, how='left',
left_index=True, right_index=True)
parent_gids = chunk.apply(parent_geonameid, axis=1)
parent_gids = parent_gids.astype(str)
parent_gids.replace(r'\.0', '', regex=True, inplace=True)
parent_gids = parent_gids.rename('parent_gid')
chunk = pd.concat([chunk, parent_gids], axis=1)
chunk = chunk.reindex(columns=['frname', 'name', 'fclass',
'fcode', 'parent_gid', 'lat',
'lng'])
chunk.to_csv('final.txt', mode='a', encoding='utf-8',
quoting=csv.QUOTE_NONE, sep='\t', header=None)
This final part is the longest, because the parent_geonameid function takes
some time on each chunk to compute all parents gids. But at the end of the
process we'll proudly see a final.txt file with data the way we want it,
and without using too much memory... High five!
Congratulations! This ends our journey into the Pandas world.
Regarding Geonames, to be honest, we've only scratch the surface of its
complexity. There's so much to be done.
If you look at the file we've just produced, you'll see plenty of empty values
in the parent_gid column. May be our method to get the Geonames id of the
parent needs to be improved. May be all those orphan places should be moved
inside their countries.
Another problem lies within Geonames data. France has overseas territories, for
example Reunion Island, Geoname id 935,317. This place has feature code
PCLD, which means "dependent political entity". And indeed, Reunion Island
is not a country and should not appear at the top level of the tree, at the
same level as France. So some work should be done here to have Reunion Island
linked to France in some way, may be using the until now ignored cc2 column
(for "alternate country codes").
Still another improvement, easier this one, is to have yet another parent level
for continents. For this, one can use the file countryInfo.txt, downloadable
from the same page.
Considering speed this time, there is also room for improvements. First, the code
itself might be better designed to avoid some tests and for loops. Another
possibility is tu use multiprocessing, since each chunk in allCountries.txt
isindependent. Processes can put their finished chunk on a queue that a writer
process will read to write data in the output file. Another way to go is Cython (see: http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html).