This post deals with the Pandas Python library, the open and free access of
timeseries datasets thanks to the Quandl website and how you can handle
datasets with pandas efficiently.
There has been a long time that I want to play a little with pandas. Not an
adorable black and white teddy bear but the well-known Python Data library based
on Numpy. I would like to show how you can easely retrieve some numerical
datasets from the Quandl website and its API, and handle these datasets
with pandas efficiently trought its main object: the DataFrame.
Note that this blog post comes with a IPython Notebook which can be found at http://nbviewer.ipython.org/url/www.logilab.org/file/187482/raw/quandl-data-with-pandas.ipynb
You also can get it at http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/ with HG.
hg clone http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/
and get the IPython Notebook, the HTML conversion of this Notebook and some related
At work or at home, I use Debian. A quick and dumb apt-get install
python-pandas is enough. Nevertheless, (1) I'm keen on having a fresh and
bloody upstream sources to get the lastest features and (2) I'm trying to
contribute a little to the project --- tiny bugs, writing some docs. So I prefer
to install it from source. Thus, I pull, I do sudo python setup.py develop
and a few Cython compiling seconds later, I can do:
import pandas as pd
For the other ways to get the library, see the download page on the official website or see the
dedicated Pypi page.
Let's build 10 brownian motions and plotting them with matplotlib.
import numpy as np
I don't very like the default font and color of the matplotlib figures and
curves. I know that pandas defines a "mpl style". Just after the import, you
pd.options.display.mpl_style = 'default'
Maybe I'm wrong, but I think that it's sometimes a quite difficult to retrieve
some workable numerial datasets in the huge amount of available data over the
Web. Free Data, Open Data and so on. OK folks, where are they ? I don't want to
spent my time to through an Open Data website, find some interesting issues,
parse an Excel file, get some specific data, mangling them to get a 2D arrays of
floats with labels. Note that pandas fits with these kinds of problem very
well. See the IO part of the pandas documentation --- CSV, Excel, JSON,
HDF5 reading/writing functions. I just want workable numerical data without making
A few days ago, a colleague of mine talked me about Quandl, a website dedicated
to find and use numerical datasets with timeseries on the Internet. A perfect
source to retrieve some data and play with pandas. Note that you can access
some data about economics, health, population, education etc. thanks to a clever
API. Get some datasets in CSV/XML/JSON formats between this date and this date,
aggregate them, compute the difference, etc.
Moreover, you can access Quandl's datasets through any programming languages,
like R, Julia, Clojure or Python (also available plugins or
modules for some softwares such as Excel, Stata, etc.). The Quandl's Python
package depends on Numpy and
pandas. Perfect ! I can use the module Quandl.py available on GitHub and query some
datasets directly in a DataFrame.
Here we are, huge amount of data are teasing me. Next question: which data to
I've already imported the pandas library. Let's query some datasets thanks to
the Quandl Python module. An example inspired by the README from the Quandl's
GitHub page project.
data = Quandl.get('GOOG/NYSE_IBM')
and you get:
Open High Low Close Volume
2013-10-11 185.25 186.23 184.12 186.16 3232828
2013-10-14 185.41 186.99 184.42 186.97 2663207
2013-10-15 185.74 185.94 184.22 184.66 3367275
2013-10-16 185.42 186.73 184.99 186.73 6717979
2013-10-17 173.84 177.00 172.57 174.83 22368939
OK, I'm not very familiar with this kind of data. Take a look at the Quandl
website. After a dozen of minutes on the Quandl website, I found this OECD
murder rates. This page
shows current and historical murder rates (assault deaths per 100 000 people)
for 33 countries from the OECD. Take a country and type:
uk_df = Quandl.get('OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_GBR')
It's a DataFrame with a single column 'Value'. The index of the DataFrame is
a timeserie. You can easily plot these data thanks to a:
See the other pieces of code and using examples in the dedicated IPython
Notebook. I also get data about unemployment in OECD for the quite same
countries with more dates. Then, as I would like to compare these data, I must
select similar countries, time-resample my data to have the same frequency and
so on. Take a look. Any comment is welcomed.
So, the remaining content of this blog post is just a summary of a few
interesting and useful pandas features used in the IPython notebook.
- Using the timeseries as Index of my DataFrames
- pd.concat to concatenate several DataFrames along a given axis. This
function can deal with missing values if the Index of each DataFrame are
not similar (this is my case)
- DataFrame.to_csv and pd.read_csv to dump/load your data to/from CSV
files. There are different arguments for the read_csv which deals with
dates, mising value, header & footer, etc.
- DateOffset pandas object to deal with different time frequencies. Quite
useful if you handle some data with calendar or business day, month end or
begin, quarter end or begin, etc.
- Resampling some data with the method resample. I use it to make frequency
conversion of some data with timeseries.
- Merging/joining DataFrames. Quite similar to the "SQL" feature. See
pd.merge function or the DataFrame.join method. I used this feature to
align my two DataFrames along its Index.
- Some Matplotlib plotting functions such as DataFrame.plot() and
I showed a few useful pandas features in the IPython Notebooks: concatenation,
plotting, data computation, data alignement. I think I can show more but this
could be occurred in a further blog post. Any comments, suggestions or questions
The next 0.13 pandas release should be coming soon. I'll write a short blog post about it in a few days.
The pictures come from: