Blog entries

  • Nazca notebooks

    2014/07/04 by Vincent Michel

    We have just published the following ipython notebooks explaining how to perform record linkage and entities matching with Nazca:

  • Retrieve Quandl's Data and Play with a Pandas

    2013/10/31 by Damien Garaud

    This post deals with the Pandas Python library, the open and free access of timeseries datasets thanks to the Quandl website and how you can handle datasets with pandas efficiently.

    Why this post?

    There has been a long time that I want to play a little with pandas. Not an adorable black and white teddy bear but the well-known Python Data library based on Numpy. I would like to show how you can easely retrieve some numerical datasets from the Quandl website and its API, and handle these datasets with pandas efficiently trought its main object: the DataFrame.

    Note that this blog post comes with a IPython Notebook which can be found at

    You also can get it at with HG.

    Just do:

    hg clone

    and get the IPython Notebook, the HTML conversion of this Notebook and some related CSV files.

    First Step: Get the Code

    At work or at home, I use Debian. A quick and dumb apt-get install python-pandas is enough. Nevertheless, (1) I'm keen on having a fresh and bloody upstream sources to get the lastest features and (2) I'm trying to contribute a little to the project --- tiny bugs, writing some docs. So I prefer to install it from source. Thus, I pull, I do sudo python develop and a few Cython compiling seconds later, I can do:

    import pandas as pd

    For the other ways to get the library, see the download page on the official website or see the dedicated Pypi page.

    Let's build 10 brownian motions and plotting them with matplotlib.

    import numpy as np
    pd.DataFrame(np.random.randn(120, 10).cumsum(axis=0)).plot()

    I don't very like the default font and color of the matplotlib figures and curves. I know that pandas defines a "mpl style". Just after the import, you can write:

    pd.options.display.mpl_style = 'default'

    Second Step: Have You Got Some Data Please ?

    Maybe I'm wrong, but I think that it's sometimes a quite difficult to retrieve some workable numerial datasets in the huge amount of available data over the Web. Free Data, Open Data and so on. OK folks, where are they ? I don't want to spent my time to through an Open Data website, find some interesting issues, parse an Excel file, get some specific data, mangling them to get a 2D arrays of floats with labels. Note that pandas fits with these kinds of problem very well. See the IO part of the pandas documentation --- CSV, Excel, JSON, HDF5 reading/writing functions. I just want workable numerical data without making effort.

    A few days ago, a colleague of mine talked me about Quandl, a website dedicated to find and use numerical datasets with timeseries on the Internet. A perfect source to retrieve some data and play with pandas. Note that you can access some data about economics, health, population, education etc. thanks to a clever API. Get some datasets in CSV/XML/JSON formats between this date and this date, aggregate them, compute the difference, etc.

    Moreover, you can access Quandl's datasets through any programming languages, like R, Julia, Clojure or Python (also available plugins or modules for some softwares such as Excel, Stata, etc.). The Quandl's Python package depends on Numpy and pandas. Perfect ! I can use the module available on GitHub and query some datasets directly in a DataFrame.

    Here we are, huge amount of data are teasing me. Next question: which data to play with?

    Third Step: Give some Food to Pandas

    I've already imported the pandas library. Let's query some datasets thanks to the Quandl Python module. An example inspired by the README from the Quandl's GitHub page project.

    import Quandl
    data = Quandl.get('GOOG/NYSE_IBM')

    and you get:

                  Open    High     Low   Close    Volume
    2013-10-11  185.25  186.23  184.12  186.16   3232828
    2013-10-14  185.41  186.99  184.42  186.97   2663207
    2013-10-15  185.74  185.94  184.22  184.66   3367275
    2013-10-16  185.42  186.73  184.99  186.73   6717979
    2013-10-17  173.84  177.00  172.57  174.83  22368939

    OK, I'm not very familiar with this kind of data. Take a look at the Quandl website. After a dozen of minutes on the Quandl website, I found this OECD murder rates. This page shows current and historical murder rates (assault deaths per 100 000 people) for 33 countries from the OECD. Take a country and type:


    It's a DataFrame with a single column 'Value'. The index of the DataFrame is a timeserie. You can easily plot these data thanks to a:


    See the other pieces of code and using examples in the dedicated IPython Notebook. I also get data about unemployment in OECD for the quite same countries with more dates. Then, as I would like to compare these data, I must select similar countries, time-resample my data to have the same frequency and so on. Take a look. Any comment is welcomed.

    So, the remaining content of this blog post is just a summary of a few interesting and useful pandas features used in the IPython notebook.

    • Using the timeseries as Index of my DataFrames
    • pd.concat to concatenate several DataFrames along a given axis. This function can deal with missing values if the Index of each DataFrame are not similar (this is my case)
    • DataFrame.to_csv and pd.read_csv to dump/load your data to/from CSV files. There are different arguments for the read_csv which deals with dates, mising value, header & footer, etc.
    • DateOffset pandas object to deal with different time frequencies. Quite useful if you handle some data with calendar or business day, month end or begin, quarter end or begin, etc.
    • Resampling some data with the method resample. I use it to make frequency conversion of some data with timeseries.
    • Merging/joining DataFrames. Quite similar to the "SQL" feature. See pd.merge function or the DataFrame.join method. I used this feature to align my two DataFrames along its Index.
    • Some Matplotlib plotting functions such as DataFrame.plot() and plot(kind='bar').


    I showed a few useful pandas features in the IPython Notebooks: concatenation, plotting, data computation, data alignement. I think I can show more but this could be occurred in a further blog post. Any comments, suggestions or questions are welcomed.

    The next 0.13 pandas release should be coming soon. I'll write a short blog post about it in a few days.

    The pictures come from: