We have just published the following ipython notebooks explaining how to perform record linkage and entities matching with Nazca:
There has been a long time that I want to play a little with pandas. Not an adorable black and white teddy bear but the well-known Python Data library based on Numpy. I would like to show how you can easely retrieve some numerical datasets from the Quandl website and its API, and handle these datasets with pandas efficiently trought its main object: the DataFrame.
Note that this blog post comes with a IPython Notebook which can be found at http://nbviewer.ipython.org/url/www.logilab.org/file/187482/raw/quandl-data-with-pandas.ipynb
You also can get it at http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/ with HG.
hg clone http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/
and get the IPython Notebook, the HTML conversion of this Notebook and some related CSV files.
At work or at home, I use Debian. A quick and dumb apt-get install python-pandas is enough. Nevertheless, (1) I'm keen on having a fresh and bloody upstream sources to get the lastest features and (2) I'm trying to contribute a little to the project --- tiny bugs, writing some docs. So I prefer to install it from source. Thus, I pull, I do sudo python setup.py develop and a few Cython compiling seconds later, I can do:
import pandas as pd
Let's build 10 brownian motions and plotting them with matplotlib.
import numpy as np pd.DataFrame(np.random.randn(120, 10).cumsum(axis=0)).plot()
I don't very like the default font and color of the matplotlib figures and curves. I know that pandas defines a "mpl style". Just after the import, you can write:
pd.options.display.mpl_style = 'default'
Maybe I'm wrong, but I think that it's sometimes a quite difficult to retrieve some workable numerial datasets in the huge amount of available data over the Web. Free Data, Open Data and so on. OK folks, where are they ? I don't want to spent my time to through an Open Data website, find some interesting issues, parse an Excel file, get some specific data, mangling them to get a 2D arrays of floats with labels. Note that pandas fits with these kinds of problem very well. See the IO part of the pandas documentation --- CSV, Excel, JSON, HDF5 reading/writing functions. I just want workable numerical data without making effort.
A few days ago, a colleague of mine talked me about Quandl, a website dedicated to find and use numerical datasets with timeseries on the Internet. A perfect source to retrieve some data and play with pandas. Note that you can access some data about economics, health, population, education etc. thanks to a clever API. Get some datasets in CSV/XML/JSON formats between this date and this date, aggregate them, compute the difference, etc.
Moreover, you can access Quandl's datasets through any programming languages, like R, Julia, Clojure or Python (also available plugins or modules for some softwares such as Excel, Stata, etc.). The Quandl's Python package depends on Numpy and pandas. Perfect ! I can use the module Quandl.py available on GitHub and query some datasets directly in a DataFrame.
Here we are, huge amount of data are teasing me. Next question: which data to play with?
I've already imported the pandas library. Let's query some datasets thanks to the Quandl Python module. An example inspired by the README from the Quandl's GitHub page project.
import Quandl data = Quandl.get('GOOG/NYSE_IBM') data.tail()
and you get:
Open High Low Close Volume Date 2013-10-11 185.25 186.23 184.12 186.16 3232828 2013-10-14 185.41 186.99 184.42 186.97 2663207 2013-10-15 185.74 185.94 184.22 184.66 3367275 2013-10-16 185.42 186.73 184.99 186.73 6717979 2013-10-17 173.84 177.00 172.57 174.83 22368939
OK, I'm not very familiar with this kind of data. Take a look at the Quandl website. After a dozen of minutes on the Quandl website, I found this OECD murder rates. This page shows current and historical murder rates (assault deaths per 100 000 people) for 33 countries from the OECD. Take a country and type:
uk_df = Quandl.get('OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_GBR')
It's a DataFrame with a single column 'Value'. The index of the DataFrame is a timeserie. You can easily plot these data thanks to a:
See the other pieces of code and using examples in the dedicated IPython Notebook. I also get data about unemployment in OECD for the quite same countries with more dates. Then, as I would like to compare these data, I must select similar countries, time-resample my data to have the same frequency and so on. Take a look. Any comment is welcomed.
So, the remaining content of this blog post is just a summary of a few interesting and useful pandas features used in the IPython notebook.
- Using the timeseries as Index of my DataFrames
- pd.concat to concatenate several DataFrames along a given axis. This function can deal with missing values if the Index of each DataFrame are not similar (this is my case)
- DataFrame.to_csv and pd.read_csv to dump/load your data to/from CSV files. There are different arguments for the read_csv which deals with dates, mising value, header & footer, etc.
- DateOffset pandas object to deal with different time frequencies. Quite useful if you handle some data with calendar or business day, month end or begin, quarter end or begin, etc.
- Resampling some data with the method resample. I use it to make frequency conversion of some data with timeseries.
- Merging/joining DataFrames. Quite similar to the "SQL" feature. See pd.merge function or the DataFrame.join method. I used this feature to align my two DataFrames along its Index.
- Some Matplotlib plotting functions such as DataFrame.plot() and plot(kind='bar').
I showed a few useful pandas features in the IPython Notebooks: concatenation, plotting, data computation, data alignement. I think I can show more but this could be occurred in a further blog post. Any comments, suggestions or questions are welcomed.
The next 0.13 pandas release should be coming soon. I'll write a short blog post about it in a few days.
The pictures come from: