Blog entries october 2013 
This post deals with the Pandas Python library, the open and free access of
timeseries datasets thanks to the Quandl website and how you can handle
datasets with pandas efficiently.
There has been a long time that I want to play a little with pandas. Not an
adorable black and white teddy bear but the well-known Python Data library based
on Numpy. I would like to show how you can easely retrieve some numerical
datasets from the Quandl website and its API, and handle these datasets
with pandas efficiently trought its main object: the DataFrame.
Note that this blog post comes with a IPython Notebook which can be found at http://nbviewer.ipython.org/url/www.logilab.org/file/187482/raw/quandl-data-with-pandas.ipynb
You also can get it at http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/ with HG.
hg clone http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/
and get the IPython Notebook, the HTML conversion of this Notebook and some related
At work or at home, I use Debian. A quick and dumb apt-get install
python-pandas is enough. Nevertheless, (1) I'm keen on having a fresh and
bloody upstream sources to get the lastest features and (2) I'm trying to
contribute a little to the project --- tiny bugs, writing some docs. So I prefer
to install it from source. Thus, I pull, I do sudo python setup.py develop
and a few Cython compiling seconds later, I can do:
For the other ways to get the library, see the download page on the official website or see the
dedicated Pypi page.
Let's build 10 brownian motions and plotting them with matplotlib.
import numpy as np
I don't very like the default font and color of the matplotlib figures and
curves. I know that pandas defines a "mpl style". Just after the import, you
pd.options.display.mpl_style = 'default'
Maybe I'm wrong, but I think that it's sometimes a quite difficult to retrieve
some workable numerial datasets in the huge amount of available data over the
Web. Free Data, Open Data and so on. OK folks, where are they ? I don't want to
spent my time to through an Open Data website, find some interesting issues,
parse an Excel file, get some specific data, mangling them to get a 2D arrays of
floats with labels. Note that pandas fits with these kinds of problem very
well. See the IO part of the pandas documentation --- CSV, Excel, JSON,
HDF5 reading/writing functions. I just want workable numerical data without making
A few days ago, a colleague of mine talked me about Quandl, a website dedicated
to find and use numerical datasets with timeseries on the Internet. A perfect
source to retrieve some data and play with pandas. Note that you can access
some data about economics, health, population, education etc. thanks to a clever
API. Get some datasets in CSV/XML/JSON formats between this date and this date,
aggregate them, compute the difference, etc.
Moreover, you can access Quandl's datasets through any programming languages,
like R, Julia, Clojure or Python (also available plugins or
modules for some softwares such as Excel, Stata, etc.). The Quandl's Python
package depends on Numpy and
pandas. Perfect ! I can use the module Quandl.py available on GitHub and query some
datasets directly in a DataFrame.
Here we are, huge amount of data are teasing me. Next question: which data to
I've already imported the pandas library. Let's query some datasets thanks to
the Quandl Python module. An example inspired by the README from the Quandl's
GitHub page project.
data = Quandl.get('GOOG/NYSE_IBM')
and you get:
Open High Low Close Volume
2013-10-11 185.25 186.23 184.12 186.16 3232828
2013-10-14 185.41 186.99 184.42 186.97 2663207
2013-10-15 185.74 185.94 184.22 184.66 3367275
2013-10-16 185.42 186.73 184.99 186.73 6717979
2013-10-17 173.84 177.00 172.57 174.83 22368939
OK, I'm not very familiar with this kind of data. Take a look at the Quandl
website. After a dozen of minutes on the Quandl website, I found this OECD
murder rates. This page
shows current and historical murder rates (assault deaths per 100 000 people)
for 33 countries from the OECD. Take a country and type:
uk_df = Quandl.get('OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_GBR')
It's a DataFrame with a single column 'Value'. The index of the DataFrame is
a timeserie. You can easily plot these data thanks to a:
See the other pieces of code and using examples in the dedicated IPython
Notebook. I also get data about unemployment in OECD for the quite same
countries with more dates. Then, as I would like to compare these data, I must
select similar countries, time-resample my data to have the same frequency and
so on. Take a look. Any comment is welcomed.
So, the remaining content of this blog post is just a summary of a few
interesting and useful pandas features used in the IPython notebook.
- Using the timeseries as Index of my DataFrames
- pd.concat to concatenate several DataFrames along a given axis. This
function can deal with missing values if the Index of each DataFrame are
not similar (this is my case)
- DataFrame.to_csv and pd.read_csv to dump/load your data to/from CSV
files. There are different arguments for the read_csv which deals with
dates, mising value, header & footer, etc.
- DateOffset pandas object to deal with different time frequencies. Quite
useful if you handle some data with calendar or business day, month end or
begin, quarter end or begin, etc.
- Resampling some data with the method resample. I use it to make frequency
conversion of some data with timeseries.
- Merging/joining DataFrames. Quite similar to the "SQL" feature. See
pd.merge function or the DataFrame.join method. I used this feature to
align my two DataFrames along its Index.
- Some Matplotlib plotting functions such as DataFrame.plot() and
I showed a few useful pandas features in the IPython Notebooks: concatenation,
plotting, data computation, data alignement. I think I can show more but this
could be occurred in a further blog post. Any comments, suggestions or questions
The next 0.13 pandas release should be coming soon. I'll write a short blog post about it in a few days.
The pictures come from:
Last week, on the first day of OpenWorldForum 2013, we met up with
Thomas Hatch of SaltStack to have a talk about salt. He was in Paris
to give two talks the following day (1 & 2), and it was a
great opportunity to meet him and physically meet part of the French
Salt community. Since Logilab hosted the Great Salt Sprint in Paris, we offered to
co-organise the meetup at OpenWorldForum.
About 15 people gathered in Montrouge (near Paris) and we all took
turns to present ourselves and how or why we used salt. Some people
wanted to migrate from BCFG2 to salt. Some people told
the story of working a month with CFEngine and meeting the same
functionnality in two days with salt and so decided to go for that
instead. Some like salt because they can hack its python code. Some
use salt to provision pre-defined AMI images for the clouds
(salt-ami-cloud-builder). Some chose
salt over Ansible. Some want to
use salt to pilot temporary computation clusters in the cloud (sort of
like what StarCluster does
with boto and ssh).
When Paul from Logilab introduced salt-ami-cloud-builder, Thomas
Hatch said that some work is being done to go all the way : build an
image from scratch from a state definition. On the question of Debian
packaging, some efforts could be done to have salt into
wheezy-backports. Julien Cristau from Logilab who is a
debian developer might help with that.
Some untold stories where shared : some companies that replaced
puppet by salt, some companies use salt
to control an HPC cluster, some companies use salt to pilot their
existing puppet system.
We had some discussions around salt-cloud, which will probably
be merged into salt at some point. One idea for salt-cloud was raised :
have a way of defining a "minimum" type of configuration which translates
into the profiles according to which provider is used (an issue should
be added shortly). The expression "pushing states" was often used, it
is probably a good way of looking a the combination of using
salt-cloud and the masterless mode available with salt-ssh. salt-cloud
controls an existing cloud, but Thomas Hatch points to the fact that
with salt-virt, salt is becoming a cloud controller itself,
more on that soon.
Mixing pillar definition between 'public' and 'private' definitions
can be tricky. Some solutions exist with multiple gitfs (or
mercurial) external pillar definitions, but more use cases will drive
more flexible functionalities in the future.
For those in the audience that were not (yet) users of salt, Thomas
went back to explaining a few basics about it. Salt should be seen as
a "toolkit to solve problems in a infrastructure" says Thomas
Hatch. Why is it fast ? It is completely asynchronous and event
He gave a quick presentation about the new salt-ssh which was
introduced in 0.17, which
allows the application of salt recipes to machines that don't have a
minion connected to the master.
The peer communication
can be used so as to add a condition for a state on the presence of
service on a different minion.
While doing demos or even hacking on salt, one can use
salt/test/minionswarm.py which makes fake minions, not everyone has
hundreds of servers in at their fingertips.
Smart modules are loaded dynamically, for example, the git module that
gets loaded if a state installs git and then in the same highstate
uses the git modules.
Thomas explained the difference between grains and pillars : grains is
data about a minion that lives on the minion, pillar is data about the
minion that lives on the master. When handling grains, the
grains.setval can be useful (it writes in /etc/salt/grains as yaml,
so you can edit it separately). If a minion is not reachable one can
obtain its grains information by replacing test=True by cache=True.
Thomas shortly presented saltstack-formulas : people want to "program"
their states, and formulas answer this need, some of the jinja2 is overly complicated to make them
flexible and programmable.
While talking about the unified package commands (a salt command often
has various backends according to what system runs the minion), for
example salt-call --local pkg.install vim, Thomas told this funny
story : ironically, salt was nominated for "best package manager" at
some linux magazine competition. (so you don't have to learn how to
use FreeBSD packaging tools).
While hacking salt, one can take a look at the Event Bus (see
test/eventlisten.py), many applications are possible when using the
data on this bus. Thomas talks about a future IOflow python module
where a complex logic can be implemented in the reactor with rules and
a state machine. One example use of this would be if the load is high
on X number of servers and the number of connexions Y on these servers
then launch extra machines.
To finish on a buzzword, someone asked "what is the overlap of salt
and docker" ? The answer is not simple, but Thomas thinks that in the
long run there will be a lot of overlap, one can check out the
existing lxc modules and states.
To wrap up, Thomas announced a salt conference
planned for January 2014 in Salt Lake City.
Logilab proposes to bootstrap the French community around salt. As the
group suggest this could take the form of a mailing list, an irc
channel, a meetup group , some sprints, or a combination of all the
above. On that note, next international sprint will probably take
place in January 2014 around the salt conference.
One nice way of having a reproducible development or test environment is to "program" a virtual machine to do the job. If you have a powerful machine at hand you might use Vagrant in combination with VirtualBox. But if you have an OpenStack setup at hand (which is our case), you might want to setup and destroy your virtual machines on such a private cloud (or public cloud if you want or can). Sure, Vagrant has some plugins that should add OpenStack as a provider, but, here at Logilab, we have a clear preference for python over ruby. So this is where cloudenvy comes into play.
Cloudenvy is written in python and with some simple YAML configuration can help you setup and provision some virtual machines that contain your tests or your development environment.
Setup your authentication in ~/.cloudenvy.yml :
Then create an Envyfile.yml at the root of your project
# files copied from your host to the VM
#local_file : destination
Now simply type envy up. Cloudenvy does the rest. It "simply" creates your machine, copies the files, runs your provision script and gives you it's IP address. You can then run envy ssh if you don't want to be bothered with IP addresses and such nonsense (forget about copy and paste from the OpenStack web interface, or your nova show commands).
Little added bonus : you know your machine will run a web server on port 8080 at some point, set it up in your environment by defining in the same Envyfile.yml your access rules
'tcp, 22, 22, 0.0.0.0/0',
'tcp, 80, 80, 0.0.0.0/0',
'tcp, 8080, 8080, 0.0.0.0/0',
As you might know (or I'll just recommend it), you should be able to scratch and restart your environment without loosing anything, so once in a while you'll just do envy destroy to do so. You might want to have multiples VM with the same specs, then go for envy up -n second-machine.
Only downside right now : cloudenvy isn't packaged for debian (which is usually a prerequisite for the tools we use), but let's hope it gets some packaging soon (or maybe we'll end up doing it).
Don't forget to include this configuration in your project's version control so that a colleague starting on the project can just type envy up and have a working setup.
In the same order of ideas, we've been trying out salt-cloud <https://github.com/saltstack/salt-cloud> because provisioning machines with SaltStack is the way forward. A blog about this is next.