blog entries created by Damien Garaud

A Salt Configuration for C++ Development

2014/01/24 by Damien Garaud

At Logilab, we've been using Salt for one year to manage our own infrastructure. I wanted to use it to manage a specific configuration: C++ development. When I instantiate a Virtual Machine with a Debian image, I don't want to spend time to install and configure a system which fits my needs as a C++ developer:

This article is a very simple recipe to get a C++ development environment, ready to use, ready to hack.

Give Me an Editor and a DVCS

Quite simple: I use the YAML file format used by Salt to describe what I want. To install these two editors, I just need to write:



For Mercurial, you'll guess:


You can write these lines in the same init.sls file, but you can also decide to split your configuration into different subdirectories: one place for each thing. I decided to create a dev and editor directories at the root of my salt config with two init.sls inside.

That's all for the editors. Next step: specific C++ development packages.

Install Several "C++" Packages

In a cpp folder, I write a file init.sls with this content:









The choice of these packages is arbitrary. You add or remove some as you need. There is not a unique right solution. But I want more. I want some LLVM packages. In a cpp/llvm.sls, I write:




{% if not grains['oscodename'] == 'wheezy' %}
{% endif %}

The last line specifies that you install the lldb package if your Debian release is not the stable one, i.e. jessie/testing or sid in my case. Now, just include this file in the init.sls one:

# ...
# at the end of 'cpp/init.sls'
  - .llvm

Organize your sls files according to your needs. That's all for packages installation. You Salt configuration now looks like this:

|-- cpp
|   |-- init.sls
|   `-- llvm.sls
|-- dev
|   `-- init.sls
|-- edit
|   `-- init.sls
`-- top.sls

Launching Salt

Start your VM and install a masterless Salt on it (e.g. apt-get install salt-minion). For launching Salt locally on your naked VM, you need to copy your configuration (through scp or a DVCS) into /srv/salt/ directory and to write the file top.sls:

    - dev
    - edit
    - cpp

Then just launch:

> salt-call --local state.highstate

as root.

And What About Configuration Files?

You're right. At the beginning of the post, I talked about a "ready to use" Mercurial with some HG extensions. So I use and copy the default /etc/mercurial/hgrc.d/hgext.rc file into the dev directory of my Salt configuration. Then, I edit it to set some extensions such as color, rebase, pager. As I also need Evolve, I have to clone the source code from With Salt, I can tell "clone this repo and copy this file" to specific places.

So, I add some lines to dev/init.sls.
      - rev: tip
      - target: /opt/local/mutable-history
      - require:
         - pkg: mercurial

      - source: salt://dev/hgext.rc
      - user: root
      - group: root
      - mode: 644

The require keyword means "install (if necessary) this target before cloning". The other lines are quite self-explanatory.

In the end, you have just six files with a few lines. Your configuration now looks like:

|-- cpp
|   |-- init.sls
|   `-- llvm.sls
|-- dev
|   |-- hgext.rc
|   `-- init.sls
|-- edit
|   `-- init.sls
`-- top.sls

You can customize it and share it with your teammates. A step further would be to add some configuration files for your favorite editor. You can also imagine to install extra packages that your library depends on. Quite simply add a subdirectory amazing_lib and write your own init.sls. I know I often need Boost libraries for example. When your Salt configuration has changed, just type: salt-call --local state.highstate.

As you can see, setting up your environment on a fresh system will take you only a couple commands at the shell before you are ready to compile your C++ library, debug it, fix it and commit your modifications to your repository.

What's New in Pandas 0.13?

2014/01/20 by Damien Garaud

Do you know pandas, a Python library for data analysis? Version 0.13 came out on January the 16th and this post describes a few new features and improvements that I think are important.

Each release has its list of bug fixes and API changes. You may read the full release note if you want all the details, but I will just focus on a few things.

You may be interested in one of my previous blog post that showed a few useful Pandas features with datasets from the Quandl website and came with an IPython Notebook for reproducing the results.

Let's talk about some new and improved Pandas features. I suppose that you have some knowledge of Pandas features and main objects such as Series and DataFrame. If not, I suggest you watch the tutorial video by Wes McKinney on the main page of the project or to read 10 Minutes to Pandas in the documentation.


I welcome the refactoring effort: the Series type, subclassed from ndarray, has now the same base class as DataFrame and Panel, i.e. NDFrame. This work unifies methods and behaviors for these classes. Be aware that you can hit two potential incompatibilities with versions less that 0.13. See internal refactoring for more details.



Function pd.to_timedelta to convert a string, scalar or array of strings to a Numpy timedelta type (np.timedelta64 in nanoseconds). It requires a Numpy version >= 1.7. You can handle an array of timedeltas, divide it by an other timedelta to carry out a frequency conversion.

from datetime import timedelta
import numpy as np
import pandas as pd

# Create a Series of timedelta from two DatetimeIndex.
dr1 = pd.date_range('2013/06/23', periods=5)
dr2 = pd.date_range('2013/07/17', periods=5)
td = pd.Series(dr2) - pd.Series(dr1)

# Set some Na{N,T} values.
td[2] -= np.timedelta64(timedelta(minutes=10, seconds=7))
td[3] = np.nan
td[4] += np.timedelta64(timedelta(hours=14, minutes=33))
0   24 days, 00:00:00
1   24 days, 00:00:00
2   23 days, 23:49:53
3                 NaT
4   24 days, 14:33:00
dtype: timedelta64[ns]

Note the NaT type (instead of the well-known NaN). For day conversion:

td / np.timedelta64(1, 'D')
0    24.000000
1    24.000000
2    23.992975
3          NaN
4    24.606250
dtype: float64

You can also use the DateOffSet as:

td + pd.offsets.Minute(10) - pd.offsets.Second(7) + pd.offsets.Milli(102)

Nanosecond Time

Support for nanosecond time as an offset. See pd.offsets.Nano. You can use N of this offset in the pd.date_range function as the value of the argument freq.

Daylight Savings

The tz_localize method can now infer a fall daylight savings transition based on the structure of the unlocalized data. This method, as the tz_convert method is available for any DatetimeIndex, Series and DataFrame with a DatetimeIndex. You can use it to localize your datasets thanks to the pytz module or convert your timeseries to a different time zone. See the related documentation about time zone handling. To use the daylight savings inference in the method tz_localize, set the infer_dst argument to True.

DataFrame Features

New Method isin()

New DataFrame method isin which is used for boolean indexing. The argument to this method can be an other DataFrame, a Series, or a dictionary of a list of values. Comparing two DataFrame with isin is equivalent to do df1 == df2. But you can also check if values from a list occur in any column or check if some values for a few specific columns occur in the DataFrame (i.e. using a dict instead of a list as argument):

df = pd.DataFrame({'A': [3, 4, 2, 5],
                   'Q': ['f', 'e', 'd', 'c'],
                   'X': [1.2, 3.4, -5.4, 3.0]})
   A  Q    X
0  3  f  1.2
1  4  e  3.4
2  2  d -5.4
3  5  c  3.0

and then:

df.isin(['f', 1.2, 3.0, 5, 2, 'd'])
       A      Q      X
0   True   True   True
1  False  False  False
2   True   True  False
3   True  False   True

Of course, you can use the previous result as a mask for the current DataFrame.

mask = _
      A  Q    X
   0  3  f  1.2
   2  2  d -5.4
   3  5  c  3.0

When you pass a dictionary to the ``isin`` method, you can specify the column
labels for each values.
mask = df.isin({'A': [2, 3, 5], 'Q': ['d', 'c', 'e'], 'X': [1.2, -5.4]})
    A    Q    X
0   3  NaN  1.2
1 NaN    e  NaN
2   2    d -5.4
3   5    c  NaN

See the related documentation for more details or different examples.

New Method str.extract

The new vectorized extract method from the StringMethods object, available with the suffix str on Series or DataFrame. Thus, it is possible to extract some data thanks to regular expressions as followed:

s = pd.Series(['', '', 'wrong.mail', '', ''])
# Extract usernames.


0       doe
1    nobody
2       NaN
3    pandas
4       NaN
dtype: object

Note that the result is a Series with the re match objects. You can also add more groups as:

# Extract usernames and domain.
        0           1
0     doe
1  nobody
2     NaN         NaN
3  pandas
4     NaN         NaN

Elements that do no math return NaN. You can use named groups. More useful if you want a more explicit column names (without NaN values in the following example):

# Extract usernames and domain with named groups.
     user          at
0     doe
1  nobody
3  pandas

Thanks to this part of the documentation, I also found out other useful strings methods such as split, strip, replace, etc. when you handle a Series of str for instance. Note that the most of them have already been available in 0.8.1. Take a look at the string handling API doc (recently added) and some basics about vectorized strings methods.

Interpolation Methods

DataFrame has a new interpolate method, similar to Series. It was possible to interpolate missing data in a DataFrame before, but it did not take into account the dates if you had index timeseries. Now, it is possible to pass a specific interpolation method to the method function argument. You can use scipy interpolation functions such as slinear, quadratic, polynomial, and others. The time method is used to take your index timeseries into account.

from datetime import date
# Arbitrary timeseries
ts = pd.DatetimeIndex([date(2006,5,2), date(2006,12,23), date(2007,4,13),
                       date(2007,6,14), date(2008,8,31)])
df = pd.DataFrame(np.random.randn(5, 2), index=ts, columns=['X', 'Z'])
# Fill the DataFrame with missing values.
df['X'].iloc[[1, -1]] = np.nan
df['Z'].iloc[3] = np.nan
                   X         Z
2006-05-02  0.104836 -0.078031
2006-12-23       NaN -0.589680
2007-04-13 -1.751863  0.543744
2007-06-14  1.210980       NaN
2008-08-31       NaN  0.566205

Without any optional argument, you have:

                   X         Z
2006-05-02  0.104836 -0.078031
2006-12-23 -0.823514 -0.589680
2007-04-13 -1.751863  0.543744
2007-06-14  1.210980  0.554975
2008-08-31  1.210980  0.566205

With the time method, you obtain:

                   X         Z
2006-05-02  0.104836 -0.078031
2006-12-23 -1.156217 -0.589680
2007-04-13 -1.751863  0.543744
2007-06-14  1.210980  0.546496
2008-08-31  1.210980  0.566205

I suggest you to read more examples in the missing data doc part and the scipy documentation about the module interpolate.


Convert a Series to a single-column DataFrame with its method to_frame.

Misc & Experimental Features

Retrieve R Datasets

Not a killing feature but very pleasant: the possibility to load into a DataFrame all R datasets listed at

import pandas.rpy.common as com
titanic = com.load_data('Titanic')
  Survived    Age     Sex Class value
0       No  Child    Male   1st   0.0
1       No  Child    Male   2nd   0.0
2       No  Child    Male   3rd  35.0
3       No  Child    Male  Crew   0.0
4       No  Child  Female   1st   0.0

for the datasets about survival of passengers on the Titanic. You can find several and different datasets about New York air quality measurements, body temperature series of two beavers, plant growth results or the violent crime rates by US state for instance. Very useful if you would like to show pandas to a friend, a colleague or your Grandma and you do not have a dataset with you.

And then three great experimental features.

Eval and Query Experimental Features

The eval and query methods which use numexpr which can fastly evaluate array expressions as x - 0.5 * y. For numexpr, x and y are Numpy arrays. You can use this powerfull feature in pandas to evaluate different DataFrame columns. By the way, we have already talked about numexpr a few years ago in EuroScipy 09: Need for Speed.

df = pd.DataFrame(np.random.randn(10, 3), columns=['x', 'y', 'z'])
          x         y         z
0 -0.617131  0.460250 -0.202790
1 -1.943937  0.682401 -0.335515
2  1.139353  0.461892  1.055904
3 -1.441968  0.477755  0.076249
4 -0.375609 -1.338211 -0.852466
df.eval('x + 0.5 * y - z').head()
0   -0.184217
1   -1.267222
2    0.314395
3   -1.279340
4   -0.192248
dtype: float64

About the query method, you can select elements using a very simple query syntax.

df.query('x >= y > z')
          x         y         z
9  2.560888 -0.827737 -1.326839

msgpack Serialization

New reading and writing functions to serialize your data with the great and well-known msgpack library. Note this experimental feature does not have a stable storage format. You can imagine to use zmq to transfer msgpack serialized pandas objects over TCP, IPC or SSH for instance.

Google BigQuery

A recent module which provides a way to load into and extract datasets from the Google BigQuery Web service. I've not installed the requirements for this feature now. The example of the release note shows how you can select the average monthly temperature in the year 2000 across the USA. You can also read the related pandas documentation. Nevertheless, you will need a BigQuery account as the other Google's products.

Take Your Keyboard

Give it a try, play with some data, mangle and plot them, compute some stats, retrieve some patterns or whatever. I'm convinced that pandas will be more and more used and not only for data scientists or quantitative analysts. Open an IPython Notebook, pick up some data and let yourself be tempted by pandas.

I think I will use more the vectorized strings methods that I found out about when writing this post. I'm glad to learn more about timeseries because I know that I'll use these features. I'm looking forward to the two experimental features such as eval/query and msgpack serialization.

You can follow me on Twitter (@jazzydag). See also Logilab (@logilab_org).

Lecture de données tabulées avec Numpy -- Retour d'expérience sur dtype

2013/12/17 by Damien Garaud

Ce billet est un petit retour d'expérience sur l'utilisation de Numpy pour lire et extraire des données tabulées depuis des fichiers texte.

Chaque section, hormis les objectifs ou la conclusion, correspond soit à une difficulté rencontrée, une remarque technique, des explications et références vers la documentation officielle sur un point précis qui m'a fait patauger quelques temps. Il y a de forte chance, pour certains d'entre vous, que les points décrits ici vous paraissent évidents, que vous vous disiez "mais qui ne sait pas ça ?!". J'étais moi-même le premier étonné, depuis que je connais Numpy, de ne pas savoir ce genre de choses. Je l'étais moins quand autour de moi, mes camarades ne semblaient pas non plus connaître les petites histoires numpysiennes que je vais vous conter.


Le Pourquoi et le Où on va au fait ?

J'avais sous la main des fichiers aux données tabulées, type CSV, où les types de données par colonne étaient clairement identifiés. Je ne souhaitais pas passer du temps avec le module csv de la bibliothèque standard à convertir chaque élément en type de base, str, flottants ou dates. Numpy étant déjà une dépendance du projet, et connaissant la fonction np.genfromtxt, j'ai évidemment souhaité l'utiliser.

Il était nécessaire de ne lire que certaines colonnes. Je souhaitais associer un nom à chaque colonne. L'objectif était ensuite d'itérer sur ces données ligne par ligne et les traiter dans des générateurs Python. Je n'utilise pas ici Numpy pour faire des opérations mathématiques sur ces tableaux à deux dimensions avec des types hétérogènes. Et je ne pense d'ailleurs pas qu'il soit pertinent d'utiliser ce type de tableau pour faire ces opérations.

dtypes différents, str et extraction de chaînes vides

On utilise ici l'argument dtype des fonctions telles que np.genfromtxt pour lire des fichiers tabulés dont les colonnes sont de types différents.

Attention au dtype à passer à np.genfromtxt ou np.recfromtxt quand on parse des données tabulée (file ou stream). Pour parser une colonne de chaînes de caratères, lui passer [('colname', str)] renvoie des chaînes vides si les autres dtypes sont de types différents.

Il faut préciser la taille :

dtype=[('colname', str, 10)]
# or
dtype=[('colname', 'S10')]

Ou alors prendre un "vrai" objet str Python :

dtype=[('colname', object)]

aussi équivalent à:

dtype=[('colname', 'object')]

Et oui, je suis littéralement tombé sur l'évidence, les "types Numpy", c'est du type C. Et les chaînes, c'est du char * et il y a donc besoin de la taille. La solitude s'est fait moindre quand j'ai su que je n'étais pas le seul à être tombé sur des données tronquées voire vides.

dtype et tableau à zéro dimension

Attention au tableau Numpy 0D quand le contenu tabulé à parser n'a qu'une seule ligne (cas d'un np.[rec]array avec plusieurs dtypes). Impossible d'itérer sur les éléments puisque dimension nulle.

Supposons que vous ayez un fichier tabulé d'une seule ligne :


J'utilise np.genfromtxt en précisant le type des colonnes que je souhaite récupérer (je ne prends pas en compte ici la première ligne).

data = np.genfromtxt(filename, delimiter=',',
                     dtype=[('name', 'S12'), ('play', object), ('age', int)],

Voici la représentation de mon array :

array(('Coltrane', 'Saxo', 28),
    dtype=[('name', 'S12'), ('play', 'O'), ('age', '<i8')])

Si dans votre code, vous avez eu la bonne idée de parcourir vos données avec :

for name, instrument, age in data:
    # ...

vous pourrez obenir un malheureux TypeError: 'numpy.int64' object is not iterable par exemple. Vous n'avez pas eu de chance, votre tableau Numpy est à zéro dimension et une shape nulle (i.e. un tuple vide). Effectivement, itérer sur un objet de dimension nulle n'est pas chose aisée. Ce que je veux, c'est un tableau à une dimension avec un seul élément (ici un tuple avec mes trois champs) sur lequel il est possible d'itérer.

Pour cela, faire:

>>> print data
array(('Coltrane', 'Saxo', 28),
      dtype=[('name', 'S12'), ('play', 'O'), ('age', '<i8')])array(('babar', 42.), dytpe=[('f0', 'S5'), ('f1', '<f8')])
>>> print data.shape, data.ndim
(), 0
>>> data = data[np.newaxis]
>>> print data.shape, data.ndim
(1,), 1

dtype et str : chararray ou ndarray de strings ?

Pour les chararray, lire help(np.chararray) ou En particulier:

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

On fera donc la distinction entre:

# ndarray of str
na = np.array(['babar', 'celeste'], dtype=np.str_)
# chararray
ca = np.chararray(2)
ca[0], ca[1] = 'babar', 'celeste'

Le type de tableau est ici différent : np.ndarray pour le premier et np.chararray pour le second. Malheureusement pour np.recfromtxt et en particulier pour np.recarray, si on transpose le label de la colonne en tant qu'attribut, np.recarray il est transformé en chararray avec le bon type Numpy --- |S7 dans notre cas --- au lieu de conserver un np.ndarray de type |S7.

Exemple :

from StringIO import StringIO
rawtxt = 'babar,36\nceleste,12'
a = np.recfromtxt(StringIO(rawtxt), delimiter=',', dtype=[('name', 'S7'), ('age', int)])

Le print rend bien un objet de type chararray. Alors que :

a = np.genfromtxt(StringIO(rawtxt), delimiter=',', dtype=[('name', 'S7'), ('age', int)])

affiche ndarray. J'aimerais que tout puisse être du même type, peu importe la fonction utilisée. Au vue de la documentation et de l'aspect déprécié du type charray, on souhaiterait avoir que du ndarray de type np.str. J'ai par ailleurs ouvert le ticket Github 3993 qui n'a malheureusement que peu de succès :-(

Tableau de chaînes : quel dtype ?

Si certains se demandent quoi mettre pour représenter le type "une chaîne de caractères" dans un tableau numpy, ils ont le choix :

np.array(['coltrane', 'hancock'], dtype=np.str)
np.array(['coltrane', 'hancock'], dtype=np.str_)
np.array(['coltrane', 'hancock'], dtype=np.string_)
np.array(['coltrane', 'hancock'], dtype='S')
np.array(['coltrane', 'hancock'], dtype='S10')
np.array(['coltrane', 'hancock'], dtype='|S10')

Les questions peuvent être multiples : est-ce la même chose ? pourquoi tant de choses différentes ? Pourquoi tant de haine quand on lit la doc Numpy et que l'info ne saute pas aux yeux ? À savoir que le tableau construit sera identique dans chacun des cas. Il existe peut-être même d'autres moyens de construire un tableau de type identique en lui passant encore un n-ième argument dtype différent.

  • np.str représente le type str de Python. Il est converti automatiquement en type chaines de caractère Numpy dont la longueur correspond à la longueur maximale rencontrée.
  • np.str_ et np.string_ sont équivalents et représentent le type "chaîne de caractères" pour Numpy (longueur = longueur max.).
  • Les trois autres utilisent la représentation sous forme de chaîne de caractères du type np.string_.
    • S ne précise pas la taille de la chaîne (Numpy prend donc la chaîne la plus longue)
    • S10 précise la taille de la chaîne (données tronquées si la taille est plus petite que la chaîne la plus longue)
    • |S10 est strictement identique au précédent. Il faut savoir qu'il existe cette notation <f8 ou >f8 pour représenter un flottant. Les chevrons signifient little endian ou big endian respectivement. Le | sert à dire "pas pertinent". Source: la section typestr sur la page

À noter que j'ai particulièrement apprécié l'utilisation d'un symbole pour spécifier une information non pertinente --- depuis le temps que je me demandais ce que voulait bien pouvoir dire ce pipe devant le 'S'.

Conclusion (et pourquoi pas pandas ?)

Pandas, bibliothèque Python d'analyse de données basée sur Numpy, propose, via sa fonction read_csv, le même genre de fonctionnalités. Il a l'avantage de convertir les types par colonne sans lui donner d'information de type, qu'on lise toutes les colonnes ou seulement quelques unes. Pour les colonnes de type "chaîne de caractères", il prend un dtype=object et n'essaie pas de deviner la longueur maximale pour le type np.str_. Vous ne rencontrerez donc pas "le coup des chaînes vides/tronquées" comme avec dtype='S'.

Je ne m'étalerai pas sur tout le bien que je pense de cette bibliothèque. Je vous invite par ailleurs à lire/ parcourir un billet de novembre qui expose un certain nombre de fonctionnalités croustillantes et accompagné d'un IPython Notebook.

Et pourquoi pas Pandas ? Il ne me semble pas pertinent de dépendre d'une nouvelle bibliothèque, aussi bien soit-elle, pour une fonction, aussi utile soit-elle. Pandas est un projet intéressant, mais jeune, qui ne se distribue pas aussi bien que Numpy pour l'instant. De plus, le projet sur lequel je travaillais utilisait déjà Numpy. Je n'avais besoin de rien d'autre pour réaliser mon travail, et dépendre de Pandas ne me semblait pas très pertinent. Je me suis donc contenté des fonctions np.{gen,rec}fromtxt qui font très bien le boulot, avec un dtype comme il faut, tout en retenant les boulettes que j'ai faites.

Retrieve Quandl's Data and Play with a Pandas

2013/10/31 by Damien Garaud

This post deals with the Pandas Python library, the open and free access of timeseries datasets thanks to the Quandl website and how you can handle datasets with pandas efficiently.

Why this post?

There has been a long time that I want to play a little with pandas. Not an adorable black and white teddy bear but the well-known Python Data library based on Numpy. I would like to show how you can easely retrieve some numerical datasets from the Quandl website and its API, and handle these datasets with pandas efficiently trought its main object: the DataFrame.

Note that this blog post comes with a IPython Notebook which can be found at

You also can get it at with HG.

Just do:

hg clone

and get the IPython Notebook, the HTML conversion of this Notebook and some related CSV files.

First Step: Get the Code

At work or at home, I use Debian. A quick and dumb apt-get install python-pandas is enough. Nevertheless, (1) I'm keen on having a fresh and bloody upstream sources to get the lastest features and (2) I'm trying to contribute a little to the project --- tiny bugs, writing some docs. So I prefer to install it from source. Thus, I pull, I do sudo python develop and a few Cython compiling seconds later, I can do:

import pandas as pd

For the other ways to get the library, see the download page on the official website or see the dedicated Pypi page.

Let's build 10 brownian motions and plotting them with matplotlib.

import numpy as np
pd.DataFrame(np.random.randn(120, 10).cumsum(axis=0)).plot()

I don't very like the default font and color of the matplotlib figures and curves. I know that pandas defines a "mpl style". Just after the import, you can write:

pd.options.display.mpl_style = 'default'

Second Step: Have You Got Some Data Please ?

Maybe I'm wrong, but I think that it's sometimes a quite difficult to retrieve some workable numerial datasets in the huge amount of available data over the Web. Free Data, Open Data and so on. OK folks, where are they ? I don't want to spent my time to through an Open Data website, find some interesting issues, parse an Excel file, get some specific data, mangling them to get a 2D arrays of floats with labels. Note that pandas fits with these kinds of problem very well. See the IO part of the pandas documentation --- CSV, Excel, JSON, HDF5 reading/writing functions. I just want workable numerical data without making effort.

A few days ago, a colleague of mine talked me about Quandl, a website dedicated to find and use numerical datasets with timeseries on the Internet. A perfect source to retrieve some data and play with pandas. Note that you can access some data about economics, health, population, education etc. thanks to a clever API. Get some datasets in CSV/XML/JSON formats between this date and this date, aggregate them, compute the difference, etc.

Moreover, you can access Quandl's datasets through any programming languages, like R, Julia, Clojure or Python (also available plugins or modules for some softwares such as Excel, Stata, etc.). The Quandl's Python package depends on Numpy and pandas. Perfect ! I can use the module available on GitHub and query some datasets directly in a DataFrame.

Here we are, huge amount of data are teasing me. Next question: which data to play with?

Third Step: Give some Food to Pandas

I've already imported the pandas library. Let's query some datasets thanks to the Quandl Python module. An example inspired by the README from the Quandl's GitHub page project.

import Quandl
data = Quandl.get('GOOG/NYSE_IBM')

and you get:

              Open    High     Low   Close    Volume
2013-10-11  185.25  186.23  184.12  186.16   3232828
2013-10-14  185.41  186.99  184.42  186.97   2663207
2013-10-15  185.74  185.94  184.22  184.66   3367275
2013-10-16  185.42  186.73  184.99  186.73   6717979
2013-10-17  173.84  177.00  172.57  174.83  22368939

OK, I'm not very familiar with this kind of data. Take a look at the Quandl website. After a dozen of minutes on the Quandl website, I found this OECD murder rates. This page shows current and historical murder rates (assault deaths per 100 000 people) for 33 countries from the OECD. Take a country and type:


It's a DataFrame with a single column 'Value'. The index of the DataFrame is a timeserie. You can easily plot these data thanks to a:


See the other pieces of code and using examples in the dedicated IPython Notebook. I also get data about unemployment in OECD for the quite same countries with more dates. Then, as I would like to compare these data, I must select similar countries, time-resample my data to have the same frequency and so on. Take a look. Any comment is welcomed.

So, the remaining content of this blog post is just a summary of a few interesting and useful pandas features used in the IPython notebook.

  • Using the timeseries as Index of my DataFrames
  • pd.concat to concatenate several DataFrames along a given axis. This function can deal with missing values if the Index of each DataFrame are not similar (this is my case)
  • DataFrame.to_csv and pd.read_csv to dump/load your data to/from CSV files. There are different arguments for the read_csv which deals with dates, mising value, header & footer, etc.
  • DateOffset pandas object to deal with different time frequencies. Quite useful if you handle some data with calendar or business day, month end or begin, quarter end or begin, etc.
  • Resampling some data with the method resample. I use it to make frequency conversion of some data with timeseries.
  • Merging/joining DataFrames. Quite similar to the "SQL" feature. See pd.merge function or the DataFrame.join method. I used this feature to align my two DataFrames along its Index.
  • Some Matplotlib plotting functions such as DataFrame.plot() and plot(kind='bar').


I showed a few useful pandas features in the IPython Notebooks: concatenation, plotting, data computation, data alignement. I think I can show more but this could be occurred in a further blog post. Any comments, suggestions or questions are welcomed.

The next 0.13 pandas release should be coming soon. I'll write a short blog post about it in a few days.

The pictures come from:

Rencontre LLVM à l'IRILL

2012/09/27 by Damien Garaud

L'IRILL , Initiative de Recherche et Innovation sur le Logiciel Libre, organisait une rencontre LLVM mardi 25 septembre à Paris dans les locaux de l'INRIA Place d'Italie dans le 13e arrondissement. Les organisateurs, Arnaud de Grandmaison, Duncan Sands, Sylvestre Ledru et Tobias Grosser ont chaleureusement accueilli environ 8 personnes dont 3 de Logilab.

L'annonce de l'IRILL indiquait une rencontre informelle avec une potentielle Bug Squashing Party sur Clang. Nous n'avons pas écrasé d'insectes mais avons plutôt discuté: discussions techniques, petits trolls, échanges de connaissances, d'outils et d'expériences accompagnés de bières, sodas, gâteaux et pizzas (et une salade solitaire et orpheline qui n'a finie dans le ventre de personne contrairement aux pizzas mexicaines ou quatre fromages).

LLVM (Low Level Virtual Machine) est un projet développé depuis une dizaine d'années qui propose une collection d'outils et de modules facilement réutilisables pour construire des logiciels orientés langages : interpréteurs, compilateurs, optimiseurs, convertisseurs source-to-source...

Depuis l'étage d'entrée du code dans le compilateur (ou frontend), en passant par l'optimiseur indépendant de la plate-forme, jusqu'à la génération de code machine (backend) et ce, pour plusieurs architectures (X86, PowerPC, ARM, etc.), le choix de son design en fait un outil tout à fait intéressant. Parmi ces frontends, il y a le fameux Clang (C/C++, Objective-C), GHC (Haskell). Et le projet dragonegg permet de l'utiliser comme plug-in à GCC et donc de bénéficier des différents frontends GCC (Ada, Fortran, etc.).

De nombreux outils se sont construits autour du framework LLVM : LLDB un débugger, vmkit une implémentation de la machine virtuelle Java et .NET ou encore polly, un projet de recherche dont Tobias est l'un des deux co-fondateurs, qui a pour objectif d'optimiser un programme indépendamment du langage via des polyhedral optimizations.

Les principaux contributeurs à ce jour sont Apple, Google, des contributeurs "individuels" dont Duncan Sands, suivis par des laboratoires de recherche et autres. Voir l'article de Sylvestre sur "Who is in control of LLVM/Clang ?".

La clef de voûte de LLVM est sa représentation intermédiaire (IR pour Intermediate Representation). C'est une structure de données qui représente sous forme SSA (Single Static Assignment) les flux de données et de contrôle. Cette forme est plus "haut-niveau" et plus lisible que du code assembleur, elle comporte certaines informations de type par exemple. Elle constitue le pivot de l'infrastructure LLVM : c'est ce que produisent les frontends comme Clang, qui est ensuite passée aux optimiseurs de LLVM et est ensuite consommée par les backends dont le rôle est de les transformer en code machine natif.

En voici un exemple en C:

int add(int a)
    return a + 42;

La représentation intermédiaire est donnée via Clang :

$> clang -S -emit-llvm add_function.c -o add_function.ll

L'extension *.ll désigne des "fichiers IR" et le résultat donne:

define i32 @add(i32 %a) nounwind uwtable {
  %1 = alloca i32, align 4
  store i32 %a, i32* %1, align 4
  %2 = load i32* %1, align 4
  %3 = add nsw i32 %2, 42
  ret i32 %3

Cette représentation peut ensuite être bitcodée, assemblée ou compilée. Voir les différentes commandes LLVM pour assembler, désassembler, optimiser, linker, exécuter, etc.

Une autre utilisation intéressante de cette infrastructure est l'utilisation de la bibliothèque Clang pour parser du code C/C++ afin de parcourir l'Abstract Syntax Tree (AST). Ceci offre notamment de belles possibilités d'introspection. Google a d'ailleurs rédigé un tutoriel sur les plugins Clang et ses possibilités via l'API de Clang, notamment la classe ASTConsumer. Il est possible de parcourir l'ensemble des constructeurs, de connaître la propriété d'une classe (abstraite ou non), parcourir les membres d'une classe, etc. Avant de partir, nous parlions de la possiblité de propager un changement d'API à travers toutes les bibliothèques qui en dépendent, soit la notion de patch sémantique.

Nous avons aussi profité pour parler un peu du langage Julia ou de Numba.

Pythonistes convaincus, nous connaissions déjà un peu Numba, projet de Travis Oliphant, contributeur NumPy et papa de SciPy. Ce projet utilise l'infrastructure LLVM pour compiler du byte-code Python en code machine (Just In Time compilation), en particulier pour NumPy et SciPy. Les exemples montrent comment passer d'une fonction Python qui traite un tableau NumPy à une fonction compilée Just In Time. Extrait issu d'un des exemples fournis par le projet Numba :

from numba import d
from numba.decorators import jit as jit

def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

csum2d = jit(ret_type=d, arg_types=[d[:,:]])(sum2d)

Oui, la fonction sum2d n'est pas optimale et très "naïve". Néanmoins, la fonction compilée va près de 250 fois plus vite ! Numba utilise les bindings LLVM pour Python via llvm-py afin de passer du code Python, à l'Intermediate Representation pour ensuite utiliser les fonctionnalités JIT d'LLVM.

Entre programmeurs de langages à typage statique, nous avons aussi parlé de C et d'Ocaml (dont il existe des bindings pour LLVM) et mentionné de beaux projets tels que PIPS et Coccinelle.

Pour finir, nous savons maintenant prononcer Clang. On dit "klang" et non "cilangue". Nous avons appris que gdb avait son propre parser et n'utilisait pas le parser de GCC. Rappelons-le, l'un des grands avantages de LLVM & Clang face à GCC est sa modularité et la possibilité d'utiliser une des pierres qui forme l'édifice pour construire sa propre maison.

Nous finissons ce billet par une citation. David Beazley disait lors de sa présentation à Eursoscipy 2012 :

The life is too short to write C++.

Certes. Mais qu'on le veuille ou qu'on le doive, autant se servir d'outils biens pensés pour nous faciliter la vie.

Encore merci aux organisateurs pour cette rencontre et à la prochaine.

Quelques références

A Python dev day at La Cantine. Would like to have more PyCon?

2012/06/01 by Damien Garaud

We were at La Cantine on May 21th 2012 in Paris for the " Replay session".

La Cantine is a coworking space where hackers, artists, students and so on can meet and work. It also organises some meetings and conferences about digital culture, computer science, ...

On May 21th 2012, it was a dev day about Python. "Would you like to have more PyCon?" is a french wordplay where PyCon sounds like Picon, a french "apéritif" which traditionally accompanies beer. A good thing because the meeting began at 6:30 PM! Presentations and demonstrations were about some Python projects presented at PyCon 2012 in Santa Clara (California) last March. The original pycon presentations are accessible on

PDB Introduction

By Gael Pasgrimaud (@gawel_).

pdb is the well-known Python debugger. Gael showed us how to easily use this almost-mandatory tool when you develop in Python. As with the gdb debugger, you can stop the execution at a breakpoint, walk up the stack, print the value of local variables or modify temporarily some local variables.

The best way to define a breakpoint in your source code, it's to write:

import pdb; pdb.set_trace()

Insert that where you would like pdb to stop. Then, you can step trough the code with s, c or n commands. See help for more information. Following, the help command in pdb command-line interpreter:

(Pdb) help

Documented commands (type help <topic>):
EOF    bt         cont      enable  jump  pp       run      unt
a      c          continue  exit    l     q        s        until
alias  cl         d         h       list  quit     step     up
args   clear      debug     help    n     r        tbreak   w
b      commands   disable   ignore  next  restart  u        whatis
break  condition  down      j       p     return   unalias  where

Miscellaneous help topics:
exec  pdb

It is also possible to invoke the module pdb when you run a Python script such as:

$> python -m pdb


By Alexis Metereau (@ametaireau).

Pyramid is an open source Python web framework from Pylons Project. It concentrates on providing fast, high-quality solutions to the fundamental problems of creating a web application:

  • the mapping of URLs to code ;
  • templating ;
  • security and serving static assets.

The framework allows to choose different approaches according the simplicity//feature tradeoff that the programmer need. Alexis, from the French team of Services Mozilla, is working with it on a daily basis and seemed happy to use it. He told us that he uses Pyramid more as web Python library than a web framework.


By Benoit Chesneau (@benoitc).

Circus is a process watcher and runner. Python scripts, via an API, or command-line interface can be used to manage and monitor multiple processes.

A very useful web application, called circushttpd, provides a way to monitor and manage Circus through the web. Circus uses zeromq, a well-known tool used at Logilab.

matplotlib demo

This session was a well prepared and funny live demonstration by Julien Tayon of matplotlib, the Python 2D plotting library . He showed us some quick and easy stuff.

For instance, how to plot a sinus with a few code lines with matplotlib and NumPy:

import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

# A simple sinus.
ax.plot(np.sin(np.arange(-10., 10., 0.05)))

which gives:

You can make some fancier plots such as:

# A sinus and a fancy Cardioid.
a = np.arange(-5., 5., 0.1)
ax_sin = fig.add_subplot(211)
ax_sin.plot(np.sin(a), '^-r', lw=1.5)
ax_sin.set_title("A sinus")

# Cardioid.
ax_cardio = fig.add_subplot(212)
x = 0.5 * (2. * np.cos(a) - np.cos(2 * a))
y = 0.5 * (2. * np.sin(a) - np.sin(2 * a))
ax_cardio.plot(x, y, '-og')
ax_cardio.set_xlabel(r"$\frac{1}{2} (2 \cos{t} - \cos{2t})$", fontsize=16)

where you can type some LaTeX equations as X label for instance.

The force of this plotting library is the gallery of several examples with piece of code. See the matplotlib gallery.

Using Python for robotics

Dimitri Merejkowsky reviewed how Python can be used to control and program Aldebaran's humanoid robot NAO.

Wrap up

Unfortunately, Olivier Grisel who was supposed to make three interesting presentations was not there. He was supposed to present :

  • A demo about injecting arbitrary code and monitoring Python process with Pyrasite.
  • Another demo about Interactive Data analysis with Pandas and the new IPython NoteBook.
  • Wrap up : Distributed computation on cluster related project: IPython.parallel, picloud and Storm + Umbrella

Thanks to La Cantine and the different organisers for this friendly dev day.

Python in Finance (and Derivative Analytics)

2011/10/25 by Damien Garaud

The Logilab team attended (and co-organized) EuroScipy 2011, at the end of August in Paris.

We saw some interesting posters and a presentation dealing with Python in finance and derivative analytics [1].

In order to debunk the idea that "all computation libraries dedicated to financial applications must be written in C/C++ or some other compiled programming language", I would like to introduce a more Pythonic way.

You may know that financial applications such as risk management have in most cases high computational needs. For instance, it can be necessary to quickly perform a large number of Monte Carlo simulations to evaluate an American option in a few seconds.

The Python community provides several reliable and efficient libraries and packages dedicated to numerical computations:
  • the well-known SciPy and NumPy libraries. They provide a complete set of tools to work with matrix, linear algebra operations, singular values decompositions, multi-variate regression models, ...
  • scikits is a set of add-on toolkits for SciPy. For instance there are statistical models in statsmodels packages, a toolkit dedicated to timeseries manipulation and another one dedicated to numerical optimization;
  • pandas is a recent Python package which provides "fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive.". pandas uses Cython to improve its performance. Moreover, pandas has been used extensively in production in financial applications;
  • Cython is a way to write C extensions for the Python language. Since you write Cython code in the same way as you write Python code, it's easy to use it. Is it fast? Yes ! I compared a simple example from Cython's official documentation with a full Python code -- a piece of code which computes the first kth prime numbers. The Cython code is almost thirty times faster than the full-Python code (non-official). Furthermore, you can use NumPy in Cython code !

I believe that thanks to several useful tools and libraries, Python can be used in numerical computation, even in Finance (both research and production). It is easy-to-maintain without sacrificing performances.

Note that you can find some other references on Visixion webpages: