Blog entries january 2014 [2]
  • A Salt Configuration for C++ Development

    2014/01/24 by Damien Garaud
    http://www.logilab.org/file/204916/raw/SaltStack-Logo.png

    At Logilab, we've been using Salt for one year to manage our own infrastructure. I wanted to use it to manage a specific configuration: C++ development. When I instantiate a Virtual Machine with a Debian image, I don't want to spend time to install and configure a system which fits my needs as a C++ developer:

    This article is a very simple recipe to get a C++ development environment, ready to use, ready to hack.

    Give Me an Editor and a DVCS

    Quite simple: I use the YAML file format used by Salt to describe what I want. To install these two editors, I just need to write:

    vim-nox:
      pkg.installed
    
    emacs23-nox:
      pkg.installed
    

    For Mercurial, you'll guess:

    mercurial:
     pkg.installed
    

    You can write these lines in the same init.sls file, but you can also decide to split your configuration into different subdirectories: one place for each thing. I decided to create a dev and editor directories at the root of my salt config with two init.sls inside.

    That's all for the editors. Next step: specific C++ development packages.

    Install Several "C++" Packages

    In a cpp folder, I write a file init.sls with this content:

    gcc:
        pkg.installed
    
    g++:
        pkg.installed
    
    gdb:
        pkg.installed
    
    cmake:
        pkg.installed
    
    automake:
        pkg.installed
    
    libtool:
        pkg.installed
    
    pkg-config:
        pkg.installed
    
    colorgcc:
        pkg.installed
    

    The choice of these packages is arbitrary. You add or remove some as you need. There is not a unique right solution. But I want more. I want some LLVM packages. In a cpp/llvm.sls, I write:

    llvm:
     pkg.installed
    
    clang:
        pkg.installed
    
    libclang-dev:
        pkg.installed
    
    {% if not grains['oscodename'] == 'wheezy' %}
    lldb-3.3:
        pkg.installed
    {% endif %}
    

    The last line specifies that you install the lldb package if your Debian release is not the stable one, i.e. jessie/testing or sid in my case. Now, just include this file in the init.sls one:

    # ...
    # at the end of 'cpp/init.sls'
    include:
      - .llvm
    

    Organize your sls files according to your needs. That's all for packages installation. You Salt configuration now looks like this:

    .
    |-- cpp
    |   |-- init.sls
    |   `-- llvm.sls
    |-- dev
    |   `-- init.sls
    |-- edit
    |   `-- init.sls
    `-- top.sls
    

    Launching Salt

    Start your VM and install a masterless Salt on it (e.g. apt-get install salt-minion). For launching Salt locally on your naked VM, you need to copy your configuration (through scp or a DVCS) into /srv/salt/ directory and to write the file top.sls:

    base:
      '*':
        - dev
        - edit
        - cpp
    

    Then just launch:

    > salt-call --local state.highstate
    

    as root.

    And What About Configuration Files?

    You're right. At the beginning of the post, I talked about a "ready to use" Mercurial with some HG extensions. So I use and copy the default /etc/mercurial/hgrc.d/hgext.rc file into the dev directory of my Salt configuration. Then, I edit it to set some extensions such as color, rebase, pager. As I also need Evolve, I have to clone the source code from https://bitbucket.org/marmoute/mutable-history. With Salt, I can tell "clone this repo and copy this file" to specific places.

    So, I add some lines to dev/init.sls.

    https://bitbucket.org/marmoute/mutable-history:
        hg.latest:
          - rev: tip
          - target: /opt/local/mutable-history
          - require:
             - pkg: mercurial
    
    /etc/mercurial/hgrc.d/hgext.rc:
        file.managed:
          - source: salt://dev/hgext.rc
          - user: root
          - group: root
          - mode: 644
    

    The require keyword means "install (if necessary) this target before cloning". The other lines are quite self-explanatory.

    In the end, you have just six files with a few lines. Your configuration now looks like:

    .
    |-- cpp
    |   |-- init.sls
    |   `-- llvm.sls
    |-- dev
    |   |-- hgext.rc
    |   `-- init.sls
    |-- edit
    |   `-- init.sls
    `-- top.sls
    

    You can customize it and share it with your teammates. A step further would be to add some configuration files for your favorite editor. You can also imagine to install extra packages that your library depends on. Quite simply add a subdirectory amazing_lib and write your own init.sls. I know I often need Boost libraries for example. When your Salt configuration has changed, just type: salt-call --local state.highstate.

    As you can see, setting up your environment on a fresh system will take you only a couple commands at the shell before you are ready to compile your C++ library, debug it, fix it and commit your modifications to your repository.


  • What's New in Pandas 0.13?

    2014/01/20 by Damien Garaud
    http://www.logilab.org/file/203841/raw/pandas_logo.png

    Do you know pandas, a Python library for data analysis? Version 0.13 came out on January the 16th and this post describes a few new features and improvements that I think are important.

    Each release has its list of bug fixes and API changes. You may read the full release note if you want all the details, but I will just focus on a few things.

    You may be interested in one of my previous blog post that showed a few useful Pandas features with datasets from the Quandl website and came with an IPython Notebook for reproducing the results.

    Let's talk about some new and improved Pandas features. I suppose that you have some knowledge of Pandas features and main objects such as Series and DataFrame. If not, I suggest you watch the tutorial video by Wes McKinney on the main page of the project or to read 10 Minutes to Pandas in the documentation.

    Refactoring

    I welcome the refactoring effort: the Series type, subclassed from ndarray, has now the same base class as DataFrame and Panel, i.e. NDFrame. This work unifies methods and behaviors for these classes. Be aware that you can hit two potential incompatibilities with versions less that 0.13. See internal refactoring for more details.

    Timeseries

    to_timedelta()

    Function pd.to_timedelta to convert a string, scalar or array of strings to a Numpy timedelta type (np.timedelta64 in nanoseconds). It requires a Numpy version >= 1.7. You can handle an array of timedeltas, divide it by an other timedelta to carry out a frequency conversion.

    from datetime import timedelta
    import numpy as np
    import pandas as pd
    
    # Create a Series of timedelta from two DatetimeIndex.
    dr1 = pd.date_range('2013/06/23', periods=5)
    dr2 = pd.date_range('2013/07/17', periods=5)
    td = pd.Series(dr2) - pd.Series(dr1)
    
    # Set some Na{N,T} values.
    td[2] -= np.timedelta64(timedelta(minutes=10, seconds=7))
    td[3] = np.nan
    td[4] += np.timedelta64(timedelta(hours=14, minutes=33))
    td
    
    0   24 days, 00:00:00
    1   24 days, 00:00:00
    2   23 days, 23:49:53
    3                 NaT
    4   24 days, 14:33:00
    dtype: timedelta64[ns]
    

    Note the NaT type (instead of the well-known NaN). For day conversion:

    td / np.timedelta64(1, 'D')
    
    0    24.000000
    1    24.000000
    2    23.992975
    3          NaN
    4    24.606250
    dtype: float64
    

    You can also use the DateOffSet as:

    td + pd.offsets.Minute(10) - pd.offsets.Second(7) + pd.offsets.Milli(102)
    

    Nanosecond Time

    Support for nanosecond time as an offset. See pd.offsets.Nano. You can use N of this offset in the pd.date_range function as the value of the argument freq.

    Daylight Savings

    The tz_localize method can now infer a fall daylight savings transition based on the structure of the unlocalized data. This method, as the tz_convert method is available for any DatetimeIndex, Series and DataFrame with a DatetimeIndex. You can use it to localize your datasets thanks to the pytz module or convert your timeseries to a different time zone. See the related documentation about time zone handling. To use the daylight savings inference in the method tz_localize, set the infer_dst argument to True.

    DataFrame Features

    New Method isin()

    New DataFrame method isin which is used for boolean indexing. The argument to this method can be an other DataFrame, a Series, or a dictionary of a list of values. Comparing two DataFrame with isin is equivalent to do df1 == df2. But you can also check if values from a list occur in any column or check if some values for a few specific columns occur in the DataFrame (i.e. using a dict instead of a list as argument):

    df = pd.DataFrame({'A': [3, 4, 2, 5],
                       'Q': ['f', 'e', 'd', 'c'],
                       'X': [1.2, 3.4, -5.4, 3.0]})
    
       A  Q    X
    0  3  f  1.2
    1  4  e  3.4
    2  2  d -5.4
    3  5  c  3.0
    

    and then:

    df.isin(['f', 1.2, 3.0, 5, 2, 'd'])
    
           A      Q      X
    0   True   True   True
    1  False  False  False
    2   True   True  False
    3   True  False   True
    

    Of course, you can use the previous result as a mask for the current DataFrame.

    mask = _
    df[mask.any(1)]
    
          A  Q    X
       0  3  f  1.2
       2  2  d -5.4
       3  5  c  3.0
    
    When you pass a dictionary to the ``isin`` method, you can specify the column
    labels for each values.
    
    mask = df.isin({'A': [2, 3, 5], 'Q': ['d', 'c', 'e'], 'X': [1.2, -5.4]})
    df[mask]
    
        A    Q    X
    0   3  NaN  1.2
    1 NaN    e  NaN
    2   2    d -5.4
    3   5    c  NaN
    

    See the related documentation for more details or different examples.

    New Method str.extract

    The new vectorized extract method from the StringMethods object, available with the suffix str on Series or DataFrame. Thus, it is possible to extract some data thanks to regular expressions as followed:

    s = pd.Series(['doe@umail.com', 'nobody@post.org', 'wrong.mail', 'pandas@pydata.org', ''])
    # Extract usernames.
    s.str.extract(r'(\w+)@\w+\.\w+')
    

    returns:

    0       doe
    1    nobody
    2       NaN
    3    pandas
    4       NaN
    dtype: object
    

    Note that the result is a Series with the re match objects. You can also add more groups as:

    # Extract usernames and domain.
    s.str.extract(r'(\w+)@(\w+\.\w+)')
    
            0           1
    0     doe   umail.com
    1  nobody    post.org
    2     NaN         NaN
    3  pandas  pydata.org
    4     NaN         NaN
    

    Elements that do no math return NaN. You can use named groups. More useful if you want a more explicit column names (without NaN values in the following example):

    # Extract usernames and domain with named groups.
    s.str.extract(r'(?P<user>\w+)@(?P<at>\w+\.\w+)').dropna()
    
         user          at
    0     doe   umail.com
    1  nobody    post.org
    3  pandas  pydata.org
    

    Thanks to this part of the documentation, I also found out other useful strings methods such as split, strip, replace, etc. when you handle a Series of str for instance. Note that the most of them have already been available in 0.8.1. Take a look at the string handling API doc (recently added) and some basics about vectorized strings methods.

    Interpolation Methods

    DataFrame has a new interpolate method, similar to Series. It was possible to interpolate missing data in a DataFrame before, but it did not take into account the dates if you had index timeseries. Now, it is possible to pass a specific interpolation method to the method function argument. You can use scipy interpolation functions such as slinear, quadratic, polynomial, and others. The time method is used to take your index timeseries into account.

    from datetime import date
    # Arbitrary timeseries
    ts = pd.DatetimeIndex([date(2006,5,2), date(2006,12,23), date(2007,4,13),
                           date(2007,6,14), date(2008,8,31)])
    df = pd.DataFrame(np.random.randn(5, 2), index=ts, columns=['X', 'Z'])
    # Fill the DataFrame with missing values.
    df['X'].iloc[[1, -1]] = np.nan
    df['Z'].iloc[3] = np.nan
    df
    
                       X         Z
    2006-05-02  0.104836 -0.078031
    2006-12-23       NaN -0.589680
    2007-04-13 -1.751863  0.543744
    2007-06-14  1.210980       NaN
    2008-08-31       NaN  0.566205
    

    Without any optional argument, you have:

    df.interpolate()
    
                       X         Z
    2006-05-02  0.104836 -0.078031
    2006-12-23 -0.823514 -0.589680
    2007-04-13 -1.751863  0.543744
    2007-06-14  1.210980  0.554975
    2008-08-31  1.210980  0.566205
    

    With the time method, you obtain:

    df.interpolate(method='time')
    
                       X         Z
    2006-05-02  0.104836 -0.078031
    2006-12-23 -1.156217 -0.589680
    2007-04-13 -1.751863  0.543744
    2007-06-14  1.210980  0.546496
    2008-08-31  1.210980  0.566205
    

    I suggest you to read more examples in the missing data doc part and the scipy documentation about the module interpolate.

    Misc

    Convert a Series to a single-column DataFrame with its method to_frame.

    Misc & Experimental Features

    Retrieve R Datasets

    Not a killing feature but very pleasant: the possibility to load into a DataFrame all R datasets listed at http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

    import pandas.rpy.common as com
    titanic = com.load_data('Titanic')
    titanic.head()
    
      Survived    Age     Sex Class value
    0       No  Child    Male   1st   0.0
    1       No  Child    Male   2nd   0.0
    2       No  Child    Male   3rd  35.0
    3       No  Child    Male  Crew   0.0
    4       No  Child  Female   1st   0.0
    

    for the datasets about survival of passengers on the Titanic. You can find several and different datasets about New York air quality measurements, body temperature series of two beavers, plant growth results or the violent crime rates by US state for instance. Very useful if you would like to show pandas to a friend, a colleague or your Grandma and you do not have a dataset with you.

    And then three great experimental features.

    Eval and Query Experimental Features

    The eval and query methods which use numexpr which can fastly evaluate array expressions as x - 0.5 * y. For numexpr, x and y are Numpy arrays. You can use this powerfull feature in pandas to evaluate different DataFrame columns. By the way, we have already talked about numexpr a few years ago in EuroScipy 09: Need for Speed.

    df = pd.DataFrame(np.random.randn(10, 3), columns=['x', 'y', 'z'])
    df.head()
    
              x         y         z
    0 -0.617131  0.460250 -0.202790
    1 -1.943937  0.682401 -0.335515
    2  1.139353  0.461892  1.055904
    3 -1.441968  0.477755  0.076249
    4 -0.375609 -1.338211 -0.852466
    
    df.eval('x + 0.5 * y - z').head()
    
    0   -0.184217
    1   -1.267222
    2    0.314395
    3   -1.279340
    4   -0.192248
    dtype: float64
    

    About the query method, you can select elements using a very simple query syntax.

    df.query('x >= y > z')
    
              x         y         z
    9  2.560888 -0.827737 -1.326839
    

    msgpack Serialization

    New reading and writing functions to serialize your data with the great and well-known msgpack library. Note this experimental feature does not have a stable storage format. You can imagine to use zmq to transfer msgpack serialized pandas objects over TCP, IPC or SSH for instance.

    Google BigQuery

    A recent module pandas.io.gbq which provides a way to load into and extract datasets from the Google BigQuery Web service. I've not installed the requirements for this feature now. The example of the release note shows how you can select the average monthly temperature in the year 2000 across the USA. You can also read the related pandas documentation. Nevertheless, you will need a BigQuery account as the other Google's products.

    Take Your Keyboard

    Give it a try, play with some data, mangle and plot them, compute some stats, retrieve some patterns or whatever. I'm convinced that pandas will be more and more used and not only for data scientists or quantitative analysts. Open an IPython Notebook, pick up some data and let yourself be tempted by pandas.

    I think I will use more the vectorized strings methods that I found out about when writing this post. I'm glad to learn more about timeseries because I know that I'll use these features. I'm looking forward to the two experimental features such as eval/query and msgpack serialization.

    You can follow me on Twitter (@jazzydag). See also Logilab (@logilab_org).