Blog entries by Alain Leufroy [5]

Going to EuroScipy2013

2013/09/04 by Alain Leufroy

The EuroScipy2013 conference was held in Bruxelles at the Université libre de Bruxelles.

http://www.logilab.org/file/175984/raw/logo-807286783.png

As usual the first two days were dedicated to tutorials while the last two ones were dedicated to scientific presentations and general python related talks. The meeting was extended by one more day for sprint sessions during which enthusiasts were able to help free software projects, namely sage, vispy and scipy.

Jérôme and I had the great opportunity to represent Logilab during the scientific tracks and the sprint day. We enjoyed many talks about scientific applications using python. We're not going to describe the whole conference. Visit the conference website if you want the complete list of talks. In this article we will try to focus on the ones we found the most interesting.

First of all the keynote by Cameron Neylon about Network ready research was very interesting. He presented some graphs about the impact of a group job on resolving complex problems. They revealed that there is a critical network size for which the effectiveness for solving a problem drastically increase. He pointed that the source code accessibility "friction" limits the "getting help" variable. Open sourcing software could be the best way to reduce this "friction" while unit testing and ongoing integration are facilitators. And, in general, process reproducibility is very important, not only in computing research. Retrieving experimental settings, metadata, and process environment is vital. We agree with this as we are experimenting it everyday in our work. That is why we encourage open source licenses and develop a collaborative platform that provides the distributed simulation traceability and reproducibility platform Simulagora (in french).

Ian Ozsvald's talk dealt with key points and tips from his own experience to grow a business based on open source and python, as well as mistakes to avoid (e.g. not checking beforehand there are paying customers interested by what you want to develop). His talk was comprehensive and mentioned a wide panel of situations.

http://vispy.org/_static/img/logo.png

We got a very nice presentation of a young but interesting visualization tools: Vispy. It is 6 months old and the first public release was early August. It is the result of the merge of 4 separated libraries, oriented toward interactive visualisation (vs. static figure generation for Matplotlib) and using OpenGL on GPUs to avoid CPU overload. A demonstration with large datasets showed vispy displaying millions of points in real time at 40 frames per second. During the talk we got interesting information about OpenGL features like anti-grain compared to Matplotlib Agg using CPU.

We also got to learn about cartopy which is an open source Python library originally written for weather and climate science. It provides useful and simple API to manipulate cartographic mapping.

Distributed computing systems was a hot topic and many talks were related to this theme.

https://www.openstack.org/themes/openstack/images/openstack-logo-preview-full-color.png

Gael Varoquaux reminded us what are the keys problems with "biggish data" and the key points to successfully process them. I think that some of his recommendations are generally useful like "choose simple solutions", "fail gracefully", "make it easy to debug". For big data processing when I/O limit is the constraint, first try to split the problem into random fractions of the data, then run algorithms and aggregate the results to circumvent this limit. He also presented mini-batch that takes a bunch of observations (trade-off memory usage/vectorization) and joblib.parallel that makes I/O faster using compression (CPUs are faster than disk access).

Benoit Da Mota talked about shared memory in parallel computing and Antonio Messina gave us a quick overview on how to build a computing cluster with Elasticluster, using OpenStack/Slurm/ansible. He demonstrated starting and stopping a cluster on OpenStack: once all VMs are started, ansible configures them as hosts to the cluster and new VMs can be created and added to the cluster on the fly thanks to a command line interface.

We also got a keynote by Peter Wang (from Continuum Analytics) about the future of data analysis with Python. As a PhD in physics I loved his metaphor of giving mass to data. He tried to explain the pain that scientists have when using databases.

https://scikits.appspot.com/static/images/scipyshiny_small.png

After the conference we participated to the numpy/scipy sprint. It was organized by Ralph Gommers and Pauli Virtanen. There were 18 people trying to close issues from different difficulty levels and had a quick tutorial on how easy it is to contribute: the easiest is to fork from the github project page on your own github account (you can create one for free), so that later your patch submission will be a simple "Pull Request" (PR). Clone locally your scipy fork repository, and make a new branch (git checkout -b <newbranch>) to tackle one specific issue. Once your patch is ready, commit it locally, push it on your github repository and from the github interface choose "Push request". You will be able to add something to your commit message before your PR is sent and looked at by the project lead developers. For example using "gh-XXXX" in your commit message will automatically add a link to the issue no. XXXX. Here is the list of open issues for scipy; you can filter them, e.g. displaying only the ones considered easy to fix :D

For more information: Contributing to SciPy.


Profiling tools

2012/09/07 by Alain Leufroy

Python

Run time profiling with cProfile

Python is distributed with profiling modules. They describe the run time operation of a pure python program, providing a variety of statistics.

The cProfile module is the recommended module. To execute your program under the control of the cProfile module, a simple form is

$ python -m cProfile -s cumulative mypythonscript.py

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      16    0.055    0.003   15.801    0.988 __init__.py:1(<module>)
       1    0.000    0.000   11.113   11.113 __init__.py:35(extract)
     135    7.351    0.054   11.078    0.082 __init__.py:25(iter_extract)
10350736    3.628    0.000    3.628    0.000 {method 'startswith' of 'str' objects}
       1    0.000    0.000    2.422    2.422 pyplot.py:123(show)
       1    0.000    0.000    2.422    2.422 backend_bases.py:69(__call__)
       ...

Each column provides information about time execution of every function calls. -s cumulative orders the result by descending cumulative time.

Note:

You can profile a particular python function such as main()

>>> import profile
>>> profile.run('main()')

Graphical tools to show profiling results

Even if report tools are included in cProfile profiler, it can be interesting to use graphical tools. Most of them work with a stat file that can be generated by cProfile using the -o filepath option.

Below are some of available graphical tools that we tested.

Gpro2Dot

is a python based tool that allows to transform profiling results output into a picture containing the call tree graph (using graphviz). A typical profiling session with python looks like this:

$ python -m cProfile -o output.pstats mypythonscript.py
$ gprof2dot.py -f pstats output.pstats | dot -Tpng -o profiling_results.png
http://wiki.jrfonseca.googlecode.com/git/gprof2dot.png

Each node of the output graph represents a function and has the following layout:

+----------------------------------+
|   function name : module name    |
| total time including sub-calls % |  total time including sub-calls %
|    (self execution time %)       |------------------------------------>
|  total number of self calls      |
+----------------------------------+

Nodes and edges are colored according to the "total time" spent in the functions.

Note:The following small patch let the node color correspond to the execution time and the edge color to the "total time":
diff -r da2b31597c5f gprof2dot.py
--- a/gprof2dot.py      Fri Aug 31 16:38:37 2012 +0200
+++ b/gprof2dot.py      Fri Aug 31 16:40:56 2012 +0200
@@ -2628,6 +2628,7 @@
                 weight = function.weight
             else:
                 weight = 0.0
+            weight = function[TIME_RATIO]

             label = '\n'.join(labels)
             self.node(function.id,
PyProf2CallTree

is a script to help visualizing profiling data with the KCacheGrind graphical calltree analyzer. This is a more interactive solution than Gpro2Dot but it requires to install KCacheGrind. Typical usage:

$ python -m cProfile -o stat.prof mypythonscript.py
$ python pyprof2calltree.py -i stat.prof -k

Profiling data file is opened in KCacheGrind with pyprof2calltree module, whose -k switch automatically opens KCacheGrind.

http://kcachegrind.sourceforge.net/html/pics/KcgShot3Large.gif

There are other tools that are worth testing:

  • RunSnakeRun is an interactive GUI tool which visualizes profile file using square maps:

    $ python -m cProfile -o stat.prof mypythonscript.py
    $ runsnake stat.prof
    
  • pycallgraph generates PNG images of a call tree with the total number of calls:

    $ pycallgraph mypythonscript.py
    
  • lsprofcalltree also use KCacheGrind to display profiling data:

    $ python lsprofcalltree.py -o output.log yourprogram.py
    $ kcachegrind output.log
    

C/C++ extension profiling

For optimization purpose one may have python extensions written in C/C++. For such modules, cProfile will not dig into the corresponding call tree. Dedicated tools must be used (they are most part of Python) to profile a C++ extension from python.

Yep

is a python module dedicated to the profiling of compiled python extension. It uses the google CPU profiler:

$ python -m yep --callgrind mypythonscript.py

Memory Profiler

You may want to control the amount of memory used by a python program. There is an interesting module that fits this need: memory_profiler

You can fetch memory consumption of a program over time using

>>> from memory_profiler import memory_usage
>>> memory_usage(main, (), {})

memory_profiler can also spot lines that consume the most using pdb or IPython.

General purpose Profiling

The Linux perf tool gives access to a wide variety of performance counter subsystems. Using perf, any execution configuration (pure python programs, compiled extensions, subprocess, etc.) may be profiled.

Performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots.

You can have information about execution times with:

$ perf stat -e cpu-cycles,cpu-clock,task-clock python mypythonscript.py

You can have RAM access information using:

$ perf stat -e cache-misses python mypythonscript.py

Be careful about the fact that perf gives the raw value of the hardware counters. So, you need to know exactly what you are looking for and how to interpret these values in the context of your program.

Note that you can use Gpro2Dot to get a more user-friendly output:

$ perf record -g python mypythonscript.py
$ perf script | gprof2dot.py -f perf | dot -Tpng -o output.png

Text mode makes it into hgview 1.4.0

2011/10/06 by Alain Leufroy

Here is at last the release of the version 1.4.0 of hgview.

http://www.logilab.org/image/77974?vid=download

Small description

Besides the classic bugfixes this release introduces a new text based user interface thanks to the urwid library.

Running hgview in a shell, in a terminal, over a ssh session is now possible! If you are trying not to use X (or use it less), have a geek mouse-killer window manager such as wmii/dwm/ion/awesome/... this is for you!

This TUI (Text User Interface!) adopts the principal features of the Qt4 based GUI. Although only the main view has been implemented for now.

In a nutshell, this interface includes the following features :

  • display the revision graph (with working directory as a node, and basic support for the mq extension),
  • display the files affected by a selected changeset (with basic support for the bfiles extension)
  • display diffs (with syntax highlighting thanks to pygments),
  • automatically refresh the displayed revision graph when the repository is being modified (requires pyinotify),
  • easy key-based navigation in revisions' history of a repo (same as the GUI),
  • a command system for special actions (see help)

Installation

There are packages for debian and ubuntu in the logilab's debian repository.

Note:you have to install the hgview-curses package to get the text based interface.

Or you can simply clone our Mercurial repository:

hg clone http://hg.logilab.org/hgview

(more on the hgview home page)

Running the text based interface

A new --interface option is now available to choose the interface:

hgview --interface curses

Or you can fix it in the [hgview] section of your ~/.hgrc:

[hgview]
interface = curses # or qt or raw

Then run:

hgview

What's next

We'll be working on including other features from the Qt4 interface and making it fully configurable.

We'll also work on bugfixes and new features, so stay tuned! And feel free to file bugs and feature requests.


EuroSciPy'11 - Annual European Conference for Scientists using Python.

2011/08/24 by Alain Leufroy
http://www.logilab.org/image/9852?vid=download

The EuroScipy2011 conference will be held in Paris at the Ecole Normale Supérieure from August 25th to 28th and is co-organized and sponsored by INRIA, Logilab and other companies.

The conference is dedicated to cross-disciplinary gathering focused on the use and development of the Python language in scientific research.

August 25th and 26th are dedicated to tutorial tracks -- basic and advanced tutorials. August 27th and 28th are dedicated to talks, posters and demos sessions.

Damien Garaud, Vincent Michel and Alain Leufroy (and others) from Logilab will be there. We will talk about a RSS feeds aggregator based on Scikits.learn and CubicWeb and we have a poster about LibAster (a python library for thermomechanical simulation based on Code_Aster).


Distutils2 Sprint at Logilab (first day)

2011/01/28 by Alain Leufroy

We're very happy to host the Distutils2 sprint this week in Paris.

The sprint has started yesterday with some of Logilab's developers and others contributors. We'll sprint during 4 days, trying to pull up the new python package manager.

Let's sumarize this first day:

  • Boris Feld and Pierre-Yves David worked on the new system for detecting and dispatching data-files.
  • Julien Miotte worked on
    • moving qGitFilterBranch from setuptools to distutils2
    • testing distutils2 installation and register (see the tutorial)
    • the backward compatibility to distutils in setup.py, using setup.cfg to fill the setup arguments of setup for helping users to switch to distutils2.
  • André Espaze and Alain Leufroy worked on the python script that help developers build a setup.cfg by recycling their existing setup.py (track).

Join us on IRC at #distutils on irc.freenode.net !


Virtualenv - Play safely with a Python

2010/03/26 by Alain Leufroy
http://farm5.static.flickr.com/4031/4255910934_80090f65d7.jpg

virtualenv, pip and Distribute are tree tools that help developers and packagers. In this short presentation we will see some virtualenv capabilities.

Please, keep in mind that all above stuff has been made using : Debian Lenny, python 2.5 and virtualenv 1.4.5.

Abstract

virtualenv builds python sandboxes where it is possible to do whatever you want as a simple user without putting in jeopardy your global environment.

virtualenv allows you to safety:

  • install any python packages
  • add debug lines everywhere (not only in your scripts)
  • switch between python versions
  • try your code as you are a final user
  • and so on ...

Install and usage

Install

Prefered way

Just download the virtualenv python script at http://bitbucket.org/ianb/virtualenv/raw/tip/virtualenv.py and call it using python (e.g. python virtualenv.py).

For conveinience, we will refers to this script using virtualenv.

Other ways

For Debian (ubuntu as well) addicts, just do :

$ sudo aptitude install python-virtualenv

Fedora users would do:

$ sudo yum install python-virtualenv

And others can install from PyPI (as superuser):

$ pip install virtualenv

or

$ easy_install pip && pip install virtualenv

You could also get the source here.

Quick Guide

To work in a python sandbox, do as follow:

$ virtualenv my_py_env
$ source my_py_env/bin/activate
(my_py_env)$ python

"That's all Folks !"

Once you have finished just do:

(my_py_env)$ deactivate

or quit the tty.

What does virtualenv actually do ?

At creation time

Let's start again ... more slowly. Consider the following environment:

$ pwd
/home/you/some/where
$ ls

Now create a sandbox called my-sandbox:

$ virtualenv my-sandbox
New python executable in "my-sandbox/bin/python"
Installing setuptools............done.

The output said that you have a new python executable and specific install tools. Your current directory now looks like:

$ ls -Cl
my-sandbox/ README
$ tree -L 3 my-sandbox
my-sandbox/
|-- bin
|   |-- activate
|   |-- activate_this.py
|   |-- easy_install
|   |-- easy_install-2.5
|   |-- pip
|   `-- python
|-- include
|   `-- python2.5 -> /usr/include/python2.5
`-- lib
    `-- python2.5
        |-- ...
        |-- orig-prefix.txt
        |-- os.py -> /usr/lib/python2.5/os.py
        |-- re.py -> /usr/lib/python2.5/re.py
        |-- ...
        |-- site-packages
        |   |-- easy-install.pth
        |   |-- pip-0.6.3-py2.5.egg
        |   |-- setuptools-0.6c11-py2.5.egg
        |   `-- setuptools.pth
        |-- ...

In addition to the new python executable and the install tools you have an whole new python environment containing libraries, a site-packages/ (where your packages will be installed), a bin directory, ...

Note:
virtualenv does not create every file needed to get a whole new python environment. It uses links to global environment files instead in order to save disk space end speed up the sandbox creation. Therefore, there must already have an active python environment installed on your system.

At activation time

At this point you have to activate the sandbox in order to use your custom python. Once activated, python still has access to the global environment but will look at your sandbox first for python's modules:

$ source my-sandbox/bin/activate
(my-sandbox)$ which python
/home/you/some/where/my-sandbox/bin/python
$ echo $PATH
/home/you/some/where/my-sandbox/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
(pyver)$ python -c 'import sys;print sys.prefix;'
/home/you/some/where/my-sandbox
(pyver)$ python -c 'import sys;print "\n".join(sys.path)'
/home/you/some/where/my-sandbox/lib/python2.5/site-packages/setuptools-0.6c8-py2.5.egg
[...]
/home/you/some/where/my-sandbox
/home/you/personal/PYTHONPATH
/home/you/some/where/my-sandbox/lib/python2.5/
[...]
/usr/lib/python2.5
[...]
/home/you/some/where/my-sandbox/lib/python2.5/site-packages
[...]
/usr/local/lib/python2.5/site-packages
/usr/lib/python2.5/site-packages
[...]

First of all, a (my-sandbox) message is automatically added to your prompt in order to make it clear that you're using a python sandbox environment.

Secondly, my-sandbox/bin/ is added to your PATH. So, running python calls the specific python executable placed in my-sandbox/bin.

Note
It is possible to improve the sandbox isolation by ignoring the global paths and your PYTHONPATH (see Improve isolation section).

Installing package

It is possible to install any packages in the sandbox without any superuser privilege. For instance, we will install the pylint development revision in the sandbox.

Suppose that you have the pylint stable version already installed in your global environment:

(my-sandbox)$ deactivate
$ python -c 'from pylint.__pkginfo__ import version;print version'
0.18.0

Once your sandbox activated, install the development revision of pylint as an update:

$ source /home/you/some/where/my-sandbox/bin/activate
(my-sandbox)$ pip install -U hg+http://www.logilab.org/hg/pylint#egg=pylint-0.19

The new package and its dependencies are only installed in the sandbox:

(my-sandbox)$ python -c 'import pylint.__pkginfo__ as p;print p.version, p.__file__'
0.19.0 /home/you/some/where/my-sandbox/lib/python2.6/site-packages/pylint/__pkginfo__.pyc
(my-sandbox)$ deactivate
$ python -c 'import pylint.__pkginfo__ as p;print p.version, p.__file__'
0.18.0 /usr/lib/pymodules/python2.6/pylint/__pkginfo__.pyc

You can safely do any change in the new pylint code or in others sandboxed packages because your global environment is still unchanged.

Useful options

Improve isolation

As said before, your sandboxed python sys.path still references the global system path. You can however hide them by:

  • either use the --no-site-packages that do not give access to the global site-packages directory to the sandbox
  • or change your PYTHONPATH in my-sandbox/bin/activate in the same way as for PATH (see tips)
$ virtualenv --no-site-packages closedPy
$ sed -i '9i PYTHONPATH="$_OLD_PYTHON_PATH"
      9i export PYTHONPATH
      9i unset _OLD_PYTHON_PATH
      40i _OLD_PYTHON_PATH="$PYTHONPATH"
      40i PYTHONPATH="."
      40i export PYTHONPATH' closedPy/bin/activate
$ source closedPy/bin/activate
(closedPy)$ python -c 'import sys; print "\n".join(sys.path)'
/home/you/some/where/closedPy/lib/python2.5/site-packages/setuptools-0.6c8-py2.5.egg
/home/you/some/where/closedPy
/home/you/some/where/closedPy/lib/python2.5
/home/you/some/where/closedPy/lib/python2.5/plat-linux2
/home/you/some/where/closedPy/lib/python2.5/lib-tk
/home/you/some/where/closedPy/lib/python2.5/lib-dynload
/usr/lib/python2.5
/usr/lib64/python2.5
/usr/lib/python2.5/lib-tk
/home/you/some/where/closedPy/lib/python2.5/site-packages
$ deactivate

This way, you'll get an even more isolated sandbox, just as with a brand new python environment.

Work with different versions of Python

It is possible to dedicate a sandbox to a particular version of python by using the --python=PYTHON_EXE which specifies the interpreter that virtualenv was installed with (default is /usr/bin/python):

$ virtualenv --python=python2.4 pyver24
$ source pyver24/bin/activate
(pyver24)$ python -V
Python 2.4.6
$ deactivate
$ virtualenv --python=python2.5 pyver25
$ source pyver25/bin/activate
(pyver25)$ python -V
Python 2.5.2
$ deactivate

Distribute a sandbox

To distribute your sandbox, you must use the --relocatable option that makes an existing sandbox relocatable. This fixes up scripts and makes all .pth files relative This option should be called just before you distribute the sandbox (each time you have changed something in your sandbox).

An important point is that the host system should be similar to your own.

Tips

Speed up sandbox manipulation

Add these scripts to your .bashrc in order to help you using virtualenv and automate the creation and activation processes.

rel2abs() {
#from http://unix.derkeiler.com/Newsgroups/comp.unix.programmer/2005-01/0206.html
  [ "$#" -eq 1 ] || return 1
  ls -Ld -- "$1" > /dev/null || return
  dir=$(dirname -- "$1" && echo .) || return
  dir=$(cd -P -- "${dir%??}" && pwd -P && echo .) || return
  dir=${dir%??}
  file=$(basename -- "$1" && echo .) || return
  file=${file%??}
  case $dir in
    /) printf '%s\n' "/$file";;
    /*) printf '%s\n' "$dir/$file";;
    *) return 1;;
  esac
  return 0
}
function activate(){
    if [[ "$1" == "--help" ]]; then
        echo -e "usage: activate PATH\n"
        echo -e "Activate the sandbox where PATH points inside of.\n"
        return
    fi
    if [[ "$1" == '' ]]; then
        local target=$(pwd)
    else
        local target=$(rel2abs "$1")
    fi
    until  [[ "$target" == '/' ]]; do
        if test -e "$target/bin/activate"; then
            source "$target/bin/activate"
            echo "$target sandbox activated"
            return
        fi
        target=$(dirname "$target")
    done
    echo 'no sandbox found'
}
function mksandbox(){
    if [[ "$1" == "--help" ]]; then
        echo -e "usage: mksandbox NAME\n"
        echo -e "Create and activate a highly isaolated sandbox named NAME.\n"
        return
    fi
    local name='sandbox'
    if [[ "$1" != "" ]]; then
        name="$1"
    fi
    if [[ -e "$1/bin/activate" ]]; then
        echo "$1 is already a sandbox"
        return
    fi
    virtualenv --no-site-packages --clear --distribute "$name"
    sed -i '9i PYTHONPATH="$_OLD_PYTHON_PATH"
            9i export PYTHONPATH
            9i unset _OLD_PYTHON_PATH
           40i _OLD_PYTHON_PATH="$PYTHONPATH"
           40i PYTHONPATH="."
           40i export PYTHONPATH' "$name/bin/activate"
    activate "$name"
}
Note:
The virtualenv-commands and virtualenvwrapper projects add some very interesting features to virtualenv. So, put on eye on them for more advanced features than the above ones.

Conclusion

I found it to be irreplaceable for testing new configurations or working on projects with different dependencies. Moreover, I use it to learn about other python projects, how my project exactly interacts with its dependencies (during debugging) or to test the final user experience.

All of this stuff can be done without virtualenv but not in such an easy and secure way.

I will continue the series by introducing other useful projects to enhance your productivity : pip and Distribute. See you soon.