We have been using many different tools for doing statistical analysis with Python, including R, SciPy, specific C++ code, etc. It looks like the growing audience of SciPy is now in movement to have dedicated modules in SciPy (lets call them SciKits). See this thread in SciPy-user mailing-list.
The EuroSciPy2009 conference was held in Leipzig at the end of July and was sponsored by Logilab and other companies. It started with three talks about speed.
In his keynote, Fransesc Alted talked about starving CPUs. Thirty years back, memory and CPU frequencies where about the same. Memory speed kept up for about ten years with the evolution of CPU speed before falling behind. Nowadays, memory is about a hundred times slower than the cache which is itself about twenty times slower than the CPU. The direct consequence is that CPUs are starving and spend many clock cycles waiting for data to process.
In order to improve the performance of programs, it is now required to know about the multiple layers of computer memory, from disk storage to CPU. The common architecture will soon count six levels: mechanical disk, solid state disk, ram, cache level 3, cache level 2, cache level 1.
Using optimized array operations, taking striding into account, processing data blocks of the right size and using compression to diminish the amount of data that is transfered from one layer to the next are four techniques that go a long way on the road to high performance. Compression algorithms like Blosc increase throughput for they strike the right balance between being fast and providing good compression ratios. Blosc compression will soon be available in PyTables.
Fransesc also mentions the numexpr extension to numpy, and its combination with PyTables named tables.Expr, that nicely and easily accelerates the computation of some expressions involving numpy arrays. In his list of references, Fransesc cites Ulrich Drepper article What every programmer should know about memory.
Maciej Fijalkowski started his talk with a general presentation of the PyPy framework. One uses PyPy to describe an interpreter in RPython, then generate the actual interpreter code and its JIT.
Since PyPy is has become more of a framework to write interpreters than a reimplementation of Python in Python, I suggested to change its misleading name to something like gcgc the Generic Compiler for Generating Compilers. Maciej answered that there are discussions on the mailing list to split the project in two and make the implementation of the Python interpreter distinct from the GcGc framework.
Maciej then focused his talk on his recent effort to rewrite in RPython the part of numpy that exposes the underlying C library to Python. He says the benefits of using PyPy's JIT to speedup that wrapping layer are already visible. He has details on the PyPy blog. Gaël Varoquaux added that David Cournapeau has started working on making the C/Python split in numpy cleaner, which would further ease the job of rewriting it in RPython.
The SciPy2008 conference had at least two papers talking about speeding Python: Converting Python Functions to Dynamically Compiled C and unPython: Converting Python Numerical Programs into C.
David Beazley gave a very interesting talk in 2009 at a Chicago Python Users group meeting about the effects of the GIL on multicore machines.
I will continue my report on the conference with the second part titled "Applications And Open Questions".
I spent some time this week evaluating Pupynere, the PUre PYthon NEtcdf REader written by Roberto De Almeida. I see several advantages in pupynere.
First it's a pure Python module with no external dependency. It doesn't even depend on the NetCDF lib and it is therefore very easy to deploy.
Second, it offers the same interface as Scientific Python's NetCDF bindings which makes transitioning from one module to another very easy.
Third pupynere is being integrated into Scipy as the scypi.io.netcdf module. Once integrated, this could ensure a wide adoption by the python community.
Finally it's easy to dig in this clear and small code base of about 600 lines. I have just sent several fixes and bug reports to the author.
However pupynere isn't mature yet. First it seems pupynere has been only used for simple cases so far. Many common cases are broken. Moreover there is no support for new NetCDF formats such as long-NetCDF and NetCDF4, and important features such as file update are still missing. In addition, The lack of a test suite is a serious issue. In my opinion, various bugs could already have been detected and fixed with simple unit tests. Contributions would be much more comfortable with the safety net offered by a test suite. I am not certain that the fixes and improvements I made this week did not introduce regressions.
To conclude, pupynere seems too young for production use. But I invite people to try it and provide feedback and fixes to the author. I'm looking forward to using this project in production in the future.
The EuroSciPy 2010 conference will be held in Paris from july 8th to 11th at Ecole Normale Supérieure. Two days of tutorials, two days of conference, two interesting keynotes, a lightning talk session, an open space for collaboration and sprinting, thirty quality talks in the schedule and already 100 delegates registered.
If you are doing science and using Python, you want to be there!
The EuroSciPy2010 conference was held in Paris at the Ecole Normale Supérieure from July 8th to 11th and was organized and sponsored by Logilab and other companies.
The first two days were dedicated to tutorials and I had the chance to talk about SciPy with André Espaze, Gaël Varoquaux and Emanuelle Gouillart in the introductory track. This was nice but it was a bit tricky to present SciPy in such a short time while trying to illustrate the material with real and interesting examples. One very nice thing for the introductory track is that all the material was contributed by different speakers and is freely available in a github repository (licensed under CC BY).
The next two days were dedicated to scientific presentations and why python is such a great tool to develop scientific software and carry out research.
I had a very great time listening to the presentations, starting with the two very nice keynotes given by Hans Petter Langtangen and Konrad Hinsen. The latter gave us a very nice summary of what happened in the scientific python world during the past 15 years, what is happening now and of course what could happen during the next 15 years. Using a crystal ball and a very humorous tone, he made it very clear that the challenge in the next years will be about how using our hundreds, thousands or even more cores in a bug-free and efficient way. Functional programming may be a very good solution to this challenge as it provides a deterministic way of parallelizing our programs. Konrad also provided some hints about future versions of python that could provide a deeper and more efficient support of functional programming and maybe the addition of a keyword 'async' to handle the computation of a function in another core.
In fact, the PEP 3148 entitled "Futures - execute computations asynchronously" was just accepted two days ago. This PEP describes the new package called "futures" designed to facilitate the evaluation of callables using threads and processes in future versions of python. A full implementation is already available.
Parallelization was indeed a very popular issue across presentations, and as for resolving embarrassingly parallel problems, several solutions were presented.
Distributing the computation of a function over two machines is as simple as:
import playdoh result1, result2 = playdoh.map(fun, [arg1, arg2], _machines = ['machine1.network.com', 'machine2.network.com'])
Theano: Allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. In particular it can use GPU transparently and generate optimized C code (see theano presentation).
joblib: Provides among other things helpers for embarrassingly parallel problems. It's built over the multiprocessing package introduced in python 2.6 and brings more readable code and easier debugging.
Concerning speed, Fransesc Alted has showed us interesting tools for memory optimization currently successfully used in PyTables 2.2. You can read more details on these kind of optimizations in EuroSciPy'09 (part 1/2): The Need For Speed.
The Logilab team attended (and co-organized) EuroScipy 2011, at the end of August in Paris.
In order to debunk the idea that "all computation libraries dedicated to financial applications must be written in C/C++ or some other compiled programming language", I would like to introduce a more Pythonic way.
You may know that financial applications such as risk management have in most cases high computational needs. For instance, it can be necessary to quickly perform a large number of Monte Carlo simulations to evaluate an American option in a few seconds.
The Python community provides several reliable and efficient libraries and packages dedicated to numerical computations:
- the well-known SciPy and NumPy libraries. They provide a complete set of tools to work with matrix, linear algebra operations, singular values decompositions, multi-variate regression models, ...
- scikits is a set of add-on toolkits for SciPy. For instance there are statistical models in statsmodels packages, a toolkit dedicated to timeseries manipulation and another one dedicated to numerical optimization;
- pandas is a recent Python package which provides "fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive.". pandas uses Cython to improve its performance. Moreover, pandas has been used extensively in production in financial applications;
- Cython is a way to write C extensions for the Python language. Since you write Cython code in the same way as you write Python code, it's easy to use it. Is it fast? Yes ! I compared a simple example from Cython's official documentation with a full Python code -- a piece of code which computes the first kth prime numbers. The Cython code is almost thirty times faster than the full-Python code (non-official). Furthermore, you can use NumPy in Cython code !
I believe that thanks to several useful tools and libraries, Python can be used in numerical computation, even in Finance (both research and production). It is easy-to-maintain without sacrificing performances.
Note that you can find some other references on Visixion webpages:
As usual the first two days were dedicated to tutorials while the last two ones were dedicated to scientific presentations and general python related talks. The meeting was extended by one more day for sprint sessions during which enthusiasts were able to help free software projects, namely sage, vispy and scipy.
Jérôme and I had the great opportunity to represent Logilab during the scientific tracks and the sprint day. We enjoyed many talks about scientific applications using python. We're not going to describe the whole conference. Visit the conference website if you want the complete list of talks. In this article we will try to focus on the ones we found the most interesting.
First of all the keynote by Cameron Neylon about Network ready research was very interesting. He presented some graphs about the impact of a group job on resolving complex problems. They revealed that there is a critical network size for which the effectiveness for solving a problem drastically increase. He pointed that the source code accessibility "friction" limits the "getting help" variable. Open sourcing software could be the best way to reduce this "friction" while unit testing and ongoing integration are facilitators. And, in general, process reproducibility is very important, not only in computing research. Retrieving experimental settings, metadata, and process environment is vital. We agree with this as we are experimenting it everyday in our work. That is why we encourage open source licenses and develop a collaborative platform that provides the distributed simulation traceability and reproducibility platform Simulagora (in french).
Ian Ozsvald's talk dealt with key points and tips from his own experience to grow a business based on open source and python, as well as mistakes to avoid (e.g. not checking beforehand there are paying customers interested by what you want to develop). His talk was comprehensive and mentioned a wide panel of situations.
We got a very nice presentation of a young but interesting visualization tools: Vispy. It is 6 months old and the first public release was early August. It is the result of the merge of 4 separated libraries, oriented toward interactive visualisation (vs. static figure generation for Matplotlib) and using OpenGL on GPUs to avoid CPU overload. A demonstration with large datasets showed vispy displaying millions of points in real time at 40 frames per second. During the talk we got interesting information about OpenGL features like anti-grain compared to Matplotlib Agg using CPU.
We also got to learn about cartopy which is an open source Python library originally written for weather and climate science. It provides useful and simple API to manipulate cartographic mapping.
Distributed computing systems was a hot topic and many talks were related to this theme.
Gael Varoquaux reminded us what are the keys problems with "biggish data" and the key points to successfully process them. I think that some of his recommendations are generally useful like "choose simple solutions", "fail gracefully", "make it easy to debug". For big data processing when I/O limit is the constraint, first try to split the problem into random fractions of the data, then run algorithms and aggregate the results to circumvent this limit. He also presented mini-batch that takes a bunch of observations (trade-off memory usage/vectorization) and joblib.parallel that makes I/O faster using compression (CPUs are faster than disk access).
Benoit Da Mota talked about shared memory in parallel computing and Antonio Messina gave us a quick overview on how to build a computing cluster with Elasticluster, using OpenStack/Slurm/ansible. He demonstrated starting and stopping a cluster on OpenStack: once all VMs are started, ansible configures them as hosts to the cluster and new VMs can be created and added to the cluster on the fly thanks to a command line interface.
We also got a keynote by Peter Wang (from Continuum Analytics) about the future of data analysis with Python. As a PhD in physics I loved his metaphor of giving mass to data. He tried to explain the pain that scientists have when using databases.
After the conference we participated to the numpy/scipy sprint. It was organized by Ralph Gommers and Pauli Virtanen. There were 18 people trying to close issues from different difficulty levels and had a quick tutorial on how easy it is to contribute: the easiest is to fork from the github project page on your own github account (you can create one for free), so that later your patch submission will be a simple "Pull Request" (PR). Clone locally your scipy fork repository, and make a new branch (git checkout -b <newbranch>) to tackle one specific issue. Once your patch is ready, commit it locally, push it on your github repository and from the github interface choose "Push request". You will be able to add something to your commit message before your PR is sent and looked at by the project lead developers. For example using "gh-XXXX" in your commit message will automatically add a link to the issue no. XXXX. Here is the list of open issues for scipy; you can filter them, e.g. displaying only the ones considered easy to fix :D
For more information: Contributing to SciPy.