The EuroScipy2013
conference was held in Bruxelles at the Université libre de Bruxelles.
As usual the first two days were dedicated to tutorials while the last two ones
were dedicated to scientific presentations and general python related talks. The
meeting was extended by one more day for sprint sessions during which
enthusiasts were able to help free software projects, namely sage, vispy and
scipy.
Jérôme and I had the great opportunity to represent Logilab during the scientific tracks and the sprint day. We
enjoyed many talks about scientific applications using python. We're not going
to describe the whole conference. Visit the conference website if you want the complete list of talks. In this
article we will try to focus on the ones we found the most interesting.
First of all the keynote by Cameron Neylon about
Network ready research was very
interesting. He presented some graphs about the impact of a group job on
resolving complex problems. They revealed that there is a critical network size
for which the effectiveness for solving a problem drastically increase. He
pointed that the source code accessibility "friction" limits the "getting help"
variable. Open sourcing software could be the best way to reduce this "friction"
while unit testing and ongoing integration are facilitators. And, in general,
process reproducibility is very important, not only in computing
research. Retrieving experimental settings, metadata, and process environment is
vital. We agree with this as we are experimenting it everyday in our work. That
is why we encourage open source licenses and develop a collaborative
platform that provides the distributed simulation traceability and reproducibility platform
Simulagora (in french).
Ian Ozsvald's talk
dealt with key points and tips from his own experience to grow a business based
on open source and python, as well as mistakes to avoid (e.g. not checking
beforehand there are paying customers interested by what you want to
develop). His talk was comprehensive and mentioned a wide panel of situations.
We got a very nice presentation of a young but interesting visualization tools:
Vispy. It is 6 months old and
the first public release was early August. It is the result of the merge of 4
separated libraries, oriented toward interactive visualisation (vs. static
figure generation for Matplotlib) and using OpenGL on GPUs to avoid CPU
overload. A demonstration with large datasets showed vispy displaying millions
of points in real time at 40 frames per second. During the talk we got
interesting information about OpenGL features like anti-grain compared to
Matplotlib Agg using CPU.
We also got to learn about cartopy which is
an open source Python library originally written for weather and climate
science. It provides useful and simple API to manipulate cartographic mapping.
Distributed computing systems was a hot topic and many talks were related to this
theme.
Gael Varoquaux reminded us what are the keys problems with "biggish data" and
the key points to successfully process them. I think that some of his
recommendations are generally useful like "choose simple solutions", "fail
gracefully", "make it easy to debug". For big data processing when I/O limit is
the constraint, first try to split the problem into random fractions of the
data, then run algorithms and aggregate the results to circumvent this limit. He
also presented mini-batch that takes a bunch of observations (trade-off memory
usage/vectorization) and joblib.parallel that makes I/O faster using
compression (CPUs are faster than disk access).
Benoit Da Mota talked about shared memory in parallel computing and Antonio
Messina gave us a quick overview on how to build a computing cluster with
Elasticluster, using
OpenStack/Slurm/ansible. He
demonstrated starting and stopping a cluster on OpenStack: once all VMs are
started, ansible configures them as hosts to the cluster and new VMs can be
created and added to the cluster on the fly thanks to a command line interface.
We also got a keynote by Peter Wang (from Continuum Analytics) about the future of data analysis with Python. As
a PhD in physics I loved his metaphor of giving mass to data. He tried to
explain the pain that scientists have when using databases.
After the conference we participated to the numpy/scipy sprint. It was
organized by Ralph Gommers and Pauli Virtanen. There were 18 people trying to close
issues from different difficulty levels and had a quick tutorial on how easy it
is to contribute: the easiest is to fork from the github project page on your own github account (you can create
one for free), so that later your patch submission will be a simple "Pull
Request" (PR). Clone locally your scipy fork repository, and make a new branch
(git checkout -b <newbranch>) to tackle one specific issue. Once your patch
is ready, commit it locally, push it on your github repository and from the
github interface choose "Push request". You will be able to add something to
your commit message before your PR is sent and looked at by the project lead
developers. For example using "gh-XXXX" in your commit message will
automatically add a link to the issue no. XXXX. Here is the list
of open issues for scipy; you can filter them, e.g. displaying only the ones
considered easy to fix :D
For more information: Contributing to SciPy.