The view cw.archive.by_date can not be applied to this query

Blog entries

EP14 Pylint sprint Day 1 report

2014/07/24 by Sylvain Thenault
https://ep2014.europython.eu/static_media/assets/images/logo.png

We've had a fairly enjoyable and productive first day in our little hidden room at EuroPython in Berlin ! Below are some noticeable things we've worked on and discussed about.

First, we discussed and agreed that while we should at some point cut the cord to the logilab.common package, it will take some time notably because of the usage logilab.common.configuration which would be somewhat costly to replace (and is working pretty well). There are some small steps we should do but basically we should mostly get back some pylint/astroid specific things from logilab.common to astroid or pylint. This should be partly done during the sprint, and remaining work will go to tickets in the tracker.

We also discussed about release management. The point is that we should release more often, so every pylint maintainers should be able to do that easily. Sylvain will write some document about the release procedure and ensure access are granted to the pylint and astroid projects on pypi. We shall release pylint 1.3 / astroid 1.2 soon, and those releases branches will be the last one supporting python < 2.7.

During this first day, we also had the opportunity to meet Carl Crowder, the guy behind http://landscape.io, as well as David Halter which is building the Jedi completion library (https://github.com/davidhalter/jedi). Landscape.io runs pylint on thousands of projects, and it would be nice if we could test beta release on some part of this panel. On the other hand, there are probably many code to share with the Jedi library like the parser and ast generation, as well as a static inference engine. That deserves a sprint on his own though, so we agreed that a nice first step would be to build a common library for import resolution without relying on the python interpreter for that, while handling most of the python dark import features like zip/egg import, .pth files and so one. Indeed that may be two nice future collaborations!

Last but not least, we got some actual work done:

  • Michal Novikowsky from Intel in Poland joined us to work on the ability to run pylint in different processes so it may drastically improve performance on multiple cores box.
  • Torsten did continue some work on various improvements of the functionnal test framework.
  • Sylvain did merge logilab.common.modutils module into astroid as it's mostly driven by astroid and pylint needs. Also fixed the annoying namespace package crash.
  • Claudiu keep up the good work he does daily at improving and fixing pylint :)

Open Legislative Data Conference 2014

2014/06/10 by Nicolas Chauvat

I was at the Open Legislative Data Conference on may 28 2014 in Paris, to present a simple demo I worked on since the same event that happened two years ago.

The demo was called "Law is Code Rebooted with CubicWeb". It featured the use of the cubicweb-vcreview component to display the amendments of the hospital law ("loi hospitalière") gathered into a version control system (namely Mercurial).

The basic idea is to compare writing code and writing law, for both are collaborative and distributed writing processes. Could we reuse for the second one the tools developed for the first?

Here are the slides and a few screenshots.

http://www.logilab.org/file/253394/raw/lawiscode1.png

Statistics with queries embedded in report page.

http://www.logilab.org/file/253400/raw/lawiscode2.png

List of amendments.

http://www.logilab.org/file/253396/raw/lawiscode3.png

User comment on an amendment.

While attending the conference, I enjoyed several interesting talks and chats with other participants, including:

  1. the study of co-sponsorship of proposals in the french parliament
  2. data.senat.fr announcing their use of PostgreSQL and JSON.
  3. and last but not least, the great work done by RegardsCitoyens and SciencesPo MediaLab on visualizing the law making process.

Thanks to the organisation team and the other speakers. Hope to see you again!


SaltStack Meetup with Thomas Hatch in Paris France

2014/05/22 by Arthur Lutz

This monday (19th of may 2014), Thomas Hatch was in Paris for dotScale 2014. After presenting SaltStack there (videos will be published at some point), he spent the evening with members of the French SaltStack community during a meetup set up by Logilab at IRILL.

http://www.logilab.org/file/248338/raw/thomas-hatch.png

Here is a list of what we talked about :

  • Since Salt seems to have pushed ZMQ to its limits, SaltStack has been working on RAET (Reliable Asynchronous Event Transport Protocol ), a transport layer based on UDP and elliptic curve cryptography (Dan Berstein's CURVE-255-19) that works more like a stack than a socket and has reliability built in. RAET will be released as an optionnal beta feature in the next Salt release.
  • Folks from Dailymotion bumped into a bug that seems related to high latency networks and the auth_timeout. Updating to the very latest release should fix the issue.
  • Thomas told us about how a dedicated team at SaltStack handles pull requests and another team works on triaging github issues to input them into their internal SCRUM process. There are a lot of duplicate issues and old inactive issues that need attention and clutter the issue tracker. Help will be welcome.
http://www.logilab.org/file/248336/raw/Salt-Logo.png
  • Continuous integration is based on Jenkins and spins up VMs to test pull request. There is work in progress to test multiple clouds, various latencies and loads.
  • For the Docker integration, salt now keeps track of forwarded ports and relevant information about the containers.
  • salt-virt bumped into problems with chroots and timeouts due to ZMQ.
  • Multi-master: the problem lies with syncronisation of data which is sent to minions but also the data that is sent to the masters. Possible solutions to be explored are : the use of gitfs, there is no built-in solution for keys (salt-key has to be run on all masters), mine.send should send the data at both masters, for the jobs cache: one could use an external returner.
  • Thomas talked briefly about ioflo which should bring queuing, data hierarchy and data pub-sub to Salt.
http://www.logilab.org/file/248335/raw/ioflo.png
  • About the rolling release question: versions in Salt are definitely not git snapshots, things get backported into previous versions. No clear definition yet of length of LTS versions.
  • salt-cloud and libcloud : in the next release, libcloud will not be a hard dependency. Some clouds didn't work in libcloud (for example AWS), so these providers got implemented directly in salt-cloud or by using third-party libraries (eg. python-boto).
  • Documentation: a sprint is planned next week. Reference documentation will not be completly revamped, but tutorial content will be added.

Boris Feld showed a demo of vagrant images orchestrated by salt and a web UI to monitor a salt install.

http://www.vagrantup.com/images/logo_vagrant-81478652.png

Thanks again to Thomas Hatch for coming and meeting up with (part of) the community here in France.


Salt April Meetup in Paris (France)

2014/05/14 by Arthur Lutz

On the 15th of april, in Paris (France), we took part in yet another Salt meetup. The community is now meeting up once every two months.

We had two presentations:

  • Arthur Lutz made an introduction to returners and the scheduler using the SalMon monitoring system as an example. Salt is not only about configuration management Indeed!
  • The folks from Is Cool Entertainment did a presentation about how they are using salt-cloud to deploy and orchestrate clusters of EC2 machines (islands in their jargon) to reproduce parts of their production environment for testing and developement.

More discussions about various salty subjects followed and were pursued in an Italian restaurant (photos here).

In case it is not already in your diary : Thomas Hatch is coming to Paris next week, on Monday the 19th of May, and will be speaking at dotscale during the day and at a Salt meetup in the evening. The Salt Meetup will take place at IRILL (like the previous meetups, thanks again to them) and should start at 19h. The meetup is free and open to the public, but registering on this framadate would be appreciated.


Pylint 1.2 released!

2014/04/22 by Sylvain Thenault

Once again, a lot of work has been achieved since the latest 1.1 release. Claudiu, who joined the maintainer team (Torsten and me) did a great work in the past few months. Also lately Torsten has backported a lot of things from their internal G[oogle]Pylint. Last but not least, various people contributed by reporting issues and proposing pull requests. So thanks to everybody!

Notice Pylint 1.2 depends on astroid 1.1 which has been released at the same time. Currently, code is available on Pypi, and Debian/Ubuntu packages should be ready shortly on Logilab's acceptance repositories.

Below is the changes summary, check the changelog for more info.

New and improved checks:

  • New message 'eval-used' checking that the builtin function eval was used.
  • New message 'bad-reversed-sequence' checking that the reversed builtin receive a sequence (i.e. something that implements __getitem__ and __len__, without being a dict or a dict subclass) or an instance which implements __reversed__.
  • New message 'bad-exception-context' checking that raise ... from ... uses a proper exception context (None or an exception).
  • New message 'abstract-class-instantiated' warning when abstract classes created with abc module and with abstract methods are instantied.
  • New messages checking for proper class __slots__: 'invalid-slots-object' and 'invalid-slots'.
  • New message 'undefined-all-variable' if a package's __all__ variable contains a missing submodule (#126).
  • New option logging-modules giving the list of module names that can be checked for 'logging-not-lazy'.
  • New option include-naming-hint to show a naming hint for invalid name (#138).
  • Mark file as a bad function when using python2 (#8).
  • Add support for enforcing multiple, but consistent name styles for different name types inside a single module.
  • Warn about empty docstrings on overridden methods.
  • Inspect arguments given to constructor calls, and emit relevant warnings.
  • Extend the number of cases in which logging calls are detected (#182).
  • Enhance the check for 'used-before-assignment' to look for nonlocal uses.
  • Improve cyclic import detection in the case of packages.

Bug fixes:

  • Do not warn about 'return-arg-in-generator' in Python 3.3+.
  • Do not warn about 'abstract-method' when the abstract method is implemented through assignment (#155).
  • Do not register most of the 'newstyle' checker warnings with python >= 3.
  • Fix 'unused-import' false positive with augment assignment (#78).
  • Fix 'access-member-before-definition' false negative with augment assign (#164).
  • Do not crash when looking for 'used-before-assignment' in context manager assignments (#128).
  • Do not attempt to analyze non python file, eg '.so' file (#122).
  • Pass the current python path to pylint process when invoked via epylint (#133).

Command line:

  • Add -i / --include-ids and -s / --symbols back as completely ignored options (#180).
  • Ensure init-hooks is evaluated before other options, notably load-plugins (#166).

Other:

  • Improve pragma handling to not detect 'pylint:*' strings in non-comments (#79).
  • Do not crash with UnknownMessage if an unknown message identifier/name appears in disable or enable in the configuration (#170).
  • Search for rc file in ~/.config/pylintrc if ~/.pylintrc doesn't exists (#121).
  • Python 2.5 support restored (#50 and #62).

Astroid:

  • Python 3.4 support
  • Enhanced support for metaclass
  • Enhanced namedtuple support

Nice easter egg, no?


Code_Aster back in Debian unstable

2014/03/31 by Denis Laxalde

Last week, a new release of Code_Aster entered Debian unstable. Code_Aster is a finite element solver for partial differential equations in mechanics, mainly developed by EDF R&D (Électricité de France). It is arguably one of the most feature complete free software available in this domain.

Aster has been in Debian since 2012 thanks to the work of debian-science team. Yet it has always been somehow a problematic package with a couple of persistent Release Critical (RC) bugs (FTBFS, instalability issues) and actually never entered a stable release of Debian.

Logilab has committed to improving Code_Aster for a long time in various areas, notably through the LibAster friendly fork, which aims at turning the monolithic Aster into a library, usable from Python.

Recently, the EDF R&D team in charge of the development of Code_Aster took several major decisions, including:

  • the move to Bitbucket forge as a sign of community opening (following the path opened by LibAster that imported the code of Code_Aster into a Mercurial repository) and,
  • the change of build system from a custom makefile-style architecture to a fine-grained Waf system (taken from that of LibAster).

The latter obviously led to significant changes on the Debian packaging side, most of which going into a sane direction: the debian/rules file slimed down from 239 lines to 51 and a bunch of tricky install-step manipulations were dropped leading to something much simpler and closer to upstream (see #731211 for details). From upstream perspective, this re-packaging effort based on the new build-system may be the opportunity to update the installation scheme (in particular by declaring the Python library as private).

Clearly, there's still room for improvements on both side (like building with the new metis library, shipping several versions of Aster stable/testing, MPI/serial). All in all, this is good for both Debian users and upstream developers. At Logilab, we hope that this effort will consolidate our collaboration with EDF R&D.


Second Salt Meetup builds the french community

2014/03/04 by Arthur Lutz

On the 6th of February, the Salt community in France met in Paris to discuss Salt and choose the tools to federate itself. The meetup was kindly hosted by IRILL.

There were two formal presentations :

  • Logilab did a short introduction of Salt,
  • Majerti presented a feedback of their experience with Salt in various professional contexts.

The presentation space was then opened to other participants and Boris Feld did a short presentation of how Salt was used at NovaPost.

http://www.logilab.org/file/226420/raw/saltstack_meetup.jpeg

We then had a short break to share some pizza (sponsored by Logilab).

After the break, we had some open discussion about various subjects, including "best practices" in Salt and some specific use cases. Regis Leroy talked about the states that Makina Corpus has been publishing on github. The idea of reconciling the documentation and the monitoring of an infrastructure was brought up by Logilab, that calls it "Test Driven Infrastructure".

The tools we collectively chose to form the community were the following :

  • a mailing-list kindly hosted by the AFPY (a pythonic french organization)
  • a dedicated #salt-fr IRC channel on freenode

We decided that the meetup would take place every two months, hence the third one will be in April. There is already some discussion about organizing events to tell as many people as possible about Salt. It will probably start with an event at NUMA in March.

After the meetup was officially over, a few people went on to have some drinks nearby. Thank you all for coming and your participation.

login or register to comment on this blog


FOSDEM PGDay 2014

2014/02/11 by Rémi Cardona

I attended PGDay on January 31st, in Brussels. This event was held just before FOSDEM, which I also attended (expect another blog post). Here are some of the notes I took during the conference.

https://fosdem.org/2014/support/promote/wide.png

Statistics in PostgreSQL, Heikki Linnakangas

Due to transit delays, I only caught the last half of that talk.

The main goal of this talk was to explain some of Postgres' per-column statistics. In a nutshell, Postgres needs to have some idea about tables' content in order to choose an appropriate query plan.

Heikki explained which sorts of statistics gathers, such as most common values and histograms. Another interesting stat is the correlation between physical pages and data ordering (see CLUSTER).

Column statistics are gathered when running ANALYZE and stored in the pg_statistic system catalog. The pg_stats view provides a human-readable version of these stats.

Heikki also explained how to determine whether performance issues are due to out-of-date statistics or not. As it turns out, EXPLAIN ANALYZE shows for each step of the query planner how many rows it expects to process and how many it actually processed. The rule of thumb is that similar values (no more than an order of magnitude apart) mean that column statistics are doing their job. A wider margin between expected and actual rows mean that statistics are possibly preventing the query planner from picking a more optimized plan.

It was noted though that statistics-related performance issues often happen on tables with very frequent modifications. Running ANALYZE manually or increasing the frequency of the automatic ANALYZE may help in those situations.

Advanced Extension Use Cases, Dimitri Fontaine

Dimitri explained with very simple cases the use of some of Postgres' lesser-known extensions and the overall extension mechanism.

Here's a grocery-list of the extensions and types he introduced:

  • intarray extension, which adds operators and functions to the standard ARRAY type, specifically tailored for arrays of integers,
  • the standard POINT type which provides basic 2D flat-earth geometry,
  • the cube extension that can represent N-dimensional points and volumes,
  • the earthdistance extension that builds on cube to provide distance functions on a sphere-shaped Earth (a close-enough approximation for many uses),
  • the pg_trgm extension which provides text similarity functions based on trigram matching (a much simpler thus faster alternative to Levenshtein distances), especially useful for "typo-resistant" auto-completion suggestions,
  • the hstore extension which provides a simple-but-efficient key value store that has everyone talking in the Postgres world (it's touted as the NoSQL killer),
  • the hll extensions which implements the HyperLogLog algorithm which seems very well suited to storing and counting unique visitor on a web site, for example.

An all-around great talk with simple but meaningful examples.

http://tapoueh.org/images/fosdem_2014.jpg

Integrated cache invalidation for better hit ratios, Magnus Hagander

What Magnus presented almost amounted to a tutorial on caching strategies for busy web sites. He went through simple examples, using the ubiquitous Django framework for the web view part and Varnish for the HTTP caching part.

The whole talk revolved around adding private (X-prefixed) HTTP headers in replies containing one or more "entity IDs" so that Varnish's cache can be purged whenever said entities change. The hard problem lies in how and when to call PURGE on Varnish.

The obvious solution is to override Django's save() method on Model-derived objects. One can then use httplib (or better yet requests) to purge the cache. This solution can be slightly improved by using Django's signal mechanism instead, which sound an awful-lot like CubicWeb's hooks.

The problem with the above solution is that any DB modification not going through Django (and they will happen) will not invalidate the cached pages. So Magnus then presented how to write the same cache-invalidating code in PL/Python in triggers.

While this does solve that last issue, it introduces synchronous HTTP calls in the DB, killing write performance completely (or killing it completely if the HTTP calls fail). So to fix those problems, while introducing limited latency, is to use SkyTools' PgQ, a simple message queue based on Postgres. Moving the HTTP calls outside of the main database and into a Consumer (a class provided by PgQ's python bindings) makes the cache-invalidating trigger asynchronous, reducing write overhead.

http://www.logilab.org/file/210615/raw/varnish_django_postgresql.png

A clear, concise and useful talk for any developer in charge of high-traffic web sites or applications.

The Worst Day of Your Life, Christophe Pettus

Christophe humorously went back to that dreadful day in the collective Postgres memory: the release of 9.3.1 and the streaming replication chaos.

My overall impression of the talk: Thank $DEITY I'm not a DBA!

But Christophe also gave some valuable advice, even for non-DBAs:

  • Provision 3 times the necessary disk space, in case you need to pg_dump or otherwise do a snapshot of your currently running database,
  • Do backups and test them:
    • give them to developers,
    • use them for analytics,
    • test the restore, make it foolproof, try to automate it,
  • basic Postgres hygiene:
    • fsync = on (on by default, DON'T TURN IT OFF, there are better ways)
    • full_page_writes = on (on by default, don't turn it off)
    • deploy minor versions as soon as possible,
    • plan upgrade strategies before EOL,
    • 9.3+ checksums (createdb option, performance cost is minimal),
    • application-level consistency checks (don't wait for auto vacuum to "discover" consistency errors).

Materialised views now and in the future, Thom Brown

Thom presented on of the new features of Postgres 9.3, materialized views.

In a nutshell, materialized views (MV) are read-only snapshots of queried data that's stored on disk, mostly for performance reasons. An interesting feature of materialized views is that they can have indexes, just like regular tables.

The REFRESH MATERIALIZED VIEW command can be used to update an MV: it will simply run the original query again and store the new results.

There are a number of caveats with the current implementation of MVs:

  • pg_dump never saves the data, only the query used to build it,
  • REFRESH requires an exclusive lock,
  • due to implementation details (frozen rows or pages IIRC), MVs may exhibit non-concurrent behavior with other running transactions.

Looking towards 9.4 and beyond, here are some of the upcoming MV features:

  • 9.4 adds the CONCURRENTLY keyword:
    • + no longer needs an exclusive lock, doesn't block reads
    • - requires a unique index
    • - may require VACUUM
  • roadmap (no guarantees):
    • unlogged (disables the WAL),
    • incremental refresh,
    • lazy automatic refresh,
    • planner awareness of MVs (would use MVs as cache/index).

Indexes: The neglected performance all-rounder, Markus Winand

http://use-the-index-luke.com/img/alchemie.png

Markus' goal with this talk showed that very few people in the SQL world actually know - let alone really care - about indexes. According to his own experience and that of others (even with competing RDBMS), poorly written SQL is still a leading cause of production downtime (he puts the number at around 50% of downtime though others he quoted put that number higher). SQL queries can indeed put such stress on DB systems and cause them to fail.

One major issue, he argues, is poorly designed indexes. He went back in time to explain possible reasons for the lack of knowledge about indexes with both SQL developers and DBAs. One such reason may be that indexes are not part of the SQL standard and are left as implementation-specific details. Thus many books about SQL barely cover indexes, if at all.

He then took us through a simple quiz he wrote on the topic, with only 5 questions. The questions and explanations were very insightful and I must admit my knowledge of indexes was not up to par. I think everyone in the room got his message loud and clear: indexes are part of the schema, devs should care about them too.

Try out the test : http://use-the-index-luke.com/3-minute-test

PostgreSQL - Community meets Business, Michael Meskes

For the last talk of the day, Michael went back to the history of the Postgres project and its community. Unlike other IT domains such as email, HTTP servers or even operating systems, RDBMS are still largely dominated by proprietary vendors such as Oracle, IBM and Microsoft. He argues that the reasons are not technical: from a developer stand point, Postgres has all the features of the leading RDMBS (and many more) and the few missing administrative features related to scalability are being addressed.

Instead, he argues decision makers inside companies don't yet fully trust Postgres due to its (perceived) lack of corporate backers.

He went on to suggest ways to overcome those perceptions, for example with an "official" Postgres certification program.

A motivational talk for the Postgres community.

http://fosdem2014.pgconf.eu/files/img/frontrotate/slonik.jpg

A Salt Configuration for C++ Development

2014/01/24 by Damien Garaud
http://www.logilab.org/file/204916/raw/SaltStack-Logo.png

At Logilab, we've been using Salt for one year to manage our own infrastructure. I wanted to use it to manage a specific configuration: C++ development. When I instantiate a Virtual Machine with a Debian image, I don't want to spend time to install and configure a system which fits my needs as a C++ developer:

This article is a very simple recipe to get a C++ development environment, ready to use, ready to hack.

Give Me an Editor and a DVCS

Quite simple: I use the YAML file format used by Salt to describe what I want. To install these two editors, I just need to write:

vim-nox:
  pkg.installed

emacs23-nox:
  pkg.installed

For Mercurial, you'll guess:

mercurial:
 pkg.installed

You can write these lines in the same init.sls file, but you can also decide to split your configuration into different subdirectories: one place for each thing. I decided to create a dev and editor directories at the root of my salt config with two init.sls inside.

That's all for the editors. Next step: specific C++ development packages.

Install Several "C++" Packages

In a cpp folder, I write a file init.sls with this content:

gcc:
    pkg.installed

g++:
    pkg.installed

gdb:
    pkg.installed

cmake:
    pkg.installed

automake:
    pkg.installed

libtool:
    pkg.installed

pkg-config:
    pkg.installed

colorgcc:
    pkg.installed

The choice of these packages is arbitrary. You add or remove some as you need. There is not a unique right solution. But I want more. I want some LLVM packages. In a cpp/llvm.sls, I write:

llvm:
 pkg.installed

clang:
    pkg.installed

libclang-dev:
    pkg.installed

{% if not grains['oscodename'] == 'wheezy' %}
lldb-3.3:
    pkg.installed
{% endif %}

The last line specifies that you install the lldb package if your Debian release is not the stable one, i.e. jessie/testing or sid in my case. Now, just include this file in the init.sls one:

# ...
# at the end of 'cpp/init.sls'
include:
  - .llvm

Organize your sls files according to your needs. That's all for packages installation. You Salt configuration now looks like this:

.
|-- cpp
|   |-- init.sls
|   `-- llvm.sls
|-- dev
|   `-- init.sls
|-- edit
|   `-- init.sls
`-- top.sls

Launching Salt

Start your VM and install a masterless Salt on it (e.g. apt-get install salt-minion). For launching Salt locally on your naked VM, you need to copy your configuration (through scp or a DVCS) into /srv/salt/ directory and to write the file top.sls:

base:
  '*':
    - dev
    - edit
    - cpp

Then just launch:

> salt-call --local state.highstate

as root.

And What About Configuration Files?

You're right. At the beginning of the post, I talked about a "ready to use" Mercurial with some HG extensions. So I use and copy the default /etc/mercurial/hgrc.d/hgext.rc file into the dev directory of my Salt configuration. Then, I edit it to set some extensions such as color, rebase, pager. As I also need Evolve, I have to clone the source code from https://bitbucket.org/marmoute/mutable-history. With Salt, I can tell "clone this repo and copy this file" to specific places.

So, I add some lines to dev/init.sls.

https://bitbucket.org/marmoute/mutable-history:
    hg.latest:
      - rev: tip
      - target: /opt/local/mutable-history
      - require:
         - pkg: mercurial

/etc/mercurial/hgrc.d/hgext.rc:
    file.managed:
      - source: salt://dev/hgext.rc
      - user: root
      - group: root
      - mode: 644

The require keyword means "install (if necessary) this target before cloning". The other lines are quite self-explanatory.

In the end, you have just six files with a few lines. Your configuration now looks like:

.
|-- cpp
|   |-- init.sls
|   `-- llvm.sls
|-- dev
|   |-- hgext.rc
|   `-- init.sls
|-- edit
|   `-- init.sls
`-- top.sls

You can customize it and share it with your teammates. A step further would be to add some configuration files for your favorite editor. You can also imagine to install extra packages that your library depends on. Quite simply add a subdirectory amazing_lib and write your own init.sls. I know I often need Boost libraries for example. When your Salt configuration has changed, just type: salt-call --local state.highstate.

As you can see, setting up your environment on a fresh system will take you only a couple commands at the shell before you are ready to compile your C++ library, debug it, fix it and commit your modifications to your repository.


What's New in Pandas 0.13?

2014/01/20 by Damien Garaud
http://www.logilab.org/file/203841/raw/pandas_logo.png

Do you know pandas, a Python library for data analysis? Version 0.13 came out on January the 16th and this post describes a few new features and improvements that I think are important.

Each release has its list of bug fixes and API changes. You may read the full release note if you want all the details, but I will just focus on a few things.

You may be interested in one of my previous blog post that showed a few useful Pandas features with datasets from the Quandl website and came with an IPython Notebook for reproducing the results.

Let's talk about some new and improved Pandas features. I suppose that you have some knowledge of Pandas features and main objects such as Series and DataFrame. If not, I suggest you watch the tutorial video by Wes McKinney on the main page of the project or to read 10 Minutes to Pandas in the documentation.

Refactoring

I welcome the refactoring effort: the Series type, subclassed from ndarray, has now the same base class as DataFrame and Panel, i.e. NDFrame. This work unifies methods and behaviors for these classes. Be aware that you can hit two potential incompatibilities with versions less that 0.13. See internal refactoring for more details.

Timeseries

to_timedelta()

Function pd.to_timedelta to convert a string, scalar or array of strings to a Numpy timedelta type (np.timedelta64 in nanoseconds). It requires a Numpy version >= 1.7. You can handle an array of timedeltas, divide it by an other timedelta to carry out a frequency conversion.

from datetime import timedelta
import numpy as np
import pandas as pd

# Create a Series of timedelta from two DatetimeIndex.
dr1 = pd.date_range('2013/06/23', periods=5)
dr2 = pd.date_range('2013/07/17', periods=5)
td = pd.Series(dr2) - pd.Series(dr1)

# Set some Na{N,T} values.
td[2] -= np.timedelta64(timedelta(minutes=10, seconds=7))
td[3] = np.nan
td[4] += np.timedelta64(timedelta(hours=14, minutes=33))
td
0   24 days, 00:00:00
1   24 days, 00:00:00
2   23 days, 23:49:53
3                 NaT
4   24 days, 14:33:00
dtype: timedelta64[ns]

Note the NaT type (instead of the well-known NaN). For day conversion:

td / np.timedelta64(1, 'D')
0    24.000000
1    24.000000
2    23.992975
3          NaN
4    24.606250
dtype: float64

You can also use the DateOffSet as:

td + pd.offsets.Minute(10) - pd.offsets.Second(7) + pd.offsets.Milli(102)

Nanosecond Time

Support for nanosecond time as an offset. See pd.offsets.Nano. You can use N of this offset in the pd.date_range function as the value of the argument freq.

Daylight Savings

The tz_localize method can now infer a fall daylight savings transition based on the structure of the unlocalized data. This method, as the tz_convert method is available for any DatetimeIndex, Series and DataFrame with a DatetimeIndex. You can use it to localize your datasets thanks to the pytz module or convert your timeseries to a different time zone. See the related documentation about time zone handling. To use the daylight savings inference in the method tz_localize, set the infer_dst argument to True.

DataFrame Features

New Method isin()

New DataFrame method isin which is used for boolean indexing. The argument to this method can be an other DataFrame, a Series, or a dictionary of a list of values. Comparing two DataFrame with isin is equivalent to do df1 == df2. But you can also check if values from a list occur in any column or check if some values for a few specific columns occur in the DataFrame (i.e. using a dict instead of a list as argument):

df = pd.DataFrame({'A': [3, 4, 2, 5],
                   'Q': ['f', 'e', 'd', 'c'],
                   'X': [1.2, 3.4, -5.4, 3.0]})
   A  Q    X
0  3  f  1.2
1  4  e  3.4
2  2  d -5.4
3  5  c  3.0

and then:

df.isin(['f', 1.2, 3.0, 5, 2, 'd'])
       A      Q      X
0   True   True   True
1  False  False  False
2   True   True  False
3   True  False   True

Of course, you can use the previous result as a mask for the current DataFrame.

mask = _
df[mask.any(1)]
      A  Q    X
   0  3  f  1.2
   2  2  d -5.4
   3  5  c  3.0

When you pass a dictionary to the ``isin`` method, you can specify the column
labels for each values.
mask = df.isin({'A': [2, 3, 5], 'Q': ['d', 'c', 'e'], 'X': [1.2, -5.4]})
df[mask]
    A    Q    X
0   3  NaN  1.2
1 NaN    e  NaN
2   2    d -5.4
3   5    c  NaN

See the related documentation for more details or different examples.

New Method str.extract

The new vectorized extract method from the StringMethods object, available with the suffix str on Series or DataFrame. Thus, it is possible to extract some data thanks to regular expressions as followed:

s = pd.Series(['doe@umail.com', 'nobody@post.org', 'wrong.mail', 'pandas@pydata.org', ''])
# Extract usernames.
s.str.extract(r'(\w+)@\w+\.\w+')

returns:

0       doe
1    nobody
2       NaN
3    pandas
4       NaN
dtype: object

Note that the result is a Series with the re match objects. You can also add more groups as:

# Extract usernames and domain.
s.str.extract(r'(\w+)@(\w+\.\w+)')
        0           1
0     doe   umail.com
1  nobody    post.org
2     NaN         NaN
3  pandas  pydata.org
4     NaN         NaN

Elements that do no math return NaN. You can use named groups. More useful if you want a more explicit column names (without NaN values in the following example):

# Extract usernames and domain with named groups.
s.str.extract(r'(?P<user>\w+)@(?P<at>\w+\.\w+)').dropna()
     user          at
0     doe   umail.com
1  nobody    post.org
3  pandas  pydata.org

Thanks to this part of the documentation, I also found out other useful strings methods such as split, strip, replace, etc. when you handle a Series of str for instance. Note that the most of them have already been available in 0.8.1. Take a look at the string handling API doc (recently added) and some basics about vectorized strings methods.

Interpolation Methods

DataFrame has a new interpolate method, similar to Series. It was possible to interpolate missing data in a DataFrame before, but it did not take into account the dates if you had index timeseries. Now, it is possible to pass a specific interpolation method to the method function argument. You can use scipy interpolation functions such as slinear, quadratic, polynomial, and others. The time method is used to take your index timeseries into account.

from datetime import date
# Arbitrary timeseries
ts = pd.DatetimeIndex([date(2006,5,2), date(2006,12,23), date(2007,4,13),
                       date(2007,6,14), date(2008,8,31)])
df = pd.DataFrame(np.random.randn(5, 2), index=ts, columns=['X', 'Z'])
# Fill the DataFrame with missing values.
df['X'].iloc[[1, -1]] = np.nan
df['Z'].iloc[3] = np.nan
df
                   X         Z
2006-05-02  0.104836 -0.078031
2006-12-23       NaN -0.589680
2007-04-13 -1.751863  0.543744
2007-06-14  1.210980       NaN
2008-08-31       NaN  0.566205

Without any optional argument, you have:

df.interpolate()
                   X         Z
2006-05-02  0.104836 -0.078031
2006-12-23 -0.823514 -0.589680
2007-04-13 -1.751863  0.543744
2007-06-14  1.210980  0.554975
2008-08-31  1.210980  0.566205

With the time method, you obtain:

df.interpolate(method='time')
                   X         Z
2006-05-02  0.104836 -0.078031
2006-12-23 -1.156217 -0.589680
2007-04-13 -1.751863  0.543744
2007-06-14  1.210980  0.546496
2008-08-31  1.210980  0.566205

I suggest you to read more examples in the missing data doc part and the scipy documentation about the module interpolate.

Misc

Convert a Series to a single-column DataFrame with its method to_frame.

Misc & Experimental Features

Retrieve R Datasets

Not a killing feature but very pleasant: the possibility to load into a DataFrame all R datasets listed at http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

import pandas.rpy.common as com
titanic = com.load_data('Titanic')
titanic.head()
  Survived    Age     Sex Class value
0       No  Child    Male   1st   0.0
1       No  Child    Male   2nd   0.0
2       No  Child    Male   3rd  35.0
3       No  Child    Male  Crew   0.0
4       No  Child  Female   1st   0.0

for the datasets about survival of passengers on the Titanic. You can find several and different datasets about New York air quality measurements, body temperature series of two beavers, plant growth results or the violent crime rates by US state for instance. Very useful if you would like to show pandas to a friend, a colleague or your Grandma and you do not have a dataset with you.

And then three great experimental features.

Eval and Query Experimental Features

The eval and query methods which use numexpr which can fastly evaluate array expressions as x - 0.5 * y. For numexpr, x and y are Numpy arrays. You can use this powerfull feature in pandas to evaluate different DataFrame columns. By the way, we have already talked about numexpr a few years ago in EuroScipy 09: Need for Speed.

df = pd.DataFrame(np.random.randn(10, 3), columns=['x', 'y', 'z'])
df.head()
          x         y         z
0 -0.617131  0.460250 -0.202790
1 -1.943937  0.682401 -0.335515
2  1.139353  0.461892  1.055904
3 -1.441968  0.477755  0.076249
4 -0.375609 -1.338211 -0.852466
df.eval('x + 0.5 * y - z').head()
0   -0.184217
1   -1.267222
2    0.314395
3   -1.279340
4   -0.192248
dtype: float64

About the query method, you can select elements using a very simple query syntax.

df.query('x >= y > z')
          x         y         z
9  2.560888 -0.827737 -1.326839

msgpack Serialization

New reading and writing functions to serialize your data with the great and well-known msgpack library. Note this experimental feature does not have a stable storage format. You can imagine to use zmq to transfer msgpack serialized pandas objects over TCP, IPC or SSH for instance.

Google BigQuery

A recent module pandas.io.gbq which provides a way to load into and extract datasets from the Google BigQuery Web service. I've not installed the requirements for this feature now. The example of the release note shows how you can select the average monthly temperature in the year 2000 across the USA. You can also read the related pandas documentation. Nevertheless, you will need a BigQuery account as the other Google's products.

Take Your Keyboard

Give it a try, play with some data, mangle and plot them, compute some stats, retrieve some patterns or whatever. I'm convinced that pandas will be more and more used and not only for data scientists or quantitative analysts. Open an IPython Notebook, pick up some data and let yourself be tempted by pandas.

I think I will use more the vectorized strings methods that I found out about when writing this post. I'm glad to learn more about timeseries because I know that I'll use these features. I'm looking forward to the two experimental features such as eval/query and msgpack serialization.

You can follow me on Twitter (@jazzydag). See also Logilab (@logilab_org).


Pylint 1.1 christmas release

2013/12/24 by Sylvain Thenault

Pylint 1.1 eventually got released on pypi!

A lot of work has been achieved since the latest 1.0 release. Various people have contributed to add several new checks as well as various bug fixes and other enhancement.

Here is the changes summary, check the changelog for more info.

New checks:

  • 'deprecated-pragma', for use of deprecated pragma directives "pylint:disable-msg" or "pylint:enable-msg" (was previously emmited as a regular warn().
  • 'superfluous-parens' for unnecessary parentheses after certain keywords.
  • 'bad-context-manager' checking that '__exit__' special method accepts the right number of arguments.
  • 'raising-non-exception' / 'catching-non-exception' when raising/catching class non inheriting from BaseException
  • 'non-iterator-returned' for non-iterators returned by '__iter__'.
  • 'unpacking-non-sequence' for unpacking non-sequences in assignments and 'unbalanced-tuple-unpacking' when left-hand-side size doesn't match right-hand-side.

Command line:

  • New option for the multi-statement warning to allow single-line if statements.
  • Allow to run pylint as a python module 'python -m pylint' (anatoly techtonik).
  • Various fixes to epylint

Bug fixes:

  • Avoid false used-before-assignment for except handler defined identifier used on the same line (#111).
  • 'useless-else-on-loop' not emited if there is a break in the else clause of inner loop (#117).
  • Drop 'badly-implemented-container' which caused several problems in its current implementation.
  • Don't mark input as a bad function when using python3 (#110).
  • Use attribute regexp for properties in python3, as in python2
  • Fix false-positive 'trailing-whitespace' on Windows (#55)

Other:

  • Replaced regexp based format checker by a more powerful (and nit-picky) parser, combining 'no-space-after-operator', 'no-space-after-comma' and 'no-space-before-operator' into a new warning 'bad-whitespace'.
  • Create the PYLINTHOME directory when needed, it might fail and lead to spurious warnings on import of pylint.config.
  • Fix setup.py so that pylint properly install on Windows when using python3.
  • Various documentation fixes and enhancements

Packages will be available in Logilab's Debian and Ubuntu repository in the next few weeks.

Happy christmas!


SaltStack Paris Meetup on Feb 6th, 2014 - (S01E02)

2013/12/20 by Nicolas Chauvat

Logilab has set up the second meetup for salt users in Paris on Feb 6th, 2014 at IRILL, near Place d'Italie, starting at 18:00. The address is 23 avenue d'Italie, 75013 Paris.

Here is the announce in french http://www.logilab.fr/blogentry/1981

Please forward it to whom may be interested, underlining that pizzas will be offered to refuel the chatters ;)

Conveniently placed a week after the Salt Conference, topics will include anything related to salt and its uses, demos, new ideas, exchange of salt formulas, commenting the talks/videos of the saltconf, etc.

If you are interested in Salt, Python and Devops and will be in Paris at that time, we hope to see you there !


A quick take on continuous integration services for Bitbucket

2013/12/19 by Sylvain Thenault

Some time ago, we moved Pylint from this forge to Bitbucket (more on this here).

https://bitbucket-assetroot.s3.amazonaws.com/c/photos/2012/Oct/11/master-logo-2562750429-5_avatar.png

Since then, I somewhat continued to use the continuous integration (CI) service we provide on logilab.org to run tests on new commits, and to do the release job (publish a tarball on pypi, on our web site, build Debian and Ubuntu packages, etc.). This is fine, but not really handy since the logilab.org's CI service is not designed to be used for projects hosted elsewhere. Also I wanted to see what others have to offer, so I decided to find a public CI service to host Pylint and Astroid automatic tests at least.

Here are the results of my first swing at it. If you have others suggestions, some configuration proposal or whatever, please comment.

First, here are the ones I didn't test along with why:

The first one I actually tested, also the first one to show up when looking for "bitbucket continuous integration" on Google is https://drone.io. The UI is really simple, I was able to set up tests for Pylint in a matter of minutes: https://drone.io/bitbucket.org/logilab/pylint. Tests are automatically launched when a new commit is pushed to Pylint's Bitbucket repository and that setup was done automatically.

Trying to push Drone.io further, one missing feature is the ability to have different settings for my project, e.g. to launch tests on all the python flavor officially supported by Pylint (2.5, 2.6, 2.7, 3.2, 3.3, pypy, jython, etc.). Last but not least, the missing killer feature I want is the ability to launch tests on top of Pull Requests, which travis-ci supports.

Then I gave http://wercker.com a shot, but got stuck at the Bitbucket repository selection screen: none were displayed. Maybe because I don't own Pylint's repository, I'm only part of the admin/dev team? Anyway, wercker seems appealing too, though the configuration using yaml looks a bit more complicated than drone.io's, but as I was not able to test it further, there's not much else to say.

http://wercker.com/images/logo_header.png

So for now the winner is https://drone.io, but the first one allowing me to test on several Python versions and to launch tests on pull requests will be the definitive winner! Bonus points for automating the release process and checking test coverage on pull requests as well.

https://drone.io/drone3000/images/alien-zap-header.png

A retrospective of 10 years animating the pylint free software projet

2013/11/25 by Sylvain Thenault

was the topic of the talk I gave last saturday at the Capitol du Libre in Toulouse.

Here are the slides (pdf) for those interested (in french). A video of the talk should be available soon on the Capitol du Libre web site. The slides are mirrored on slideshare (see below):


Retrieve Quandl's Data and Play with a Pandas

2013/10/31 by Damien Garaud

This post deals with the Pandas Python library, the open and free access of timeseries datasets thanks to the Quandl website and how you can handle datasets with pandas efficiently.

http://www.logilab.org/file/186707/raw/scrabble_data.jpghttp://www.logilab.org/file/186708/raw/pandas_peluche.jpg

Why this post?

There has been a long time that I want to play a little with pandas. Not an adorable black and white teddy bear but the well-known Python Data library based on Numpy. I would like to show how you can easely retrieve some numerical datasets from the Quandl website and its API, and handle these datasets with pandas efficiently trought its main object: the DataFrame.

Note that this blog post comes with a IPython Notebook which can be found at http://nbviewer.ipython.org/url/www.logilab.org/file/187482/raw/quandl-data-with-pandas.ipynb

You also can get it at http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/ with HG.

Just do:

hg clone http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/

and get the IPython Notebook, the HTML conversion of this Notebook and some related CSV files.

First Step: Get the Code

At work or at home, I use Debian. A quick and dumb apt-get install python-pandas is enough. Nevertheless, (1) I'm keen on having a fresh and bloody upstream sources to get the lastest features and (2) I'm trying to contribute a little to the project --- tiny bugs, writing some docs. So I prefer to install it from source. Thus, I pull, I do sudo python setup.py develop and a few Cython compiling seconds later, I can do:

import pandas as pd

For the other ways to get the library, see the download page on the official website or see the dedicated Pypi page.

Let's build 10 brownian motions and plotting them with matplotlib.

import numpy as np
pd.DataFrame(np.random.randn(120, 10).cumsum(axis=0)).plot()

I don't very like the default font and color of the matplotlib figures and curves. I know that pandas defines a "mpl style". Just after the import, you can write:

pd.options.display.mpl_style = 'default'
http://www.logilab.org/file/186714/raw/Ten-Brownian-Motions.png

Second Step: Have You Got Some Data Please ?

Maybe I'm wrong, but I think that it's sometimes a quite difficult to retrieve some workable numerial datasets in the huge amount of available data over the Web. Free Data, Open Data and so on. OK folks, where are they ? I don't want to spent my time to through an Open Data website, find some interesting issues, parse an Excel file, get some specific data, mangling them to get a 2D arrays of floats with labels. Note that pandas fits with these kinds of problem very well. See the IO part of the pandas documentation --- CSV, Excel, JSON, HDF5 reading/writing functions. I just want workable numerical data without making effort.

A few days ago, a colleague of mine talked me about Quandl, a website dedicated to find and use numerical datasets with timeseries on the Internet. A perfect source to retrieve some data and play with pandas. Note that you can access some data about economics, health, population, education etc. thanks to a clever API. Get some datasets in CSV/XML/JSON formats between this date and this date, aggregate them, compute the difference, etc.

Moreover, you can access Quandl's datasets through any programming languages, like R, Julia, Clojure or Python (also available plugins or modules for some softwares such as Excel, Stata, etc.). The Quandl's Python package depends on Numpy and pandas. Perfect ! I can use the module Quandl.py available on GitHub and query some datasets directly in a DataFrame.

Here we are, huge amount of data are teasing me. Next question: which data to play with?

Third Step: Give some Food to Pandas

I've already imported the pandas library. Let's query some datasets thanks to the Quandl Python module. An example inspired by the README from the Quandl's GitHub page project.

import Quandl
data = Quandl.get('GOOG/NYSE_IBM')
data.tail()

and you get:

              Open    High     Low   Close    Volume
Date
2013-10-11  185.25  186.23  184.12  186.16   3232828
2013-10-14  185.41  186.99  184.42  186.97   2663207
2013-10-15  185.74  185.94  184.22  184.66   3367275
2013-10-16  185.42  186.73  184.99  186.73   6717979
2013-10-17  173.84  177.00  172.57  174.83  22368939

OK, I'm not very familiar with this kind of data. Take a look at the Quandl website. After a dozen of minutes on the Quandl website, I found this OECD murder rates. This page shows current and historical murder rates (assault deaths per 100 000 people) for 33 countries from the OECD. Take a country and type:

uk_df = Quandl.get('OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_GBR')

It's a DataFrame with a single column 'Value'. The index of the DataFrame is a timeserie. You can easily plot these data thanks to a:

uk_df.plot()
http://www.logilab.org/file/186711/raw/GBR-oecd-murder-rates.png

See the other pieces of code and using examples in the dedicated IPython Notebook. I also get data about unemployment in OECD for the quite same countries with more dates. Then, as I would like to compare these data, I must select similar countries, time-resample my data to have the same frequency and so on. Take a look. Any comment is welcomed.

So, the remaining content of this blog post is just a summary of a few interesting and useful pandas features used in the IPython notebook.

  • Using the timeseries as Index of my DataFrames
  • pd.concat to concatenate several DataFrames along a given axis. This function can deal with missing values if the Index of each DataFrame are not similar (this is my case)
  • DataFrame.to_csv and pd.read_csv to dump/load your data to/from CSV files. There are different arguments for the read_csv which deals with dates, mising value, header & footer, etc.
  • DateOffset pandas object to deal with different time frequencies. Quite useful if you handle some data with calendar or business day, month end or begin, quarter end or begin, etc.
  • Resampling some data with the method resample. I use it to make frequency conversion of some data with timeseries.
  • Merging/joining DataFrames. Quite similar to the "SQL" feature. See pd.merge function or the DataFrame.join method. I used this feature to align my two DataFrames along its Index.
  • Some Matplotlib plotting functions such as DataFrame.plot() and plot(kind='bar').

Conclusion

I showed a few useful pandas features in the IPython Notebooks: concatenation, plotting, data computation, data alignement. I think I can show more but this could be occurred in a further blog post. Any comments, suggestions or questions are welcomed.

The next 0.13 pandas release should be coming soon. I'll write a short blog post about it in a few days.

The pictures come from:


SaltStack Paris Meetup - some of what was said

2013/10/09 by Arthur Lutz

Last week, on the first day of OpenWorldForum 2013, we met up with Thomas Hatch of SaltStack to have a talk about salt. He was in Paris to give two talks the following day (1 & 2), and it was a great opportunity to meet him and physically meet part of the French Salt community. Since Logilab hosted the Great Salt Sprint in Paris, we offered to co-organise the meetup at OpenWorldForum.

http://saltstack.com/images/SaltStack-Logo.pnghttp://openworldforum.org/static/pictures/Calque1.png

Introduction

About 15 people gathered in Montrouge (near Paris) and we all took turns to present ourselves and how or why we used salt. Some people wanted to migrate from BCFG2 to salt. Some people told the story of working a month with CFEngine and meeting the same functionnality in two days with salt and so decided to go for that instead. Some like salt because they can hack its python code. Some use salt to provision pre-defined AMI images for the clouds (salt-ami-cloud-builder). Some chose salt over Ansible. Some want to use salt to pilot temporary computation clusters in the cloud (sort of like what StarCluster does with boto and ssh).

When Paul from Logilab introduced salt-ami-cloud-builder, Thomas Hatch said that some work is being done to go all the way : build an image from scratch from a state definition. On the question of Debian packaging, some efforts could be done to have salt into wheezy-backports. Julien Cristau from Logilab who is a debian developer might help with that.

Some untold stories where shared : some companies that replaced puppet by salt, some companies use salt to control an HPC cluster, some companies use salt to pilot their existing puppet system.

We had some discussions around salt-cloud, which will probably be merged into salt at some point. One idea for salt-cloud was raised : have a way of defining a "minimum" type of configuration which translates into the profiles according to which provider is used (an issue should be added shortly). The expression "pushing states" was often used, it is probably a good way of looking a the combination of using salt-cloud and the masterless mode available with salt-ssh. salt-cloud controls an existing cloud, but Thomas Hatch points to the fact that with salt-virt, salt is becoming a cloud controller itself, more on that soon.

Mixing pillar definition between 'public' and 'private' definitions can be tricky. Some solutions exist with multiple gitfs (or mercurial) external pillar definitions, but more use cases will drive more flexible functionalities in the future.

http://openworldforum.org/en/speakers/112/photo?s=220

Presentation and live demo

For those in the audience that were not (yet) users of salt, Thomas went back to explaining a few basics about it. Salt should be seen as a "toolkit to solve problems in a infrastructure" says Thomas Hatch. Why is it fast ? It is completely asynchronous and event driven.

He gave a quick presentation about the new salt-ssh which was introduced in 0.17, which allows the application of salt recipes to machines that don't have a minion connected to the master.

The peer communication can be used so as to add a condition for a state on the presence of service on a different minion.

While doing demos or even hacking on salt, one can use salt/test/minionswarm.py which makes fake minions, not everyone has hundreds of servers in at their fingertips.

Smart modules are loaded dynamically, for example, the git module that gets loaded if a state installs git and then in the same highstate uses the git modules.

Thomas explained the difference between grains and pillars : grains is data about a minion that lives on the minion, pillar is data about the minion that lives on the master. When handling grains, the grains.setval can be useful (it writes in /etc/salt/grains as yaml, so you can edit it separately). If a minion is not reachable one can obtain its grains information by replacing test=True by cache=True.

Thomas shortly presented saltstack-formulas : people want to "program" their states, and formulas answer this need, some of the jinja2 is overly complicated to make them flexible and programmable.

While talking about the unified package commands (a salt command often has various backends according to what system runs the minion), for example salt-call --local pkg.install vim, Thomas told this funny story : ironically, salt was nominated for "best package manager" at some linux magazine competition. (so you don't have to learn how to use FreeBSD packaging tools).

While hacking salt, one can take a look at the Event Bus (see test/eventlisten.py), many applications are possible when using the data on this bus. Thomas talks about a future IOflow python module where a complex logic can be implemented in the reactor with rules and a state machine. One example use of this would be if the load is high on X number of servers and the number of connexions Y on these servers then launch extra machines.

To finish on a buzzword, someone asked "what is the overlap of salt and docker" ? The answer is not simple, but Thomas thinks that in the long run there will be a lot of overlap, one can check out the existing lxc modules and states.

Wrap up

To wrap up, Thomas announced a salt conference planned for January 2014 in Salt Lake City.

Logilab proposes to bootstrap the French community around salt. As the group suggest this could take the form of a mailing list, an irc channel, a meetup group , some sprints, or a combination of all the above. On that note, next international sprint will probably take place in January 2014 around the salt conference.


Setup your project with cloudenvy and OpenStack

2013/10/03 by Arthur Lutz

One nice way of having a reproducible development or test environment is to "program" a virtual machine to do the job. If you have a powerful machine at hand you might use Vagrant in combination with VirtualBox. But if you have an OpenStack setup at hand (which is our case), you might want to setup and destroy your virtual machines on such a private cloud (or public cloud if you want or can). Sure, Vagrant has some plugins that should add OpenStack as a provider, but, here at Logilab, we have a clear preference for python over ruby. So this is where cloudenvy comes into play.

http://www.openstack.org/themes/openstack/images/open-stack-cloud-computing-logo-2.png

Cloudenvy is written in python and with some simple YAML configuration can help you setup and provision some virtual machines that contain your tests or your development environment.

http://www.python.org/images/python-logo.gif

Setup your authentication in ~/.cloudenvy.yml :

cloudenvy:
  clouds:
    cloud01:
      os_username: username
      os_password: password
      os_tenant_name: tenant_name
      os_auth_url: http://keystone.example.com:5000/v2.0/

Then create an Envyfile.yml at the root of your project

project_config:
  name: foo
  image: debian-wheezy-x64

  # Optional
  #remote_user: ec2-user
  #flavor_name: m1.small
  #auto_provision: False
  #provision_scripts:
    #- provision_script.sh
  #files:
    # files copied from your host to the VM
    #local_file : destination

Now simply type envy up. Cloudenvy does the rest. It "simply" creates your machine, copies the files, runs your provision script and gives you it's IP address. You can then run envy ssh if you don't want to be bothered with IP addresses and such nonsense (forget about copy and paste from the OpenStack web interface, or your nova show commands).

Little added bonus : you know your machine will run a web server on port 8080 at some point, set it up in your environment by defining in the same Envyfile.yml your access rules

sec_groups: [
    'tcp, 22, 22, 0.0.0.0/0',
    'tcp, 80, 80, 0.0.0.0/0',
    'tcp, 8080, 8080, 0.0.0.0/0',
  ]

As you might know (or I'll just recommend it), you should be able to scratch and restart your environment without loosing anything, so once in a while you'll just do envy destroy to do so. You might want to have multiples VM with the same specs, then go for envy up -n second-machine.

Only downside right now : cloudenvy isn't packaged for debian (which is usually a prerequisite for the tools we use), but let's hope it gets some packaging soon (or maybe we'll end up doing it).

Don't forget to include this configuration in your project's version control so that a colleague starting on the project can just type envy up and have a working setup.

In the same order of ideas, we've been trying out salt-cloud <https://github.com/saltstack/salt-cloud> because provisioning machines with SaltStack is the way forward. A blog about this is next.


DebConf13 report

2013/09/25 by Julien Cristau

As announced before, I spent a week last month in Vaumarcus, Switzerland, attending the 14th Debian conference (DebConf13).

It was great to be at DebConf again, with lots of people I hadn't seen since New York City three years ago, and lots of new faces. Kudos to the organizers for pulling this off. These events are always a great boost for motivation, even if the amount of free time after coming back home is not quite as copious as I might like.

One thing that struck me this year was the number of upstream people, not directly involved in Debian, who showed up. From systemd's Lennart and Kay, to MariaDB's Monty, and people from upstart, dracut, phpmyadmin or munin. That was a rather pleasant surprise for me.

Here's a report on the talks and BoF sessions I attended. It's a bit long, but hey, the conference lasted a week. In addition to those I had quite a few chats with various people, including fellow members of the Debian release team.

http://debconf13.debconf.org/images/logo.png

Day 1 (Aug 11)

Linux kernel : Ben Hutchings made a summary of the features added between 3.2 in wheezy and the current 3.10, and their status in Debian (some still need userspace work).

SPI status : Bdale Garbee and Jimmy Kaplowitz explained what steps SPI is making to deal with its growth, including getting help from a bookkeeper recently to relieve the pressure on the (volunteer) treasurer.

Hardware support in Debian stable : If you buy new hardware today, it's almost certainly not supported by the Debian stable release. Ideas to improve this :

  • backport whole subsystems: probably not feasible, risk of regressions would be too high
  • ship compat-drivers, and have the installer automatically install newer drivers based on PCI ids, seems possible.
  • mesa: have the GL loader pick a different driver based on the hardware, and ship newer DRI drivers for the new hardware, without touching the old ones. Issue: need to update libGL and libglapi too when adding new drivers.
  • X drivers, drm: ? (it's complicated)

Meeting between release team and DPL to figure out next steps for jessie. Decided to schedule a BoF later in the week.

Day 2 (Aug 12)

Munin project lead on new features in 2.0 (shipped in wheezy) and roadmap for 2.2. Improvements on the scalability front (both in terms of number of nodes and number of plugins on a node). Future work includes improving the UI to make it less 1990 and moving some metadata to sql.

jeb on AWS and Debian : Amazon Web Services (AWS) includes compute (ec2), storage (s3), network (virtual private cloud, load balancing, ..) and other services. Used by Debian for package rebuilds. http://cloudfront.debian.net is a CDN frontend for archive mirrors. Official Debian images are on ec2, including on the AWS marketplace front page. build-debian-cloud tool from Anders Ingeman et al. was presented.

openstack in Debian : Packaging work is focused on making things easy for newcomers, basic config with debconf. Advanced users are going to use puppet or similar anyway. Essex is in wheezy, but end-of-life upstream. Grizzly available in sid and in a separate archive for wheezy. This work is sponsored by enovance.

Patents : http://patents.stackexchange.com, looks like the USPTO has used comments made there when rejecting patent applications based on prior art. Patent applications are public, and it's a lot easier to get a patent application rejected than invalidate a patent later on. Should we use that site? Help build momentum around it? Would other patent offices use that kind of research? Issues: looking at patent applications (and publicly commenting) might mean you're liable for treble damages if the patent is eventually granted? Can you comment anonymously?

Why systemd? : Lennart and Kay. Pop corn, upstart trolling, nothing really new.

Day 3 (Aug 13)

dracut : dracut presented by Harald Hoyer, its main developer. Seems worth investigating replacing initramfs-tools and sharing the maintenance load. Different hooks though, so we'll need to coordinate this with various packages.

upstart : More Debian-focused than the systemd talk. Not helped by Canonical's CLA...

dh_busfactor : debhelper is essentially a one-man show from the beginning. Though various packages/people maintain different dh_* tools either in the debhelper package itself or elsewhere. Joey is thinking about creating a debhelper team including those people. Concerns over increased breakage while people get up to speed (joeyh has 10 years of experience and still occasionally breaks stuff).

dri3000 : Keith is trying to fix dri2 issues. While dri2 fixed a number of things that were wrong with dri1, it still has some problems. One of the goals is to improve presentation: we need a way to sync between app and compositor (to avoid displaying incompletely drawn frames), avoid tearing, and let the app choose immediate page flip instead of waiting for next vblank if it missed its target (stutter in games is painful). He described this work on his blog.

security team BoF : explain the workflow, try to improve documentation of the process and what people can do to help. http://security.debian.org/

Day 4 (Aug 14)

day trip, and conference dinner on a boat from Neuchatel to Vaumarcus

Day 5 (Aug 15)

git-dpm : Spent half an hour explaining git, then was rushed to show git-dpm itself. Still, needs looking at. Lets you work with git and export changes as quilt series to build a source package.

Ubuntu daily QA : The goal was to make it possible for canonical devs (not necessarily people working on the distro) to use ubuntu+1 (dev release). They tried syncing from testing for a while, but noticed bug fixes being delayed: not good. In the previous workflow the dev release was unusable/uninstallable for the first few months. Multiarch made things even more problematic because it requires amd64/i386 being in sync.

  • 12.04: a bunch of manpower thrown at ubuntu+1 to keep backlog of technical debt under control.
  • 12.10: prepare infrastructure (mostly launchpad), add APIs, to make non-canonical people able to do stuff that previously required shell access on central machines.
  • 13.04: proposed migration. britney is used to migrate packages from devel-proposed to devel. A few teething problems at first, but good reaction.
  • 13.10 and beyond: autopkgtest runs triggered after upload/build, also for rdeps. Phased updates for stable releases (rolled out to a subset of users and then gradually generalized). Hook into errors.ubuntu.com to match new crashes with package uploads. Generally more continuous integration. Better dashboard. (Some of that is still to be done.)

Lessons learned from debian:

  • unstable's backlog can get bad → proposed is only used for builds and automated tests, no delay
  • transitions can take weeks at best
  • to avoid dividing human attention, devs are focused on devel, not devel-proposed

Lessons debian could learn:

  • keeping testing current is a collective duty/win
  • splitting users between testing and unstable has important costs
  • hooking automated testing into britney really powerful; there's a small but growing number of automated tests

Ideas:

  • cut migration delay in half
  • encourage writing autopkgtests
  • end goal: make sid to testing migration entirely based on automated tests

Debian tests using Jenkins http://jenkins.debian.net

  • https://github.com/h01ger/jenkins-job-builder
  • Only running amd64 right now.
  • Uses jenkins plugins: git, svn, log parser, html publisher, ...
  • Has existing jobs for installer, chroot installs, others
  • Tries to make it easy to reproduce jobs, to allow debugging
  • {c,sh}ould add autopkgtests

Day 6 (Aug 16)

X Strike Force BoF : Too many bugs we can't do anything about: {mass,auto}-close them, asking people to report upstream. Reduce distraction by moving the non-X stuff to separate teams (compiz removed instead, wayland to discuss...). We should keep drivers as close to upstream as possible. A couple of people in the room volunteered to handle the intel, ati and input drivers.

reclass BoF

I had missed the talk about reclass, and Martin kindly offered to give a followup BoF to show what reclass can do.

Reclass provides adaptors for puppet(?), salt, ansible. A yaml file describes each host:

  • can declare applications and parameters
  • host is leaf in a dag/tree of classes

Lets you put the data in reclass instead of the config management tool, keeping generic templates in ansible/salt.

I'm definitely going to try this and see if it makes it easier to organize data we're currently putting directly in salt states.

release BoF : Notes are on http://gobby.debian.org. Basic summary: "Releasing in general is hard. Releasing something as big/diverse/distributed as Debian is even harder." Who knew?

freedombox : status update from Bdale

Keith Packard showed off the free software he uses in his and Bdale's rockets adventures.

This was followed by a birthday party in the evening, as Debian turned 20 years old.

Day 7 (Aug 17)

x2go : Notes are on http://gobby.debian.org. To be solved: issues with nx libs (gpl fork of old x). Seems like a good thing to try as alternative to LTSP which we use at Logilab.

lightning talks

  • coquelicot (lunar) - one-click secure(ish) file upload web app
  • notmuch (bremner) - need to try that again now that I have slightly more disk space
  • fedmsg (laarmen) - GSoC, message passing inside the debian infrastructure

Debconf15 bids :

  • Mechelen/Belgium - Wouter
  • Germany (no city yet) - Marga

Debconf14 presentation : Will be in Portland (Portland State University) next August. Presentation by vorlon, harmoney, keithp. Looking forward to it!

  • Closing ceremony

The videos of most of the talks can be downloaded, thanks to the awesome work of the video team. And if you want to check what I didn't see or talk about, check the complete schedule.


JDEV2013 - Software development conference of CNRS

2013/09/14 by Nicolas Chauvat

I had the pleasure to be invited to lead a tutorial at JDEV2013 titled Learning TDD and Python in Dojo mode.

http://www.logilab.org/file/177427/raw/logo_JDEV2013.png

I quickly introduced the keywords with a single slide to keep it simple:

http://Python.org
+ Test Driven Development (Test, Code, Refactor)
+ Dojo (house of training: Kata / Randori)
= Calculators
  - Reverse Polish Notation
  - Formulas with Roman Numbers
  - Formulas with Numbers in letters

As you can see, I had three types of calculators, hence at least three Kata to practice, but as usual with beginners, it took us the whole tutorial to get done with the first one.

The room was a class room that we set up as our coding dojo with the coder and his copilot working on a laptop, facing the rest of the participants, with the large screen at their back. The pair-programmers could freely discuss with the people facing them, who were following the typing on the large screen.

We switched every ten minutes: the copilot became coder, the coder went back to his seat in the class and someone else stood up to became the copilot.

The session was allocated 3 hours split over two slots of 1h30. It took me less than 10 minutes to open the session with the above slide, 10 minutes as first coder and 10 minutes to close it. Over a time span of 3 hours, that left 150 minutes for coding, hence 15 people. Luckily, the whole group was about that size and almost everyone got a chance to type.

I completely skipped explaining Python, its syntax and the unittest framework and we jumped right into writing our first tests with if and print statements. Since they knew about other programming languages, they picked up the Python langage on the way.

After more than an hour of slowly discovering Python and TDD, someone in the room realized they had been focusing more on handling exception cases and failures than implementing the parsing and computation of the formulas because the specifications where not clearly understood. He then asked me the right question by trying to define Reverse Polish Notation in one sentence and checking that he got it right.

Different algorithms to parse and compute RPN formulas where devised at the blackboard over the pause while part of the group went for a coffee break.

The implementation took about another hour to get right, with me making sure they would not wander too far from the actual goal. Once the stack-based solution was found and implemented, I asked them to delete the files, switch coder and start again. They had forgotten about the Kata definition and were surprised, but quickly enjoyed it when they realized that progress was much faster on the second attempt.

Since it is always better to show that you can walk the talk, I closed the session by praticing the RPN calculator kata myself in a bit less than 10 minutes. The order in which to write the tests is the tricky part, because it can easily appear far-fetched for such a small problem when you already know an algorithm that solves it.

Here it is:

import operator

OPERATORS = {'+': operator.add,
             '*': operator.mul,
             '/': operator.div,
             '-': operator.sub,
             }

def compute(args):
    items = args.split()
    stack = []
    for item in items:
        if item in OPERATORS:
            b,a = stack.pop(), stack.pop()
            stack.append(OPERATORS[item](a,b))
        else:
            stack.append(int(item))
    return stack[0]

with the accompanying tests:

import unittest
from npi import compute

class TestTC(unittest.TestCase):

    def test_unit(self):
        self.assertEqual(compute('1'), 1)

    def test_dual(self):
        self.assertEqual(compute('1 2 +'), 3)

    def test_tri(self):
        self.assertEqual(compute('1 2 3 + +'), 6)
        self.assertEqual(compute('1 2 + 3 +'), 6)

    def test_precedence(self):
        self.assertEqual(compute('1 2 + 3 *'), 9)
        self.assertEqual(compute('1 2 * 3 +'), 5)

    def test_zerodiv(self):
        self.assertRaises(ZeroDivisionError, compute, '10 0 /')

unittest.main()

Apparently, it did not go too bad, for I had positive comments at the end from people that enjoyed discovering in a single session Python, Test Driven Development and the Dojo mode of learning.

I had fun doing this tutorial and thank the organizators for this conference!


Going to EuroScipy2013

2013/09/04 by Alain Leufroy

The EuroScipy2013 conference was held in Bruxelles at the Université libre de Bruxelles.

http://www.logilab.org/file/175984/raw/logo-807286783.png

As usual the first two days were dedicated to tutorials while the last two ones were dedicated to scientific presentations and general python related talks. The meeting was extended by one more day for sprint sessions during which enthusiasts were able to help free software projects, namely sage, vispy and scipy.

Jérôme and I had the great opportunity to represent Logilab during the scientific tracks and the sprint day. We enjoyed many talks about scientific applications using python. We're not going to describe the whole conference. Visit the conference website if you want the complete list of talks. In this article we will try to focus on the ones we found the most interesting.

First of all the keynote by Cameron Neylon about Network ready research was very interesting. He presented some graphs about the impact of a group job on resolving complex problems. They revealed that there is a critical network size for which the effectiveness for solving a problem drastically increase. He pointed that the source code accessibility "friction" limits the "getting help" variable. Open sourcing software could be the best way to reduce this "friction" while unit testing and ongoing integration are facilitators. And, in general, process reproducibility is very important, not only in computing research. Retrieving experimental settings, metadata, and process environment is vital. We agree with this as we are experimenting it everyday in our work. That is why we encourage open source licenses and develop a collaborative platform that provides the distributed simulation traceability and reproducibility platform Simulagora (in french).

Ian Ozsvald's talk dealt with key points and tips from his own experience to grow a business based on open source and python, as well as mistakes to avoid (e.g. not checking beforehand there are paying customers interested by what you want to develop). His talk was comprehensive and mentioned a wide panel of situations.

http://vispy.org/_static/img/logo.png

We got a very nice presentation of a young but interesting visualization tools: Vispy. It is 6 months old and the first public release was early August. It is the result of the merge of 4 separated libraries, oriented toward interactive visualisation (vs. static figure generation for Matplotlib) and using OpenGL on GPUs to avoid CPU overload. A demonstration with large datasets showed vispy displaying millions of points in real time at 40 frames per second. During the talk we got interesting information about OpenGL features like anti-grain compared to Matplotlib Agg using CPU.

We also got to learn about cartopy which is an open source Python library originally written for weather and climate science. It provides useful and simple API to manipulate cartographic mapping.

Distributed computing systems was a hot topic and many talks were related to this theme.

https://www.openstack.org/themes/openstack/images/openstack-logo-preview-full-color.png

Gael Varoquaux reminded us what are the keys problems with "biggish data" and the key points to successfully process them. I think that some of his recommendations are generally useful like "choose simple solutions", "fail gracefully", "make it easy to debug". For big data processing when I/O limit is the constraint, first try to split the problem into random fractions of the data, then run algorithms and aggregate the results to circumvent this limit. He also presented mini-batch that takes a bunch of observations (trade-off memory usage/vectorization) and joblib.parallel that makes I/O faster using compression (CPUs are faster than disk access).

Benoit Da Mota talked about shared memory in parallel computing and Antonio Messina gave us a quick overview on how to build a computing cluster with Elasticluster, using OpenStack/Slurm/ansible. He demonstrated starting and stopping a cluster on OpenStack: once all VMs are started, ansible configures them as hosts to the cluster and new VMs can be created and added to the cluster on the fly thanks to a command line interface.

We also got a keynote by Peter Wang (from Continuum Analytics) about the future of data analysis with Python. As a PhD in physics I loved his metaphor of giving mass to data. He tried to explain the pain that scientists have when using databases.

https://scikits.appspot.com/static/images/scipyshiny_small.png

After the conference we participated to the numpy/scipy sprint. It was organized by Ralph Gommers and Pauli Virtanen. There were 18 people trying to close issues from different difficulty levels and had a quick tutorial on how easy it is to contribute: the easiest is to fork from the github project page on your own github account (you can create one for free), so that later your patch submission will be a simple "Pull Request" (PR). Clone locally your scipy fork repository, and make a new branch (git checkout -b <newbranch>) to tackle one specific issue. Once your patch is ready, commit it locally, push it on your github repository and from the github interface choose "Push request". You will be able to add something to your commit message before your PR is sent and looked at by the project lead developers. For example using "gh-XXXX" in your commit message will automatically add a link to the issue no. XXXX. Here is the list of open issues for scipy; you can filter them, e.g. displaying only the ones considered easy to fix :D

For more information: Contributing to SciPy.


Emacs turned into a IDE with CEDET

2013/08/29 by Anthony Truchet

Abstract

In this post you will find one way, namely thanks to CEDET, of turning your Emacs into an IDE offering features for semantic browsing and refactoring assistance similar to what you can find in major IDE like Visual Studio or Eclipse.

Introduction

Emacs is a tool of choice for the developer: it is very powerful, highly configurable and has a wealth of so called modes to improve many aspects of daily work, especially when editing code.

The point, as you might have realised in case you have already worked with an IDE like Eclipse or Visual Studio, is that Emacs (code) browsing abilities are quite rudimentary... at least out of the box!

In this post I will walk through one way to configure Emacs + CEDET which works for me. This is by far not the only way to get to it but finding this path required several days of wandering between inconsistent resources, distribution pitfall and the like.

I will try to convey relevant parts of what I have learnt on the way, to warn about some pitfalls and also to indicate some interesting direction I haven't followed (be it by choice or necessity) and encourage you to try. Should you try to push this adventure further, your experience will be very much appreciated... and in any case your feedback on this post is also very welcome.

The first part gives some deemed useful background to understand what's going on. If you want to go straight to the how-to please jump directly to the second part.

Sketch map of the jungle

This all started because I needed a development environment to do work remotely on a big, legacy C++ code base from quite a lightweight machine and a weak network connection.

My former habit of using Eclipse CDT and compiling locally was not an option any longer but I couldn't stick to a bare text editor plus remote compilation either because of the complexity of the code base. So I googled emacs IDE code browser and started this journey to set CEDET + ECB up...

I quickly got lost in a jungle of seemingly inconsistent options and I reckon that some background facts are welcome at this point as to why.

Up to this date - sept. 2013 - most of the world is in-between two major releases of Emacs. Whereas Emacs 23.x is still packaged in many stable Linux distribution, the latest release is Emacs 24.3. In this post we will use Emacs 24.x which brings lots of improvements, two of those are really relevant to us:

  • the introduction of a package manager, which is great and (but) changes initialisation
  • the partial integration of some version of CEDET into Emacs since version 23.2

Emacs 24 initialisation

Very basically, Emacs used to read the user's Emacs config (~/.emacs or ~/.emacs.d/init.el) which was responsible for adapting the load-path and issuing the right (require 'stuff) commands and configuring each library in some appropriate sequence.

Emacs 24 introduces ELPA, a new package system and official packages repository. It can be extended by other packages repositories such as Marmalade or MELPA

By default in Emacs 24, the initialisation order is a bit more complex due to packages loading: the user's config is still read but should NOT require the libraries installed through the package system: those are automatically loaded (the former load-path adjustment and (require 'stuff) steps) after the ~/.emacs or ~/.emacs.d/init.el has finished. This makes configuring the loaded libraries much more error-prone, especially for libraries designed to be configured the old way (as of today most libraries, notably CEDET).

Here is a good analysis of the situation and possible options. And for those interested in the details of the new initialisation process, see following sections of the manual:

I first tried to stick to the new-way, setting up hooks in ~/.emacs.d/init.el to be called after loading the various libraries, each library having its own configuration hook, and praying for the interaction between the package manager load order and my hooks to be ok... in vain. So I ended up forcing the initialisation to the old way (see Emacs 24 below).

What is CEDET ?

CEDET is a Collection of Emacs Development Environment Tools. The major word here is collection, do not expect it to be an integrated environment. The main components of (or coupled with) CEDET are:

Semantic
Extract a common semantic from source code in different languages
(e)ctags / GNU global
Traditional (exhuberant) CTags or GNU global can be used as a source of information for Semantic
SemanticDB
SemanticDB provides for caching the outcome of semantic analysis in some database to reduce analysis overhead across several editing sessions
Emacs Code Browser
This component uses information provided by Semantic to offer a browsing GUI with windows for traversing files, classes, dependencies and the like
EDE
This provides a notion of project analogous to most IDE. Even if the features related to building projects are very Emacs/ Linux/ Autotools-centric (and thus not necessarily very helful depending on your project setup), the main point of EDE is providing scoping of source code for Semantic to analyse and include path customisation at the project level.
AutoComplete
This is not part of CEDET but Semantic can be configured as a source of completions for auto-complete to propose to the user.
and more...
Senator, SRecode, Cogre, Speedbar, EIEIO, EAssist are other components of CEDET I've not looked at yet.

To add some more complexity, CEDET itself is also undergoing heavy changes and is in-between major versions. The last standalone release is 1.1 but it has the old source layout and activation method. The current head of development says it is version 2.0, has new layout and activation method, plus some more features but is not released yet.

Integration of CEDET into Emacs

Since Emacs 23.2, CEDET is built into Emacs. More exactly parts of some version of new CEDET are built into Emacs, but of course this built-in version is older than the current head of new CEDET... As for the notable parts not built into Emacs, ECB is the most prominent! But it is packaged into Marmalade in a recent version following head of development closely which, mitigates the inconvenience.

My first choice was using built-in CEDET with ECB installed from the packages repository: the installation was perfectly smooth but I was not able to configure cleanly enough the whole to get proper operation. Although I tried hard, I could not get Semantic to take into account the include paths I configured using my EDE project for example.

I would strongly encourage you to try this way, as it is supposed to require much less effort to set up and less maintenance. Should you succeed I would greatly appreciate some feedback of you experience!

As for me I got down to install the latest version from the source repositories following as closely as possible Alex Ott's advices and using his own fork of ECB to make it compliant with most recent CEDET:

How to set up CEDET + ECB in Emacs 24

Emacs 24

Install Emacs 24 as you wish, I will not cover the various options here but simply summarise the local install from sources I choose.

  1. Get the source archive from http://ftpmirror.gnu.org/emacs/
  2. Extract it somewhere and run the usual (or see the INSTALL file) - configure --prefix=~/local, - make, - make install

Create your emacs personal directory and configuration file ~/.emacs.d/site-lisp/ and ~/.emacs.d/init.el and put this inside the latter:

;; this is intended for manually installed libraries
(add-to-list 'load-path "~/.emacs.d/site-lisp/")

;; load the package system and add some repositories
(require 'package)
(add-to-list 'package-archives
             '("marmalade" . "http://marmalade-repo.org/packages/"))
(add-to-list 'package-archives
             '("melpa" . "http://melpa.milkbox.net/packages/") t)

;; Install a hook running post-init.el *after* initialization took place
(add-hook 'after-init-hook (lambda () (load "post-init.el")))

;; Do here basic initialization, (require) non-ELPA packages, etc.

;; disable automatic loading of packages after init.el is done
(setq package-enable-at-startup nil)
;; and force it to happen now
(package-initialize)
;; NOW you can (require) your ELPA packages and configure them as normal

Useful Emacs packages

Using the emacs commands M-x package-list-packages interactively or M-x package-install <package name>, you can install many packages easily. For example I installed:

Choose your own! I just recommend against installing ECB or other CEDET since we are going to install those from source.

You can also insert or load your usual Emacs configuration here, simply beware of configuring ELPA, Marmalade et al. packages after (package-initialize).

CEDET

  • Get the source and put it under ~/.emacs.d/site-lisp/cedet-bzr. You can either download a snapshot from http://www.randomsample.de/cedet-snapshots/ or check it out of the bazaar repository with:

    ~/.emacs.d/site-lisp$ bzr checkout --lightweight \
    bzr://cedet.bzr.sourceforge.net/bzrroot/cedet/code/trunk cedet-bzr
    
  • Run make (and optionnaly make install-info) in cedet-bzr or see the INSTALL file for more details.

  • Get Alex Ott's minimal CEDET configuration file to ~/.emacs.d/config/cedet.el for example

  • Adapt it to your system by editing the first lines as follows

    (setq cedet-root-path
        (file-name-as-directory (expand-file-name
            "~/.emacs.d/site-lisp/cedet-bzr/")))
    (add-to-list 'Info-directory-list
            "~/projects/cedet-bzr/doc/info")
    
  • Don't forget to load it from your ~/.emacs.d/init.el:

    ;; this is intended for configuration snippets
    (add-to-list 'load-path "~/.emacs.d/")
    ...
    (load "config/cedet.el")
    
  • restart your emacs to check everything is OK; the --debug-init option is of great help for that purpose.

ECB

  • Get Alex Ott ECB fork into ~/.emacs.d/site-lisp/ecb-alexott:

    ~/.emacs.d/site-lisp$ git clone --depth 1  https://github.com/alexott/ecb/
    
  • Run make in ecb-alexott and see the README file for more details.

  • Don't forget to load it from your ~/.emacs.d/init.el:

    (add-to-list 'load-path (expand-file-name
          "~/.emacs.d/site-lisp/ecb-alexott/"))
    (require 'ecb)
    ;(require 'ecb-autoloads)
    

    Note

    You can theoretically use (require 'ecb-autoloads) instead of (require 'ecb) in order to load ECB by need. I encountered various misbehaviours trying this option and finally dropped it, but I encourage you to try it and comment on your experience.

  • restart your emacs to check everything is OK (you probably want to use the --debug-init option).

  • Create a hello.cpp with you CEDET enable Emacs and say M-x ecb-activate to check that ECB is actually installed.

Tune your configuration

Now, it is time to tune your configuration. There is no good recipe from here onward... But I'll try to propose some snippets below. Some of them are adapted from Alex Ott personal configuration

More Semantic options

You can use the following lines just before (semantic-mode 1) to add to the activated features list:

(add-to-list 'semantic-default-submodes 'global-semantic-decoration-mode)
(add-to-list 'semantic-default-submodes 'global-semantic-idle-local-symbol-highlight-mode)
(add-to-list 'semantic-default-submodes 'global-semantic-idle-scheduler-mode)
(add-to-list 'semantic-default-submodes 'global-semantic-idle-completions-mode)

You can also load additional capabilities with those lines after (semantic-mode 1):

(require 'semantic/ia)
(require 'semantic/bovine/gcc) ; or depending on you compiler
; (require 'semantic/bovine/clang)
Auto-completion

If you want to use auto-complete you can tell it to interface with Semantic by configuring it as follows (where AAAAMMDD.rrrr is the date.revision suffix of the version od auti-complete installed by you package manager):

;; Autocomplete
(require 'auto-complete-config)
(add-to-list 'ac-dictionary-directories (expand-file-name
             "~/.emacs.d/elpa/auto-complete-AAAAMMDD.rrrr/dict"))
(setq ac-comphist-file (expand-file-name
             "~/.emacs.d/ac-comphist.dat"))
(ac-config-default)

and activating it in your cedet hook, for example:

...
;; customisation of modes
(defun alexott/cedet-hook ()
...
    (add-to-list 'ac-sources 'ac-source-semantic)
) ; defun alexott/cedet-hook ()
Support for GNU global a/o (e)ctags
;; if you want to enable support for gnu global
(when (cedet-gnu-global-version-check t)
  (semanticdb-enable-gnu-global-databases 'c-mode)
  (semanticdb-enable-gnu-global-databases 'c++-mode))

;; enable ctags for some languages:
;;  Unix Shell, Perl, Pascal, Tcl, Fortran, Asm
(when (cedet-ectag-version-check)
  (semantic-load-enable-primary-exuberent-ctags-support))

Using CEDET for development

Once CEDET + ECB + EDE is up you can start using it for actual development. How to actually use it is beyond the scope of this already too long post. I can only invite you to have a look at:

Conclusion

CEDET provides an impressive set of features both to allow your emacs environment to "understand" your code and to provide powerful interfaces to this "understanding". It is probably one of the very few solution to work with complex C++ code base in case you can't or don't want to use a heavy-weight IDE like Eclipse CDT.

But its being highly configurable also means, at least for now, some lack of integration, or at least a pretty complex configuration. I hope this post will help you to do your first steps with CEDET and find your way to setup and configure it to you own taste.


Pylint 1.0 released!

2013/08/06 by Sylvain Thenault

Hi there,

I'm very pleased to announce, after 10 years of existence, the 1.0 release of Pylint.

This release has a hell long ChangeLog, thanks to many contributions and to the 10th anniversary sprint we hosted during june. More details about changes below.

Chances are high that your Pylint score will go down with this new release that includes a lot of new checks :) Also, there are a lot of improvments on the Python 3 side (notably 3.3 support which was somewhat broken).

You may download and install it from Pypi or from Logilab's debian repositories. Notice Pylint has been updated to use the new Astroid library (formerly known as logilab-astng) and that the logilab-common 0.60 library includes some fixes necessary for using Pylint with Python3 as well as long-awaited support for namespace packages.

For those interested, below is a comprehensive list of what changed:

Command line and output formating

  • A new --msg-template option to control output, deprecating "msvc" and "parseable" output formats as well as killing --include-ids and --symbols options.
  • Fix spelling of max-branchs option, now max-branches.
  • Start promoting usage of symbolic name instead of numerical ids.

New checks

  • "missing-final-newline" (C0304) for files missing the final newline.
  • "invalid-encoded-data" (W0512) for files that contain data that cannot be decoded with the specified or default encoding.
  • "bad-open-mode" (W1501) for calls to open (or file) that specify invalid open modes (Original implementation by Sasha Issayev).
  • "old-style-class" (C1001) for classes that do not have any base class.
  • "trailing-whitespace" (C0303) that warns about trailing whitespace.
  • "unpacking-in-except" (W0712) about unpacking exceptions in handlers, which is unsupported in Python 3.
  • "old-raise-syntax" (W0121) for the deprecated syntax raise Exception, args.
  • "unbalanced-tuple-unpacking" (W0632) for unbalanced unpacking in assignments (bitbucket #37).

Enhanced behaviours

  • Do not emit [fixme] for every line if the config value 'notes' is empty
  • Emit warnings about lines exceeding the column limit when those lines are inside multiline docstrings.
  • Name check enhancement:
    • simplified message,
    • don't double-check parameter names with the regex for parameters and inline variables,
    • don't check names of derived instance class members,
    • methods that are decorated as properties are now treated as attributes,
    • names in global statements are now checked against the regular expression for constants,
    • for toplevel name assignment, the class name regex will be used if pylint can detect that value on the right-hand side is a class (like collections.namedtuple()),
    • add new name type 'class_attribute' for attributes defined in class scope. By default, allow both const and variable names.
  • Add a configuration option for missing-docstring to optionally exempt short functions/methods/classes from the check.
  • Add the type of the offending node to missing-docstring and empty-docstring.
  • Do not warn about redefinitions of variables that match the dummy regex.
  • Do not treat all variables starting with "_" as dummy variables, only "_" itself.
  • Make the line-too-long warning configurable by adding a regex for lines for with the length limit should not be enforced.
  • Do not warn about a long line if a pylint disable option brings it above the length limit.
  • Do not flag names in nested with statements as undefined.
  • Remove string module from the default list of deprecated modules (bitbucket #3).
  • Fix incomplete-protocol false positive for read-only containers like tuple (bitbucket #25).

Other changes

  • Support for pkgutil.extend_path and setuptools pkg_resources (logilab-common #8796).
  • New utility classes for per-checker unittests in testutils.py
  • Added a new base class and interface for checkers that work on the tokens rather than the syntax, and only tokenize the input file once.
  • epylint shouldn't hang anymore when there is a large output on pylint'stderr (bitbucket #15).
  • Put back documentation in source distribution (bitbucket #6).

Astroid

  • New API to make it smarter by allowing transformation functions on any node, providing a register_transform function on the manager instead of the register_transformer to make it more flexible wrt node selection
  • Use this new transformation API to provide support for namedtuple (actually in pylint-brain, logilab-astng #8766)
  • Better description of hashlib
  • Properly recognize methods annotated with abc.abstract{property,method} as abstract.
  • Added the test_utils module for building ASTs and extracting deeply nested nodes for easier testing.

Astroid 1.0 released!

2013/08/02 by Sylvain Thenault

Astroid is the new name of former logilab-astng library. It's an AST library, used as the basis of Pylint and including Python 2.5 -> 3.3 compatible tree representation, statical type inference and other features useful for advanced Python code analysis, such as an API to provide extra information when statistical inference can't overcome Python dynamic nature (see the pylint-brain project for instance).

It has been renamed and hosted to bitbucket to make clear that this is not a Logilab dedicated project but a community project that could benefit to any people manipulating Python code (statistical analysis tools, IDE, browser, etc).

Documentation is a bit rough but should quickly improve. Also a dedicated web-site is now online, visit www.astroid.org (or https://bitbucket.org/logilab/astroid for development).

You may download and install it from Pypi or from Logilab's debian repositories.


Going to DebConf13

2013/08/01 by Julien Cristau

The 14th Debian developers conference (DebConf13) will take place between August 11th and August 18th in Vaumarcus, Switzerland.

Logilab is a DebConf13 sponsor, and I'll attend the conference. There are quite a lot of cloud-related events on the schedule this year, plus the usual impromptu discussions and hallway track. Looking forward to meeting the usual suspects there!

https://www.logilab.org/file/158611/raw/dc13-btn0-going-bg.png

We hosted the Salt Sprint in Paris

2013/07/30 by Arthur Lutz

Last Friday, we hosted the French event for the international Great Salt Sprint. Here is a report on what was done and discussed on this occasion.

http://saltstack.com/images/SaltStack-Logo.png

We started off by discussing various points that were of interest to the participants :

  • automatically write documentation from salt sls files (for Sphinx)
  • salt-mine add security layer with restricted access (bug #5467 and #6437)
  • test compatibility of salt-cloud with openstack
  • module bridge bug correction : traceback on KeyError
  • setting up the network in debian (equivalent of rh_ip)
  • configure existing monitoring solution through salt (add machines, add checks, etc) on various backends with a common syntax

We then split up into pairs to tackle issues in small groups, with some general discussions from time to time.

6 people participated, 5 from Logilab, 1 from nbs-system. We were expecting more participants but some couldn't make it at the last minute, or though the sprint was taking place at some other time.

Unfortunately we had a major electricity black out all afternoon, some of us switched to battery and 3G tethering to carry on, but that couldn't last all afternoon. We ended up talking about design and use cases. ERDF (French electricity distribution company) ended up bringing generator trucks for the neighborhood !

Arthur & Benoit : monitoring adding machines or checks

http://www.logilab.org/file/157971/raw/salt-centreon-shinken.png

Some unfinished draft code for supervision backends was written and pushed on github. We explored how a common "interface" could be done in salt (using a combination of states and __virtual___). The official documentation was often very useful, reading code was also always a good resource (and the code is really readable).

While we were fixing stuff because of the power black out, Benoit submitted a bug fix.

David & Alain : generate documentation from salt state & salt master

The idea is to couple the SLS description and the current state of the salt master to generate documentation about one's infrastructure using Sphinx. This was transmitted to the mailing-list.

http://www.logilab.org/file/157976/raw/salt-sphinx.png

Design was done around which information should be extracted and display and how to configure access control to the salt-master, taking a further look to external_auth and salt-api will probably be the way forward.

General discussions

We had general discussions around concepts of access control to a salt master, on how to define this access. One of the things we believe to be missing (but haven't checked thoroughly) is the ability to separate the "read-only" operations to the "read-write" operations in states and modules, if this was done (through decorators?) we could easily tell salt-api to only give access to data collection. Complex scenarios of access were discussed. Having a configuration or external_auth based on ssh public keys (similar to mercurial-server would be nice, and would provide a "limited" shell to a mercurial server.

Conclusion

The power black out didn't help us get things done, but nevertheless, some sharing was done around our uses cases around SaltStack and features that we'd want to get out of it (or from third party applications). We hope to convert all the discussions into bug reports or further discussion on the mailing-lists and (obviously) into code and pull-requests. Check out the scoreboard for an overview of how the other cities contributed.

to comment this post you need to login or create an account


The Great Salt Sprint Paris Location is Logilab

2013/07/12 by Nicolas Chauvat
http://farm1.static.flickr.com/183/419945378_4ead41a76d_m.jpg

We're happy to be part of the second Great Salt Sprint that will be held at the end of July 2013. We will be hosting the french sprinters on friday 26th in our offices in the center of Paris.

The focus of our Logilab team will probably be Test-Driven System Administration with Salt, but the more participants and the topics, the merrier the event.

Please register if you plan on joining us. We will be happy to meet with fellow hackers.

photo by Sebastian Mary under creative commons licence.


PyLint 10th anniversary 1.0 sprint: day 3 - Sprint summary

2013/06/20 by Sylvain Thenault

Yesterday was the third and last day of the 10th anniversary Pylint sprint in Logilab's Toulouse office.

Design

To get started, we took advantage of this last day to have a few discussions about:

  • A "mode" feature gpylint has. It turns out that behind perhaps a few implementation details, this is something we definitly want into pylint (mode are specific configurations defined in the pylintrc and easilly recallable, they may even be specified per file).

  • How to avoid conflicts in the ChangeLog by using specific instruction in the commit message. We decided that a commit message should look like

    [my checker] do this and that. Closes #1234
    
    bla bla bla
    
    :release note: this will be a new item in the ChangeLog
    
    as well as anything until the end of the message
    

    now someone has to write the ChangeLog generation script so we may use this for post-1.0 releases

  • The roadmap. More on this later in this post.

Code

When we were not discussing, we were coding!

  • Anthony worked on having a template for the text reporter. His patch is available on Bitbucket but not yet integrated.
  • Julien and David pushed a bunch of patches on logilab-common, astroid and pylint for the Python 3.3 support. Not all tests are green on the pylint side, but much progress was done.
  • A couple other things were fixed, like a better "invalid name" message, stop complaining about string module being deprecated, etc.
  • A lot of patches have been integrated, from gpylint and others (e.g python 3 related)

All in all, an impressive amount of work was achieved during this sprint:

  • A lot of new checks or enhanced behaviour backported from gpylint (Take a look at Pylint's ChangeLog for more details on this, the list is impressively long).
  • The transformation API of astroid now allows to customize the tree structure as well as the inference process, hence to make pylint smarter than ever.
  • Better python 3 support.
  • A few bugs fixed and some enhancements added.
  • The templating stuff should land with the CLI cleanup (some output-formats will be removed as well as the --include-ids and --symbols option).
  • A lot of discussions, especially regarding the future community development of pylint/astroid on Bitbucket. Short summary being: more contributors and integrators are welcome! We should drop some note somewhere to describe how we are using bitbucket's pull requests and tracker.

Plan

Now here is the 1.O roadmap, which is expected by the begining of July:

  • Green tests under Python 3, including specification of Python version in message description (Julien).
  • Finish template for text reporters (Anthony).
  • Update web site (David).

And for later releases:

  • Backport mode from gpylint (Torsten).
  • Write ChangeLog update script (Sylvain).

So many thanks to everyone for this very successful sprint. I'm excited about this forthcoming 1.0 release!


PyLint 10th anniversary 1.0 sprint: day 2

2013/06/18 by Sylvain Thenault

Today was the second day of the 10th anniversary Pylint sprint in Logilab's Toulouse office.

This morning, we started with a presentation by myself about how the inference engine works in astroid (former astng). Then we started thinking all together about how we should change its API to be able to plug more information during the inference process. The first use-case we wanted to assert was namedtuple, as explained in http://www.logilab.org/ticket/8796.

We ended up by addressing it by:

  • enhancing the existing transformation feature so one may register a transformation function on any node rather than on a module node only;
  • being able to specify, on a node instance, a custom inference function to use instead of the default (class) implementation.

We would then be able to customize both the tree structure and the inference process and so to resolve the cases we were targeting.

Once this was sufficiently sketched out, everyone got his own tasks to do. Here is a quick summary of what has been achieved today:

  • Anthony resumed the check_messages thing and finished it for the simple cases, then he started on having a template for text reported,
  • Julien and David made a lot of progress on the Python 3.3 compatibility, though not enough to get the full green test suite,
  • Torsten continued backporting stuff from gpylint, all of them having been integrated by the end of the day,
  • Sylvain implemented the new transformation API and had the namedtuple proof of concept working, and even some documentation! Now this have to be tested for more real-world uses.

So things are going really well, and see you tomorrow for even more improvements to pylint!


PyLint 10th anniversary 1.0 sprint: day 1

2013/06/17 by Sylvain Thenault

Today was the first day of the Pylint sprint we organized using Pylint's 10th years anniversary as an excuse.

So I (Sylvain) have welcome my fellow Logilab friends David, Anthony and Julien as well as Torsten from Google into Logilab's new Toulouse office.

After a bit of presentation and talk about Pylint development, we decided to keep discussion for lunch and dinner and to setup priorities. We ended with the following tasks (picks from the pad at http://piratepad.net/oAvsUoGCAC):

  • rename astng to move it outside the logilab package,
  • Torsten gpylint (Google Pylint) patches review, as much as possible (but not all of them, starting by a review of the numberous internal checks Google has, seeing one by one which one should be backported upstream),
  • setuptools namespace package support (https://www.logilab.org/8796),
  • python 3.3 support,
  • enhance astroid (former astng) API to allow more ad-hoc customization for a better grasp of magic occuring in e.g. web frameworks (protocol buffer or SQLAlchemy may also be an application of this).

Regarding the astng renaming, we decided to move on with astroid as pointed out by the survey on StellarSurvey.com

In the afternoon, David and Julien tackled this, while Torsten was extracting patches from Google code and sending them to bitbucket as pulll request, Sylvain embrassing setuptools namespaces packages and Anthony discovering the code to spread the @check_message decorator usage.

By the end of the day:

  • David and Julien submitted patches to rename logilab.astng which were quickly integrated and now https://bitbucket.org/logilab/astroid should be used instead of https://bitbucket.org/logilab/astng
  • Torsten submitted 5 pull-requests with code extracted from gpylint, we reviewed them together and then Torsten used evolve to properly insert those in the pylint history once review comments were integrated
  • Sylvain submitted 2 patches on logilab-common to support both setuptools namespace packages and pkgutil.extend_path (but not bare __path__ manipulation
  • Anthony discovered various checkers and started adding proper @check_messages on visit methods

After doing some review all together, we even had some time to take a look at Python 3.3 support while writing this summary.

Hopefuly, our work on forthcoming days will be as efficient as on this first day!


About salt-ami-cloud-builder

2013/06/07 by Paul Tonelli

What

At Logilab we are big fans of SaltStack, we use it quite extensivelly to centralize, configure and automate deployments.

http://www.logilab.org/file/145398/raw/SaltStack-Logo.png

We've talked on this blog about how to build a Debian AMI "by hand" and we wanted to automate this fully. Hence the salt way seemed to be the obvious way to go.

So we wrote salt-ami-cloud-builder. It is mainly glue between existing pieces of software that we use and like. If you already have some definition of a type of host that you provision using salt-stack, salt-ami-cloud-builder should be able to generate the corresponding AMI.

http://www.logilab.org/file/145397/raw/open-stack-cloud-computing-logo-2.png

Why

Building a Debian based OpenStack private cloud using salt made us realize that we needed a way to generate various flavours of AMIs for the following reasons:

  • Some of our openstack users need "preconfigured" AMIs (for example a Debian system with Postgres 9.1 and the appropriate Python bindings) without doing the modifications by hand or waiting for an automated script to do the job at AMI boot time.
  • Some cloud use cases require that you boot many (hundreds for instance) machines with the same configuration. While tools like salt automate the job, waiting while the same download and install takes place hundreds of times is a waste of resources. If the modifications have already been integrated into a specialized ami, you save a lot of computing time. And especially in the Amazon (or other pay-per-use cloud infrastructures), these resources are not free.
  • Sometimes one needs to repeat a computation on an instance with the very same packages and input files, possibly years after the first run. Freezing packages and files in one preconfigured AMI helps this a lot. When relying only on a salt configuration the installed packages may not be (exactly) the same from one run to the other.

Relation to other projects

While multiple tools like build-debian-cloud exist, their objective is to build a vanilla AMI from scratch. The salt-ami-cloud-builder starts from such vanilla AMIs to create variations. Other tools like salt-cloud focus instead on the boot phase of the deployment of (multiple) machines.

Chef & Puppet do the same job as Salt, however Salt being already extensively deployed at Logilab, we continue to build on it.

Get it now !

Grab the code here: http://hg.logilab.org/master/salt-ami-cloud-builder

The project page is http://www.logilab.org/project/salt-ami-cloud-builder

The docs can be read here: http://docs.logilab.org/salt-ami-cloud-builder

We hope you find it useful. Bug reports and contributions are welcome.

The logilab-salt-ami-cloud-builder team :)


Pylint 10th years anniversary from June 17 to 19 in Toulouse

2013/04/18 by Sylvain Thenault

After a quick survey, we're officially scheduling Pylint 10th years anniversary sprint from monday, June 17 to wednesday, June 19 in Logilab's Toulouse office.

There is still some room available if more people want to come, drop me a note (sylvain dot thenault at logilab dot fr).


Pylint development moving to BitBucket

2013/04/12 by Sylvain Thenault

Hi everyone,

After 10 years of hosting Pylint on our own forge at logilab.org, we've decided to publish version 1.0 and move Pylint and astng development to BitBucket. There has been repository mirrors there for some time, but we intend now to use all BitBucket features, notably Pull Request, to handle various development tasks.

There are several reasons behind this. First, using both BitBucket and our own forge is rather cumbersome, for integrators at least. This is mainly because BitBucket doesn't provide support for Mercurial's changeset evolution feature while our forge relies on it. Second, our forge has several usability drawbacks that make it hard to use for newcomers, and we lack the time to be responsive on this. Finally, we think that our quality-control process, as exposed by our forge, is a bit heavy for such community projects and may keep potential contributors away.

All in all, we hope this will help to have a wider contributor audience as well as more regular maintainers / integrators which are not Logilab employees. And so, bring the best Pylint possible to the Python community!

Logilab.org web pages will be updated to mention this, but kept as there is still valuable information there (eg tickets). We may also keep automatic tests and package building services there.

So, please use https://bitbucket.org/logilab/pylint as main web site regarding pylint development. Bug reports, feature requests as well as contributions should be done there. The same move will be done for Pylint's underlying library, logilab-astng (https://bitbucket.org/logilab/astng). We also wish in this process to move it out of the 'logilab' python package. It may be a good time to give it another name, if you have any idea don't hesitate to express yourself.

Last but not least, remember that Pylint home page may be edited using Mercurial, and that the new http://docs.pylint.org is generated using the content found in Pylint source doc subdirectory.

Pylint turning 10 and moving out of its parents is probably a good time to thank Logilab for paying me and some colleagues to create and maintain this project!

https://bitbucket-assetroot.s3.amazonaws.com/c/photos/2013/Apr/05/pylint-logo-1661676867-0_avatar.png

PyLint 10th years anniversary, 1.0 sprint

2013/03/29 by Sylvain Thenault

In a few week, pylint will be 10 years old (0.1 released on may 19 2003!). At this occasion, I would like to release a 1.0. Well, not exactly at that date, but not too long after would be great. Also, I think it would be a good time to have a few days sprint to work a bit on this 1.0 but also to meet all together and talk about pylint status and future, as more and more contributions come from outside Logilab (actually mostly Google, which employs Torsten and Martin, the most active contributors recently).

The first thing to do is to decide a date and place. Having discussed a bit with Torsten about that, it seems reasonable to target a sprint during june or july. Due to personal constraints, I would like to host this sprint in Logilab's Toulouse office.

So, who would like to jump in and sprint to make pylint even better? I've created a doodle so every one interested may tell his preferences: http://doodle.com/4uhk26zryis5x7as

Regarding the location, is everybody ok with Toulouse? Other ideas are Paris, or Florence around EuroPython, or... <add your proposition here>.

We'll talk about the sprint topics later, but there are plenty of exciting ideas around there.

Please, answer quickly so we can move on. And I hope to see you all there!


LMGC90 Sprint at Logilab in March 2013

2013/03/28 by Vladimir Popescu

LMGC90 Sprint at Logilab

At the end of March 2013, Logilab hosted a sprint on the LMGC90 simulation code in Paris.

LMGC90 is an open-source software developed at the LMGC ("Laboratoire de Mécanique et Génie Civil" -- "Mechanics and Civil Engineering Laboratory") of the CNRS, in Montpellier, France. LMGC90 is devoted to contact mechanics and is, thus, able to model large collections of deformable or undeformable physical objects of various shapes, with numerous interaction laws. LMGC90 also allows for multiphysics coupling.

Sprint Participants

https://www.logilab.org/file/143585/raw/logo_LMGC.jpghttps://www.logilab.org/file/143749/raw/logo_SNCF.jpghttps://www.logilab.org/file/143750/raw/logo_LaMSID.jpghttps://www.logilab.org/file/143751/raw/logo_LOGILAB.jpg

More than ten hackers joined in from:

  • the LMGC, which leads LMCG90 development and aims at constantly improving its architecture and usability;
  • the Innovation and Research Department of the SNCF (the French state-owned railway company), which uses LMGC90 to study railway mechanics, and more specifically, the ballast;
  • the LaMSID ("Laboratoire de Mécanique des Structures Industrielles Durables", "Laboratory for the Mechanics of Ageing Industrial Structures") laboratory of the EDF / CNRS / CEA , which has an strong expertise on Code_ASTER and LMGC90;
  • Logilab, as the developer, for the SNCF, of a CubicWeb-based platform dedicated to the simulation data and knowledge management.

After a great introduction to LMGC90 by Frédéric Dubois and some preliminary discussions, teams were quickly constituted around the common areas of interest.

Enhancing LMGC90's Python API to build core objects

As of the sprint date, LMGC90 is mainly developed in Fortran, but also contains Python code for two purposes:

  • Exposing the Fortran functions and subroutines in the LMGC90 core to Python; this is achieved using Fortran 2003's ISO_C_BINDING module and Swig. These Python bindings are grouped in a module called ChiPy.
  • Making it easy to generate input data (so called "DATBOX" files) using Python. This is done through a module called Pre_LMGC.

The main drawback of this approach is the double modelling of data that this architecture implies: once in the core and once in Pre_LMGC.

It was decided to build a unique user-level Python layer on top of ChiPy, that would be able to build the computational problem description and write the DATBOX input files (currently achieved by using Pre_LMGC), as well as to drive the simulation and read the OUTBOX result files (currently by using direct ChiPy calls).

This task has been met with success, since, in the short time span available (half a day, basically), the team managed to build some object types using ChiPy calls and save them into a DATBOX.

Using the Python API to feed a computation data store

This topic involved importing LMGC90 DATBOX data into the numerical platform developed by Logilab for the SNCF.

This was achieved using ChiPy as a Python API to the Fortran core to get:

  • the bodies involved in the computation, along with their materials, behaviour laws (with their associated parameters), geometries (expressed in terms of zones);
  • the interactions between these bodies, along with their interaction laws (and associated parameters, e.g. friction coefficient) and body pair (each interaction is defined between two bodies);
  • the interaction groups, which contain interactions that have the same interaction law.

There is still a lot of work to be done (notably regarding the charges applied to the bodies), but this is already a great achievement. This could only have occured in a sprint, were every needed expertise is available:

  • the SNCF experts were there to clarify the import needs and check the overall direction;

  • Logilab implemented a data model based on CubicWeb, and imported the data using the ChiPy bindings developed on-demand by the LMGC core developer team, using the usual-for-them ISO_C_BINDING/ Swig Fortran wrapping dance.

    https://www.logilab.org/file/143753/raw/logo_CubicWeb.jpg
  • Logilab undertook the data import; to this end, it asked the LMGC how the relevant information from LMGC90 can be exposed to Python via the ChiPy API.

Using HDF5 as a data storage backend for LMGC90

The main point of this topic was to replace the in-house DATBOX/OUTBOX textual format used by LMGC90 to store input and output data, with an open, standard and efficient format.

Several formats have been considered, like HDF5, MED and NetCDF4.

MED has been ruled out for the moment, because it lacks the support for storing body contact information. HDF5 was chosen at last because of the quality of its Python libraries, h5py and pytables, and the ease of use tools like h5fs provide.

https://www.logilab.org/file/143754/raw/logo_HDF.jpg

Alain Leufroy from Logilab quickly presented h5py and h5fs usage, and the team started its work, measuring the performance impact of the storage pattern of LMGC90 data. This was quickly achieved, as the LMGC experts made it easy to setup tests of various sizes, and as the Logilab developers managed to understand the concepts and implement the required code in a fast and agile way.

Debian / Ubuntu Packaging of LMGC90

This topic turned out to be more difficult than initially assessed, mainly because LMGC90 has dependencies to non-packaged external libraries, which thus had to be packaged first:

  • the Matlib linear algebra library, written in C,
  • the Lapack95 library, which is a Fortran95 interface to the Lapack library.

Logilab kept working on this after the sprint and produced packages that are currently being tested by the LMGC team. Some changes are expected (for instance, Python modules should be prefixed with a proper namespace) before the packages can be submitted for inclusion into Debian. The expertise of Logilab regarding Debian packaging was of great help for this task. This will hopefully help to spread the use of LMGC90.

https://www.logilab.org/file/143755/raw/logo_Debian.jpg

Distributed Version Control System for LMGC90

As you may know, Logilab is really fond of Mercurial as a DVCS. Our company invested a lot into the development of the great evolve extension, which makes Mercurial a very powerful tool to efficiently manage the team development of software in a clean fashion.

This is why Logilab presented Mercurial's features and advantages over the current VCS used to manage LMGC90 sources, namely svn, to the other participants of the Sprint. This was appreciated and will hopefully benefit to LMGC90 ease of development and spread among the Open Source community.

https://www.logilab.org/file/143756/raw/logo_HG.jpg

Conclusions

All in all, this two-day sprint on LMGC90, involving participants from several industrial and academic institutions has been a great success. A lot of code has been written but, more importantly, several stepping stones have been laid, such as:

  • the general LMGC90 data access architecture, with the Python layer on top of the LMGC90 core;
  • the data storage format, namely HDF5.

Colaterally somehow, several other results have also been achieved:

  • partial LMGC90 data import into the SNCF CubicWeb-based numerical platform,
  • Debian / Ubuntu packaging of LMGC90 and dependencies.

On a final note, one would say that we greatly appreciated the cooperation between the participants, which we found pleasant and efficient. We look forward to finding more occasions to work together.


Release of PyLint 0.27 / logilab-astng 0.24.2

2013/02/28 by Sylvain Thenault

Hi there,

I'm very pleased to announce the release of pylint 0.27 and logilab-astng 0.24.2. There has been a lot of enhancements and bug fixes since the latest release, so you're strongly encouraged to upgrade. Here is a detailed list of changes:

  • #20693: replace pylint.el by Ian Eure version (patch by J.Kotta)
  • #105327: add support for --disable=all option and deprecate the 'disable-all' inline directive in favour of 'skip-file' (patch by A.Fayolle)
  • #110840: add messages I0020 and I0021 for reporting of suppressed messages and useless suppression pragmas. (patch by Torsten Marek)
  • #112728: add warning E0604 for non-string objects in __all__ (patch by Torsten Marek)
  • #120657: add warning W0110/deprecated-lambda when a map/filter of a lambda could be a comprehension (patch by Martin Pool)
  • #113231: logging checker now looks at instances of Logger classes in addition to the base logging module. (patch by Mike Bryant)
  • #111799: don't warn about octal escape sequence, but warn about o which is not octal in Python (patch by Martin Pool)
  • #110839: bind <F5> to Run button in pylint-gui
  • #115580: fix erroneous W0212 (access to protected member) on super call (patch by Martin Pool)
  • #110853: fix a crash when an __init__ method in a base class has been created by assignment rather than direct function definition (patch by Torsten Marek)
  • #110838: fix pylint-gui crash when include-ids is activated (patch by Omega Weapon)
  • #112667: fix emission of reimport warnings for mixed imports and extend the testcase (patch by Torsten Marek)
  • #112698: fix crash related to non-inferable __all__ attributes and invalid __all__ contents (patch by Torsten Marek)
  • Python 3 related fixes:
    • #110213: fix import of checkers broken with python 3.3, causing "No such message id W0704" breakage
    • #120635: redefine cmp function used in pylint.reporters
  • Include full warning id for I0020 and I0021 and make sure to flush warnings after each module, not at the end of the pylint run. (patch by Torsten Marek)
  • Changed the regular expression for inline options so that it must be preceeded by a # (patch by Torsten Marek)
  • Make dot output for import graph predictable and not depend on ordering of strings in hashes. (patch by Torsten Marek)
  • Add hooks for import path setup and move pylint's sys.path modifications into them. (patch by Torsten Marek)
  • pylint-brain: more subprocess.Popen faking (see #46273)
  • #109562 [jython]: java modules have no __doc__, causing crash
  • #120646 [py3]: fix for python3.3 _ast changes which may cause crash
  • #109988 [py3]: test fixes

Many thanks to all the people who contributed to this release!

Enjoy!


FOSDEM 2013

2013/02/12 by Pierre-Yves David

I was in Bruxelles for FOSDEM 2013. As with previous FOSDEM there were too many interesting talks and people to see. Here is a summary of what I saw:

In the Mozilla's room:

  1. The html5 pdf viewer pdfjs is impressive. The PDF specification is really scary but this full featured "native" viewer is able to renders most of it with very good performance. Have a look at the pdfjs demo!
  1. Firefox debug tools overview with a specific focus of Firefox OS emulator in your browser.
  1. Introduction to webl10n: an internationalization format and library used in Firefox OS. A successful mix that results in a format that is idiot-proof enough for a duck to use, that relies on Unicode specifications to handle complex pluralization rules and that allows cascading translation definitions.
typical webl10n user
  1. Status of html5 video and audio support in Firefox. The topic looks like a real headache but the team seems to be doing really well. Special mention for the reverse demo effect: The speaker expected some format to be still unsupported but someone else apparently implemented them over night.
  2. Last but not least I gave a talk about the changeset evolution concept that I'm putting in Mercurial. Thanks goes to Feth for asking me his not-scripted-at-all-questions during this talk. (slides)
http://www.selenic.com/hg-logo/logo-droplets-150.png

In the postgresql room:

  1. Insightful talk about more event trigger in postgresql engine and how this may becomes the perfect way to break your system.
  2. Full update of the capability of postgis 2.0. The postgis suite was already impressive for storing and querying 2D data, but it now have impressive capability regarding 3D data.
http://upload.wikimedia.org/wikipedia/en/6/60/PostGIS_logo.png

On python related topic:

http://www.python.org/community/logos/python-logo-master-v3-TM-flattened.png
  • Victor Stinner has started an interesting project to improve CPython performance. The first one: astoptimizer breaks some of the language semantics to apply optimisation on compiling to byte code (lookup caching, constant folding,…). The other, registervm is a full redefinition of how the interpreter handles reference in byte code.

After the FOSDEM, I crossed the channel to attend a Mercurial sprint in London. Expect more on this topic soon.


Febuary 2013: Mercurial channel "tour"

2013/01/22 by Pierre-Yves David

The Release candidate version of Mercurial 2.5 was released last sunday.

http://mercurial.selenic.com/images/mercurial-logo.png

This new version makes a major change in the way "hidden" changesets are handled. In 2.4 only hg log (and a few others) would support effectively hiding "hidden" changesets. Now all hg commands are transparently compatible with the hidden revision concept. This is a considerable step towards changeset evolution, the next-generation collaboration technology that I'm developing for Mercurial.

https://fosdem.org/2013/assets/flyer-thumb-0505d19dbf3cf6139bc7490525310f8e253e60448a29ed4313801b723d5b2ef1.png

The 2.5 cycle is almost over, but there is no time to rest yet, Saturday the 2th of February, I will give a talk about changeset evolution concept at FOSDEM in the Mozilla Room. This talk in an updated version of the one I gave at OSDC.fr 2012 (video in french).

The week after, I'm crossing the channel to attend the Mercurial 2.6 Sprint hosted by Facebook London. I expect a lot of discussion about the user interface and network access of changeset evolution.

The HG 2.3 sprint

Building Debian images for an OpenStack (private) cloud

2012/12/23 by David Douard

Now I have a working OpenStack cloud at Logilab, I want to provide my fellow collegues a bunch of ready-made images to create instances.

Strangely, there are no really usable ready-made UEC Debian images available out there. There have been recent efforts made to provide Debian images on Amazon Market Place, and the toolsuite used to build these is available as a collection of bash shell scripts from a github repository. There are also some images for Eucalyptus, but I have not been able to make them boot properly on my kvm-based OpenStack install.

So I have tried to build my own set of Debian images to upload in my glance shop.

Vocabulary

A bit of vocabulary may be useful for the one not very accustomed with OpenStack nor AWS jargons.

When you want to create an instance of an image, ie. boot a virtual machine in a cloud, you generally choose from a set of ready made system images, then you choose a virtual machine flavor (ie. a combination of a number of virtual CPUs, an amount of RAM, and a harddrive size used as root device). Generally, you have to choose between tiny (1 CPU, 512MB, no disk), small (1 CPU, 2G of RAM, 20G of disk), etc.

In the cloud world, an instance is not meant to be sustainable. What is sustainable is a volume that can be attached to a running instance.

If you want your instance to be sustainable, there are 2 choices:

  • you can snapshot a running instance and upload it as a new image ; so it is not really a sustainable instance, instead, it's the ability to configure an instance that is then the base for booting other instances,
  • or you can boot an instance from a volume (which is the sustainable part of a virtual machine in a cloud).

In the Amazon world, a "standard" image (the one that is instanciated when creating a new instance) is called an instance store-backed AMI images, also called an UEC image, and a volume image is called an EBS-backed AMI image (EBS stands for Elastic Block Storage). So an AMI images stored in a volume cannot be instanciated, it can be booted once and only once at a time. But it is sustainable. Different usage.

An UEC or AMI image consist in a triplet: a kernel, an init ramdisk and a root file system image. An EBS-backed image is just the raw image disk to be booted on a virtulization host (a kvm raw or qcow2 image, etc.)

Images in OpenStack

In OpenStack, when you create an instance from a given image, what happens depends on the kind of image.

In fact, in OpenStack, one can upload traditional UEC AMI images (need to upload the 3 files, the kernel, the initial ramdisk and the root filesystem as a raw image). But one can also upload bare images. These kind of images are booted directly by the virtualization host. So it is some kind of hybrid between a boot from volume (an EBS-backed boot in the Amazon world) and the traditional instanciation from an UEC image.

Instanciating an AMI image

When one creates an instance from an AMI image in an OpenStack cloud:

  • the kernel is copied to the virtualization host,
  • the initial ramdisk is copied to the virtualization host,
  • the root FS image is copied to the virtualization host,
  • then, the root FS image is :
    • duplicated (instanciated),
    • resized (the file is increased if needed) to the size of the asked instance flavor,
    • the file system is resized to the new size of the file,
    • the contained filesystem is mounted (using qemu-nbd) and the configured SSH acces key is added to /root/.ssh/authorized_keys
    • the nbd volume is then unmounted
  • a libvirt domain is created, configured to boot from the given kernel and init ramdisk, using the resized and modified image disk as root filesystem,
  • the libvirt domain is then booted.

Instantiating a BARE image

When one creates an instance from a BARE image in an OpenStack cloud:

  • the VM image file is copied on the virtualization host,
  • the VM image file is duplicated (instantiated),
  • a libvirt domain is created, configured to boot from this copied image disk as root filesystem,
  • the libvirt domain is then booted.

Differences between the 2 instantiation methods

Instantiating a BARE image:
  • Involves a much simpler process.
  • Allows to boot a non-linux system (depends on the virtualization system, especially true when using kvm vitualization).
  • Is slower to boot and consumes more resources, since the virtual machine image must be the size of the required/wanted virtual machine (but can remain minimal if using a qcow2 image format). If you use a 10G raw image, then 10G of data will be copied from the image provider to the virtualization host, and this big file will be duplicated each time you instantiate this image.
  • The root filesystem size corresponding to the flavor of the instance is not honored; the filesystem size is the one of the BARE images.
Instantiating an AMI image:
  • Honours the flavor.
  • Generally allows quicker instance creation process.
  • Less resource consumption.
  • Can only boot Linux guests.

If one wants to boot a Windows guest in OpenStack, the only solution (as far as I know) is to use a BARE image of an installed Windows system. It works (I have succeeded in doing so), but a minimal Windows 7 install is several GB, so instantiating such a BARE image is very slow, because the image needs to be uploaded on the virtualization host.

Building a Debian AMI image

So I wanted to provide a minimal Debian image in my cloud, and to provide it as an AMI image so the flavor is honoured, and so the standard cloud injection mechanisms (like setting up the ssh key to access the VM) work without having to tweak the rc.local script or use cloud-init in my guest.

Here is what I did.

1. Install a Debian system in a standard libvirt/kvm guest.

david@host:~$ virt-install  --connect qemu+tcp://virthost/system   \
                 -n openstack-squeeze-amd64 -r 512 \
                 -l http://ftp2.fr.debian.org/pub/debian/dists/stable/main/installer-amd64/ \
                 --disk pool=default,bus=virtio,type=qcow2,size=5 \
                 --network bridge=vm7,model=virtio  --nographics  \
                 --extra-args='console=tty0 console=ttyS0,115200'

This creates a new virtual machine, launch the Debian installer directly downloaded from a Debian mirror, and start the usual Debian installer in a virtual serial console (I don't like VNC very much).

I then followed the installation procedure. When asked for the partitioning and so, I chose to create only one primary partition (ie. with no swap partition; it wont be necessary here). I also chose only "Default system" and "SSH server" to be installed.

2. Configure the system

After the installation process, the VM is rebooted, I log into it (by SSH or via the console), so I can configure a bit the system.

david@host:~$ ssh root@openstack-squeeze-amd64.vm.logilab.fr
Linux openstack-squeeze-amd64 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Sun Dec 23 20:14:24 2012 from 192.168.1.34
root@openstack-squeeze-amd64:~# apt-get update
root@openstack-squeeze-amd64:~# apt-get install vim curl parted # install some must have packages
[...]
root@openstack-squeeze-amd64:~# dpkg-reconfigure locales # I like to have fr_FR and en_US in my locales
[...]
root@openstack-squeeze-amd64:~# echo virtio_baloon >> /etc/modules
root@openstack-squeeze-amd64:~# echo acpiphp >> /etc/modules
root@openstack-squeeze-amd64:~# update-initramfs -u
root@openstack-squeeze-amd64:~# apt-get clean
root@openstack-squeeze-amd64:~# rm /etc/udev/rules.d/70-persistent-net.rules
root@openstack-squeeze-amd64:~# rm .bash_history
root@openstack-squeeze-amd64:~# poweroff

What we do here is to install some packages, do some configurations. The important part is adding the acpiphp module so the volume attachment will work in our instances. We also clean some stuffs up before shutting the VM down.

3. Convert the image into an AMI image

Since I created the VM image as a qcow2 image, I needed to convert it back to a raw image:

david@host:~$ scp root@virthost:/var/lib/libvirt/images/openstack-squeeze-amd64.img .
david@host:~$ qemu-img convert -O raw openstack-squeeze-amd64.img openstack-squeeze-amd64.raw

Then, as I want a minimal-sized disk image, the filesystem must be resized to minimal. I did this like described below, but I think there are simpler methods to do so.

david@host:~$ fdisk -l openstack-squeeze-amd64.raw  # display the partition location in the disk

Disk openstack-squeeze-amd64.raw: 5368 MB, 5368709120 bytes
149 heads, 8 sectors/track, 8796 cylinders, total 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0001fab7

                   Device Boot      Start         End      Blocks   Id  System
debian-squeeze-amd64.raw1            2048    10483711     5240832   83  Linux
david@host:~$ # extract the filesystem from the image
david@host:~$ dd if=openstack-squeeze-amd64.raw of=openstack-squeeze-amd64.ami bs=1024 skip=1024 count=5240832
david@host:~$ losetup /dev/loop1 openstack-squeeze-amd64.ami
david@host:~$ mkdir /tmp/img
david@host:~$ mount /dev/loop1 /tmp/img
david@host:~$ cp /tmp/img/boot/vmlinuz-2.6.32-5-amd64 .
david@host:~$ cp /tmp/img/boot/initrd.img-2.6.32-5-amd64 .
david@host:~$ umount /tmp/img
david@host:~$ e2fsck -f /dev/loop1 # required before a resize

e2fsck 1.42.5 (29-Jul-2012)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/loop1: 26218/327680 files (0.2% non-contiguous), 201812/1310208 blocks
david@host:~$ resize2fs -M /dev/loop1 # minimize the filesystem

resize2fs 1.42.5 (29-Jul-2012)
Resizing the filesystem on /dev/loop1 to 191461 (4k) blocks.
The filesystem on /dev/loop1 is now 191461 blocks long.
david@host:~$ # note the new size ^^^^ and the block size above (4k)
david@host:~$ losetup -d /dev/loop1 # detach the lo device
david@host:~$ dd if=debian-squeeze-amd64.ami of=debian-squeeze-amd64-reduced.ami bs=4096 count=191461

4. Upload in OpenStack

After all this, you have a kernel image, a init ramdisk file and a minimized root filesystem image file. So you just have to upload them to your OpenStack image provider (glance):

david@host:~$ glance add disk_format=aki container_format=aki name="debian-squeeze-uec-x86_64-kernel" \
                 < vmlinuz-2.6.32-5-amd64
Uploading image 'debian-squeeze-uec-x86_64-kernel'
==================================================================================[100%] 24.1M/s, ETA  0h  0m  0s
Added new image with ID: 644e59b8-1503-403f-a4fe-746d4dac2ff8
david@host:~$ glance add disk_format=ari container_format=ari name="debian-squeeze-uec-x86_64-initrd" \
                 < initrd.img-2.6.32-5-amd64
Uploading image 'debian-squeeze-uec-x86_64-initrd'
==================================================================================[100%] 26.7M/s, ETA  0h  0m  0s
Added new image with ID: 6f75f1c9-1e27-4cb0-bbe0-d30defa8285c
david@host:~$ glance add disk_format=ami container_format=ami name="debian-squeeze-uec-x86_64" \
                 kernel_id=644e59b8-1503-403f-a4fe-746d4dac2ff8 ramdisk_id=6f75f1c9-1e27-4cb0-bbe0-d30defa8285c \
                 < debian-squeeze-amd64-reduced.ami
Uploading image 'debian-squeeze-uec-x86_64'
==================================================================================[100%] 42.1M/s, ETA  0h  0m  0s
Added new image with ID: 4abc09ae-ea34-44c5-8d54-504948e8d1f7
http://www.logilab.org/file/115220?vid=download

And that's it (!). I now have a working Debian squeeze image in my cloud that works fine:

http://www.logilab.org/file/115221?vid=download

Nazca is out !

2012/12/21 by Simon Chabot

What is it for ?

Nazca is a python library aiming to help you to align data. But, what does “align data” mean? For instance, you have a list of cities, described by their name and their country and you would like to find their URI on dbpedia to have more information about them, as the longitude and the latitude. If you have two or three cities, it can be done with bare hands, but it could not if there are hundreds or thousands cities. Nazca provides you all the stuff we need to do it.

This blog post aims to introduce you how this library works and can be used. Once you have understood the main concepts behind this library, don't hesitate to try Nazca online !

Introduction

The alignment process is divided into three main steps:

  1. Gather and format the data we want to align. In this step, we define two sets called the alignset and the targetset. The alignset contains our data, and the targetset contains the data on which we would like to make the links.
  2. Compute the similarity between the items gathered. We compute a distance matrix between the two sets according to a given distance.
  3. Find the items having a high similarity thanks to the distance matrix.

Simple case

  1. Let's define alignset and targetset as simple python lists.
alignset = ['Victor Hugo', 'Albert Camus']
targetset = ['Albert Camus', 'Guillaume Apollinaire', 'Victor Hugo']
  1. Now, we have to compute the similarity between each items. For that purpose, the Levenshtein distance [1], which is well accurate to compute the distance between few words, is used. Such a function is provided in the nazca.distance module.

    The next step is to compute the distance matrix according to the Levenshtein distance. The result is given in the following table.

     

    Albert Camus

    Guillaume Apollinaire

    Victor Hugo

    Victor Hugo

    6

    9

    0

    Albert Camus

    0

    8

    6

  2. The alignment process is ended by reading the matrix and saying items having a value inferior to a given threshold are identical.

[1]Also called the edit distance, because the distance between two words is equal to the number of single-character edits required to change one word into the other.

A more complex one

The previous case was simple, because we had only one attribute to align (the name), but it is frequent to have a lot of attributes to align, such as the name and the birth date and the birth city. The steps remain the same, except that three distance matrices will be computed, and items will be represented as nested lists. See the following example:

alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
            ['Jacques Dupuis', '06-01-1999', 'Bressuire'],
            ['Michel Edouard', '18-04-1881', 'Nantes']]
targetset = [['Dupond Paul', '14/08/1991', 'Paris'],
             ['Edouard Michel', '18/04/1881', 'Nantes'],
             ['Dupuis Jacques ', '06/01/1999', 'Bressuire'],
             ['Dupont Paul', '01-12-2012', 'Paris']]

In such a case, two distance functions are used, the Levenshtein one for the name and the city and a temporal one for the birth date [2].

The cdist function of nazca.distances enables us to compute those matrices :

  • For the names:
>>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
>>>                    'levenshtein', matrix_normalized=False)
array([[ 1.,  6.,  5.,  0.],
       [ 5.,  6.,  0.,  5.],
       [ 6.,  0.,  6.,  6.]], dtype=float32)
  Dupond Paul Edouard Michel Dupuis Jacques Dupont Paul
Paul Dupont 1 6 5 0
Jacques Dupuis 5 6 0 5
Edouard Michel 6 0 6 6
  • For the birthdates:
>>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
>>>                    'temporal', matrix_normalized=False)
array([[     0.,  40294.,   2702.,   7780.],
       [  2702.,  42996.,      0.,   5078.],
       [ 40294.,      0.,  42996.,  48074.]], dtype=float32)
  14/08/1991 18/04/1881 06/01/1999 01-12-2012
14-08-1991 0 40294 2702 7780
06-01-1999 2702 42996 0 5078
18-04-1881 40294 0 42996 48074
  • For the birthplaces:
>>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
>>>                    'levenshtein', matrix_normalized=False)
array([[ 0.,  4.,  8.,  0.],
       [ 8.,  9.,  0.,  8.],
       [ 4.,  0.,  9.,  4.]], dtype=float32)
  Paris Nantes Bressuire Paris
Paris 0 4 8 0
Bressuire 8 9 0 8
Nantes 4 0 9 4

The next step is gathering those three matrices into a global one, called the global alignment matrix. Thus we have :

  0 1 2 3
0 1 40304 2715 7780
1 2715 43011 0 5091
2 40304 0 43011 48084

Allowing some misspelling mistakes (for example Dupont and Dupond are very closed), the matching threshold can be set to 1 or 2. Thus we can see that the item 0 in our alignset is the same that the item 0 in the targetset, the 1 in the alignset and the 2 of the targetset too : the links can be done !

It's important to notice that even if the item 0 of the alignset and the 3 of the targetset have the same name and the same birthplace they are unlikely identical because of their very different birth date.

You may have noticed that working with matrices as I did for the example is a little bit boring. The good news is that Nazca makes all this job for you. You just have to give the sets and distance functions and that's all. An other good news is the project comes with the needed functions to build the sets !

[2]Provided in the nazca.distances module.

Real applications

Just before we start, we will assume the following imports have been done:

from nazca import dataio as aldio   #Functions for input and output data
from nazca import distances as ald  #Functions to compute the distances
from nazca import normalize as aln  #Functions to normalize data
from nazca import aligner as ala    #Functions to align data

The Goncourt prize

On wikipedia, we can find the Goncourt prize winners, and we would like to establish a link between the winners and their URI on dbpedia (Let's imagine the Goncourt prize winners category does not exist in dbpedia)

We simply copy/paste the winners list of wikipedia into a file and replace all the separators (- and ,) by #. So, the beginning of our file is :

1903#John-Antoine Nau#Force ennemie (Plume)
1904#Léon Frapié#La Maternelle (Albin Michel)
1905#Claude Farrère#Les Civilisés (Paul Ollendorff)
1906#Jérôme et Jean Tharaud#Dingley, l'illustre écrivain (Cahiers de la Quinzaine)

When using the high-level functions of this library, each item must have at least two elements: an identifier (the name, or the URI) and the attribute to compare. With the previous file, we will use the name (so the column number 1) as identifier (we don't have an URI here as identifier) and attribute to align. This is told to python thanks to the following code:

alignset = aldio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')

So, the beginning of our alignset is:

>>> alignset[:3]
[[u'John-Antoine Nau', u'John-Antoine Nau'],
 [u'Léon Frapié', u'Léon, Frapié'],
 [u'Claude Farrère', u'Claude Farrère']]

Now, let's build the targetset thanks to a sparql query and the dbpedia end-point. We ask for the list of the French novelists, described by their URI and their name in French:

query = """
     SELECT ?writer, ?name WHERE {
       ?writer  <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:French_novelists>.
       ?writer rdfs:label ?name.
       FILTER(lang(?name) = 'fr')
    }
 """
 targetset = aldio.sparqlquery('http://dbpedia.org/sparql', query)

Both functions return nested lists as presented before. Now, we have to define the distance function to be used for the alignment. This is done thanks to a python dictionary where the keys are the columns to work on, and the values are the treatments to apply.

treatments = {1: {'metric': ald.levenshtein}} # Use a levenshtein on the name
                                              # (column 1)

Finally, the last thing we have to do, is to call the alignall function:

alignments = ala.alignall(alignset, targetset,
                       0.4, #This is the matching threshold
                       treatments,
                       mode=None,#We'll discuss about that later
                       uniq=True #Get the best results only
                      )

This function returns an iterator over the different alignments done. You can see the results thanks to the following code :

for a, t in alignments:
    print '%s has been aligned onto %s' % (a, t)

It may be important to apply some pre-treatment on the data to align. For instance, names can be written with lower or upper characters, with extra characters as punctuation or unwanted information in parenthesis and so on. That is why we provide some functions to normalize your data. The most useful may be the simplify() function (see the docstring for more information). So the treatments list can be given as follow:

def remove_after(string, sub):
    """ Remove the text after ``sub`` in ``string``
        >>> remove_after('I like cats and dogs', 'and')
        'I like cats'
        >>> remove_after('I like cats and dogs', '(')
        'I like cats and dogs'
    """
    try:
        return string[:string.lower().index(sub.lower())].strip()
    except ValueError:
        return string


treatments = {1: {'normalization': [lambda x:remove_after(x, '('),
                                    aln.simply],
                  'metric': ald.levenshtein
                 }
             }

Cities alignment

The previous case with the Goncourt prize winners was pretty simply because the number of items was small, and the computation fast. But in a more real use case, the number of items to align may be huge (some thousands or millions…). In such a case it's unthinkable to build the global alignment matrix because it would be too big and it would take (at least...) fews days to achieve the computation. So the idea is to make small groups of possible similar data to compute smaller matrices (i.e. a divide and conquer approach). For this purpose, we provide some functions to group/cluster data. We have functions to group text and numerical data.

This is the code used, we will explain it:

targetset = aldio.rqlquery('http://demo.cubicweb.org/geonames',
                           """Any U, N, LONG, LAT WHERE X is Location, X name
                              N, X country C, C name "France", X longitude
                              LONG, X latitude LAT, X population > 1000, X
                              feature_class "P", X cwuri U""",
                           indexes=[0, 1, (2, 3)])
alignset = aldio.sparqlquery('http://dbpedia.inria.fr/sparql',
                             """prefix db-owl: <http://dbpedia.org/ontology/>
                             prefix db-prop: <http://fr.dbpedia.org/property/>
                             select ?ville, ?name, ?long, ?lat where {
                              ?ville db-owl:country <http://fr.dbpedia.org/resource/France> .
                              ?ville rdf:type db-owl:PopulatedPlace .
                              ?ville db-owl:populationTotal ?population .
                              ?ville foaf:name ?name .
                              ?ville db-prop:longitude ?long .
                              ?ville db-prop:latitude ?lat .
                              FILTER (?population > 1000)
                             }""",
                             indexes=[0, 1, (2, 3)])


treatments = {1: {'normalization': [aln.simply],
                  'metric': ald.levenshtein,
                  'matrix_normalized': False
                 }
             }
results = ala.alignall(alignset, targetset, 3, treatments=treatments, #As before
                       indexes=(2, 2), #On which data build the kdtree
                       mode='kdtree',  #The mode to use
                       uniq=True) #Return only the best results

Let's explain the code. We have two files, containing a list of cities we want to align, the first column is the identifier, and the second is the name of the city and the last one is location of the city (longitude and latitude), gathered into a single tuple.

In this example, we want to build a kdtree on the couple (longitude, latitude) to divide our data in few candidates. This clustering is coarse, and is only used to reduce the potential candidats without loosing any more refined possible matchs.

So, in the next step, we define the treatments to apply. It is the same as before, but we ask for a non-normalized matrix (ie: the real output of the levenshtein distance). Thus, we call the alignall function. indexes is a tuple saying the position of the point on which the kdtree must be built, mode is the mode used to find neighbours [3].

Finally, uniq ask to the function to return the best candidate (ie: the one having the shortest distance below the given threshold)

The function outputs a generator yielding tuples where the first element is the identifier of the alignset item and the second is the targetset one (It may take some time before yielding the first tuples, because all the computation must be done…)

[3]The available modes are kdtree, kmeans and minibatch for numerical data and minhashing for text one.

Try it online !

We have also made this little application of Nazca, using Cubicweb. This application provides a user interface for Nazca, helping you to choose what you want to align. You can use sparql or rql queries, as in the previous example, or import your own cvs file [4]. Once you have choosen what you want to align, you can click the Next step button to customize the treatments you want to apply, just as you did before in python ! Once done, by clicking the Next step, you start the alignment process. Wait a little bit, and you can either download the results in a csv or rdf file, or directly see the results online choosing the html output.

[4]Your csv file must be tab-separated for the moment…