subscribe to this blog

Logilab.org - en

News from Logilab and our Free Software projects, as well as on topics dear to our hearts (Python, Debian, Linux, the semantic web, scientific computing...)

back to pagination (10 results)
  • Mercurial conference in Paris: May 28th, 2019

    2019/05/03 by Marla Da Silva

    Mercurial Paris conference will take place on May 28th at Mozilla's headquarters, in Paris.

    Mercurial is a free distributed Source Control Management system. It offers an intuitive interface to efficiently handle projects of any size. With its powerful extension system, Mercurial can easily adapt to any environment.

    This first edition targets organizations that are currently using Mercurial or considering switching from another Version Control System, such as Subversion.

    Attending the conference will allow users to share ideas and version control experiences in different industries and at a different scale. It is a great opportunity to connect with Mercurial core developers and get updates about modern workflow and features.

    You are welcome to register here and be part of this first edition with us!

    https://www.logilab.org/file/10131926/raw

    Mercurial conference is co-organized by Logilab, Octobus & Rhode-Code.


  • Mercurial mini-sprint: from April 4th to 7th in Paris

    2019/03/19 by Marla Da Silva

    Logilab co-organizes with Octobus, a mini-sprint Mercurial to be held from Thursday 4 to Sunday 7 April in Paris.

    https://www.logilab.org/file/10131359/raw

    Logilab will host mercurial mini-sprint in its Paris premises on Thursday 4 and Friday 5 April. Octobus will be communicating very soon the place chosen to sprint during the weekend.

    To participate to mercurial mini-sprint, please complete the survey by informing your name and which days you will be joining us.

    Some of the developers working on mercurial or associated tooling plan to focus on improving the workflow and tool and documentation used for online collaboration through Mercurial (Kalithea, RhodeCode, Heptapod, Phabricator, sh.rt, etc. You can also fill the pad below to indicate the themes that you want to tackle during this sprint: https://mensuel.framapad.org/p/mini-sprint-hg

    Let's code together!


  • Logilab trip report for FOSDEM 2019

    2019/02/13 by Nicolas Chauvat
    https://fosdem.org/2019/support/promote/wide.png

    A very large conference

    This year I attended the FOSDEM in Brussels for the first time. I have been doing free software for more than 20 years, but for some reason, I had never been to FOSDEM. I was pleasantly surprised to see that it was much larger than I thought and that it gathered thousands of people. This is by far the largest free software event I have been to. My congratulations to the organizers and volunteers, since this must be a huge effort to pull off.

    My presentation about CubicWeb

    I went to FOSDEM to present Logilab's latest project, a reboot of CubicWeb to turn it into a web extension to browse the web of data. The recording of the talk, the slides and the video of the demo are online, I hope you enjoy them and get in touch with me if you are to comment or contribute.

    My highlights

    As usual, the "hallway track" was the most useful for me and there are more sets of slides I read than talks I could attend.

    I met with Bram, the author of redbaron and we had a long discussion about programming in different languages.

    I also met with Octobus. We discussed Heptapod, a project to add Mercurial support to Gitlab. Logilab would back such a project with money if it were to become usable and (please) move towards federation (with ActivityPub?) and user queries (with GraphQL?). We also discussed the so-called oxydation of Mercurial, which consists in rewriting some parts in Rust. After a quick search I saw that tools like PyO3 can help writing Python extensions in Rust.

    Some of the talks that interested me included:

    • Memex, that reuses the name of the very first hypertext system described in the litterature, and tries to implement a decentralized annotation system for the web. It reminded me of Web hypothesis and W3C's annotations recommendation which they say they will be compatible with.
    • GraphQL was presented both in GraphQL with Python and Testing GraphQL with Javascript. I have been following GraphQL about two years because it compares to the RQL language of CubicWeb. We have been testing it at Logilab with cubicweb-graphql.
    • Web Components are one of the options to develop user interfaces in the browser. I had a look at the Future of Web Components, which I relate to the work we are doing with the CubicWeb browser (see above) and the work the Virtual Assembly has been doing to implement Transiscope.
    • Pyodide, the scientific python stack compiled to Web Assembly, I try to compare it to using Jupyter notebooks.
    • Chat-over-IMAP another try to rule them all chat protocols. It is right that everyone has more than one email address, that email addresses are more and more used as logins in many web sites and that using these email addresses as instant-messaging / chat addresses would be nice. We will see if it takes off!

  • Experiment to safely refactor your code in Python

    2018/05/05 by Nicolas Chauvat

    "Will my refactoring break my code ?" is the question the developer asks himself because he is not sure the tests cover all the cases. He should wonder, because tests that cover all the cases would be costly to write, run and maintain. Hence, most of the time, small decisions are made day after day to test this and not that. After some time, you could consider that in a sense, the implementation has become the specification and the rest of the code expects it not to change.

    Enters Scientist, by GitHub, that inspired a Python port named Laboratory.

    Let us assume you want to add a cache to a function that reads data from a database. The function would be named read_from_db, it would take an int as parameter item_id and return a dict with attributes of the items and their values.

    You could experiment with the new version of this function like so:

    import laboratory
    
    def read_from_db_with_cache(item_id):
        data = {}
        # some implementation with a cache
        return data
    
    @laboratory.Experiment.decorator(candidate=read_from_db_with_cache)
    def read_from_db(item_id):
         data = {}
         #  fetch data from db
         return data
    

    When you run the above code, calling read_from_db returns its result as usual, but thanks to laboratory, a call to read_from_db_with_cache is made and its execution time and result are compared with the first one. These measurements are logged to a file or sent to your metrics solution for you to compare and study.

    In other words, things continue to work as usual as you keep the original function, but at the same time you experiment with its candidate replacement to make sure switching will not break or slow things down.

    I like the idea ! Thank you for Scientist and Laboratory that are both available under the MIT license.

    https://www.logilab.org/file/10128331/raw/chemistry.png

  • Reduce docker image size with multi-stage builds

    2018/03/21 by Philippe Pepiot

    At logilab we use more and more docker both for test and production purposes.

    For a CubicWeb application I had to write a Dockerfile that does a pip install and compiles javascript code using npm.

    The first version of the Dockerfile was:

    FROM debian:stretch
    RUN apt-get update && apt-get -y install \
        wget gnupg ca-certificates apt-transport-https \
        python-pip python-all-dev libgecode-dev g++
    RUN echo "deb https://deb.nodesource.com/node_9.x stretch main" > /etc/apt/sources.list.d/nodesource.list
    RUN wget https://deb.nodesource.com/gpgkey/nodesource.gpg.key -O - | apt-key add -
    RUN apt-get update && apt-get -y install nodejs
    COPY . /sources
    RUN pip install /sources
    RUN cd /sources/frontend && npm install && npm run build && \
        mv /sources/frontend/bundles /app/bundles/
    # ...
    

    The resulting image size was about 1.3GB which cause issues while uploading it to registries and with the required disk space on production servers.

    So I looked how to reduce this image size. What is important to know about Dockerfile is that each operation result in a new docker layer, so removing useless files at the end will not reduce the image size.

    https://www.logilab.org/file/10128049/raw/docker-multi-stage.png

    The first change was to use debian:stretch-slim as base image instead of debian:stretch, this reduced the image size by 18MB.

    Also by default apt-get would pull a lot of extra optional packages which are "Suggestions" or "Recommends". We can simply disable it globally using:

    RUN echo 'APT::Install-Recommends "0";' >> /etc/apt/apt.conf && \
        echo 'APT::Install-Suggests "0";' >> /etc/apt/apt.conf
    

    This reduced the image size by 166MB.

    Then I looked at the content of the image and see a lot of space used in /root/.pip/cache. By default pip build and cache python packages (as wheels), this can be disabled by adding --no-cache to pip install calls. This reduced the image size by 26MB.

    In the image we also have a full nodejs build toolchain which is useless after the bundles are generated. The old workaround is to install nodejs, build files, remove useless build artifacts (node_modules) and uninstall nodejs in a single RUN operation, but this result in a ugly Dockerfile and will not be an optimal use of the layer cache. Instead we can setup a multi stage build

    The idea behind multi-stage builds is be able to build multiple images within a single Dockerfile (only the latest is tagged) and to copy files from one image to another within the same build using COPY --from= (also we can use a base image in the FROM clause).

    Let's extract the javascript building in a single image:

    FROM debian:stretch-slim as base
    RUN echo 'APT::Install-Recommends "0";' >> /etc/apt/apt.conf && \
        echo 'APT::Install-Suggests "0";' >> /etc/apt/apt.conf
    
    FROM base as node-builder
    RUN apt-get update && apt-get -y install \
        wget gnupg ca-certificates apt-transport-https \
    RUN echo "deb https://deb.nodesource.com/node_9.x stretch main" > /etc/apt/sources.list.d/nodesource.list
    RUN wget https://deb.nodesource.com/gpgkey/nodesource.gpg.key -O - | apt-key add -
    RUN apt-get update && apt-get -y install nodejs
    COPY . /sources
    RUN cd /sources/frontend && npm install && npm run build
    
    FROM base
    RUN apt-get update && apt-get -y install python-pip python-all-dev libgecode-dev g++
    RUN pip install --no-cache /sources
    COPY --from=node-builder /sources/frontend/bundles /app/bundles
    

    This reduced the image size by 252MB

    The Gecode build toolchain is required to build rql with gecode extension (which is a lot faster that native python), my next idea was to build rql as a wheel inside a staging image, so the resulting image would only need the gecode library:

    FROM debian:stretch-slim as base
    RUN echo 'APT::Install-Recommends "0";' >> /etc/apt/apt.conf && \
        echo 'APT::Install-Suggests "0";' >> /etc/apt/apt.conf
    
    FROM base as node-builder
    RUN apt-get update && apt-get -y install \
        wget gnupg ca-certificates apt-transport-https \
    RUN echo "deb https://deb.nodesource.com/node_9.x stretch main" > /etc/apt/sources.list.d/nodesource.list
    RUN wget https://deb.nodesource.com/gpgkey/nodesource.gpg.key -O - | apt-key add -
    RUN apt-get update && apt-get -y install nodejs
    COPY . /sources
    RUN cd /sources/frontend && npm install && npm run build
    
    FROM base as wheel-builder
    RUN apt-get update && apt-get -y install python-pip python-all-dev libgecode-dev g++ python-wheel
    RUN pip wheel --no-cache-dir --no-deps --wheel-dir /wheels rql
    
    FROM base
    RUN apt-get update && apt-get -y install python-pip
    COPY --from=wheel-builder /wheels /wheels
    RUN pip install --no-cache --find-links /wheels /sources
    # a test to be sure that rql is built with gecode extension
    RUN python -c 'import rql.rql_solve'
    COPY --from=node-builder /sources/frontend/bundles /app/bundles
    

    This reduced the image size by 297MB.

    So the final image size goes from 1.3GB to 300MB which is more suitable to use in a production environment.

    Unfortunately I didn't find a way to copy packages from a staging image, install it and remove the package in a single docker layer, a possible workaround is to use an intermediate http server.

    The next step could be to build a Debian package inside a staging image, so the build process would be separated from the Dockerfile and we could provide Debian packages along with the docker image.

    Another approach could be to use Alpine as base image instead of Debian.


  • Mercurial Sprint 4.4

    2017/10/10 by Denis Laxalde
    Mercurial

    In late September, I participated on behalf of Logilab to the Mercurial 4.4 sprint that held at Facebook Dublin. It was the opportunity to meet developers of the project, follow active topics and forthcoming plans for Mercurial. Here, I'm essentially summarizing the main points that held my attention.

    Amongst three days of sprint, the first two mostly dedicated to discussions and the last one to writing code. Overall, the organization was pretty smooth thanks to Facebook guys doing a fair amount of preparation. Amongst an attendance of 25 community members, some companies had a significant presence: noticeably Facebook had 10 employees and Google 5, the rest consisting of either unaffiliated people and single-person representing their affiliation. No woman, unfortunately, and a vast majority of people with English as a native or dominant usage language.

    The sprint started by with short talks presenting the state of the Mercurial project. Noticeably, in the state of the community talk, it was recalled that the project governance now holds on the steering committee while some things are still not completely formalized (in particular, the project does not have a code of conduct yet). Quite surprisingly, the committee made no explicit mention of the recurring tensions in the project which recently lead to a banishment of a major contributor.

    Facebook, Google, Mozilla and Unity then presented by turns the state of Mercurial usages in their organization. Both Facebook and Google have a significant set of tools around hg, most of them either aiming at making the user experience more smooth with respect to performance problems coming from their monorepos or towards GUI tools. Other than that, it's interesting to note most of these corporate users have now integrated evolve on the user side, either as is or with a convenience wrapper layer.

    After that, followed several "vision statements" presentations combined with breakout sessions. (Just presenting a partial selection.)

    The first statement was about streamlining the style of the code base: that one was fairly consensual as most people agreed upon the fact that something has to be done in this respect; it was finally decided (on the second day) to adopt a PEP-like process. Let's see how things evolve!

    Second, Boris Feld and I talked about the development of the evolve extension and the ongoing task about moving it into Mercurial core (slides). We talked about new usages and workflows around the changeset evolution concept and the topics concept. Despite the recent tensions on this technical topic in the community, we tried to feel positive and reaffirmed that the whole community has interests in moving evolve into core. On the same track, in another vision statement, Facebook presented their restack workflow (sort of evolve for "simple" cases) and suggested to push this into core: this is encouraging as it means that evolve-like workflows tend to get mainstream.

    Rust

    Another interesting vision statement was about using Rust in Mercurial. Most people agreed that Mercurial would benefit from porting its native C code in Rust, essentially for security reasons and hopefully to gain a bit of performance and maintainability. More radical ideas were also proposed such as making the hg executable a Rust program (thus embedding a Python interpreter along with its standard library) or reimplementing commands in Rust (which would pose problems with respect to the Python extension system). Later on, Facebook presented mononoke, a promising Mercurial server implemented in Rust that is expected to scale better with respect to high committing rates.

    Back to a community subject, we discussed about code review and related tooling. It was first recalled that the project would benefit from more reviewers, including people without a committer status. Then the discussion essentially focused on the Phabricator experiment that started in July. Introduction of a web-based review tool in Mercurial was arguably a surprise for the community at large since many reviewers and long-time contributors have always expressed a clear preference over email-based review. This experiment is apparently meant to lower the contribution barrier so it's nice to see the project moving forward on this topic and attempt to favor diversity by contribution. On the other hand, the choice of Phabricator was quite controversial. From the beginning (see replies to the announcement email linked above), several people expressed concerns (about notifications notably) and some reviewers also complained about the increase of review load and loss of efficiency induced by the new system. A survey recently addressed to the whole community apparently (no official report yet at the time of this writing) confirms that most employees from Facebook or Google seem pretty happy with the experiment while other community members generally dislike it. Despite that, it was decided to keep the "experiment" going while trying to improve the notification system (better threading support, more diff context, subscription-based notification, etc.). That may work, but there's a risk of community split as non-corporate members might feel left aside. Overall, adopting a consensus-based decision model on such important aspects would be beneficial.

    Yet another interesting discussion took place around branching models. Noticeably, Facebook and Google people presented their recommended (internal) workflow respectively based on remotenames and local bookmarks while Pulkit Goyal presented the current state to the topics extension and associated workflows. Bridging the gap between all these approaches would be nice and it seems that a first step would be to agree upon the definition of a stack of changesets to define a current line of work. In particular, both the topics extension and the show command have their own definition, which should be aligned.

    (Comments on reddit.)


  • Better code archaeology with Mercurial

    2017/09/21 by Denis Laxalde

    For about a year, Logilab has been involved in Mercurial development in the framework of a Mozilla Open Source Support (MOSS) program. Mercurial is a foundational technology for Mozilla, as the VCS used for the development of Firefox. As the main protagonist of the program, I'd first like to mention this has been a very nice experience for me, both from the social and technical perspectives. I've learned a lot and hope to continue working on this nice piece of software.

    The general objective of the program was to improve "code archaeology" workflows, especially when using hgweb (the web UI for Mercurial). Outcomes of program spread from versions 4.1 to 4.4 of Mercurial and I'm going to present the main (new) features in this post.

    Better navigation in "blame/annotate" view (hgweb)

    The first feature concerns the "annotate" view in hgweb; the idea was to improve navigation along "blamed" revisions, a process that is often tedious and involves a lot of clicks in the web UI (not easier from the CLI, for that matter). Basically, we added an hover box on the left side of file content displaying more context on blamed revision along with a couple of links to make navigation easier (links to parent changesets, diff and changeset views). See below for an example about mercurial.error module (to try it at https://www.mercurial-scm.org/repo/hg/annotate/4.3/mercurial/error.py)

    The hover box in annotate view in hgweb.

    Followlines

    While this wasn't exactly in the initial scope of the program, the most interesting result of my work is arguably the introduction of the "followlines" feature set. The idea is to build upon the log command instead of annotate to make it easier to follow changes across revisions and the new thing is to make this possible by filtering changes only affecting a block of lines in a particular file.

    The first component introduced a revset, named followlines, which accepts at least two arguments a file path and a line range written as fromline:toline (following Python slice syntax). For instance, say we are interested the history of LookupError class in mercurial/error.py module which, at tag 4.3, lives between line 43 and 59, we'll use the revset as follows:

    $ hg log -r 'followlines(mercurial/error.py, 43:59)'
    changeset:   7633:08cabecfa8a8
    user:        Matt Mackall <mpm@selenic.com>
    date:        Sun Jan 11 22:48:28 2009 -0600
    summary:     errors: move revlog errors
    
    changeset:   24038:10d02cd18604
    user:        Martin von Zweigbergk <martinvonz@google.com>
    date:        Wed Feb 04 13:57:35 2015 -0800
    summary:     error: store filename and message on LookupError for later
    
    changeset:   24137:dcfdfd63bde4
    user:        Siddharth Agarwal <sid0@fb.com>
    date:        Wed Feb 18 16:45:16 2015 -0800
    summary:     error.LookupError: rename 'message' property to something
    else
    
    changeset:   25945:147bd9e238a1
    user:        Gregory Szorc <gregory.szorc@gmail.com>
    date:        Sat Aug 08 19:09:09 2015 -0700
    summary:     error: use absolute_import
    
    changeset:   34016:6df193b5c437
    user:        Yuya Nishihara <yuya@tcha.org>
    date:        Thu Jun 01 22:43:24 2017 +0900
    summary:     py3: implement __bytes__() on most of our exception classes
    

    This only yielded changesets touching this class.

    This is not an exact science (because the algorithm works on diff hunks), but in many situations it gives interesting results way faster that with an iterative "annotate" process (in which one has to step from revision to revision and run annotate every times).

    The followlines() predicate accepts other arguments, in particular, descend which, in combination with the startrev, lets you walk the history of a block of lines in the descending direction of history. Below the detailed help (hg help revset.followlines) of this revset:

    $ hg help revsets.followlines
        "followlines(file, fromline:toline[, startrev=., descend=False])"
          Changesets modifying 'file' in line range ('fromline', 'toline').
    
          Line range corresponds to 'file' content at 'startrev' and should hence
          be consistent with file size. If startrev is not specified, working
          directory's parent is used.
    
          By default, ancestors of 'startrev' are returned. If 'descend' is True,
          descendants of 'startrev' are returned though renames are (currently)
          not followed in this direction.
    

    In hgweb

    The second milestone of the program was to port this followlines filtering feature into hgweb. This has been implemented as a line selection mechanism (using mouse) reachable from both the file and annotate views. There, you'll see a small green ± icon close to the line number. By clicking on this icon, you start the line selection process which can be completed by clicking on a similar icon on another line of the file.

    Starting a line selection for followlines in hgweb.

    After that, you'll see a box inviting you to follow the history of lines <selected range> , either in the ascending (older) or descending (newer) direction.

    Line selection completed for followlines in hgweb.

    Here clicking on the "newer" link, we get:

    Followlines result in hgweb.

    As you can notice, this gives a similar result as with the command line but it also displays the patch of each changeset. In these patch blocks, only the diff hunks affecting selected line range are displayed (sometimes with some extra context).

    What's next?

    At this point, the "followlines" feature is probably complete as far as hgweb is concerned. In the remainder of the MOSS program, I'd like to focus on a command-line interface producing a similar output as the one above in hgweb, i.e. filtering patches to only show followed lines. That would take the form of a --followlines/-L option to hg log command, e.g.:

    $ hg log -L mercurial/error.py,43:59 -p
    

    That's something I'd like to tackle at the upcoming Mercurial 4.4 sprint in Dublin!


  • A non-numeric Pandas example with Geonames

    2017/02/20 by Yann Voté

    The aim of this 2-parts blog post is to show some useful, yet not very complicated, features of the Pandas Python library that are not found in most (numeric oriented) tutorials.

    We will illustrate these techniques with Geonames data : extract useful data from the Geonames dump, transform it, and load it into another file. There is no numeric computation involved here, nor statictics. We will prove that pandas can be used in a wide range of cases beyond numerical analysis.

    While the first part was an introduction to Geonames data, this second part contains the real work with Pandas. Please read part 1 if you are not familiar with Geonames data.

    The project

    The goal is to read Geonames data, from allCountries.txt and alternateNames.txt files, combine them, and produce a new CSV file with the following columns:

    • gid, the Geonames id,
    • frname, French name when available,
    • gname, Geonames main name,
    • fclass, feature class,
    • fcode, feature code,
    • parent_gid, Geonames id of the parent location,
    • lat, latitude,
    • lng, longitude.

    Also we don't want to use too much RAM during the process. 1 GB is a maximum.

    You may think that this new CSV file is a low-level objective, not very interesting. But it can be a step into a larger process. For example one can build a local Geonames database that can then be used by tools like Elastic Search or Solr to provide easy searching and auto-completion in French.

    Another interesting feature is provided by the parent_gid column. One can use this column to build a tree of locations, or a SKOS thesaurus of concepts.

    The method

    Before diving into technical details, let us first look at the big picture, the overall process to achieve the aforementioned goal.

    Considering the wanted columns, we can see the following differences with allCountries.txt:

    • we are not interested in some original columns (population, elevation, ...),
    • the column frname is new and will come from alternateNames.txt, thanks to the Geonames id,
    • the column parent_gid is new too. We must derive this information from the adminX_code (X=1, 2, 3, 4) columns in allCountries.txt.

    This column deserves an example. Look at Arrondissement de toulouse.

    feature_code: ADM3, country_code: FR, admin1_code: 76, admin2_code: 31, admin3_code: 313
    

    To find its parent, we must look at a place with the following properties,

    feature_code: ADM2, country_code: FR, admin1_code: 76, admin2_code: 31
    

    and there is only one place with such properties, Geonames id 3,013,767, namely Département de la Haute-Garonne. Thus we must find a way to derive the Geonames id from the feature_code, country_code and adminX_code columns. Pandas will make this easy for us.


    Let's get to work. And of course, we must first import the Pandas library to make it available. We also import the csv module because we will need it to perform basic operations on CSV files.

    >>> import pandas as pd
    >>> import csv
    

    The French names step

    Loading file

    We begin by loading data from the alternateNames.txt file.

    And indeed, that's a big file. To save memory we won't load the whole file. Recall from previous part, that the alternateNames.txt file provides the following columns in order.

    alternateNameId   : the id of this alternate name, int
    geonameid         : geonameId referring to id in table 'geoname', int
    isolanguage       : iso 639 language code 2- or 3-characters; (...)
    alternate name    : alternate name or name variant, varchar(400)
    isPreferredName   : '1', if this alternate name is an official/preferred name
    isShortName       : '1', if this is a short name like 'California' for 'State of California'
    isColloquial      : '1', if this alternate name is a colloquial or slang term
    isHistoric        : '1', if this alternate name is historic and was used in the pastq
    

    For our purpose we are only interested in columns geonameid (so that we can find corresponding place in allCountries.txt file), isolanguage (so that we can keep only French names), alternate name (of course), and isPreferredName (because we want to keep preferred names when possible).

    Another way to save memory is to filter the file before loading it. Indeed, it's a better practice to load a smaller dataset (filter before loading) than to load a big one and then filter it after loading. It's important to keep those things in mind when you are working with large datasets. So in our case, it is cleaner (but slower) to prepare the CSV file before, keeping only French names.

    >>> # After restarting Python
    >>> with open('alternateNames.txt') as f:
    ...     reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
    ...     with open('frAltNames.txt', 'w') as g:
    ...         writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE,
    ...                             escapechar='\\')
    ...         for row in reader:
    ...             if row[2] == u'fr':
    ...                 writer.writerow(row)
    

    To load a CSV file, Pandas provides the read_csv function. It returns a dataframe populated with data from a CSV file.

    >>> with open('frAltNames.txt') as altf:
    ...     frnames = pd.read_csv(
    ...         altf, encoding='utf-8',
    ...         sep='\t', quoting=csv.QUOTE_NONE,
    ...         header=None, usecols=(1, 3, 4), index_col=0,
    ...         names=('gid', 'frname', 'pref'),
    ...         dtype={'gid': 'uint32', 'frname': str, 'pref': 'bool_'},
    ...         true_values=['1'], false_values=[u''], na_filter=False,
    ...         skipinitialspace=True, error_bad_lines=False
    ...     )
    ...
    

    The read_csv function is quite complex and it takes some time to use it rightly. In our cases, the first few parameters are self-explanatory: encoding for the file encoding, sep for the CSV separator, quoting for the quoting protocol (here there is none), and header for lines to be considered as label lines (here there is none).

    usecols tells pandas to load only the specified columns. Beware that indices start at 0, so column 1 is the second column in the file (geonameid in this case).

    index_col says to use one of the columns as a row index (instead of creating a new index from scratch). Note that the number for index_col is relative to usecols. In other words, 0 means the first column of usecols, not the first column of the file.

    names gives labels for columns (instead of using integers from 0). Hence we can extract the last column with frnames['pref'] (instead of frnames[3]). Please note, that this parameter is not compatible with headers=0 for example (in that case, the first line is used to label columns).

    The dtype parameter is interesting. It allows you to specify one type per column. When possible, prefer to use NumPy types to save memory (eg np.uint32, np.bool_).

    Since we want boolean values for the pref column, we can tell Pandas to convert '1' strings to True and empty strings ('') to False. That is the point of using parameters true_values and false_values.

    But, by default, Pandas detect empty strings and affect them np.nan (NaN means not a number). To prevent this behavior, setting na_filter to False will leave empty strings as empty strings. Thus, empty strings in the pref column will be converted to False (thanks to the false_values parameter and to the 'bool_' data type).

    Finally, skipinitialspace tells Pandas to left-strip strings to remove leading spaces. Without it, a line like a, b (commas-separated) would give values 'a' and ' b' (note leading space before b).

    And lastly, error_bad_lines make pandas ignore lines he cannot understand (eg. wrong number of columns). The default behavior is to raise an exception.

    There are many more parameters to this function, which is much more powerful than the simple reader object from the csv module. Please, refer to the documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html for a full list of options. For example, the header parameter can accept a list of integers, like [1, 3, 4], saying that lines at position 1, 3 and 4 are label lines.

    >>> frnames.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 55684 entries, 18918 to 11441793
    Data columns (total 2 columns):
    frname     55684 non-null object
    pref       55684 non-null bool
    dtypes: bool(1), object(1)
    memory usage: 6.6 MB
    

    We can see here that our dataframe contains 55,684 entries and that it takes about 6.6 MB in memory.

    Let's look at entry number 3,013,767 for the Haute-Garonne department.

    >>> frnames.loc[3013767]
                                     frname    pref
    gid
    3013767  Département de la Haute-Garonne  False
    3013767                    Haute-Garonne   True
    

    Removing duplicated indices

    There is one last thing to do with the frnames dataframe: make its row indices uniques. Indeed, we can see from the previous example that there are two entries for the Haute-Garonne department. We need to keep only one, and, when possible, the preferred form (when the pref column is True).

    This is not always possible: we can see at the beginning of the dataframe (use the head method) that there is no preferred French name for Rwanda.

    >>> frnames.head()
                         frname   pref
    gid
    18918              Protaras   True
    49518  République du Rwanda  False
    49518                Rwanda  False
    50360       Woqooyi Galbeed  False
    51230              Togdheer  False
    

    In such a case, we will take one of the two lines at random. Maybe a clever rule to decide which one to keep would be useful (like involving other columns), but for the purpose of this tutorial it does not matter which one is kept.

    Back to our problem, how to make indices uniques ? Pandas provides the duplicated method on Index objects. This method returns a boolean NumPy 1d-array (a vector), the size of which is the number of entries. So, since our dataframe has 55,684 entries, the length of the returned vector is 55,684.

    >>> dups_idx = frnames.index.duplicated()
    >>> dups_idx
    array([False, False,  True, ..., False, False, False], dtype=bool)
    >>> len(dups_idx)
    55684
    

    The meaning is simple: when you encounter True, it means that the index at this position is a duplicate of a previously encountered index. For example, we can see that the third value in dups_idx is True. And indeed, the third line of frnames has an index (49,518) which is a duplicate of the second line.

    So duplicated is meant to mark as True duplicated indices and to keep only the first one (there is an optional parameter to change this: read the doc). How do we make sure that the first entry is the preferred entry ? Sorting the dataframe of course! We can sort a table by a column (or by a list of columns), using the sort_values method. We give ascending=False because True > False (that's a philosophical question!), and inplace=True to sort the dataframe in place, thus we do not create a copy (still thinking about memory usage).

    >>> frnames.sort_values('pref', ascending=False, inplace=True)
    >>> frnames.head()
                                   frname  pref
    gid
    18918                        Protaras  True
    8029850  Hôtel de ville de Copenhague  True
    8127417                        Kalmar  True
    8127361                     Sundsvall  True
    8127291                        Örebro  True
    >>> frnames.loc[3013767]
                                      frname   pref
    gid
    3013767                    Haute-Garonne   True
    3013767  Département de la Haute-Garonne  False
    

    Great! All preferred names are now first in the dataframe. We can then use the duplicated index method to filter out duplicated entries.

    >>> frnames = frnames[~(frnames.index.duplicated())]
    >>> frnames.loc[3013767]
    frname     Haute-Garonne
    pref                True
    Name: 3013767, dtype: object
    >>> frnames.loc[49518]
    frname     Rwanda
    pref        False
    Name: 49518, dtype: object
    >>> frnames.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 49047 entries, 18918 to 11441793
    Data columns (total 2 columns):
    frname     49047 non-null object
    pref       49047 non-null bool
    dtypes: bool(1), object(1)
    memory usage: 5.7 MB
    

    We end up with 49,047 French names. Notice the ~ in the filter expression ? That's because we want to keep the first entry for each duplicated index, and the first entry is False in the vector returned by duplicated.

    Summary for the french names step

    One last thing to do is to remove the pref column which won't be used anymore. We already know how to do it.

    >>> frnames.drop('pref', axis=1, inplace=True)
    

    There has been a lot of talking until now. But if we summarize, very few commands where needed to obtain this table with only two columns (gid and frname):

    1. Prepare a smaller file so there is less data to load (keep only french names).

      with open('alternateNames.txt') as f:
          reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
          with open('frAltNames.txt', 'w') as g:
              writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE,
                                  escapechar='\\')
              for row in reader:
                  if row[2] == u'fr':
                      writer.writerow(row)
      
    2. Load the file into Pandas.

      with open('frAltNames.txt') as altf:
          frnames = pd.read_csv(
              altf, encoding='utf-8',
              sep='\t', quoting=csv.QUOTE_NONE,
              header=None, usecols=(1, 3, 4), index_col=0,
              names=('gid', 'frname', 'pref'),
              dtype={'gid': 'uint32', 'frname': str, 'pref': 'bool_'},
              true_values=['1'], false_values=[u''], na_filter=False,
              skipinitialspace=True, error_bad_lines=False
          )
      
    3. Sort on the pref column and remove duplicated indices.

      frnames.sort_values('pref', ascending=False, inplace=True)
      frnames = frnames[~(frnames.index.duplicated())]
      
    4. Remove the pref column.

      frnames.drop('pref', axis=1, inplace=True)
      

    Simple, isn't it ? We'll keep this dataframe for later use. Pandas will make it a breeze to merge it with the main dataframe coming from allCountries.txt to create the new frname column.

    The administrative tree step

    But for now, let's look at the second problem: how to derive a gid from a feature_code, a country_code, and a bunch of adminX_code (X=1, 2, 3, 4).

    Loading file

    First we need the administrative part of the file allCountries.txt, that is all places with the A feature class.

    Of course we can load the whole file into Pandas and then filter to keep only A-class entries, but now you know that this is memory intensive (and this file is much bigger than alternateNames.txt). So we'll be more clever and first prepare a smaller file.

    >>> with open('allCountries.txt') as f:
    ...     reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
    ...     with open('adm_geonames.txt', 'w') as g:
    ...         writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE, escapechar='\\')
    ...         for row in reader:
    ...             if row[6] == 'A':
    ...                 writer.writerow(row)
    

    Now we load this file into pandas. Remembering our goal, the resulting dataframe will only be used to compute gid from the code columns. So all columns we need are geonameid, feature_code, country_code, admin1_code, admin2_code, admin3_code, and admin4_code.

    What about data types. All these columns are strings except for geonameid which is an integer. Since it will be painful to type <colname>: str for the dtype parameter dictionary, let's use the fromkeys constructor instead.

    >>> d_types = dict.fromkeys(['fcode', 'code0', 'code1', 'code2', 'code3',
    ...                          'code4'], str)
    >>> d_types['gid'] = 'uint32'  # Shorter geonameid in gid
    

    We can now load the file.

    >>> with open('adm_geonames.txt') as admf:
    ...     admgids = pd.read_csv(
    ...         admf, encoding='utf-8',
    ...         sep='\t', quoting=csv.QUOTE_NONE,
    ...         header=None, usecols=(0, 7, 8, 10, 11, 12, 13),
    ...         names=('gid', 'fcode', 'code0', 'code1', 'code2', 'code3', 'code4'),
    ...         dtype=d_types, na_values='', keep_default_na=False,
    ...         error_bad_lines=False
    ...     )
    

    We recognize most parameters in this instruction. Notice that we didn't use index_col=0: gid is now a normal column, not an index, and Pandas will generate automatically a row index with integers starting at 0.

    Two new parameters are na_values and keep_default_na. The first one gives Pandas additional strings to be considerer as NaN (Not a Number). The astute reader would say that empty strings ('') are already considered by Pandas as NaN, and he would be right.

    But here comes the second parameter which, if set to False, tells Pandas to forget about its default list of strings recognized as NaN. The default list contains a bunch of strings like 'N/A' or '#NA' or, this is interesting, simply 'NA'. But 'NA' is used in allCountries.txt as country code for Namibia. If we keep the default list, this whole country will be ignored. So the combination of these two parameters tells Pandas to:

    • reset its default list of NaN strings,
    • use '' as a NaN string, which, consequently, will be the only one.

    That's it for NaN. What can Pandas tells us about our dataframe ?

    >>> admgids.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 369277 entries, 0 to 369276
    Data columns (total 7 columns):
    gid      369277 non-null uint32
    fcode    369273 non-null object
    code0    369272 non-null object
    code1    369197 non-null object
    code2    288494 non-null object
    code3    215908 non-null object
    code4    164899 non-null object
    dtypes: object(6), uint32(1)
    memory usage: 128.8 MB
    

    Note the RangeIndex which Pandas has created for us. The dataframe takes about 130 MB of memory because it has a lots of object columns. Apart from that, everything looks good.

    Replacing values

    One thing we want to do before going on, is to replace all 'PCL<X>' values in the fcode column with just PCL. This will make our life easier later when searching in this dataframe.

    >>> pd.unique(admgids.fcode)
    array([u'ADMD', u'ADM1', u'PCLI', u'ADM2', u'ADM3', u'ADM2H', u'PCLD',
           u'ADM4', u'ZN', u'ADM1H', u'PCLH', u'TERR', u'ADM3H', u'PRSH',
           u'PCLIX', u'ADM5', u'ADMDH', nan, u'PCLS', u'LTER', u'ADM4H',
           u'ZNB', u'PCLF', u'PCL'], dtype=object)
    

    Pandas provides the replace method for that.

    >>> admgids.replace({'fcode': {r'PCL[A-Z]{1,2}': 'PCL'}}, regex=True, inplace=True)
    >>> pd.unique(admgids.fcode)
    array([u'ADMD', u'ADM1', u'PCL', u'ADM2', u'ADM3', u'ADM2H', u'ADM4',
           u'ZN', u'ADM1H', u'TERR', u'ADM3H', u'PRSH', u'ADM5', u'ADMDH', nan,
           u'LTER', u'ADM4H', u'ZNB'], dtype=object)
    

    The replace method has lot of different signatures, refer to the Pandas documentation for a comprehensive description. Here, with the dictionary, we hare saying to look only in column fcode, and in this column, replace strings matching the regular expression with the given value. Since we are using regular expressions, the regex parameter must be set to True.

    And as usual, the inplace parameter avoids creation of a copy.

    Multi-indexing

    Remember the goal: we want to be able to get the gid from the others columns. Well, dear reader, you'll be happy to know that Pandas allows an index to be composite, to be composed of multiple columns, what Pandas calls a MultiIndex.

    To put it simply, a multi-index is useful when you have hierarchical indices Consider for example the following table.

    lvl1 lvl2 N S
    A AA 11 1x1
    A AB 12 1x2
    B BA 21 2x1
    B BB 22 2x2
    B BC 33 3x3

    If we were to load such a table in a Pandas dataframe df (exercise: do it), we would be able to use it as follows.

    >>> df.loc['A']
         N    S
    AA  11  1x1
    AB  12  1x2
    >>> df.loc['B']
         N    S
    BA  21  2x1
    BB  22  2x2
    BC  23  2x3
    >>> df.loc[('A',), 'S']
    AA    1x1
    AB    1x2
    Name: S, dtype: object
    >>> df.loc[('A', 'AB'), 'S']
    '1x2'
    >>> df.loc['B', 'BB']
    N     22
    S    2x2
    Name: (B, BB), dtype: object
    >>> df.loc[('B', 'BB')]
    N     22
    S    2x2
    Name: (B, BB), dtype: object
    

    So, basically, we can query the multi-index using tuples (and we can omit tuples if column indexing is not involved). But must importantly we can query a multi-index partially: df.loc['A'] returns a sub-dataframe with one level of index gone.

    Back to our subject, we clearly have hierarchical information in our code columns: the country code, then the admin1 code, then the admin2 code, and so on. Moreover, we can put the feature code at the top. But how to do that ?

    It can't be more simpler. The main issue is to find the correct method: set_index.

    >>> admgids.set_index(['fcode', 'code0', 'code1', 'code2', 'code3', 'code4'],
    ...                    inplace=True)
    >>> admgids.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 369277 entries, (ADMD, AD, 00, nan, nan, nan) to (PCLH, nan, nan, nan, nan, nan)
    Data columns (total 1 columns):
    gid      369277 non-null uint32
    dtypes: uint32(1)
    memory usage: 21.9 MB
    >>> admgids.head()
                                             gid
    fcode code0 code1 code2 code3 code4
    ADMD  AD    00    NaN   NaN   NaN    3038817
                                  NaN    3039039
    ADM1  AD    06    NaN   NaN   NaN    3039162
                05    NaN   NaN   NaN    3039676
                04    NaN   NaN   NaN    3040131
    

    That's not really it, because Pandas has kept the original dataframe order, so multi-index is all messed up. No problem, there is the sort_index method.

    >>> admgids.sort_index(inplace=True)
    >>> admgids.info(memory_usage='deep')
    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 369277 entries, (nan, CA, 10, 02, 94235, nan) to (ZNB, SY, 00, nan, nan, nan)
    Data columns (total 1 columns):
    gid      369277 non-null uint32
    dtypes: uint32(1)
    memory usage: 21.9 MB
    >>> admgids.head()
                                             gid
    fcode code0 code1 code2 code3 code4
    NaN   CA    10    02    94235 NaN    6544163
          CN    NaN   NaN   NaN   NaN    6255000
          CY    04    NaN   NaN   NaN    6640324
          MX    16    NaN   NaN   NaN    6618819
    ADM1  AD    02    NaN   NaN   NaN    3041203
    

    Much better. We can even see that there are three entries without a feature code.

    Let's see if all this effort was worth it. Can we fulfill our goal ? Can we get a gid from a bunch of codes ? Can we get the Haute-Garonne gid ?

    >>> admgids.loc['ADM2', 'FR', '76', '31']
                     gid
    code3 code4
    NaN   NaN    3013767
    

    And for the Toulouse ADM4 ?

    >>> admgids.loc['ADM4', 'FR', '76', '31', '313', '31555']
                                             gid
    fcode code0 code1 code2 code3 code4
    ADM4  FR    76    31    313   31555  6453974
    

    Cheers! You've deserved a glass of wine!

    Summary for the administrative data step

    Before moving on four our grand finale, let's summarize what have been done to prepare administrative data.

    1. We've prepared a smaller file with only administrative data.

      with open('allCountries.txt') as f:
          reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
          with open('adm_geonames.txt', 'w') as g:
              writer = csv.writer(g, delimiter='\t', quoting=csv.QUOTE_NONE,
                                  escapechar='\\')
              for row in reader:
                  if row[6] == 'A':
                      writer.writerow(row)
      
    2. We've loaded the file into Pandas.

      d_types = dict.fromkeys(['fcode', 'code0', 'code1', 'code2', 'code3',
                               'code4'], str)
      d_types['gid'] = 'uint32'
      with open('adm_geonames.txt') as admf:
          admgids = pd.read_csv(
              admf, encoding='utf-8',
              sep='\t', quoting=csv.QUOTE_NONE,
              header=None, usecols=(0, 7, 8, 10, 11, 12, 13),
              names=('gid', 'fcode', 'code0', 'code1', 'code2', 'code3', 'code4'),
              dtype=d_types, na_values='', keep_default_na=False,
              error_bad_lines=False
          )
      
    3. We've replaced all 'PCL<XX>' values with just 'PCL' in the fcode column.

      admgids.replace({'fcode': {r'PCL[A-Z]{1,2}': 'PCL'}}, regex=True,
                      inplace=True)
      
    4. Then we've created a multi-index with columns fcode, code0, code1, code2, code3, code4.

      admgids.set_index(['fcode', 'code0', 'code1', 'code2', 'code3', 'code4'],
                         inplace=True)
      admgids.sort_index(inplace=True)
      

    Putting it all together

    Time has finally come to load the main file now: allCountries.txt. On one hand, we will then be able to use the frnames dataframe to get French name for each entry and populate the frname column, and on the other hand we will use the admgids dataframe to compute parent gid for each line too.

    Loading file

    On my computer, loading the whole allCountries.txt at once takes 5.5 GB of memory, clearly too much! And in this case, there is no trick to reduce the size of the file first: we want all the data.

    Pandas can help us with the chunk_size parameter to the read_csv function. It allows us to read the file chunk by chunk (it returns an iterator). The idea is to first create an empty CSV file for our final data, then read each chunk, perform data manipulation (that is add frname and parent_gid) of this chunk, and append the data to the file.

    So we load data into Pandas the usual way. The only difference is that we add a new parameter chunksize with value 1,000,000. You can choose a smaller or a larger number of rows depending of your memory limit.

    >>> d_types = dict.fromkeys(['fclass', 'fcode', 'code0', 'code1',
    ...                          'code2', 'code3', 'code4'], str)
    >>> d_types['gid'] = 'uint32'
    >>> d_types['lat'] = 'float16'
    >>> d_types['lng'] = 'float16'
    >>> with open('allCountries.txt') as geof:
    ...     reader = pd.read_csv(
    ...         geof, encoding='utf-8',
    ...         sep='\t', quoting=csv.QUOTE_NONE,
    ...         header=None, usecols=(0, 1, 4, 5, 6, 7, 8, 10, 11, 12, 13),
    ...         names=('gid', 'name', 'lat', 'lng', 'fclass', 'fcode',
    ...                'code0', 'code1', 'code2', 'code3', 'code4'),
    ...         dtype=d_types, index_col=0,
    ...         na_values='', keep_default_na=False,
    ...         chunksize=1000000, error_bad_lines=False
    ...     )
    ...     for chunk in reader:
    ...         pass  # We will put here code to work on each chunk
    

    Joining tables

    Pandas can perform a JOIN on two dataframes, much like in SQL. The function to do so is merge.

    ...     for chunk in reader:
    ...         chunk = pd.merge(chunk, frnames, how='left',
    ...                          left_index=True, right_index=True)
    

    merge expects first the two dataframes to be joined. The how parameter tells what type of JOIN to perform (it can be left, right, inner, ...). Here we wants to keep all lines in chunk, which is the first parameter, so it is 'left' (if chunk was the second parameter, it would have been 'right').

    left_index=True and right_index=True tell Pandas that the pivot column is the index on each table. Indeed, in our case the gid index will be use in both tables to compute the merge. If in one table, for example the right one, the pivot column is not the index, one can set right_index=False and add parameter right_on='<column_name>' (same parameter exists for left table).

    Additionally, If there are name clashes (same column name in both tables), one can also use the suffixes parameter. For example suffixes=('_first', '_second').

    And that's all we need to add the frname column. Simple isn't it ?

    Computing parent gid

    The parent_gid column is trickier. We'll delegate computation of the gid from administrative codes to a separate function. But first let's define two reverse dictionaries linking the fcode column to a level number

    >>> level_fcode = {0: 'PCL', 1: 'ADM1', 2: 'ADM2', 3: 'ADM3', 4: 'ADM4'}
    >>> fcode_level = dict((v, k) for k, v in level_fcode.items())
    

    And here is the function computing the Geonames id from administrative codes.

    >>> def geonameid_from_codes(level, **codes):
    ...     """Return the Geoname id of the *administrative* place with the
    ...     given information.
    ...
    ...     Return ``None`` if there is no match.
    ...
    ...     ``level`` is an integer from 0 to 4. 0 means we are looking for a
    ...     political entity (``PCL<X>`` feature code), 1 for a first-level
    ...     administrative division (``ADM1``), and so on until fourth-level
    ...     (``ADM4``).
    ...
    ...     Then user must provide at least one of the ``code0``, ...,
    ...     ``code4`` keyword parameters, depending on ``level``.
    ...
    ...     Examples::
    ...
    ...         >>> geonameid_from_codes(level=0, code0='RE')
    ...         935317
    ...         >>> geonameid_from_codes(level=3, code0='RE', code1='RE',
    ...         ...                      code2='974', code3='9742')
    ...         935213
    ...         >>> geonameid_from_codes(level=0, code0='AB')  # None
    ...
    ...     """
    ...     try:
    ...         idx = tuple(codes['code{0}'.format(i)] for i in range(level+1))
    ...     except KeyError:
    ...         raise ValueError('Not enough codeX parameters for level {0}'.format(level))
    ...     idx = (level_fcode[level],) + idx
    ...     try:
    ...         return admgids.loc[idx, 'gid'].values[0]
    ...     except (KeyError, TypeError):
    ...         return None
    ...
    

    A few comments about this function: the first part builds a tuple idx with the given codes. Then this idx is used as an index value in the admgids dataframe to find the matching gid.

    We also need another function which will compute the parent gid for a Pandas row.

    >>> from six import string_types
    >>> def parent_geonameid(row):
    ...     """Return the Geoname id of the parent of the given Pandas row.
    ...
    ...     Return ``None`` if we can't find the parent's gid.
    ...     """
    ...     # Get the parent's administrative level (PCL or ADM1, ..., ADM4)
    ...     level = fcode_level.get(row.fcode)
    ...     if (level is None and isinstance(fcode, string_types)
    ...             and len(fcode) >= 3):
    ...         fcode_level.get(row.fcode[:3], 5)
    ...     level = level or 5
    ...     level -= 1
    ...     if level < 0:  # We were on a country, no parent
    ...         return None
    ...     # Compute available codes
    ...     l = list(range(5))
    ...     while l and pd.isnull(row['code{0}'.format(l[-1])]):
    ...         l.pop()  # Remove NaN values backwards from code4
    ...     codes = {}
    ...     for i in l:
    ...         if i > level:
    ...             break
    ...         code_label = 'code{0}'.format(i)
    ...         codes[code_label] = row[code_label]
    ...     try:
    ...         return geonameid_from_codes(level, **codes)
    ...     except ValueError:
    ...         return None
    ...
    

    In this function, we first look at the row fcode to get the row administrative level. If the fcode is ADMX we get the level directly. If it is PCL<X>, we get level 0 from PCL. Else we set it to level 5 to say that it is below level 4. So the parent's level is the found level minus one. And if it is -1, we know that we were on a country and there is no parent.

    Then we compute all available administrative codes, removing codes with NaN values from the end.

    With the level and the codes, we can search for the parent's gid using the previous function.

    Now, how do we use this function. No need for a for loop, Pandas gives us the apply method.

    ...     for chunk in reader:
    ...         # (...) pd.merge as before
    ...         parent_gids = chunk.apply(parent_geonameid, axis=1)
    

    This will apply the parent_geonameid to each row and return a new Pandas series whose head looks like this.

    gid
    2986043     nan
    2993838     nan
    2994701     nan
    3007683     nan
    3017832     nan
    ...
    3039162     3041565.0
    

    None values have been converted to NaN. Thus, integer values have been converted to float (you cannot have NaN within an integer column), and this is not what we want. As a compromise, we are going to convert this into str and suppress the decimal part.

    ...         parent_gids = parent_gid.astype(str)  # no inplace=True here
    ...         parent_gids.replace(r'\.0', '', regex=True, inplace=True)
    

    We also add a label to the column. That's the name the column in our future dataframe.

    ...         parent_gids = parent_gids.rename('parent_gid')
    

    And we can now append this new column to our chunk dataframe.

    ...         chunk = pd.concat([chunk, parent_gids], axis=1)
    

    We're almost there. Before we can save the chunk in a CSV file, we must reorganize its columns to the expected order. For now, the frname and parent_gid columns have been appended at the end of the dataframe.

    ...         chunk = chunk.reindex(columns=['frname', 'name', 'fclass',
    ...                                        'fcode', 'parent_gid', 'lat',
    ...                                        'lng'])
    

    At last, we save the chunk to the file opened in append mode.

    ...         chunk.to_csv('final.txt', mode='a', encoding='utf-8',
    ...                      quoting=csv.QUOTE_NONE, sep='\t', header=None)
    

    Caching to speed up

    Currently, creating the new CSV file from Geonames takes hours, and this is not acceptable. There are multiple ways to make thing go faster. One of the most significant change is to cache results of the parent_geonameid function. Indeed, many places in Geonames have the same parent ; computing the parent gid once and caching it sounds like a good idea.

    If you are using Python3, you can simply use the @functools.lru_cache decorator on the parent_geonameid function. But let us try to define our own custom cache.

    >>> from six import string_types
    >>> gid_cache = {}
    >>> def parent_geonameid(row):
    ...     """Return the Geoname id of the parent of the given Pandas row.
    ...
    ...     Return ``None`` if we can't find the parent's gid.
    ...     """
    ...     # Get the parent's administrative level (PCL or ADM1, ..., ADM4)
    ...     level = fcode_level.get(row.fcode)
    ...     if (level is None and isinstance(fcode, string_types)
    ...             and len(fcode) >= 3):
    ...         fcode_level.get(row.fcode[:3], 5)
    ...     level = level or 5
    ...     level -= 1
    ...     if level < 0:  # We were on a country, no parent
    ...         return None
    ...     # Compute available codes
    ...     l = list(range(5))
    ...     while l and pd.isnull(row['code{0}'.format(l[-1])]):
    ...         l.pop()  # Remove NaN values backwards from code4
    ...     codes = {}
    ...     code_tuple = [level]
    ...     for i in l:
    ...         if i > level:
    ...             break
    ...         code_label = 'code{0}'.format(i)
    ...         code = row[code_label]
    ...         codes[code_label] = code
    ...         code_tuple.append(code)
    ...     code_tuple = tuple(code_tuple)
    ...     try:
    ...         parent_gid = (gid_cache.get(code_tuple)
    ...                       or geonameid_from_codes(level, **codes))
    ...     except ValueError:
    ...         parent_gid = None
    ...     # Put value in cache if not already to speed up future lookup
    ...     if code_tuple not in gid_cache:
    ...         gid_cache[code_tuple] = parent_gid
    ...     return parent_gid
    

    The only difference with the previous version is the use of a gid_cache dictionary. Keys for this dictionary are tuples (<level>, <code0>, [[<code1>], ..., <code4>]) (stored in the code_tuple variable), and the corresponding value is the parent gid for this combination of level and codes. Then the returned parent_gid is first looked in this dictionary for a previous cached result, else is computed from the geonameid_from_codes function like before, and the result is cached.

    Summary for the final step

    Let's review what we have done.

    1. We have defined three useful dictionaries. Two to get a level number from a feature code and conversely, and one to cache computation results.

      level_fcode = {0: 'PCL', 1: 'ADM1', 2: 'ADM2', 3: 'ADM3', 4: 'ADM4'}
      fcode_level = dict((v, k) for k, v in level_fcode.items())
      gid_cache = {}
      
    2. We have defined a function computing a Geoname id from administrative codes.

      def geonameid_from_codes(level, **codes):
          """Return the Geoname id of the *administrative* place with the
          given information.
      
          Return ``None`` if there is no match.
      
          ``level`` is an integer from 0 to 4. 0 means we are looking for a
          political entity (``PCL<X>`` feature code), 1 for a first-level
          administrative division (``ADM1``), and so on until fourth-level
          (``ADM4``).
      
          Then user must provide at least one of the ``code0``, ...,
          ``code4`` keyword parameters, depending on ``level``.
      
          Examples::
      
              >>> geonameid_from_codes(level=0, code0='RE')
              935317
              >>> geonameid_from_codes(level=3, code0='RE', code1='RE',
              ...                      code2='974', code3='9742')
              935213
              >>> geonameid_from_codes(level=0, code0='AB')  # None
              >>>
      
          """
          try:
              idx = tuple(codes['code{0}'.format(i)] for i in range(level+1))
          except KeyError:
              raise ValueError('Not enough codeX parameters for level {0}'.format(level))
          idx = (level_fcode[level],) + idx
          try:
              return admgids.loc[idx, 'gid'].values[0]
          except (KeyError, TypeError):
              return None
      
    3. We have defined a function computing the parent's gid of a Pandas row.

      def parent_geonameid(row):
          """Return the Geoname id of the parent of the given Pandas row.
      
          Return ``None`` if we can't find the parent's gid.
          """
          # Get the parent's administrative level (PCL or ADM1, ..., ADM4)
          level = fcode_level.get(row.fcode)
          if (level is None and isinstance(fcode, string_types)
                  and len(fcode) >= 3):
              fcode_level.get(row.fcode[:3], 5)
          level = level or 5
          level -= 1
          if level < 0:  # We were on a country, no parent
              return None
          # Compute available codes
          l = list(range(5))
          while l and pd.isnull(row['code{0}'.format(l[-1])]):
              l.pop()  # Remove NaN values backwards from code4
          codes = {}
          code_tuple = [level]
          for i in l:
              if i > level:
                  break
              code_label = 'code{0}'.format(i)
              code = row[code_label]
              codes[code_label] = code
              code_tuple.append(code)
          code_tuple = tuple(code_tuple)
          try:
              parent_gid = (gid_cache.get(code_tuple)
                            or geonameid_from_codes(level, **codes))
          except ValueError:
              parent_gid = None
          # Put value in cache if not already to speed up future lookup
          if code_tuple not in gid_cache:
              gid_cache[code_tuple] = parent_gid
          return parent_gid
      
    4. And finally we have loaded the file allCountries.txt into Pandas using chunks of 1,000,000 rows to save memory. For each chunk, we have merged it with the frnames table to add the frname column, and we applied the parent_geonameid function to add the parent_gid column. We then reordered the columns and append the chunk to the final CSV file.

      d_types = dict.fromkeys(['fclass', 'fcode', 'code0', 'code1',
                               'code2', 'code3', 'code4'], str)
      d_types['gid'] = 'uint32'
      d_types['lat'] = 'float16'
      d_types['lng'] = 'float16'
      with open('allCountries.txt') as geof:
          reader = pd.read_csv(
              geof, encoding='utf-8',
              sep='\t', quoting=csv.QUOTE_NONE,
              header=None, usecols=(0, 1, 4, 5, 6, 7, 8, 10, 11, 12, 13),
              names=('gid', 'name', 'lat', 'lng', 'fclass', 'fcode',
                     'code0', 'code1', 'code2', 'code3', 'code4'),
              dtype=d_types, index_col=0,
              na_values='', keep_default_na=False,
              chunksize=1000000, error_bad_lines=False
          )
          for chunk in reader:
              chunk = pd.merge(chunk, frnames, how='left',
                               left_index=True, right_index=True)
              parent_gids = chunk.apply(parent_geonameid, axis=1)
              parent_gids = parent_gids.astype(str)  # no inplace=True here
              parent_gids.replace(r'\.0', '', regex=True, inplace=True)
              parent_gids = parent_gids.rename('parent_gid')
              chunk = pd.concat([chunk, parent_gids], axis=1)
              chunk = chunk.reindex(columns=['frname', 'name', 'fclass',
                                             'fcode', 'parent_gid', 'lat',
                                             'lng'])
              chunk.to_csv('final.txt', mode='a', encoding='utf-8',
                           quoting=csv.QUOTE_NONE, sep='\t', header=None)
      

    This final part is the longest, because the parent_geonameid function takes some time on each chunk to compute all parents gids. But at the end of the process we'll proudly see a final.txt file with data the way we want it, and without using too much memory... High five!

    What can be improved

    Congratulations! This ends our journey into the Pandas world.

    Regarding Geonames, to be honest, we've only scratch the surface of its complexity. There's so much to be done.

    If you look at the file we've just produced, you'll see plenty of empty values in the parent_gid column. May be our method to get the Geonames id of the parent needs to be improved. May be all those orphan places should be moved inside their countries.

    Another problem lies within Geonames data. France has overseas territories, for example Reunion Island, Geoname id 935,317. This place has feature code PCLD, which means "dependent political entity". And indeed, Reunion Island is not a country and should not appear at the top level of the tree, at the same level as France. So some work should be done here to have Reunion Island linked to France in some way, may be using the until now ignored cc2 column (for "alternate country codes").

    Still another improvement, easier this one, is to have yet another parent level for continents. For this, one can use the file countryInfo.txt, downloadable from the same page.

    Considering speed this time, there is also room for improvements. First, the code itself might be better designed to avoid some tests and for loops. Another possibility is tu use multiprocessing, since each chunk in allCountries.txt isindependent. Processes can put their finished chunk on a queue that a writer process will read to write data in the output file. Another way to go is Cython (see: http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html).


  • Understanding Geonames dump

    2017/02/20 by Yann Voté

    The aim of this 2-parts blog post is to show some useful, yet not very complicated, features of the Pandas Python library that are not found in most (numeric oriented) tutorials.

    We will illustrate these techniques with Geonames data : extract useful data from the Geonames dump, transform it, and load it into another file. There is no numeric computation involved here, nor statictics. We will prove that pandas can be used in a wide range of cases beyond numerical analysis.

    This first part is an introduction to Geonames data. The real work with Pandas will be shown on the second part You can skip this part and go directly to part 2 if you are already familiar with Geonames.

    Main data

    Geonames data can be downloaded from http://download.geonames.org/export/dump/ The main file to download is allCountries.zip. Once extracted, you'll get a CSV file named allCountries.txt which contains nearly all Geonames data. In this file, CSV data are separated by tabulation.

    A sample Geonames entry

    Let's look in Geonames data for the city of Toulouse in France.

    $ grep -P '\tToulouse\t' allCountries.txt
    

    You'll find multiple results (including one in the United States of America), and among them the following line.

    2972315     Toulouse        Toulouse        Gorad Tuluza,Lapangan Terbang Blagnac,TLS,Tolosa,(...)  43.60426        1.44367 P       PPLA    FR              76      31      313     31555   433055          150     Europe/Paris    2016-02-18
    

    What is the meaning of each column ? There is no header line at the top of the file... Go back to the web page from where you downloaded the file. Below download links, you'll find some documentation. In particular, consider the following excerpt.

    The main 'geoname' table has the following fields :
    ---------------------------------------------------
    geonameid         : integer id of record in geonames database
    name              : name of geographical point (utf8) varchar(200)
    asciiname         : name of geographical point in plain ascii characters, varchar(200)
    alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
    latitude          : latitude in decimal degrees (wgs84)
    longitude         : longitude in decimal degrees (wgs84)
    feature class     : see http://www.geonames.org/export/codes.html, char(1)
    feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
    country code      : ISO-3166 2-letter country code, 2 characters
    cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
    admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
    admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
    admin3 code       : code for third level administrative division, varchar(20)
    admin4 code       : code for fourth level administrative division, varchar(20)
    population        : bigint (8 byte int)
    elevation         : in meters, integer
    dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
    timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
    modification date : date of last modification in yyyy-MM-dd format
    

    So we see that entry 2,972,315 represents a place named Toulouse, for which latitude is 43.60 and longitude is 1.44. The place is located in France (FR country code), its population is estimated to 433,055, elevation is 150 m, and timezone is the same than Paris.

    To understand the meaning of P and PPLA, as feature class and feature code respectively, the indicated web page at http://www.geonames.org/export/codes.html provides us with the following information.

    P city, village,...
    PPLA        seat of a first-order administrative division
    

    Thus line 2,972,315 is about a city which is the seat of a first-order administrative division in France. And indeed, Toulouse is the seat of the Occitanie region in France.

    Geonames administrative divisions

    So let's look for Occitanie.

    $ grep -P '\tOccitanie\t' allCountries.txt
    

    You'll end up with the following line.

    11071623    Occitanie       Occitanie       Languedoc-Roussillon-Midi-Pyrenees,(...)        44.02722        1.63559 A       ADM1    FR              76              5626858         188     Europe/Paris    2016-11-16
    

    This is entry number 11,071,623, and we have latitude, longitude, population, elevation and timezone as usual. The place's class is A and its code is ADM1. Information about these codes can be found in the same previous page.

    A country, state, region,...
    ADM1        first-order administrative division
    

    This confirms that Occitanie is a first-order administrative division of France.

    Now look at the adminX_code part of the Occitanie line (X=1, 2, 3, 4).

    76
    

    Only column admin1_code is filled with value 76. Compare this with the same part from the Toulouse line.

    76  31      313     31555
    

    Here all columns admin1_code, admin2_code, admin3_code and admin4_code are given a value, respectively 76, 31 313, and 31555.

    Furthermore, we see that column admin1_code matches for Occitanie and Toulouse (value 76). This let us deduce that Toulouse is actually a city located in Occitanie.

    So, following the same logic, we can infer that Toulouse is also located in a second-order administrative division of France (feature code ADM2) with admin1_code of 76 and admin2_code equals to 31. Let's look for such a line in allCountries.txt.

    $ grep -P '\tADM2\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \
    > | grep -P '\t31\t'
    

    We get the following line.

    3013767     Département de la Haute-Garonne Departement de la Haute-Garonne Alta Garonna,Alto Garona,(...)  43.41667        1.5     A       ADM2    FR              76      31                      1254347         181     Europe/Paris    2016-02-18
    

    Success! Toulouse is actually in the Haute-Garonne department, which is a department in Occitanie region.

    Is this going to work for third-level administrative division ? Let's look for an ADM3 line, with admin1_code=76, admin2_code=31, and admin3_code=313.

    $ grep -P '\tADM3\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \
    > | grep -P '\t31\t' | grep -P '\t313\t'
    

    Here is the result.

    2972314     Arrondissement de Toulouse      Arrondissement de Toulouse      Arrondissement de Toulouse      43.58333        1.5     A       ADM3    FR              76      31      313             972098          139     Europe/Paris    2016-12-05
    

    Still works! Finally, let's find out the fourth-order administrative division which contains the city of Toulouse.

    $ grep -P '\tADM4\t' allCountries.txt | grep -P '\tFR\t' | grep -P '\t76\t' \
    > | grep -P '\t31\t' | grep -P '\t313\t' | grep -P '\t31555\t'
    

    We get a place also named Toulouse.

    6453974     Toulouse        Toulouse        Toulouse        43.60444        1.44194 A       ADM4    FR              76      31      313     31555   440204          153     Europe/Paris    2016-02-18
    

    That's a surprise. That's because in France, an arrondissement is the smallest administrative division above a city. So for Geonames, city of Toulouse is both an administrative place (feature class A with feature code ADM4) and a populated place (feature class P with feature code PPLA), so there is two different entries with the same name (beware that this may be different for other countries).

    And that's it! We have found the whole hierarchy of administrative divisions above city of Toulouse, thanks to the adminX_code columns: Occitanie, Département de la Haute-Garonne, Arrondissement de Toulouse, Toulouse (ADM4), and Toulouse (PPL).

    That's it, really ? What about the top-most administrative level, the country ? This may not be intuitive: feature codes for countries start with PCL (for political entity).

    $ grep -P '\tPCL' allCountries.txt | grep -P '\tFR\t'
    

    Among the result, there's only one PCLI, which means independant political entity.

    3017382     Republic of France      Republic of France      An Fhrainc,An Fhraing,(...)     46      2       A       PCLI    FR              00      64768389                543     Europe/Paris    2015-01-08
    

    Now we have the whole hierarchy, summarized in the following table.

    level name id fclass fcode ccode adm1_code adm2_code adm3_code adm4_code
    country Republic of France 3017382 A PCLI FR        
    level 1 Occitanie 11071623 A ADM1 FR 76      
    level 2 Département de la Haute-Garonne 3013767 A ADM2 FR 76 31    
    level 3 Arrondissement de Toulouse 2972314 A ADM3 FR 76 31 313  
    level 4 Toulouse 6453974 A ADM4 FR 76 31 313 31555
    city Toulouse 2972315 P PPL FR 76 31 313 31555

    Alternate names

    In the previous example, "Republic of France" is the main name for Geonames. But this is not the name in French ("République française"), and even the name in French is not the most commonly used name which is "France".

    In the same way, "Département de la Haute-Garonne" is not the most commonly used name for the department, which is just "Haute-Garonne".

    The fourth column in allCountries.txt provides a comma-separated list of alternate names for a place, in other languages and in other forms. But this is not very useful because we can't decide which form in the list is in which language.

    For this, the Geonames project provides another file to download: alternateNames.zip. Go back to the download page, download it, and extract it. You'll get another tabulation-separeted CSV file named alternateNames.txt.

    Let's look at alternate names for the Haute-Garonne department.

    $ grep -P '\t3013767\t' alternateNames.txt
    

    Multiple names are printed.

    2080345     3013767 fr      Département de la Haute-Garonne
    2080346     3013767 es      Alto Garona     1       1
    2187116     3013767 fr      Haute-Garonne   1       1
    2431178     3013767 it      Alta Garonna    1       1
    2703076     3013767 en      Upper Garonne   1       1
    2703077     3013767 de      Haute-Garonne   1       1
    3047130     3013767 link    http://en.wikipedia.org/wiki/Haute-Garonne
    3074362     3013767 en      Haute Garonne
    4288095     3013767         Département de la Haute-Garonne
    10273497    3013767 link    http://id.loc.gov/authorities/names/n80009763
    

    To understand these columns, let's look again at the documentation.

    The table 'alternate names' :
    -----------------------------
    alternateNameId   : the id of this alternate name, int
    geonameid         : geonameId referring to id in table 'geoname', int
    isolanguage       : iso 639 language code 2- or 3-characters; (...)
    alternate name    : alternate name or name variant, varchar(400)
    isPreferredName   : '1', if this alternate name is an official/preferred name
    isShortName       : '1', if this is a short name like 'California' for 'State of California'
    isColloquial      : '1', if this alternate name is a colloquial or slang term
    isHistoric        : '1', if this alternate name is historic and was used in the pastq
    

    We can see that "Haute-Garonne" is a short version of "Département de la Haute-Garonne" and is the preferred form in French. As an exercise, the reader can confirm in the same way that "France" is the preferred shorter form for "Republic of France" in French.


    And that's it for our introductory journey into Geonames. You are now familiar enough with this data to begin working with it using Pandas in Python. In fact, what we have done until now, namely working with grep commands, is not very useful... See you in part 2!


  • SciviJS

    2016/10/10 by Martin Renou

    Introduction

    The goal of my work at Logilab is to create tools to visualize scientific 3D volumic-mesh-based data (mechanical data, electromagnetic...) in a standard web browser. It's a part of the european OpenDreamKit project. Franck Wang has been working on this subject last year. I based my work on his results and tried to improve them.

    Our goal is to create widgets to be used in Jupyter Notebook (formerly IPython) for easy 3D visualization and analysis. We also want to create a graphical user interface in order to enable users to intuitively compute multiple effects on their meshes.

    As Franck Wang worked with X3DOM, which is an open source JavaScript framework that makes it possible to display 3D scenes using HTML nodes, we first thought it was a good idea to keep on working with this framework. But X3DOM is not very well maintained these days, as can be seen on their GitHub repository.

    As a consequence, we decided to take a look at another 3D framework. Our best candidates were:

    • ThreeJS
    • BabylonJS

    ThreeJS and BabylonJS are two well-known Open Source frameworks for 3D web visualization. They are well maintained by hundreds of contributors since several years. Even if BabylonJS was first thought for video games, these two engines are interesting for our project. Some advantages of ThreeJS are:

    Finally, the choice of using ThreeJS was quite obvious because of its Nodes feature, contributed by Sunag Entertainment. It allows users to compose multiple effects like isocolor, threshold, clip plane, etc. As ThreeJS is an Open Source framework, it is quite easy to propose new features and contributors are very helpful.

    ThreeJS

    As we want to compose multiple effects like isocolor and threshold (the pixel color correspond to a pressure but if this pressure is under a certain threshold we don't want to display it), it seems a good idea to compose shaders instead of creating a big shader with all the features we want to implement. The problem is that WebGL is still limited (as of the 1.x version) and it's not possible for shaders to exchange data with other shaders. Only the vertex shader can send data to the fragment shader through varyings.

    So it's not really possible to compose shaders, but the good news is we can use the new node system of ThreeJS to easily compute and compose a complex material for a mesh.

    alternate text

    It's the graphical view of what you can do in your code, but you can see that it's really simple to implement effects in order to visualize your data.

    SciviJS

    With this great tools as a solid basis, I designed a first version of a javascript library, SciviJS, that aims at loading, displaying and analyzing mesh data in a standard web browser (i.e. without any plugin).

    You can define your visualization in a .yml file containing urls to your mesh and data and a hierarchy of effects (called block structures).

    See https://demo.logilab.fr/SciviJS/ for an online demo.

    You can see the block structure like following:

    https://www.logilab.org/file/8719790/raw

    Data blocks are instantiated to load the mesh and define basic parameters like color, position etc. Blocks are connected together to form a tree that helps building a visual analysis of your mesh data. Each block receives data (like mesh variables, color and position) from its parent and can modify them independently.

    Following parameters must be set on dataBlocks:

    • coordURL: URL to the binary file containing coordinate values of vertices.
    • facesURL: URL to the binary file containing indices of faces defining the skin of the mesh.
    • tetrasURL: URL to the binary file containing indices of tetrahedrons. Default is ''.
    • dataURL: URL to the binary file containing data that you want to visualize for each vertices.

    Following parameters can be set on dataBlocks or plugInBlocks:

    • type: type of the block, which is dataBlock or the name of the plugInBlock that you want.
    • colored: define whether or not the 3D object is colored. Default is false, object is rendered gray.
    • colorMap: color map used for coloration, available values are rainbow and gray. Default is rainbow.
    • colorMapMin and colorMapMax: bounds for coloration scaled in [0, 1]. Default is (0, 1).
    • visualizedData: data used as input for coloration. If data are 3D vectors available values are magnitude, X, Y, Z, and default is magnitude. If data are scalar values you don't need to set this parameter.
    • position, rotation, scale: 3D vectors representing position, rotation and scale of the object. Default are [0., 0., 0.], [0., 0., 0.] and [1., 1., 1.].
    • visible: define whether or not the object is visible. Default is true if there's no childrenBlock, false otherwise.
    • childrenBlocks: array of children blocks. Default is empty.

    As of today, there are 6 types of plug-in blocks:

    • Threshold: hide areas of your mesh based on a variable's value and bound parameters

      • lowerBound: lower bound used for threshold. Default is 0 (representing dataMin). If inputData is under lowerBound, then it's not displayed.
      • upperBound: upper bound used for threshold. Default is 1 (representing dataMax). If inputData is above upperBound, then it's not displayed.
      • inputData: data used for threshold effect. Default is visualizedData, but you can set it to magnitude, X, Y or Z.
    • ClipPlane: hide a part of the mesh by cutting it with a plane

      • planeNormal: 3D array representing the normal of the plane used for section. Default is [1., 0., 0.].
      • planePosition: position of the plane for the section. It's a scalar scaled bewteen -1 and 1. Default is 0.
    • Slice: make a slice of your mesh

      • sliceNormal
      • slicePosition
    • Warp: deform the mesh along the direction of an input vector data

      • warpFactor: deformation factor. Default is 1, can be negative.
      • inputData: vector data used for warp effect. Default is data, but you can set it to X, Y or Z to use only one vector component.
    • VectorField: represent the input vector data with arrow glyphs

      • lengthFactor: factor of length of vectors. Default is 1, can be negative.
      • inputData
      • nbVectors: max number of vectors. Default is the number of vertices of the mesh (which is the maximum value).
      • mode: mode of distribution. Default is volume, you can set it to surface.
      • distribution: type of distribution. Default is regular, you can set it to random.
    • Points: represent the data with points

      • pointsSize: size of points in pixels. Default is 3.
      • nbPoints
      • mode
      • distribution

    Using those blocks you can easily render interesting 3D scenes like this:

    https://www.logilab.org/file/8571787/raw https://www.logilab.org/file/8572007/raw

    Future works

    • Integration to Jupyter Notebook
    • As of today you only can define a .yml file defining the tree of blocks, we plan to develop a Graphical User Interface to enable users to define this tree interactively with drag and drop
    • Support of most file types (for now it only supports binary files)

  • ngReact: getting angular and react to work together

    2016/08/03 by Nicolas Chauvat

    ngReact is an Angular module that allows React components to be used in AngularJS applications.

    I had to work on enhancing an Angular-based application and wanted to provide the additionnal functionnality as an isolated component that I could develop and test without messing with a large Angular controller that several other people were working on.

    Here is my Angular+React "Hello World", with a couple gotchas that were not underlined in the documentation and took me some time to figure out.

    To set things up, just run:

    $ mkdir angulareacthello && cd angulareacthello
    $ npm init && npm install --save angular ngreact react react-dom
    

    Then write into index.html:

    <!doctype html>
    <html>
         <head>
                 <title>my angular react demo</title>
         </head>
         <body ng-app="app" ng-controller="helloController">
                 <div>
                         <label>Name:</label>
                         <input type="text" ng-model="person.name" placeholder="Enter a name here">
                         <hr>
                         <h1><react-component name="HelloComponent" props="person" /></h1>
                 </div>
         </body>
         <script src="node_modules/angular/angular.js"></script>
         <script src="node_modules/react/dist/react.js"></script>
         <script src="node_modules/react-dom/dist/react-dom.js"></script>
         <script src="node_modules/ngreact/ngReact.js"></script>
         <script>
         // include the ngReact module as a dependency for this Angular app
         var app = angular.module('app', ['react']);
    
         // define a controller that has the name attribute
         app.controller('helloController', function($scope) {
                 $scope.person = { name: 'you' };
         });
    
         // define a React component that displays "Hello {name}"
         var HelloComponent = React.createClass({
                 render: function() {
                         return React.DOM.span(null, "Hello "+this.props.name);
                 }
         });
    
         // tell Angular about this React component
         app.value('HelloComponent', HelloComponent);
    
         </script>
    </html>
    

    I took me time to get a couple things clear in my mind.

    <react-component> is not a React component, but an Angular directive that delegates to a React component. Therefore, you should not expect the interface of this tag to be the same as the one of a React component. More precisely, you can only use the props attribute and can not set your react properties by adding more attributes to this tag. If you want to be able to write something like <react-component firstname="person.firstname" lastname="person.lastname"> you will have to use reactDirective to create a specific Angular directive.

    You have to set an object as the props attribute of the react-component tag, because it will be used as the value of this.props in the code of your React class. For example if you set the props attribute to a string (person.name instead of person in the above example) , you will have trouble using it on the React side because you will get an object built from the enumeration of the string. Therefore, the above example can not be made simpler. If we had written $scope.name = 'you' we could not have passed it correctly to the react component.

    The above was tested with angular@1.5.8, ngreact@0.3.0, react@15.3.0 and react-dom@15.3.0.

    All in all, it worked well. Thank you to all the developers and contributors of these projects.


  • Testing salt formulas with testinfra

    2016/07/21 by Philippe Pepiot

    In a previous post we talked about an environment to develop salt formulas. To add some spicy requirements, the formula must now handle multiple target OS (Debian and Centos), have tests and a continuous integration (CI) server setup.

    http://testinfra.readthedocs.io/en/latest/_static/logo.png

    I started a year ago to write a framework to this purpose, it's called testinfra and is used to execute commands on remote systems and make assertions on the state and the behavior of the system. The modules API provides a pythonic way to inspect the system. It has a smooth integration with pytest that adds some useful features out of the box like parametrization to run tests against multiple systems.

    Writing useful tests is not an easy task, my advice is to test code that triggers implicit actions, code that has caused issues in the past or simply test the application is working correctly like you would do in the shell.

    For instance this is one of the tests I wrote for the saemref formula

    def test_saemref_running(Process, Service, Socket, Command):
        assert Service("supervisord").is_enabled
    
        supervisord = Process.get(comm="supervisord")
        # Supervisor run as root
        assert supervisord.user == "root"
        assert supervisord.group == "root"
    
        cubicweb = Process.get(ppid=supervisord.pid)
        # Cubicweb should run as saemref user
        assert cubicweb.user == "saemref"
        assert cubicweb.group == "saemref"
        assert cubicweb.comm == "uwsgi"
        # Should have 2 worker process with 8 thread each and 1 http proccess with one thread
        child_threads = sorted([c.nlwp for c in Process.filter(ppid=cubicweb.pid)])
        assert child_threads == [1, 8, 8]
    
        # uwsgi should bind on all ipv4 adresses
        assert Socket("tcp://0.0.0.0:8080").is_listening
    
        html = Command.check_output("curl http://localhost:8080")
        assert "<title>accueil (Référentiel SAEM)</title>" in html
    

    Now we can run tests against a running container by giving its name or docker id to testinfra:

    % testinfra --hosts=docker://1a8ddedf8164 test_saemref.py
    [...]
    test/test_saemref.py::test_saemref_running[docker:/1a8ddedf8164] PASSED
    

    The immediate advantage of writing such test is that you can reuse it for monitoring purpose, testinfra can behave like a nagios plugin:

    % testinfra -qq --nagios --hosts=ssh://prod test_saemref.py
    TESTINFRA OK - 1 passed, 0 failed, 0 skipped in 2.31 seconds
    .
    

    We can now integrate the test suite in our run-tests.py by adding some code to build and run a provisioned docker image and add a test command that runs testinfra tests against it.

    provision_option = click.option('--provision', is_flag=True, help="Provision the container")
    
    @cli.command(help="Build an image")
    @image_choice
    @provision_option
    def build(image, provision=False):
        dockerfile = "test/{0}.Dockerfile".format(image)
        tag = "{0}-formula:{1}".format(formula, image)
        if provision:
            dockerfile_content = open(dockerfile).read()
            dockerfile_content += "\n" + "\n".join([
                "ADD test/minion.conf /etc/salt/minion.d/minion.conf",
                "ADD {0} /srv/formula/{0}".format(formula),
                "RUN salt-call --retcode-passthrough state.sls {0}".format(formula),
            ]) + "\n"
            dockerfile = "test/{0}_provisioned.Dockerfile".format(image)
            with open(dockerfile, "w") as f:
                f.write(dockerfile_content)
            tag += "-provisioned"
        subprocess.check_call(["docker", "build", "-t", tag, "-f", dockerfile, "."])
        return tag
    
    
    @cli.command(help="Spawn an interactive shell in a new container")
    @image_choice
    @provision_option
    @click.pass_context
    def dev(ctx, image, provision=False):
        tag = ctx.invoke(build, image=image, provision=provision)
        subprocess.call([
            "docker", "run", "-i", "-t", "--rm", "--hostname", image,
            "-v", "{0}/test/minion.conf:/etc/salt/minion.d/minion.conf".format(BASEDIR),
            "-v", "{0}/{1}:/srv/formula/{1}".format(BASEDIR, formula),
            tag, "/bin/bash",
        ])
    
    
    @cli.command(help="Run tests against a provisioned container",
                 context_settings={"allow_extra_args": True})
    @click.pass_context
    @image_choice
    def test(ctx, image):
        import pytest
        tag = ctx.invoke(build, image=image, provision=True)
        docker_id = subprocess.check_output([
            "docker", "run", "-d", "--hostname", image,
            "-v", "{0}/test/minion.conf:/etc/salt/minion.d/minion.conf".format(BASEDIR),
            "-v", "{0}/{1}:/srv/formula/{1}".format(BASEDIR, formula),
            tag, "tail", "-f", "/dev/null",
        ]).strip()
        try:
            ctx.exit(pytest.main(["--hosts=docker://" + docker_id] + ctx.args))
        finally:
            subprocess.check_call(["docker", "rm", "-f", docker_id])
    

    Tests can be run on a local CI server or on travis, they "just" require a docker server, here is an example of .travis.yml

    sudo: required
    services:
      - docker
    language: python
    python:
      - "2.7"
    env:
      matrix:
        - IMAGE=centos7
        - IMAGE=jessie
    install:
      - pip install testinfra
    script:
      - python run-tests.py test $IMAGE -- -v
    

    I wrote a dummy formula with the above code, feel free to use it as a template for your own formula or open pull requests and break some tests.

    There is a highly enhanced version of this code in the saemref formula repository, including:

    • Building a provisioned docker image with custom pillars, we use it to run an online demo
    • Destructive tests where each test is run in a dedicated "fresh" container
    • Run Systemd in the containers to get a system close to the production one (this enables the use of Salt service module)
    • Run a postgresql container linked to the tested container for specific tests like upgrading a Cubicweb instance.

    Destructive tests rely on advanced pytest features that may produce weird bugs when mixed together, too much magic involved here. Also, handling Systemd in docker is really painful and adds a lot of complexity, for instance some systemctl commands require a running systemd as PID 1 and this is not the case during the docker build phase. So the trade-off between complexity and these features may not be worth.

    There is also a lot of quite new tools to develop and test infrastructure code that you could include in your stack like test-kitchen, serverspec, and goss. Choose your weapon and go test your infrastructure code.


  • Developing salt formulas with docker

    2016/07/21 by Philippe Pepiot
    https://www.logilab.org/file/248336/raw/Salt-Logo.png

    While developing salt formulas I was looking for a simple and reproducible environment to allow faster development, less bugs and more fun. The formula must handle multiple target OS (Debian and Centos).

    The first barrier is the master/minion installation of Salt, but fortunately Salt has a masterless mode. The idea is quite simple, bring up a virtual machine, install a Salt minion on it, expose the code inside the VM and call Salt states.

    https://www.logilab.org/file/7159870/raw/docker.png

    At Logilab we like to work with docker, a lightweight OS-level virtualization solution. One of the key features is docker volumes to share local files inside the container. So I started to write a simple Python script to build a container with a Salt minion installed and run it with formula states and a few config files shared inside the VM.

    The formula I was working on is used to deploy the saemref project, which is a Cubicweb based application:

    % cat test/centos7.Dockerfile
    FROM centos:7
    RUN yum -y install epel-release && \
        yum -y install https://repo.saltstack.com/yum/redhat/salt-repo-latest-1.el7.noarch.rpm && \
        yum clean expire-cache && \
        yum -y install salt-minion
    
    % cat test/jessie.Dockerfile
    FROM debian:jessie
    RUN apt-get update && apt-get -y install wget
    RUN wget -O - https://repo.saltstack.com/apt/debian/8/amd64/latest/SALTSTACK-GPG-KEY.pub | apt-key add -
    RUN echo "deb http://repo.saltstack.com/apt/debian/8/amd64/latest jessie main" > /etc/apt/sources.list.d/saltstack.list
    RUN apt-get update && apt-get -y install salt-minion
    
    % cat test/minion.conf
    file_client: local
    file_roots:
      base:
        - /srv/salt
        - /srv/formula
    

    And finally the run-tests.py file, using the beautiful click module

    #!/usr/bin/env python
    import os
    import subprocess
    
    import click
    
    @click.group()
    def cli():
        pass
    
    formula = "saemref"
    BASEDIR = os.path.abspath(os.path.dirname(__file__))
    
    image_choice = click.argument("image", type=click.Choice(["centos7", "jessie"]))
    
    
    @cli.command(help="Build an image")
    @image_choice
    def build(image):
        dockerfile = "test/{0}.Dockerfile".format(image)
        tag = "{0}-formula:{1}".format(formula, image)
        subprocess.check_call(["docker", "build", "-t", tag, "-f", dockerfile, "."])
        return tag
    
    
    @cli.command(help="Spawn an interactive shell in a new container")
    @image_choice
    @click.pass_context
    def dev(ctx, image):
        tag = ctx.invoke(build, image=image)
        subprocess.call([
            "docker", "run", "-i", "-t", "--rm", "--hostname", image,
            "-v", "{0}/test/minion.conf:/etc/salt/minion.d/minion.conf".format(BASEDIR),
            "-v", "{0}/{1}:/srv/formula/{1}".format(BASEDIR, formula),
            tag, "/bin/bash",
        ])
    
    
    if __name__ == "__main__":
        cli()
    

    Now I can run quickly multiple containers and test my Salt states inside the containers while editing the code locally:

    % ./run-tests.py dev centos7
    [root@centos7 /]# salt-call state.sls saemref
    
    [ ... ]
    
    [root@centos7 /]# ^D
    % # The container is destroyed when it exits
    

    Notice that we could add some custom pillars and state files simply by adding specific docker shared volumes.

    With a few lines we created a lightweight vagrant like, but faster, with docker instead of virtualbox and it remain fully customizable for future needs.


  • Introduction to thesauri and SKOS

    2016/06/27 by Yann Voté

    Recently, I've faced the problem to import the European Union thesaurus, Eurovoc, into cubicweb using the SKOS cube. Eurovoc doesn't follow the SKOS data model and I'll show here how I managed to adapt Eurovoc to fit in SKOS.

    This article is in two parts:

    • this is the first part where I introduce what a thesaurus is and what SKOS is,
    • the second part will show how to convert Eurovoc to plain SKOS.

    The whole text assumes familiarity with RDF, as describing RDF would require more than a blog entry and is out of scope.

    What is a thesaurus ?

    A common need in our digital lives is to attach keywords to documents, web pages, pictures, and so on, so that search is easier. For example, you may want to add two keywords:

    • lily,
    • lilium

    in a picture's metadata about this flower. If you have a large collection of flower pictures, this will make your life easier when you want to search for a particular species later on.

    free-text keywords on a picture

    In this example, keywords are free: you can choose whatever keyword you want, very general or very specific. For example you may just use the keyword:

    • flower

    if you don't care about species. You are also free to use lowercase or uppercase letters, and to make typos...

    free-text keyword on a picture

    On the other side, sometimes you have to select keywords from a list. Such a constrained list is called a controlled vocabulary. For instance, a very simple controlled vocabulary with only two keywords is the one about a person's gender:

    • male (or man),
    • female (or woman).
    a simple controlled vocabulary

    But there are more complex examples: think about how a library organizes books by themes: there are very general themes (eg. Science), then more and more specific ones (eg. Computer science -> Software -> Operating systems). There may also be synonyms (eg. Computing for Computer science) or referrals (eg. there may be a "see also" link between keywords Algebra and Geometry). Such a controlled vocabulary where keywords are organized in a tree structure, and with relations like synonym and referral, is called a thesaurus.

    an example thesaurus with a tree of keywords

    For the sake of simplicity, in the following we will call thesaurus any controlled vocabulary, even a simple one with two keywords like male/female.

    SKOS

    SKOS, from the World Wide Web Consortium (W3C), is an ontology for the semantic web describing thesauri. To make it simple, it is a common data model for thesauri that can be used on the web. If you have a thesaurus and publish it on the web using SKOS, then anyone can understand how your thesaurus is organized.

    SKOS is very versatile. You can use it to produce very simple thesauri (like male/female) and very complex ones, with a tree of keywords, even in multiple languages.

    To cope with this complexity, SKOS data model splits each keyword into two entities: a concept and its labels. For example, the concept of a male person have multiple labels: male and man in English, homme and masculin in French. The concept of a lily flower also has multiple labels: lily in English, lilium in Latin, lys in French.

    Among all labels for a given concept, some can be preferred, while others are alternative. There may be only one preferred label per language. In the person's gender example, man may be the preferred label in English and male an alternative one, while in French homme would be the preferred label and masculin and alternative one. In the flower example, lily (resp. lys) is the preferred label in English (resp. French), and lilium is an alternative label in Latin (no preferred label in Latin).

    SKOS concepts and labels

    And of course, in SKOS, it is possible to say that a concept is broader than another one (just like topic Science is broader than topic Computer science).

    So to summarize, in SKOS, a thesaurus is a tree of concepts, and each concept have one or more labels, preferred or alternative. A thesaurus is also called a concept scheme in SKOS.

    Also, please note that SKOS data model is slightly more complicated than what we've shown here, but this will be sufficient for our purpose.

    RDF URIs defined by SKOS

    In order to publish a thesaurus in RDF using SKOS ontology, SKOS introduces the "skos:" namespace associated to the following URI: http://www.w3.org/2004/02/skos/core#.

    Within that namespace, SKOS defines some classes and predicates corresponding to what has been described above. For example:

    • the triple (<uri>, rdf:type, skos:ConceptScheme) says that <uri> belongs to class skos:ConceptScheme (that is, is a concept scheme),
    • the triple (<uri>, rdf:type, skos:Concept) says that <uri> belongs to class skos:Concept (that is, is a concept),
    • the triple (<uri>, skos:prefLabel, <literal>) says that <literal> is a preferred label for concept <uri>,
    • the triple (<uri>, skos:altLabel, <literal>) says that <literal> is an alternative label for concept <uri>,
    • the triple (<uri1>, skos:broader, <uri2>) says that concept <uri2> is a broder concept of <uri1>.

  • One way to convert Eurovoc into plain SKOS

    2016/06/27 by Yann Voté

    This is the second part of an article where I show how to import the Eurovoc thesaurus from the European Union into an application using a plain SKOS data model. I've recently faced the problem of importing Eurovoc into CubicWeb using the SKOS cube, and the solution I've chose is discussed here.

    The first part was an introduction to thesauri and SKOS.

    The whole article assumes familiarity with RDF, as describing RDF would require more than a blog entry and is out of scope.

    Difficulties with Eurovoc and SKOS

    Eurovoc

    Eurovoc is the main thesaurus covering European Union business domains. It is published and maintained by the EU commission. It is quite complex and big, structured as a tree of keywords.

    You can see Eurovoc keywords and browse the tree from the Eurovoc homepage using the link Browse the subject-oriented version.

    For example, when publishing statistics about education in the EU, you can tag the published data with the broadest keyword Education and communications. Or you can be more precise and use the following narrower keywords, in increasing order of preference: Education, Education policy, Education statistics.

    Problem: hierarchy of thesauri

    The EU commission uses SKOS to publish its Eurovoc thesaurus, so it should be straightforward to import Eurovoc into our own application. But things are not that simple...

    For some reasons, Eurovoc uses a hierarchy of concept schemes. For example, Education and communications is a sub-concept scheme of Eurovoc (it is called a domain), and Education is a sub-concept scheme of Education and communications (it is called a micro-thesaurus). Education policy is (a label of) the first concept in this hierarchy.

    But with SKOS this is not possible: a concept scheme cannot be contained into another concept scheme.

    Possible solutions

    So to import Eurovoc into our SKOS application, and not loose data, one solution is to turn sub-concept schemes into concepts. We have two strategies:

    • keep only one concept scheme (Eurovoc) and turn domains and micro-thesauri into concepts,
    • keep domains as concept schemes, drop Eurovoc concept scheme, and only turn micro-thesauri into concepts.

    Here we will discuss the latter solution.

    Lets get to work

    Eurovoc thesaurus can be downloaded at the following URL: http://publications.europa.eu/mdr/resource/thesaurus/eurovoc/skos/eurovoc_skos.zip

    The ZIP archive contains only one XML file named eurovoc_skos.rdf. Put it somewhere where you can find it easily.

    To read this file easily, we will use the RDFLib Python library. This library makes it really convenient to work with RDF data. It has only one drawback: it is very slow. Reading the whole Eurovoc thesaurus with it takes a very long time. Make the process faster is the first thing to consider for later improvements.

    Reading the Eurovoc thesaurus is as simple as creating an empty RDF Graph and parsing the file. As said above, this takes a long long time (from half an hour to two hours).

    import rdflib
    
    eurovoc_graph = rdflib.Graph()
    eurovoc_graph.parse('<path/to/eurovoc_skos.rdf>', format='xml')
    
    <Graph identifier=N52834ca3766d4e71b5e08d50788c5a13 (<class 'rdflib.graph.Graph'>)>
    

    We can see that Eurovoc contains more than 2 million triples.

    len(eurovoc_graph)
    
    2828910
    

    Now, before actually converting Eurovoc to plain SKOS, lets introduce some helper functions:

    • the first one, uriref(), will allow us to build RDFLib URIRef objects from simple prefixed URIs like skos:prefLabel or dcterms:title,
    • the second one, capitalized_eurovoc_domains(), is used to convert Eurovoc domain names, all uppercase (eg. 32 EDUCATION ET COMMUNICATION) to a string where only first letter is uppercase (eg. 32 Education and communication)
    import re
    
    from rdflib import Literal, Namespace, RDF, URIRef
    from rdflib.namespace import DCTERMS, SKOS
    
    eu_ns = Namespace('http://eurovoc.europa.eu/schema#')
    thes_ns = Namespace('http://purl.org/iso25964/skos-thes#')
    
    prefixes = {
        'dcterms': DCTERMS,
        'skos': SKOS,
        'eu': eu_ns,
        'thes': thes_ns,
    }
    
    def uriref(prefixed_uri):
        prefix, value = prefixed_uri.split(':', 1)
        ns = prefixes[prefix]
        return ns[value]
    
    def capitalized_eurovoc_domain(domain):
        """Return the given Eurovoc domain name with only the first letter uppercase."""
        return re.sub(r'^(\d+\s)(.)(.+)$',
                      lambda m: u'{0}{1}{2}'.format(m.group(1), m.group(2).upper(), m.group(3).lower()),
                      domain, re.UNICODE)
    

    Now the actual work. After using variables to reference URIs, the loop will parse each triple in original graph and:

    • discard it if it contains deprecated data,
    • if triple is like (<uri>, rdf:type, eu:Domain), replace it with (<uri>, rdf:type, skos:ConceptScheme),
    • if triple is like (<uri>, rdf:type, eu:MicroThesaurus), replace it with (<uri>, rdf:type, skos:Concept) and add triple (<uri>, skos:inScheme, <domain_uri>),
    • if triple is like (<uri>, rdf:type, eu:ThesaurusConcept), replace it with (<uri>, rdf:type, skos:Concept),
    • if triple is like (<uri>, skos:topConceptOf, <microthes_uri>), replace it with (<uri>, skos:broader, <microthes_uri>),
    • if triple is like (<uri>, skos:inScheme, <microthes_uri>), replace it with (<uri>, skos:inScheme, <domain_uri>),
    • keep triples like (<uri>, skos:prefLabel, <label_uri>), (<uri>, skos:altLabel, <label_uri>), and (<uri>, skos:broader, <concept_uri>),
    • discard all other non-deprecated triples.

    Note that, to replace a micro thesaurus with a domain, we have to build a mapping between each micro thesaurus and its containing domain (microthes2domain dict).

    This loop is also quite long.

    eurovoc_ref = URIRef(u'http://eurovoc.europa.eu/100141')
    deprecated_ref = URIRef(u'http://publications.europa.eu/resource/authority/status/deprecated')
    title_ref = uriref('dcterms:title')
    status_ref = uriref('thes:status')
    class_domain_ref = uriref('eu:Domain')
    rel_domain_ref = uriref('eu:domain')
    microthes_ref = uriref('eu:MicroThesaurus')
    thesconcept_ref = uriref('eu:ThesaurusConcept')
    concept_scheme_ref = uriref('skos:ConceptScheme')
    concept_ref = uriref('skos:Concept')
    pref_label_ref = uriref('skos:prefLabel')
    alt_label_ref = uriref('skos:altLabel')
    in_scheme_ref = uriref('skos:inScheme')
    broader_ref = uriref('skos:broader')
    top_concept_ref = uriref('skos:topConceptOf')
    
    microthes2domain = dict((mt, next(eurovoc_graph.objects(mt, uriref('eu:domain'))))
                            for mt in eurovoc_graph.subjects(RDF.type, uriref('eu:MicroThesaurus')))
    
    new_graph = rdflib.ConjunctiveGraph()
    for subj_ref, pred_ref, obj_ref in eurovoc_graph:
        if deprecated_ref in list(eurovoc_graph.objects(subj_ref, status_ref)):
            continue
        # Convert eu:Domain into a skos:ConceptScheme
        if obj_ref == class_domain_ref:
            new_graph.add((subj_ref, RDF.type, concept_scheme_ref))
            for title in eurovoc_graph.objects(subj_ref, pref_label_ref):
                if title.language == u'en':
                    new_graph.add((subj_ref, title_ref,
                                   Literal(capitalized_eurovoc_domain(title))))
                    break
        # Convert eu:MicroThesaurus into a skos:Concept
        elif obj_ref == microthes_ref:
            new_graph.add((subj_ref, RDF.type, concept_ref))
            scheme_ref = next(eurovoc_graph.objects(subj_ref, rel_domain_ref))
            new_graph.add((subj_ref, in_scheme_ref, scheme_ref))
        # Convert eu:ThesaurusConcept into a skos:Concept
        elif obj_ref == thesconcept_ref:
            new_graph.add((subj_ref, RDF.type, concept_ref))
        # Replace <concept> topConceptOf <MicroThesaurus> by <concept> broader <MicroThesaurus>
        elif pred_ref == top_concept_ref:
            new_graph.add((subj_ref, broader_ref, obj_ref))
        # Replace <concept> skos:inScheme <MicroThes> by <concept> skos:inScheme <Domain>
        elif pred_ref == in_scheme_ref and obj_ref in microthes2domain:
            new_graph.add((subj_ref, in_scheme_ref, microthes2domain[obj_ref]))
        # Keep label triples
        elif (subj_ref != eurovoc_ref and obj_ref != eurovoc_ref
              and pred_ref in (pref_label_ref, alt_label_ref)):
            new_graph.add((subj_ref, pred_ref, obj_ref))
        # Keep existing skos:broader relations and existing concepts
        elif pred_ref == broader_ref or obj_ref == concept_ref:
            new_graph.add((subj_ref, pred_ref, obj_ref))
    

    We can check that we now have far less triples than before.

    len(new_graph)
    
    388582
    

    Now we dump this new graph to disk. We choose the Turtle format as it is far more readable than RDF/XML for humans, and slightly faster to parse for machines. This file will contain plain SKOS data that can be directly imported into any application able to read SKOS.

    with open('eurovoc.n3', 'w') as f:
        new_graph.serialize(f, format='n3')
    

    With CubicWeb using the SKOS cube, it is a one command step:

    cubicweb-ctl skos-import --cw-store=massive <instance_name> eurovoc.n3
    

  • Installing Debian Jessie on a "pure UEFI" system

    2016/06/13 by David Douard

    At the core of the Logilab infrastructure is a highly-available pair of small machines dedicated to our main directory and authentication services: LDAP, DNS, DHCP, Kerberos and Radius.

    The machines are small fanless boxes powered by a 1GHz Via Eden processor, 512Mb of RAM and 2Gb of storage on a CompactFlash module.

    They have served us well for many years, but now is the time for an improvement. We've bought a pair of Lanner FW-7543B that have the same form-factor. They are not fanless, but are much more powerful. They are pretty nice, but have one major drawback: their firmware does not boot on a legacy BIOS-mode device when set up in UEFI. Another hard point is that they do not have a video connector (there is a VGA output on the motherboard, but the connector is optional), so everything must be done via the serial console.

    https://www.logilab.org/file/6679313/raw/FW-7543_front.jpg

    I knew the Debian Jessie installer would provide everything that is required to handle an UEFI-based system, but it took me a few tries to get it to boot.

    First, I tried the standard netboot image, but the firmware did not want to boot from a USB stick, probably because the image requires a MBR-based bootloader.

    Then I tried to boot from the Refind bootable image and it worked! At least I had the proof this little beast could boot in UEFI. But, although it is probably possible, I could not figure out how to tweak the Refind config file to make it boot properly the Debian installer kernel and initrd.

    https://www.logilab.org/file/6679257/raw/uefi_lanner_nope.png

    Finally I gave a try to something I know much better: Grub. Here is what I did to have a working UEFI Debian installer on a USB key.

    Partitionning

    First, in the UEFI world, you need a GPT partition table with a FAT partition typed "EFI System":

    david@laptop:~$ sudo fdisk /dev/sdb
    Welcome to fdisk (util-linux 2.25.2).
    Changes will remain in memory only, until you decide to write them.
    Be careful before using the write command.
    
    Command (m for help): g
    Created a new GPT disklabel (GUID: 52FFD2F9-45D6-40A5-8E00-B35B28D6C33D).
    
    Command (m for help): n
    Partition number (1-128, default 1): 1
    First sector (2048-3915742, default 2048): 2048
    Last sector, +sectors or +size{K,M,G,T,P} (2048-3915742, default 3915742):  +100M
    
    Created a new partition 1 of type 'Linux filesystem' and of size 100 MiB.
    
    Command (m for help): t
    Selected partition 1
    Partition type (type L to list all types): 1
    Changed type of partition 'Linux filesystem' to 'EFI System'.
    
    Command (m for help): p
    Disk /dev/sdb: 1.9 GiB, 2004877312 bytes, 3915776 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: gpt
    Disk identifier: 52FFD2F9-45D6-40A5-8E00-B35B28D6C33D
    
    Device     Start    End Sectors  Size Type
    /dev/sdb1   2048 206847  204800  100M EFI System
    
    Command (m for help): w
    

    Install Grub

    Now we need to install a grub-efi bootloader in this partition:

    david@laptop:~$ pmount sdb1
    david@laptop:~$ sudo grub-install --target x86_64-efi --efi-directory /media/sdb1/ --removable --boot-directory=/media/sdb1/boot
    Installing for x86_64-efi platform.
    Installation finished. No error reported.
    

    Copy the Debian Installer

    Our next step is to copy the Debian's netboot kernel and initrd on the USB key:

    david@laptop:~$ mkdir /media/sdb1/EFI/debian
    david@laptop:~$ wget -O /media/sdb1/EFI/debian/linux http://ftp.fr.debian.org/debian/dists/jessie/main/installer-amd64/current/images/netboot/debian-installer/amd64/linux
    --2016-06-13 18:40:02--  http://ftp.fr.debian.org/debian/dists/jessie/main/installer-amd64/current  /images/netboot/debian-installer/amd64/linux
    Resolving ftp.fr.debian.org (ftp.fr.debian.org)... 212.27.32.66, 2a01:e0c:1:1598::2
    Connecting to ftp.fr.debian.org (ftp.fr.debian.org)|212.27.32.66|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 3120416 (3.0M) [text/plain]
    Saving to: ‘/media/sdb1/EFI/debian/linux’
    
    /media/sdb1/EFI/debian/linux      100%[========================================================>]   2.98M      464KB/s   in 6.6s
    
    2016-06-13 18:40:09 (459 KB/s) - ‘/media/sdb1/EFI/debian/linux’ saved [3120416/3120416]
    
    david@laptop:~$ wget -O /media/sdb1/EFI/debian/initrd.gz http://ftp.fr.debian.org/debian/dists/jessie/main/installer-amd64/current/images/netboot/debian-installer/amd64/initrd.gz
    --2016-06-13 18:41:30--  http://ftp.fr.debian.org/debian/dists/jessie/main/installer-amd64/current/images/netboot/debian-installer/amd64/initrd.gz
    Resolving ftp.fr.debian.org (ftp.fr.debian.org)... 212.27.32.66, 2a01:e0c:1:1598::2
    Connecting to ftp.fr.debian.org (ftp.fr.debian.org)|212.27.32.66|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 15119287 (14M) [application/x-gzip]
    Saving to: ‘/media/sdb1/EFI/debian/initrd.gz’
    
    /media/sdb1/EFI/debian/initrd.g    100%[========================================================>]  14.42M    484KB/s   in 31s
    
    2016-06-13 18:42:02 (471 KB/s) - ‘/media/sdb1/EFI/debian/initrd.gz’ saved [15119287/15119287]
    

    Configure Grub

    Then, we must write a decent grub.cfg file to load these:

    david@laptop:~$ echo >/media/sdb1/boot/grub/grub.cfg <<EOF
    menuentry "Jessie Installer" {
      insmod part_msdos
      insmod ext2
      insmod part_gpt
      insmod fat
      insmod gzio
      echo  'Loading Linux kernel'
      linux /EFI/debian/linux --- console=ttyS0,115200
      echo 'Loading InitRD'
      initrd /EFI/debian/initrd.gz
    }
    EOF
    

    Et voilà, piece of cake!


  • Our work for the OpenDreamKit project during the 77th Sage days

    2016/04/18 by Florent Cayré

    Logilab is part of OpenDreamKit, a Horizon 2020 European Research Infrastructure project that will run until 2019 and provides substantial funding to the open source computational mathematics ecosystem.

    https://www.logilab.org/file/5545539/raw

    One of the goals of this project is improve the packaging and documentation of SageMath, the open source alternative to Maple and Mathematica.

    The core developers of SageMath organised the 77th Sage days last week and Logilab has taken part, with David Douard, Julien Cristau and I, Florent Cayre.

    David and Julien have been working on packaging SageMath for Debian. This is a huge task (several man-months of work), split into two sub-tasks for now:

    • building SageMath with Debian-packaged versions of its dependencies, if available;
    • packaging some of the missing dependencies, starting with the most expected ones, like the latest releases of Jupyter and IPython.
    http://ipython.org/_static/IPy_header.png http://jupyter.org/assets/nav_logo.svg https://www.debian.org/Pics/hotlink/swirl-debian.png

    As a first result, the following packages have been pushed into Debian experimental:

    There is still a lot of work to be done, and packaging the notebook is the next task on the list.

    One hiccup along the way was a python crash involving multiple inheritance from Cython extensions classes. Having people nearby who knew the SageMath codebase well (or even wrote the relevant parts) was invaluable for debugging, and allowed us to blame a recent CPython change.

    Julien also gave a hand to Florent Hivert and Robert Lehmann who were trying to understand why building SageMath's documentation needed this much memory.

    As far as I am concerned, I made a prototype of a structured HTML documentation produced with Sphinx and containing Python executable code ran on https://tmpnb.org/ thanks to the Thebe javascript library that interfaces statically delivered HTML pages with a Jupyter notebook server.

    The Sage days have been an excellent opportunity to efficiently work on the technical tasks with skillfull and enthusiastic people. We would like to thank the OpenDreamKit core team for the organization and their hard work. We look forward to the next workshop.


  • 3D Visualization of simulation data with x3dom

    2016/02/16 by Yuanxiang Wang

    X3DOM Plugins

    As part of the Open Dream Kit project, we are working at Logilab on the creation of tools for mesh data visualization and analysis in a web application. Our goal was to create widgets to use in Jupyter notebook (formerly IPython) for 3D visualization and analysis.

    We found two interesting technologies for 3D rendering: ThreeJS and X3DOM. ThreeJS is a large JavaScript 3D library and X3DOM a HTML5 framework for 3D. After working with both, we chose to use X3DOM because of its high level architecture. With X3DOM the 3D model is defined in the DOM in HTML5, so the parameters of the nodes can be changed easily with the setAttribute DOM function. This makes the creation of user interfaces and widgets much easier.

    We worked to create new DOM nodes that integrate nicely in a standard X3DOM tree, namely IsoColor, Threshold and ClipPlane.

    We had two goals in mind:

    1. create an X3DOM plugin API that allows one to create new DOM nodes which extend X3DOM functionality;
    2. keep a simple X3DOM-like interface for the final users.

    Example of the plugins Threshold and IsoColor:

    image0

    The Threshold and IsoColor nodes work like any X3DOM node and react to attribute changes performed with the setAttribute method. This makes it easy to use HTML widgets like sliders / buttons to drive the plugin's parameters.

    X3Dom API

    The goal is to create custom nodes that affect the rendering based on data (positions, pressure, temperature...). The idea is to manipulate the shaders, since it gives low-level manipulation on the 3D rendering. Shaders give more freedom and efficiency compared to reusing other X3DOM nodes. (Reminder : Shaders are parts of GLSL, used to work with the GPU).

    X3DOM has a native node to all users to write shaders : the ComposedShader node. The problem of this node is it overwrites the shaders generated by X3DOM. For example, nodes like ClipPlane are disabled with a ComposedShader node in the DOM. Another example is image texturing, the computation of the color from texture coordinate should be written within the ComposedShader.

    In order to add parts of shader to the generated shader without overwriting it I created a new node: CustomAttributeNode. This node is a generic node to add uniforms, varying and shader parts into X3DOW. The data of the geometry (attributes) are set using the X3DOM node named FloatVertexAttribute.

    Example of CustomAttributeNode to create a threshold node:

    image2

    The CustomAttributeNode is the entry point in x3dom for the javascript API.

    JavaScript API

    The idea of the the API is to create a new node inherited from CustomAttributeNode. We wrote some functions to make the implementation of the node easier.

    Ideas for future improvement

    There are still some points that need improvement

    • Create a tree widget using the grouping nodes in X3DOM
    • Add high level functions to X3DGeometricPropertyNode to set the values. For instance the IsoColor node is only a node that set the values of the TextureCoordinate node from the FloatVertexAttribute node.
    • Add high level function to return the variable name needed to overwrite the basic attributes like positions in a Geometry. With my API, the IsoColor use a varying defined in X3DOM to overwrite the values of the texture coordinate. Because there are no documentation, it is hard for the users to find the varying names. On the other hand there are no specification on the varying names so it might need to be maintained.
    • Maybe the CustomAttributeNode should be a X3DChildNode instead of a X3DGeometricPropertyNode.

    image4

    This structure might allow the "use" attribute in X3DOM. Like that, X3DOM avoid data duplication and writing too much HTML. The following code illustrate what I expect.

    image5


  • We went to cfgmgmtcamp 2016 (after FOSDEM)

    2016/02/09 by Arthur Lutz

    Following a day at FOSDEM (another post about it), we spend two days at cfgmgmtcamp in Gent. At cfgmgmtcamp, we obviously spent some time in the Salt track since it's our tool of choice as you might have noticed. But checking out how some of the other tools and communities are finding solutions to similar problems is also great.

    cfgmgmtcamp logo

    I presented Roll out active Supervision with Salt, Graphite and Grafana (mirrored on slideshare), you can find the code on bitbucket.

    http://image.slidesharecdn.com/cfgmgmtcamp2016activesupervisionwithsalt-160203131954/95/cfgmgmtcamp-2016-roll-out-active-supervision-with-salt-graphite-and-grafana-1-638.jpg?cb=1454505737

    We saw :

    Day 1

    • Mark Shuttleworth from Canonical presenting Juju and its ecosystem, software modelling. MASS (Metal As A Service) was demoed on the nice "OrangeBox". It promises to spin up an OpenStack infrastructure in 15 minutes. One of the interesting things with charms and bundles of charms is the interfaces that need to be established between different service bricks. In the salt community we have salt-formulas but they lack maturity in the sense that there's no possibility to plug in multiple formulas that interact with each other... yet.
    juju deploy of openstack
    • Mitch Michell from Hashicorp presented vault. Vault stores your secrets (certificates, passwords, etc.) and we will probably be trying it out in the near future. A lot of concepts in vault are really well thought out and resonate with some of the things we want to do and automate in our infrastructure. The use of Shamir Secret Sharing technique (also used in the debian infrastructure team) for the N-man challenge to unvault the secrets is quite nice. David is already looking into automating it with Salt and having GSSAPI (kerberos) authentication.
    https://www.vaultproject.io/assets/images/hero-95b4a434.png bikes!

    Day 2

    • Gareth Rushgrove from PuppetLabs talked about the importance of metadata in docker images and docker containers by explaining how these greatly benefit tools like dpkg and rpm and that the container community should be inspired by the amazing skills and experience that has been built by these package management communities (think of all the language-specific package managers that each reinvent the wheel one after the other).
    • Testing Immutable Infrastructure: we found some inspiration from test-kitchen and running the tests inside a docker container instead of vagrant virtual machine. We'll have to take a look at the SaltStack provisioner for test-kitchen. We already do some of that stuff in docker and OpenStack using salt-cloud. But maybe we can take it further with such tools (or testinfra whose author will be joining Logilab next month).
    coreos, rkt, kubernetes
    • How CoreOS is built, modified, and updated: From repo sync to Omaha by Brian "RedBeard" Harrington. Interesting presentation of the CoreOS system. Brian also revealed that CoreOS is now capable of using the TPM to enforce a signed OS, but also signed containers. Official CoreOS images shipped through Omaha are now signed with a root key that can be installed in the TPM of the host (ie. they didn't use a pre-installed Microsoft key), along with a modified TPM-aware version of GRUB. For now, the Omaha platform is not open source, so it may not be that easy to build one's own CoreOS images signed with a personal root key, but it is theoretically possible. Brian also said that he expect their Omaha server implementation to become open source some day.
    • The use of Salt in Foreman was presented and demoed by Stephen Benjamin. We'll have to retry using that tool with the newest features of the smart proxy.
    • Jonathan Boulle from CoreOS presented "rkt and Kubernetes: What’s new with Container Runtimes and Orchestration" In this last talk, Johnathan gave a tour of the rkt project and how it is used to build, coupled with kubernetes, a comprehensive, secure container running infrastructure (which uses saltstack!). He named the result "rktnetes". The idea is to use rkt as the kubelet's (primany node agent) container runtime of a kubernetes cluster powered by CoreOS. Along with the new CoreOS support for TPM-based trust chain, it allows to ensure completely secured executions, from the bootloader to the container! The possibility to run fully secured containers is one of the reasons why CoreOS developed the rkt project.
    coffee!

    We would like to thank the cfgmgmntcamp organisation team, it was a great conference, we highly recommend it. Thanks for the speaker event the night before the conference, and the social event on Monday evening. (and thanks for the chocolate!).


  • We went to FOSDEM 2016 (and cfgmgmtcamp)

    2016/02/09 by Arthur Lutz

    David & I went to FOSDEM and cfgmgmtcamp this year to see some conferences, do two presentations, and discuss with the members of the open source communities we contribute to.

    https://www.logilab.org/file/4253021/raw/16312670359_565eec1e3d_k.jpg

    At FOSDEM, we started early by doing a presentation at 9.00 am in the "Configuration Management devroom", which to our surprise was a large room which was almost full. The presentation was streamed over the Internet and should be available to view shortly.

    I presented "Once you've configured your infrastructure using salt, monitor it by re-using that definition". (mirrored on slideshare. The main part was a demo, the code being published on bitbucket.

    http://image.slidesharecdn.com/fosdem2016describeitmonitorit-160203131836/95/fosdem-2016-after-describing-your-infrastructure-as-code-reuse-that-to-monitor-it-1-638.jpg?cb=1454505792

    The presentation was streamed live (I came across someone that watched it on the Internet to "sleep in"), and should be available to watch when it gets encoded on http://video.fosdem.org/.

    FOSDEM video box

    We then saw the following talks :

    • Unified Framework for Big Data Foreign Data Wrappers (FDW) by Shivram Mani in the Postgresql Track
    • Mainflux Open Source IoT Cloud
    • EzBench, a tool to help you benchmark and bisect the Graphics Stack's performance
    • The RTC components in the debian infrastructure
    • CoreOS: A Linux distribution designed for application containers that scale
    • Using PostgreSQL for Bibliographic Data (since we've worked on http://data.bnf.fr/ with http://cubicweb.org/ and PostgreSQL)
    • The FOSDEM infrastructure review

    Congratulations to all the FOSDEM organisers, volunteers and speakers. We will hopefully be back for more.

    We then took the train to Gent where we spent two days learning and sharing about Configuration Management Systems and all the ecosystem around it (orchestration, containers, clouds, testing, etc.).

    More on our cfmgmtcamp experience in another blog post.

    Photos under creative commons CC-BY, by Ludovic Hirlimann and Deborah Bryant here and here


  • DebConf15 wrap-up

    2015/08/25 by Julien Cristau
    //www.logilab.org/file/856155/raw/heidelberg-panorama-2.jpg

    I just came back from two weeks in Heidelberg for DebCamp15 and DebConf15.

    In the first week, besides helping out DebConf's infrastructure team with network setup, I tried to make some progress on the library transitions triggered by libstdc++6's C++11 changes. At first, I spent many hours going through header files for a bunch of libraries trying to figure out if the public API involved std::string or std::list. It turns out that is time-consuming, error-prone, and pretty efficient at making me lose the will to live. So I ended up stealing a script from Steve Langasek to automatically rename library packages for this transition. This ended in 29 non-maintainer uploads to the NEW queue, quickly processed by the FTP team. Sadly the transition is not quite there yet, as making progress with the initial set of packages reveals more libraries that need renaming.

    Building on some earlier work from Laurent Bigonville, I've also moved the setuid root Xorg wrapper from the xserver-xorg package to xserver-xorg-legacy, which is now in experimental. Hopefully that will make its way to sid and stretch soon (need to figure out what to do with non-KMS drivers first).

    Finally, with the help of the security team, the security tracker was moved to a new VM that will hopefully not eat its root filesystem every week as the old one was doing the last few months. Of course, the evening we chose to do this was the night DebConf15's network was being overhauled, which made things more interesting.

    DebConf itself was the opportunity to meet a lot of people. I was particularly happy to meet Andreas Boll, who has been a member of pkg-xorg for two years now, working on our mesa package, among other things. I didn't get to see a lot of talks (too many other things going on), but did enjoy Enrico's stand up comedy, the CitizenFour screening, and Jake Applebaum's keynote. Thankfully, for the rest the video team has done a great job as usual.

    Note

    Above picture is by Aigars Mahinovs, licensed under CC-BY 2.0


  • Going to DebConf15

    2015/08/11 by Julien Cristau

    On Sunday I travelled to Heidelberg, Germany, to attend the 16th annual Debian developer's conference, DebConf15.

    The conference itself is not until next week, but this week is DebCamp, a hacking session. I've already met a few of my DSA colleagues, who've been working on setting up the network infrastructure. My other plans for this week involve helping the Big Transition of 2015 along, and trying to remove the setuid bit from /usr/bin/X in the default Debian install (bug #748203 in particular).

    As for next week, there's a rich schedule in which I'll need to pick a few things to go see.

    //www.logilab.org/file/524206/raw/Dc15going1.png

  • Experiments on building a Jenkins CI service with Salt

    2015/06/17 by Denis Laxalde

    In this blog post, I'll talk about my recent experiments on building a continuous integration service with Jenkins that is, as much as possible, managed through Salt. We've been relying on a Jenkins platform for quite some time at Logilab (Tolosa team). The service was mostly managed by me with sporadic help from other team-mates but I've never been entirely satisfied about the way it was managed because it involved a lot of boilerplate configuration through Jenkins user interface and this does not scale very well nor does it make long term maintenance easy.

    So recently, I've taken a stance and decided to go through a Salt-based configuration and management of our Jenkins CI platform. There are actually two aspects here. The first concerns the setup of Jenkins itself (this includes installation, security configuration, plugins management amongst other things). The second concerns the management of client projects (or jobs in Jenkins jargon). For this second aspect, one of the design goals was to enable easy configuration of jobs by users not necessarily familiar with Jenkins setup and to make collaborative maintenance easy. To tackle these two aspects I've essentially been using (or developing) two distinct Salt formulas which I'll detail hereafter.

    Jenkins jobs salt

    Core setup: the jenkins formula

    The core setup of Jenkins is based on an existing Salt formula, the jenkins-formula which I extended a bit to support map.jinja and which was further improved to support installation of plugins by Yann and Laura (see 3b524d4).

    With that, deploying a Jenkins server is as simple as adding the following to your states and pillars top.sls files:

    base:
      "jenkins":
        - jenkins
        - jenkins.plugins
    

    Base pillar configuration is used to declare anything that differs from the default Jenkins settings in a jenkins section, e.g.:

    jenkins:
      lookup:
        - home: /opt/jenkins
    

    Plugins configuration is declared in plugins subsection as follows:

    jenkins:
      lookup:
        plugins:
          scm-api:
            url: 'http://updates.jenkins-ci.org/download/plugins/scm-api/0.2/scm-api.hpi'
            hash: 'md5=9574c07bf6bfd02a57b451145c870f0e'
          mercurial:
            url: 'http://updates.jenkins-ci.org/download/plugins/mercurial/1.54/mercurial.hpi'
            hash: 'md5=1b46e2732be31b078001bcc548149fe5'
    

    (Note that plugins dependency is not handled by Jenkins when installing from the command line, neither by this formula. So in the preceding example, just having an entry for the Mercurial plugin would have not been enough because this plugin depends on scm-api.)

    Other aspects (such as security setup) are not handled yet (neither by the original formula, nor by our extension), but I tend to believe that this is acceptable to manage this "by hand" for now.

    Jobs management : the jenkins_jobs formula

    For this task, I leveraged the excellent jenkins-job-builder tool which makes it possible to configure jobs using a declarative YAML syntax. The tool takes care of installing the job and also handles any housekeeping tasks such as checking configuration validity or deleting old configurations. With this tool, my goal was to let end-users of the Jenkins service add their own project by providing a minima a YAML job description file. So for instance, a simple Job description for a CubicWeb job could be:

    - scm:
        name: cubicweb
        scm:
          - hg:
             url: http://hg.logilab.org/review/cubicweb
             clean: true
    
    - job:
        name: cubicweb
        display-name: CubicWeb
        scm:
          - cubicweb
        builders:
          - shell: "find . -name 'tmpdb*' -delete"
          - shell: "tox --hashseed noset"
        publishers:
          - email:
              recipients: cubicweb@lists.cubicweb.org
    

    It consists of two parts:

    • the scm section declares, well, SCM information, here the location of the review Mercurial repository, and,

    • a job section which consists of some metadata (project name), a reference of the SCM section declared above, some builders (here simple shell builders) and a publisher part to send results by email.

    Pretty simple. (Note that most test running configuration is here declared within the source repository, via tox (another story), so that the CI bot holds minimum knowledge and fetches information from the sources repository directly.)

    To automate the deployment of this kind of configurations, I made a jenkins_jobs-formula which takes care of:

    1. installing jenkins-job-builder,
    2. deploying YAML configurations,
    3. running jenkins-jobs update to push jobs into the Jenkins instance.

    In addition to installing the YAML file and triggering a jenkins-jobs update run upon changes of job files, the formula allows for job to list distribution packages that it would require for building.

    Wrapping things up, a pillar declaration of a Jenkins job looks like:

    jenkins_jobs:
      lookup:
        jobs:
          cubicweb:
            file: <path to local cubicweb.yaml>
            pkgs:
              - mercurial
              - python-dev
              - libgecode-dev
    

    where the file section indicates the source of the YAML file to install and pkgs lists build dependencies that are not managed by the job itself (typically non Python package in our case).

    So, as an end user, all is needed to provide is the YAML file and a pillar snippet similar to the above.

    Outlook

    This initial setup appears to be enough to greatly reduce the burden of managing a Jenkins server and to allow individual users to contribute jobs for their project based on simple contribution to a Salt configuration.

    Later on, there is a few things I'd like to extend on jenkins_jobs-formula side. Most notably the handling of distant sources for YAML configuration file (as well as maybe the packages list file). I'd also like to experiment on configuring slaves for the Jenkins server, possibly relying on Docker (taking advantage of another of my experiment...).


  • Running a local salt-master to orchestrate docker containers

    2015/05/20 by David Douard

    In a recent blog post, Denis explained how to build Docker containers using Salt.

    What's missing there is how to have a running salt-master dedicated to Docker containers.

    There is not need the salt-master run as root for this. A test config of mine looks like:

    david@perseus:~$ mkdir -p salt/etc/salt
    david@perseus:~$ cd salt
    david@perseus:~salt/$ cat << EOF >etc/salt/master
    interface: 192.168.127.1
    user: david
    
    root_dir: /home/david/salt/
    pidfile: var/run/salt-master.pid
    pki_dir: etc/salt/pki/master
    cachedir: var/cache/salt/master
    sock_dir: var/run/salt/master
    
    file_roots:
      base:
        - /home/david/salt/states
        - /home/david/salt/formulas/cubicweb
    
    pillar_roots:
      base:
        - /home/david/salt/pillar
    EOF
    

    Here, 192.168.127.1 is the ip of my docker0 bridge. Also note that path in file_roots and pillar_roots configs must be absolute (they are not relative to root_dir, see the salt-master configuration documentation).

    Now we can start a salt-master that will be accessible to Docker containers:

    david@perseus:~salt/$ /usr/bin/salt-master -c etc/salt
    

    Warning

    with salt 2015.5.0, salt-master really wants to execute dmidecode, so add /usr/sbin to the $PATH variable before running the salt-master as non-root user.

    From there, you can talk to your test salt master by adding -c ~/salt/etc/salt option to all salt commands. Fortunately, you can also set the SALT_CONFIG_DIR environment variable:

    david@perseus:~salt/$ export SALT_CONFIG_DIR=~/salt/etc/salt
    david@perseus:~salt/$ salt-key
    Accepted Keys:
    Denied Keys:
    Unaccepted Keys:
    Rejected Keys:
    

    Now, you need to have a Docker images with salt-minion already installed, as explained in Denis' blog post. (I prefer using supervisord as PID 1 in my dockers, but that's not important here.)

    david@perseus:~salt/ docker run -d --add-host salt:192.168.127.1  logilab/salted_debian:wheezy
    53bf7d8db53001557e9ae25f5141cd9f2caf7ad6bcb7c2e3442fcdbb1caf5144
    david@perseus:~salt/ docker run -d --name jessie1 --hostname jessie1 --add-host salt:192.168.127.1  logilab/salted_debian:jessie
    3da874e58028ff6dcaf3999b29e2563e1bc4d6b1b7f2f0b166f9a8faffc8aa47
    david@perseus:~salt/ salt-key
    Accepted Keys:
    Denied Keys:
    Unaccepted Keys:
    53bf7d8db530
    jessie1
    Rejected Keys:
    david@perseus:~/salt$ salt-key -y -a 53bf7d8db530
    The following keys are going to be accepted:
    Unaccepted Keys:
    53bf7d8db530
    Key for minion 53bf7d8db530 accepted.
    david@perseus:~/salt$ salt-key -y -a jessie1
    The following keys are going to be accepted:
    Unaccepted Keys:
    jessie1
    Key for minion jessie1 accepted.
    david@perseus:~/salt$ salt '*' test.ping
    jessie1:
        True
    53bf7d8db530:
        True
    

    You can now build Docker images as explained by Denis, or test your sls config files in containers.


  • Mini-Debconf Lyon 2015

    2015/04/29 by Julien Cristau
    //www.logilab.org/file/291628/raw/debian-france.png

    A couple of weeks ago I attended the mini-DebConf organized by Debian France in Lyon.

    It was a really nice week-end, and the first time a French mini-DebConf wasn't in Paris :)

    Among the highlights, Juliette Belin reported on her experience as a new contributor to Debian: she authored the awesome "Lines" theme which was selected as the default theme for Debian 8.

    //www.logilab.org/file/291626/raw/juliette.jpg

    As a non-developer and newcomer to the free software community, she had quite intesting insights and ideas about areas where development processes need to improve.

    And Raphael Geissert reported on the new httpredir.debian.org service (previously http.debian.net), an http redirector to automagically pick the closest Debian archive mirror. So long, manual sources.list updates on laptops whenever travelling!

    //www.logilab.org/file/291627/raw/raphael.jpg

    Finally the mini-DebConf was a nice opportunity to celebrate the release of Debian 8, two weeks in advance.

    Now it's time to go and upgrade all our infrastructure to jessie.


  • Building Docker containers using Salt

    2015/04/07 by Denis Laxalde

    In this blog post, I'll talk about a way to use Salt to automate the build and configuration of Docker containers. I will not consider the deployment of Docker containers with Salt as this subject is already covered elsewhere (here for instance). The emphasis here is really on building (or configuring) a container for future deployment.

    Motivation

    Salt is a remote execution framework that can be used for configuration management. It's already widely used at Logilab to manage our infrastructure as well as on a semi-daily basis during our application development activities.

    Docker is a tool that helps automating the deployment of applications within Linux containers. It essentially provides a convenient abstraction and a set of utilities for system level virtualization on Linux. Amongst other things, Docker provides container build helpers around the concept of dockerfile.

    So, the first question is why would you use Salt to build Docker containers when you already have this dockerfile building tool. My first motivation is to encompass the limitations of the available declarations one could insert in a Dockerfile. First limitation: you can only execute instructions in a sequential manner using a Dockerfile, there's is no possibility of declaring dependencies between instructions or even of making an instruction conditional (apart from using the underlying shell conditional machinery of course). Then, you have only limited possibilities of specializing a Dockerfile. Finally, it's no so easy to apply a configuration step-by-step, for instance during the development of said configuration.

    That's enough for an introduction to lay down the underlying motivation of this post. Let's move on to more practical things!

    A Dockerfile for the base image

    Before jumping into the usage of Salt for the configuration of a Docker image, the first thing you need to do is to build a Docker container into a proper Salt minion.

    Assuming we're building on top of some a base image of Debian flavour subsequently referred to as <debian> (I won't tell you where it comes from, since you ought to build your own base image -- or find some friend you trust to provide you with one!), the following Dockerfile can be used to initialize a working image which will serve as the starting point for further configuration with Salt:

    FROM <debian>
    RUN apt-get update
    RUN apt-get install -y salt-minion
    

    Then, run docker build . docker_salt/debian_salt_minion and you're done.

    Plugin the minion container with the Salt master

    The next thing to do with our fresh Debian+salt-minion image is to turn it into a container running salt-minion, waiting for the Salt master to instruct it.

    docker run --add-host=salt:10.1.1.1 --hostname docker_minion \
        --name minion_container \
        docker_salt/debian/salt_minion salt-minion
    

    Here:

    • --hostname is used to specify the network name of the container, for easier query by the Salt master,
    • 10.1.1.1 is usually the IP address of the host, which in our example will serve as the Salt master,
    • --name is just used for easier book-keeping.

    Finally,

    salt-key -a docker_minion
    

    will register the new minion's key into the master's keyring.

    If all went well, the following command should succeed:

    salt 'docker_minion' test.ping
    

    Configuring the container with a Salt formula

    salt 'docker_minion' state.sls some_formula
    salt 'docker_minion' state.highstate
    

    Final steps: save the configured image and build a runnable image

    (Optional step, cleanup salt-minion installation.)

    Make a snapshot image of your configured container.

    docker stop minion_container
    docker commit -m 'Install something with Salt' \
        minion_container me/something
    

    Try out your new image:

    docker run -p 8080:80 me/something <entry point>
    

    where <entry point> will be the main program driving the service provided by the container (typically defined through the Salt formula).

    Make a fully configured image for you service:

    FROM me/something
    [...anything else you need, such as EXPOSE, etc...]
    CMD <entry point>
    

  • Monitoring our websites before we deploy them using Salt

    2015/03/11 by Arthur Lutz

    As you might have noticed we're quite big fans of Salt. One of the things that Salt enables us to do, it to apply what we're used to doing with code to our infrastructure. Let's look at TDD (Test Driven Development).

    Write the test first, make it fail, implement the code, test goes green, you're done.

    Apply the same thing to infrastructure and you get TDI (Test Driven Infrastructure).

    So before you deploy a service, you make sure that your supervision (shinken, nagios, incinga, salt based monitoring, etc.) is doing the correct test, you deploy and then your supervision goes green.

    Let's take a look at website supervision. At Logilab we weren't too satisfied with how our shinken/http_check were working so we started using uptime (nodejs + mongodb). Uptime has a simple REST API to get and add checks, so we wrote a salt execution module and a states module for it.

    https://www.logilab.org/file/288174/raw/68747470733a2f2f7261772e6769746875622e636f6d2f667a616e696e6f74746f2f757074696d652f646f776e6c6f6164732f636865636b5f64657461696c732e706e67.png

    For the sites that use the apache-formula we simply loop on the domains declared in the pillars to add checks :

    {% for domain in salt['pillar.get']('apache:sites').keys() %}
    uptime {{ domain }} (http):
      uptime.monitored:
        - name : http://{{ domain }}
    {% endfor %}
    

    For other URLs (specific URL such as sitemaps) we can list them in pillars and do :

    {% for url in salt['pillar.get']('uptime:urls') %}
    uptime {{ url }}:
      uptime.monitored:
        - name : {{ url }}
    {% endfor %}
    

    That's it. Monitoring comes before deployment.

    We've also contributed a formula for deploying uptime.

    Follow us if you are interested in Test Driven Infrastructure for we intend to write regular reports as we make progress exploring this new domain.


  • A report on the Salt Sprint 2015 in Paris

    2015/03/05 by Arthur Lutz

    On Wednesday the 4th of march 2015, Logilab hosted a sprint on salt on the same day as the sprint at SaltConf15. 7 people joined in and hacked on salt for a few hours. We collaboratively chose some subjects on a pad which is still available.

    //www.logilab.org/file/248336/raw/Salt-Logo.png

    We started off by familiarising those who had never used them to using tests in salt. Some of us tried to run the tests via tox which didn't work any more, a fix was found and will be submitted to the project.

    We organised in teams.

    Boris & Julien looked at the authorisation code and wrote a few issues (minion enumeration, acl documentation). On saltpad (client side) they modified the targeting to adapt to the permissions that the salt-api sends back.

    We discussed the salt permission model (external_auth) : where should the filter happen ? the master ? should the minion receive information about authorisation and not execute what is being asked for ? Boris will summarise some of the discussion about authorisations in a new issue.

    //www.logilab.org/file/288010/raw/IMG_3034.JPG

    Sofian worked on some unification on execution modules (refresh_db which will be ignored for the modules that don't understand that). He will submit a pull request in the next few days.

    Georges & Paul added some tests to hg_pillar, the test creates a mercurial repository, adds a top.sls and a file and checks that they are visible. Here is the diff. They had some problems while debugging the tests.

    David & Arthur implemented the execution module for managing postgresql clusters (create, list, exists, remove) in debian. A pull request was submitted by the end of the day. A state module should follow shortly. On the way we removed some dead code in the postgres module.

    All in all, we had some interesting discussions about salt, it's architecture, shared tips about developing and using it and managed to get some code done. Thanks to all for participating and hopefully we'll sprint again soon...


  • Generate stats from your SaltStack infrastructure

    2014/12/15 by Arthur Lutz

    As presented at the November french meetup of saltstack users, we've published code to generate some statistics about a salstack infrastructure. We're using it, for the moment, to identify which parts of our infrastructure need attention. One of the tools we're using to monitor this distance is munin.

    You can grab the code at bitbucket salt-highstate-stats, fork it, post issues, discuss it on the mailing lists.

    If you're french speaking, you can also read the slides of the above presentation (mirrored on slideshare).

    Hope you find it useful.


  • Using Saltstack to limit impact of Poodle SSLv3 vulnerability

    2014/10/15 by Arthur Lutz

    Here at Logilab, we're big fans of SaltStack automation. As seen with Heartbleed, controlling your infrastructure and being able to fix your servers in a matter of a few commands as documented in this blog post. Same applies to Shellshock more recently with this blog post.

    Yesterday we got the news that a big vulnerability on SSL was going to be released. Code name : Poodle. This morning we got the details and started working on a fix through salt.

    So far, we've handled configuration changes and services restart for apache, nginx, postfix and user configuration for iceweasel (debian's firefox) and chromium (adapting to firefox and chrome should be a breeze). Some credit goes to mtpettyp for his answer on askubuntu.

    http://www.logilab.org/file/267853/raw/saltstack_poodlebleed.jpg
    {% if salt['pkg.version']('apache2') %}
    poodle apache server restart:
        service.running:
            - name: apache2
      {% for foundfile in salt['cmd.run']('rgrep -m 1 SSLProtocol /etc/apache*').split('\n') %}
        {% if 'No such file' not in foundfile and 'bak' not in foundfile and foundfile.strip() != ''%}
    poodle {{ foundfile.split(':')[0] }}:
        file.replace:
            - name : {{ foundfile.split(':')[0] }}
            - pattern: "SSLProtocol all -SSLv2[ ]*$"
            - repl: "SSLProtocol all -SSLv2 -SSLv3"
            - backup: False
            - show_changes: True
            - watch_in:
                service: apache2
        {% endif %}
      {% endfor %}
    {% endif %}
    
    {% if salt['pkg.version']('nginx') %}
    poodle nginx server restart:
        service.running:
            - name: nginx
      {% for foundfile in salt['cmd.run']('rgrep -m 1 ssl_protocols /etc/nginx/*').split('\n') %}
        {% if 'No such file' not in foundfile and 'bak' not in foundfile and foundfile.strip() != ''%}
    poodle {{ foundfile.split(':')[0] }}:
        file.replace:
            - name : {{ foundfile.split(':')[0] }}
            - pattern: "ssl_protocols .*$"
            - repl: "ssl_protocols TLSv1 TLSv1.1 TLSv1.2;"
            - show_changes: True
            - watch_in:
                service: nginx
        {% endif %}
      {% endfor %}
    {% endif %}
    
    {% if salt['pkg.version']('postfix') %}
    poodle postfix server restart:
        service.running:
            - name: postfix
    poodle /etc/postfix/main.cf:
    {% if 'main.cf' in salt['cmd.run']('grep smtpd_tls_mandatory_protocols /etc/postfix/main.cf') %}
        file.replace:
            - pattern: "smtpd_tls_mandatory_protocols=.*"
            - repl: "smtpd_tls_mandatory_protocols=!SSLv2,!SSLv3"
    {% else %}
        file.append:
            - text: |
                # poodle fix
                smtpd_tls_mandatory_protocols=!SSLv2,!SSLv3
    {% endif %}
            - name: /etc/postfix/main.cf
            - watch_in:
                service: postfix
    {% endif %}
    
    {% if salt['pkg.version']('chromium') %}
    /usr/share/applications/chromium.desktop:
        file.replace:
            - pattern: Exec=/usr/bin/chromium %U
            - repl: Exec=/usr/bin/chromium --ssl-version-min=tls1 %U
    {% endif %}
    
    {% if salt['pkg.version']('iceweasel') %}
    /etc/iceweasel/pref/poodle.js:
        file.managed:
            - text : pref("security.tls.version.min", "1")
    {% endif %}
    

    The code is also published as a gist on github. Feel free to comment and fork the gist. There is room for improvement, and don't forget that by disabling SSLv3 you might prevent some users with "legacy" browsers from accessing your services.


  • Report from DebConf14

    2014/09/05 by Julien Cristau

    Last week I attended DebConf14 in Portland, Oregon. As usual the conference was a blur, with lots of talks, lots of new people, and lots of old friends. The organizers tried to do something different this year, with a longer conference (9 days instead of a week) and some dedicated hack time, instead of a pre-DebConf "DebCamp" week. That worked quite well for me, as it meant the schedule was not quite so full with talks, and even though I didn't really get any hacking done, it felt a bit more relaxed and allowed some more hallway track discussions.

    http://www.logilab.org/file/264666/raw/Screenshot%20from%202014-09-05%2015%3A09%3A38.png

    On the talks side, the keynotes from Zack and Biella provided some interesting thoughts. Some nice progress was made on making package builds reproducible.

    I gave two talks: an introduction to salt (odp),

    http://www.logilab.org/file/264663/raw/slide2.jpg

    and a report on the Debian jessie release progress (pdf).

    http://www.logilab.org/file/264665/raw/slide3.jpg

    And as usual all talks were streamed live and recorded, and many are already available thanks to the awesome DebConf video team. Also for a change, and because I'm a sucker for punishment, I came back with more stuff to do.


  • Logilab at Debconf 2014 - Debian annual conference

    2014/08/21 by Arthur Lutz

    Logilab is proud to contribute to the annual debian conference which will take place in Portland (USA) from the 23rd to the 31st of august.

    Julien Cristau (debian page) will be giving two talks at the conference :

    http://www.logilab.org/file/263602/raw/debconf2014.png

    Logilab is also contributing to the conference as a sponsor for the event.

    Here is what we previously blogged about salt and the previous debconf . Stay tuned for a blog post about what we saw and heard at the conference.

    https://www.debian.org/logos/openlogo-100.png

  • Pylint 1.3 / Astroid 1.2 released

    2014/07/28 by Sylvain Thenault

    The EP14 Pylint sprint team (more on this here and there) is proud to announce they just released Pylint 1.3 together with its companion Astroid 1.2. As usual, this includes several new features as well and bug fixes. You'll find below some structured list of the changes.

    Packages are uploaded to pypi, debian/ubuntu packages should be soon provided by Logilab, until they get into the standard packaging system of your favorite distribution.

    Please notice Pylint 1.3 will be the last release branch support python 2.5 and 2.6. Starting from 1.4, we will only support python greater or equal to 2.7. This will be the occasion to do some great cleanup in the code base. Notice this is only about the Pylint's runtime, you should still be able to run Pylint on your Python 2.5 code, through using Python 2.7 at least.

    New checks

    • Add multiple checks for PEP 3101 advanced string formatting: 'bad-format-string', 'missing-format-argument-key', 'unused-format-string-argument', 'format-combined-specification', 'missing-format-attribute' and 'invalid-format-index'
    • New 'invalid-slice-index' and 'invalid-sequence-index' for invalid sequence and slice indices
    • New 'assigning-non-slot' warning, which detects assignments to attributes not defined in slots

    Improved checkers

    • Fixed 'fixme' false positive (#149)
    • Fixed 'unbalanced-iterable-unpacking' false positive when encountering starred nodes (#273)
    • Fixed 'bad-format-character' false positive when encountering the 'a' format on Python 3
    • Fixed 'unused-variable' false positive when the variable is assigned through an import (#196)
    • Fixed 'unused-variable' false positive when assigning to a nonlocal (#275)
    • Fixed 'pointless-string-statement' false positive for attribute docstrings (#193)
    • Emit 'undefined-variable' when using the Python 3 metaclass= argument. Also fix 'unused-import' false for that construction (#143)
    • Emit 'broad-except' and 'bare-except' even if the number of except handlers is different than 1. Fixes issue (#113)
    • Emit 'attribute-defined-outside-init' for all statements in the same module as the offended class, not just for the last assignment (#262, as well as a long standing output mangling problem in some edge cases)
    • Emit 'not-callable' when calling properties (#268)
    • Don't let ImportError propagate from the imports checker, leading to crash in some namespace package related cases (#203)
    • Don't emit 'no-name-in-module' for ignored modules (#223)
    • Don't emit 'unnecessary-lambda' if the body of the lambda call contains call chaining (#243)
    • Definition order is considered for classes, function arguments and annotations (#257)
    • Only emit 'attribute-defined-outside-init' for definition within the same module as the offended class, avoiding to mangle the output in some cases
    • Don't emit 'hidden-method' message when the attribute has been monkey-patched, you're on your own when you do that.

    Others changes

    • Checkers are now properly ordered to respect priority(#229)
    • Use the proper mode for pickle when opening and writing the stats file (#148)

    Astroid changes

    • Function nodes can detect decorator call chain and see if they are decorated with builtin descriptors (classmethod and staticmethod).
    • infer_call_result called on a subtype of the builtin type will now return a new Class rather than an Instance.
    • Class.metaclass() now handles module-level __metaclass__ declaration on python 2, and no longer looks at the __metaclass__ class attribute on python 3.
    • Add slots method to Class nodes, for retrieving the list of valid slots it defines.
    • Expose function annotation to astroid: Arguments node exposes 'varargannotation', 'kwargannotation' and 'annotations' attributes, while Function node has the 'returns' attribute.
    • Backported most of the logilab.common.modutils module there, as most things there are for pylint/astroid only and we want to be able to fix them without requiring a new logilab.common release
    • Fix names grabed using wildcard import in "absolute import mode" (i.e. with absolute_import activated from the __future__ or with python 3) (pylint issue #58)
    • Add support in brain for understanding enum classes.

  • EP14 Pylint sprint Day 2 and 3 reports

    2014/07/28 by Sylvain Thenault
    https://ep2014.europython.eu/static_media/assets/images/logo.png

    Here are the list of things we managed to achieve during those last two days at EuroPython.

    After several attempts, Michal managed to have pylint running analysis on several files in parallel. This is still in a pull request (https://bitbucket.org/logilab/pylint/pull-request/82/added-support-for-checking-files-in) because of some limitations, so we decided it won't be part of the 1.3 release.

    Claudiu killed maybe 10 bugs or so and did some heavy issues cleanup in the trackers. He also demonstrated some experimental support of python 3 style annotation to drive a better inference. Pretty exciting! Torsten also killed several bugs, restored python 2.5 compat (though that will need a logilab-common release as well), introduced a new functional test framework that will replace the old one once all the existing tests will be backported. On wednesday, he did show us a near future feature they already have at Google: some kind of confidence level associated to messages so that you can filter out based on that. Sylvain fixed a couple of bugs (including https://bitbucket.org/logilab/pylint/issue/58/ which was annoying all the numpy community), started some refactoring of the PyLinter class so it does a little bit fewer things (still way too many though) and attempted to improve the pylint note on both pylint and astroid, which went down recently "thanks" to the new checks like 'bad-continuation'.

    Also, we merged the pylint-brain project into astroid to simplify things, so you should now submit your brain plugins directly to the astroid project. Hopefuly you'll be redirected there on attempt to use the old (removed) pylint-brain project on bitbucket.

    And, the good news is that now both Torsten and Claudiu have new powers: they should be able to do some releases of pylint and astroid. To celebrate that and the end of the sprint, we published Pylint 1.3 together with Astroid 1.2. More on this here.


  • EP14 Pylint sprint Day 1 report

    2014/07/24 by Sylvain Thenault
    https://ep2014.europython.eu/static_media/assets/images/logo.png

    We've had a fairly enjoyable and productive first day in our little hidden room at EuroPython in Berlin ! Below are some noticeable things we've worked on and discussed about.

    First, we discussed and agreed that while we should at some point cut the cord to the logilab.common package, it will take some time notably because of the usage logilab.common.configuration which would be somewhat costly to replace (and is working pretty well). There are some small steps we should do but basically we should mostly get back some pylint/astroid specific things from logilab.common to astroid or pylint. This should be partly done during the sprint, and remaining work will go to tickets in the tracker.

    We also discussed about release management. The point is that we should release more often, so every pylint maintainers should be able to do that easily. Sylvain will write some document about the release procedure and ensure access are granted to the pylint and astroid projects on pypi. We shall release pylint 1.3 / astroid 1.2 soon, and those releases branches will be the last one supporting python < 2.7.

    During this first day, we also had the opportunity to meet Carl Crowder, the guy behind http://landscape.io, as well as David Halter which is building the Jedi completion library (https://github.com/davidhalter/jedi). Landscape.io runs pylint on thousands of projects, and it would be nice if we could test beta release on some part of this panel. On the other hand, there are probably many code to share with the Jedi library like the parser and ast generation, as well as a static inference engine. That deserves a sprint on his own though, so we agreed that a nice first step would be to build a common library for import resolution without relying on the python interpreter for that, while handling most of the python dark import features like zip/egg import, .pth files and so one. Indeed that may be two nice future collaborations!

    Last but not least, we got some actual work done:

    • Michal Nowikowski from Intel in Poland joined us to work on the ability to run pylint in different processes so it may drastically improve performance on multiple cores box.
    • Torsten did continue some work on various improvements of the functionnal test framework.
    • Sylvain did merge logilab.common.modutils module into astroid as it's mostly driven by astroid and pylint needs. Also fixed the annoying namespace package crash.
    • Claudiu keep up the good work he does daily at improving and fixing pylint :)

  • Nazca notebooks

    2014/07/04 by Vincent Michel

    We have just published the following ipython notebooks explaining how to perform record linkage and entities matching with Nazca:


  • Open Legislative Data Conference 2014

    2014/06/10 by Nicolas Chauvat

    I was at the Open Legislative Data Conference on may 28 2014 in Paris, to present a simple demo I worked on since the same event that happened two years ago.

    The demo was called "Law is Code Rebooted with CubicWeb". It featured the use of the cubicweb-vcreview component to display the amendments of the hospital law ("loi hospitalière") gathered into a version control system (namely Mercurial).

    The basic idea is to compare writing code and writing law, for both are collaborative and distributed writing processes. Could we reuse for the second one the tools developed for the first?

    Here are the slides and a few screenshots.

    http://www.logilab.org/file/253394/raw/lawiscode1.png

    Statistics with queries embedded in report page.

    http://www.logilab.org/file/253400/raw/lawiscode2.png

    List of amendments.

    http://www.logilab.org/file/253396/raw/lawiscode3.png

    User comment on an amendment.

    While attending the conference, I enjoyed several interesting talks and chats with other participants, including:

    1. the study of co-sponsorship of proposals in the french parliament
    2. data.senat.fr announcing their use of PostgreSQL and JSON.
    3. and last but not least, the great work done by RegardsCitoyens and SciencesPo MediaLab on visualizing the law making process.

    Thanks to the organisation team and the other speakers. Hope to see you again!


  • SaltStack Meetup with Thomas Hatch in Paris France

    2014/05/22 by Arthur Lutz

    This monday (19th of may 2014), Thomas Hatch was in Paris for dotScale 2014. After presenting SaltStack there (videos will be published at some point), he spent the evening with members of the French SaltStack community during a meetup set up by Logilab at IRILL.

    http://www.logilab.org/file/248338/raw/thomas-hatch.png

    Here is a list of what we talked about :

    • Since Salt seems to have pushed ZMQ to its limits, SaltStack has been working on RAET (Reliable Asynchronous Event Transport Protocol ), a transport layer based on UDP and elliptic curve cryptography (Dan Berstein's CURVE-255-19) that works more like a stack than a socket and has reliability built in. RAET will be released as an optionnal beta feature in the next Salt release.
    • Folks from Dailymotion bumped into a bug that seems related to high latency networks and the auth_timeout. Updating to the very latest release should fix the issue.
    • Thomas told us about how a dedicated team at SaltStack handles pull requests and another team works on triaging github issues to input them into their internal SCRUM process. There are a lot of duplicate issues and old inactive issues that need attention and clutter the issue tracker. Help will be welcome.
    http://www.logilab.org/file/248336/raw/Salt-Logo.png
    • Continuous integration is based on Jenkins and spins up VMs to test pull request. There is work in progress to test multiple clouds, various latencies and loads.
    • For the Docker integration, salt now keeps track of forwarded ports and relevant information about the containers.
    • salt-virt bumped into problems with chroots and timeouts due to ZMQ.
    • Multi-master: the problem lies with syncronisation of data which is sent to minions but also the data that is sent to the masters. Possible solutions to be explored are : the use of gitfs, there is no built-in solution for keys (salt-key has to be run on all masters), mine.send should send the data at both masters, for the jobs cache: one could use an external returner.
    • Thomas talked briefly about ioflo which should bring queuing, data hierarchy and data pub-sub to Salt.
    http://www.logilab.org/file/248335/raw/ioflo.png
    • About the rolling release question: versions in Salt are definitely not git snapshots, things get backported into previous versions. No clear definition yet of length of LTS versions.
    • salt-cloud and libcloud : in the next release, libcloud will not be a hard dependency. Some clouds didn't work in libcloud (for example AWS), so these providers got implemented directly in salt-cloud or by using third-party libraries (eg. python-boto).
    • Documentation: a sprint is planned next week. Reference documentation will not be completly revamped, but tutorial content will be added.

    Boris Feld showed a demo of vagrant images orchestrated by salt and a web UI to monitor a salt install.

    http://www.vagrantup.com/images/logo_vagrant-81478652.png

    Thanks again to Thomas Hatch for coming and meeting up with (part of) the community here in France.


  • Salt April Meetup in Paris (France)

    2014/05/14 by Arthur Lutz

    On the 15th of april, in Paris (France), we took part in yet another Salt meetup. The community is now meeting up once every two months.

    We had two presentations:

    • Arthur Lutz made an introduction to returners and the scheduler using the SalMon monitoring system as an example. Salt is not only about configuration management Indeed!
    • The folks from Is Cool Entertainment did a presentation about how they are using salt-cloud to deploy and orchestrate clusters of EC2 machines (islands in their jargon) to reproduce parts of their production environment for testing and developement.

    More discussions about various salty subjects followed and were pursued in an Italian restaurant (photos here).

    In case it is not already in your diary : Thomas Hatch is coming to Paris next week, on Monday the 19th of May, and will be speaking at dotscale during the day and at a Salt meetup in the evening. The Salt Meetup will take place at IRILL (like the previous meetups, thanks again to them) and should start at 19h. The meetup is free and open to the public, but registering on this framadate would be appreciated.


  • Pylint 1.2 released!

    2014/04/22 by Sylvain Thenault

    Once again, a lot of work has been achieved since the latest 1.1 release. Claudiu, who joined the maintainer team (Torsten and me) did a great work in the past few months. Also lately Torsten has backported a lot of things from their internal G[oogle]Pylint. Last but not least, various people contributed by reporting issues and proposing pull requests. So thanks to everybody!

    Notice Pylint 1.2 depends on astroid 1.1 which has been released at the same time. Currently, code is available on Pypi, and Debian/Ubuntu packages should be ready shortly on Logilab's acceptance repositories.

    Below is the changes summary, check the changelog for more info.

    New and improved checks:

    • New message 'eval-used' checking that the builtin function eval was used.
    • New message 'bad-reversed-sequence' checking that the reversed builtin receive a sequence (i.e. something that implements __getitem__ and __len__, without being a dict or a dict subclass) or an instance which implements __reversed__.
    • New message 'bad-exception-context' checking that raise ... from ... uses a proper exception context (None or an exception).
    • New message 'abstract-class-instantiated' warning when abstract classes created with abc module and with abstract methods are instantied.
    • New messages checking for proper class __slots__: 'invalid-slots-object' and 'invalid-slots'.
    • New message 'undefined-all-variable' if a package's __all__ variable contains a missing submodule (#126).
    • New option logging-modules giving the list of module names that can be checked for 'logging-not-lazy'.
    • New option include-naming-hint to show a naming hint for invalid name (#138).
    • Mark file as a bad function when using python2 (#8).
    • Add support for enforcing multiple, but consistent name styles for different name types inside a single module.
    • Warn about empty docstrings on overridden methods.
    • Inspect arguments given to constructor calls, and emit relevant warnings.
    • Extend the number of cases in which logging calls are detected (#182).
    • Enhance the check for 'used-before-assignment' to look for nonlocal uses.
    • Improve cyclic import detection in the case of packages.

    Bug fixes:

    • Do not warn about 'return-arg-in-generator' in Python 3.3+.
    • Do not warn about 'abstract-method' when the abstract method is implemented through assignment (#155).
    • Do not register most of the 'newstyle' checker warnings with python >= 3.
    • Fix 'unused-import' false positive with augment assignment (#78).
    • Fix 'access-member-before-definition' false negative with augment assign (#164).
    • Do not crash when looking for 'used-before-assignment' in context manager assignments (#128).
    • Do not attempt to analyze non python file, eg '.so' file (#122).
    • Pass the current python path to pylint process when invoked via epylint (#133).

    Command line:

    • Add -i / --include-ids and -s / --symbols back as completely ignored options (#180).
    • Ensure init-hooks is evaluated before other options, notably load-plugins (#166).

    Other:

    • Improve pragma handling to not detect 'pylint:*' strings in non-comments (#79).
    • Do not crash with UnknownMessage if an unknown message identifier/name appears in disable or enable in the configuration (#170).
    • Search for rc file in ~/.config/pylintrc if ~/.pylintrc doesn't exists (#121).
    • Python 2.5 support restored (#50 and #62).

    Astroid:

    • Python 3.4 support
    • Enhanced support for metaclass
    • Enhanced namedtuple support

    Nice easter egg, no?


  • Code_Aster back in Debian unstable

    2014/03/31 by Denis Laxalde

    Last week, a new release of Code_Aster entered Debian unstable. Code_Aster is a finite element solver for partial differential equations in mechanics, mainly developed by EDF R&D (Électricité de France). It is arguably one of the most feature complete free software available in this domain.

    Aster has been in Debian since 2012 thanks to the work of debian-science team. Yet it has always been somehow a problematic package with a couple of persistent Release Critical (RC) bugs (FTBFS, instalability issues) and actually never entered a stable release of Debian.

    Logilab has committed to improving Code_Aster for a long time in various areas, notably through the LibAster friendly fork, which aims at turning the monolithic Aster into a library, usable from Python.

    Recently, the EDF R&D team in charge of the development of Code_Aster took several major decisions, including:

    • the move to Bitbucket forge as a sign of community opening (following the path opened by LibAster that imported the code of Code_Aster into a Mercurial repository) and,
    • the change of build system from a custom makefile-style architecture to a fine-grained Waf system (taken from that of LibAster).

    The latter obviously led to significant changes on the Debian packaging side, most of which going into a sane direction: the debian/rules file slimed down from 239 lines to 51 and a bunch of tricky install-step manipulations were dropped leading to something much simpler and closer to upstream (see #731211 for details). From upstream perspective, this re-packaging effort based on the new build-system may be the opportunity to update the installation scheme (in particular by declaring the Python library as private).

    Clearly, there's still room for improvements on both side (like building with the new metis library, shipping several versions of Aster stable/testing, MPI/serial). All in all, this is good for both Debian users and upstream developers. At Logilab, we hope that this effort will consolidate our collaboration with EDF R&D.


  • Second Salt Meetup builds the french community

    2014/03/04 by Arthur Lutz

    On the 6th of February, the Salt community in France met in Paris to discuss Salt and choose the tools to federate itself. The meetup was kindly hosted by IRILL.

    There were two formal presentations :

    • Logilab did a short introduction of Salt,
    • Majerti presented a feedback of their experience with Salt in various professional contexts.

    The presentation space was then opened to other participants and Boris Feld did a short presentation of how Salt was used at NovaPost.

    http://www.logilab.org/file/226420/raw/saltstack_meetup.jpeg

    We then had a short break to share some pizza (sponsored by Logilab).

    After the break, we had some open discussion about various subjects, including "best practices" in Salt and some specific use cases. Regis Leroy talked about the states that Makina Corpus has been publishing on github. The idea of reconciling the documentation and the monitoring of an infrastructure was brought up by Logilab, that calls it "Test Driven Infrastructure".

    The tools we collectively chose to form the community were the following :

    • a mailing-list kindly hosted by the AFPY (a pythonic french organization)
    • a dedicated #salt-fr IRC channel on freenode

    We decided that the meetup would take place every two months, hence the third one will be in April. There is already some discussion about organizing events to tell as many people as possible about Salt. It will probably start with an event at NUMA in March.

    After the meetup was officially over, a few people went on to have some drinks nearby. Thank you all for coming and your participation.

    login or register to comment on this blog


  • FOSDEM PGDay 2014

    2014/02/11 by Rémi Cardona

    I attended PGDay on January 31st, in Brussels. This event was held just before FOSDEM, which I also attended (expect another blog post). Here are some of the notes I took during the conference.

    https://fosdem.org/2014/support/promote/wide.png

    Statistics in PostgreSQL, Heikki Linnakangas

    Due to transit delays, I only caught the last half of that talk.

    The main goal of this talk was to explain some of Postgres' per-column statistics. In a nutshell, Postgres needs to have some idea about tables' content in order to choose an appropriate query plan.

    Heikki explained which sorts of statistics gathers, such as most common values and histograms. Another interesting stat is the correlation between physical pages and data ordering (see CLUSTER).

    Column statistics are gathered when running ANALYZE and stored in the pg_statistic system catalog. The pg_stats view provides a human-readable version of these stats.

    Heikki also explained how to determine whether performance issues are due to out-of-date statistics or not. As it turns out, EXPLAIN ANALYZE shows for each step of the query planner how many rows it expects to process and how many it actually processed. The rule of thumb is that similar values (no more than an order of magnitude apart) mean that column statistics are doing their job. A wider margin between expected and actual rows mean that statistics are possibly preventing the query planner from picking a more optimized plan.

    It was noted though that statistics-related performance issues often happen on tables with very frequent modifications. Running ANALYZE manually or increasing the frequency of the automatic ANALYZE may help in those situations.

    Advanced Extension Use Cases, Dimitri Fontaine

    Dimitri explained with very simple cases the use of some of Postgres' lesser-known extensions and the overall extension mechanism.

    Here's a grocery-list of the extensions and types he introduced:

    • intarray extension, which adds operators and functions to the standard ARRAY type, specifically tailored for arrays of integers,
    • the standard POINT type which provides basic 2D flat-earth geometry,
    • the cube extension that can represent N-dimensional points and volumes,
    • the earthdistance extension that builds on cube to provide distance functions on a sphere-shaped Earth (a close-enough approximation for many uses),
    • the pg_trgm extension which provides text similarity functions based on trigram matching (a much simpler thus faster alternative to Levenshtein distances), especially useful for "typo-resistant" auto-completion suggestions,
    • the hstore extension which provides a simple-but-efficient key value store that has everyone talking in the Postgres world (it's touted as the NoSQL killer),
    • the hll extensions which implements the HyperLogLog algorithm which seems very well suited to storing and counting unique visitor on a web site, for example.

    An all-around great talk with simple but meaningful examples.

    http://tapoueh.org/images/fosdem_2014.jpg

    Integrated cache invalidation for better hit ratios, Magnus Hagander

    What Magnus presented almost amounted to a tutorial on caching strategies for busy web sites. He went through simple examples, using the ubiquitous Django framework for the web view part and Varnish for the HTTP caching part.

    The whole talk revolved around adding private (X-prefixed) HTTP headers in replies containing one or more "entity IDs" so that Varnish's cache can be purged whenever said entities change. The hard problem lies in how and when to call PURGE on Varnish.

    The obvious solution is to override Django's save() method on Model-derived objects. One can then use httplib (or better yet requests) to purge the cache. This solution can be slightly improved by using Django's signal mechanism instead, which sound an awful-lot like CubicWeb's hooks.

    The problem with the above solution is that any DB modification not going through Django (and they will happen) will not invalidate the cached pages. So Magnus then presented how to write the same cache-invalidating code in PL/Python in triggers.

    While this does solve that last issue, it introduces synchronous HTTP calls in the DB, killing write performance completely (or killing it completely if the HTTP calls fail). So to fix those problems, while introducing limited latency, is to use SkyTools' PgQ, a simple message queue based on Postgres. Moving the HTTP calls outside of the main database and into a Consumer (a class provided by PgQ's python bindings) makes the cache-invalidating trigger asynchronous, reducing write overhead.

    http://www.logilab.org/file/210615/raw/varnish_django_postgresql.png

    A clear, concise and useful talk for any developer in charge of high-traffic web sites or applications.

    The Worst Day of Your Life, Christophe Pettus

    Christophe humorously went back to that dreadful day in the collective Postgres memory: the release of 9.3.1 and the streaming replication chaos.

    My overall impression of the talk: Thank $DEITY I'm not a DBA!

    But Christophe also gave some valuable advice, even for non-DBAs:

    • Provision 3 times the necessary disk space, in case you need to pg_dump or otherwise do a snapshot of your currently running database,
    • Do backups and test them:
      • give them to developers,
      • use them for analytics,
      • test the restore, make it foolproof, try to automate it,
    • basic Postgres hygiene:
      • fsync = on (on by default, DON'T TURN IT OFF, there are better ways)
      • full_page_writes = on (on by default, don't turn it off)
      • deploy minor versions as soon as possible,
      • plan upgrade strategies before EOL,
      • 9.3+ checksums (createdb option, performance cost is minimal),
      • application-level consistency checks (don't wait for auto vacuum to "discover" consistency errors).

    Materialised views now and in the future, Thom Brown

    Thom presented on of the new features of Postgres 9.3, materialized views.

    In a nutshell, materialized views (MV) are read-only snapshots of queried data that's stored on disk, mostly for performance reasons. An interesting feature of materialized views is that they can have indexes, just like regular tables.

    The REFRESH MATERIALIZED VIEW command can be used to update an MV: it will simply run the original query again and store the new results.

    There are a number of caveats with the current implementation of MVs:

    • pg_dump never saves the data, only the query used to build it,
    • REFRESH requires an exclusive lock,
    • due to implementation details (frozen rows or pages IIRC), MVs may exhibit non-concurrent behavior with other running transactions.

    Looking towards 9.4 and beyond, here are some of the upcoming MV features:

    • 9.4 adds the CONCURRENTLY keyword:
      • + no longer needs an exclusive lock, doesn't block reads
      • - requires a unique index
      • - may require VACUUM
    • roadmap (no guarantees):
      • unlogged (disables the WAL),
      • incremental refresh,
      • lazy automatic refresh,
      • planner awareness of MVs (would use MVs as cache/index).

    Indexes: The neglected performance all-rounder, Markus Winand

    http://use-the-index-luke.com/img/alchemie.png

    Markus' goal with this talk showed that very few people in the SQL world actually know - let alone really care - about indexes. According to his own experience and that of others (even with competing RDBMS), poorly written SQL is still a leading cause of production downtime (he puts the number at around 50% of downtime though others he quoted put that number higher). SQL queries can indeed put such stress on DB systems and cause them to fail.

    One major issue, he argues, is poorly designed indexes. He went back in time to explain possible reasons for the lack of knowledge about indexes with both SQL developers and DBAs. One such reason may be that indexes are not part of the SQL standard and are left as implementation-specific details. Thus many books about SQL barely cover indexes, if at all.

    He then took us through a simple quiz he wrote on the topic, with only 5 questions. The questions and explanations were very insightful and I must admit my knowledge of indexes was not up to par. I think everyone in the room got his message loud and clear: indexes are part of the schema, devs should care about them too.

    Try out the test : http://use-the-index-luke.com/3-minute-test

    PostgreSQL - Community meets Business, Michael Meskes

    For the last talk of the day, Michael went back to the history of the Postgres project and its community. Unlike other IT domains such as email, HTTP servers or even operating systems, RDBMS are still largely dominated by proprietary vendors such as Oracle, IBM and Microsoft. He argues that the reasons are not technical: from a developer stand point, Postgres has all the features of the leading RDMBS (and many more) and the few missing administrative features related to scalability are being addressed.

    Instead, he argues decision makers inside companies don't yet fully trust Postgres due to its (perceived) lack of corporate backers.

    He went on to suggest ways to overcome those perceptions, for example with an "official" Postgres certification program.

    A motivational talk for the Postgres community.

    http://fosdem2014.pgconf.eu/files/img/frontrotate/slonik.jpg

  • A Salt Configuration for C++ Development

    2014/01/24 by Damien Garaud
    http://www.logilab.org/file/204916/raw/SaltStack-Logo.png

    At Logilab, we've been using Salt for one year to manage our own infrastructure. I wanted to use it to manage a specific configuration: C++ development. When I instantiate a Virtual Machine with a Debian image, I don't want to spend time to install and configure a system which fits my needs as a C++ developer:

    This article is a very simple recipe to get a C++ development environment, ready to use, ready to hack.

    Give Me an Editor and a DVCS

    Quite simple: I use the YAML file format used by Salt to describe what I want. To install these two editors, I just need to write:

    vim-nox:
      pkg.installed
    
    emacs23-nox:
      pkg.installed
    

    For Mercurial, you'll guess:

    mercurial:
     pkg.installed
    

    You can write these lines in the same init.sls file, but you can also decide to split your configuration into different subdirectories: one place for each thing. I decided to create a dev and editor directories at the root of my salt config with two init.sls inside.

    That's all for the editors. Next step: specific C++ development packages.

    Install Several "C++" Packages

    In a cpp folder, I write a file init.sls with this content:

    gcc:
        pkg.installed
    
    g++:
        pkg.installed
    
    gdb:
        pkg.installed
    
    cmake:
        pkg.installed
    
    automake:
        pkg.installed
    
    libtool:
        pkg.installed
    
    pkg-config:
        pkg.installed
    
    colorgcc:
        pkg.installed
    

    The choice of these packages is arbitrary. You add or remove some as you need. There is not a unique right solution. But I want more. I want some LLVM packages. In a cpp/llvm.sls, I write:

    llvm:
     pkg.installed
    
    clang:
        pkg.installed
    
    libclang-dev:
        pkg.installed
    
    {% if not grains['oscodename'] == 'wheezy' %}
    lldb-3.3:
        pkg.installed
    {% endif %}
    

    The last line specifies that you install the lldb package if your Debian release is not the stable one, i.e. jessie/testing or sid in my case. Now, just include this file in the init.sls one:

    # ...
    # at the end of 'cpp/init.sls'
    include:
      - .llvm
    

    Organize your sls files according to your needs. That's all for packages installation. You Salt configuration now looks like this:

    .
    |-- cpp
    |   |-- init.sls
    |   `-- llvm.sls
    |-- dev
    |   `-- init.sls
    |-- edit
    |   `-- init.sls
    `-- top.sls
    

    Launching Salt

    Start your VM and install a masterless Salt on it (e.g. apt-get install salt-minion). For launching Salt locally on your naked VM, you need to copy your configuration (through scp or a DVCS) into /srv/salt/ directory and to write the file top.sls:

    base:
      '*':
        - dev
        - edit
        - cpp
    

    Then just launch:

    > salt-call --local state.highstate
    

    as root.

    And What About Configuration Files?

    You're right. At the beginning of the post, I talked about a "ready to use" Mercurial with some HG extensions. So I use and copy the default /etc/mercurial/hgrc.d/hgext.rc file into the dev directory of my Salt configuration. Then, I edit it to set some extensions such as color, rebase, pager. As I also need Evolve, I have to clone the source code from https://bitbucket.org/marmoute/mutable-history. With Salt, I can tell "clone this repo and copy this file" to specific places.

    So, I add some lines to dev/init.sls.

    https://bitbucket.org/marmoute/mutable-history:
        hg.latest:
          - rev: tip
          - target: /opt/local/mutable-history
          - require:
             - pkg: mercurial
    
    /etc/mercurial/hgrc.d/hgext.rc:
        file.managed:
          - source: salt://dev/hgext.rc
          - user: root
          - group: root
          - mode: 644
    

    The require keyword means "install (if necessary) this target before cloning". The other lines are quite self-explanatory.

    In the end, you have just six files with a few lines. Your configuration now looks like:

    .
    |-- cpp
    |   |-- init.sls
    |   `-- llvm.sls
    |-- dev
    |   |-- hgext.rc
    |   `-- init.sls
    |-- edit
    |   `-- init.sls
    `-- top.sls
    

    You can customize it and share it with your teammates. A step further would be to add some configuration files for your favorite editor. You can also imagine to install extra packages that your library depends on. Quite simply add a subdirectory amazing_lib and write your own init.sls. I know I often need Boost libraries for example. When your Salt configuration has changed, just type: salt-call --local state.highstate.

    As you can see, setting up your environment on a fresh system will take you only a couple commands at the shell before you are ready to compile your C++ library, debug it, fix it and commit your modifications to your repository.


  • What's New in Pandas 0.13?

    2014/01/19 by Damien Garaud
    http://www.logilab.org/file/203841/raw/pandas_logo.png

    Do you know pandas, a Python library for data analysis? Version 0.13 came out on January the 16th and this post describes a few new features and improvements that I think are important.

    Each release has its list of bug fixes and API changes. You may read the full release note if you want all the details, but I will just focus on a few things.

    You may be interested in one of my previous blog post that showed a few useful Pandas features with datasets from the Quandl website and came with an IPython Notebook for reproducing the results.

    Let's talk about some new and improved Pandas features. I suppose that you have some knowledge of Pandas features and main objects such as Series and DataFrame. If not, I suggest you watch the tutorial video by Wes McKinney on the main page of the project or to read 10 Minutes to Pandas in the documentation.

    Refactoring

    I welcome the refactoring effort: the Series type, subclassed from ndarray, has now the same base class as DataFrame and Panel, i.e. NDFrame. This work unifies methods and behaviors for these classes. Be aware that you can hit two potential incompatibilities with versions less that 0.13. See internal refactoring for more details.

    Timeseries

    to_timedelta()

    Function pd.to_timedelta to convert a string, scalar or array of strings to a Numpy timedelta type (np.timedelta64 in nanoseconds). It requires a Numpy version >= 1.7. You can handle an array of timedeltas, divide it by an other timedelta to carry out a frequency conversion.

    from datetime import timedelta
    import numpy as np
    import pandas as pd
    
    # Create a Series of timedelta from two DatetimeIndex.
    dr1 = pd.date_range('2013/06/23', periods=5)
    dr2 = pd.date_range('2013/07/17', periods=5)
    td = pd.Series(dr2) - pd.Series(dr1)
    
    # Set some Na{N,T} values.
    td[2] -= np.timedelta64(timedelta(minutes=10, seconds=7))
    td[3] = np.nan
    td[4] += np.timedelta64(timedelta(hours=14, minutes=33))
    td
    
    0   24 days, 00:00:00
    1   24 days, 00:00:00
    2   23 days, 23:49:53
    3                 NaT
    4   24 days, 14:33:00
    dtype: timedelta64[ns]
    

    Note the NaT type (instead of the well-known NaN). For day conversion:

    td / np.timedelta64(1, 'D')
    
    0    24.000000
    1    24.000000
    2    23.992975
    3          NaN
    4    24.606250
    dtype: float64
    

    You can also use the DateOffSet as:

    td + pd.offsets.Minute(10) - pd.offsets.Second(7) + pd.offsets.Milli(102)
    

    Nanosecond Time

    Support for nanosecond time as an offset. See pd.offsets.Nano. You can use N of this offset in the pd.date_range function as the value of the argument freq.

    Daylight Savings

    The tz_localize method can now infer a fall daylight savings transition based on the structure of the unlocalized data. This method, as the tz_convert method is available for any DatetimeIndex, Series and DataFrame with a DatetimeIndex. You can use it to localize your datasets thanks to the pytz module or convert your timeseries to a different time zone. See the related documentation about time zone handling. To use the daylight savings inference in the method tz_localize, set the infer_dst argument to True.

    DataFrame Features

    New Method isin()

    New DataFrame method isin which is used for boolean indexing. The argument to this method can be an other DataFrame, a Series, or a dictionary of a list of values. Comparing two DataFrame with isin is equivalent to do df1 == df2. But you can also check if values from a list occur in any column or check if some values for a few specific columns occur in the DataFrame (i.e. using a dict instead of a list as argument):

    df = pd.DataFrame({'A': [3, 4, 2, 5],
                       'Q': ['f', 'e', 'd', 'c'],
                       'X': [1.2, 3.4, -5.4, 3.0]})
    
       A  Q    X
    0  3  f  1.2
    1  4  e  3.4
    2  2  d -5.4
    3  5  c  3.0
    

    and then:

    df.isin(['f', 1.2, 3.0, 5, 2, 'd'])
    
           A      Q      X
    0   True   True   True
    1  False  False  False
    2   True   True  False
    3   True  False   True
    

    Of course, you can use the previous result as a mask for the current DataFrame.

    mask = _
    df[mask.any(1)]
    
          A  Q    X
       0  3  f  1.2
       2  2  d -5.4
       3  5  c  3.0
    
    When you pass a dictionary to the ``isin`` method, you can specify the column
    labels for each values.
    
    mask = df.isin({'A': [2, 3, 5], 'Q': ['d', 'c', 'e'], 'X': [1.2, -5.4]})
    df[mask]
    
        A    Q    X
    0   3  NaN  1.2
    1 NaN    e  NaN
    2   2    d -5.4
    3   5    c  NaN
    

    See the related documentation for more details or different examples.

    New Method str.extract

    The new vectorized extract method from the StringMethods object, available with the suffix str on Series or DataFrame. Thus, it is possible to extract some data thanks to regular expressions as followed:

    s = pd.Series(['doe@umail.com', 'nobody@post.org', 'wrong.mail', 'pandas@pydata.org', ''])
    # Extract usernames.
    s.str.extract(r'(\w+)@\w+\.\w+')
    

    returns:

    0       doe
    1    nobody
    2       NaN
    3    pandas
    4       NaN
    dtype: object
    

    Note that the result is a Series with the re match objects. You can also add more groups as:

    # Extract usernames and domain.
    s.str.extract(r'(\w+)@(\w+\.\w+)')
    
            0           1
    0     doe   umail.com
    1  nobody    post.org
    2     NaN         NaN
    3  pandas  pydata.org
    4     NaN         NaN
    

    Elements that do no math return NaN. You can use named groups. More useful if you want a more explicit column names (without NaN values in the following example):

    # Extract usernames and domain with named groups.
    s.str.extract(r'(?P<user>\w+)@(?P<at>\w+\.\w+)').dropna()
    
         user          at
    0     doe   umail.com
    1  nobody    post.org
    3  pandas  pydata.org
    

    Thanks to this part of the documentation, I also found out other useful strings methods such as split, strip, replace, etc. when you handle a Series of str for instance. Note that the most of them have already been available in 0.8.1. Take a look at the string handling API doc (recently added) and some basics about vectorized strings methods.

    Interpolation Methods

    DataFrame has a new interpolate method, similar to Series. It was possible to interpolate missing data in a DataFrame before, but it did not take into account the dates if you had index timeseries. Now, it is possible to pass a specific interpolation method to the method function argument. You can use scipy interpolation functions such as slinear, quadratic, polynomial, and others. The time method is used to take your index timeseries into account.

    from datetime import date
    # Arbitrary timeseries
    ts = pd.DatetimeIndex([date(2006,5,2), date(2006,12,23), date(2007,4,13),
                           date(2007,6,14), date(2008,8,31)])
    df = pd.DataFrame(np.random.randn(5, 2), index=ts, columns=['X', 'Z'])
    # Fill the DataFrame with missing values.
    df['X'].iloc[[1, -1]] = np.nan
    df['Z'].iloc[3] = np.nan
    df
    
                       X         Z
    2006-05-02  0.104836 -0.078031
    2006-12-23       NaN -0.589680
    2007-04-13 -1.751863  0.543744
    2007-06-14  1.210980       NaN
    2008-08-31       NaN  0.566205
    

    Without any optional argument, you have:

    df.interpolate()
    
                       X         Z
    2006-05-02  0.104836 -0.078031
    2006-12-23 -0.823514 -0.589680
    2007-04-13 -1.751863  0.543744
    2007-06-14  1.210980  0.554975
    2008-08-31  1.210980  0.566205
    

    With the time method, you obtain:

    df.interpolate(method='time')
    
                       X         Z
    2006-05-02  0.104836 -0.078031
    2006-12-23 -1.156217 -0.589680
    2007-04-13 -1.751863  0.543744
    2007-06-14  1.210980  0.546496
    2008-08-31  1.210980  0.566205
    

    I suggest you to read more examples in the missing data doc part and the scipy documentation about the module interpolate.

    Misc

    Convert a Series to a single-column DataFrame with its method to_frame.

    Misc & Experimental Features

    Retrieve R Datasets

    Not a killing feature but very pleasant: the possibility to load into a DataFrame all R datasets listed at http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

    import pandas.rpy.common as com
    titanic = com.load_data('Titanic')
    titanic.head()
    
      Survived    Age     Sex Class value
    0       No  Child    Male   1st   0.0
    1       No  Child    Male   2nd   0.0
    2       No  Child    Male   3rd  35.0
    3       No  Child    Male  Crew   0.0
    4       No  Child  Female   1st   0.0
    

    for the datasets about survival of passengers on the Titanic. You can find several and different datasets about New York air quality measurements, body temperature series of two beavers, plant growth results or the violent crime rates by US state for instance. Very useful if you would like to show pandas to a friend, a colleague or your Grandma and you do not have a dataset with you.

    And then three great experimental features.

    Eval and Query Experimental Features

    The eval and query methods which use numexpr which can fastly evaluate array expressions as x - 0.5 * y. For numexpr, x and y are Numpy arrays. You can use this powerfull feature in pandas to evaluate different DataFrame columns. By the way, we have already talked about numexpr a few years ago in EuroScipy 09: Need for Speed.

    df = pd.DataFrame(np.random.randn(10, 3), columns=['x', 'y', 'z'])
    df.head()
    
              x         y         z
    0 -0.617131  0.460250 -0.202790
    1 -1.943937  0.682401 -0.335515
    2  1.139353  0.461892  1.055904
    3 -1.441968  0.477755  0.076249
    4 -0.375609 -1.338211 -0.852466
    
    df.eval('x + 0.5 * y - z').head()
    
    0   -0.184217
    1   -1.267222
    2    0.314395
    3   -1.279340
    4   -0.192248
    dtype: float64
    

    About the query method, you can select elements using a very simple query syntax.

    df.query('x >= y > z')
    
              x         y         z
    9  2.560888 -0.827737 -1.326839
    

    msgpack Serialization

    New reading and writing functions to serialize your data with the great and well-known msgpack library. Note this experimental feature does not have a stable storage format. You can imagine to use zmq to transfer msgpack serialized pandas objects over TCP, IPC or SSH for instance.

    Google BigQuery

    A recent module pandas.io.gbq which provides a way to load into and extract datasets from the Google BigQuery Web service. I've not installed the requirements for this feature now. The example of the release note shows how you can select the average monthly temperature in the year 2000 across the USA. You can also read the related pandas documentation. Nevertheless, you will need a BigQuery account as the other Google's products.

    Take Your Keyboard

    Give it a try, play with some data, mangle and plot them, compute some stats, retrieve some patterns or whatever. I'm convinced that pandas will be more and more used and not only for data scientists or quantitative analysts. Open an IPython Notebook, pick up some data and let yourself be tempted by pandas.

    I think I will use more the vectorized strings methods that I found out about when writing this post. I'm glad to learn more about timeseries because I know that I'll use these features. I'm looking forward to the two experimental features such as eval/query and msgpack serialization.

    You can follow me on Twitter (@jazzydag). See also Logilab (@logilab_org).


  • Pylint 1.1 christmas release

    2013/12/24 by Sylvain Thenault

    Pylint 1.1 eventually got released on pypi!

    A lot of work has been achieved since the latest 1.0 release. Various people have contributed to add several new checks as well as various bug fixes and other enhancement.

    Here is the changes summary, check the changelog for more info.

    New checks:

    • 'deprecated-pragma', for use of deprecated pragma directives "pylint:disable-msg" or "pylint:enable-msg" (was previously emmited as a regular warn().
    • 'superfluous-parens' for unnecessary parentheses after certain keywords.
    • 'bad-context-manager' checking that '__exit__' special method accepts the right number of arguments.
    • 'raising-non-exception' / 'catching-non-exception' when raising/catching class non inheriting from BaseException
    • 'non-iterator-returned' for non-iterators returned by '__iter__'.
    • 'unpacking-non-sequence' for unpacking non-sequences in assignments and 'unbalanced-tuple-unpacking' when left-hand-side size doesn't match right-hand-side.

    Command line:

    • New option for the multi-statement warning to allow single-line if statements.
    • Allow to run pylint as a python module 'python -m pylint' (anatoly techtonik).
    • Various fixes to epylint

    Bug fixes:

    • Avoid false used-before-assignment for except handler defined identifier used on the same line (#111).
    • 'useless-else-on-loop' not emited if there is a break in the else clause of inner loop (#117).
    • Drop 'badly-implemented-container' which caused several problems in its current implementation.
    • Don't mark input as a bad function when using python3 (#110).
    • Use attribute regexp for properties in python3, as in python2
    • Fix false-positive 'trailing-whitespace' on Windows (#55)

    Other:

    • Replaced regexp based format checker by a more powerful (and nit-picky) parser, combining 'no-space-after-operator', 'no-space-after-comma' and 'no-space-before-operator' into a new warning 'bad-whitespace'.
    • Create the PYLINTHOME directory when needed, it might fail and lead to spurious warnings on import of pylint.config.
    • Fix setup.py so that pylint properly install on Windows when using python3.
    • Various documentation fixes and enhancements

    Packages will be available in Logilab's Debian and Ubuntu repository in the next few weeks.

    Happy christmas!


  • SaltStack Paris Meetup on Feb 6th, 2014 - (S01E02)

    2013/12/20 by Nicolas Chauvat

    Logilab has set up the second meetup for salt users in Paris on Feb 6th, 2014 at IRILL, near Place d'Italie, starting at 18:00. The address is 23 avenue d'Italie, 75013 Paris.

    Here is the announce in french http://www.logilab.fr/blogentry/1981

    Please forward it to whom may be interested, underlining that pizzas will be offered to refuel the chatters ;)

    Conveniently placed a week after the Salt Conference, topics will include anything related to salt and its uses, demos, new ideas, exchange of salt formulas, commenting the talks/videos of the saltconf, etc.

    If you are interested in Salt, Python and Devops and will be in Paris at that time, we hope to see you there !


  • A quick take on continuous integration services for Bitbucket

    2013/12/19 by Sylvain Thenault

    Some time ago, we moved Pylint from this forge to Bitbucket (more on this here).

    https://bitbucket-assetroot.s3.amazonaws.com/c/photos/2012/Oct/11/master-logo-2562750429-5_avatar.png

    Since then, I somewhat continued to use the continuous integration (CI) service we provide on logilab.org to run tests on new commits, and to do the release job (publish a tarball on pypi, on our web site, build Debian and Ubuntu packages, etc.). This is fine, but not really handy since the logilab.org's CI service is not designed to be used for projects hosted elsewhere. Also I wanted to see what others have to offer, so I decided to find a public CI service to host Pylint and Astroid automatic tests at least.

    Here are the results of my first swing at it. If you have others suggestions, some configuration proposal or whatever, please comment.

    First, here are the ones I didn't test along with why:

    The first one I actually tested, also the first one to show up when looking for "bitbucket continuous integration" on Google is https://drone.io. The UI is really simple, I was able to set up tests for Pylint in a matter of minutes: https://drone.io/bitbucket.org/logilab/pylint. Tests are automatically launched when a new commit is pushed to Pylint's Bitbucket repository and that setup was done automatically.

    Trying to push Drone.io further, one missing feature is the ability to have different settings for my project, e.g. to launch tests on all the python flavor officially supported by Pylint (2.5, 2.6, 2.7, 3.2, 3.3, pypy, jython, etc.). Last but not least, the missing killer feature I want is the ability to launch tests on top of Pull Requests, which travis-ci supports.

    Then I gave http://wercker.com a shot, but got stuck at the Bitbucket repository selection screen: none were displayed. Maybe because I don't own Pylint's repository, I'm only part of the admin/dev team? Anyway, wercker seems appealing too, though the configuration using yaml looks a bit more complicated than drone.io's, but as I was not able to test it further, there's not much else to say.

    https://www.logilab.org/file/4758432/raw/wercker.png

    So for now the winner is https://drone.io, but the first one allowing me to test on several Python versions and to launch tests on pull requests will be the definitive winner! Bonus points for automating the release process and checking test coverage on pull requests as well.

    https://drone.io/drone3000/images/alien-zap-header.png

  • A retrospective of 10 years animating the pylint free software projet

    2013/11/25 by Sylvain Thenault

    was the topic of the talk I gave last saturday at the Capitol du Libre in Toulouse.

    Here are the slides (pdf) for those interested (in french). A video of the talk should be available soon on the Capitol du Libre web site. The slides are mirrored on slideshare (see below):


  • Retrieve Quandl's Data and Play with a Pandas

    2013/10/31 by Damien Garaud

    This post deals with the Pandas Python library, the open and free access of timeseries datasets thanks to the Quandl website and how you can handle datasets with pandas efficiently.

    http://www.logilab.org/file/186707/raw/scrabble_data.jpg http://www.logilab.org/file/186708/raw/pandas_peluche.jpg

    Why this post?

    There has been a long time that I want to play a little with pandas. Not an adorable black and white teddy bear but the well-known Python Data library based on Numpy. I would like to show how you can easely retrieve some numerical datasets from the Quandl website and its API, and handle these datasets with pandas efficiently trought its main object: the DataFrame.

    Note that this blog post comes with a IPython Notebook which can be found at http://nbviewer.ipython.org/url/www.logilab.org/file/187482/raw/quandl-data-with-pandas.ipynb

    You also can get it at http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/ with HG.

    Just do:

    hg clone http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/
    

    and get the IPython Notebook, the HTML conversion of this Notebook and some related CSV files.

    First Step: Get the Code

    At work or at home, I use Debian. A quick and dumb apt-get install python-pandas is enough. Nevertheless, (1) I'm keen on having a fresh and bloody upstream sources to get the lastest features and (2) I'm trying to contribute a little to the project --- tiny bugs, writing some docs. So I prefer to install it from source. Thus, I pull, I do sudo python setup.py develop and a few Cython compiling seconds later, I can do:

    import pandas as pd
    

    For the other ways to get the library, see the download page on the official website or see the dedicated Pypi page.

    Let's build 10 brownian motions and plotting them with matplotlib.

    import numpy as np
    pd.DataFrame(np.random.randn(120, 10).cumsum(axis=0)).plot()
    

    I don't very like the default font and color of the matplotlib figures and curves. I know that pandas defines a "mpl style". Just after the import, you can write:

    pd.options.display.mpl_style = 'default'
    
    http://www.logilab.org/file/186714/raw/Ten-Brownian-Motions.png

    Second Step: Have You Got Some Data Please ?

    Maybe I'm wrong, but I think that it's sometimes a quite difficult to retrieve some workable numerial datasets in the huge amount of available data over the Web. Free Data, Open Data and so on. OK folks, where are they ? I don't want to spent my time to through an Open Data website, find some interesting issues, parse an Excel file, get some specific data, mangling them to get a 2D arrays of floats with labels. Note that pandas fits with these kinds of problem very well. See the IO part of the pandas documentation --- CSV, Excel, JSON, HDF5 reading/writing functions. I just want workable numerical data without making effort.

    A few days ago, a colleague of mine talked me about Quandl, a website dedicated to find and use numerical datasets with timeseries on the Internet. A perfect source to retrieve some data and play with pandas. Note that you can access some data about economics, health, population, education etc. thanks to a clever API. Get some datasets in CSV/XML/JSON formats between this date and this date, aggregate them, compute the difference, etc.

    Moreover, you can access Quandl's datasets through any programming languages, like R, Julia, Clojure or Python (also available plugins or modules for some softwares such as Excel, Stata, etc.). The Quandl's Python package depends on Numpy and pandas. Perfect ! I can use the module Quandl.py available on GitHub and query some datasets directly in a DataFrame.

    Here we are, huge amount of data are teasing me. Next question: which data to play with?

    Third Step: Give some Food to Pandas

    I've already imported the pandas library. Let's query some datasets thanks to the Quandl Python module. An example inspired by the README from the Quandl's GitHub page project.

    import Quandl
    data = Quandl.get('GOOG/NYSE_IBM')
    data.tail()
    

    and you get:

                  Open    High     Low   Close    Volume
    Date
    2013-10-11  185.25  186.23  184.12  186.16   3232828
    2013-10-14  185.41  186.99  184.42  186.97   2663207
    2013-10-15  185.74  185.94  184.22  184.66   3367275
    2013-10-16  185.42  186.73  184.99  186.73   6717979
    2013-10-17  173.84  177.00  172.57  174.83  22368939
    

    OK, I'm not very familiar with this kind of data. Take a look at the Quandl website. After a dozen of minutes on the Quandl website, I found this OECD murder rates. This page shows current and historical murder rates (assault deaths per 100 000 people) for 33 countries from the OECD. Take a country and type:

    uk_df = Quandl.get('OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_GBR')
    

    It's a DataFrame with a single column 'Value'. The index of the DataFrame is a timeserie. You can easily plot these data thanks to a:

    uk_df.plot()
    
    http://www.logilab.org/file/186711/raw/GBR-oecd-murder-rates.png

    See the other pieces of code and using examples in the dedicated IPython Notebook. I also get data about unemployment in OECD for the quite same countries with more dates. Then, as I would like to compare these data, I must select similar countries, time-resample my data to have the same frequency and so on. Take a look. Any comment is welcomed.

    So, the remaining content of this blog post is just a summary of a few interesting and useful pandas features used in the IPython notebook.

    • Using the timeseries as Index of my DataFrames
    • pd.concat to concatenate several DataFrames along a given axis. This function can deal with missing values if the Index of each DataFrame are not similar (this is my case)
    • DataFrame.to_csv and pd.read_csv to dump/load your data to/from CSV files. There are different arguments for the read_csv which deals with dates, mising value, header & footer, etc.
    • DateOffset pandas object to deal with different time frequencies. Quite useful if you handle some data with calendar or business day, month end or begin, quarter end or begin, etc.
    • Resampling some data with the method resample. I use it to make frequency conversion of some data with timeseries.
    • Merging/joining DataFrames. Quite similar to the "SQL" feature. See pd.merge function or the DataFrame.join method. I used this feature to align my two DataFrames along its Index.
    • Some Matplotlib plotting functions such as DataFrame.plot() and plot(kind='bar').

    Conclusion

    I showed a few useful pandas features in the IPython Notebooks: concatenation, plotting, data computation, data alignement. I think I can show more but this could be occurred in a further blog post. Any comments, suggestions or questions are welcomed.

    The next 0.13 pandas release should be coming soon. I'll write a short blog post about it in a few days.

    The pictures come from:


  • SaltStack Paris Meetup - some of what was said

    2013/10/09 by Arthur Lutz

    Last week, on the first day of OpenWorldForum 2013, we met up with Thomas Hatch of SaltStack to have a talk about salt. He was in Paris to give two talks the following day (1 & 2), and it was a great opportunity to meet him and physically meet part of the French Salt community. Since Logilab hosted the Great Salt Sprint in Paris, we offered to co-organise the meetup at OpenWorldForum.

    http://saltstack.com/images/SaltStack-Logo.png http://openworldforum.org/static/pictures/Calque1.png

    Introduction

    About 15 people gathered in Montrouge (near Paris) and we all took turns to present ourselves and how or why we used salt. Some people wanted to migrate from BCFG2 to salt. Some people told the story of working a month with CFEngine and meeting the same functionnality in two days with salt and so decided to go for that instead. Some like salt because they can hack its python code. Some use salt to provision pre-defined AMI images for the clouds (salt-ami-cloud-builder). Some chose salt over Ansible. Some want to use salt to pilot temporary computation clusters in the cloud (sort of like what StarCluster does with boto and ssh).

    When Paul from Logilab introduced salt-ami-cloud-builder, Thomas Hatch said that some work is being done to go all the way : build an image from scratch from a state definition. On the question of Debian packaging, some efforts could be done to have salt into wheezy-backports. Julien Cristau from Logilab who is a debian developer might help with that.

    Some untold stories where shared : some companies that replaced puppet by salt, some companies use salt to control an HPC cluster, some companies use salt to pilot their existing puppet system.

    We had some discussions around salt-cloud, which will probably be merged into salt at some point. One idea for salt-cloud was raised : have a way of defining a "minimum" type of configuration which translates into the profiles according to which provider is used (an issue should be added shortly). The expression "pushing states" was often used, it is probably a good way of looking a the combination of using salt-cloud and the masterless mode available with salt-ssh. salt-cloud controls an existing cloud, but Thomas Hatch points to the fact that with salt-virt, salt is becoming a cloud controller itself, more on that soon.

    Mixing pillar definition between 'public' and 'private' definitions can be tricky. Some solutions exist with multiple gitfs (or mercurial) external pillar definitions, but more use cases will drive more flexible functionalities in the future.

    Presentation and live demo

    For those in the audience that were not (yet) users of salt, Thomas went back to explaining a few basics about it. Salt should be seen as a "toolkit to solve problems in a infrastructure" says Thomas Hatch. Why is it fast ? It is completely asynchronous and event driven.

    He gave a quick presentation about the new salt-ssh which was introduced in 0.17, which allows the application of salt recipes to machines that don't have a minion connected to the master.

    The peer communication can be used so as to add a condition for a state on the presence of service on a different minion.

    While doing demos or even hacking on salt, one can use salt/test/minionswarm.py which makes fake minions, not everyone has hundreds of servers in at their fingertips.

    Smart modules are loaded dynamically, for example, the git module that gets loaded if a state installs git and then in the same highstate uses the git modules.

    Thomas explained the difference between grains and pillars : grains is data about a minion that lives on the minion, pillar is data about the minion that lives on the master. When handling grains, the grains.setval can be useful (it writes in /etc/salt/grains as yaml, so you can edit it separately). If a minion is not reachable one can obtain its grains information by replacing test=True by cache=True.

    Thomas shortly presented saltstack-formulas : people want to "program" their states, and formulas answer this need, some of the jinja2 is overly complicated to make them flexible and programmable.

    While talking about the unified package commands (a salt command often has various backends according to what system runs the minion), for example salt-call --local pkg.install vim, Thomas told this funny story : ironically, salt was nominated for "best package manager" at some linux magazine competition. (so you don't have to learn how to use FreeBSD packaging tools).

    While hacking salt, one can take a look at the Event Bus (see test/eventlisten.py), many applications are possible when using the data on this bus. Thomas talks about a future IOflow python module where a complex logic can be implemented in the reactor with rules and a state machine. One example use of this would be if the load is high on X number of servers and the number of connexions Y on these servers then launch extra machines.

    To finish on a buzzword, someone asked "what is the overlap of salt and docker" ? The answer is not simple, but Thomas thinks that in the long run there will be a lot of overlap, one can check out the existing lxc modules and states.

    Wrap up

    To wrap up, Thomas announced a salt conference planned for January 2014 in Salt Lake City.

    Logilab proposes to bootstrap the French community around salt. As the group suggest this could take the form of a mailing list, an irc channel, a meetup group , some sprints, or a combination of all the above. On that note, next international sprint will probably take place in January 2014 around the salt conference.


  • Setup your project with cloudenvy and OpenStack

    2013/10/03 by Arthur Lutz

    One nice way of having a reproducible development or test environment is to "program" a virtual machine to do the job. If you have a powerful machine at hand you might use Vagrant in combination with VirtualBox. But if you have an OpenStack setup at hand (which is our case), you might want to setup and destroy your virtual machines on such a private cloud (or public cloud if you want or can). Sure, Vagrant has some plugins that should add OpenStack as a provider, but, here at Logilab, we have a clear preference for python over ruby. So this is where cloudenvy comes into play.

    http://www.openstack.org/themes/openstack/images/open-stack-cloud-computing-logo-2.png

    Cloudenvy is written in python and with some simple YAML configuration can help you setup and provision some virtual machines that contain your tests or your development environment.

    http://www.python.org/images/python-logo.gif

    Setup your authentication in ~/.cloudenvy.yml :

    cloudenvy:
      clouds:
        cloud01:
          os_username: username
          os_password: password
          os_tenant_name: tenant_name
          os_auth_url: http://keystone.example.com:5000/v2.0/
    

    Then create an Envyfile.yml at the root of your project

    project_config:
      name: foo
      image: debian-wheezy-x64
    
      # Optional
      #remote_user: ec2-user
      #flavor_name: m1.small
      #auto_provision: False
      #provision_scripts:
        #- provision_script.sh
      #files:
        # files copied from your host to the VM
        #local_file : destination
    

    Now simply type envy up. Cloudenvy does the rest. It "simply" creates your machine, copies the files, runs your provision script and gives you it's IP address. You can then run envy ssh if you don't want to be bothered with IP addresses and such nonsense (forget about copy and paste from the OpenStack web interface, or your nova show commands).

    Little added bonus : you know your machine will run a web server on port 8080 at some point, set it up in your environment by defining in the same Envyfile.yml your access rules

    sec_groups: [
        'tcp, 22, 22, 0.0.0.0/0',
        'tcp, 80, 80, 0.0.0.0/0',
        'tcp, 8080, 8080, 0.0.0.0/0',
      ]
    

    As you might know (or I'll just recommend it), you should be able to scratch and restart your environment without loosing anything, so once in a while you'll just do envy destroy to do so. You might want to have multiples VM with the same specs, then go for envy up -n second-machine.

    Only downside right now : cloudenvy isn't packaged for debian (which is usually a prerequisite for the tools we use), but let's hope it gets some packaging soon (or maybe we'll end up doing it).

    Don't forget to include this configuration in your project's version control so that a colleague starting on the project can just type envy up and have a working setup.

    In the same order of ideas, we've been trying out salt-cloud <https://github.com/saltstack/salt-cloud> because provisioning machines with SaltStack is the way forward. A blog about this is next.


  • DebConf13 report

    2013/09/25 by Julien Cristau

    As announced before, I spent a week last month in Vaumarcus, Switzerland, attending the 14th Debian conference (DebConf13).

    It was great to be at DebConf again, with lots of people I hadn't seen since New York City three years ago, and lots of new faces. Kudos to the organizers for pulling this off. These events are always a great boost for motivation, even if the amount of free time after coming back home is not quite as copious as I might like.

    One thing that struck me this year was the number of upstream people, not directly involved in Debian, who showed up. From systemd's Lennart and Kay, to MariaDB's Monty, and people from upstart, dracut, phpmyadmin or munin. That was a rather pleasant surprise for me.

    Here's a report on the talks and BoF sessions I attended. It's a bit long, but hey, the conference lasted a week. In addition to those I had quite a few chats with various people, including fellow members of the Debian release team.

    http://debconf13.debconf.org/images/logo.png

    Day 1 (Aug 11)

    Linux kernel : Ben Hutchings made a summary of the features added between 3.2 in wheezy and the current 3.10, and their status in Debian (some still need userspace work).

    SPI status : Bdale Garbee and Jimmy Kaplowitz explained what steps SPI is making to deal with its growth, including getting help from a bookkeeper recently to relieve the pressure on the (volunteer) treasurer.

    Hardware support in Debian stable : If you buy new hardware today, it's almost certainly not supported by the Debian stable release. Ideas to improve this :

    • backport whole subsystems: probably not feasible, risk of regressions would be too high
    • ship compat-drivers, and have the installer automatically install newer drivers based on PCI ids, seems possible.
    • mesa: have the GL loader pick a different driver based on the hardware, and ship newer DRI drivers for the new hardware, without touching the old ones. Issue: need to update libGL and libglapi too when adding new drivers.
    • X drivers, drm: ? (it's complicated)

    Meeting between release team and DPL to figure out next steps for jessie. Decided to schedule a BoF later in the week.

    Day 2 (Aug 12)

    Munin project lead on new features in 2.0 (shipped in wheezy) and roadmap for 2.2. Improvements on the scalability front (both in terms of number of nodes and number of plugins on a node). Future work includes improving the UI to make it less 1990 and moving some metadata to sql.

    jeb on AWS and Debian : Amazon Web Services (AWS) includes compute (ec2), storage (s3), network (virtual private cloud, load balancing, ..) and other services. Used by Debian for package rebuilds. http://cloudfront.debian.net is a CDN frontend for archive mirrors. Official Debian images are on ec2, including on the AWS marketplace front page. build-debian-cloud tool from Anders Ingeman et al. was presented.

    openstack in Debian : Packaging work is focused on making things easy for newcomers, basic config with debconf. Advanced users are going to use puppet or similar anyway. Essex is in wheezy, but end-of-life upstream. Grizzly available in sid and in a separate archive for wheezy. This work is sponsored by enovance.

    Patents : http://patents.stackexchange.com, looks like the USPTO has used comments made there when rejecting patent applications based on prior art. Patent applications are public, and it's a lot easier to get a patent application rejected than invalidate a patent later on. Should we use that site? Help build momentum around it? Would other patent offices use that kind of research? Issues: looking at patent applications (and publicly commenting) might mean you're liable for treble damages if the patent is eventually granted? Can you comment anonymously?

    Why systemd? : Lennart and Kay. Pop corn, upstart trolling, nothing really new.

    Day 3 (Aug 13)

    dracut : dracut presented by Harald Hoyer, its main developer. Seems worth investigating replacing initramfs-tools and sharing the maintenance load. Different hooks though, so we'll need to coordinate this with various packages.

    upstart : More Debian-focused than the systemd talk. Not helped by Canonical's CLA...

    dh_busfactor : debhelper is essentially a one-man show from the beginning. Though various packages/people maintain different dh_* tools either in the debhelper package itself or elsewhere. Joey is thinking about creating a debhelper team including those people. Concerns over increased breakage while people get up to speed (joeyh has 10 years of experience and still occasionally breaks stuff).

    dri3000 : Keith is trying to fix dri2 issues. While dri2 fixed a number of things that were wrong with dri1, it still has some problems. One of the goals is to improve presentation: we need a way to sync between app and compositor (to avoid displaying incompletely drawn frames), avoid tearing, and let the app choose immediate page flip instead of waiting for next vblank if it missed its target (stutter in games is painful). He described this work on his blog.

    security team BoF : explain the workflow, try to improve documentation of the process and what people can do to help. http://security.debian.org/

    Day 4 (Aug 14)

    day trip, and conference dinner on a boat from Neuchatel to Vaumarcus

    Day 5 (Aug 15)

    git-dpm : Spent half an hour explaining git, then was rushed to show git-dpm itself. Still, needs looking at. Lets you work with git and export changes as quilt series to build a source package.

    Ubuntu daily QA : The goal was to make it possible for canonical devs (not necessarily people working on the distro) to use ubuntu+1 (dev release). They tried syncing from testing for a while, but noticed bug fixes being delayed: not good. In the previous workflow the dev release was unusable/uninstallable for the first few months. Multiarch made things even more problematic because it requires amd64/i386 being in sync.

    • 12.04: a bunch of manpower thrown at ubuntu+1 to keep backlog of technical debt under control.
    • 12.10: prepare infrastructure (mostly launchpad), add APIs, to make non-canonical people able to do stuff that previously required shell access on central machines.
    • 13.04: proposed migration. britney is used to migrate packages from devel-proposed to devel. A few teething problems at first, but good reaction.
    • 13.10 and beyond: autopkgtest runs triggered after upload/build, also for rdeps. Phased updates for stable releases (rolled out to a subset of users and then gradually generalized). Hook into errors.ubuntu.com to match new crashes with package uploads. Generally more continuous integration. Better dashboard. (Some of that is still to be done.)

    Lessons learned from debian:

    • unstable's backlog can get bad → proposed is only used for builds and automated tests, no delay
    • transitions can take weeks at best
    • to avoid dividing human attention, devs are focused on devel, not devel-proposed

    Lessons debian could learn:

    • keeping testing current is a collective duty/win
    • splitting users between testing and unstable has important costs
    • hooking automated testing into britney really powerful; there's a small but growing number of automated tests

    Ideas:

    • cut migration delay in half
    • encourage writing autopkgtests
    • end goal: make sid to testing migration entirely based on automated tests

    Debian tests using Jenkins http://jenkins.debian.net

    • https://github.com/h01ger/jenkins-job-builder
    • Only running amd64 right now.
    • Uses jenkins plugins: git, svn, log parser, html publisher, ...
    • Has existing jobs for installer, chroot installs, others
    • Tries to make it easy to reproduce jobs, to allow debugging
    • {c,sh}ould add autopkgtests

    Day 6 (Aug 16)

    X Strike Force BoF : Too many bugs we can't do anything about: {mass,auto}-close them, asking people to report upstream. Reduce distraction by moving the non-X stuff to separate teams (compiz removed instead, wayland to discuss...). We should keep drivers as close to upstream as possible. A couple of people in the room volunteered to handle the intel, ati and input drivers.

    reclass BoF

    I had missed the talk about reclass, and Martin kindly offered to give a followup BoF to show what reclass can do.

    Reclass provides adaptors for puppet(?), salt, ansible. A yaml file describes each host:

    • can declare applications and parameters
    • host is leaf in a dag/tree of classes

    Lets you put the data in reclass instead of the config management tool, keeping generic templates in ansible/salt.

    I'm definitely going to try this and see if it makes it easier to organize data we're currently putting directly in salt states.

    release BoF : Notes are on http://gobby.debian.org. Basic summary: "Releasing in general is hard. Releasing something as big/diverse/distributed as Debian is even harder." Who knew?

    freedombox : status update from Bdale

    Keith Packard showed off the free software he uses in his and Bdale's rockets adventures.

    This was followed by a birthday party in the evening, as Debian turned 20 years old.

    Day 7 (Aug 17)

    x2go : Notes are on http://gobby.debian.org. To be solved: issues with nx libs (gpl fork of old x). Seems like a good thing to try as alternative to LTSP which we use at Logilab.

    lightning talks

    • coquelicot (lunar) - one-click secure(ish) file upload web app
    • notmuch (bremner) - need to try that again now that I have slightly more disk space
    • fedmsg (laarmen) - GSoC, message passing inside the debian infrastructure

    Debconf15 bids :

    • Mechelen/Belgium - Wouter
    • Germany (no city yet) - Marga

    Debconf14 presentation : Will be in Portland (Portland State University) next August. Presentation by vorlon, harmoney, keithp. Looking forward to it!

    • Closing ceremony

    The videos of most of the talks can be downloaded, thanks to the awesome work of the video team. And if you want to check what I didn't see or talk about, check the complete schedule.


  • JDEV2013 - Software development conference of CNRS

    2013/09/13 by Nicolas Chauvat

    I had the pleasure to be invited to lead a tutorial at JDEV2013 titled Learning TDD and Python in Dojo mode.

    http://www.logilab.org/file/177427/raw/logo_JDEV2013.png

    I quickly introduced the keywords with a single slide to keep it simple:

    http://Python.org
    + Test Driven Development (Test, Code, Refactor)
    + Dojo (house of training: Kata / Randori)
    = Calculators
      - Reverse Polish Notation
      - Formulas with Roman Numbers
      - Formulas with Numbers in letters
    

    As you can see, I had three types of calculators, hence at least three Kata to practice, but as usual with beginners, it took us the whole tutorial to get done with the first one.

    The room was a class room that we set up as our coding dojo with the coder and his copilot working on a laptop, facing the rest of the participants, with the large screen at their back. The pair-programmers could freely discuss with the people facing them, who were following the typing on the large screen.

    We switched every ten minutes: the copilot became coder, the coder went back to his seat in the class and someone else stood up to became the copilot.

    The session was allocated 3 hours split over two slots of 1h30. It took me less than 10 minutes to open the session with the above slide, 10 minutes as first coder and 10 minutes to close it. Over a time span of 3 hours, that left 150 minutes for coding, hence 15 people. Luckily, the whole group was about that size and almost everyone got a chance to type.

    I completely skipped explaining Python, its syntax and the unittest framework and we jumped right into writing our first tests with if and print statements. Since they knew about other programming languages, they picked up the Python langage on the way.

    After more than an hour of slowly discovering Python and TDD, someone in the room realized they had been focusing more on handling exception cases and failures than implementing the parsing and computation of the formulas because the specifications where not clearly understood. He then asked me the right question by trying to define Reverse Polish Notation in one sentence and checking that he got it right.

    Different algorithms to parse and compute RPN formulas where devised at the blackboard over the pause while part of the group went for a coffee break.

    The implementation took about another hour to get right, with me making sure they would not wander too far from the actual goal. Once the stack-based solution was found and implemented, I asked them to delete the files, switch coder and start again. They had forgotten about the Kata definition and were surprised, but quickly enjoyed it when they realized that progress was much faster on the second attempt.

    Since it is always better to show that you can walk the talk, I closed the session by praticing the RPN calculator kata myself in a bit less than 10 minutes. The order in which to write the tests is the tricky part, because it can easily appear far-fetched for such a small problem when you already know an algorithm that solves it.

    Here it is:

    import operator
    
    OPERATORS = {'+': operator.add,
                 '*': operator.mul,
                 '/': operator.div,
                 '-': operator.sub,
                 }
    
    def compute(args):
        items = args.split()
        stack = []
        for item in items:
            if item in OPERATORS:
                b,a = stack.pop(), stack.pop()
                stack.append(OPERATORS[item](a,b))
            else:
                stack.append(int(item))
        return stack[0]
    

    with the accompanying tests:

    import unittest
    from npi import compute
    
    class TestTC(unittest.TestCase):
    
        def test_unit(self):
            self.assertEqual(compute('1'), 1)
    
        def test_dual(self):
            self.assertEqual(compute('1 2 +'), 3)
    
        def test_tri(self):
            self.assertEqual(compute('1 2 3 + +'), 6)
            self.assertEqual(compute('1 2 + 3 +'), 6)
    
        def test_precedence(self):
            self.assertEqual(compute('1 2 + 3 *'), 9)
            self.assertEqual(compute('1 2 * 3 +'), 5)
    
        def test_zerodiv(self):
            self.assertRaises(ZeroDivisionError, compute, '10 0 /')
    
    unittest.main()
    

    Apparently, it did not go too bad, for I had positive comments at the end from people that enjoyed discovering in a single session Python, Test Driven Development and the Dojo mode of learning.

    I had fun doing this tutorial and thank the organizators for this conference!


  • Going to EuroScipy2013

    2013/09/04 by Alain Leufroy

    The EuroScipy2013 conference was held in Bruxelles at the Université libre de Bruxelles.

    http://www.logilab.org/file/175984/raw/logo-807286783.png

    As usual the first two days were dedicated to tutorials while the last two ones were dedicated to scientific presentations and general python related talks. The meeting was extended by one more day for sprint sessions during which enthusiasts were able to help free software projects, namely sage, vispy and scipy.

    Jérôme and I had the great opportunity to represent Logilab during the scientific tracks and the sprint day. We enjoyed many talks about scientific applications using python. We're not going to describe the whole conference. Visit the conference website if you want the complete list of talks. In this article we will try to focus on the ones we found the most interesting.

    First of all the keynote by Cameron Neylon about Network ready research was very interesting. He presented some graphs about the impact of a group job on resolving complex problems. They revealed that there is a critical network size for which the effectiveness for solving a problem drastically increase. He pointed that the source code accessibility "friction" limits the "getting help" variable. Open sourcing software could be the best way to reduce this "friction" while unit testing and ongoing integration are facilitators. And, in general, process reproducibility is very important, not only in computing research. Retrieving experimental settings, metadata, and process environment is vital. We agree with this as we are experimenting it everyday in our work. That is why we encourage open source licenses and develop a collaborative platform that provides the distributed simulation traceability and reproducibility platform Simulagora (in french).

    Ian Ozsvald's talk dealt with key points and tips from his own experience to grow a business based on open source and python, as well as mistakes to avoid (e.g. not checking beforehand there are paying customers interested by what you want to develop). His talk was comprehensive and mentioned a wide panel of situations.

    http://vispy.org/_static/img/logo.png

    We got a very nice presentation of a young but interesting visualization tools: Vispy. It is 6 months old and the first public release was early August. It is the result of the merge of 4 separated libraries, oriented toward interactive visualisation (vs. static figure generation for Matplotlib) and using OpenGL on GPUs to avoid CPU overload. A demonstration with large datasets showed vispy displaying millions of points in real time at 40 frames per second. During the talk we got interesting information about OpenGL features like anti-grain compared to Matplotlib Agg using CPU.

    We also got to learn about cartopy which is an open source Python library originally written for weather and climate science. It provides useful and simple API to manipulate cartographic mapping.

    Distributed computing systems was a hot topic and many talks were related to this theme.

    https://www.openstack.org/themes/openstack/images/openstack-logo-preview-full-color.png

    Gael Varoquaux reminded us what are the keys problems with "biggish data" and the key points to successfully process them. I think that some of his recommendations are generally useful like "choose simple solutions", "fail gracefully", "make it easy to debug". For big data processing when I/O limit is the constraint, first try to split the problem into random fractions of the data, then run algorithms and aggregate the results to circumvent this limit. He also presented mini-batch that takes a bunch of observations (trade-off memory usage/vectorization) and joblib.parallel that makes I/O faster using compression (CPUs are faster than disk access).

    Benoit Da Mota talked about shared memory in parallel computing and Antonio Messina gave us a quick overview on how to build a computing cluster with Elasticluster, using OpenStack/Slurm/ansible. He demonstrated starting and stopping a cluster on OpenStack: once all VMs are started, ansible configures them as hosts to the cluster and new VMs can be created and added to the cluster on the fly thanks to a command line interface.

    We also got a keynote by Peter Wang (from Continuum Analytics) about the future of data analysis with Python. As a PhD in physics I loved his metaphor of giving mass to data. He tried to explain the pain that scientists have when using databases.

    https://scikits.appspot.com/static/images/scipyshiny_small.png

    After the conference we participated to the numpy/scipy sprint. It was organized by Ralph Gommers and Pauli Virtanen. There were 18 people trying to close issues from different difficulty levels and had a quick tutorial on how easy it is to contribute: the easiest is to fork from the github project page on your own github account (you can create one for free), so that later your patch submission will be a simple "Pull Request" (PR). Clone locally your scipy fork repository, and make a new branch (git checkout -b <newbranch>) to tackle one specific issue. Once your patch is ready, commit it locally, push it on your github repository and from the github interface choose "Push request". You will be able to add something to your commit message before your PR is sent and looked at by the project lead developers. For example using "gh-XXXX" in your commit message will automatically add a link to the issue no. XXXX. Here is the list of open issues for scipy; you can filter them, e.g. displaying only the ones considered easy to fix :D

    For more information: Contributing to SciPy.


  • Emacs turned into a IDE with CEDET

    2013/08/29 by Anthony Truchet

    Abstract

    In this post you will find one way, namely thanks to CEDET, of turning your Emacs into an IDE offering features for semantic browsing and refactoring assistance similar to what you can find in major IDE like Visual Studio or Eclipse.

    Introduction

    Emacs is a tool of choice for the developer: it is very powerful, highly configurable and has a wealth of so called modes to improve many aspects of daily work, especially when editing code.

    The point, as you might have realised in case you have already worked with an IDE like Eclipse or Visual Studio, is that Emacs (code) browsing abilities are quite rudimentary... at least out of the box!

    In this post I will walk through one way to configure Emacs + CEDET which works for me. This is by far not the only way to get to it but finding this path required several days of wandering between inconsistent resources, distribution pitfall and the like.

    I will try to convey relevant parts of what I have learnt on the way, to warn about some pitfalls and also to indicate some interesting direction I haven't followed (be it by choice or necessity) and encourage you to try. Should you try to push this adventure further, your experience will be very much appreciated... and in any case your feedback on this post is also very welcome.

    The first part gives some deemed useful background to understand what's going on. If you want to go straight to the how-to please jump directly to the second part.

    Sketch map of the jungle

    This all started because I needed a development environment to do work remotely on a big, legacy C++ code base from quite a lightweight machine and a weak network connection.

    My former habit of using Eclipse CDT and compiling locally was not an option any longer but I couldn't stick to a bare text editor plus remote compilation either because of the complexity of the code base. So I googled emacs IDE code browser and started this journey to set CEDET + ECB up...

    I quickly got lost in a jungle of seemingly inconsistent options and I reckon that some background facts are welcome at this point as to why.

    Up to this date - sept. 2013 - most of the world is in-between two major releases of Emacs. Whereas Emacs 23.x is still packaged in many stable Linux distribution, the latest release is Emacs 24.3. In this post we will use Emacs 24.x which brings lots of improvements, two of those are really relevant to us:

    • the introduction of a package manager, which is great and (but) changes initialisation
    • the partial integration of some version of CEDET into Emacs since version 23.2

    Emacs 24 initialisation

    Very basically, Emacs used to read the user's Emacs config (~/.emacs or ~/.emacs.d/init.el) which was responsible for adapting the load-path and issuing the right (require 'stuff) commands and configuring each library in some appropriate sequence.

    Emacs 24 introduces ELPA, a new package system and official packages repository. It can be extended by other packages repositories such as Marmalade or MELPA

    By default in Emacs 24, the initialisation order is a bit more complex due to packages loading: the user's config is still read but should NOT require the libraries installed through the package system: those are automatically loaded (the former load-path adjustment and (require 'stuff) steps) after the ~/.emacs or ~/.emacs.d/init.el has finished. This makes configuring the loaded libraries much more error-prone, especially for libraries designed to be configured the old way (as of today most libraries, notably CEDET).

    Here is a good analysis of the situation and possible options. And for those interested in the details of the new initialisation process, see following sections of the manual:

    I first tried to stick to the new-way, setting up hooks in ~/.emacs.d/init.el to be called after loading the various libraries, each library having its own configuration hook, and praying for the interaction between the package manager load order and my hooks to be ok... in vain. So I ended up forcing the initialisation to the old way (see Emacs 24 below).

    What is CEDET ?

    CEDET is a Collection of Emacs Development Environment Tools. The major word here is collection, do not expect it to be an integrated environment. The main components of (or coupled with) CEDET are:

    Semantic
    Extract a common semantic from source code in different languages
    (e)ctags / GNU global
    Traditional (exhuberant) CTags or GNU global can be used as a source of information for Semantic
    SemanticDB
    SemanticDB provides for caching the outcome of semantic analysis in some database to reduce analysis overhead across several editing sessions
    Emacs Code Browser
    This component uses information provided by Semantic to offer a browsing GUI with windows for traversing files, classes, dependencies and the like
    EDE
    This provides a notion of project analogous to most IDE. Even if the features related to building projects are very Emacs/ Linux/ Autotools-centric (and thus not necessarily very helful depending on your project setup), the main point of EDE is providing scoping of source code for Semantic to analyse and include path customisation at the project level.
    AutoComplete
    This is not part of CEDET but Semantic can be configured as a source of completions for auto-complete to propose to the user.
    and more...
    Senator, SRecode, Cogre, Speedbar, EIEIO, EAssist are other components of CEDET I've not looked at yet.

    To add some more complexity, CEDET itself is also undergoing heavy changes and is in-between major versions. The last standalone release is 1.1 but it has the old source layout and activation method. The current head of development says it is version 2.0, has new layout and activation method, plus some more features but is not released yet.

    Integration of CEDET into Emacs

    Since Emacs 23.2, CEDET is built into Emacs. More exactly parts of some version of new CEDET are built into Emacs, but of course this built-in version is older than the current head of new CEDET... As for the notable parts not built into Emacs, ECB is the most prominent! But it is packaged into Marmalade in a recent version following head of development closely which, mitigates the inconvenience.

    My first choice was using built-in CEDET with ECB installed from the packages repository: the installation was perfectly smooth but I was not able to configure cleanly enough the whole to get proper operation. Although I tried hard, I could not get Semantic to take into account the include paths I configured using my EDE project for example.

    I would strongly encourage you to try this way, as it is supposed to require much less effort to set up and less maintenance. Should you succeed I would greatly appreciate some feedback of you experience!

    As for me I got down to install the latest version from the source repositories following as closely as possible Alex Ott's advices and using his own fork of ECB to make it compliant with most recent CEDET:

    How to set up CEDET + ECB in Emacs 24

    Emacs 24

    Install Emacs 24 as you wish, I will not cover the various options here but simply summarise the local install from sources I choose.

    1. Get the source archive from http://ftpmirror.gnu.org/emacs/
    2. Extract it somewhere and run the usual (or see the INSTALL file) - configure --prefix=~/local, - make, - make install

    Create your emacs personal directory and configuration file ~/.emacs.d/site-lisp/ and ~/.emacs.d/init.el and put this inside the latter:

    ;; this is intended for manually installed libraries
    (add-to-list 'load-path "~/.emacs.d/site-lisp/")
    
    ;; load the package system and add some repositories
    (require 'package)
    (add-to-list 'package-archives
                 '("marmalade" . "http://marmalade-repo.org/packages/"))
    (add-to-list 'package-archives
                 '("melpa" . "http://melpa.milkbox.net/packages/") t)
    
    ;; Install a hook running post-init.el *after* initialization took place
    (add-hook 'after-init-hook (lambda () (load "post-init.el")))
    
    ;; Do here basic initialization, (require) non-ELPA packages, etc.
    
    ;; disable automatic loading of packages after init.el is done
    (setq package-enable-at-startup nil)
    ;; and force it to happen now
    (package-initialize)
    ;; NOW you can (require) your ELPA packages and configure them as normal
    

    Useful Emacs packages

    Using the emacs commands M-x package-list-packages interactively or M-x package-install <package name>, you can install many packages easily. For example I installed:

    Choose your own! I just recommend against installing ECB or other CEDET since we are going to install those from source.

    You can also insert or load your usual Emacs configuration here, simply beware of configuring ELPA, Marmalade et al. packages after (package-initialize).

    CEDET

    • Get the source and put it under ~/.emacs.d/site-lisp/cedet-bzr. You can either download a snapshot from http://www.randomsample.de/cedet-snapshots/ or check it out of the bazaar repository with:

      ~/.emacs.d/site-lisp$ bzr checkout --lightweight \
      bzr://cedet.bzr.sourceforge.net/bzrroot/cedet/code/trunk cedet-bzr
      
    • Run make (and optionnaly make install-info) in cedet-bzr or see the INSTALL file for more details.

    • Get Alex Ott's minimal CEDET configuration file to ~/.emacs.d/config/cedet.el for example

    • Adapt it to your system by editing the first lines as follows

      (setq cedet-root-path
          (file-name-as-directory (expand-file-name
              "~/.emacs.d/site-lisp/cedet-bzr/")))
      (add-to-list 'Info-directory-list
              "~/projects/cedet-bzr/doc/info")
      
    • Don't forget to load it from your ~/.emacs.d/init.el:

      ;; this is intended for configuration snippets
      (add-to-list 'load-path "~/.emacs.d/")
      ...
      (load "config/cedet.el")
      
    • restart your emacs to check everything is OK; the --debug-init option is of great help for that purpose.

    ECB

    • Get Alex Ott ECB fork into ~/.emacs.d/site-lisp/ecb-alexott:

      ~/.emacs.d/site-lisp$ git clone --depth 1  https://github.com/alexott/ecb/
      
    • Run make in ecb-alexott and see the README file for more details.

    • Don't forget to load it from your ~/.emacs.d/init.el:

      (add-to-list 'load-path (expand-file-name
            "~/.emacs.d/site-lisp/ecb-alexott/"))
      (require 'ecb)
      ;(require 'ecb-autoloads)
      

      Note

      You can theoretically use (require 'ecb-autoloads) instead of (require 'ecb) in order to load ECB by need. I encountered various misbehaviours trying this option and finally dropped it, but I encourage you to try it and comment on your experience.

    • restart your emacs to check everything is OK (you probably want to use the --debug-init option).

    • Create a hello.cpp with you CEDET enable Emacs and say M-x ecb-activate to check that ECB is actually installed.

    Tune your configuration

    Now, it is time to tune your configuration. There is no good recipe from here onward... But I'll try to propose some snippets below. Some of them are adapted from Alex Ott personal configuration

    More Semantic options

    You can use the following lines just before (semantic-mode 1) to add to the activated features list:

    (add-to-list 'semantic-default-submodes 'global-semantic-decoration-mode)
    (add-to-list 'semantic-default-submodes 'global-semantic-idle-local-symbol-highlight-mode)
    (add-to-list 'semantic-default-submodes 'global-semantic-idle-scheduler-mode)
    (add-to-list 'semantic-default-submodes 'global-semantic-idle-completions-mode)
    

    You can also load additional capabilities with those lines after (semantic-mode 1):

    (require 'semantic/ia)
    (require 'semantic/bovine/gcc) ; or depending on you compiler
    ; (require 'semantic/bovine/clang)
    
    Auto-completion

    If you want to use auto-complete you can tell it to interface with Semantic by configuring it as follows (where AAAAMMDD.rrrr is the date.revision suffix of the version od auti-complete installed by you package manager):

    ;; Autocomplete
    (require 'auto-complete-config)
    (add-to-list 'ac-dictionary-directories (expand-file-name
                 "~/.emacs.d/elpa/auto-complete-AAAAMMDD.rrrr/dict"))
    (setq ac-comphist-file (expand-file-name
                 "~/.emacs.d/ac-comphist.dat"))
    (ac-config-default)
    

    and activating it in your cedet hook, for example:

    ...
    ;; customisation of modes
    (defun alexott/cedet-hook ()
    ...
        (add-to-list 'ac-sources 'ac-source-semantic)
    ) ; defun alexott/cedet-hook ()
    
    Support for GNU global a/o (e)ctags
    ;; if you want to enable support for gnu global
    (when (cedet-gnu-global-version-check t)
      (semanticdb-enable-gnu-global-databases 'c-mode)
      (semanticdb-enable-gnu-global-databases 'c++-mode))
    
    ;; enable ctags for some languages:
    ;;  Unix Shell, Perl, Pascal, Tcl, Fortran, Asm
    (when (cedet-ectag-version-check)
      (semantic-load-enable-primary-exuberent-ctags-support))
    

    Using CEDET for development

    Once CEDET + ECB + EDE is up you can start using it for actual development. How to actually use it is beyond the scope of this already too long post. I can only invite you to have a look at:

    Conclusion

    CEDET provides an impressive set of features both to allow your emacs environment to "understand" your code and to provide powerful interfaces to this "understanding". It is probably one of the very few solution to work with complex C++ code base in case you can't or don't want to use a heavy-weight IDE like Eclipse CDT.

    But its being highly configurable also means, at least for now, some lack of integration, or at least a pretty complex configuration. I hope this post will help you to do your first steps with CEDET and find your way to setup and configure it to you own taste.


  • Pylint 1.0 released!

    2013/08/06 by Sylvain Thenault

    Hi there,

    I'm very pleased to announce, after 10 years of existence, the 1.0 release of Pylint.

    This release has a hell long ChangeLog, thanks to many contributions and to the 10th anniversary sprint we hosted during june. More details about changes below.

    Chances are high that your Pylint score will go down with this new release that includes a lot of new checks :) Also, there are a lot of improvments on the Python 3 side (notably 3.3 support which was somewhat broken).

    You may download and install it from Pypi or from Logilab's debian repositories. Notice Pylint has been updated to use the new Astroid library (formerly known as logilab-astng) and that the logilab-common 0.60 library includes some fixes necessary for using Pylint with Python3 as well as long-awaited support for namespace packages.

    For those interested, below is a comprehensive list of what changed:

    Command line and output formating

    • A new --msg-template option to control output, deprecating "msvc" and "parseable" output formats as well as killing --include-ids and --symbols options.
    • Fix spelling of max-branchs option, now max-branches.
    • Start promoting usage of symbolic name instead of numerical ids.

    New checks

    • "missing-final-newline" (C0304) for files missing the final newline.
    • "invalid-encoded-data" (W0512) for files that contain data that cannot be decoded with the specified or default encoding.
    • "bad-open-mode" (W1501) for calls to open (or file) that specify invalid open modes (Original implementation by Sasha Issayev).
    • "old-style-class" (C1001) for classes that do not have any base class.
    • "trailing-whitespace" (C0303) that warns about trailing whitespace.
    • "unpacking-in-except" (W0712) about unpacking exceptions in handlers, which is unsupported in Python 3.
    • "old-raise-syntax" (W0121) for the deprecated syntax raise Exception, args.
    • "unbalanced-tuple-unpacking" (W0632) for unbalanced unpacking in assignments (bitbucket #37).

    Enhanced behaviours

    • Do not emit [fixme] for every line if the config value 'notes' is empty
    • Emit warnings about lines exceeding the column limit when those lines are inside multiline docstrings.
    • Name check enhancement:
      • simplified message,
      • don't double-check parameter names with the regex for parameters and inline variables,
      • don't check names of derived instance class members,
      • methods that are decorated as properties are now treated as attributes,
      • names in global statements are now checked against the regular expression for constants,
      • for toplevel name assignment, the class name regex will be used if pylint can detect that value on the right-hand side is a class (like collections.namedtuple()),
      • add new name type 'class_attribute' for attributes defined in class scope. By default, allow both const and variable names.
    • Add a configuration option for missing-docstring to optionally exempt short functions/methods/classes from the check.
    • Add the type of the offending node to missing-docstring and empty-docstring.
    • Do not warn about redefinitions of variables that match the dummy regex.
    • Do not treat all variables starting with "_" as dummy variables, only "_" itself.
    • Make the line-too-long warning configurable by adding a regex for lines for with the length limit should not be enforced.
    • Do not warn about a long line if a pylint disable option brings it above the length limit.
    • Do not flag names in nested with statements as undefined.
    • Remove string module from the default list of deprecated modules (bitbucket #3).
    • Fix incomplete-protocol false positive for read-only containers like tuple (bitbucket #25).

    Other changes

    • Support for pkgutil.extend_path and setuptools pkg_resources (logilab-common #8796).
    • New utility classes for per-checker unittests in testutils.py
    • Added a new base class and interface for checkers that work on the tokens rather than the syntax, and only tokenize the input file once.
    • epylint shouldn't hang anymore when there is a large output on pylint'stderr (bitbucket #15).
    • Put back documentation in source distribution (bitbucket #6).

    Astroid

    • New API to make it smarter by allowing transformation functions on any node, providing a register_transform function on the manager instead of the register_transformer to make it more flexible wrt node selection
    • Use this new transformation API to provide support for namedtuple (actually in pylint-brain, logilab-astng #8766)
    • Better description of hashlib
    • Properly recognize methods annotated with abc.abstract{property,method} as abstract.
    • Added the test_utils module for building ASTs and extracting deeply nested nodes for easier testing.

  • Astroid 1.0 released!

    2013/08/02 by Sylvain Thenault

    Astroid is the new name of former logilab-astng library. It's an AST library, used as the basis of Pylint and including Python 2.5 -> 3.3 compatible tree representation, statical type inference and other features useful for advanced Python code analysis, such as an API to provide extra information when statistical inference can't overcome Python dynamic nature (see the pylint-brain project for instance).

    It has been renamed and hosted to bitbucket to make clear that this is not a Logilab dedicated project but a community project that could benefit to any people manipulating Python code (statistical analysis tools, IDE, browser, etc).

    Documentation is a bit rough but should quickly improve. Also a dedicated web-site is now online, visit www.astroid.org (or https://bitbucket.org/logilab/astroid for development).

    You may download and install it from Pypi or from Logilab's debian repositories.


  • Going to DebConf13

    2013/08/01 by Julien Cristau

    The 14th Debian developers conference (DebConf13) will take place between August 11th and August 18th in Vaumarcus, Switzerland.

    Logilab is a DebConf13 sponsor, and I'll attend the conference. There are quite a lot of cloud-related events on the schedule this year, plus the usual impromptu discussions and hallway track. Looking forward to meeting the usual suspects there!

    https://www.logilab.org/file/158611/raw/dc13-btn0-going-bg.png

  • We hosted the Salt Sprint in Paris

    2013/07/30 by Arthur Lutz

    Last Friday, we hosted the French event for the international Great Salt Sprint. Here is a report on what was done and discussed on this occasion.

    http://www.logilab.org/file/228931/raw/saltstack_logo.jpg

    We started off by discussing various points that were of interest to the participants :

    • automatically write documentation from salt sls files (for Sphinx)
    • salt-mine add security layer with restricted access (bug #5467 and #6437)
    • test compatibility of salt-cloud with openstack
    • module bridge bug correction : traceback on KeyError
    • setting up the network in debian (equivalent of rh_ip)
    • configure existing monitoring solution through salt (add machines, add checks, etc) on various backends with a common syntax

    We then split up into pairs to tackle issues in small groups, with some general discussions from time to time.

    6 people participated, 5 from Logilab, 1 from nbs-system. We were expecting more participants but some couldn't make it at the last minute, or though the sprint was taking place at some other time.

    Unfortunately we had a major electricity black out all afternoon, some of us switched to battery and 3G tethering to carry on, but that couldn't last all afternoon. We ended up talking about design and use cases. ERDF (French electricity distribution company) ended up bringing generator trucks for the neighborhood !

    Arthur & Benoit : monitoring adding machines or checks

    http://www.logilab.org/file/157971/raw/salt-centreon-shinken.png

    Some unfinished draft code for supervision backends was written and pushed on github. We explored how a common "interface" could be done in salt (using a combination of states and __virtual___). The official documentation was often very useful, reading code was also always a good resource (and the code is really readable).

    While we were fixing stuff because of the power black out, Benoit submitted a bug fix.

    David & Alain : generate documentation from salt state & salt master

    The idea is to couple the SLS description and the current state of the salt master to generate documentation about one's infrastructure using Sphinx. This was transmitted to the mailing-list.

    http://www.logilab.org/file/157976/raw/salt-sphinx.png

    Design was done around which information should be extracted and display and how to configure access control to the salt-master, taking a further look to external_auth and salt-api will probably be the way forward.

    General discussions

    We had general discussions around concepts of access control to a salt master, on how to define this access. One of the things we believe to be missing (but haven't checked thoroughly) is the ability to separate the "read-only" operations to the "read-write" operations in states and modules, if this was done (through decorators?) we could easily tell salt-api to only give access to data collection. Complex scenarios of access were discussed. Having a configuration or external_auth based on ssh public keys (similar to mercurial-server would be nice, and would provide a "limited" shell to a mercurial server.

    Conclusion

    The power black out didn't help us get things done, but nevertheless, some sharing was done around our uses cases around SaltStack and features that we'd want to get out of it (or from third party applications). We hope to convert all the discussions into bug reports or further discussion on the mailing-lists and (obviously) into code and pull-requests. Check out the scoreboard for an overview of how the other cities contributed.

    to comment this post you need to login or create an account


  • The Great Salt Sprint Paris Location is Logilab

    2013/07/12 by Nicolas Chauvat
    http://farm1.static.flickr.com/183/419945378_4ead41a76d_m.jpg

    We're happy to be part of the second Great Salt Sprint that will be held at the end of July 2013. We will be hosting the french sprinters on friday 26th in our offices in the center of Paris.

    The focus of our Logilab team will probably be Test-Driven System Administration with Salt, but the more participants and the topics, the merrier the event.

    Please register if you plan on joining us. We will be happy to meet with fellow hackers.

    photo by Sebastian Mary under creative commons licence.


  • PyLint 10th anniversary 1.0 sprint: day 3 - Sprint summary

    2013/06/20 by Sylvain Thenault

    Yesterday was the third and last day of the 10th anniversary Pylint sprint in Logilab's Toulouse office.

    Design

    To get started, we took advantage of this last day to have a few discussions about:

    • A "mode" feature gpylint has. It turns out that behind perhaps a few implementation details, this is something we definitly want into pylint (mode are specific configurations defined in the pylintrc and easilly recallable, they may even be specified per file).

    • How to avoid conflicts in the ChangeLog by using specific instruction in the commit message. We decided that a commit message should look like

      [my checker] do this and that. Closes #1234
      
      bla bla bla
      
      :release note: this will be a new item in the ChangeLog
      
      as well as anything until the end of the message
      

      now someone has to write the ChangeLog generation script so we may use this for post-1.0 releases

    • The roadmap. More on this later in this post.

    Code

    When we were not discussing, we were coding!

    • Anthony worked on having a template for the text reporter. His patch is available on Bitbucket but not yet integrated.
    • Julien and David pushed a bunch of patches on logilab-common, astroid and pylint for the Python 3.3 support. Not all tests are green on the pylint side, but much progress was done.
    • A couple other things were fixed, like a better "invalid name" message, stop complaining about string module being deprecated, etc.
    • A lot of patches have been integrated, from gpylint and others (e.g python 3 related)

    All in all, an impressive amount of work was achieved during this sprint:

    • A lot of new checks or enhanced behaviour backported from gpylint (Take a look at Pylint's ChangeLog for more details on this, the list is impressively long).
    • The transformation API of astroid now allows to customize the tree structure as well as the inference process, hence to make pylint smarter than ever.
    • Better python 3 support.
    • A few bugs fixed and some enhancements added.
    • The templating stuff should land with the CLI cleanup (some output-formats will be removed as well as the --include-ids and --symbols option).
    • A lot of discussions, especially regarding the future community development of pylint/astroid on Bitbucket. Short summary being: more contributors and integrators are welcome! We should drop some note somewhere to describe how we are using bitbucket's pull requests and tracker.

    Plan

    Now here is the 1.O roadmap, which is expected by the begining of July:

    • Green tests under Python 3, including specification of Python version in message description (Julien).
    • Finish template for text reporters (Anthony).
    • Update web site (David).

    And for later releases:

    • Backport mode from gpylint (Torsten).
    • Write ChangeLog update script (Sylvain).

    So many thanks to everyone for this very successful sprint. I'm excited about this forthcoming 1.0 release!


  • PyLint 10th anniversary 1.0 sprint: day 2

    2013/06/18 by Sylvain Thenault

    Today was the second day of the 10th anniversary Pylint sprint in Logilab's Toulouse office.

    This morning, we started with a presentation by myself about how the inference engine works in astroid (former astng). Then we started thinking all together about how we should change its API to be able to plug more information during the inference process. The first use-case we wanted to assert was namedtuple, as explained in http://www.logilab.org/ticket/8796.

    We ended up by addressing it by:

    • enhancing the existing transformation feature so one may register a transformation function on any node rather than on a module node only;
    • being able to specify, on a node instance, a custom inference function to use instead of the default (class) implementation.

    We would then be able to customize both the tree structure and the inference process and so to resolve the cases we were targeting.

    Once this was sufficiently sketched out, everyone got his own tasks to do. Here is a quick summary of what has been achieved today:

    • Anthony resumed the check_messages thing and finished it for the simple cases, then he started on having a template for text reported,
    • Julien and David made a lot of progress on the Python 3.3 compatibility, though not enough to get the full green test suite,
    • Torsten continued backporting stuff from gpylint, all of them having been integrated by the end of the day,
    • Sylvain implemented the new transformation API and had the namedtuple proof of concept working, and even some documentation! Now this have to be tested for more real-world uses.

    So things are going really well, and see you tomorrow for even more improvements to pylint!


  • PyLint 10th anniversary 1.0 sprint: day 1

    2013/06/17 by Sylvain Thenault

    Today was the first day of the Pylint sprint we organized using Pylint's 10th years anniversary as an excuse.

    So I (Sylvain) have welcome my fellow Logilab friends David, Anthony and Julien as well as Torsten from Google into Logilab's new Toulouse office.

    After a bit of presentation and talk about Pylint development, we decided to keep discussion for lunch and dinner and to setup priorities. We ended with the following tasks (picks from the pad at http://piratepad.net/oAvsUoGCAC):

    • rename astng to move it outside the logilab package,
    • Torsten gpylint (Google Pylint) patches review, as much as possible (but not all of them, starting by a review of the numberous internal checks Google has, seeing one by one which one should be backported upstream),
    • setuptools namespace package support (https://www.logilab.org/8796),
    • python 3.3 support,
    • enhance astroid (former astng) API to allow more ad-hoc customization for a better grasp of magic occuring in e.g. web frameworks (protocol buffer or SQLAlchemy may also be an application of this).

    Regarding the astng renaming, we decided to move on with astroid as pointed out by the survey on StellarSurvey.com

    In the afternoon, David and Julien tackled this, while Torsten was extracting patches from Google code and sending them to bitbucket as pulll request, Sylvain embrassing setuptools namespaces packages and Anthony discovering the code to spread the @check_message decorator usage.

    By the end of the day:

    • David and Julien submitted patches to rename logilab.astng which were quickly integrated and now https://bitbucket.org/logilab/astroid should be used instead of https://bitbucket.org/logilab/astng
    • Torsten submitted 5 pull-requests with code extracted from gpylint, we reviewed them together and then Torsten used evolve to properly insert those in the pylint history once review comments were integrated
    • Sylvain submitted 2 patches on logilab-common to support both setuptools namespace packages and pkgutil.extend_path (but not bare __path__ manipulation
    • Anthony discovered various checkers and started adding proper @check_messages on visit methods

    After doing some review all together, we even had some time to take a look at Python 3.3 support while writing this summary.

    Hopefuly, our work on forthcoming days will be as efficient as on this first day!


  • About salt-ami-cloud-builder

    2013/06/07 by Paul Tonelli

    What

    At Logilab we are big fans of SaltStack, we use it quite extensivelly to centralize, configure and automate deployments.

    http://www.logilab.org/file/145398/raw/SaltStack-Logo.png

    We've talked on this blog about how to build a Debian AMI "by hand" and we wanted to automate this fully. Hence the salt way seemed to be the obvious way to go.

    So we wrote salt-ami-cloud-builder. It is mainly glue between existing pieces of software that we use and like. If you already have some definition of a type of host that you provision using salt-stack, salt-ami-cloud-builder should be able to generate the corresponding AMI.

    http://www.logilab.org/file/145397/raw/open-stack-cloud-computing-logo-2.png

    Why

    Building a Debian based OpenStack private cloud using salt made us realize that we needed a way to generate various flavours of AMIs for the following reasons:

    • Some of our openstack users need "preconfigured" AMIs (for example a Debian system with Postgres 9.1 and the appropriate Python bindings) without doing the modifications by hand or waiting for an automated script to do the job at AMI boot time.
    • Some cloud use cases require that you boot many (hundreds for instance) machines with the same configuration. While tools like salt automate the job, waiting while the same download and install takes place hundreds of times is a waste of resources. If the modifications have already been integrated into a specialized ami, you save a lot of computing time. And especially in the Amazon (or other pay-per-use cloud infrastructures), these resources are not free.
    • Sometimes one needs to repeat a computation on an instance with the very same packages and input files, possibly years after the first run. Freezing packages and files in one preconfigured AMI helps this a lot. When relying only on a salt configuration the installed packages may not be (exactly) the same from one run to the other.

    Relation to other projects

    While multiple tools like build-debian-cloud exist, their objective is to build a vanilla AMI from scratch. The salt-ami-cloud-builder starts from such vanilla AMIs to create variations. Other tools like salt-cloud focus instead on the boot phase of the deployment of (multiple) machines.

    Chef & Puppet do the same job as Salt, however Salt being already extensively deployed at Logilab, we continue to build on it.

    Get it now !

    Grab the code here: http://hg.logilab.org/master/salt-ami-cloud-builder

    The project page is http://www.logilab.org/project/salt-ami-cloud-builder

    The docs can be read here: http://docs.logilab.org/salt-ami-cloud-builder

    We hope you find it useful. Bug reports and contributions are welcome.

    The logilab-salt-ami-cloud-builder team :)


  • Pylint 10th years anniversary from June 17 to 19 in Toulouse

    2013/04/18 by Sylvain Thenault

    After a quick survey, we're officially scheduling Pylint 10th years anniversary sprint from monday, June 17 to wednesday, June 19 in Logilab's Toulouse office.

    There is still some room available if more people want to come, drop me a note (sylvain dot thenault at logilab dot fr).


  • Pylint development moving to BitBucket

    2013/04/12 by Sylvain Thenault

    Hi everyone,

    After 10 years of hosting Pylint on our own forge at logilab.org, we've decided to publish version 1.0 and move Pylint and astng development to BitBucket. There has been repository mirrors there for some time, but we intend now to use all BitBucket features, notably Pull Request, to handle various development tasks.

    There are several reasons behind this. First, using both BitBucket and our own forge is rather cumbersome, for integrators at least. This is mainly because BitBucket doesn't provide support for Mercurial's changeset evolution feature while our forge relies on it. Second, our forge has several usability drawbacks that make it hard to use for newcomers, and we lack the time to be responsive on this. Finally, we think that our quality-control process, as exposed by our forge, is a bit heavy for such community projects and may keep potential contributors away.

    All in all, we hope this will help to have a wider contributor audience as well as more regular maintainers / integrators which are not Logilab employees. And so, bring the best Pylint possible to the Python community!

    Logilab.org web pages will be updated to mention this, but kept as there is still valuable information there (eg tickets). We may also keep automatic tests and package building services there.

    So, please use https://bitbucket.org/logilab/pylint as main web site regarding pylint development. Bug reports, feature requests as well as contributions should be done there. The same move will be done for Pylint's underlying library, logilab-astng (https://bitbucket.org/logilab/astng). We also wish in this process to move it out of the 'logilab' python package. It may be a good time to give it another name, if you have any idea don't hesitate to express yourself.

    Last but not least, remember that Pylint home page may be edited using Mercurial, and that the new http://docs.pylint.org is generated using the content found in Pylint source doc subdirectory.

    Pylint turning 10 and moving out of its parents is probably a good time to thank Logilab for paying me and some colleagues to create and maintain this project!

    https://bitbucket-assetroot.s3.amazonaws.com/c/photos/2013/Apr/05/pylint-logo-1661676867-0_avatar.png

  • PyLint 10th years anniversary, 1.0 sprint

    2013/03/29 by Sylvain Thenault

    In a few week, pylint will be 10 years old (0.1 released on may 19 2003!). At this occasion, I would like to release a 1.0. Well, not exactly at that date, but not too long after would be great. Also, I think it would be a good time to have a few days sprint to work a bit on this 1.0 but also to meet all together and talk about pylint status and future, as more and more contributions come from outside Logilab (actually mostly Google, which employs Torsten and Martin, the most active contributors recently).

    The first thing to do is to decide a date and place. Having discussed a bit with Torsten about that, it seems reasonable to target a sprint during june or july. Due to personal constraints, I would like to host this sprint in Logilab's Toulouse office.

    So, who would like to jump in and sprint to make pylint even better? I've created a doodle so every one interested may tell his preferences: http://doodle.com/4uhk26zryis5x7as

    Regarding the location, is everybody ok with Toulouse? Other ideas are Paris, or Florence around EuroPython, or... <add your proposition here>.

    We'll talk about the sprint topics later, but there are plenty of exciting ideas around there.

    Please, answer quickly so we can move on. And I hope to see you all there!


  • LMGC90 Sprint at Logilab in March 2013

    2013/03/28 by Vladimir Popescu

    LMGC90 Sprint at Logilab

    At the end of March 2013, Logilab hosted a sprint on the LMGC90 simulation code in Paris.

    LMGC90 is an open-source software developed at the LMGC ("Laboratoire de Mécanique et Génie Civil" -- "Mechanics and Civil Engineering Laboratory") of the CNRS, in Montpellier, France. LMGC90 is devoted to contact mechanics and is, thus, able to model large collections of deformable or undeformable physical objects of various shapes, with numerous interaction laws. LMGC90 also allows for multiphysics coupling.

    Sprint Participants

    https://www.logilab.org/file/143585/raw/logo_LMGC.jpg https://www.logilab.org/file/143749/raw/logo_SNCF.jpg https://www.logilab.org/file/143750/raw/logo_LaMSID.jpg https://www.logilab.org/file/143751/raw/logo_LOGILAB.jpg

    More than ten hackers joined in from:

    • the LMGC, which leads LMCG90 development and aims at constantly improving its architecture and usability;
    • the Innovation and Research Department of the SNCF (the French state-owned railway company), which uses LMGC90 to study railway mechanics, and more specifically, the ballast;
    • the LaMSID ("Laboratoire de Mécanique des Structures Industrielles Durables", "Laboratory for the Mechanics of Ageing Industrial Structures") laboratory of the EDF / CNRS / CEA , which has an strong expertise on Code_ASTER and LMGC90;
    • Logilab, as the developer, for the SNCF, of a CubicWeb-based platform dedicated to the simulation data and knowledge management.

    After a great introduction to LMGC90 by Frédéric Dubois and some preliminary discussions, teams were quickly constituted around the common areas of interest.

    Enhancing LMGC90's Python API to build core objects

    As of the sprint date, LMGC90 is mainly developed in Fortran, but also contains Python code for two purposes:

    • Exposing the Fortran functions and subroutines in the LMGC90 core to Python; this is achieved using Fortran 2003's ISO_C_BINDING module and Swig. These Python bindings are grouped in a module called ChiPy.
    • Making it easy to generate input data (so called "DATBOX" files) using Python. This is done through a module called Pre_LMGC.

    The main drawback of this approach is the double modelling of data that this architecture implies: once in the core and once in Pre_LMGC.

    It was decided to build a unique user-level Python layer on top of ChiPy, that would be able to build the computational problem description and write the DATBOX input files (currently achieved by using Pre_LMGC), as well as to drive the simulation and read the OUTBOX result files (currently by using direct ChiPy calls).

    This task has been met with success, since, in the short time span available (half a day, basically), the team managed to build some object types using ChiPy calls and save them into a DATBOX.

    Using the Python API to feed a computation data store

    This topic involved importing LMGC90 DATBOX data into the numerical platform developed by Logilab for the SNCF.

    This was achieved using ChiPy as a Python API to the Fortran core to get:

    • the bodies involved in the computation, along with their materials, behaviour laws (with their associated parameters), geometries (expressed in terms of zones);
    • the interactions between these bodies, along with their interaction laws (and associated parameters, e.g. friction coefficient) and body pair (each interaction is defined between two bodies);
    • the interaction groups, which contain interactions that have the same interaction law.

    There is still a lot of work to be done (notably regarding the charges applied to the bodies), but this is already a great achievement. This could only have occured in a sprint, were every needed expertise is available:

    • the SNCF experts were there to clarify the import needs and check the overall direction;

    • Logilab implemented a data model based on CubicWeb, and imported the data using the ChiPy bindings developed on-demand by the LMGC core developer team, using the usual-for-them ISO_C_BINDING/ Swig Fortran wrapping dance.

      https://www.logilab.org/file/143753/raw/logo_CubicWeb.jpg
    • Logilab undertook the data import; to this end, it asked the LMGC how the relevant information from LMGC90 can be exposed to Python via the ChiPy API.

    Using HDF5 as a data storage backend for LMGC90

    The main point of this topic was to replace the in-house DATBOX/OUTBOX textual format used by LMGC90 to store input and output data, with an open, standard and efficient format.

    Several formats have been considered, like HDF5, MED and NetCDF4.

    MED has been ruled out for the moment, because it lacks the support for storing body contact information. HDF5 was chosen at last because of the quality of its Python libraries, h5py and pytables, and the ease of use tools like h5fs provide.

    https://www.logilab.org/file/143754/raw/logo_HDF.jpg

    Alain Leufroy from Logilab quickly presented h5py and h5fs usage, and the team started its work, measuring the performance impact of the storage pattern of LMGC90 data. This was quickly achieved, as the LMGC experts made it easy to setup tests of various sizes, and as the Logilab developers managed to understand the concepts and implement the required code in a fast and agile way.

    Debian / Ubuntu Packaging of LMGC90

    This topic turned out to be more difficult than initially assessed, mainly because LMGC90 has dependencies to non-packaged external libraries, which thus had to be packaged first:

    • the Matlib linear algebra library, written in C,
    • the Lapack95 library, which is a Fortran95 interface to the Lapack library.

    Logilab kept working on this after the sprint and produced packages that are currently being tested by the LMGC team. Some changes are expected (for instance, Python modules should be prefixed with a proper namespace) before the packages can be submitted for inclusion into Debian. The expertise of Logilab regarding Debian packaging was of great help for this task. This will hopefully help to spread the use of LMGC90.

    https://www.logilab.org/file/143755/raw/logo_Debian.jpg

    Distributed Version Control System for LMGC90

    As you may know, Logilab is really fond of Mercurial as a DVCS. Our company invested a lot into the development of the great evolve extension, which makes Mercurial a very powerful tool to efficiently manage the team development of software in a clean fashion.

    This is why Logilab presented Mercurial's features and advantages over the current VCS used to manage LMGC90 sources, namely svn, to the other participants of the Sprint. This was appreciated and will hopefully benefit to LMGC90 ease of development and spread among the Open Source community.

    https://www.logilab.org/file/143756/raw/logo_HG.jpg

    Conclusions

    All in all, this two-day sprint on LMGC90, involving participants from several industrial and academic institutions has been a great success. A lot of code has been written but, more importantly, several stepping stones have been laid, such as:

    • the general LMGC90 data access architecture, with the Python layer on top of the LMGC90 core;
    • the data storage format, namely HDF5.

    Colaterally somehow, several other results have also been achieved:

    • partial LMGC90 data import into the SNCF CubicWeb-based numerical platform,
    • Debian / Ubuntu packaging of LMGC90 and dependencies.

    On a final note, one would say that we greatly appreciated the cooperation between the participants, which we found pleasant and efficient. We look forward to finding more occasions to work together.


  • Release of PyLint 0.27 / logilab-astng 0.24.2

    2013/02/28 by Sylvain Thenault

    Hi there,

    I'm very pleased to announce the release of pylint 0.27 and logilab-astng 0.24.2. There has been a lot of enhancements and bug fixes since the latest release, so you're strongly encouraged to upgrade. Here is a detailed list of changes:

    • #20693: replace pylint.el by Ian Eure version (patch by J.Kotta)
    • #105327: add support for --disable=all option and deprecate the 'disable-all' inline directive in favour of 'skip-file' (patch by A.Fayolle)
    • #110840: add messages I0020 and I0021 for reporting of suppressed messages and useless suppression pragmas. (patch by Torsten Marek)
    • #112728: add warning E0604 for non-string objects in __all__ (patch by Torsten Marek)
    • #120657: add warning W0110/deprecated-lambda when a map/filter of a lambda could be a comprehension (patch by Martin Pool)
    • #113231: logging checker now looks at instances of Logger classes in addition to the base logging module. (patch by Mike Bryant)
    • #111799: don't warn about octal escape sequence, but warn about o which is not octal in Python (patch by Martin Pool)
    • #110839: bind <F5> to Run button in pylint-gui
    • #115580: fix erroneous W0212 (access to protected member) on super call (patch by Martin Pool)
    • #110853: fix a crash when an __init__ method in a base class has been created by assignment rather than direct function definition (patch by Torsten Marek)
    • #110838: fix pylint-gui crash when include-ids is activated (patch by Omega Weapon)
    • #112667: fix emission of reimport warnings for mixed imports and extend the testcase (patch by Torsten Marek)
    • #112698: fix crash related to non-inferable __all__ attributes and invalid __all__ contents (patch by Torsten Marek)
    • Python 3 related fixes:
      • #110213: fix import of checkers broken with python 3.3, causing "No such message id W0704" breakage
      • #120635: redefine cmp function used in pylint.reporters
    • Include full warning id for I0020 and I0021 and make sure to flush warnings after each module, not at the end of the pylint run. (patch by Torsten Marek)
    • Changed the regular expression for inline options so that it must be preceeded by a # (patch by Torsten Marek)
    • Make dot output for import graph predictable and not depend on ordering of strings in hashes. (patch by Torsten Marek)
    • Add hooks for import path setup and move pylint's sys.path modifications into them. (patch by Torsten Marek)
    • pylint-brain: more subprocess.Popen faking (see #46273)
    • #109562 [jython]: java modules have no __doc__, causing crash
    • #120646 [py3]: fix for python3.3 _ast changes which may cause crash
    • #109988 [py3]: test fixes

    Many thanks to all the people who contributed to this release!

    Enjoy!


  • FOSDEM 2013

    2013/02/12 by Pierre-Yves David

    I was in Bruxelles for FOSDEM 2013. As with previous FOSDEM there were too many interesting talks and people to see. Here is a summary of what I saw:

    In the Mozilla's room:

    1. The html5 pdf viewer pdfjs is impressive. The PDF specification is really scary but this full featured "native" viewer is able to renders most of it with very good performance. Have a look at the pdfjs demo!
    1. Firefox debug tools overview with a specific focus of Firefox OS emulator in your browser.
    1. Introduction to webl10n: an internationalization format and library used in Firefox OS. A successful mix that results in a format that is idiot-proof enough for a duck to use, that relies on Unicode specifications to handle complex pluralization rules and that allows cascading translation definitions.
    typical webl10n user
    1. Status of html5 video and audio support in Firefox. The topic looks like a real headache but the team seems to be doing really well. Special mention for the reverse demo effect: The speaker expected some format to be still unsupported but someone else apparently implemented them over night.
    2. Last but not least I gave a talk about the changeset evolution concept that I'm putting in Mercurial. Thanks goes to Feth for asking me his not-scripted-at-all-questions during this talk. (slides)
    http://www.selenic.com/hg-logo/logo-droplets-150.png

    In the postgresql room:

    1. Insightful talk about more event trigger in postgresql engine and how this may becomes the perfect way to break your system.
    2. Full update of the capability of postgis 2.0. The postgis suite was already impressive for storing and querying 2D data, but it now have impressive capability regarding 3D data.
    http://upload.wikimedia.org/wikipedia/en/6/60/PostGIS_logo.png

    On python related topic:

    http://www.python.org/community/logos/python-logo-master-v3-TM-flattened.png
    • Victor Stinner has started an interesting project to improve CPython performance. The first one: astoptimizer breaks some of the language semantics to apply optimisation on compiling to byte code (lookup caching, constant folding,…). The other, registervm is a full redefinition of how the interpreter handles reference in byte code.

    After the FOSDEM, I crossed the channel to attend a Mercurial sprint in London. Expect more on this topic soon.


  • Febuary 2013: Mercurial channel "tour"

    2013/01/22 by Pierre-Yves David

    The Release candidate version of Mercurial 2.5 was released last sunday.

    http://mercurial.selenic.com/images/mercurial-logo.png

    This new version makes a major change in the way "hidden" changesets are handled. In 2.4 only hg log (and a few others) would support effectively hiding "hidden" changesets. Now all hg commands are transparently compatible with the hidden revision concept. This is a considerable step towards changeset evolution, the next-generation collaboration technology that I'm developing for Mercurial.

    https://fosdem.org/2013/assets/flyer-thumb-0505d19dbf3cf6139bc7490525310f8e253e60448a29ed4313801b723d5b2ef1.png

    The 2.5 cycle is almost over, but there is no time to rest yet, Saturday the 2th of February, I will give a talk about changeset evolution concept at FOSDEM in the Mozilla Room. This talk in an updated version of the one I gave at OSDC.fr 2012 (video in french).

    The week after, I'm crossing the channel to attend the Mercurial 2.6 Sprint hosted by Facebook London. I expect a lot of discussion about the user interface and network access of changeset evolution.

    The HG 2.3 sprint

  • Building Debian images for an OpenStack (private) cloud

    2012/12/23 by David Douard

    Now I have a working OpenStack cloud at Logilab, I want to provide my fellow collegues a bunch of ready-made images to create instances.

    Strangely, there are no really usable ready-made UEC Debian images available out there. There have been recent efforts made to provide Debian images on Amazon Market Place, and the toolsuite used to build these is available as a collection of bash shell scripts from a github repository. There are also some images for Eucalyptus, but I have not been able to make them boot properly on my kvm-based OpenStack install.

    So I have tried to build my own set of Debian images to upload in my glance shop.

    Vocabulary

    A bit of vocabulary may be useful for the one not very accustomed with OpenStack nor AWS jargons.

    When you want to create an instance of an image, ie. boot a virtual machine in a cloud, you generally choose from a set of ready made system images, then you choose a virtual machine flavor (ie. a combination of a number of virtual CPUs, an amount of RAM, and a harddrive size used as root device). Generally, you have to choose between tiny (1 CPU, 512MB, no disk), small (1 CPU, 2G of RAM, 20G of disk), etc.

    In the cloud world, an instance is not meant to be sustainable. What is sustainable is a volume that can be attached to a running instance.

    If you want your instance to be sustainable, there are 2 choices:

    • you can snapshot a running instance and upload it as a new image ; so it is not really a sustainable instance, instead, it's the ability to configure an instance that is then the base for booting other instances,
    • or you can boot an instance from a volume (which is the sustainable part of a virtual machine in a cloud).

    In the Amazon world, a "standard" image (the one that is instanciated when creating a new instance) is called an instance store-backed AMI images, also called an UEC image, and a volume image is called an EBS-backed AMI image (EBS stands for Elastic Block Storage). So an AMI images stored in a volume cannot be instanciated, it can be booted once and only once at a time. But it is sustainable. Different usage.

    An UEC or AMI image consist in a triplet: a kernel, an init ramdisk and a root file system image. An EBS-backed image is just the raw image disk to be booted on a virtulization host (a kvm raw or qcow2 image, etc.)

    Images in OpenStack

    In OpenStack, when you create an instance from a given image, what happens depends on the kind of image.

    In fact, in OpenStack, one can upload traditional UEC AMI images (need to upload the 3 files, the kernel, the initial ramdisk and the root filesystem as a raw image). But one can also upload bare images. These kind of images are booted directly by the virtualization host. So it is some kind of hybrid between a boot from volume (an EBS-backed boot in the Amazon world) and the traditional instanciation from an UEC image.

    Instanciating an AMI image

    When one creates an instance from an AMI image in an OpenStack cloud:

    • the kernel is copied to the virtualization host,
    • the initial ramdisk is copied to the virtualization host,
    • the root FS image is copied to the virtualization host,
    • then, the root FS image is :
      • duplicated (instanciated),
      • resized (the file is increased if needed) to the size of the asked instance flavor,
      • the file system is resized to the new size of the file,
      • the contained filesystem is mounted (using qemu-nbd) and the configured SSH acces key is added to /root/.ssh/authorized_keys
      • the nbd volume is then unmounted
    • a libvirt domain is created, configured to boot from the given kernel and init ramdisk, using the resized and modified image disk as root filesystem,
    • the libvirt domain is then booted.

    Instantiating a BARE image

    When one creates an instance from a BARE image in an OpenStack cloud:

    • the VM image file is copied on the virtualization host,
    • the VM image file is duplicated (instantiated),
    • a libvirt domain is created, configured to boot from this copied image disk as root filesystem,
    • the libvirt domain is then booted.

    Differences between the 2 instantiation methods

    Instantiating a BARE image:
    • Involves a much simpler process.
    • Allows to boot a non-linux system (depends on the virtualization system, especially true when using kvm vitualization).
    • Is slower to boot and consumes more resources, since the virtual machine image must be the size of the required/wanted virtual machine (but can remain minimal if using a qcow2 image format). If you use a 10G raw image, then 10G of data will be copied from the image provider to the virtualization host, and this big file will be duplicated each time you instantiate this image.
    • The root filesystem size corresponding to the flavor of the instance is not honored; the filesystem size is the one of the BARE images.
    Instantiating an AMI image:
    • Honours the flavor.
    • Generally allows quicker instance creation process.
    • Less resource consumption.
    • Can only boot Linux guests.

    If one wants to boot a Windows guest in OpenStack, the only solution (as far as I know) is to use a BARE image of an installed Windows system. It works (I have succeeded in doing so), but a minimal Windows 7 install is several GB, so instantiating such a BARE image is very slow, because the image needs to be uploaded on the virtualization host.

    Building a Debian AMI image

    So I wanted to provide a minimal Debian image in my cloud, and to provide it as an AMI image so the flavor is honoured, and so the standard cloud injection mechanisms (like setting up the ssh key to access the VM) work without having to tweak the rc.local script or use cloud-init in my guest.

    Here is what I did.

    1. Install a Debian system in a standard libvirt/kvm guest.

    david@host:~$ virt-install  --connect qemu+tcp://virthost/system   \
                     -n openstack-squeeze-amd64 -r 512 \
                     -l http://ftp2.fr.debian.org/pub/debian/dists/stable/main/installer-amd64/ \
                     --disk pool=default,bus=virtio,type=qcow2,size=5 \
                     --network bridge=vm7,model=virtio  --nographics  \
                     --extra-args='console=tty0 console=ttyS0,115200'
    

    This creates a new virtual machine, launch the Debian installer directly downloaded from a Debian mirror, and start the usual Debian installer in a virtual serial console (I don't like VNC very much).

    I then followed the installation procedure. When asked for the partitioning and so, I chose to create only one primary partition (ie. with no swap partition; it wont be necessary here). I also chose only "Default system" and "SSH server" to be installed.

    2. Configure the system

    After the installation process, the VM is rebooted, I log into it (by SSH or via the console), so I can configure a bit the system.

    david@host:~$ ssh root@openstack-squeeze-amd64.vm.logilab.fr
    Linux openstack-squeeze-amd64 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64
    
    The programs included with the Debian GNU/Linux system are free software;
    the exact distribution terms for each program are described in the
    individual files in /usr/share/doc/*/copyright.
    
    Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
    permitted by applicable law.
    Last login: Sun Dec 23 20:14:24 2012 from 192.168.1.34
    root@openstack-squeeze-amd64:~# apt-get update
    root@openstack-squeeze-amd64:~# apt-get install vim curl parted # install some must have packages
    [...]
    root@openstack-squeeze-amd64:~# dpkg-reconfigure locales # I like to have fr_FR and en_US in my locales
    [...]
    root@openstack-squeeze-amd64:~# echo virtio_baloon >> /etc/modules
    root@openstack-squeeze-amd64:~# echo acpiphp >> /etc/modules
    root@openstack-squeeze-amd64:~# update-initramfs -u
    root@openstack-squeeze-amd64:~# apt-get clean
    root@openstack-squeeze-amd64:~# rm /etc/udev/rules.d/70-persistent-net.rules
    root@openstack-squeeze-amd64:~# rm .bash_history
    root@openstack-squeeze-amd64:~# poweroff
    

    What we do here is to install some packages, do some configurations. The important part is adding the acpiphp module so the volume attachment will work in our instances. We also clean some stuffs up before shutting the VM down.

    3. Convert the image into an AMI image

    Since I created the VM image as a qcow2 image, I needed to convert it back to a raw image:

    david@host:~$ scp root@virthost:/var/lib/libvirt/images/openstack-squeeze-amd64.img .
    david@host:~$ qemu-img convert -O raw openstack-squeeze-amd64.img openstack-squeeze-amd64.raw
    

    Then, as I want a minimal-sized disk image, the filesystem must be resized to minimal. I did this like described below, but I think there are simpler methods to do so.

    david@host:~$ fdisk -l openstack-squeeze-amd64.raw  # display the partition location in the disk
    
    Disk openstack-squeeze-amd64.raw: 5368 MB, 5368709120 bytes
    149 heads, 8 sectors/track, 8796 cylinders, total 10485760 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x0001fab7
    
                       Device Boot      Start         End      Blocks   Id  System
    debian-squeeze-amd64.raw1            2048    10483711     5240832   83  Linux
    david@host:~$ # extract the filesystem from the image
    david@host:~$ dd if=openstack-squeeze-amd64.raw of=openstack-squeeze-amd64.ami bs=1024 skip=1024 count=5240832
    david@host:~$ losetup /dev/loop1 openstack-squeeze-amd64.ami
    david@host:~$ mkdir /tmp/img
    david@host:~$ mount /dev/loop1 /tmp/img
    david@host:~$ cp /tmp/img/boot/vmlinuz-2.6.32-5-amd64 .
    david@host:~$ cp /tmp/img/boot/initrd.img-2.6.32-5-amd64 .
    david@host:~$ umount /tmp/img
    david@host:~$ e2fsck -f /dev/loop1 # required before a resize
    
    e2fsck 1.42.5 (29-Jul-2012)
    Pass 1: Checking inodes, blocks, and sizes
    Pass 2: Checking directory structure
    Pass 3: Checking directory connectivity
    Pass 4: Checking reference counts
    Pass 5: Checking group summary information
    /dev/loop1: 26218/327680 files (0.2% non-contiguous), 201812/1310208 blocks
    david@host:~$ resize2fs -M /dev/loop1 # minimize the filesystem
    
    resize2fs 1.42.5 (29-Jul-2012)
    Resizing the filesystem on /dev/loop1 to 191461 (4k) blocks.
    The filesystem on /dev/loop1 is now 191461 blocks long.
    david@host:~$ # note the new size ^^^^ and the block size above (4k)
    david@host:~$ losetup -d /dev/loop1 # detach the lo device
    david@host:~$ dd if=debian-squeeze-amd64.ami of=debian-squeeze-amd64-reduced.ami bs=4096 count=191461
    

    4. Upload in OpenStack

    After all this, you have a kernel image, a init ramdisk file and a minimized root filesystem image file. So you just have to upload them to your OpenStack image provider (glance):

    david@host:~$ glance add disk_format=aki container_format=aki name="debian-squeeze-uec-x86_64-kernel" \
                     < vmlinuz-2.6.32-5-amd64
    Uploading image 'debian-squeeze-uec-x86_64-kernel'
    ==================================================================================[100%] 24.1M/s, ETA  0h  0m  0s
    Added new image with ID: 644e59b8-1503-403f-a4fe-746d4dac2ff8
    david@host:~$ glance add disk_format=ari container_format=ari name="debian-squeeze-uec-x86_64-initrd" \
                     < initrd.img-2.6.32-5-amd64
    Uploading image 'debian-squeeze-uec-x86_64-initrd'
    ==================================================================================[100%] 26.7M/s, ETA  0h  0m  0s
    Added new image with ID: 6f75f1c9-1e27-4cb0-bbe0-d30defa8285c
    david@host:~$ glance add disk_format=ami container_format=ami name="debian-squeeze-uec-x86_64" \
                     kernel_id=644e59b8-1503-403f-a4fe-746d4dac2ff8 ramdisk_id=6f75f1c9-1e27-4cb0-bbe0-d30defa8285c \
                     < debian-squeeze-amd64-reduced.ami
    Uploading image 'debian-squeeze-uec-x86_64'
    ==================================================================================[100%] 42.1M/s, ETA  0h  0m  0s
    Added new image with ID: 4abc09ae-ea34-44c5-8d54-504948e8d1f7
    
    http://www.logilab.org/file/115220?vid=download

    And that's it (!). I now have a working Debian squeeze image in my cloud that works fine:

    http://www.logilab.org/file/115221?vid=download

  • Nazca is out !

    2012/12/21 by Simon Chabot

    What is it for ?

    Nazca is a python library aiming to help you to align data. But, what does “align data” mean? For instance, you have a list of cities, described by their name and their country and you would like to find their URI on dbpedia to have more information about them, as the longitude and the latitude. If you have two or three cities, it can be done with bare hands, but it could not if there are hundreds or thousands cities. Nazca provides you all the stuff we need to do it.

    This blog post aims to introduce you how this library works and can be used. Once you have understood the main concepts behind this library, don't hesitate to try Nazca online !

    Introduction

    The alignment process is divided into three main steps:

    1. Gather and format the data we want to align. In this step, we define two sets called the alignset and the targetset. The alignset contains our data, and the targetset contains the data on which we would like to make the links.
    2. Compute the similarity between the items gathered. We compute a distance matrix between the two sets according to a given distance.
    3. Find the items having a high similarity thanks to the distance matrix.

    Simple case

    1. Let's define alignset and targetset as simple python lists.
    alignset = ['Victor Hugo', 'Albert Camus']
    targetset = ['Albert Camus', 'Guillaume Apollinaire', 'Victor Hugo']
    
    1. Now, we have to compute the similarity between each items. For that purpose, the Levenshtein distance [1], which is well accurate to compute the distance between few words, is used. Such a function is provided in the nazca.distance module.

      The next step is to compute the distance matrix according to the Levenshtein distance. The result is given in the following table.

        Albert Camus Guillaume Apollinaire Victor Hugo
      Victor Hugo 6 9 0
      Albert Camus 0 8 6
    2. The alignment process is ended by reading the matrix and saying items having a value inferior to a given threshold are identical.

    [1]Also called the edit distance, because the distance between two words is equal to the number of single-character edits required to change one word into the other.

    A more complex one

    The previous case was simple, because we had only one attribute to align (the name), but it is frequent to have a lot of attributes to align, such as the name and the birth date and the birth city. The steps remain the same, except that three distance matrices will be computed, and items will be represented as nested lists. See the following example:

    alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
                ['Jacques Dupuis', '06-01-1999', 'Bressuire'],
                ['Michel Edouard', '18-04-1881', 'Nantes']]
    targetset = [['Dupond Paul', '14/08/1991', 'Paris'],
                 ['Edouard Michel', '18/04/1881', 'Nantes'],
                 ['Dupuis Jacques ', '06/01/1999', 'Bressuire'],
                 ['Dupont Paul', '01-12-2012', 'Paris']]
    

    In such a case, two distance functions are used, the Levenshtein one for the name and the city and a temporal one for the birth date [2].

    The cdist function of nazca.distances enables us to compute those matrices :

    • For the names:
    >>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
    >>>                    'levenshtein', matrix_normalized=False)
    array([[ 1.,  6.,  5.,  0.],
           [ 5.,  6.,  0.,  5.],
           [ 6.,  0.,  6.,  6.]], dtype=float32)
    
      Dupond Paul Edouard Michel Dupuis Jacques Dupont Paul
    Paul Dupont 1 6 5 0
    Jacques Dupuis 5 6 0 5
    Edouard Michel 6 0 6 6
    • For the birthdates:
    >>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
    >>>                    'temporal', matrix_normalized=False)
    array([[     0.,  40294.,   2702.,   7780.],
           [  2702.,  42996.,      0.,   5078.],
           [ 40294.,      0.,  42996.,  48074.]], dtype=float32)
    
      14/08/1991 18/04/1881 06/01/1999 01-12-2012
    14-08-1991 0 40294 2702 7780
    06-01-1999 2702 42996 0 5078
    18-04-1881 40294 0 42996 48074
    • For the birthplaces:
    >>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
    >>>                    'levenshtein', matrix_normalized=False)
    array([[ 0.,  4.,  8.,  0.],
           [ 8.,  9.,  0.,  8.],
           [ 4.,  0.,  9.,  4.]], dtype=float32)
    
      Paris Nantes Bressuire Paris
    Paris 0 4 8 0
    Bressuire 8 9 0 8
    Nantes 4 0 9 4

    The next step is gathering those three matrices into a global one, called the global alignment matrix. Thus we have :

      0 1 2 3
    0 1 40304 2715 7780
    1 2715 43011 0 5091
    2 40304 0 43011 48084

    Allowing some misspelling mistakes (for example Dupont and Dupond are very closed), the matching threshold can be set to 1 or 2. Thus we can see that the item 0 in our alignset is the same that the item 0 in the targetset, the 1 in the alignset and the 2 of the targetset too : the links can be done !

    It's important to notice that even if the item 0 of the alignset and the 3 of the targetset have the same name and the same birthplace they are unlikely identical because of their very different birth date.

    You may have noticed that working with matrices as I did for the example is a little bit boring. The good news is that Nazca makes all this job for you. You just have to give the sets and distance functions and that's all. An other good news is the project comes with the needed functions to build the sets !

    [2]Provided in the nazca.distances module.

    Real applications

    Just before we start, we will assume the following imports have been done:

    from nazca import dataio as aldio   #Functions for input and output data
    from nazca import distances as ald  #Functions to compute the distances
    from nazca import normalize as aln  #Functions to normalize data
    from nazca import aligner as ala    #Functions to align data
    

    The Goncourt prize

    On wikipedia, we can find the Goncourt prize winners, and we would like to establish a link between the winners and their URI on dbpedia (Let's imagine the Goncourt prize winners category does not exist in dbpedia)

    We simply copy/paste the winners list of wikipedia into a file and replace all the separators (- and ,) by #. So, the beginning of our file is :

    1903#John-Antoine Nau#Force ennemie (Plume)
    1904#Léon Frapié#La Maternelle (Albin Michel)
    1905#Claude Farrère#Les Civilisés (Paul Ollendorff)
    1906#Jérôme et Jean Tharaud#Dingley, l'illustre écrivain (Cahiers de la Quinzaine)

    When using the high-level functions of this library, each item must have at least two elements: an identifier (the name, or the URI) and the attribute to compare. With the previous file, we will use the name (so the column number 1) as identifier (we don't have an URI here as identifier) and attribute to align. This is told to python thanks to the following code:

    alignset = aldio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')
    

    So, the beginning of our alignset is:

    >>> alignset[:3]
    [[u'John-Antoine Nau', u'John-Antoine Nau'],
     [u'Léon Frapié', u'Léon, Frapié'],
     [u'Claude Farrère', u'Claude Farrère']]
    

    Now, let's build the targetset thanks to a sparql query and the dbpedia end-point. We ask for the list of the French novelists, described by their URI and their name in French:

    query = """
         SELECT ?writer, ?name WHERE {
           ?writer  <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:French_novelists>.
           ?writer rdfs:label ?name.
           FILTER(lang(?name) = 'fr')
        }
     """
     targetset = aldio.sparqlquery('http://dbpedia.org/sparql', query)
    

    Both functions return nested lists as presented before. Now, we have to define the distance function to be used for the alignment. This is done thanks to a python dictionary where the keys are the columns to work on, and the values are the treatments to apply.

    treatments = {1: {'metric': ald.levenshtein}} # Use a levenshtein on the name
                                                  # (column 1)
    

    Finally, the last thing we have to do, is to call the alignall function:

    alignments = ala.alignall(alignset, targetset,
                           0.4, #This is the matching threshold
                           treatments,
                           mode=None,#We'll discuss about that later
                           uniq=True #Get the best results only
                          )
    

    This function returns an iterator over the different alignments done. You can see the results thanks to the following code :

    for a, t in alignments:
        print '%s has been aligned onto %s' % (a, t)
    

    It may be important to apply some pre-treatment on the data to align. For instance, names can be written with lower or upper characters, with extra characters as punctuation or unwanted information in parenthesis and so on. That is why we provide some functions to normalize your data. The most useful may be the simplify() function (see the docstring for more information). So the treatments list can be given as follow:

    def remove_after(string, sub):
        """ Remove the text after ``sub`` in ``string``
            >>> remove_after('I like cats and dogs', 'and')
            'I like cats'
            >>> remove_after('I like cats and dogs', '(')
            'I like cats and dogs'
        """
        try:
            return string[:string.lower().index(sub.lower())].strip()
        except ValueError:
            return string
    
    
    treatments = {1: {'normalization': [lambda x:remove_after(x, '('),
                                        aln.simply],
                      'metric': ald.levenshtein
                     }
                 }
    

    Cities alignment

    The previous case with the Goncourt prize winners was pretty simply because the number of items was small, and the computation fast. But in a more real use case, the number of items to align may be huge (some thousands or millions…). In such a case it's unthinkable to build the global alignment matrix because it would be too big and it would take (at least...) fews days to achieve the computation. So the idea is to make small groups of possible similar data to compute smaller matrices (i.e. a divide and conquer approach). For this purpose, we provide some functions to group/cluster data. We have functions to group text and numerical data.

    This is the code used, we will explain it:

    targetset = aldio.rqlquery('http://demo.cubicweb.org/geonames',
                               """Any U, N, LONG, LAT WHERE X is Location, X name
                                  N, X country C, C name "France", X longitude
                                  LONG, X latitude LAT, X population > 1000, X
                                  feature_class "P", X cwuri U""",
                               indexes=[0, 1, (2, 3)])
    alignset = aldio.sparqlquery('http://dbpedia.inria.fr/sparql',
                                 """prefix db-owl: <http://dbpedia.org/ontology/>
                                 prefix db-prop: <http://fr.dbpedia.org/property/>
                                 select ?ville, ?name, ?long, ?lat where {
                                  ?ville db-owl:country <http://fr.dbpedia.org/resource/France> .
                                  ?ville rdf:type db-owl:PopulatedPlace .
                                  ?ville db-owl:populationTotal ?population .
                                  ?ville foaf:name ?name .
                                  ?ville db-prop:longitude ?long .
                                  ?ville db-prop:latitude ?lat .
                                  FILTER (?population > 1000)
                                 }""",
                                 indexes=[0, 1, (2, 3)])
    
    
    treatments = {1: {'normalization': [aln.simply],
                      'metric': ald.levenshtein,
                      'matrix_normalized': False
                     }
                 }
    results = ala.alignall(alignset, targetset, 3, treatments=treatments, #As before
                           indexes=(2, 2), #On which data build the kdtree
                           mode='kdtree',  #The mode to use
                           uniq=True) #Return only the best results
    

    Let's explain the code. We have two files, containing a list of cities we want to align, the first column is the identifier, and the second is the name of the city and the last one is location of the city (longitude and latitude), gathered into a single tuple.

    In this example, we want to build a kdtree on the couple (longitude, latitude) to divide our data in few candidates. This clustering is coarse, and is only used to reduce the potential candidats without loosing any more refined possible matchs.

    So, in the next step, we define the treatments to apply. It is the same as before, but we ask for a non-normalized matrix (ie: the real output of the levenshtein distance). Thus, we call the alignall function. indexes is a tuple saying the position of the point on which the kdtree must be built, mode is the mode used to find neighbours [3].

    Finally, uniq ask to the function to return the best candidate (ie: the one having the shortest distance below the given threshold)

    The function outputs a generator yielding tuples where the first element is the identifier of the alignset item and the second is the targetset one (It may take some time before yielding the first tuples, because all the computation must be done…)

    [3]The available modes are kdtree, kmeans and minibatch for numerical data and minhashing for text one.

    Try it online !

    We have also made this little application of Nazca, using Cubicweb. This application provides a user interface for Nazca, helping you to choose what you want to align. You can use sparql or rql queries, as in the previous example, or import your own cvs file [4]. Once you have choosen what you want to align, you can click the Next step button to customize the treatments you want to apply, just as you did before in python ! Once done, by clicking the Next step, you start the alignment process. Wait a little bit, and you can either download the results in a csv or rdf file, or directly see the results online choosing the html output.

    [4]Your csv file must be tab-separated for the moment…

  • Openstack, Wheezy and ZFS on Linux

    2012/12/19 by David Douard

    Openstack, Wheezy and ZFS on Linux

    A while ago, I started the install of an OpenStack cluster at Logilab, so our developers can play easily with any kind of environment. We are planning to improve our Apycot automatic testing platform so it can use "elastic power". And so on.

    http://www.openstack.org/themes/openstack/images/open-stack-cloud-computing-logo-2.png

    I first tried a Ubuntu Precise based setup, since at that time, Debian packages were not really usable. The setup never reached a point where it could be relased as production ready, due to the fact I tried a too complex and bleeding edge configuration (involving Quantum, openvswitch, sheepdog)...

    Meanwhile, we went really short of storage capacity. For now, it mainly consists in hard drives distributed in our 19" Dell racks (generally with hardware RAID controllers). So I recently purchased a low-cost storage bay (SuperMicro SC937 with a 6Gb/s JBOD-only HBA) with 18 spinning hard drives and 4 SSDs. This storage bay being driven by ZFS on Linux (tip: the SSD-stored ZIL is a requirement to get decent performances). This storage setup is still under test for now.

    http://zfsonlinux.org/images/zfs-linux.png

    I also went to the last Mini-DebConf in Paris, where Loic Dachary presented the status of the OpenStack packaging effort in Debian. This gave me the will to give a new try to OpenStack using Wheezy and a bit simpler setup. But I could not consider not to use my new ZFS-based storage as a nova volume provider. It is not available for now in OpenStack (there is a backend for Solaris, but not for ZFS on Linux). However, this is Python and in fact, the current ISCSIDriver backend needs very little to make it work with zfs instead of lvm as "elastics" block-volume provider and manager.

    So, I wrote a custom nova volume driver to handle this. As I don't want the nova-volume daemon to run on my ZFS SAN, I wrote this backend mixing the SanISCSIDriver (which manages the storage system via SSH) and the standard ISCSIDriver (which uses standard Linux isci target tools). I'm not very fond of the API of the VolumeDriver (especially the fact that the ISCSIDriver is responsible for 2 roles: managing block-level volumes and exporting block-level volumes). This small design flaw (IMHO) is the reason I had to duplicate some code (not much but...) to implement my ZFSonLinuxISCSIDriver...

    So here is the setup I made:

    Infrastructure

    My OpenStack Essex "cluster" consists for now in:

    • one control node, running in a "normal" libvirt-controlled virtual machine; it is a Wheezy that runs:
      • nova-api
      • nova-cert
      • nova-network
      • nova-scheduler
      • nova-volume
      • glance
      • postgresql
      • OpenStack dashboard
    • one computing node (Dell R310, Xeon X3480, 32G, Wheezy), which runs:
      • nova-api
      • nova-network
      • nova-compute
    • ZFS-on-Linux SAN (3x raidz1 poools made of 6 1T drives, 2x (mirrored) 32G SLC SDDs, 2x 120G MLC SSDs for cache); for now, the storage is exported to the SAN via one 1G ethernet link.

    OpensStack Essex setup

    I mainly followed the Debian HOWTO to setup my private cloud. I mainly tuned the network settings to match my environement (and the fact my control node lives in a VM, with VLAN stuff handled by the host).

    I easily got a working setup (I must admit that I think my previous experiment with OpenStack helped a lot when dealing with custom configurations... and vocabulary; I'm not sure I would have succeded "easily" following the HOWTO, but hey, it is a functionnal HOWTO, meaning if you do not follow the instructions because you want special tunings, don't blame the HOWTO).

    Compared to the HOWTO, my nova.conf looks like (as of today):

    [DEFAULT]
    logdir=/var/log/nova
    state_path=/var/lib/nova
    lock_path=/var/lock/nova
    root_helper=sudo nova-rootwrap
    auth_strategy=keystone
    dhcpbridge_flagfile=/etc/nova/nova.conf
    dhcpbridge=/usr/bin/nova-dhcpbridge
    sql_connection=postgresql://novacommon:XXX@control.openstack.logilab.fr/nova
    
    ##  Network config
    # A nova-network on each compute node
    multi_host=true
    # VLan manger
    network_manager=nova.network.manager.VlanManager
    vlan_interface=eth1
    # My ip
    my-ip=172.17.10.2
    public_interface=eth0
    # Dmz & metadata things
    dmz_cidr=169.254.169.254/32
    ec2_dmz_host=169.254.169.254
    metadata_host=169.254.169.254
    
    ## More general things
    # The RabbitMQ host
    rabbit_host=control.openstack.logilab.fr
    
    ## Glance
    image_service=nova.image.glance.GlanceImageService
    glance_api_servers=control.openstack.logilab.fr:9292
    use-syslog=true
    ec2_host=control.openstack.logilab.fr
    
    novncproxy_base_url=http://control.openstack.logilab.fr:6080/vnc_auto.html
    vncserver_listen=0.0.0.0
    vncserver_proxyclient_address=127.0.0.1
    

    Volume

    I had a bit more work to do to make nova-volume work. First, I got hit by this nasty bug #695791 which is trivial to fix... when you know how to fix it (I noticed the bug report after I fixed it by myself).

    Then, as I wanted the volumes to be stored and exported by my shiny new ZFS-on-Linux setup, I had to write my own volume driver, which was quite easy, since it is Python, and the logic to implement was already provided by the ISCSIDriver class on the one hand, and by the SanISCSIDrvier on the other hand. So I ended with this firt implementation. This file should be copied to nova volumes package directory (nova/volume/zol.py):

    # vim: tabstop=4 shiftwidth=4 softtabstop=4
    
    # Copyright 2010 United States Government as represented by the
    # Administrator of the National Aeronautics and Space Administration.
    # Copyright 2011 Justin Santa Barbara
    # Copyright 2012 David DOUARD, LOGILAB S.A.
    # All Rights Reserved.
    #
    #    Licensed under the Apache License, Version 2.0 (the "License"); you may
    #    not use this file except in compliance with the License. You may obtain
    #    a copy of the License at
    #
    #         http://www.apache.org/licenses/LICENSE-2.0
    #
    #    Unless required by applicable law or agreed to in writing, software
    #    distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
    #    WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
    #    License for the specific language governing permissions and limitations
    #    under the License.
    """
    Driver for ZFS-on-Linux-stored volumes.
    
    This is mainly a custom version of the ISCSIDriver that uses ZFS as
    volume provider, generally accessed over SSH.
    """
    
    import os
    
    from nova import exception
    from nova import flags
    from nova import utils
    from nova import log as logging
    from nova.openstack.common import cfg
    from nova.volume.driver import _iscsi_location
    from nova.volume import iscsi
    from nova.volume.san import SanISCSIDriver
    
    
    LOG = logging.getLogger(__name__)
    
    san_opts = [
        cfg.StrOpt('san_zfs_command',
                   default='/sbin/zfs',
                   help='The ZFS command.'),
        ]
    
    FLAGS = flags.FLAGS
    FLAGS.register_opts(san_opts)
    
    
    class ZFSonLinuxISCSIDriver(SanISCSIDriver):
        """Executes commands relating to ZFS-on-Linux-hosted ISCSI volumes.
    
        Basic setup for a ZoL iSCSI server:
    
        XXX
    
        Note that current implementation of ZFS on Linux does not handle:
    
          zfs allow/unallow
    
        For now, needs to have root access to the ZFS host. The best is to
        use a ssh key with ssh authorized_keys restriction mechanisms to
        limit root access.
    
        Make sure you can login using san_login & san_password/san_private_key
        """
        ZFSCMD = FLAGS.san_zfs_command
    
        _local_execute = utils.execute
    
        def _getrl(self):
            return self._runlocal
        def _setrl(self, v):
            if isinstance(v, basestring):
                v = v.lower() in ('true', 't', '1', 'y', 'yes')
            self._runlocal = v
        run_local = property(_getrl, _setrl)
    
        def __init__(self):
            super(ZFSonLinuxISCSIDriver, self).__init__()
            self.tgtadm.set_execute(self._execute)
            LOG.info("run local = %s (%s)" % (self.run_local, FLAGS.san_is_local))
    
        def set_execute(self, execute):
            LOG.debug("override local execute cmd with %s (%s)" %
                      (repr(execute), execute.__module__))
            self._local_execute = execute
    
        def _execute(self, *cmd, **kwargs):
            if self.run_local:
                LOG.debug("LOCAL execute cmd %s (%s)" % (cmd, kwargs))
                return self._local_execute(*cmd, **kwargs)
            else:
                LOG.debug("SSH execute cmd %s (%s)" % (cmd, kwargs))
                check_exit_code = kwargs.pop('check_exit_code', None)
                command = ' '.join(cmd)
                return self._run_ssh(command, check_exit_code)
    
        def _create_volume(self, volume_name, sizestr):
            zfs_poolname = self._build_zfs_poolname(volume_name)
    
            # Create a zfs volume
            cmd = [self.ZFSCMD, 'create']
            if FLAGS.san_thin_provision:
                cmd.append('-s')
            cmd.extend(['-V', sizestr])
            cmd.append(zfs_poolname)
            self._execute(*cmd)
    
        def _volume_not_present(self, volume_name):
            zfs_poolname = self._build_zfs_poolname(volume_name)
            try:
                out, err = self._execute(self.ZFSCMD, 'list', '-H', zfs_poolname)
                if out.startswith(zfs_poolname):
                    return False
            except Exception as e:
                # If the volume isn't present
                return True
            return False
    
        def create_volume_from_snapshot(self, volume, snapshot):
            """Creates a volume from a snapshot."""
            zfs_snap = self._build_zfs_poolname(snapshot['name'])
            zfs_vol = self._build_zfs_poolname(snapshot['name'])
            self._execute(self.ZFSCMD, 'clone', zfs_snap, zfs_vol)
            self._execute(self.ZFSCMD, 'promote', zfs_vol)
    
        def delete_volume(self, volume):
            """Deletes a volume."""
            if self._volume_not_present(volume['name']):
                # If the volume isn't present, then don't attempt to delete
                return True
            zfs_poolname = self._build_zfs_poolname(volume['name'])
            self._execute(self.ZFSCMD, 'destroy', zfs_poolname)
    
        def create_export(self, context, volume):
            """Creates an export for a logical volume."""
            self._ensure_iscsi_targets(context, volume['host'])
            iscsi_target = self.db.volume_allocate_iscsi_target(context,
                                                                volume['id'],
                                                          volume['host'])
            iscsi_name = "%s%s" % (FLAGS.iscsi_target_prefix, volume['name'])
            volume_path = self.local_path(volume)
    
            # XXX (ddouard) this code is not robust: does not check for
            # existing iscsi targets on the host (ie. not created by
            # nova), but fixing it require a deep refactoring of the iscsi
            # handling code (which is what have been done in cinder)
            self.tgtadm.new_target(iscsi_name, iscsi_target)
            self.tgtadm.new_logicalunit(iscsi_target, 0, volume_path)
    
            if FLAGS.iscsi_helper == 'tgtadm':
                lun = 1
            else:
                lun = 0
            if self.run_local:
                iscsi_ip_address = FLAGS.iscsi_ip_address
            else:
                iscsi_ip_address = FLAGS.san_ip
            return {'provider_location': _iscsi_location(
                    iscsi_ip_address, iscsi_target, iscsi_name, lun)}
    
        def remove_export(self, context, volume):
            """Removes an export for a logical volume."""
            try:
                iscsi_target = self.db.volume_get_iscsi_target_num(context,
                                                               volume['id'])
            except exception.NotFound:
                LOG.info(_("Skipping remove_export. No iscsi_target " +
                           "provisioned for volume: %d"), volume['id'])
                return
    
            try:
                # ietadm show will exit with an error
                # this export has already been removed
                self.tgtadm.show_target(iscsi_target)
            except Exception as e:
                LOG.info(_("Skipping remove_export. No iscsi_target " +
                           "is presently exported for volume: %d"), volume['id'])
                return
    
            self.tgtadm.delete_logicalunit(iscsi_target, 0)
            self.tgtadm.delete_target(iscsi_target)
    
        def check_for_export(self, context, volume_id):
            """Make sure volume is exported."""
            tid = self.db.volume_get_iscsi_target_num(context, volume_id)
            try:
                self.tgtadm.show_target(tid)
            except exception.ProcessExecutionError, e:
                # Instances remount read-only in this case.
                # /etc/init.d/iscsitarget restart and rebooting nova-volume
                # is better since ensure_export() works at boot time.
                LOG.error(_("Cannot confirm exported volume "
                            "id:%(volume_id)s.") % locals())
                raise
    
        def local_path(self, volume):
            zfs_poolname = self._build_zfs_poolname(volume['name'])
            zvoldev = '/dev/zvol/%s' % zfs_poolname
            return zvoldev
    
        def _build_zfs_poolname(self, volume_name):
            zfs_poolname = '%s%s' % (FLAGS.san_zfs_volume_base, volume_name)
            return zfs_poolname
    

    To configure my nova-volume instance (which runs on the control node, since it's only a manager), I added these to my nova.conf file:

    # nove-volume config
    volume_driver=nova.volume.zol.ZFSonLinuxISCSIDriver
    iscsi_ip_address=172.17.1.7
    iscsi_helper=tgtadm
    san_thin_provision=false
    san_ip=172.17.1.7
    san_private_key=/etc/nova/sankey
    san_login=root
    san_zfs_volume_base=data/openstack/volume/
    san_is_local=false
    verbose=true
    

    Note that the private key (/etc/nova/sankey here) is stored in clear and that it must be readable by the nova user.

    This key being stored in clear and giving root acces to my ZFS host, I have limited a bit this root access by using a custom command wrapper in the .ssh/authorized_keys file.

    Something like (naive implementation):

    [root@zfshost ~]$ cat /root/zfswrapper
    #!/bin/sh
    CMD=`echo $SSH_ORIGINAL_COMMAND | awk '{print $1}'`
    if [ "$CMD" != "/sbin/zfs" && "$CMD" != "tgtadm" ]; then
      echo "Can do only zfs/tgtadm stuff here"
      exit 1
    fi
    
    echo "[`date`] $SSH_ORIGINAL_COMMAND" >> .zfsopenstack.log
    exec $SSH_ORIGINAL_COMMAND
    

    Using this in root's .ssh/authorized_keys file:

    [root@zfshost ~]$ cat /root/.ssh/authorized_keys | grep control
    from="control.openstack.logilab.fr",no-pty,no-port-forwarding,no-X11-forwarding, \
          no-agent-forwarding,command="/root/zfswrapper" ssh-rsa AAAA[...] root@control
    

    I had to set the iscsi_ip_address (the ip address of the ZFS host), but I think this is a result of something mistakenly implemented in my ZFSonLinux driver.

    Using this config, I can boot an image, create a volume on my ZFS storage, and attach it to the running image.

    I have to test things like snapshot, (live?) migration and so. This is a very first draft implementation which needs to be refined, improved and tested.

    What's next

    Besides the fact that it needs more tests, I plan to use salt for my OpenStack deployment (first to add more compute nodes in my cluster), and on the other side, I'd like to try the salt-cloud so I have a bunch of Debian images that "just work" (without the need of porting the cloud-init Ubuntu package).

    On the side of my zol driver, I need to port it to Cinder, but I do not have a Folsom install to test it...


  • Announcing pylint.org

    2012/12/04 by Arthur Lutz

    Pylint - the world renowned Python code static checker - now has a landing page : http://www.pylint.org

    http://www.python.org/images/python-logo.gif

    We've tried to summarize all the things a newcomer should know about pylint. We hope it reflects the diversity of uses and support canals for pylint.

    Open and decentralized Web

    Note that pylint is not hosted on github or another well-known forge, since we firmly believe in a decentralized architecture for the web.

    This applies especially to open source software development. Pylint's development is self-hosted on a forge and its code is version-controlled with mercurial, a distributed version control system (DVCS). Both tools are free software written in python.

    http://www.zjulian.com/wp-content/uploads/2012/05/Centralized-Decentralized-And-Distributed-System.jpg

    We know centralized (and closed source) platforms for managing software projects can make things easier for contributors. We have enabled a mirror on bitbucket (and pylint-brain) so as to ease forks and pull requests. Pull requests can be made there and even from a self-hosted mercurial (with a quick email on the mailing-list).

    Feel free to add your comments or feedback below.


  • Mini-DebConf Paris 2012

    2012/11/29 by Julien Cristau

    Last week-end, I attended the mini-DebConf organized at EPITA (near Paris) by the French Debian association and sponsored by Logilab.

    http://www.logilab.org/file/112649?vid=download

    The event was a great success, with a rather large number of attendees, including people coming from abroad such as Debian kernel maintainers Ben Hutchings and Maximilian Attems, who talked about their work with Linux.

    Among the other speakers were Loïc Dachary about OpenStack and its packaging in Debian, and Josselin Mouette about his work deploying Debian/GNOME desktops in a large enterprise environment at EDF R&D.

    On my part I gave a talk on Saturday about Debian's release team, and the current state of the wheezy (to-be Debian 7.0) release.

    On Sunday I presented together with Vladimir Daric the work we did to migrate a computation cluster from Red Hat to Debian. Attendees had quite a few questions about our use of ZFS on Linux for storage, and salt for configuration management and deployment.

    Slides for the talks are available on the mini-DebConf web page (wheezy state, migration to debian cluster also viewable on slideshare), and videos will soon be on http://video.debian.net/.

    Now looking forward to next summer's DebConf13 in Switzerland, and hopefully next year's edition of the Paris event.


  • PyLint 0.26 is out

    2012/10/08 by Sylvain Thenault

    I'm very pleased to announce new releases of Pylint and underlying ASTNG library, respectivly 0.26 and 0.24.1. The great news is that both bring a lot of new features and some bug fixes, mostly provided by the community effort.

    We're still trying to make it easier to contribute on our free software project at Logilab, so I hope this will continue and we'll get even more contritions in a near future, and an even smarter/faster/whatever pylint!

    For more details, see ChangeLog files or http://www.logilab.org/project/pylint/0.26.0 and http://www.logilab.org/project/logilab-astng/0.24.1

    So many thanks to all those who made that release, and enjoy!


  • Profiling tools

    2012/09/07 by Alain Leufroy

    Python

    Run time profiling with cProfile

    Python is distributed with profiling modules. They describe the run time operation of a pure python program, providing a variety of statistics.

    The cProfile module is the recommended module. To execute your program under the control of the cProfile module, a simple form is

    $ python -m cProfile -s cumulative mypythonscript.py
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
          16    0.055    0.003   15.801    0.988 __init__.py:1(<module>)
           1    0.000    0.000   11.113   11.113 __init__.py:35(extract)
         135    7.351    0.054   11.078    0.082 __init__.py:25(iter_extract)
    10350736    3.628    0.000    3.628    0.000 {method 'startswith' of 'str' objects}
           1    0.000    0.000    2.422    2.422 pyplot.py:123(show)
           1    0.000    0.000    2.422    2.422 backend_bases.py:69(__call__)
           ...
    

    Each column provides information about time execution of every function calls. -s cumulative orders the result by descending cumulative time.

    Note:

    You can profile a particular python function such as main()

    >>> import profile
    >>> profile.run('main()')
    

    Graphical tools to show profiling results

    Even if report tools are included in cProfile profiler, it can be interesting to use graphical tools. Most of them work with a stat file that can be generated by cProfile using the -o filepath option.

    Below are some of available graphical tools that we tested.

    Gpro2Dot

    is a python based tool that allows to transform profiling results output into a picture containing the call tree graph (using graphviz). A typical profiling session with python looks like this:

    $ python -m cProfile -o output.pstats mypythonscript.py
    $ gprof2dot.py -f pstats output.pstats | dot -Tpng -o profiling_results.png
    
    http://wiki.jrfonseca.googlecode.com/git/gprof2dot.png

    Each node of the output graph represents a function and has the following layout:

    +----------------------------------+
    |   function name : module name    |
    | total time including sub-calls % |  total time including sub-calls %
    |    (self execution time %)       |------------------------------------>
    |  total number of self calls      |
    +----------------------------------+
    

    Nodes and edges are colored according to the "total time" spent in the functions.

    Note:The following small patch let the node color correspond to the execution time and the edge color to the "total time":
    diff -r da2b31597c5f gprof2dot.py
    --- a/gprof2dot.py      Fri Aug 31 16:38:37 2012 +0200
    +++ b/gprof2dot.py      Fri Aug 31 16:40:56 2012 +0200
    @@ -2628,6 +2628,7 @@
                     weight = function.weight
                 else:
                     weight = 0.0
    +            weight = function[TIME_RATIO]
    
                 label = '\n'.join(labels)
                 self.node(function.id,
    
    PyProf2CallTree

    is a script to help visualizing profiling data with the KCacheGrind graphical calltree analyzer. This is a more interactive solution than Gpro2Dot but it requires to install KCacheGrind. Typical usage:

    $ python -m cProfile -o stat.prof mypythonscript.py
    $ python pyprof2calltree.py -i stat.prof -k
    

    Profiling data file is opened in KCacheGrind with pyprof2calltree module, whose -k switch automatically opens KCacheGrind.

    http://kcachegrind.sourceforge.net/html/pics/KcgShot3Large.gif

    There are other tools that are worth testing:

    • RunSnakeRun is an interactive GUI tool which visualizes profile file using square maps:

      $ python -m cProfile -o stat.prof mypythonscript.py
      $ runsnake stat.prof
      
    • pycallgraph generates PNG images of a call tree with the total number of calls:

      $ pycallgraph mypythonscript.py
      
    • lsprofcalltree also use KCacheGrind to display profiling data:

      $ python lsprofcalltree.py -o output.log yourprogram.py
      $ kcachegrind output.log
      

    C/C++ extension profiling

    For optimization purpose one may have python extensions written in C/C++. For such modules, cProfile will not dig into the corresponding call tree. Dedicated tools must be used (they are most part of Python) to profile a C++ extension from python.

    Yep

    is a python module dedicated to the profiling of compiled python extension. It uses the google CPU profiler:

    $ python -m yep --callgrind mypythonscript.py
    

    Memory Profiler

    You may want to control the amount of memory used by a python program. There is an interesting module that fits this need: memory_profiler

    You can fetch memory consumption of a program over time using

    >>> from memory_profiler import memory_usage
    >>> memory_usage(main, (), {})
    

    memory_profiler can also spot lines that consume the most using pdb or IPython.

    General purpose Profiling

    The Linux perf tool gives access to a wide variety of performance counter subsystems. Using perf, any execution configuration (pure python programs, compiled extensions, subprocess, etc.) may be profiled.

    Performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots.

    You can have information about execution times with:

    $ perf stat -e cpu-cycles,cpu-clock,task-clock python mypythonscript.py
    

    You can have RAM access information using:

    $ perf stat -e cache-misses python mypythonscript.py
    

    Be careful about the fact that perf gives the raw value of the hardware counters. So, you need to know exactly what you are looking for and how to interpret these values in the context of your program.

    Note that you can use Gpro2Dot to get a more user-friendly output:

    $ perf record -g python mypythonscript.py
    $ perf script | gprof2dot.py -f perf | dot -Tpng -o output.png
    

  • PyLint 0.25.2 and related projects released

    2012/07/18 by Sylvain Thenault

    I'm pleased to announce the new release of Pylint and related projects (i.e. logilab-astng and logilab-common)!

    By installing PyLint 0.25.2, ASTNG 0.24 and logilab-common 0.58.1, you'll get a bunch of bug fixes and a few new features. Among the hot stuff:

    • PyLint should now work with alternative python implementations such as Jython, and at least go further with PyPy and IronPython (but those have not really been tested, please try it and provide feedback so we can improve their support)
    • the new ASTNG includes a description of dynamic code it is not able to understand. This is handled by a bitbucket hosted project described in another post.

    Many thanks to everyone who contributed to these releases, Torsten Marek / Boris Feld in particular (both sponsored by Google by the way, Torsten as an employee and Boris as a GSoC student).

    Enjoy!


  • Introducing the pylint-brain project

    2012/07/18 by Sylvain Thenault

    Huum, along with the new PyLint release, it's time to introduce the PyLint-Brain project I've recently started.

    Despite its name, PyLint-Brain is actually a collection of extensions for ASTNG, with the goal of making ASTNG smarter (and this directly benefits PyLint) by describing stuff that is too dynamic to be understood automatically (such as functions in the hashlib module, defaultdict, etc.).

    The PyLint-Brain collection of extensions is developped outside of ASTNG itself and hosted on a bitbucket project to ease community involvement and to allow distinct development cycles. Basically, ASTNG will include the PyLint-Brain extensions, but you may use earlier/custom versions by tweaking your PYTHONPATH.

    Take a look at the code, it's fairly easy to contribute new descriptions, and help us make pylint smarter!


  • Debian science sprint and workshop at ESRF

    2012/06/22 by Julien Cristau

    esrf debian

    From June 24th to June 26th, the European Synchrotron organises a workshop centered around Debian. On Monday, a number of talks about the use of Debian in scientific facilities will be featured. On Sunday and Tuesday, members of the Debian Science group will meet for a sprint focusing on the upcoming Debian 7.0 release.

    Among the speakers will be Stefano Zacchiroli, the current Debian project leader. Logilab will be present with Nicolas Chauvat at Monday's conference, and Julien Cristau at both the sprint and the conference.

    At the sprint we'll be discussing packaging of scientific libraries such as blas or MPI implementations, and working on polishing other scientific packages, such as python-related ones (including Salome on which we are currently working).


  • A Python dev day at La Cantine. Would like to have more PyCon?

    2012/06/01 by Damien Garaud
    http://www.logilab.org/file/98313?vid=download http://www.logilab.org/file/98312?vid=download

    We were at La Cantine on May 21th 2012 in Paris for the "PyCon.us Replay session".

    La Cantine is a coworking space where hackers, artists, students and so on can meet and work. It also organises some meetings and conferences about digital culture, computer science, ...

    On May 21th 2012, it was a dev day about Python. "Would you like to have more PyCon?" is a french wordplay where PyCon sounds like Picon, a french "apéritif" which traditionally accompanies beer. A good thing because the meeting began at 6:30 PM! Presentations and demonstrations were about some Python projects presented at PyCon 2012 in Santa Clara (California) last March. The original pycon presentations are accessible on pyvideo.org.

    PDB Introduction

    By Gael Pasgrimaud (@gawel_).

    pdb is the well-known Python debugger. Gael showed us how to easily use this almost-mandatory tool when you develop in Python. As with the gdb debugger, you can stop the execution at a breakpoint, walk up the stack, print the value of local variables or modify temporarily some local variables.

    The best way to define a breakpoint in your source code, it's to write:

    import pdb; pdb.set_trace()
    

    Insert that where you would like pdb to stop. Then, you can step trough the code with s, c or n commands. See help for more information. Following, the help command in pdb command-line interpreter:

    (Pdb) help
    
    Documented commands (type help <topic>):
    ========================================
    EOF    bt         cont      enable  jump  pp       run      unt
    a      c          continue  exit    l     q        s        until
    alias  cl         d         h       list  quit     step     up
    args   clear      debug     help    n     r        tbreak   w
    b      commands   disable   ignore  next  restart  u        whatis
    break  condition  down      j       p     return   unalias  where
    
    Miscellaneous help topics:
    ==========================
    exec  pdb
    

    It is also possible to invoke the module pdb when you run a Python script such as:

    $> python -m pdb my_script.py
    

    Pyramid

    http://www.logilab.org/file/98311?vid=download

    By Alexis Metereau (@ametaireau).

    Pyramid is an open source Python web framework from Pylons Project. It concentrates on providing fast, high-quality solutions to the fundamental problems of creating a web application:

    • the mapping of URLs to code ;
    • templating ;
    • security and serving static assets.

    The framework allows to choose different approaches according the simplicity//feature tradeoff that the programmer need. Alexis, from the French team of Services Mozilla, is working with it on a daily basis and seemed happy to use it. He told us that he uses Pyramid more as web Python library than a web framework.

    Circus

    http://www.logilab.org/file/98316?vid=download

    By Benoit Chesneau (@benoitc).

    Circus is a process watcher and runner. Python scripts, via an API, or command-line interface can be used to manage and monitor multiple processes.

    A very useful web application, called circushttpd, provides a way to monitor and manage Circus through the web. Circus uses zeromq, a well-known tool used at Logilab.

    matplotlib demo

    This session was a well prepared and funny live demonstration by Julien Tayon of matplotlib, the Python 2D plotting library . He showed us some quick and easy stuff.

    For instance, how to plot a sinus with a few code lines with matplotlib and NumPy:

    import numpy as np
    import matplotlib.pyplot as plt
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    # A simple sinus.
    ax.plot(np.sin(np.arange(-10., 10., 0.05)))
    fig.show()
    

    which gives:

    http://www.logilab.org/file/98315?vid=download

    You can make some fancier plots such as:

    # A sinus and a fancy Cardioid.
    a = np.arange(-5., 5., 0.1)
    ax_sin = fig.add_subplot(211)
    ax_sin.plot(np.sin(a), '^-r', lw=1.5)
    ax_sin.set_title("A sinus")
    
    # Cardioid.
    ax_cardio = fig.add_subplot(212)
    x = 0.5 * (2. * np.cos(a) - np.cos(2 * a))
    y = 0.5 * (2. * np.sin(a) - np.sin(2 * a))
    ax_cardio.plot(x, y, '-og')
    ax_cardio.grid()
    ax_cardio.set_xlabel(r"$\frac{1}{2} (2 \cos{t} - \cos{2t})$", fontsize=16)
    fig.show()
    

    where you can type some LaTeX equations as X label for instance.

    http://www.logilab.org/file/98314?vid=download

    The force of this plotting library is the gallery of several examples with piece of code. See the matplotlib gallery.

    Using Python for robotics

    Dimitri Merejkowsky reviewed how Python can be used to control and program Aldebaran's humanoid robot NAO.

    Wrap up

    Unfortunately, Olivier Grisel who was supposed to make three interesting presentations was not there. He was supposed to present :

    • A demo about injecting arbitrary code and monitoring Python process with Pyrasite.
    • Another demo about Interactive Data analysis with Pandas and the new IPython NoteBook.
    • Wrap up : Distributed computation on cluster related project: IPython.parallel, picloud and Storm + Umbrella

    Thanks to La Cantine and the different organisers for this friendly dev day.


  • Mercurial 2.3 sprint, Day 1-2-3

    2012/05/15 by Pierre-Yves David

    I'm now back from Copenhagen were I attended the mercurial 2.3 sprint with twenty other people. A huge amount of work was done in a very friendly atmosphere.

    Regarding mercurial's core:

    • Bookmark behaviour was improved to get closer to named branch's behaviour.
    • Several performance improvements regarding branches and heads caches. The heads cache refactoring improves rebase performance on huge repository (thanks to Facebook and Atlassian).
    • The concept I'm working on, Obsolete markers, was a highly discussed subject and is expected to get partly into the core in the near future. Thanks to my employer Logilab for paying me to work on this topic.
    • General code cleanup and lock validation.
    http://www.logilab.org/file/92956?vid=download

    Regarding the bundled extension :

    • Some fixes where made to progress which is now closer to getting into mercurial's core.
    • Histedit and keyring extensions are scheduled to be shipped with mercurial.
    • Some old and unmaintained extensions (children, hgtk) are now deprecated.
    • The LargeFile extension got some new features (thanks to the folks from Unity3D)
    • Rebase will use the --detach flag by default in the next release.
    http://www.logilab.org/file/92958?vid=download

    Regarding the project itself:

    http://www.logilab.org/file/92955?vid=download

    Regarding other extensions:

    http://www.logilab.org/file/92959?vid=download

    And I'm probably forgetting some stuff. Special thanks to Unity3D for hosting the sprint and providing power, network and food during these 3 days.


  • Mercurial 2.3 day 0

    2012/05/10 by Pierre-Yves David

    I'm now at Copenhagen to attend the mercurial "2.3" sprint.

    About twenty people are attending, including staff from Atlassian, Facebook, Google and Mozilla.

    I expect code and discussion about various topic among:

    • the development process of mercurial itself,
    • performance improvement on huge repository,
    • integration of Obsolete Markers into mercurial core,
    • improvement on various aspect (merge diff, moving some extension in core, ...)

    I'm of course very interested in the Obsolete Markers topic. I've been working on an experimental implementation for several months. An handful of people are using them at Logilab for two months and feedback are very promising.


  • Debian bug squashing party in Paris

    2012/02/16 by Julien Cristau

    Logilab will be present at the upcoming Debian BSP in Paris this week-end. This event will focus on fixing as many "release critical" bugs as possible, to help with the preparation of the upcoming Debian 7.0 "wheezy" release. It will also provide an opportunity to introduce newcomers to the processes of Debian development and bug fixing, as well as provide an opportunity for contributors in various areas of the project to interact "in real life".

    http://www.logilab.org/file/88881?vid=download

    The current stable release, Debian 6.0 "squeeze", came out in February 2011. The development of "wheezy" is scheduled to freeze in June 2012, for an eventual release later this year.

    Among the things we hope to work on during this BSP, the latest HDF5 release (1.8.8) includes API and packaging changes that require some changes in dependent packages. With the number of scientific packages relying on HDF5, this is a pretty big change, as tracked in this Debian bug.


  • Introduction To Mercurial Phases (Part III)

    2012/02/03 by Pierre-Yves David

    This is the final part of a series of posts about the new phases feature we implemented for mercurial 2.1. The first part talks about how phases will help mercurial users, the second part explains how to control them. This one explains what people should take care of when upgrading.

    Important upgrade note and backward compatibility

    Phases do not require any conversion of your repos. Phase information is not stored in changesets. Everybody using a new client will take advantage of phases on any repository they touch.

    However there is some points you need to be aware of regarding interaction between the old world without phases and the new world with phases:

    Talking over the wire to a phaseless server using a phased client

    As ever, the Mercurial wire protocol (used to communicate through http and ssh) is fully backward compatible [1]. But as old Mercurial versions are not aware of phases, old servers will always be treated as publishing.

    Direct file system access to a phaseless repository using a phased client

    A new client has no way to determine which parts of the history should be immutable and which parts should not. In order to fail safely, a new repo will mark everything as public when no data is available. For example, in the scenario described in part I, if an old version of mercurial were used to clone and commit, a new version of mercurial will see them as public and refuse to rebase them.

    Note

    Some extensions (like mq) may provide smarter logic to set some changesets to the draft or even secret phases.

    The phased client will write phase data to the old repo on its first write operation.

    Direct file system access to a phased repository using a phaseless client

    Everything works fine except that the old client is unable to see or manipulate phases:

    • Changesets added to the repo inherit the phase of their parents, whatever the parents' phase. This could result in new commits being seen as public or pulled content seen as draft or even secret when a newer client uses the repo again!
    • Changesets pushed to a publishing server won't be set public.
    • Secret changesets are exchanged.
    • Old clients are willing to rewrite immutable changesets (as they don't know that they shouldn't).

    So, if you actively rewrite your history or use secret changesets, you should ensure that only new clients touch those repositories where the phase matters.

    Fixing phases error

    Several situations can result in bad phases in a repository:

    • When upgrading from phaseless to phased Mercurial, the default phases picked may be too restrictive.
    • When you let an old client touch your repository.
    • When you push to a publishing server that should not actually be publishing.

    The easiest way to restore a consistant state is to use the phase command. In most cases, changesets marked as public but absent from your real public server should be moved to draft:

    hg phase --force --draft 'public() and outgoing()'
    

    If you have multiple public servers, you can pull from the others to retrieve their phase data too.

    Conclusion

    Mercurial's phases are a simple concept that adds always on and transparent safety for most users while not preventing advanced ones from doing whatever they want.

    Behind this safety-enabling and useful feature, phases introduce in Mercurial code the concept of sharing mutable parts of history. The introduction of this feature paves the way for advanced history rewriting solutions while allowing safe and easy sharing of mutable parts of history. I'll post about those future features shortly.


    [1]You can expect the 0.9.0 version of Mercurial to interoperate cleanly with one released 5 years later.

    [Images by Crystian Cruz (cc-nd) and C.J. Peters (cc-by-sa)]


  • Introduction To Mercurial Phases (Part II)

    2012/02/02 by Pierre-Yves David

    This is the second part of a series of posts about the new phases feature we implemented for mercurial 2.1. The first part talks about how phases will help mercurial users, this second part explains how to control them.

    Controlling automatic phase movement

    Sometimes it may be desirable to push and pull changesets in the draft phase to share unfinished work. Below are some cases:

    • pushing to continuous integration,
    • pushing changesets for review,
    • user has multiple machines,
    • branch clone.

    You can disable publishing behavior in a repository configuration file [1]:

    [phases]
       publish=False
       

    When a repository is set to non-publishing, people push changesets without altering their phase. draft changesets are pushed as draft and public changesets are pushed as public:

    celeste@Chessy ~/palace $ hg showconfig phases
       phases.publish=False
       
    babar@Chessy ~/palace $ hg log --graph
       @  [draft] add a carpet (2afbcfd2af83)
       |
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       o  [public] Add wall color (0d1feb1bca54)
       |
       
       babar@Chessy ~/palace $ hg outgoing ~celeste/palace/
       [public] Add wall color (0d1feb1bca54)
       [public] Add a table in the kichen (139ead8a540f)
       [draft] add a carpet (3c1b19d5d3f5)
       babar@Chessy ~/palace $ hg push ~celeste/palace/
       pushing to ~celeste/palace/
       searching for changes
       adding changesets
       adding manifests
       adding file changes
       added 3 changesets with 3 changes to 2 files
       babar@Chessy ~/palace $ hg log --graph
       @  [draft] add a carpet (2afbcfd2af83)
       |
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       o  [public] Add wall color (0d1feb1bca54)
       |
       
       
    celeste@Chessy ~/palace $ hg log --graph
       o  [draft] add a carpet (2afbcfd2af83)
       |
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       o  [public] Add wall color (0d1feb1bca54)
       |
       
       

    And pulling gives the phase as in the remote repository:

    celeste@Chessy ~/palace $ hg up 139ead8a540f
       celeste@Chessy ~/palace $ echo The wall will be decorated with portraits >> bedroom
       celeste@Chessy ~/palace $ hg ci -m 'Decorate the wall.'
       created new head
       celeste@Chessy ~/palace $ hg log --graph
       @  [draft] Decorate the wall. (3389164e92a1)
       |
       | o  [draft] add a carpet (3c1b19d5d3f5)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       o  [public] Add wall color (0d1feb1bca54)
       |
       
       ---
       babar@Chessy ~/palace $ hg pull ~celeste/palace/
       pulling from ~celeste/palace/
       searching for changes
       adding changesets
       adding manifests
       adding file changes
       added 1 changesets with 1 changes to 1 files (+1 heads)
       babar@Chessy ~/palace $ hg log --graph
       @  [draft] Decorate the wall. (3389164e92a1)
       |
       | o  [draft] add a carpet (3c1b19d5d3f5)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       o  [public] Add wall color (0d1feb1bca54)
       |
       
       

    Phase information is exchanged during pull and push operations. When a changeset exists on both sides but within different phases, its phase is unified to the lowest [2] phase. For instance, if a changeset is draft locally but public remotely, it is set public:

    celeste@Chessy ~/palace $ hg push -r 3389164e92a1
       pushing to http://hg.celesteville.com/palace
       searching for changes
       adding changesets
       adding manifests
       adding file changes
       added 1 changesets with 1 changes to 1 files
       celeste@Chessy ~/palace $ hg log --graph
       @  [public] Decorate the wall. (3389164e92a1)
       |
       | o  [draft] add a carpet (3c1b19d5d3f5)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       o  [public] Add wall color (0d1feb1bca54)
       |
       
       ---
       babar@Chessy ~/palace $ hg pull ~celeste/palace/
       pulling from ~celeste/palace/
       searching for changes
       no changes found
       babar@Chessy ~/palace $ hg log --graph
       @  [public] Decorate the wall. (3389164e92a1)
       |
       | o  [draft] add a carpet (3c1b19d5d3f5)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       o  [public] Add wall color (0d1feb1bca54)
       |
       
       

    Note

    pull is read-only operation and does not alter phases in remote repositories.

    You can also control the phase in which a new changeset is committed. If you don't want new changesets to be pushed without explicit consent, update your configuration with:

    [phases]
       new-commit=secret
       

    You will need to use manual phase movement before you can push them. See the next section for details:

    Note

    With what have been done so far for 2.1, the "most practical way to make a new commit secret" is to use:

       hg commit --config phases.new-commit=secret
       
    [1]You can use this setting in your user hgrc too.
    [2]Phases as ordered as follow: public < draft < secret

    Manual phase movement

    Most phase movements should be automatic and transparent. However it is still possible to move phase manually using the hg phase command:

    babar@Chessy ~/palace $ hg log --graph
       @    [draft] merge with Celeste works (f728ef4eba9f)
       |\
       o |  [draft] add a carpet (3c1b19d5d3f5)
       | |
       | o  [public] Decorate the wall. (3389164e92a1)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       
       babar@Chessy ~/palace $ hg phase --public 3c1b19d5d3f5
       babar@Chessy ~/palace $ hg log --graph
       @    [draft] merge with Celeste works (f728ef4eba9f)
       |\
       o |  [public] add a carpet (3c1b19d5d3f5)
       | |
       | o  [public] Decorate the wall. (3389164e92a1)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       
       

    Changesets only move to lower [#] phases during normal operation. By default, the phase command enforces this rule:

    babar@Chessy ~/palace $ hg phase --draft 3c1b19d5d3f5
       no phases changed
       babar@Chessy ~/palace $ hg log --graph
       @    [draft] merge with Celeste works (f728ef4eba9f)
       |\
       o |  [public] add a carpet (3c1b19d5d3f5)
       | |
       | o  [public] Decorate the wall. (3389164e92a1)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       
       

    If you are confident in what your are doing you can still use the --force switch to override this behavior:

    Warning

    Phases are designed to avoid forcing people to use hg phase --force. If you need to use --force on a regular basis, you are probably doing something wrong. Read the previous section again to see how to configure your environment for automatic phase movement suitable to your needs.

    babar@Chessy ~/palace $ hg phase --verbose --force --draft 3c1b19d5d3f5
       phase change for 1 changesets
       babar@Chessy ~/palace $ hg log --graph
       @    [draft] merge with Celeste works (f728ef4eba9f)
       |\
       o |  [draft] add a carpet (3c1b19d5d3f5)
       | |
       | o  [public] Decorate the wall. (3389164e92a1)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       
       

    Note that a phase defines a consistent set of revisions in your history graph. This means that to have a public (immutable) changeset all its ancestors need to be immutable too. Once you have a secret (not exchanged) changeset, all its descendant will be secret too.

    This means that changing the phase of a changeset may result in phase movement for other changesets:

    babar@Chessy ~/palace $ hg phase -v --public f728ef4eba9f # merge with Celeste works
       phase change for 2 changesets
       babar@Chessy ~/palace $ hg log --graph
       @    [public] merge with Celeste works (f728ef4eba9f)
       |\
       o |  [public] add a carpet (3c1b19d5d3f5)
       | |
       | o  [public] Decorate the wall. (3389164e92a1)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       
       babar@Chessy ~/palace $ hg phase -vf --draft 3c1b19d5d3f5 # add a carpet
       phase change for 2 changesets
       babar@Chessy ~/palace $ hg log --graph
       @    [draft] merge with Celeste works (f728ef4eba9f)
       |\
       o |  [draft] add a carpet (3c1b19d5d3f5)
       | |
       | o  [public] Decorate the wall. (3389164e92a1)
       |/
       o  [public] Add a table in the kichen (139ead8a540f)
       |
       
       

    The next and final post will explain how older mercurial versions interact with newer versions that support phases.

    [Images by Jimmy Smith (cc-by-nd) and Cory Doctorow (cc-by-sa)]


  • Introduction To Mercurial Phases (Part I)

    2012/02/02 by Pierre-Yves David
    credit: redshirtjosh, http://www.flickr.com/photos/43273828@N06/4111258568/

    On the behalf of Logilab I put a lot of efforts to include a new core feature named phases in Mercurial 2.1. Phases are a system for tracking which changesets have been or should be shared. This helps to prevent common mistakes when modifying history (for instance, with the mq or rebase extensions). It will transparently benefit to all users. This concept is the first step towards simple, safe and powerful rewritting mecanisms for history in mercurial.

    This serie of three blog entries will explain:

    1. how phases will help mercurial users,
    2. how one can control them,
    3. how older mercurial versions interact with newer versions that support phases.

    Preventing erroneous history rewriting

    credit: anita.priks, http://www.flickr.com/photos/46785534@N06/6358218623/

    History rewriting is a common practice in DVCS. However when done the wrong way the most common error results in duplicated history. The phase concept aims to make rewriting history safer. For this purpose Mercurial 2.1 introduces a distinction between the "past" part of your history (that is expected to stay there forever) and the "present" part of the history (that you are currently evolving). The old and immutable part is called public and the mutable part of your history is called draft.

    Let's see how this happens using a simple scenario.


    A new Mercurial user clones a repository:

    babar@Chessy ~ $ hg clone http://hg.celesteville.com/palace
    requesting all changes
    adding changesets
    adding manifests
    adding file changes
    added 2 changesets with 2 changes to 2 files
    updating to branch default
    2 files updated, 0 files merged, 0 files removed, 0 files unresolved
    babar@Chessy ~/palace $ cd palace
    babar@Chessy ~/palace $ hg log --graph
    @  changeset:   1:2afbcfd2af83
    |  tag:         tip
    |  user:        Celeste the Queen <Celeste@celesteville.com>
    |  date:        Wed Jan 25 16:41:56 2012 +0100
    |  summary:     We need a kitchen too.
    |
    o  changeset:   0:898889b143fb
       user:        Celeste the Queen <Celeste@celesteville.com>
       date:        Wed Jan 25 16:39:07 2012 +0100
       summary:     First description of the throne room
    

    The repository already contains some changesets. Our user makes some improvements and commits them:

    babar@Chessy ~/palace $ echo The wall shall be Blue >> throne-room
    babar@Chessy ~/palace $ hg ci -m 'Add wall color'
    babar@Chessy ~/palace $ echo In the middle stands a three meters round table >> kitchen
    babar@Chessy ~/palace $ hg ci -m 'Add a table in the kichen'
    

    But when he tries to push new changesets, he discovers that someone else already pushed one:

    babar@Chessy ~/palace $ hg push
    pushing to http://hg.celesteville.com/palace
    searching for changes
    abort: push creates new remote head bcd4d53319ec!
    (you should pull and merge or use push -f to force)
    babar@Chessy ~/palace $ hg pull
    pulling from http://hg.celesteville.com/palace
    searching for changes
    adding changesets
    adding manifests
    adding file changes
    added 1 changesets with 1 changes to 1 files (+1 heads)
    (run 'hg heads' to see heads, 'hg merge' to merge)
    babar@Chessy ~/palace $ hg log --graph
    o  changeset:   4:0a5b3d7e4e5f
    |  tag:         tip
    |  parent:      1:2afbcfd2af83
    |  user:        Celeste the Queen <Celeste@celesteville.com>
    |  date:        Wed Jan 25 16:58:23 2012 +0100
    |  summary:     Some bedroom description.
    |
    | @  changeset:   3:bcd4d53319ec
    | |  user:        Babar the King <babar@celesteville.com>
    | |  date:        Wed Jan 25 16:52:02 2012 +0100
    | |  summary:     Add a table in the kichen
    | |
    | o  changeset:   2:f9f14815935d
    |/   user:        Babar the King <babar@celesteville.com>
    |    date:        Wed Jan 25 16:51:51 2012 +0100
    |    summary:     Add wall color
    |
    o  changeset:   1:2afbcfd2af83
    |  user:        Celeste the Queen <Celeste@celesteville.com>
    |  date:        Wed Jan 25 16:41:56 2012 +0100
    |  summary:     We need a kitchen too.
    |
    o  changeset:   0:898889b143fb
       user:        Celeste the Queen <Celeste@celesteville.com>
       date:        Wed Jan 25 16:39:07 2012 +0100
       summary:     First description of the throne room
    

    Note

    From here on this scenario becomes very unlikely. Mercurial is simple enough for a new user not to be that confused by such a trivial situation. But we keep the example simple to focus on phases.

    Recently, our new user read some hype blog about "rebase" and the benefit of linear history. So, he decides to rewrite his history instead of merging.

    Despite reading the wonderful rebase help, our new user makes the wrong decision when it comes to using it. He decides to rebase the remote changeset 0a5b3d7e4e5f:"Some bedroom description." on top of his local changeset.

    With previous versions of mercurial, this mistake was allowed and would result in a duplication of the changeset 0a5b3d7e4e5f:"Some bedroom description."

    babar@Chessy ~/palace $ hg rebase -s 4 -d 3
    babar@Chessy ~/palace $ hg push
    pushing to http://hg.celesteville.com/palace
    searching for changes
    abort: push creates new remote head bcd4d53319ec!
    (you should pull and merge or use push -f to force)
    babar@Chessy ~/palace $ hg pull
    pulling from http://hg.celesteville.com/palace
    searching for changes
    adding changesets
    adding manifests
    adding file changes
    added 1 changesets with 1 changes to 1 files (+1 heads)
    (run 'hg heads' to see heads, 'hg merge' to merge)
    babar@Chessy ~/palace $ hg log --graph
    @  changeset:   5:55d9bae1e1cb
    |  tag:         tip
    |  parent:      3:bcd4d53319ec
    |  user:        Celeste the Queen <Celeste@celesteville.com>
    |  date:        Wed Jan 25 16:58:23 2012 +0100
    |  summary:     Some bedroom description.
    |
    | o  changeset:   4:0a5b3d7e4e5f
    | |  parent:      1:2afbcfd2af83
    | |  user:        Celeste the Queen <Celeste@celesteville.com>
    | |  date:        Wed Jan 25 16:58:23 2012 +0100
    | |  summary:     Some bedroom description.
    | |
    o |  changeset:   3:bcd4d53319ec
    | |  user:        Babar the King <babar@celesteville.com>
    | |  date:        Wed Jan 25 16:52:02 2012 +0100
    | |  summary:     Add a table in the kichen
    | |
    o |  changeset:   2:f9f14815935d
    |/   user:        Babar the King <babar@celesteville.com>
    |    date:        Wed Jan 25 16:51:51 2012 +0100
    |    summary:     Add wall color
    |
    o  changeset:   1:2afbcfd2af83
    |  user:        Celeste the Queen <Celeste@celesteville.com>
    |  date:        Wed Jan 25 16:41:56 2012 +0100
    |  summary:     We need a kitchen too.
    |
    o  changeset:   0:898889b143fb
       user:        Celeste the Queen <Celeste@celesteville.com>
       date:        Wed Jan 25 16:39:07 2012 +0100
       summary:     First description of the throne room
    

    In more complicated setups it's a fairly common mistake, Even in big and successful projects and with other DVCSs.

    In the new Mercurial version the user won't be able to make this mistake anymore. Trying to rebase the wrong way will result in:

    babar@Chessy ~/palace $ hg rebase -s 4 -d 3
    abort: can't rebase immutable changeset 0a5b3d7e4e5f
    (see hg help phases for details)
    

    The correct rebase still works as expected:

    babar@Chessy ~/palace $ hg rebase -s 2 -d 4
    babar@Chessy ~/palace $ hg log --graph
    @  changeset:   4:139ead8a540f
    |  tag:         tip
    |  user:        Babar the King <babar@celesteville.com>
    |  date:        Wed Jan 25 16:52:02 2012 +0100
    |  summary:     Add a table in the kichen
    |
    o  changeset:   3:0d1feb1bca54
    |  user:        Babar the King <babar@celesteville.com>
    |  date:        Wed Jan 25 16:51:51 2012 +0100
    |  summary:     Add wall color
    |
    o  changeset:   2:0a5b3d7e4e5f
    |  user:        Celeste the Queen <Celeste@celesteville.com>
    |  date:        Wed Jan 25 16:58:23 2012 +0100
    |  summary:     Some bedroom description.
    |
    o  changeset:   1:2afbcfd2af83
    |  user:        Celeste the Queen <Celeste@celesteville.com>
    |  date:        Wed Jan 25 16:41:56 2012 +0100
    |  summary:     We need a kitchen too.
    |
    o  changeset:   0:898889b143fb
       user:        Celeste the Queen <Celeste@celesteville.com>
       date:        Wed Jan 25 16:39:07 2012 +0100
       summary:     First description of the throne room
    

    What is happening here:

    • Changeset 0a5b3d7e4e5f from Celeste was set to the public phase because it was pulled from the outside. The public phase is immutable.
    • Changesets f9f14815935d and bcd4d53319ec (rebased as 0d1feb1bca54 and 139ead8a540f) have been commited locally and haven't been transmitted from this repository to another. As such, they are still in the draft phase. Unlike the public phase, the draft phase is mutable.

    Let's watch the whole action in slow motion, paying attention to phases:

    babar@Chessy ~ $ cat >> ~/.hgrc << EOF
    [ui]
    username=Babar the King <babar@celesteville.com>
    logtemplate='[{phase}] {desc} ({node|short})\\n'
    EOF
    

    First, changesets cloned from a public server are public:

    babar@Chessy ~ $ hg clone --quiet http://hg.celesteville.com/palace
    babar@Chessy ~/palace $ cd palace
    babar@Chessy ~/palace $ hg log --graph
    @  [public] We need a kitchen too. (2afbcfd2af83)
    |
    o  [public] First description of the throne room (898889b143fb)
    

    Second, new changesets committed locally are in the draft phase:

    babar@Chessy ~/palace $ echo The wall shall be Blue >> throne-room
    babar@Chessy ~/palace $ hg ci -m 'Add wall color'
    babar@Chessy ~/palace $ echo In the middle stand a three meters round table >> kitchen
    babar@Chessy ~/palace $ hg ci -m 'Add a table in the kichen'
    babar@Chessy ~/palace $ hg log --graph
    @  [draft] Add a table in the kichen (bcd4d53319ec)
    |
    o  [draft] Add wall color (f9f14815935d)
    |
    o  [public] We need a kitchen too. (2afbcfd2af83)
    |
    o  [public] First description of the throne room (898889b143fb)
    

    Third, changesets pulled from a public server are public:

    babar@Chessy ~/palace $ hg pull --quiet
    babar@Chessy ~/palace $ hg log --graph
    o  [public] Some bedroom description. (0a5b3d7e4e5f)
    |
    | @  [draft] Add a table in the kichen (bcd4d53319ec)
    | |
    | o  [draft] Add wall color (f9f14815935d)
    |/
    o  [public] We need a kitchen too. (2afbcfd2af83)
    |
    o  [public] First description of the throne room (898889b143fb)
    

    Note

    rebase preserves the phase of rebased changesets

    babar@Chessy ~/palace $ hg rebase -s 2 -d 4
    babar@Chessy ~/palace $ hg log --graph
    @  [draft] Add a table in the kichen (139ead8a540f)
    |
    o  [draft] Add wall color (0d1feb1bca54)
    |
    o  [public] Some bedroom description. (0a5b3d7e4e5f)
    |
    o  [public] We need a kitchen too. (2afbcfd2af83)
    |
    o  [public] First description of the throne room (898889b143fb)
    

    Finally, once pushed to the public server, changesets are set to the public (immutable) phase

    babar@Chessy ~/palace $ hg push
    pushing to http://hg.celesteville.com/palace
    searching for changes
    adding changesets
    adding manifests
    adding file changes
    added 2 changesets with 2 changes to 2 files
    babar@Chessy ~/palace $ hg log --graph
    
    @  [public] Add a table in the kichen (139ead8a540f)
    |
    o  [public] Add wall color (0d1feb1bca54)
    |
    o  [public] Some bedroom description. (0a5b3d7e4e5f)
    |
    o  [public] We need a kitchen too. (2afbcfd2af83)
    |
    o  [public] First description of the throne room (898889b143fb)
    

    To summarize:

    • Changesets exchanged with the outside are public and immutable.
    • Changesets committed locally are draft until exchanged with the outside.
    • As a user, you should not worry about phases. Phases move transparently.

    Preventing premature exchange of history

    credit: Richard Elzey, http://www.flickr.com/photos/elzey/3516256055/

    The public phases prevent user from accidentally rewriting public history. It's a good step forward but phases can go further. Phases can prevent you from accidentally making history public in the first place.

    For this purpose, a third phase is available, the secret phase. To explain it, I'll use the mq extension which is nicely integrated with this secret phase:

    Our fellow user enables the mq extension

    babar@Chessy ~/palace $ vim ~/.hgrc
    babar@Chessy ~/palace $ cat ~/.hgrc
    [ui]
    username=Babar the King <babar@celesteville.com>
    [extensions]
    # enable the mq extension included with Mercurial
    hgext.mq=
    [mq]
    # Enable secret phase integration.
    # This integration is off by default for backward compatibility.
    secret=true
    

    New patches (not general commits) are now created as secret

    babar@Chessy ~/palace $ echo A red carpet on the floor. >> throne-room
    babar@Chessy ~/palace $ hg qnew -m 'add a carpet' carpet.diff
    babar@Chessy ~/palace $ hg log --graph
    
    @  [secret] add a carpet (3c1b19d5d3f5)
    |
    @  [public] Add a table in the kichen (139ead8a540f)
    |
    o  [public] Add wall color (0d1feb1bca54)
    |
    
    

    this secret changeset is excluded from outgoing and push:

    babar@Chessy ~/palace $ hg outgoing
    comparing with http://hg.celesteville.com/palace
    searching for changes
    no changes found (ignored 1 secret changesets)
    babar@Chessy ~/palace $ hg push
    pushing to http://hg.celesteville.com/palace
    searching for changes
    no changes found (ignored 1 secret changesets)
    

    And other users do not see it:

    celeste@Chessy ~/palace $ hg incoming ~babar/palace/
    comparing with ~babar/palace
    searching for changes
    [public] Add wall color (0d1feb1bca54)
    [public] Add a table in the kichen (139ead8a540f)
    

    The mq integration take care of phase movement for the user. Changeset are made draft by qfinish

    babar@Chessy ~/palace $ hg qfinish .
    babar@Chessy ~/palace $ hg log --graph
    @  [draft] add a carpet (2afbcfd2af83)
    |
    o  [public] Add a table in the kichen (139ead8a540f)
    |
    o  [public] Add wall color (0d1feb1bca54)
    |
    
    

    And changesets are made secret again by qimport

    babar@Chessy ~/palace $ hg qimport -r 2afbcfd2af83
    babar@Chessy ~/palace $ hg log --graph
    @  [secret] add a carpet (2afbcfd2af83)
    |
    o  [public] Add a table in the kichen (139ead8a540f)
    |
    o  [public] Add wall color (0d1feb1bca54)
    |
    
    

    As expected, mq refuses to qimport public changesets

    babar@Chessy ~/palace $ hg qimport -r 139ead8a540f
    abort: revision 4 is not mutable
    

    In the next part I'll details how to control phases movement.


  • Generating a user interface from a Yams model

    2012/01/09 by Nicolas Chauvat

    Yams is a pythonic way to describe an entity-relationship model. It is used at the core of the CubicWeb semantic web framework in order to automate lots of things, including the generation and validation of forms. Although we have been using the MVC design pattern to write user interfaces with Qt and Gtk before we started CubicWeb, we never got to reuse Yams. I am on my way to fix this.

    Here is the simplest possible example that generates a user interface (using dialog and python-dialog) to input data described by a Yams data model.

    First, let's write a function that builds the data model:

    def mk_datamodel():
        from yams.buildobjs import EntityType, RelationDefinition, Int, String
        from yams.reader import build_schema_from_namespace
    
        class Question(EntityType):
            number = Int()
            text = String()
    
        class Form(EntityType):
            title = String()
    
        class in_form(RelationDefinition):
            subject = 'Question'
            object = 'Form'
            cardinality = '*1'
    
        return build_schema_from_namespace(vars().items())
    

    Here is what you get using graphviz or xdot to display the schema of that data model with:

    import os
    from yams import schema2dot
    
    datamodel = mk_datamodel()
    schema2dot.schema2dot(datamodel, '/tmp/toto.dot')
    os.system('xdot /tmp/toto.dot')
    
    http://www.logilab.org/file/87002?vid=download

    To make a step in the direction of genericity, let's add a class that abstracts the dialog API:

    class InterfaceDialog:
        """Dialog-based Interface"""
        def __init__(self, dlg):
            self.dlg = dlg
    
        def input_list(self, invite, options) :
            assert len(options) != 0, str(invite)
            choice = self.dlg.radiolist(invite, list=options, selected=1)
            if choice is not None:
                return choice.lower()
            else:
                raise Exception('operation cancelled')
    
        def input_string(self, invite, default):
            return self.dlg.inputbox(invite, init=default).decode(sys.stdin.encoding)
    

    And now let's put everything together:

    datamodel = mk_datamodel()
    
    import dialog
    ui = InterfaceDialog(dialog.Dialog())
    ui.dlg.setBackgroundTitle('Dialog Interface with Yams')
    
    objs = []
    for entitydef in datamodel.entities():
        if entitydef.final:
            continue
        obj = {}
        for attr in entitydef.attribute_definitions():
            if attr[1].type in ('String','Int'):
                obj[str(attr[0])] = ui.input_string('%s.%s' % (entitydef,attr[0]), '')
        try:
            entitydef.check(obj)
        except Exception, exc:
            ui.dlg.scrollbox(str(exc))
    
    print objs
    
    http://www.logilab.org/file/87001?vid=download

    The result is a program that will prompt the user for the title of a form and the text/number of a question, then enforce the type constraints and display the inconsistencies.

    The above is very simple and does very little, but if you read the documentation of Yams and if you think about generating the UI with Gtk or Qt instead of dialog, or if you have used the form mechanism of CubicWeb, you'll understand that this proof of concept opens a door to a lot of possibilities.

    I will come back to this topic in a later article and give an example of integrating the above with pigg, a simple MVC library for Gtk, to make the programming of user-interfaces even more declarative and bug-free.


  • Interesting things seen at the Afpy Computer Camp

    2011/11/28 by Pierre-Yves David

    This summer I spent three days in Burgundy at the Afpy Computer Camps. This yearly meeting gathered French speaking python developers for talking and coding. The main points of this 2011 edition were:

    http://www.afpy.org/_public/images/logo_afpy.png

    The new IPython 0.11 was shown by Olivier Grisel. This new version contains lots of impressive feature like inline figures, asynchronous execution, exportable sessions, and a web-browser based client. IPython was also presented by its main author Fernando Perez during his keynote talk at EuroSciPy. Since then Logilab got involved with IPython. We contributed to the Debian packaging of iPython dependencies and we joined the discussion about Restructured Text formatting for note book.

    http://ipython.org/ipython-doc/rel-0.11/_static/logo.png

    Tarek Ziade bootstrapped his new Red Barrel project and small framework to build modern webservices with multiple back-end including the new socket.io protocol.

    Alexis Métaireau and Feth Arezki discovered their common interest into account tracking application. The discussion's result is a first release of I hate money a few months later.

    For my part, I spent most of my time working with Boris Feld on the Python Testing Infrastructure , a continuous integration tool to test python distributions available at PyPI.

    http://master.pyti.org/data/pyti.ico.png

    This yearly Afpy Computer Camps is an event intended for python developers but the Afpy also organize events for non python developer. The next one is tonight in Paris at La cantine : Vous reprendrez bien un peu de python ?. See you tonight ?


  • Python in Finance (and Derivative Analytics)

    2011/10/25 by Damien Garaud

    The Logilab team attended (and co-organized) EuroScipy 2011, at the end of August in Paris.

    We saw some interesting posters and a presentation dealing with Python in finance and derivative analytics [1].

    In order to debunk the idea that "all computation libraries dedicated to financial applications must be written in C/C++ or some other compiled programming language", I would like to introduce a more Pythonic way.

    You may know that financial applications such as risk management have in most cases high computational needs. For instance, it can be necessary to quickly perform a large number of Monte Carlo simulations to evaluate an American option in a few seconds.

    The Python community provides several reliable and efficient libraries and packages dedicated to numerical computations:

    http://numpy.scipy.org/_static/numpy_logo.png https://scikits.appspot.com/static/images/scipyshiny_small.png
    • the well-known SciPy and NumPy libraries. They provide a complete set of tools to work with matrix, linear algebra operations, singular values decompositions, multi-variate regression models, ...
    • scikits is a set of add-on toolkits for SciPy. For instance there are statistical models in statsmodels packages, a toolkit dedicated to timeseries manipulation and another one dedicated to numerical optimization;
    • pandas is a recent Python package which provides "fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive.". pandas uses Cython to improve its performance. Moreover, pandas has been used extensively in production in financial applications;
    http://docs.cython.org/_static/cython-logo-light.png
    • Cython is a way to write C extensions for the Python language. Since you write Cython code in the same way as you write Python code, it's easy to use it. Is it fast? Yes ! I compared a simple example from Cython's official documentation with a full Python code -- a piece of code which computes the first kth prime numbers. The Cython code is almost thirty times faster than the full-Python code (non-official). Furthermore, you can use NumPy in Cython code !

    I believe that thanks to several useful tools and libraries, Python can be used in numerical computation, even in Finance (both research and production). It is easy-to-maintain without sacrificing performances.

    Note that you can find some other references on Visixion webpages:


  • Rss feeds aggregator based on Scikits.learn and CubicWeb

    2011/10/17 by Vincent Michel

    During Euroscipy, the Logilab Team presented an original approach for querying news using semantic information: "Rss feeds aggregator based on Scikits.learn and CubicWeb" by Vincent Michel This work is based on two major pieces of software:

    http://www.cubicweb.org/data/index-cubicweb.png
    • CubicWeb, the pythonic semantic web framework, is used to store and query Dbpedia information. CubicWeb is able to reconstruct links from rdf/nt files, and can easily execute complex queries in a database with more than 8 millions entities and 75 millions links when using a PostgreSQL backend.
    http://scipy-lectures.github.com/_images/scikit-learn-logo.png
    • Scikit.learn is a cutting-edge python toolbox for machine learning. It provides algorithms that are simple and easy to use.
    http://www.pfeifermachinery.com/img/rss.png

    Based on these tools, we built a pure Python application to query the news:

    • Named Entities are extracted from RSS articles of a few mainstream English newspapers (New York Times, Reuteurs, BBC News, etc.), for each group of words in an article, we check if a Dbpedia entry has the same label. If so, we create a semantic link between the article and the Dbpedia entry.
    • An occurrence matrix of "RSS Articles" times "Named Entities" is constructed and may be used against several machine learning algorithms (MeanShift algorithm, Hierachical Clustering) in order to provide original and informative views of recent events.
    http://wiki.dbpedia.org/images/dbpedia_logo.png

    Moreover, queries may be used jointly with semantic information from Dbpedia:

    • All musical artists in the news:

      DISTINCT Any E, R WHERE E appears_in_rss R, E has_type T, T label "musical artist"
      
    • All living office holder persons in the news:

      DISTINCT Any E WHERE E appears_in_rss R, E has_type T, T label "office holder", E has_subject C, C label "Living people"
      
    • All news that talk about Barack Obama and any scientist:

      DISTINCT Any R WHERE E1 label "Barack Obama", E1 appears_in_rss R, E2 appears_in_rss R, E2 has_type T, T label "scientist"
      
    • All news that talk about a drug:

      Any X, R WHERE X appears_in_rss R, X has_type T, T label "drug"
      

    Such a tool may be used for informetrics and news analysis. Feel free to download the complete slides of the presentation.


  • Helping pylint to understand things it doesn't

    2011/10/10 by Sylvain Thenault

    The latest release of logilab-astng (0.23), the underlying source code representation library used by PyLint, provides a new API that may change pylint users' life in the near future...

    It aims to allow registration of functions that will be called after a module has been parsed. While this sounds dumb, it gives a chance to fix/enhance the understanding PyLint has about your code.

    I see this as a major step towards greatly enhanced code analysis, improving the situation where PyLint users know that when running it against code using their favorite framework (who said CubicWeb? :p ), they should expect a bunch of false positives because of black magic in the ORM or in decorators or whatever else. There are also places in the Python standard library where dynamic code can cause false positives in PyLint.

    The problem

    Let's take a simple example, and see how we can improve things using the new API. The following code:

    import hashlib
    
    def hexmd5(value):
        """"return md5 checksum hexadecimal digest of the given value"""
        return hashlib.md5(value).hexdigest()
    
    def hexsha1(value):
        """"return sha1 checksum hexadecimal digest of the given value"""
        return hashlib.sha1(value).hexdigest()
    

    gives the following output when analyzed through pylint:

    [syt@somewhere ~]$ pylint -E example.py
    No config file found, using default configuration
    ************* Module smarter_astng
    E:  5,11:hexmd5: Module 'hashlib' has no 'md5' member
    E:  9,11:hexsha1: Module 'hashlib' has no 'sha1' member
    

    However:

    [syt@somewhere ~]$ python
    Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
    [GCC 4.5.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import smarter_astng
    >>> smarter_astng.hexmd5('hop')
    '5f67b2845b51a17a7751f0d7fd460e70'
    >>> smarter_astng.hexsha1('hop')
    'cffb6b20e0eef296772f6c1457cdde0049bdfb56'
    

    The code runs fine... Why does pylint bother me then? If we take a look at the hashlib module, we see that there are no sha1 or md5 defined in there. They are defined dynamically according to Openssl library availability in order to use the fastest available implementation, using code like:

    for __func_name in __always_supported:
        # try them all, some may not work due to the OpenSSL
        # version not supporting that algorithm.
        try:
            globals()[__func_name] = __get_hash(__func_name)
        except ValueError:
            import logging
            logging.exception('code for hash %s was not found.', __func_name)
    

    Honestly I don't blame PyLint for not understanding this kind of magic. The situation on this particular case could be improved, but that's some tedious work, and there will always be "similar but different" case that won't be understood.

    The solution

    The good news is that thanks to the new astng callback, I can help it be smarter! See the code below:

    from logilab.astng import MANAGER, scoped_nodes
    
    def hashlib_transform(module):
        if module.name == 'hashlib':
            for hashfunc in ('sha1', 'md5'):
                module.locals[hashfunc] = [scoped_nodes.Class(hashfunc, None)]
    
    def register(linter):
        """called when loaded by pylint --load-plugins, register our tranformation
        function here
        """
        MANAGER.register_transformer(hashlib_transform)
    

    What's in there?

    • A function that will be called with each astng module built during a pylint execution, i.e. not only the one that you analyses, but also those accessed for type inference.
    • This transformation function is fairly simple: if the module is the 'hashlib' module, it will insert into its locals dictionary a fake class node for each desired name.
    • It is registered using the register_transformer method of astng's MANAGER (the central access point to built syntax tree). This is done in the pylint plugin API register callback function (called when module is imported using 'pylint --load-plugins'.

    Now let's try it! Suppose I stored the above code in a 'astng_hashlib.py' module in my PYTHONPATH, I can now run pylint with the plugin activated:

    [syt@somewhere ~]$ pylint -E --load-plugins astng_hashlib example.py
    No config file found, using default configuration
    ************* Module smarter_astng
    E:  5,11:hexmd5: Instance of 'md5' has no 'hexdigest' member
    E:  9,11:hexsha1: Instance of 'sha1' has no 'hexdigest' member
    

    Huum. We have now a different error :( Pylint grasp there are some md5 and sha1 classes but it complains they don't have a hexdigest method. Indeed, we didn't give a clue about that.

    We could continue on and on to give it a full representation of hashlib public API using the astng nodes API. But that would be painful, trust me. Or we could do something clever using some higher level astng API:

    from logilab.astng import MANAGER
    from logilab.astng.builder import ASTNGBuilder
    
    def hashlib_transform(module):
        if module.name == 'hashlib':
        fake = ASTNGBuilder(MANAGER).string_build('''
    
    class md5(object):
      def __init__(self, value): pass
      def hexdigest(self):
        return u''
    
    class sha1(object):
      def __init__(self, value): pass
      def hexdigest(self):
        return u''
    
    ''')
        for hashfunc in ('sha1', 'md5'):
            module.locals[hashfunc] = fake.locals[hashfunc]
    
    def register(linter):
        """called when loaded by pylint --load-plugins, register our tranformation
        function here
        """
        MANAGER.register_transformer(hashlib_transform)
    

    The idea is to write a fake python implementation only documenting the prototype of the desired class, and to get an astng from it, using the string_build method of the astng builder. This method will return a Module node containing the astng for the given string. It's then easy to replace or insert additional information into the original module, as you can see in the above example.

    Now if I run pylint using the updated plugin:

    [syt@somewhere ~]$ pylint -E --load-plugins astng_hashlib example.py
    No config file found, using default configuration
    

    No error anymore, great!

    What's next?

    This fairly simple change could quickly provide great enhancements. We should probably improve the astng manipulation API now that it's exposed like that. But we can also easily imagine a code base of such pylint plugins maintained by each community around a python library or framework. One could then use a plugins stack matching stuff used by its software, and have a greatly enhanced experience of using pylint.

    For a start, it would be great if pylint could be shipped with a plugin that explains all the magic found in the standard library, wouldn't it? Left as an exercice to the reader!


  • Text mode makes it into hgview 1.4.0

    2011/10/06 by Alain Leufroy

    Here is at last the release of the version 1.4.0 of hgview.

    http://www.logilab.org/image/77974?vid=download

    Small description

    Besides the classic bugfixes this release introduces a new text based user interface thanks to the urwid library.

    Running hgview in a shell, in a terminal, over a ssh session is now possible! If you are trying not to use X (or use it less), have a geek mouse-killer window manager such as wmii/dwm/ion/awesome/... this is for you!

    This TUI (Text User Interface!) adopts the principal features of the Qt4 based GUI. Although only the main view has been implemented for now.

    In a nutshell, this interface includes the following features :

    • display the revision graph (with working directory as a node, and basic support for the mq extension),
    • display the files affected by a selected changeset (with basic support for the bfiles extension)
    • display diffs (with syntax highlighting thanks to pygments),
    • automatically refresh the displayed revision graph when the repository is being modified (requires pyinotify),
    • easy key-based navigation in revisions' history of a repo (same as the GUI),
    • a command system for special actions (see help)

    Installation

    There are packages for debian and ubuntu in the logilab's debian repository.

    Note:you have to install the hgview-curses package to get the text based interface.

    Or you can simply clone our Mercurial repository:

    hg clone http://hg.logilab.org/hgview

    (more on the hgview home page)

    Running the text based interface

    A new --interface option is now available to choose the interface:

    hgview --interface curses

    Or you can fix it in the [hgview] section of your ~/.hgrc:

    [hgview]
    interface = curses # or qt or raw
    

    Then run:

    hgview

    What's next

    We'll be working on including other features from the Qt4 interface and making it fully configurable.

    We'll also work on bugfixes and new features, so stay tuned! And feel free to file bugs and feature requests.


  • Drawing UML diagrams with Python

    2011/09/26 by Nicolas Chauvat
    http://www.umlgraph.org/doc/seq-eg.gif?vid=download

    It started with a desire to draw diagrams of hierarchical systems with Python. Since this is similar to what we do in CubicWeb with schemas of the data model, I read the code and realized we had that graph submodule in the logilab.common library. This module uses dot from graphviz as a backend to draw the diagrams.

    Reading about UML diagrams drawn with GraphViz, I learned about UMLGraph, that uses GNU Pic to draw sequence diagrams. Pic is a language based on groff and the pic2plot tool is part of plotutils (apt-get install plotutils). Here is a tutorial. I have found some Python code wrapping pic2plot available as plugin to wikipad. It is worth noticing that TeX seems to have a nice package for UML sequence diagrams called pgf-umlsd.

    Since nowadays everything is moving into the web browser, I looked for a javascript library that does what graphviz does and I found canviz which looks nice.

    If (only) I had time, I would extend pyreverse to draw sequence diagrams and not only class diagrams...


  • EuroSciPy'11 - Annual European Conference for Scientists using Python.

    2011/08/24 by Alain Leufroy
    http://www.logilab.org/image/9852?vid=download

    The EuroScipy2011 conference will be held in Paris at the Ecole Normale Supérieure from August 25th to 28th and is co-organized and sponsored by INRIA, Logilab and other companies.

    The conference is dedicated to cross-disciplinary gathering focused on the use and development of the Python language in scientific research.

    August 25th and 26th are dedicated to tutorial tracks -- basic and advanced tutorials. August 27th and 28th are dedicated to talks, posters and demos sessions.

    Damien Garaud, Vincent Michel and Alain Leufroy (and others) from Logilab will be there. We will talk about a RSS feeds aggregator based on Scikits.learn and CubicWeb and we have a poster about LibAster (a python library for thermomechanical simulation based on Code_Aster).


  • Pylint 0.24 / logilab-astng 0.22

    2011/07/21 by Sylvain Thenault

    Hi there!

    I'm pleased to announce new releases of pylint and its underlying library logilab-astng. See http://www.logilab.org/project/pylint/0.24.0 and http://www.logilab.org/project/logilab-astng/0.22.0 for more info.

    Those releases include mostly fixes and a few enhancements. Python 2.6 relative / absolute imports should now work fine and Python 3 support has been enhanced. There are still two remaining failures in astng test suite when using python 3, but we're unfortunatly missing resources to fix them yet.

    Many thanks to everyone who contributed to this release by submitting patches or by participating to the latest bugs day.


  • pylint bug day #3 on july 8, 2011

    2011/07/04 by Sylvain Thenault

    Hey guys,

    we'll hold the next pylint bug day on july 8th 2011 (friday). If some of you want to come and work with us in our Paris office, you'll be welcome.

    You can also join us on jabber / irc:

    I know the announce is a bit late, but I hope some of you will be able to come or be online anyway!

    Regarding the program, the goal is to decrease the number of tickets in the tracker. I'll try to do some triage earlier this week so you'll get a chance to talk about your super-important ticket that has not been selected. Of course, if you intend to work on it, there is a bigger chance of it being fixed next week-end ;)


  • Setting up my Microsoft Natural Keyboard under Debian Squeeze

    2011/06/08 by Nicolas Chauvat

    I upgraded to Debian Squeeze over the week-end and it broke my custom Xmodmap. While I was fixing it, I realized that the special keys of my Microsoft Natural keyboard that were not working under Lenny were now functionnal. The only piece missing was the "zoom" key. Here is how I got it to work.

    I found on the askubuntu forum an solution to the same problem, that is missing the following details.

    To find which keysym to map, I listed input devices:

    $ ls /dev/input/by-id/
    usb-Logitech_USB-PS.2_Optical_Mouse-mouse        usb-Logitech_USB-PS_2_Optical_Mouse-mouse
    usb-Logitech_USB-PS_2_Optical_Mouse-event-mouse  usb-Microsoft_Natural??_Ergonomic_Keyboard_4000-event-kbd
    

    then used evtest to find the keysym:

    $ evtest /dev/input/by-id/usb-Microsoft*
    

    then used udevadm to find the identifiers:

    $ udevadm info --export-db | less
    

    then edited /lib/udev/rules.d/95-keymap.rules to add:

    ENV{ID_VENDOR}=="Microsoft", ENV{ID_MODEL_ID}=="00db", RUN+="keymap $name microsoft-natural-keyboard-4000"
    

    in the section keyboard_usbcheck

    and created the keymap file:

    $ cat /lib/udev/keymaps/microsoft-natural-keyboard-4000
    0xc022d pageup
    0xc022e pagedown
    

    then loaded the keymap:

    $ /lib/udev/keymap /dev/input/by-id/usb-Microsoft_Natural®_Ergonomic_Keyboard_4000-event-kbd /lib/udev/keymaps/microsoft-natural-keyboard-4000
    

    then used evtest again to check it was working.

    Of course, you do not have to map the events to pageup and pagedown, but I found it convenient to use that key to scroll up and down pages.

    Hope this helps :)


  • Coding sprint scikits.learn

    2011/03/22 by Vincent Michel

    We are planning a one day coding sprint on scikits.learn the 1st April.
    Venues, or remote participation on IRC are more than welcome !

    More information can be found on the wiki:
    https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events


  • Distutils2 Sprint at Logilab (first day)

    2011/01/28 by Alain Leufroy

    We're very happy to host the Distutils2 sprint this week in Paris.

    The sprint has started yesterday with some of Logilab's developers and others contributors. We'll sprint during 4 days, trying to pull up the new python package manager.

    Let's sumarize this first day:

    • Boris Feld and Pierre-Yves David worked on the new system for detecting and dispatching data-files.
    • Julien Miotte worked on
      • moving qGitFilterBranch from setuptools to distutils2
      • testing distutils2 installation and register (see the tutorial)
      • the backward compatibility to distutils in setup.py, using setup.cfg to fill the setup arguments of setup for helping users to switch to distutils2.
    • André Espaze and Alain Leufroy worked on the python script that help developers build a setup.cfg by recycling their existing setup.py (track).

    Join us on IRC at #distutils on irc.freenode.net !


  • The Python Package Index is not a "Software Distribution"

    2011/01/26 by Pierre-Yves David

    Recent discussions on the #disutils irc channel and with my logilab co-workers led me to the following conclusions:

    • The Python Package Index is not a software distribution
    • There is more than one way to distribute python software
    • Distribution packagers are power users and need super cow-powers
    • Users want it to "just works"
    • The Python Package Index is used by many as a software distribution
    • Pypi has a lot of contributions because requirements are low.

    The Python Package Index is not a software distribution

    I would define a software distribution as :

    • Organised group of people
    • Who apply a Unified Quality process
    • To a finite set of software
    • Which includes all its dependencies
    • With a consistent set of versions that work together
    • For a finite set of platforms
    • Managed and installed by dedicated tools.

    Pypi is a public index where:

    • Any python developer
    • Can upload any tarball containing something related
    • To any python package
    • Which might have external dependencies (outside Pypi)
    • The latest version of something is always available disregarding its compatibility with other packages.
    • Binary packages can be provided for any platform but are usually not.
    • There are several tools to install and manage python packages from pypi.

    Pypi is not a software distribution, it is a software index.

    Card File by Mr. Ducke / Matt

    There is more than one way to distribute python software

    There is a long way from the pure source used by the developer to the software installed on the system of the end user.

    First, the source must be extracted from a (D)VCS to make a version tarball, while executing several release specific actions (eg: changelog generation from a tracker) Second, the version tarball is used to generate a platform independent build, while executing several build steps (eg, Cython compilation into C files or documentation generation). Third, the platform independent build is used to generate a platform dependant build, while executing several platforms dependant build (eg, compilation of C extension). Finally, the platform dependant build is installed and each file gets dispatched to its proper location during the installation process.

    Pieces of software can be distributed as development snapshots taken from the (D)VCS, version tarballs, source packages, platform independent package or platform dependent package.

    package! by Beck Gusler

    Distribution packagers are power users and need super cow-powers

    Distribution packagers usually have the necessary infrastructure and skills to build packages from version tarballs. Moreover they might have specific needs that require as much control as possible over the various build steps. For example:

    • Specific help system requiring a custom version of sphinx.
    • Specific security or platform constraint that require a specific version of Cython
    Cheese Factory by James Yu

    Users want it to "just work"

    Standard users want it to "just work". They prefer simple and quick ways to install stuff. Build steps done on their machine increase the duration of the installation, add potential new dependencies and may trigger an error. Standard users are very disappointed when an installed failed because an error occurred while building the documentation. User give up when they have to download extra dependency and setup complicated compilation environment.

    Users want as many build steps as possible to be done by someone else. That's why many users usually choose a distribution that do the job for them (eg, ubuntu, red-hat, python xy)

    The Python Package Index is used by many as a software distribution

    But there are several situations where the user can't rely on his distribution to install python software:

    • There is no distribution available for the platform (Windows, Mac OS X)
    • They want to install a python package outside of their distribution system (to test or because they do not have the credentials to install it system-wide)
    • The software or version they need is not included in the finite set of software included in their distribution.

    When this happens, the user will use Pypi to fetch python packages. To help them, Pypi accepts binary packages of python modules and people have developed dedicated tools that ease installation of packages and their dependencies: pip, easy_install.

    Pip + Pypi provides the tools of a distribution without its consistency. This is better than nothing.

    Pypi has a lot of contributions because requirements are low

    Pypi should contain version tarballs of all known python modules. It is the first purpose of an index. Version tarball should let distribution and power user perform as many build steps as possible. Pypi will continue to be used as a distribution by people without a better option. Packages provided to these users should require as little as possible to be installed, meaning they either have no build step to perform or have only platforms dependent build step (that could not be executed by the developer).

    Thomas Fisher Rare Book Library by bookchen

    If the incoming distutils2 provides a way to differentiate platform dependent build steps from platform independent ones, python developers will be able to upload three different kind of package on Pypi.

    sdist:Pure source version released by upstream targeted at packagers and power users.
    idist:Platform-independent package with platform independent build steps done (Cython, docs). If there is no such build step, the package is the same as sdist.
    bdist:Platform-dependent package with all build steps performed. For package with no platform dependent build step this package is the same that idist.

    (Image under creative commons Card File by-nc-nd by Mr. Ducke / Matt, Thomas Fisher Rare Book Library by bookchen, package! by Beck Gusler, Cheese Factory by James Yu)


  • Fresh release of lutin77, Logilab Unit Test IN fortran 77

    2011/01/11 by Andre Espaze

    I am pleased to annouce the 0.2 release of lutin77 for running Fortran 77 tests by using a C compiler as the only dependency. Moreover this very light framework of 97 lines of C code makes a very good demo of Fortran and C interfacing. The next level could be to write it in GAS (GNU Assembler).

    For the over excited maintainers of legacy code, here comes a screenshot:

    $ cat test_error.f
       subroutine success
       end
    
       subroutine error
       integer fid
       open(fid, status="old", file="nofile.txt")
       write(fid, *) "Ola"
       end
    
       subroutine checke
       call check(.true.)
       call check(.false.)
       call abort
       end
    
       program run
       call runtest("error")
       call runtest("success")
       call runtest("absent")
       call runtest("checke")
       call resume
       end
    

    Then you can build the framework by:

    $ gcc -Wall -pedantic -c lutin77.c
    

    An now run your tests:

    $ gfortran -o test_error test_error.f lutin77.o -ldl -rdynamic
    $ ./test_error
      At line 6 of file test_error.f
      Fortran runtime error: File 'nofile.txt' does not exist
      Error with status 512 for the test "error".
    
      "absent" test not found.
    
      Failure at check statement number 2.
      Error for the test "checke".
    
      4 tests run (1 PASSED, 0 FAILED, 3 ERRORS)
    

    See also the list of test frameworks for Fortran.


  • Distutils2 January Sprint in Paris

    2011/01/07 by Pierre-Yves David

    At Logilab, we have the pleasure to host a distutils2 sprint in January. Sprinters are welcome in our Paris office from 9h on the 27th of January to 19h the 30th of January. This sprint will focus on polishing distutils2 for the next alpha release and on the install/remove scripts.

    Distutils2 is an important project for Python. Every contribution will help to improve the current state of packaging in Python. See the wiki page on python.org for details about participation. If you can't attend or join us in Paris, you can participate on the #distutils channel of the freenode irc network

    http://guide.python-distribute.org/_images/state_of_packaging.jpg

    For additional details, see Tarek Ziadé's original announce, read the wiki page on python.org or contact us


  • Accessing data on a virtual machine without network

    2010/12/02 by Andre Espaze

    At Logilab, we work a lot with virtual machines for testing and developping code on customers architecture. We access virtual machines through the network and copy data with scp command. However in case you get a network failure, there is still a way to access your data by mounting a rescue disk on the virtual machine. The following commands will use qemu but the idea could certainly be adapted for others emulators.

    Creating and mounting the rescue disk

    For later mounting the rescue disk on your system, it is necessary to use the raw image format (by default on qemu):

    $ qemu-img create data-rescue.img 10M
    

    Then run your virtual machine with the 'data-rescue.img' attached (you need to add a disk storage on virtmanager). Once in your virtual system, you will have to partition and format your new hard disk. As a an example with Linux (win32 users will prefer right clicks):

    $ fdisk /dev/sdb
    $ mke2fs -j /dev/sdb1
    

    Then the new disk can be mounted and used:

    $ mount /dev/sdb1 /media/usb
    $ cp /home/dede/important-customer-code.tar.bz2 /media/usb
    $ umount /media/usb
    

    You can then stop your virtual machine.

    Getting back data from the rescue disk

    You will then have to carry your 'data-rescue.img' on a system where you can mount a file with the 'loop' option. But first we need to find where our partition start:

    $ fdisk -ul data.img
    You must set cylinders.
    You can do this from the extra functions menu.
    
    Disk data.img: 0 MB, 0 bytes
    255 heads, 63 sectors/track, 0 cylinders, total 0 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Disk identifier: 0x499b18da
    
    Device Boot      Start         End      Blocks   Id  System
    data.img1           63       16064        8001   83  Linux
    

    Now we can mount the partition and get back our code:

    $ mkdir /media/rescue
    $ mount -o loop,offset=$((63 * 512)) data-rescue.img /media/rescue/
    $ ls /media/rescue/
    important-customer-code.tar.bz2
    

  • Thoughts on the python3 conversion workflow

    2010/11/30 by Emile Anclin

    Python3

    The 2to3 script is a very useful tool. We can just use it to run over all code base, and end up with a python3 compatible code whilst keeping a python2 code base. To make our code python3 compatible, we do (or did) two things:

    • small python2 compatible modifications of our source code
    • run 2to3 over our code base to generate a python3 compatible version

    However, we not only want to have one python3 compatible version, but also keep developping our software. Hence, we want to be able to easily test it for both python2 and python3. Furthermore if we use patches to get nice commits, this is starting to be quite messy. Let's consider this in the case of Pylint. Indeed, the workflow described before proved to be unsatisfying.

    • I have two repositories, one for python2, one for python3. On the python3 side, I run 2to3 and store the modifications in a patch or a commit.

    • Whenever I implement a fix or a functionality on either side, I have to test if it still works on the other side; but as the 2to3 modifications are often quite heavy, directly creating patches on one side and applying them on the other side won't work most of the time.

    • Now say, I implement something in my python2 base and hold it in a patch or commit it. I can then pull it to my python3 repo:

      • running 2to3 on all Pylint is quite slow: around 30 sec for Pylint without the tests, and around 2 min with the tests. (I'd rather not imagine how long it would take for say CubicWeb).

      • even if I have all my 2to3 modifications on a patch, it takes 5-6 sec to "qpush" or "qpop" them all. Commiting the 2to3 changes instead and using:

        hg pull -u --rebase
        

        is not much faster. If I don't use --rebase, I will have merges on each pull up. Furthermore, we often have either a patch application failure, merge conflict or end up with something which is not python3 compatible (like a newly introduced "except Error, exc").

    • So quite often, I will have to fix it with:

      hg revert -r REV <broken_files>
      2to3 -nw <broken_files>
      hg qref # or hg resolve -m; hg rebase -c
      
    • Suppose that 2to3 transition worked fine, or that we fixed it. I run my tests with python3 and see it does not work; so I modify the patch: it all starts again; and the new patch or the patch modification will create a new head in my python3 repo...

    2to3 Fixers

    Considering all that, let's investigate 2to3: it comes with a lot of fixers that can be activated or desactived. Now, a lot of them fix just very seldom use cases or stuff deprecated since years. On the other hand, the 2to3 fixers work with regular expressions, so the more we remove, the faster 2to3 should be. Depending on the project, most cases will just not appear, and for the others, we should be able to find other means of disabling them. The lists proposed here after are just suggestions, it will depend on the source base and other overall considerations which and how fixers could actually be disabled.

    python2 compatible

    Following fixers are 2.x compatible and should be run once and for all (and can then be disabled on daily conversion usage):

    • apply
    • execfile (?)
    • exitfunc
    • getcwdu
    • has_key
    • idioms
    • ne
    • nonzero
    • paren
    • repr
    • standarderror
    • sys_exec
    • tuple_params
    • ws_comma

    compat

    This can be fixed using imports from a "compat" module like the logilab.common.compat module which holds convenient compatible objects.

    • callable
    • exec
    • filter (Wraps filter() usage in a list call)
    • input
    • intern
    • itertools_imports
    • itertools
    • map (Wraps map() in a list call)
    • raw_input
    • reduce
    • zip (Wraps zip() usage in a list call)

    strings and bytes

    Maybe they could also be handled by compat:

    • basestring
    • unicode
    • print

    For print for example, we could think of a once-and-for-all custom fixer, that would replace it by a convenient echo function (or whatever name you like) defined in compat.

    manually

    Following issues could probably be fixed manually:

    • dict (it fixes dict iterator methods; it should be possible to have code where we can disable this fixer)
    • import (Detects sibling imports; we could convert them to absolute import)
    • imports, imports2 (renamed modules)

    necessary

    These changes seem to be necessary:

    • except
    • long
    • funcattrs
    • future
    • isinstance (Fixes duplicate types in the second argument of isinstance(). For example, isinstance(x, (int, int)) is converted to isinstance(x, (int)))
    • metaclass
    • methodattrs
    • numliterals
    • next
    • raise

    Consider however that a lot of them might never be used in some projects, like long, funcattrs, methodattrs and numliterals or even metaclass. Also, isinstance is probably motivated by long to int and unicode to str conversions and hence might also be somehow avoided.

    don't know

    Can we fix these one also with compat ?

    • renames
    • throw
    • types
    • urllib
    • xrange
    • xreadlines

    2to3 and Pylint

    Pylint is a special case since its test suite has a lot of bad and deprecated code which should stay there. However, in order to have a reasonable work flow, it seems that something must be done to reduce the 1:30 minutes of 2to3 parsing of the tests. Probably nothing could be gained from the above considerations since most cases just should be in the tests, and actually are. Realise that We can expect to be supporting python2 and python3 for several years in parallel.

    After a quick look, we see that 90 % of the refactorings of test/input files are just concerning the print statements; more over most of them have nothing to do with the tested functionality. Hence a solution might be to avoid to run 2to3 on the test/input directory, since we already have a mechanism to select depending on python version whether a test file should be tested or not. To some extend, astng is a similar case, but the test suite and the whole project is much smaller.


  • Notes on making "logilab-common" Py3k-compatible

    2010/09/28 by Emile Anclin

    The version 3 of Python is incompatible with the 2.x series. In order to make pylint usable with Python3, I did some work on making the logilab-common library Python3 compatible, since pylint depends on it.

    The strategy is to have one source code version, and to use the 2to3 tool for publishing a Python3 compatible version.

    Pytest vs. Unittest

    The first problem was that we use the pytest runner, that depends on logilab.common.testlib which extends the unittest module.

    Without major modification we could use unittest2 instead of unittest in Python2.6. I thought that the unittest2 module was equivalent to the unittest in Python3, but then realized I was wrong:

    • Python3.1/unittest is some strange "forward port" of unittest. Both are a single file, but they must be quite different since 3.1 has 1623 lines compared to 875 from 2.6...
    • Python2.x/unittest2 is a python package, backported from the alpha-release of Python3.2/unittest.

    I did not investigate if there are other unittest and unittest2 versions corresponding.

    What we can see is that the 3.1 version of unittest is different from everything else; whereas the 2.6-unittest2 is equivalent to 3.2-unittest. So, after trying to run pytest on Python3.1 and since there is a backport of unittest2 for Python3.1, it became clear that the best is to ignore py3.1-unittest and work on Python3.2 and unittest2 directly.

    Meanwhile, some work was being done on logilab-common to switch from unittest to unittest2. This was included in logilab.common-0.52.

    'python2.6 -3' and 2to3

    The -3 option of python2.6 warns about Python3 incompatible stuff.

    Since I already knew that pytest would work with unittest2, I wanted to know as fast as possible if pytest would run on Python3.x. So I run all logilab.common tests with "python2.6 -3 bin/pytest" and found a couple of problems that I quick-fixed or discarded, waiting to know the real solution.

    The 2to3 script (from the 2to3 library) does its best to transform Python2.x code into Python3 compatible code, but manual work is often needed to handle some cases. For example file is not considered a deprecated base class, calls to raw_input(...) are handled but not using raw_input as an instance attribute, etc. At times, 2to3 can be overzealous, and for example do modifications such as:

    -                for name, local_node in node.items():
    +                for name, local_node in list(node.items()):
    

    Procedure

    After a while, I found that the best solution was to adopt the following working procedure:

    • run the tests with python2.6 -3 and solve the appearing issues.
    • run 2to3 on all that has to be transformed:
    2to3-2.6 -n -w *py test/*py ureports/*py
    

    Since we are in a mercurial repository we don't need backups (-n) and we can write the modifications to the files directly (-w).

    • create a 223.diff patch that will be applied and removed repeatedly.

      Now, we will push and pop this patch (which is much faster than running 2to3), and only regenerate it from time to time to make sure it still works:

    • run "python3.2 bin/pytest -x", to find problems and solutions for crashes and tests that do not work. Note that after some quick fixes on logilab.common.testlib, pytest works quite well, and that we can use the "-x" option. Using Python's Whatsnew_3.0 documentation for hints is quite useful.

    • hg qpop 223.diff

    • write the solution into the 2.x code, convert it into a patch or a commit, and run the tests: some trivial things might not work or not be 2.4 compatible.

    • hg qpush 223.diff

    • repeat the procedure

    I used two repositories when working on logilab.common, one for Python2 and one for Python3, because other tools, like astng and pylint, depend on that library. Setting the PYTHONPATH was enough to get astng and pylint to use the right version.

    Concrete examples

    • We had to remove "os.path.walk" by replacing it with "os.walk".

    • The renaming of raw_input to input, __builtin__ to builtins and IOString to io could easily be resolved by using the improved logilab.common.compat technique: write a python version dependent definition of a variable, function, or class in logilab.common.compat and import it from there.

      For builtin, it is even easier: as 2to3 recognizes direct imports, so we can write in compat.py:

    import __builtin__ as builtins # 2to3 will tranform '__builtin__' to 'builtins'
    

    The most difficult point is the replacement of str/unicode by bytes/str.

    In Python3.x, we only use unicode strings called just str (the u'' syntax and unicode disappear), but everything written on disk will have to be converted to bytes, with some explicit encoding. In Python3.x, file descriptors have a defined encoding, and will automatically transform the strings to bytes.

    I wrote two functions in logilab.common.compat. One converts str to bytes and the other simply ignores the encoding in case of 3.x where it was expected in 2.x. But there might be a need to write additional tests to make sure the modifications work as expected.

    Conclusion

    • After less than a week of work, most of the logilab.common tests pass. The biggest remaining problem are the tests for testlib.py. But we can already start working on the Python3 compatibility for astng and finally pylint.
    • Looking at the lib2to3 library, one can see that 2to3 works with regular expressions which reproduce the Python grammar. Hence, it can not do much code investigation or static inference like astng. I think that using astng, we could improve 2to3 without too much effort.
    • for astng the difficulties are quite different: syntax changes become semantic changes, we will have to add new types of astng nodes.
    • For testing astng and pylint we will probably have to check the different test examples, a lot of them being code snippets which 2to3 will not parse; they will have to be corrected by hand.

    As a general conclusion, I found no need for using sa2to3, although it might be a very good tool. I would instead suggest to have a small compat module and keep only one version of the code, as far as possible. The code base being either on 2.x or on 3.x and using the (possibly customized) 2to3 or 3to2 scripts to publish two different versions.


  • SemWeb.Pro - first french Semantic Web conference, Jan 17/18 2011

    2010/09/20 by Nicolas Chauvat

    SemWeb.Pro, the first french conference dedicated to the Semantic Web will take place in Paris on January 17/18 2011.

    One day of talks, one day of tutorials.

    Want to grok the Web 3.0? Be there.

    Something you want to share? Call for papers ends on October 15, 2010.

    http://www.semweb.pro/semwebpro.png

  • Discovering logilab-common Part 1 - deprecation module

    2010/09/02 by Stéphanie Marcu

    logilab-common library contains a lot of utilities which are often unknown. I will write a series of blog entries to explore nice features of this library.

    We will begin with the logilab.common.deprecation module which contains utilities to warn users when:

    • a function or a method is deprecated
    • a class has been moved into another module
    • a class has been renamed
    • a callable has been moved to a new module

    deprecated

    When a function or a method is deprecated, you can use the deprecated decorator. It will print a message to warn the user that the function is deprecated.

    The decorator takes two optional arguments:

    • reason: the deprecation message. A good practice is to specify at the beginning of the message, between brackets, the version number from which the function is deprecated. The default message is 'The function "[function name]" is deprecated'.
    • stacklevel: This is the option of the warnings.warn function which is used by the decorator. The default value is 2.

    We have a class Person defined in a file person.py. The get_surname method is deprecated, we must use the get_lastname method instead. For that, we use the deprecated decorator on the get_surname method.

    from logilab.common.deprecation import deprecated
    
    class Person(object):
    
        def __init__(self, firstname, lastname):
            self._firstname = firstname
            self._lastname = lastname
    
        def get_firstname(self):
            return self._firstname
    
        def get_lastname(self):
            return self._lastname
    
        @deprecated('[1.2] use get_lastname instead')
        def get_surname(self):
            return self.get_lastname()
    
    def create_user(firstname, lastname):
        return Person(firstname, lastname)
    
    if __name__ == '__main__':
        person = create_user('Paul', 'Smith')
        surname = person.get_surname()
    

    When running person.py we have the message below:

    person.py:22: DeprecationWarning: [1.2] use get_lastname instead
    surname = person.get_surname()

    class_moved

    Now we moved the class Person in a new_person.py file. We notice in the person.py file that the class has been moved:

    from logilab.common.deprecation import class_moved
    import new_person
    Person = class_moved(new_person.Person)
    
    if __name__ == '__main__':
        person = Person('Paul', 'Smith')
    

    When we run the person.py file, we have the following message:

    person.py:6: DeprecationWarning: class Person is now available as new_person.Person
    person = Person('Paul', 'Smith')

    The class_moved function takes one mandatory argument and two optional:

    • new_class: this mandatory argument is the new class
    • old_name: this optional argument specify the old class name. By default it is the same name than the new class. This argument is used in the default printed message.
    • message: with this optional argument, you can specify a custom message

    class_renamed

    The class_renamed function automatically creates a class which fires a DeprecationWarning when instantiated.

    The function takes two mandatory arguments and one optional:

    • old_name: a string which contains the old class name
    • new_class: the new class
    • message: an optional message. The default one is '[old class name] is deprecated, use [new class name]'

    We now rename the Person class into User class in the new_person.py file. Here is the new person.py file:

    from logilab.common.deprecation import class_renamed
    from new_person import User
    
    Person = class_renamed('Person', User)
    
    if __name__ == '__main__':
        person = Person('Paul', 'Smith')
    

    When running person.py, we have the following message:

    person.py:5: DeprecationWarning: Person is deprecated, use User
    person = Person('Paul', 'Smith')

    moved

    The moved function is used to tell that a callable has been moved to a new module. It returns a callable wrapper, so that when the wrapper is called, a warning is printed telling where the object can be found. Then the import is done (and not before) and the actual object is called.

    Note

    The usage is somewhat limited on classes since it will fail if the wrapper is used in a class ancestors list: use the class_moved function instead (which has no lazy import feature though).

    The moved function takes two mandatory parameters:

    • modpath: a string representing the path to the new module
    • objname: the name of the new callable

    We will use in person.py, the create_user function which is now defined in the new_person.py file:

    from logilab.common.deprecation import moved
    
    create_user = moved('new_person', 'create_user')
    
    if __name__ == '__main__':
        person = create_user('Paul', 'Smith')
    

    When running person.py, we have the following message:

    person.py:4: DeprecationWarning: object create_user has been moved to module new_person
    person = create_user('Paul', 'Smith')

  • pdb.set_trace no longer working: problem solved

    2010/08/12

    I had a bad case of bug hunting today which took me > 5 hours to track down (with the help of Adrien in the end).

    I was trying to start a CubicWeb instance on my computer, and was encountering some strange pyro error at startup. So I edited some source file to add a pdb.set_trace() statement and restarted the instance, waiting for Python's debugger to kick in. But that did not happen. I was baffled. I first checked for standard problems:

    • no pdb.py or pdb.pyc was lying around in my Python sys.path
    • the pdb.set_trace function had not been silently redefined
    • no other thread was bugging me
    • the standard input and output were what they were supposed to be
    • I was not able to reproduce the issue on other machines

    After triple checking everything, grepping everywhere, I asked a question on StackOverflow before taking a lunch break (if you go there, you'll see the answer). After lunch, no useful answer had come in, so I asked Adrien for help, because two pairs of eyes are better than one in some cases. We dutifully traced down the pdb module's code to the underlying bdb and cmd modules and learned some interesting things on the way down there. Finally, we found out that the Python code frames which should have been identical where not. This discovery caused further bafflement. We looked at the frames, and saw that one of those frames's class was a psyco generated wrapper.

    It turned out that CubicWeb can use two implementation of the RQL module: one which uses gecode (a C++ library for constraint based programming) and one which uses logilab.constraint (a pure python library for constraint solving). The former is the default, but it would not load on my computer, because the gecode library had been replaced by a more recent version during an upgrade. The pure python implementation tries to use psyco to speed up things. Installing the correct version of libgecode solved the issue. End of story.

    When I checked out StackOverflow, Ned Batchelder had provided an answer. I didn't get the satisfaction of answering the question myself...

    Once this was figured out, solving the initial pyro issue took 2 minutes...


  • EuroSciPy'10

    2010/07/13 by Adrien Chauve
    http://www.logilab.org/image/9852?vid=download

    The EuroSciPy2010 conference was held in Paris at the Ecole Normale Supérieure from July 8th to 11th and was organized and sponsored by Logilab and other companies.

    July, 8-9: Tutorials

    The first two days were dedicated to tutorials and I had the chance to talk about SciPy with André Espaze, Gaël Varoquaux and Emanuelle Gouillart in the introductory track. This was nice but it was a bit tricky to present SciPy in such a short time while trying to illustrate the material with real and interesting examples. One very nice thing for the introductory track is that all the material was contributed by different speakers and is freely available in a github repository (licensed under CC BY).

    July, 10-11: Scientific track

    The next two days were dedicated to scientific presentations and why python is such a great tool to develop scientific software and carry out research.

    Keynotes

    I had a very great time listening to the presentations, starting with the two very nice keynotes given by Hans Petter Langtangen and Konrad Hinsen. The latter gave us a very nice summary of what happened in the scientific python world during the past 15 years, what is happening now and of course what could happen during the next 15 years. Using a crystal ball and a very humorous tone, he made it very clear that the challenge in the next years will be about how using our hundreds, thousands or even more cores in a bug-free and efficient way. Functional programming may be a very good solution to this challenge as it provides a deterministic way of parallelizing our programs. Konrad also provided some hints about future versions of python that could provide a deeper and more efficient support of functional programming and maybe the addition of a keyword 'async' to handle the computation of a function in another core.

    In fact, the PEP 3148 entitled "Futures - execute computations asynchronously" was just accepted two days ago. This PEP describes the new package called "futures" designed to facilitate the evaluation of callables using threads and processes in future versions of python. A full implementation is already available.

    Parallelization

    Parallelization was indeed a very popular issue across presentations, and as for resolving embarrassingly parallel problems, several solutions were presented.

    • Playdoh: Distributes computations over computers connected to a secure network (see playdoh presentation).

      Distributing the computation of a function over two machines is as simple as:

      import playdoh
      result1, result2 = playdoh.map(fun, [arg1, arg2], _machines = ['machine1.network.com', 'machine2.network.com'])
      
    • Theano: Allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. In particular it can use GPU transparently and generate optimized C code (see theano presentation).

    • joblib: Provides among other things helpers for embarrassingly parallel problems. It's built over the multiprocessing package introduced in python 2.6 and brings more readable code and easier debugging.

    Speed

    Concerning speed, Fransesc Alted has showed us interesting tools for memory optimization currently successfully used in PyTables 2.2. You can read more details on these kind of optimizations in EuroSciPy'09 (part 1/2): The Need For Speed.

    SCons

    Last but not least, I talked with Cristophe Pradal who is one of the core developer of OpenAlea. He convinced me that SCons is worth using once you have built a nice extension for it: SConsX. I'm looking forward to testing it.


  • HOWTO install lodgeit pastebin under Debian/Ubuntu

    2010/06/24 by Arthur Lutz

    Lodge it is a simple open source pastebin... and it's written in Python!

    The installation under debian/ubuntu goes as follows:

    sudo apt-get update
    sudo apt-get -uVf install python-imaging python-sqlalchemy python-jinja2 python-pybabel python-werkzeug python-simplejson
    cd local
    hg clone http://dev.pocoo.org/hg/lodgeit-main
    cd lodgeit-main
    vim manage.py
    

    For debian squeeze you have to downgrade python-werkzeug, so get the old version of python-werkzeug from snapshot.debian.org at http://snapshot.debian.org/package/python-werkzeug/0.5.1-1/

    wget http://snapshot.debian.org/archive/debian/20090808T041155Z/pool/main/p/python-werkzeug/python-werkzeug_0.5.1-1_all.deb
    

    Modify the dburi and the SECRET_KEY. And launch application:

    python manage.py runserver
    

    Then off you go configure your apache or lighthttpd.

    An easy (and dirty) way of running it at startup is to add the following command to the www-data crontab

    @reboot cd /tmp/; nohup /usr/bin/python /usr/local/lodgeit-main/manage.py runserver &
    

    This should of course be done in an init script.

    http://rn0.ru/static/help/advanced_features.png

    Hopefully we'll find some time to package this nice webapp for debian/ubuntu.


  • EuroSciPy 2010 schedule is out !

    2010/06/06 by Nicolas Chauvat
    https://www.euroscipy.org/data/logo.png

    The EuroSciPy 2010 conference will be held in Paris from july 8th to 11th at Ecole Normale Supérieure. Two days of tutorials, two days of conference, two interesting keynotes, a lightning talk session, an open space for collaboration and sprinting, thirty quality talks in the schedule and already 100 delegates registered.

    If you are doing science and using Python, you want to be there!


  • Salomé accepted into Debian unstable

    2010/06/03 by Andre Espaze

    Salomé is a platform for pre and post-processing of numerical simulation available at http://salome-platform.org/. It is now available as a Debian package http://packages.debian.org/source/sid/salome and should soon appear in Ubuntu https://launchpad.net/ubuntu/+source/salome as well.

    http://salome-platform.org/salome_screens.png/image_preview

    A difficult packaging work

    A first package of Salomé 3 was made by the courageous Debian developper Adam C. Powell, IV on January 2008. Such packaging is very resources intensive because of the building of many modules. But the most difficult part was to bring Salomé to an unported environment. Even today, Salomé 5 binaries are only provided by upstream as a stand-alone piece of software ready to unpack on a Debian Sarge/Etch or a Mandriva 2006/2008. This is the first reason why several patches were required for adapting the code to new versions of the dependencies. The version 3 of Salomé was so difficult and time consuming to package that Adam decided to stop during two years.

    The packaging of Salomé started back with the version 5.1.3 in January 2010. Thanks to Logilab and the OpenHPC project, I could join him during 14 weeks of work for adapting every module to Debian unstable. Porting to the new versions of the dependencies was a first step, but we had also to adapt the code to the Debian packaging philosophy with binaries, librairies and data shipped to dedicated directories.

    A promising future

    Salomé being accepted to Debian unstable means that porting it to Ubuntu should follow in a near future. Moreover the work done for adapting Salomé to a GNU/Linux distribution may help developpers on others platforms as well.

    That is excellent news for all people involved in numerical simulation because they are going to have access to Salomé services by using their packages management tools. It will help the spreading of Salomé code on any fresh install and moreover keep it up to date.

    Join the fun

    For mechanical engineers, a derived product called Salomé-Méca has recently been published. The goal is to bring the functionalities from the Code Aster finite element solver to Salomé in order to ease simulation workflows. If you are as well interested in Debian packages for those tools, you are invited to come with us and join the fun.

    I have submitted a proposal to talk about Salomé at EuroSciPy 2010. I look forward to meet other interested parties during this conference that will take place in Paris on July 8th-11th.


  • Enable and disable encrypted swap - Ubuntu

    2010/05/18 by Arthur Lutz
    http://ubuntu-party.org/wp-content/themes/ubuntu-party/scripts/timthumb.php?src=//wp-content/uploads/2010/04/evl-pochette21.png&w=210&h=192&zc=1&q=100

    With the release of Ubuntu Lucid Lynx, the use of an encrypted /home is becoming a pretty common and simple to setup thing. This is good news for privacy reasons obviously. The next step which a lot of users are reluctant to accomplish is the use of an encrypted swap. One of the most obvious reasons is that in most cases it breaks the suspend and hibernate functions.

    Here is a little HOWTO on how to switch from normal swap to encrypted swap and back. That way, when you need a secure laptop (trip to a conference, or situtation with risk of theft) you can active it, and then deactivate it when you're at home for example.

    Turn it on

    That is pretty simple

    sudo ecryptfs-setup-swap
    

    Turn it off

    https://launchpadlibrarian.net/17699584/ecryptfs_64.png

    The idea is to turn off swap, remove the ecryptfs layer, reformat your partition with normal swap and enable it. We use sda5 as an example for the swap partition, please use your own (fdisk -l will tell you which swap partition you are using - or in /etc/crypttab)

    sudo swapoff -a
    sudo cryptsetup remove /dev/mapper/cryptswap1
    sudo vim /etc/crypttab
    *remove the /dev/sda5 line*
    sudo /sbin/mkswap /dev/sda5
    sudo swapon /dev/sda5
    sudo vim /etc/fstab
    *replace /dev/mapper/cryptswap1 with /dev/sda5*
    

    If this is is useful, you can probably stick it in a script to turn on and off... maybe we could get an ecryptfs-unsetup-swap into ecryptfs.


  • The DEBSIGN_KEYID trick

    2010/05/12 by Nicolas Chauvat

    I have been wondering for some time why debsign would not use the DEBSIGN_KEYID environment variable that I exported from my bashrc. Debian bug 444641 explains the trick: debsign ignores environment variables and sources ~/.devscripts instead. A simple export DEBSIGN_KEYID=ABCDEFG in ~/.devscripts is enough to get rid of the -k argument once and for good.


  • pylint bug days #2 report

    2010/04/19 by Sylvain Thenault

    First of all, I've to say that pylint bugs day wasn't that successful in term of 'community event': I've been sprinting almost alone. My Logilab's felows were tied to customer projects, and no outside people shown up on jabber. Fortunatly Tarek Ziade came to visit us, and that was a nice opportunity to talk about pylint, distribute, etc ... Thank you Tarek, you saved my day ;)

    As I felt a bit alone, I decided to work on somethings funnier than bug fixing: refactoring!

    First, I've greatly simplified the command line: enable-msg/enable-msg-cat/enable-checker/enable-report and their disable-* counterparts were all merged into single --enable/--disable options.

    I've also simplified "pylint --help" output, providing a --long-help option to get what we had before. Generic support in `logilab.common.configuration of course.

    And last but not least, I refactored pylint so we can have multiple checkers with the same name. The idea behind this is that we can split checker into smaller chunks, basically only responsible for one or a few related messages. When pylint runs, it only uses necessary checkers according to activated messages and reports. When all checkers will be splitted, it should improve performance of "pylint --error-only".

    So, I can say I'm finally happy with the results of that pylint bugs day! And hopefuly we will be more people for the next edition...


  • Virtualenv - Play safely with a Python

    2010/03/26 by Alain Leufroy
    http://farm5.static.flickr.com/4031/4255910934_80090f65d7.jpg

    virtualenv, pip and Distribute are tree tools that help developers and packagers. In this short presentation we will see some virtualenv capabilities.

    Please, keep in mind that all above stuff has been made using : Debian Lenny, python 2.5 and virtualenv 1.4.5.

    Abstract

    virtualenv builds python sandboxes where it is possible to do whatever you want as a simple user without putting in jeopardy your global environment.

    virtualenv allows you to safety:

    • install any python packages
    • add debug lines everywhere (not only in your scripts)
    • switch between python versions
    • try your code as you are a final user
    • and so on ...

    Install and usage

    Install

    Prefered way

    Just download the virtualenv python script at http://bitbucket.org/ianb/virtualenv/raw/tip/virtualenv.py and call it using python (e.g. python virtualenv.py).

    For conveinience, we will refers to this script using virtualenv.

    Other ways

    For Debian (ubuntu as well) addicts, just do :

    $ sudo aptitude install python-virtualenv
    

    Fedora users would do:

    $ sudo yum install python-virtualenv
    

    And others can install from PyPI (as superuser):

    $ pip install virtualenv
    

    or

    $ easy_install pip && pip install virtualenv
    

    You could also get the source here.

    Quick Guide

    To work in a python sandbox, do as follow:

    $ virtualenv my_py_env
    $ source my_py_env/bin/activate
    (my_py_env)$ python
    

    "That's all Folks !"

    Once you have finished just do:

    (my_py_env)$ deactivate
    

    or quit the tty.

    What does virtualenv actually do ?

    At creation time

    Let's start again ... more slowly. Consider the following environment:

    $ pwd
    /home/you/some/where
    $ ls
    

    Now create a sandbox called my-sandbox:

    $ virtualenv my-sandbox
    New python executable in "my-sandbox/bin/python"
    Installing setuptools............done.
    

    The output said that you have a new python executable and specific install tools. Your current directory now looks like:

    $ ls -Cl
    my-sandbox/ README
    $ tree -L 3 my-sandbox
    my-sandbox/
    |-- bin
    |   |-- activate
    |   |-- activate_this.py
    |   |-- easy_install
    |   |-- easy_install-2.5
    |   |-- pip
    |   `-- python
    |-- include
    |   `-- python2.5 -> /usr/include/python2.5
    `-- lib
        `-- python2.5
            |-- ...
            |-- orig-prefix.txt
            |-- os.py -> /usr/lib/python2.5/os.py
            |-- re.py -> /usr/lib/python2.5/re.py
            |-- ...
            |-- site-packages
            |   |-- easy-install.pth
            |   |-- pip-0.6.3-py2.5.egg
            |   |-- setuptools-0.6c11-py2.5.egg
            |   `-- setuptools.pth
            |-- ...
    

    In addition to the new python executable and the install tools you have an whole new python environment containing libraries, a site-packages/ (where your packages will be installed), a bin directory, ...

    Note:
    virtualenv does not create every file needed to get a whole new python environment. It uses links to global environment files instead in order to save disk space end speed up the sandbox creation. Therefore, there must already have an active python environment installed on your system.

    At activation time

    At this point you have to activate the sandbox in order to use your custom python. Once activated, python still has access to the global environment but will look at your sandbox first for python's modules:

    $ source my-sandbox/bin/activate
    (my-sandbox)$ which python
    /home/you/some/where/my-sandbox/bin/python
    $ echo $PATH
    /home/you/some/where/my-sandbox/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
    (pyver)$ python -c 'import sys;print sys.prefix;'
    /home/you/some/where/my-sandbox
    (pyver)$ python -c 'import sys;print "\n".join(sys.path)'
    /home/you/some/where/my-sandbox/lib/python2.5/site-packages/setuptools-0.6c8-py2.5.egg
    [...]
    /home/you/some/where/my-sandbox
    /home/you/personal/PYTHONPATH
    /home/you/some/where/my-sandbox/lib/python2.5/
    [...]
    /usr/lib/python2.5
    [...]
    /home/you/some/where/my-sandbox/lib/python2.5/site-packages
    [...]
    /usr/local/lib/python2.5/site-packages
    /usr/lib/python2.5/site-packages
    [...]
    

    First of all, a (my-sandbox) message is automatically added to your prompt in order to make it clear that you're using a python sandbox environment.

    Secondly, my-sandbox/bin/ is added to your PATH. So, running python calls the specific python executable placed in my-sandbox/bin.

    Note
    It is possible to improve the sandbox isolation by ignoring the global paths and your PYTHONPATH (see Improve isolation section).

    Installing package

    It is possible to install any packages in the sandbox without any superuser privilege. For instance, we will install the pylint development revision in the sandbox.

    Suppose that you have the pylint stable version already installed in your global environment:

    (my-sandbox)$ deactivate
    $ python -c 'from pylint.__pkginfo__ import version;print version'
    0.18.0
    

    Once your sandbox activated, install the development revision of pylint as an update:

    $ source /home/you/some/where/my-sandbox/bin/activate
    (my-sandbox)$ pip install -U hg+http://www.logilab.org/hg/pylint#egg=pylint-0.19
    

    The new package and its dependencies are only installed in the sandbox:

    (my-sandbox)$ python -c 'import pylint.__pkginfo__ as p;print p.version, p.__file__'
    0.19.0 /home/you/some/where/my-sandbox/lib/python2.6/site-packages/pylint/__pkginfo__.pyc
    (my-sandbox)$ deactivate
    $ python -c 'import pylint.__pkginfo__ as p;print p.version, p.__file__'
    0.18.0 /usr/lib/pymodules/python2.6/pylint/__pkginfo__.pyc
    

    You can safely do any change in the new pylint code or in others sandboxed packages because your global environment is still unchanged.

    Useful options

    Improve isolation

    As said before, your sandboxed python sys.path still references the global system path. You can however hide them by:

    • either use the --no-site-packages that do not give access to the global site-packages directory to the sandbox
    • or change your PYTHONPATH in my-sandbox/bin/activate in the same way as for PATH (see tips)
    $ virtualenv --no-site-packages closedPy
    $ sed -i '9i PYTHONPATH="$_OLD_PYTHON_PATH"
          9i export PYTHONPATH
          9i unset _OLD_PYTHON_PATH
          40i _OLD_PYTHON_PATH="$PYTHONPATH"
          40i PYTHONPATH="."
          40i export PYTHONPATH' closedPy/bin/activate
    $ source closedPy/bin/activate
    (closedPy)$ python -c 'import sys; print "\n".join(sys.path)'
    /home/you/some/where/closedPy/lib/python2.5/site-packages/setuptools-0.6c8-py2.5.egg
    /home/you/some/where/closedPy
    /home/you/some/where/closedPy/lib/python2.5
    /home/you/some/where/closedPy/lib/python2.5/plat-linux2
    /home/you/some/where/closedPy/lib/python2.5/lib-tk
    /home/you/some/where/closedPy/lib/python2.5/lib-dynload
    /usr/lib/python2.5
    /usr/lib64/python2.5
    /usr/lib/python2.5/lib-tk
    /home/you/some/where/closedPy/lib/python2.5/site-packages
    $ deactivate
    

    This way, you'll get an even more isolated sandbox, just as with a brand new python environment.

    Work with different versions of Python

    It is possible to dedicate a sandbox to a particular version of python by using the --python=PYTHON_EXE which specifies the interpreter that virtualenv was installed with (default is /usr/bin/python):

    $ virtualenv --python=python2.4 pyver24
    $ source pyver24/bin/activate
    (pyver24)$ python -V
    Python 2.4.6
    $ deactivate
    $ virtualenv --python=python2.5 pyver25
    $ source pyver25/bin/activate
    (pyver25)$ python -V
    Python 2.5.2
    $ deactivate
    

    Distribute a sandbox

    To distribute your sandbox, you must use the --relocatable option that makes an existing sandbox relocatable. This fixes up scripts and makes all .pth files relative This option should be called just before you distribute the sandbox (each time you have changed something in your sandbox).

    An important point is that the host system should be similar to your own.

    Tips

    Speed up sandbox manipulation

    Add these scripts to your .bashrc in order to help you using virtualenv and automate the creation and activation processes.

    rel2abs() {
    #from http://unix.derkeiler.com/Newsgroups/comp.unix.programmer/2005-01/0206.html
      [ "$#" -eq 1 ] || return 1
      ls -Ld -- "$1" > /dev/null || return
      dir=$(dirname -- "$1" && echo .) || return
      dir=$(cd -P -- "${dir%??}" && pwd -P && echo .) || return
      dir=${dir%??}
      file=$(basename -- "$1" && echo .) || return
      file=${file%??}
      case $dir in
        /) printf '%s\n' "/$file";;
        /*) printf '%s\n' "$dir/$file";;
        *) return 1;;
      esac
      return 0
    }
    function activate(){
        if [[ "$1" == "--help" ]]; then
            echo -e "usage: activate PATH\n"
            echo -e "Activate the sandbox where PATH points inside of.\n"
            return
        fi
        if [[ "$1" == '' ]]; then
            local target=$(pwd)
        else
            local target=$(rel2abs "$1")
        fi
        until  [[ "$target" == '/' ]]; do
            if test -e "$target/bin/activate"; then
                source "$target/bin/activate"
                echo "$target sandbox activated"
                return
            fi
            target=$(dirname "$target")
        done
        echo 'no sandbox found'
    }
    function mksandbox(){
        if [[ "$1" == "--help" ]]; then
            echo -e "usage: mksandbox NAME\n"
            echo -e "Create and activate a highly isaolated sandbox named NAME.\n"
            return
        fi
        local name='sandbox'
        if [[ "$1" != "" ]]; then
            name="$1"
        fi
        if [[ -e "$1/bin/activate" ]]; then
            echo "$1 is already a sandbox"
            return
        fi
        virtualenv --no-site-packages --clear --distribute "$name"
        sed -i '9i PYTHONPATH="$_OLD_PYTHON_PATH"
                9i export PYTHONPATH
                9i unset _OLD_PYTHON_PATH
               40i _OLD_PYTHON_PATH="$PYTHONPATH"
               40i PYTHONPATH="."
               40i export PYTHONPATH' "$name/bin/activate"
        activate "$name"
    }
    
    Note:
    The virtualenv-commands and virtualenvwrapper projects add some very interesting features to virtualenv. So, put on eye on them for more advanced features than the above ones.

    Conclusion

    I found it to be irreplaceable for testing new configurations or working on projects with different dependencies. Moreover, I use it to learn about other python projects, how my project exactly interacts with its dependencies (during debugging) or to test the final user experience.

    All of this stuff can be done without virtualenv but not in such an easy and secure way.

    I will continue the series by introducing other useful projects to enhance your productivity : pip and Distribute. See you soon.


  • Astng 0.20.0 and Pylint 0.20.0 releases

    2010/03/24 by Emile Anclin

    We are happy to announce the Astng 0.20.0 and Pylint 0.20.0 releases.

    Pylint is a static code checker based on Astng, both depending on logilab-common 0.49.

    Astng

    Astng 0.20.0 is a major refactoring: instead of parsing and modifying the syntax tree generated from python's _ast or compiler.ast modules, the syntax tree is rebuilt. Thus the code becomes much clearer, and all monkey patching will eventually disappear from this module.

    Speed improvement is achieved by caching the parsed modules earlier to avoid double parsing, and avoiding some repeated inferences, all along fixing a lot of important bugs.

    Pylint

    Pylint 0.20.0 uses the new Astng, and fixes a lot of bugs too, adding some new functionality:

    • parameters with leading "_" shouldn't count as "local" variables
    • warn on assert( a, b )
    • warning if return or break inside a finally
    • specific message for NotImplemented exception

    We would like to thank Chmouel Boudjnah, Johnson Fletcher, Daniel Harding, Jonathan Hartley, Colin Moris, Winfried Plapper, Edward K. Ream and Pierre Rouleau for their contributions, and all other people helping the project to progress.


  • pylint bugs day #2 on april 16, 2010

    2010/03/22 by Sylvain Thenault

    Hey guys,

    we'll hold the next pylint bugs day on april 16th 2010 (friday). If some of you want to come and work with us in our Paris office, you'll be much welcome.

    Else you can still join us on jabber / irc:

    See you then!


  • PostgreSQL on windows : plpythonu and "specified module could not be found" error

    2010/03/22

    I recently had to (remotely) debug an issue on windows involving PostgreSQL and PL/Python. Basically, two very similar computers, with Python2.5 installed via python(x,y), PostgreSQL 8.3.8 installed via the binary installer. On the first machine create language plpythonu; worked like a charm, and on the other one, it failed with C:\\Program Files\\Postgresql\\8.3\\plpython.dll: specified module could not be found. This is caused by the dynamic linker not finding some DLL. Using Depends.exe showed that plpython.dll looks for python25.dll (the one it was built against in the 8.3.8 installer), but that the DLL was there.

    I'll save the various things we tried and jump directly to the solution. After much head scratching, it turned out that the first computer had TortoiseHg installed. This caused C:\\Program Files\\TortoiseHg to be included in the System PATH environment variable, and that directory contains python25.dll. On the other hand C:\\Python25 was in the user's PATH environment variable on both computers. As the database Windows service runs using a dedicated local account (typically with login postgres), it would not have C:\\Python25 in its PATH, but if TortoiseHg was there, it would find the DLL in some other directory. So the solution was to add C:\\Python25 to the system PATH.


  • Now publishing blog entries under creative commons

    2010/03/15 by Arthur Lutz

    Logilab is proud to announce that the blog entries published on the blogs of http://www.logilab.org and http://www.cubicweb.org are now licensed under a Creative Commons Attribution-Share Alike 2.0 License (check out the footer).

    http://creativecommons.org/images/deed/seal.png

    We often use creative commons licensed photographs to illustrate this blog, and felt that being developers of open source software it was quite logical that some of our content should be published under a similar license. Some of the documentation that we release also uses this license, for example the "Building Salome" documentation. This license footer has been integrated to the cubicweb-blog package that is used to publish our sites (as part of cubicweb-forge).


  • Launching Python scripts via Condor

    2010/02/17
    http://farm2.static.flickr.com/1362/1402963775_0185d2e62f.jpg

    As part of an ongoing customer project, I've been learning about the Condor queue management system (actually it is more than just a batch queue management system, tacking the High-throughput computing problem, but in my current project, we're not using the full possibilities of Condor, and the choice was dictated by other considerations outside the scope of this note). The documentation is excellent, and the features of the product are really amazing (pity the project runs on Windows, and we cannot use 90% of these...).

    To launch a job on a computer participating in the Condor farm, you just have to write a job file which looks like this:

    Universe=vanilla
    Executable=$path_to_executabe
    Arguments=$arguments_to_the_executable
    InitialDir=$working_directory
    Log=$local_logfile_name
    Output=$local_file_for_job_stdout
    Error=$local_file_for_job_stderr
    Queue
    

    and then run condor_submit my_job_file and use condor_q to monitor the status your job (queued, running...)

    My program is generating Condor job files and submitting them, and I've spent hours yesterday trying to understand why they were all failing : the stderr file contained a message from Python complaining that it could not import site and exiting.

    A point which was not clear in the documentation I read (but I probably overlooked it) is that the executable mentionned in the job file is supposed to be a local file on the submission host which is copied to the computer running the job. In the jobs generated by my code, I was using sys.executable for the Executable field, and a path to the python script I wanted to run in the Arguments field. This resulted in the Python interpreter being copied on the execution host and not being able to run because it was not able to find the standard files it needs at startup.

    Once I figured this out, the fix was easy: I made my program write a batch script which launched the Python script and changed the job to run that script.

    UPDATE : I'm told there is a Transfer_executable=False line I could have put in the script to achieve the same thing.

    (photo by gudi&cris licenced under CC-BY-ND)


  • Adding Mercurial build identification to Python

    2010/02/15 by Andre Espaze

    This work is a part of the build identification task found in the PEP 385, Migrating from svn to Mercurial: http://www.python.org/dev/peps/pep-0385/ It was done during the Mercurial sprint hosted at Logilab. If you would like to see the result, just follow the steps:

    hg clone http://hg.xavamedia.nl/cpython/pymigr/
    cd pymigr/build-identification
    

    Setting up the environment

    The current Python development branch is first checkout:

    svn co http://svn.python.org/projects/python/trunk
    

    A patch will be applied for adding the 'sys.mercurial' attribute and modifying the build informations:

    cp add-hg-build-id.diff trunk/
    cd trunk
    svn up -r 78019
    patch -p0 < add-hg-build-id.diff
    

    The changed made to 'configure.in' need then to be propagated to the configure script:

    autoconf
    

    The configuration is then done by:

    ./configure --enable-shared --prefix=/dev/null
    

    You should now see changes propagated to the Makefile for finding the revision, the tag and the branch:

    grep MERCURIAL Makefile
    

    Finally, Python can be built:

    make
    

    The sys.mercurial attribute should already be present:

    LD_LIBRARY_PATH=. ./python
    >>> import sys
    >>> sys.mercurial
    ('CPython', '', '')
    

    No tag nor revision have been found as there was no mercurial repository. A test by the Py_GetBuildInfo() in the C API will also be built:

    gcc -o show-build-info -I. -IInclude -L. -lpython2.7 ../show-build-info.c
    

    You can test its result by:

    LD_LIBRARY_PATH=. ./show-build-info
    -> default, Feb  7 2010, 15:07:46
    

    Manual test

    First a fake mercurial tree is built:

    hg init
    hg add README
    hg ci -m "Initial repo"
    hg id
    -> 84a6de74e48f tip
    

    Now Python needs to be built with the given mercurial information:

    rm Modules/getbuildinfo.o
    make
    

    You should then see the current revision number:

    LD_LIBRARY_PATH=. ./python
    >>> import sys
    >>> sys.mercurial
    ('CPython', 'default', '84a6de74e48f')
    

    and the C API can be tested by:

    LD_LIBRARY_PATH=. ./show-build-info
    -> default:84a6de74e48f, Feb  7 2010, 15:10:13
    

    The fake mercurial repository can now be cleaned:

    rm -rf .hg
    

    Automatic tests

    Automatic tests for checking the behavior for every cases will now work build Python and clean afterward. Those tests work only when run from the trunk svn directory of Python:

    python ../test_build_identification.py
    

    Further work

    The current work is only an attempt to add the mercurial build identification to Python, it still needs to be checked on production cases. Moreover the build identification on Windows has not been started yet, it will need to be integrated to the Microsoft Visual Studio building process.


  • Why you shoud get rid of os.system, os.popen, etc. in your code

    2010/02/12

    I regularly come across code such as:

    output = os.popen('diff -u %s %s' % (appl_file, ref_file), 'r')
    

    Code like this might well work machine but it is buggy and will fail (preferably during the demo or once shipped).

    Where is the bug?

    It is in the use of %s, which can inject in your command any string you want and also strings you don't want. The problem is that you probably did not check appl_file and ref_file for weird things (spaces, quotes, semi colons...). Putting quotes around the %s in the string will not solve the issue.

    So what should you do? The answer is "use the subprocess module": subprocess.Popen takes a list of arguments as first parameter, which are passed as-is to the new process creation system call of your platform, and not interpreted by the shell:

    pipe = subprocess.Popen(['diff', '-u', appl_file, ref_file], stdout=subprocess.PIPE)
    output = pipe.stdout
    

    By now, you should have guessed that the shell=True parameter of subprocess.Popen should not be used unless you really really need it (and even them, I encourage you to question that need).


  • Apycot for Mercurial

    2010/02/11 by Pierre-Yves David
    http://www.logilab.org/image/20439?vid=download

    What is apycot

    apycot is a highly extensible test automatization tool used for Continuous Integration. It can:

    • download the project from a version controlled repository (like SVN or Hg);
    • install it from scratch with all dependencies;
    • run various checkers;
    • store the results in a CubicWeb database;
    • post-process the results;
    • display the results in various format (html, xml, pdf, mail, RSS...);
    • repeat the whole procedure with various configurations;
    • get triggered by new changesets or run periodically.

    For an example, take a look at the "test reports" tab of the logilab-common project.

    Setting up an apycot for Mercurial

    During the mercurial sprint, we set up a proof-of-concept environment running six different checkers:

    • Check syntax of all python files.
    • Check syntax of all documentation files.
    • Run pylint on the mercurial source code with the mercurial pylintrc.
    • Run the check-code.py script included in mercurial checking style and python errors
    • Run the Mercurial's test suite.
    • Run Mercurial's benchmark on a reference repository.

    The first three checkers, shipped with apycot, were set up quickly. The last three are mercurial specific and required few additional tweaks to be integrated to apycot.

    The bot was setup to run with all public mercurial repositories. Five checkers immediately proved useful as they pointed out some errors or warnings (on some rarely used contrib files it even found a syntax error).

    Prospectives

    A public instance is being set up. It will provide features that the community is looking forward to:

    • testing all python versions;
    • running pure python or the C variant;
    • code coverage of the test suite;
    • performance history.

    Conclusion

    apycot proved to be highly flexible and could quickly be adapted to Mercurial's test suite even for people new to apycot. The advantages of continuously running different long running tests is obvious. So apycot seems to be a very valuable tool for improving the software development process.


  • SCons presentation in 5 minutes

    2010/02/09 by Andre Espaze
    http://www.scons.org/scons-logo-transparent.png

    Building software with SCons requires to have Python and SCons installed.

    As SCons is only made of Python modules, the sources may be shipped with your project if your clients can not install dependencies. All the following exemples can be downloaded at the end of that blog.

    A building tool for every file extension

    First a Fortran 77 program will be built made of two files:

    $ cd fortran-project
    $ scons -Q
    gfortran -o cfib.o -c cfib.f
    gfortran -o fib.o -c fib.f
    gfortran -o compute-fib cfib.o fib.o
    $ ./compute-fib
     First 10 Fibonacci numbers:
      0.  1.  1.  2.  3.  5.  8. 13. 21. 34.
    

    The '-Q' option tell to Scons to be less verbose. For cleaning the project, add the '-c' option:

    $ scons -Qc
    Removed cfib.o
    Removed fib.o
    Removed compute-fib
    

    From this first example, it can been seen that SCons find the 'gfortran' tool from the file extension. Then have a look at the user's manual if you want to set a particular tool.

    Describing the construction with Python objects

    A second C program will directly run the execution from the SCons file by adding a test command:

    $ cd c-project
    $ scons -Q run-test
    gcc -o test.o -c test.c
    gcc -o fact.o -c fact.c
    ar rc libfact.a fact.o
    ranlib libfact.a
    gcc -o test-fact test.o libfact.a
    run_test(["run-test"], ["test-fact"])
    OK
    

    However running scons alone builds only the main program:

    $ scons -Q
    gcc -o main.o -c main.c
    gcc -o compute-fact main.o libfact.a
    $ ./compute-fact
    Computing factorial for: 5
    Result: 120
    

    This second example shows that the construction dependency is described by passing Python objects. An interesting point is the possibility to add your own Python functions in the build process.

    Hierarchical build with environment

    A third C++ program will create a shared library used for two different programs: the main application and a test suite. The main application can be built by:

    $ cd cxx-project
    $ scons -Q
    g++ -o main.o -c -Imbdyn-src main.cxx
    g++ -o mbdyn-src/nodes.os -c -fPIC -Imbdyn-src mbdyn-src/nodes.cxx
    g++ -o mbdyn-src/solver.os -c -fPIC -Imbdyn-src mbdyn-src/solver.cxx
    g++ -o mbdyn-src/libmbdyn.so -shared mbdyn-src/nodes.os mbdyn-src/solver.os
    g++ -o mbdyn main.o -Lmbdyn-src -lmbdyn
    

    It shows that SCons handles for us the compilation flags for creating a shared library according to the tool (-fPIC). Moreover extra environment variables have been given (CPPPATH, LIBPATH, LIBS), which are all translated for the chosen tool. All those variables can be found in the user's manual or in the man page. The building and running of the test suite is made by giving an extra variable:

    $ TEST_CMD="LD_LIBRARY_PATH=mbdyn-src ./%s" scons -Q run-tests
    g++ -o tests/run_all_tests.o -c -Imbdyn-src tests/run_all_tests.cxx
    g++ -o tests/test_solver.o -c -Imbdyn-src tests/test_solver.cxx
    g++ -o tests/all-tests tests/run_all_tests.o tests/test_solver.o -Lmbdyn-src -lmbdyn
    run_test(["tests/run-tests"], ["tests/all-tests"])
    OK
    

    Conclusion

    That is rather convenient to build softwares by manipulating Python objects, moreover custom actions can be added in the process. SCons has also a configuration mechanism working like autotools macros that can be discovered in the user's manual.


  • Extended 256 colors in bash prompt

    2010/02/07 by Nicolas Chauvat

    The Mercurial 1.5 sprint is taking place in our offices this week-end and pair-programming with Steve made me want a better looking terminal. Have you seen his extravagant zsh prompt ? I used to have only 8 colors to decorate my shell prompt, but thanks to some time spent playing around, I now have 256.

    Here is what I used to have in my bashrc for 8 colors:

    NO_COLOUR="\[\033[0m\]"
    LIGHT_WHITE="\[\033[1;37m\]"
    WHITE="\[\033[0;37m\]"
    GRAY="\[\033[1;30m\]"
    BLACK="\[\033[0;30m\]"
    
    RED="\[\033[0;31m\]"
    LIGHT_RED="\[\033[1;31m\]"
    GREEN="\[\033[0;32m\]"
    LIGHT_GREEN="\[\033[1;32m\]"
    YELLOW="\[\033[0;33m\]"
    LIGHT_YELLOW="\[\033[1;33m\]"
    BLUE="\[\033[0;34m\]"
    LIGHT_BLUE="\[\033[1;34m\]"
    MAGENTA="\[\033[0;35m\]"
    LIGHT_MAGENTA="\[\033[1;35m\]"
    CYAN="\[\033[0;36m\]"
    LIGHT_CYAN="\[\033[1;36m\]"
    
    # set a fancy prompt
    export PS1="${RED}[\u@\h \W]\$${NO_COLOUR} "
    

    Just put the following lines in your bashrc to get the 256 colors:

    function EXT_COLOR () { echo -ne "\[\033[38;5;$1m\]"; }
    
    # set a fancy prompt
    export PS1="`EXT_COLOR 172`[\u@\h \W]\$${NO_COLOUR} "
    

    Yay, I now have an orange prompt! I now need to write a script that will display useful information depending on the context. Displaying the status of the mercurial repository I am in might be my next step.


  • We're happy to host the mercurial Sprint

    2010/02/02 by Arthur Lutz
    http://farm1.static.flickr.com/183/419945378_4ead41a76d_m.jpg

    We're very happy to be hosting the next mercurial sprint in our brand new offices in central Paris. It is quite an honor to be chosen when the other contender was Google.

    So a bunch of mercurial developers are heading out to our offices this coming Friday to sprint for three days on mercurial. We use mercurial a lot here over at Logilab and we also contribute a tool to visualize and manipulate a mercurial repository : hgview.

    To check out the things that we will be working on with the mercurial crew, check out the program of the sprint on their wiki.

    What is a sprint? "A sprint (sometimes called a Code Jam or hack-a-thon) is a short time period (three to five days) during which software developers work on a particular chunk of functionality. "The whole idea is to have a focused group of people make progress by the end of the week," explains Jeff Whatcott" [source]. For geographically distributed open source communities, it is also a way of physically meeting and working in the same room for a period of time.

    Sprinting is a practice that we encourage at Logilab, with CubicWeb we organize as often as possible open sprints, which is an opportunity for users and developers to come and code with us. We even use the sprint format for some internal stuff.

    photo by Sebastian Mary under creative commons licence.


  • hgview 1.2.0 released

    2010/01/21 by David Douard

    Here is at last the release of the version 1.2.0 of hgview.

    http://www.logilab.org/image/19894?vid=download

    In a nutshell, this release includes:

    • a basic support for mq extension,
    • a basic support for hg-bfiles extension,
    • working directory is now displayed as a node of the graph (if there are local modifications of course),
    • it's now possible to display only the subtree from a given revision (a bit like hg log -f)
    • it's also possible to activate an annotate view (make navigation slower however),
    • several improvements in the graph filling and rendering mecanisms,
    • I also added toolbar icons for the search and goto "quickbars" so they are not "hidden" any more to the one reluctant to user manuals,
    • it's now possible to go directly to the common ancestor of 2 revisions,
    • when on a merge node, it's now possible to choose the parent the diff is computed against,
    • make search also search in commit messages (it used to search only in diff contents),
    • and several bugfixes of course.
    Notes:
    there are packages for debian lenny, squeeze and sid, and for ubuntu hardy, interpid, jaunty and karmic. However, for lenny and hardy, provided packages won't work on pure distribs since hgview 1.2 depends on mercurial 1.1. Thus for these 2 distributions, packages will only work if you have installed backported mercurial packages.

  • New supported repositories for Debian and Ubuntu

    2010/01/21 by Arthur Lutz

    For the release of hgview 1.2.0 in our Karmic Ubuntu repository, we would like to announce that we are now going to generate packages for the following distributions :

    • Debian Lenny (because it's stable)
    • Debian Sid (because it's the dev branch)
    • Ubuntu Hardy (because it has Long Term Support)
    • Ubuntu Karmic (because it's the current stable)
    • Ubuntu Lucid (because it's the next stable) - no repo yet, but soon...
    http://img.generation-nt.com/ubuntulogo_0080000000420571.png

    The old packages in the previously supported architectures are still accessible (etch, jaunty, intrepid), but new versions will not be generated for these repositories. Packages will be coming in as versions get released, if before that you need a package, give us a shout and we'll see what we can do.

    For instructions on how to use the repositories for Ubuntu or Debian, go to the following page : http://www.logilab.org/card/LogilabDebianRepository


  • Open Source/Design Hardware

    2009/12/13 by Nicolas Chauvat
    http://www.logilab.org/image/19338?vid=download

    I have been doing free software since I discovered it existed. I bought an OpenMoko some time ago, since I am interested in anything that is open, including artwork like books, music, movies and... hardware.

    I just learned about two lists, one at Wikipedia and another one at MakeOnline, but Google has more. Explore and enjoy!


  • Solution to a common Mercurial task

    2009/12/10 by David Douard

    An interesting question has just been sent by Greg Ward on the Mercurial devel mailing-list (as a funny coincidence, it happened that I had to solve this problem a few days ago).

    Let me quote his message:

    here's my problem: imagine a customer is running software built from
    changeset A, and we want to upgrade them to a new version, built from
    changeset B.  So I need to know what bugs are fixed in B that were not
    fixed in A.  I have already implemented a changeset/bug mapping, so I
    can trivially lookup the bugs fixed by any changeset.  (It even handles
    "ongoing" and "reverted" bugs in addition to "fixed".)
    

    And he gives an example of situation where a tricky case may be found:

                    +--- 75 -- 78 -- 79 ------------+
                   /                                 \
                  /     +-- 77 -- 80 ---------- 84 -- 85
                 /     /                        /
    0 -- ... -- 74 -- 76                       /
                       \                      /
                        +-- 81 -- 82 -- 83 --+
    

    So what is the problem?

    Imagine the lastest distributed stable release is built on rev 81. Now, I need to publish a new bugfix release based on this latest stable version, including every changeset that is a bugfix, but that have not yet been applied at revision 81.

    So the first problem we need to solve is answering: what are the revisions ancestors of revision 85 that are not ancestor of revision 81?

    Command line solution

    Using hg commands, the solution is proposed by Steve Losh:

    hg log --template '{rev}\n' --rev 85:0 --follow --prune 81
    

    or better, as suggested by Matt:

    hg log -q --template '{rev}\n' --rev 85:0 --follow --prune 81
    

    The second is better since it does only read the index, and thus is much faster. But on big repositories, this command remains quite slow (with Greg's situation, a repo of more than 100000 revisions, the command takes more than 2 minutes).

    Python solution

    Using Python, one may think about using revlog.nodesbetween(), but it won't work as wanted here, not listing revisions 75, 78 and 79.

    On the mailing list, Matt gave the most simple and efficient solution:

    cl = repo.changelog
    a = set(cl.ancestors(81))
    b = set(cl.ancestors(85))
    revs = b - a
    

    Idea for a new extension

    Using this simple python code, it should be easy to write a nice Mercurial extension (which could be named missingrevisions) to do this job.

    Then, it should be interesting to also implement some filtering feature. For example, if there are simple conventions used in commit messages, eg. using something like "[fix #1245]" or "[close #1245]" in the commit message when the changeset is a fix for a bug listed in the bugtracker, then we may type commands like:

    hg missingrevs REV -f bugfix
    

    or:

    hg missingrevs REV -h HEADREV -f bugfix
    

    to find bugfix revisions ancestors of HEADREV that are not ancestors of REV.

    With filters (bugfix here) may be configurables in hgrc using regexps.


  • pylint bug day report

    2009/12/04 by Pierre-Yves David
    http://farm1.static.flickr.com/85/243306920_6a12bb48c7.jpg

    The first pylint bug day took place on wednesday 25th. Four members of the Logilab crew and two other people spent the day working on pylint.

    Several patches submitted before the bug day were processed and some tickets were closed.

    Charles Hébert added James Lingard's patches for string formatting and is working on several improvements. Vincent Férotin submitted a patch for simple message listings. Sylvain Thenault fixed significant inference bugs in astng (an underlying module of pylint managing the syntax tree). Émile Anclin began a major astng refactoring to take advantage of new python2.6 functionality. For my part, I made several improvements to the test suite. I applied James Lingard patches for ++ operator and generalised it to -- too. I also added a new checker for function call arguments submitted by James Lingard once again. Finally I improved the message filtering of the --errors-only options.

    We thank Maarten ter Huurne, Vincent Férotin for their participation and of course James Lingard for submitting numerous patches.

    Another pylint bug day will be held in a few months.

    image under creative commons by smccann


  • Resume of the first Coccinelle users day

    2009/11/30 by Andre Espaze

    A matching and transformation tool for systems code

    The Coccinelle's goal is to ease code maintenance by first revealing code smells based on design patterns and second easing an API (Application Programming Interface) change for a heavily used library. Coccinelle can thus be seen as two tools inside one. The first one matches patterns, the second applies transformations. However facing such a big problem, the project needed to define boundaries in order to increase chances of success. The building motivation was thus to target the Linux kernel. This choice has implied a tool working on the C programming language before the preprocessor step. Moreover the Linux code base adds interesing constraints as it is huge, contains many possible configurations depending on C macros, may contain many bugs and evolves a lot. What was the Coccinelle solution for easing the kernel maintenance?

    http://farm1.static.flickr.com/151/398536506_57df539ccf_m.jpg

    Generating diff files from the semantic patch langage

    The Linux community reads lot of diff files for following the kernel evolution. As a consequence the diff file syntax is widely spread and commonly understood. However this syntax concerns a particular change between two files, its does not allow to match a generic pattern.

    The Coccinelle's solution is to build its own langage allowing to declare rules describing a code pattern and a possible transformation. This langage is the Semantic Patch Langage (SmPL), based on the declarative approach of the diff file syntax. It allows to propagate a change rule to many files by generating diff files. Then those results can be directly applied by using the patch command but most of the time they will be reviewed and may be slightly adapted to the programmer's need.

    A Coccinelle's rule is made of two parts: metavariable declaration and a code pattern match followed by a possible transformation. A metavariable means a control flow variable, its possibles names inside the program do not matter. Then the code pattern will describe a particular control flow in the program by using the C and SmPL syntaxes manipulating the metavariables. As a result, Coccinelle succeeds to generate diff files because it works on the C program control flow.

    A complete SmPL description will not be given here because it can be found in the Coccinelle's documentation. However a brief introduction will be made on a rule declaration. The metavariable part will look like this:

    @@
    expression E;
    constant C;
    @@
    

    'expression' means a variable or the result of a function. However 'constant' means a C constant. Then for negating the result of an and operation between an expression and a constant instead of negating the expression first, the transformation part will be:

    - !E & C
    + !(E & C)
    

    A file containing several rules like that will be called a semantic patch. It will be applied by using the Coccinelle 'spatch' command that will generate a change written in the diff file syntax each time the above pattern is matched. The next section will illustrate this way of work.

    http://www.simplehelp.net/wp-images/icons/topic_linux.jpg

    A working example on the Linux kernel 2.6.30

    You can download and install Coccinelle 'spatch' command from its website: http://coccinelle.lip6.fr/ if you want to execute the following example. Let's first consider the following structure with accessors in the header 'device.h':

    struct device {
        void *driver_data;
    };
    
    static inline void *dev_get_drvdata(const struct device *dev)
    {
        return dev->driver_data;
    }
    
    static inline void dev_set_drvdata(struct device *dev, void* data)
    {
        dev->driver_data = data;
    }
    

    it imitates the 2.6.30 kernel header 'include/linux/device.h'. Let's now consider the following client code that does not make use of the accessors:

    #include <stdlib.h>
    #include <assert.h>
    
    #include "device.h"
    
    int main()
    {
        struct device devs[2], *dev_ptr;
        int data[2] = {3, 7};
        void *a = NULL, *b = NULL;
    
        devs[0].driver_data = (void*)(&data[0]);
        a = devs[0].driver_data;
    
        dev_ptr = &devs[1];
        dev_ptr->driver_data = (void*)(&data[1]);
        b = dev_ptr->driver_data;
    
        assert(*((int*)a) == 3);
        assert(*((int*)b) == 7);
        return 0;
    }
    

    Once this code saved in the file 'fake_device.c', we can check that the code compiles and runs by:

    $ gcc fake_device.c && ./a.out
    

    We will now create a semantic patch 'device_data.cocci' trying to add the getter accessor with this first rule:

    @@
    struct device dev;
    @@
    - dev.driver_data
    + dev_get_drvdata(&dev)
    

    The 'spatch' command is then run by:

    $ spatch -sp_file device_data.cocci fake_device.c
    

    producing the following change in a diff file:

    -    devs[0].driver_data = (void*)(&data[0]);
    -    a = devs[0].driver_data;
    +    dev_get_drvdata(&devs[0]) = (void*)(&data[0]);
    +    a = dev_get_drvdata(&devs[0]);
    

    which illustrates the great Coccinelle's way of work on program flow control. However the transformation has also matched code where the setter accessor should be used. We will thus add a rule above the previous one, the semantic patch becomes:

    @@
    struct device dev;
    expression data;
    @@
    - dev.driver_data = data
    + dev_set_drvdata(&dev, data)
    
    @@
    struct device dev;
    @@
    - dev.driver_data
    + dev_get_drvdata(&dev)
    

    Running the command again will produce the wanted output:

    $ spatch -sp_file device_data.cocci fake_device.c
    -    devs[0].driver_data = (void*)(&data[0]);
    -    a = devs[0].driver_data;
    +    dev_set_drvdata(&devs[0], (void *)(&data[0]));
    +    a = dev_get_drvdata(&devs[0]);
    

    It is important to write the setter rule before the getter rule else the getter rule will be applied first to the whole file.

    At this point our semantic patch is still incomplete because it does not work on 'device' structure pointers. By using the same logic, let's add it to the 'device_data.cocci' semantic patch:

    @@
    struct device dev;
    expression data;
    @@
    - dev.driver_data = data
    + dev_set_drvdata(&dev, data)
    
    @@
    struct device * dev;
    expression data;
    @@
    - dev->driver_data = data
    + dev_set_drvdata(dev, data)
    
    @@
    struct device dev;
    @@
    - dev.driver_data
    + dev_get_drvdata(&dev)
    
    @@
    struct device * dev;
    @@
    - dev->driver_data
    + dev_get_drvdata(dev)
    

    Running Coccinelle again:

    $ spatch -sp_file device_data.cocci fake_device.c
    

    will add the remaining transformations for the 'fake_device.c' file:

    -    dev_ptr->driver_data = (void*)(&data[1]);
    -    b = dev_ptr->driver_data;
    +    dev_set_drvdata(dev_ptr, (void *)(&data[1]));
    +    b = dev_get_drvdata(dev_ptr);
    

    but a new problem appears: the 'device.h' header is also modified. We meet here an important point of the Coccinelle's philosophy described in the first section. 'spatch' is a tool to ease code maintenance by propagating a code pattern change to many files. However the resulting diff files are supposed to be reviewed and in our case the unwanted modification should be removed. Note that it would be possible to avoid the 'device.h' header modification by using SmPL syntax but the explanation would be too much for a starting tutorial. Instead, we will simply cut the unwanted part:

    $ spatch -sp_file device_data.cocci fake_device.c | cut -d $'\n' -f 16-34
    

    This result will now be kept in a diff file by moreover asking 'spatch' to produce it for the current working directory:

    $ spatch -sp_file device_data.cocci -patch "" fake_device.c | \
    cut -d $'\n' -f 16-34 > device_data.patch
    

    It is now time to apply the change for getting a working C code using accessors:

    $ patch -p1 < device_data.patch
    

    The final result for 'fake_device.c' should be:

    #include <stdlib.h>
    #include <assert.h>
    
    #include "device.h"
    
    int main()
    {
        struct device devs[2], *dev_ptr;
        int data[2] = {3, 7};
        void *a = NULL, *b = NULL;
    
        dev_set_drvdata(&devs[0], (void *)(&data[0]));
        a = dev_get_drvdata(&devs[0]);
    
        dev_ptr = &devs[1];
        dev_set_drvdata(dev_ptr, (void *)(&data[1]));
        b = dev_get_drvdata(dev_ptr);
    
        assert(*((int*)a) == 3);
        assert(*((int*)b) == 7);
        return 0;
    }
    

    Finally, we can test that the code compiles and runs:

    .. sourcecode:: sh
    
    $ gcc fake_device.c && ./a.out

    The semantic patch is now ready to be used on the Linux's 2.6.30 kernel:

    $ wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.30.tar.bz2
    $ tar xjf linux-2.6.30.tar.bz2
    $ spatch -sp_file device_data.cocci -dir linux-2.6.30/drivers/net/ \
      > device_drivers_net.patch
    $ wc -l device_drivers_net.patch
    642
    

    You may also try the 'drivers/ieee1394' directory.

    http://coccinelle.lip6.fr/img/lip6.jpg

    Conclusion

    Coccinelle is made of around 60 thousands lines of Objective Caml. As illustrated by the above example on the linux kernel, the 'spatch' command succeeds to ease code maintenance. For the Coccinelle's team working on the kernel code base, a semantic patch is usually around 100 lines and will generated diff files to sometimes hundred of files. Moreover the processing is rather fast, the average time per file is said to be 0.7s.

    Two tools using the 'spatch' engine have already been built: 'spdiff' and 'herodotos'. With the first one you could almost avoid to learn the SmPL language because the idea is to generate a semantic patch by looking to transformations between files pairs. The second allows to correlate defects over software versions once the corresponding code smells have been described in SmPL.

    One of the Coccinelle's problem is to not being easily extendable to another language as the engine was designed for analyzing control flows on C programs. The C++ langage may be added but required obviously lot of work. It would be great to also have such a tool on dynamic languages like Python.

    image under creative commons by Rémi Vannier


  • pylint bug day next wednesday!

    2009/11/23 by Sylvain Thenault

    Remember that the first pylint bug day will be held on wednesday, november 25, from around 8am to 8pm in the Paris (France) time zone.

    We'll be a few people at Logilab and hopefuly a lot of other guys all around the world, trying to make pylint better.

    Join us on the #public conference room of conference.jabber.logilab.org, or if you prefer using an IRC client, join #public on irc.logilab.org which is a gateway to the jabber forum. And if you're in Paris, come to work with us in our office.

    People willing to help but without knowledge of pylint internals are welcome, it's the perfect occasion to learn a lot about it, and to be able to hack on pylint in the future!


  • First contact with pupynere

    2009/11/06 by Pierre-Yves David

    I spent some time this week evaluating Pupynere, the PUre PYthon NEtcdf REader written by Roberto De Almeida. I see several advantages in pupynere.

    First it's a pure Python module with no external dependency. It doesn't even depend on the NetCDF lib and it is therefore very easy to deploy.

    http://www.unidata.ucar.edu/software/netcdf/netcdf1_sm.png

    Second, it offers the same interface as Scientific Python's NetCDF bindings which makes transitioning from one module to another very easy.

    Third pupynere is being integrated into Scipy as the scypi.io.netcdf module. Once integrated, this could ensure a wide adoption by the python community.

    Finally it's easy to dig in this clear and small code base of about 600 lines. I have just sent several fixes and bug reports to the author.

    http://docs.scipy.org/doc/_static/scipyshiny_small.png

    However pupynere isn't mature yet. First it seems pupynere has been only used for simple cases so far. Many common cases are broken. Moreover there is no support for new NetCDF formats such as long-NetCDF and NetCDF4, and important features such as file update are still missing. In addition, The lack of a test suite is a serious issue. In my opinion, various bugs could already have been detected and fixed with simple unit tests. Contributions would be much more comfortable with the safety net offered by a test suite. I am not certain that the fixes and improvements I made this week did not introduce regressions.

    To conclude, pupynere seems too young for production use. But I invite people to try it and provide feedback and fixes to the author. I'm looking forward to using this project in production in the future.


  • First Pylint Bug Day on Nov 25th, 2009 !

    2009/10/21 by Sylvain Thenault
    http://www.logilab.org/image/18785?vid=download

    Since we don't stop being overloaded here at Logilab, and we've got some encouraging feedback after the "Pylint needs you" post, we decided to take some time to introduce more "community" in pylint.

    And the easiest thing to do, rather sooner than later, is a irc/jabber synchronized bug day, which will be held on Wednesday november 25. We're based in France, so main developpers will be there between around 8am and 19pm UTC+1. If a few of you guys are around Paris at this time and wish to come at Logilab to sprint with us, contact us and we'll try to make this possible.

    The focus for this bug killing day could be:

    • using logilab.org tracker : getting an account, submitting tickets, triaging existing tickets...
    • using mercurial to develop pylint / astng
    • guide people in the code so they're able to fix simple bugs

    We will of course also try to kill a hella-lotta bugs, but the main idea is to help whoever wants to contribute to pylint... and plan for the next bug-killing day !

    As we are in the process of moving to another place, we can't organize a sprint yet, but we should have some room available for the next time, so stay tuned :)


  • Projman 0.14.0 includes a Graphical User Interface

    2009/10/19 by Emile Anclin

    Introduction

    Projman is a project manager. With projman 0.14.0, the first sketch of a GUI has been updated, and important functionalities added. You can now easily see and edit task dependencies and test the resulting scheduling. Furthermore, a begin-after-end-previous constraint has been added which should really simplify the edition of the scheduling.

    The GUI can be used the two following ways:

    $ projman-gui
    $ projman-gui <path/to/project.xml>
    

    The file <path/to/project.xml> is the well known main file of a projman project. Starting projman-gui with no project.xml specified, or after opening a project, you can open an existing project simply with "File->Open". (For now, you can't create a new project with projman-gui.) You can edit the tasks and then save the modifications to the task file with "File->Save".

    http://www.logilab.org/image/18731?vid=download

    The Project tab

    The Project tab shows simply the four needed files of a projman project for resources, activities, tasks and schedule.

    Resources

    The Resources tab presents the different resources:

    • human resources
    • resource roles describing the different roles that resources can play
    • Different calendars for different resources with their "offdays"

    Activities

    For now, the Activities tab is not implemented. It should show the planning of the activities for each resource and the progress of the project.

    Tasks

    The Tasks tab is for now the most important one; it shows a tree view of the task hierarchy, and for each task:

    • the title of the task,
    • the role for that task,
    • the load (time in days),
    • the scheduling type,
    • the list of the constraints for the scheduling,
    • and the description of the task,

    each of which can be edited. You easily can drag and drop tasks inside the task tree and add and delete tasks and constraints.

    See the attached screenshot of the projman-gui task panel.

    Scheduling

    In the Scheduling tab you can simply test your scheduling by clicking "START". If you expect the scheduling to take a longer time, you can modify the maximum time of searching a solution.

    Known bugs

    • The begin-after-end-previous constraint does not work for a task having subtasks.
    • Deleting a task doesn't check for depending tasks, so scheduling won't work anymore.

  • hgview 1.1.0 released

    2009/09/25 by David Douard

    I am pleased to announce the latest release of hgview 1.1.0.

    What is it?

    For the ones from the back of the classroom near the radiator, let me remind you that hgview is a very helpful tool for daily work using the excellent DVCS Mercurial (which we heavily use at Logilab). It allows to easily and visually navigate your hg repository revision graphlog. It is written in Python and pyqt.

    http://www.logilab.org/image/18210?vid=download

    What's new

    • user can now configure colors used in the diff area (and they now defaults to white on black)
    • indicate current working directory position by a square node
    • add many other configuration options (listed when typing hg help hgview)
    • removed 'hg hgview-options' command in favor of 'hg help hgview'
    • add ability to choose which parent to diff with for merge nodes
    • dramatically improved UI behaviour (shortcuts)
    • improved help and make it accessible from the GUI
    • make it possible not to display the diffstat column of the file list (which can dramatically improve performances on big repositories)
    • standalone application: improved command line options
    • indicate working directory position in the graph
    • add auto-reload feature (when the repo is modified due to a pull, a commit, etc., hgview detects it, reloads the repo and updates the graph)
    • fix many bugs, especially the file log navigator should now display the whole graph

    Download and installation

    The source code is available as a tarball, or using our public hg repository of course.

    To use it from the sources, you just have to add a line in your .hgrc file, in the [extensions] section:

    hgext.hgview=/path/to/hgview/hgext/hgview.py

    Debian and Ubuntu users can also easily install hgview (and Logilab other free software tools) using our deb package repositories.


  • Using tempfile.mkstemp correctly

    2009/09/10

    The mkstemp function in the tempfile module returns a tuple of 2 values:

    • an OS-level handle to an open file (as would be returned by os.open())
    • the absolute pathname of that file.

    I often see code using mkstemp only to get the filename to the temporary file, following a pattern such as:

    from tempfile import mkstemp
    import os
    
    def need_temp_storage():
        _, temp_path = mkstemp()
        os.system('some_commande --output %s' % temp_path)
        file = open(temp_path, 'r')
        data = file.read()
        file.close()
        os.remove(temp_path)
        return data
    

    This seems to be working fine, but there is a bug hiding in there. The bug will show up on Linux if you call this functions many time in a long running process, and on the first call on Windows. We have leaked a file descriptor.

    The first element of the tuple returned by mkstemp is typically an integer used to refer to a file by the OS. In Python, not closing a file is usually no big deal because the garbage collector will ultimately close the file for you, but here we are not dealing with file objects, but with OS-level handles. The interpreter sees an integer and has no way of knowing that the integer is connected to a file. On Linux, calling the above function repeatedly will eventually exhaust the available file descriptors. The program will stop with:

    IOError: [Errno 24] Too many open files: '/tmp/tmpJ6g4Ke'
    

    On Windows, it is not possible to remove a file which is still opened by another process, and you will get:

    Windows Error [Error 32]
    

    Fixing the above function requires closing the file descriptor using os.close_():

    from tempfile import mkstemp
    import os
    
    def need_temp_storage():
        fd, temp_path = mkstemp()
        os.system('some_commande --output %s' % temp_path)
        file = open(temp_path, 'r')
        data = file.read()
        file.close()
        os.close(fd)
        os.remove(temp_path)
        return data
    

    If you need your process to write directly in the temporary file, you don't need to call os.write_(fd, data). The function os.fdopen_(fd) will return a Python file object using the same file descriptor. Closing that file object will close the OS-level file descriptor.


  • You can now register on our sites

    2009/09/03 by Arthur Lutz

    With the new version of CubicWeb deployed on our "public" sites, we would like to welcome a new (much awaited) functionality : you can now register directly on our websites. Getting an account with give you access to a bunch of functionalities :

    http://farm1.static.flickr.com/53/148921611_eadce4f5f5_m.jpg
    • registering to a project's activity with get you automated email reports of what is happening on that project
    • you can directly add tickets on projects instead of talking about it on the mailing lists
    • you can bookmark content
    • tag stuff
    • and much more...

    This is also a way of testing out the CubicWeb framework (in this case the forge cube) which you can take home and host yourself (debian recommended). Just click on the "register" link on the top right, or here.

    Photo by wa7son under creative commons.


  • New pylint/astng release, but... pylint needs you !

    2009/08/27 by Sylvain Thenault

    After several months with no time to fix/enhance pylint beside answering email and filing tickets, I've finally tackled some tasks yesterday night to publish bug fixes releases ([1] and [2]).

    The problem is that we don't have enough free time at Logilab to lower the number of tickets in pylint tracker page . If you take a look at the ticket tab, you'll see a lot of pendings bug and must-have features (well, and some other less necessary...). You can already easily contribute thanks to the great mercurial dvcs, and some of you do, either by providing patches or by reporting bugs (more tickets, iiirk ! ;) Thank you all btw !!

    Now I was wondering what could be done to make pylint going further, and the first ideas which came to my mind was :

    • do ~3 days sprint
    • do some 'tickets killing' days, as done in some popular oss projects

    But for this to be useful, we need your support, so here are some questions for you:

    • would you come to a sprint at Logilab (in Paris, France), so you can meet us, learn a lot about pylint, and work on tickets you wish to have in pylint?
    • if France is too far away for most people, would you have another location to propose?
    • would you be on jabber for a tickets killing day, providing it's ok with your agenda? if so, what's your knowledge of pylint/astng internals?

    you may answer by adding a comment to this blog (please register first by using the link at the top right of this page) or by mail to sylvain.thenault@logilab.fr. If we've enough positive answers, we'll take the time to organize such a thing.


  • Looking for a Windows Package Manager

    2009/07/31 by Nicolas Chauvat
    http://www.logilab.org/image/9862?vid=download

    As said in a previous article, I am convinced that part of the motivation for making package sub-systems like the Python one, which includes distutils, setuptools, etc, is that Windows users and Mac users never had the chance to use a tool that properly manages the configuration of their computer system. They just do not know what it would be like if they had at least a good package management system and do not miss it in their daily work.

    I looked for Windows package managers that claim to provide features similar to Debian's dpkg+apt-get and here is what I found in alphabetical order.

    AppSnap

    AppSnap is written in Python and uses wxPython, PyCurl and PyYAML. It is packaged using Py2Exe, compressed with UPX and installed using NSIS.

    It has not seen activity in the svn or on its blog since the end of 2008.

    Appupdater

    Appupdater provides functionality similar to apt-get or yum. It automates the process of installing and maintaining up to date versions of programs. It claims to be fully customizable and is licensed under the GPL.

    It seems under active development at SourceForge.

    QWinApt

    QWinApt is a Synaptic clone written in C# that has not evolved since september 2007.

    WinAptic

    WinAptic is another Synaptic clone written this time in Pascal that has not evolved since the end of 2007.

    Win-Get

    Win-get is an automated install system and software repository for Microsoft Windows. It is similar to apt-get: it connects to a link repository, finds an application and downloads it before performing the installation routine (silent or standard) and deleting the install file.

    It is written in pascal and is set up as a SourceForge project, but not much has been done lately.

    WinLibre

    WinLibre is a Windows free software distribution that provides a repository of packages and a tool to automate and simplify their installation.

    WinLibre was selected for Google Summer of Code 2009.

    ZeroInstall

    ZeroInstall started as a "non-admin" package manager for Linux distributions and is now extending its reach to work on windows.

    Conclusion

    I have not used any of these tools, the above is just the result of some time spent searching the web.

    A more limited approach is to notify the user of the newer versions:

    • App-Get will show you a list of your installed Applications. When an update is available for one of them, it will highlighted and you will be able to update the specific applications in seconds.
    • GetIt is not an application-getter/installer. When you want to install a program, you can look it up in GetIt to choose which program to install from a master list of all programs made available by the various apt-get clones.

    The appupdater project also compares itself to the programs automating the installation of software on Windows.

    Some columists expect the creation of application stores replicating the iPhone one.

    I once read about a project to get the Windows kernel into the Debian distribution, but can not find any trace of it... Remember that Debian is not limited to the Linux kernel, so why not think about a very improbable apt-get install windows-vista ?


  • The Configuration Management Problem

    2009/07/31 by Nicolas Chauvat
    http://www.logilab.org/image/9863?vid=download

    Today I felt like summing up my opinion on a topic that was discussed this year on the Python mailing lists, at PyCon-FR, at EuroPython and EuroSciPy... packaging software! Let us discuss the two main use cases.

    The first use case is to maintain computer systems in production. A trait of production systems, is that they can not afford failures and are often deployed on a large scale. It leaves little room for manually fixing problems. Either the installation process works or the system fails. Reaching that level of quality takes a lot of work.

    The second use case is to facilitate the life of software developers and computer users by making it easy for them to give a try to new pieces of software without much work.

    The first use case has to be addressed as a configuration management problem. There is no way around it. The best way I know of managing the configuration of a computer system is called Debian. Its package format and its tool chain provide a very extensive and efficient set of features for system development and maintenance. Of course it is not perfect and there are missing bits and open issues that could be tackled, like the dependencies between hardware and software. For example, nothing will prevent you from installing on your Debian system a version of a driver that conflicts with the version of the chip found in your hardware. That problem could be solved, but I do not think the Debian project is there yet and I do not count it as a reason to reject Debian since I have not seen any other competitor at the level as Debian.

    The second use case is kind of a trap, for it concerns most computer users and most of those users are either convinced the first use case has nothing in common with their problem or convinced that the solution is easy and requires little work.

    The situation is made more complicated by the fact that most of those users never had the chance to use a system with proper package management tools. They simply do not know the difference and do not feel like they are missing when using their system-that-comes-with-a-windowing-system-included.

    Since many software developers have never had to maintain computer systems in production (often considered a lower sysadmin job) and never developed packages for computer systems that are maintained in production, they tend to think that the operating system and their software are perfectly decoupled. They have no problem trying to create a new layer on top of existing operating systems and transforming an operating system issue (managing software installation) into a programming langage issue (see CPAN, Python eggs and so many others).

    Creating a sub-system specific to a language and hosting it on an operating system works well as long as the language boundary is not crossed and there is no competition between the sub-system and the system itself. In the Python world, distutils, setuptools, eggs and the like more or less work with pure Python code. They create a square wheel that was made round years ago by dpkg+apt-get and others, but they help a lot of their users do something they would not know how to do another way.

    A wall is quickly hit though, as the approach becomes overly complex as soon as they try to depend on things that do not belong to their Python sub-system. What if your application needs a database? What if your application needs to link to libraries? What if your application needs to reuse data from or provide data to other applications? What if your application needs to work on different architectures?

    The software developers that never had to maintain computer systems in production wish these tasks were easy. Unfortunately they are not easy and cannot be. As I said, there is no way around configuration management for the one who wants a stable system. Configuration management requires both project management work and software development work. One can have a system where packaging software is less work, but that comes at the price of stability and reduced functionnality and ease of maintenance.

    Since none of the two use cases will disappear any time soon, the only solution to the problem is to share as much data as possible between the different tools and let each one decide how to install software on his computer system.

    Some links to continue your readings on the same topic:


  • EuroSciPy'09 (part 1/2): The Need For Speed

    2009/07/29 by Nicolas Chauvat
    http://www.logilab.org/image/9852?vid=download

    The EuroSciPy2009 conference was held in Leipzig at the end of July and was sponsored by Logilab and other companies. It started with three talks about speed.

    Starving CPUs

    In his keynote, Fransesc Alted talked about starving CPUs. Thirty years back, memory and CPU frequencies where about the same. Memory speed kept up for about ten years with the evolution of CPU speed before falling behind. Nowadays, memory is about a hundred times slower than the cache which is itself about twenty times slower than the CPU. The direct consequence is that CPUs are starving and spend many clock cycles waiting for data to process.

    In order to improve the performance of programs, it is now required to know about the multiple layers of computer memory, from disk storage to CPU. The common architecture will soon count six levels: mechanical disk, solid state disk, ram, cache level 3, cache level 2, cache level 1.

    Using optimized array operations, taking striding into account, processing data blocks of the right size and using compression to diminish the amount of data that is transfered from one layer to the next are four techniques that go a long way on the road to high performance. Compression algorithms like Blosc increase throughput for they strike the right balance between being fast and providing good compression ratios. Blosc compression will soon be available in PyTables.

    Fransesc also mentions the numexpr extension to numpy, and its combination with PyTables named tables.Expr, that nicely and easily accelerates the computation of some expressions involving numpy arrays. In his list of references, Fransesc cites Ulrich Drepper article What every programmer should know about memory.

    Using PyPy's JIT for science

    Maciej Fijalkowski started his talk with a general presentation of the PyPy framework. One uses PyPy to describe an interpreter in RPython, then generate the actual interpreter code and its JIT.

    Since PyPy is has become more of a framework to write interpreters than a reimplementation of Python in Python, I suggested to change its misleading name to something like gcgc the Generic Compiler for Generating Compilers. Maciej answered that there are discussions on the mailing list to split the project in two and make the implementation of the Python interpreter distinct from the GcGc framework.

    Maciej then focused his talk on his recent effort to rewrite in RPython the part of numpy that exposes the underlying C library to Python. He says the benefits of using PyPy's JIT to speedup that wrapping layer are already visible. He has details on the PyPy blog. Gaël Varoquaux added that David Cournapeau has started working on making the C/Python split in numpy cleaner, which would further ease the job of rewriting it in RPython.

    CrossTwine Linker

    Damien Diederen talked about his work on CrossTwine Linker and compared it with the many projects that are actively attacking the problem of speed that dynamic and interpreted languages have been dragging along for years. Parrot tries to be the über virtual machine. Psyco offers very nice acceleration, but currently only on 32bits system. PyPy might be what he calls the Right Approach, but still needs a lot of work. Jython and IronPython modify the language a bit but benefit from the qualities of the JVM or the CLR. Unladen Swallow is probably the one that's most similar to CrossTwine.

    CrossTwine considers CPython as a library and uses a set of C++ classes to generate efficient interpreters that make calls to CPython's internals. CrossTwine is a tool that helps improving performance by hand-replacing some code paths with very efficient code that does the same operations but bypasses the interpreter and its overhead. An interpreter built with CrossTwine can be viewed as a JIT'ed branch of the official Python interpreter that should be feature-compatible (and bug-compatible) with CPython. Damien calls he approach "punching holes in C substrate to get more speed" and says it could probably be combined with Psyco for even better results.

    CrossTwine works on 64bit systems, but it is not (yet?) free software. It focuses on some use cases to greatly improve speed and is not to be considered a general purpose interpreter able to make any Python code faster.

    More readings

    Cython is a language that makes writing C extensions for the Python language as easy as Python itself. It replaces the older Pyrex.

    The SciPy2008 conference had at least two papers talking about speeding Python: Converting Python Functions to Dynamically Compiled C and unPython: Converting Python Numerical Programs into C.

    David Beazley gave a very interesting talk in 2009 at a Chicago Python Users group meeting about the effects of the GIL on multicore machines.

    I will continue my report on the conference with the second part titled "Applications And Open Questions".


  • Logilab at OSCON 2009

    2009/07/27 by Sandrine Ribeau
    http://assets.en.oreilly.com/1/event/27/oscon2009_oscon_11_years.gif

    OSCON, Open Source CONvention, takes place every year and promotes Open Source for technology. It is one of the meeting hubs for the growing open source community. This was the occasion for us to learn about new projects and to present CubicWeb during a BAYPIGgies meeting hosted by OSCON.

    http://www.openlina.com/templates/rhuk_milkyway/images/header_red_left.png

    I had the chance to talk with some of the folks working at OpenLina where they presented LINA. LINA is a thin virtual layer that enables developers to write and compile code using ordinary Linux tools, then package that code into a single executable that runs on a variety of operating systems. LINA runs invisibly in the background, enabling the user to install and run LINAfied Linux applications as if they were native to that user's operating system. They were curious about CubicWeb and took as a challenge to package it with LINA... maybe soon on LINA's applications list.

    Two open sources projects catched my attention as potential semantic data publishers. The first one is Family search where they provide a tool to search for family history and genealogy. Also they are working to define a standard format to exchange citation with Open Library. Democracy Lab provide an application to collect votes and build geographic statitics based on political interests. They will at some point publish data semantically so that their application data could be consumed.

    It also was for us the occasion of introducing CubicWeb to the BayPIGgies folks. The same presentation as the one held at Europython 2009. I'd like to take the opportunity to answer a question I did not manage to answer at that time. The question was: how different is CubicWeb from Freebase Parallax in terms of interface and views filters? Before answering this question let's detail what Freebase Parallax is.

    Freebase Parallax provides a new way to browse and explore data in Freebase. It allows to browse data from a set of data to a related set of data. This interface enables to aggregate visualization. For instance, given the set of US presidents, different types of views could be applied, such as a timeline view, where the user could set up which start and end date to use to draw the timeline. So generic views (which applies to any data) are customizable by the user.

    http://res.freebase.com/s/f64a2f0cc4534b2b17140fd169cee825a7ed7ddcefe0bf81570301c72a83c0a8/resources/images/freebase-logo.png

    The search powered by Parallax is very similar to CubicWeb faceted search, except that Parallax provides the user with a list of suggested filters to add in addition to the default one, the user can even remove a filter. That is something we could think about for CubicWeb: provide a generated faceted search so that the user could decide which filters to choose.

    Parallax also provides related topics to the current data set which ease navigation between sets of data. The main difference I could see with the view filter offered by Parallax and CubicWeb is that Parallax provides the same views to any type of data whereas CubicWeb has specific views depending on the data type and generic views that applies to any type of data. This is a nice Web interface to browse data and it could be a good source of inspiration for CubicWeb.

    http://www.zgeek.com/forum/gallery/files/6/3/2/img_228_96x96.jpg

    During this talk, I mentionned that CubicWeb now understands SPARQL queries thanks to the fyzz parser.


  • Quizz WolframAlpha

    2009/07/10 by Nicolas Chauvat
    http://www.logilab.org/image/9609?vid=download

    Wolfram Alpha is a web front-end to huge database of information covering very different topics ranging from mathematical functions to genetics, geography, astronomy, etc.

    When you search for a word, it will try to match it with one of the objects it as in its database and display all the information it has concerning that object. For example it can tell you a lot about the Halley Comet, including where it is at the moment you ask the query. This is the main difference with, say Wikipedia, that will know a lot about that comet in general, but is not meant to compute its location in the sky at the moment you enter your query.

    Searches are not limited to words. One can key in commands like weather in Paris in june 2009 or x^2+sin(x) and get results for those precise queries. The processing of the input query is far from bad, since it returns results to questions like what are the cities of France, but I would not call it state of the art natural language processing since that query returns the largest cities instead of just the cities it knows about and the question what are the smallest cities of France will not return any result. Natural language processing is a very difficult problem, though, especially when done in the open world as it is the case there with a engine available to the wide public on the internet.

    For more examples, visit the WolframAlpha website, where you will also be able to post feature requests or, if you are a developer, get documentation about the WolframAlpha API and maybe use it as a web service in your application when you need to answer certain types of questions.


  • EuroPython 2009

    2009/07/06 by Nicolas Chauvat
    http://www.logilab.org/file/9580/raw/europython_logo.png

    Once again Logilab sponsored the EuroPython conference. We would like to thank the organization team (especially John Pinner and Laura Creighton) for their hard work. The Conservatoire is a very central location in Birmingham and walking around the city center and along the canals