Blog entries by Nicolas Chauvat [30]

SaltStack Paris Meetup on Feb 6th, 2014 - (S01E02)

2013/12/20 by Nicolas Chauvat

Logilab has set up the second meetup for salt users in Paris on Feb 6th, 2014 at IRILL, near Place d'Italie, starting at 18:00. The address is 23 avenue d'Italie, 75013 Paris.

Here is the announce in french http://www.logilab.fr/blogentry/1981

Please forward it to whom may be interested, underlining that pizzas will be offered to refuel the chatters ;)

Conveniently placed a week after the Salt Conference, topics will include anything related to salt and its uses, demos, new ideas, exchange of salt formulas, commenting the talks/videos of the saltconf, etc.

If you are interested in Salt, Python and Devops and will be in Paris at that time, we hope to see you there !


JDEV2013 - Software development conference of CNRS

2013/09/14 by Nicolas Chauvat

I had the pleasure to be invited to lead a tutorial at JDEV2013 titled Learning TDD and Python in Dojo mode.

http://www.logilab.org/file/177427/raw/logo_JDEV2013.png

I quickly introduced the keywords with a single slide to keep it simple:

http://Python.org
+ Test Driven Development (Test, Code, Refactor)
+ Dojo (house of training: Kata / Randori)
= Calculators
  - Reverse Polish Notation
  - Formulas with Roman Numbers
  - Formulas with Numbers in letters

As you can see, I had three types of calculators, hence at least three Kata to practice, but as usual with beginners, it took us the whole tutorial to get done with the first one.

The room was a class room that we set up as our coding dojo with the coder and his copilot working on a laptop, facing the rest of the participants, with the large screen at their back. The pair-programmers could freely discuss with the people facing them, who were following the typing on the large screen.

We switched every ten minutes: the copilot became coder, the coder went back to his seat in the class and someone else stood up to became the copilot.

The session was allocated 3 hours split over two slots of 1h30. It took me less than 10 minutes to open the session with the above slide, 10 minutes as first coder and 10 minutes to close it. Over a time span of 3 hours, that left 150 minutes for coding, hence 15 people. Luckily, the whole group was about that size and almost everyone got a chance to type.

I completely skipped explaining Python, its syntax and the unittest framework and we jumped right into writing our first tests with if and print statements. Since they knew about other programming languages, they picked up the Python langage on the way.

After more than an hour of slowly discovering Python and TDD, someone in the room realized they had been focusing more on handling exception cases and failures than implementing the parsing and computation of the formulas because the specifications where not clearly understood. He then asked me the right question by trying to define Reverse Polish Notation in one sentence and checking that he got it right.

Different algorithms to parse and compute RPN formulas where devised at the blackboard over the pause while part of the group went for a coffee break.

The implementation took about another hour to get right, with me making sure they would not wander too far from the actual goal. Once the stack-based solution was found and implemented, I asked them to delete the files, switch coder and start again. They had forgotten about the Kata definition and were surprised, but quickly enjoyed it when they realized that progress was much faster on the second attempt.

Since it is always better to show that you can walk the talk, I closed the session by praticing the RPN calculator kata myself in a bit less than 10 minutes. The order in which to write the tests is the tricky part, because it can easily appear far-fetched for such a small problem when you already know an algorithm that solves it.

Here it is:

import operator

OPERATORS = {'+': operator.add,
             '*': operator.mul,
             '/': operator.div,
             '-': operator.sub,
             }

def compute(args):
    items = args.split()
    stack = []
    for item in items:
        if item in OPERATORS:
            b,a = stack.pop(), stack.pop()
            stack.append(OPERATORS[item](a,b))
        else:
            stack.append(int(item))
    return stack[0]

with the accompanying tests:

import unittest
from npi import compute

class TestTC(unittest.TestCase):

    def test_unit(self):
        self.assertEqual(compute('1'), 1)

    def test_dual(self):
        self.assertEqual(compute('1 2 +'), 3)

    def test_tri(self):
        self.assertEqual(compute('1 2 3 + +'), 6)
        self.assertEqual(compute('1 2 + 3 +'), 6)

    def test_precedence(self):
        self.assertEqual(compute('1 2 + 3 *'), 9)
        self.assertEqual(compute('1 2 * 3 +'), 5)

    def test_zerodiv(self):
        self.assertRaises(ZeroDivisionError, compute, '10 0 /')

unittest.main()

Apparently, it did not go too bad, for I had positive comments at the end from people that enjoyed discovering in a single session Python, Test Driven Development and the Dojo mode of learning.

I had fun doing this tutorial and thank the organizators for this conference!


The Great Salt Sprint Paris Location is Logilab

2013/07/12 by Nicolas Chauvat
http://farm1.static.flickr.com/183/419945378_4ead41a76d_m.jpg

We're happy to be part of the second Great Salt Sprint that will be held at the end of July 2013. We will be hosting the french sprinters on friday 26th in our offices in the center of Paris.

The focus of our Logilab team will probably be Test-Driven System Administration with Salt, but the more participants and the topics, the merrier the event.

Please register if you plan on joining us. We will be happy to meet with fellow hackers.

photo by Sebastian Mary under creative commons licence.


Generating a user interface from a Yams model

2012/01/09 by Nicolas Chauvat

Yams is a pythonic way to describe an entity-relationship model. It is used at the core of the CubicWeb semantic web framework in order to automate lots of things, including the generation and validation of forms. Although we have been using the MVC design pattern to write user interfaces with Qt and Gtk before we started CubicWeb, we never got to reuse Yams. I am on my way to fix this.

Here is the simplest possible example that generates a user interface (using dialog and python-dialog) to input data described by a Yams data model.

First, let's write a function that builds the data model:

def mk_datamodel():
    from yams.buildobjs import EntityType, RelationDefinition, Int, String
    from yams.reader import build_schema_from_namespace

    class Question(EntityType):
        number = Int()
        text = String()

    class Form(EntityType):
        title = String()

    class in_form(RelationDefinition):
        subject = 'Question'
        object = 'Form'
        cardinality = '*1'

    return build_schema_from_namespace(vars().items())

Here is what you get using graphviz or xdot to display the schema of that data model with:

import os
from yams import schema2dot

datamodel = mk_datamodel()
schema2dot.schema2dot(schema, '/tmp/toto.dot')
os.system('xdot /tmp/toto.dot')
http://www.logilab.org/file/87002?vid=download

To make a step in the direction of genericity, let's add a class that abstracts the dialog API:

class InterfaceDialog:
    """Dialog-based Interface"""
    def __init__(self, dlg):
        self.dlg = dlg

    def input_list(self, invite, options) :
        assert len(options) != 0, str(invite)
        choice = self.dlg.radiolist(invite, list=options, selected=1)
        if choice is not None:
            return choice.lower()
        else:
            raise Exception('operation cancelled')

    def input_string(self, invite, default):
        return self.dlg.inputbox(invite, init=default).decode(sys.stdin.encoding)

And now let's put everything together:

datamodel = mk_datamodel()

import dialog
ui = InterfaceDialog(dialog.Dialog())
ui.dlg.setBackgroundTitle('Dialog Interface with Yams')

objs = []
for entitydef in datamodel.entities():
    if entitydef.final:
        continue
    obj = {}
    for attr in entitydef.attribute_definitions():
        if attr[1].type in ('String','Int'):
            obj[str(attr[0])] = ui.input_string('%s.%s' % (entitydef,attr[0]), '')
    try:
        entitydef.check(obj)
    except Exception, exc:
        ui.dlg.scrollbox(str(exc))

print objs
http://www.logilab.org/file/87001?vid=download

The result is a program that will prompt the user for the title of a form and the text/number of a question, then enforce the type constraints and display the inconsistencies.

The above is very simple and does very little, but if you read the documentation of Yams and if you think about generating the UI with Gtk or Qt instead of dialog, or if you have used the form mechanism of CubicWeb, you'll understand that this proof of concept opens a door to a lot of possibilities.

I will come back to this topic in a later article and give an example of integrating the above with pigg, a simple MVC library for Gtk, to make the programming of user-interfaces even more declarative and bug-free.


Drawing UML diagrams with Python

2011/09/26 by Nicolas Chauvat
http://www.umlgraph.org/doc/seq-eg.gif?vid=download

It started with a desire to draw diagrams of hierarchical systems with Python. Since this is similar to what we do in CubicWeb with schemas of the data model, I read the code and realized we had that graph submodule in the logilab.common library. This module uses dot from graphviz as a backend to draw the diagrams.

Reading about UML diagrams drawn with GraphViz, I learned about UMLGraph, that uses GNU Pic to draw sequence diagrams. Pic is a language based on groff and the pic2plot tool is part of plotutils (apt-get install plotutils). Here is a tutorial. I have found some Python code wrapping pic2plot available as plugin to wikipad. It is worth noticing that TeX seems to have a nice package for UML sequence diagrams called pgf-umlsd.

Since nowadays everything is moving into the web browser, I looked for a javascript library that does what graphviz does and I found canviz which looks nice.

If (only) I had time, I would extend pyreverse to draw sequence diagrams and not only class diagrams...


Setting up my Microsoft Natural Keyboard under Debian Squeeze

2011/06/08 by Nicolas Chauvat

I upgraded to Debian Squeeze over the week-end and it broke my custom Xmodmap. While I was fixing it, I realized that the special keys of my Microsoft Natural keyboard that were not working under Lenny were now functionnal. The only piece missing was the "zoom" key. Here is how I got it to work.

I found on the askubuntu forum an solution to the same problem, that is missing the following details.

To find which keysym to map, I listed input devices:

$ ls /dev/input/by-id/
usb-Logitech_USB-PS.2_Optical_Mouse-mouse        usb-Logitech_USB-PS_2_Optical_Mouse-mouse
usb-Logitech_USB-PS_2_Optical_Mouse-event-mouse  usb-Microsoft_Natural??_Ergonomic_Keyboard_4000-event-kbd

then used evtest to find the keysym:

$ evtest /dev/input/by-id/usb-Microsoft*

then used udevadm to find the identifiers:

$ udevadm info --export-db | less

then edited /lib/udev/rules.d/95-keymap.rules to add:

ENV{ID_VENDOR}=="Microsoft", ENV{ID_MODEL_ID}=="00db", RUN+="keymap $name microsoft-natural-keyboard-4000"

in the section keyboard_usbcheck

and created the keymap file:

$ cat /lib/udev/keymaps/microsoft-natural-keyboard-4000
0xc022d pageup
0xc022e pagedown

then loaded the keymap:

$ /lib/udev/keymap /dev/input/by-id/usb-Microsoft_Natural®_Ergonomic_Keyboard_4000-event-kbd /lib/udev/keymaps/microsoft-natural-keyboard-4000

then used evtest again to check it was working.

Of course, you do not have to map the events to pageup and pagedown, but I found it convenient to use that key to scroll up and down pages.

Hope this helps :)


SemWeb.Pro - first french Semantic Web conference, Jan 17/18 2011

2010/09/21 by Nicolas Chauvat

SemWeb.Pro, the first french conference dedicated to the Semantic Web will take place in Paris on January 17/18 2011.

One day of talks, one day of tutorials.

Want to grok the Web 3.0? Be there.

Something you want to share? Call for papers ends on October 15, 2010.

http://www.semweb.pro/semwebpro.png

EuroSciPy 2010 schedule is out !

2010/06/06 by Nicolas Chauvat
https://www.euroscipy.org/data/logo.png

The EuroSciPy 2010 conference will be held in Paris from july 8th to 11th at Ecole Normale Supérieure. Two days of tutorials, two days of conference, two interesting keynotes, a lightning talk session, an open space for collaboration and sprinting, thirty quality talks in the schedule and already 100 delegates registered.

If you are doing science and using Python, you want to be there!


The DEBSIGN_KEYID trick

2010/05/12 by Nicolas Chauvat

I have been wondering for some time why debsign would not use the DEBSIGN_KEYID environment variable that I exported from my bashrc. Debian bug 444641 explains the trick: debsign ignores environment variables and sources ~/.devscripts instead. A simple export DEBSIGN_KEYID=ABCDEFG in ~/.devscripts is enough to get rid of the -k argument once and for good.


Extended 256 colors in bash prompt

2010/02/07 by Nicolas Chauvat

The Mercurial 1.5 sprint is taking place in our offices this week-end and pair-programming with Steve made me want a better looking terminal. Have you seen his extravagant zsh prompt ? I used to have only 8 colors to decorate my shell prompt, but thanks to some time spent playing around, I now have 256.

Here is what I used to have in my bashrc for 8 colors:

NO_COLOUR="\[\033[0m\]"
LIGHT_WHITE="\[\033[1;37m\]"
WHITE="\[\033[0;37m\]"
GRAY="\[\033[1;30m\]"
BLACK="\[\033[0;30m\]"

RED="\[\033[0;31m\]"
LIGHT_RED="\[\033[1;31m\]"
GREEN="\[\033[0;32m\]"
LIGHT_GREEN="\[\033[1;32m\]"
YELLOW="\[\033[0;33m\]"
LIGHT_YELLOW="\[\033[1;33m\]"
BLUE="\[\033[0;34m\]"
LIGHT_BLUE="\[\033[1;34m\]"
MAGENTA="\[\033[0;35m\]"
LIGHT_MAGENTA="\[\033[1;35m\]"
CYAN="\[\033[0;36m\]"
LIGHT_CYAN="\[\033[1;36m\]"

# set a fancy prompt
export PS1="${RED}[\u@\h \W]\$${NO_COLOUR} "

Just put the following lines in your bashrc to get the 256 colors:

function EXT_COLOR () { echo -ne "\[\033[38;5;$1m\]"; }

# set a fancy prompt
export PS1="`EXT_COLOR 172`[\u@\h \W]\$${NO_COLOUR} "

Yay, I now have an orange prompt! I now need to write a script that will display useful information depending on the context. Displaying the status of the mercurial repository I am in might be my next step.


Open Source/Design Hardware

2009/12/13 by Nicolas Chauvat
http://www.logilab.org/image/19338?vid=download

I have been doing free software since I discovered it existed. I bought an OpenMoko some time ago, since I am interested in anything that is open, including artwork like books, music, movies and... hardware.

I just learned about two lists, one at Wikipedia and another one at MakeOnline, but Google has more. Explore and enjoy!


Looking for a Windows Package Manager

2009/07/31 by Nicolas Chauvat
http://www.logilab.org/image/9862?vid=download

As said in a previous article, I am convinced that part of the motivation for making package sub-systems like the Python one, which includes distutils, setuptools, etc, is that Windows users and Mac users never had the chance to use a tool that properly manages the configuration of their computer system. They just do not know what it would be like if they had at least a good package management system and do not miss it in their daily work.

I looked for Windows package managers that claim to provide features similar to Debian's dpkg+apt-get and here is what I found in alphabetical order.

AppSnap

AppSnap is written in Python and uses wxPython, PyCurl and PyYAML. It is packaged using Py2Exe, compressed with UPX and installed using NSIS.

It has not seen activity in the svn or on its blog since the end of 2008.

Appupdater

Appupdater provides functionality similar to apt-get or yum. It automates the process of installing and maintaining up to date versions of programs. It claims to be fully customizable and is licensed under the GPL.

It seems under active development at SourceForge.

QWinApt

QWinApt is a Synaptic clone written in C# that has not evolved since september 2007.

WinAptic

WinAptic is another Synaptic clone written this time in Pascal that has not evolved since the end of 2007.

Win-Get

Win-get is an automated install system and software repository for Microsoft Windows. It is similar to apt-get: it connects to a link repository, finds an application and downloads it before performing the installation routine (silent or standard) and deleting the install file.

It is written in pascal and is set up as a SourceForge project, but not much has been done lately.

WinLibre

WinLibre is a Windows free software distribution that provides a repository of packages and a tool to automate and simplify their installation.

WinLibre was selected for Google Summer of Code 2009.

ZeroInstall

ZeroInstall started as a "non-admin" package manager for Linux distributions and is now extending its reach to work on windows.

Conclusion

I have not used any of these tools, the above is just the result of some time spent searching the web.

A more limited approach is to notify the user of the newer versions:

  • App-Get will show you a list of your installed Applications. When an update is available for one of them, it will highlighted and you will be able to update the specific applications in seconds.
  • GetIt is not an application-getter/installer. When you want to install a program, you can look it up in GetIt to choose which program to install from a master list of all programs made available by the various apt-get clones.

The appupdater project also compares itself to the programs automating the installation of software on Windows.

Some columists expect the creation of application stores replicating the iPhone one.

I once read about a project to get the Windows kernel into the Debian distribution, but can not find any trace of it... Remember that Debian is not limited to the Linux kernel, so why not think about a very improbable apt-get install windows-vista ?


The Configuration Management Problem

2009/07/31 by Nicolas Chauvat
http://www.logilab.org/image/9863?vid=download

Today I felt like summing up my opinion on a topic that was discussed this year on the Python mailing lists, at PyCon-FR, at EuroPython and EuroSciPy... packaging software! Let us discuss the two main use cases.

The first use case is to maintain computer systems in production. A trait of production systems, is that they can not afford failures and are often deployed on a large scale. It leaves little room for manually fixing problems. Either the installation process works or the system fails. Reaching that level of quality takes a lot of work.

The second use case is to facilitate the life of software developers and computer users by making it easy for them to give a try to new pieces of software without much work.

The first use case has to be addressed as a configuration management problem. There is no way around it. The best way I know of managing the configuration of a computer system is called Debian. Its package format and its tool chain provide a very extensive and efficient set of features for system development and maintenance. Of course it is not perfect and there are missing bits and open issues that could be tackled, like the dependencies between hardware and software. For example, nothing will prevent you from installing on your Debian system a version of a driver that conflicts with the version of the chip found in your hardware. That problem could be solved, but I do not think the Debian project is there yet and I do not count it as a reason to reject Debian since I have not seen any other competitor at the level as Debian.

The second use case is kind of a trap, for it concerns most computer users and most of those users are either convinced the first use case has nothing in common with their problem or convinced that the solution is easy and requires little work.

The situation is made more complicated by the fact that most of those users never had the chance to use a system with proper package management tools. They simply do not know the difference and do not feel like they are missing when using their system-that-comes-with-a-windowing-system-included.

Since many software developers have never had to maintain computer systems in production (often considered a lower sysadmin job) and never developed packages for computer systems that are maintained in production, they tend to think that the operating system and their software are perfectly decoupled. They have no problem trying to create a new layer on top of existing operating systems and transforming an operating system issue (managing software installation) into a programming langage issue (see CPAN, Python eggs and so many others).

Creating a sub-system specific to a language and hosting it on an operating system works well as long as the language boundary is not crossed and there is no competition between the sub-system and the system itself. In the Python world, distutils, setuptools, eggs and the like more or less work with pure Python code. They create a square wheel that was made round years ago by dpkg+apt-get and others, but they help a lot of their users do something they would not know how to do another way.

A wall is quickly hit though, as the approach becomes overly complex as soon as they try to depend on things that do not belong to their Python sub-system. What if your application needs a database? What if your application needs to link to libraries? What if your application needs to reuse data from or provide data to other applications? What if your application needs to work on different architectures?

The software developers that never had to maintain computer systems in production wish these tasks were easy. Unfortunately they are not easy and cannot be. As I said, there is no way around configuration management for the one who wants a stable system. Configuration management requires both project management work and software development work. One can have a system where packaging software is less work, but that comes at the price of stability and reduced functionnality and ease of maintenance.

Since none of the two use cases will disappear any time soon, the only solution to the problem is to share as much data as possible between the different tools and let each one decide how to install software on his computer system.

Some links to continue your readings on the same topic:


EuroSciPy'09 (part 1/2): The Need For Speed

2009/07/29 by Nicolas Chauvat
http://www.logilab.org/image/9852?vid=download

The EuroSciPy2009 conference was held in Leipzig at the end of July and was sponsored by Logilab and other companies. It started with three talks about speed.

Starving CPUs

In his keynote, Fransesc Alted talked about starving CPUs. Thirty years back, memory and CPU frequencies where about the same. Memory speed kept up for about ten years with the evolution of CPU speed before falling behind. Nowadays, memory is about a hundred times slower than the cache which is itself about twenty times slower than the CPU. The direct consequence is that CPUs are starving and spend many clock cycles waiting for data to process.

In order to improve the performance of programs, it is now required to know about the multiple layers of computer memory, from disk storage to CPU. The common architecture will soon count six levels: mechanical disk, solid state disk, ram, cache level 3, cache level 2, cache level 1.

Using optimized array operations, taking striding into account, processing data blocks of the right size and using compression to diminish the amount of data that is transfered from one layer to the next are four techniques that go a long way on the road to high performance. Compression algorithms like Blosc increase throughput for they strike the right balance between being fast and providing good compression ratios. Blosc compression will soon be available in PyTables.

Fransesc also mentions the numexpr extension to numpy, and its combination with PyTables named tables.Expr, that nicely and easily accelerates the computation of some expressions involving numpy arrays. In his list of references, Fransesc cites Ulrich Drepper article What every programmer should know about memory.

Using PyPy's JIT for science

Maciej Fijalkowski started his talk with a general presentation of the PyPy framework. One uses PyPy to describe an interpreter in RPython, then generate the actual interpreter code and its JIT.

Since PyPy is has become more of a framework to write interpreters than a reimplementation of Python in Python, I suggested to change its misleading name to something like gcgc the Generic Compiler for Generating Compilers. Maciej answered that there are discussions on the mailing list to split the project in two and make the implementation of the Python interpreter distinct from the GcGc framework.

Maciej then focused his talk on his recent effort to rewrite in RPython the part of numpy that exposes the underlying C library to Python. He says the benefits of using PyPy's JIT to speedup that wrapping layer are already visible. He has details on the PyPy blog. Gaël Varoquaux added that David Cournapeau has started working on making the C/Python split in numpy cleaner, which would further ease the job of rewriting it in RPython.

CrossTwine Linker

Damien Diederen talked about his work on CrossTwine Linker and compared it with the many projects that are actively attacking the problem of speed that dynamic and interpreted languages have been dragging along for years. Parrot tries to be the über virtual machine. Psyco offers very nice acceleration, but currently only on 32bits system. PyPy might be what he calls the Right Approach, but still needs a lot of work. Jython and IronPython modify the language a bit but benefit from the qualities of the JVM or the CLR. Unladen Swallow is probably the one that's most similar to CrossTwine.

CrossTwine considers CPython as a library and uses a set of C++ classes to generate efficient interpreters that make calls to CPython's internals. CrossTwine is a tool that helps improving performance by hand-replacing some code paths with very efficient code that does the same operations but bypasses the interpreter and its overhead. An interpreter built with CrossTwine can be viewed as a JIT'ed branch of the official Python interpreter that should be feature-compatible (and bug-compatible) with CPython. Damien calls he approach "punching holes in C substrate to get more speed" and says it could probably be combined with Psyco for even better results.

CrossTwine works on 64bit systems, but it is not (yet?) free software. It focuses on some use cases to greatly improve speed and is not to be considered a general purpose interpreter able to make any Python code faster.

More readings

Cython is a language that makes writing C extensions for the Python language as easy as Python itself. It replaces the older Pyrex.

The SciPy2008 conference had at least two papers talking about speeding Python: Converting Python Functions to Dynamically Compiled C and unPython: Converting Python Numerical Programs into C.

David Beazley gave a very interesting talk in 2009 at a Chicago Python Users group meeting about the effects of the GIL on multicore machines.

I will continue my report on the conference with the second part titled "Applications And Open Questions".


Quizz WolframAlpha

2009/07/10 by Nicolas Chauvat
http://www.logilab.org/image/9609?vid=download

Wolfram Alpha is a web front-end to huge database of information covering very different topics ranging from mathematical functions to genetics, geography, astronomy, etc.

When you search for a word, it will try to match it with one of the objects it as in its database and display all the information it has concerning that object. For example it can tell you a lot about the Halley Comet, including where it is at the moment you ask the query. This is the main difference with, say Wikipedia, that will know a lot about that comet in general, but is not meant to compute its location in the sky at the moment you enter your query.

Searches are not limited to words. One can key in commands like weather in Paris in june 2009 or x^2+sin(x) and get results for those precise queries. The processing of the input query is far from bad, since it returns results to questions like what are the cities of France, but I would not call it state of the art natural language processing since that query returns the largest cities instead of just the cities it knows about and the question what are the smallest cities of France will not return any result. Natural language processing is a very difficult problem, though, especially when done in the open world as it is the case there with a engine available to the wide public on the internet.

For more examples, visit the WolframAlpha website, where you will also be able to post feature requests or, if you are a developer, get documentation about the WolframAlpha API and maybe use it as a web service in your application when you need to answer certain types of questions.


EuroPython 2009

2009/07/06 by Nicolas Chauvat
http://www.logilab.org/image/9580?vid=download

Once again Logilab sponsored the EuroPython conference. We would like to thank the organization team (especially John Pinner and Laura Creighton) for their hard work. The Conservatoire is a very central location in Birmingham and walking around the city center and along the canals was nice. The website was helpful when preparing the trip and made it easy to find places where to eat and stay. The conference program was full of talks about interesting topics.

I presented CubicWeb and spent a large part of my talk explaining what is the semantic web and what features we need in the tools we will use to be part of that web of data. I insisted on the fact that CubicWeb is made of two parts, the web engine and the data repository, and that the repository can be used without the web engine. I demonstrated this with a TurboGears application that used the CubicWeb repository as its persistence layer. RQL in TurboGears! See my slides and Reinout Van Rees' write-up.

Christian Tismer took over the development of Psyco a few months ago. He said he recently removed some bugs that were show stoppers, including one that was generating way too many recompilations. His new version looks very promising. Performance improved, long numbers are supported, 64bit support may become possible, generators work... and Stackless is about to be rebuilt on top of Psyco! Psyco 2.0 should be out today.

I had a nice chat with Cosmin Basca about the Semantic Web. He suggested using Mako as a templating language for CubicWeb. Cosmin is doing his PhD at DERI and develops SurfRDF which is an Object-RDF mapper that wraps a SPARQL endpoint to provide "discoverable" objects. See his slides and Reinout Van Rees' summary of his talk.

I saw a lightning talk about the Nagare framework which refuses to use templating languages, for the same reason we do not use them in CubicWeb. Is their h.something the right way of doing things? The example reminds me of the C++ concatenation operator. I am not really convinced with the continuation idea since I have been for years a happy user of the reactor model that's implemented in frameworks liked Twisted. Read the blog and documentation for more information.

I had a chat with Jasper Op de Coul about Infrae's OAI Server and the work he did to manage RDF data in Subversion and a relational database before publishing it within a web app based on YUI. We commented code that handles books and library catalogs. Part of my CubicWeb demo was about books in DBpedia and cubicweb-book. He gave me a nice link to the WorldCat API.

Souheil Chelfouh showed me his work on Dolmen and Menhir. For several design problems and framework architecture issues, we compared the solutions offered by the Zope Toolkit library with the ones found by CubicWeb. I will have to read more about Martian and Grok to make sure I understand the details of that component architecture.

I had a chat with Martijn Faassen about packaging Python modules. A one sentence summary would be that the Python community should agree on a meta-data format that describes packages and their dependencies, then let everyone use the tool he likes most to manage the installation and removal of software on his system. I hope the work done during the last PyConUS and led by Tarek Ziadé arrived at the same conclusion. Read David Cournapeau's blog entry about Python Packaging for a detailed explanation of why the meta-data format is the way to go. By the way, Martijn is the lead developer of Grok and Martian.

Godefroid Chapelle and I talked a lot about Zope Toolkit (ZTK) and CubicWeb. We compared the way the two frameworks deal with pluggable components. ZTK has adapters and a registry. CubicWeb does not use adapters as ZTK does, but has a view selection mechanism that required a registry with more features than the one used in ZTK. The ZTK registry only has to match a tuple (Interface, Class) when looking for an adapter, whereas CubicWeb's registry has to find the views that can be applied to a result set by checking various properties:

  • interfaces: all items of first column implement the Calendar Interface,
  • dimensions: more than one line, more than two columns,
  • types: items of first column are numbers or dates,
  • form: form contains key XYZ that has a value lower than 10,
  • session: user is authenticated,
  • etc.

As for Grok and Martian, I will have to look into the details to make sure nothing evil is hinding there. I should also find time to compare zope.schema and yams and write about it on this blog.

And if you want more information about the conference:


The Web is reaching version 3

2009/06/05 by Nicolas Chauvat
http://www.logilab.org/image/9295?vid=download

I presented CubicWeb at several conferences recently and I used the following as an introduction.

Web version numbers:

  • version 0 = the internet links computers
  • version 1 = the web links documents
  • version 2 = web applications
  • version 3 = the semantic web links data [we are here!]
  • version 4 = more personnalization and fix problems with privacy and security
  • ... reach into physical world, bits of AI, etc.

In his blog at MIT, Tim Berners-Lee calls version 0 the International Information Infrastructure, version 1 the World Wide Web and version 3 the Giant Global Graph. Read the details about the Giant Global Graph on his blog.


Fetching book descriptions and covers

2009/05/11 by Nicolas Chauvat
http://www.logilab.org/image/9139?vid=download

We recently added the book cube to our intranet in order for books available in our library to show up in the search results. Entering a couple books using the default HTML form, even with the help of copy/paste from Google Book or Amazon, is boring enough to make one seek out other options.

As a Python and Debian user, I put the python-gdata package on my list of options, but quickly realized that the version in Debian is not current and that the books service is not yet accessible with the python gdata client. Both problems could be easily overcome since I could update Debian's version from 1.1.1 to the latest 1.3.1 and patch it with the book search support that will be included in the next release, but I went on exploring other options.

Amazon is the first answer that comes to mind when speaking of books on the net and pyAWS looks like a nice wrapper around the Amazon Web Service. The quickstart example on the home page does almost exactly what I was looking for. Trying to find a Debian package of pyAWS, I only came accross boto which appears to be general purpose.

Registering with Amazon and Google to get a key and use their web services is doable, but one wonders why something as common as books and public libraries would have to be accessed through private companies. It turns out Wikipedia knows of many book catalogs on the net, but I was looking for a site publishing data as RDF or part of the Linked Open Data initiative. I ended up with almost exactly what I needed.

The Open Library features millions of books and covers, publicly accessible as JSON using its API. There is even a dump of the database. End of search, be happy.

Next step is to use this service to enhance the cubicweb-book cube by allowing a user to add a new book to its collection by simply entering a ISBN. All data about the book can be fetched from the OpenLibrary, including the cover and information about the author. You can expect such a new version soon... and we will probably get a new demo of CubicWeb online in the process, since all that data available as a dump is screaming for reuse as others have already found out by making it available as RDF on AppEngine!


Release of CubicWeb 3.0

2009/01/05 by Nicolas Chauvat
http://www.cubicweb.org/index-cubicweb.png

As some readers of this blog may be aware of, Logilab has been developing its own framework since 2001. It evolved over the years trying to reach the main goal (managing and publishing data with style) and to incorporate the goods ideas seen in other Python frameworks Logilab developers had used. Now, companies other than Logilab have started providing services for this framework and it is stable enough for the core team to be confident in recommending it to third parties willing to build on it without suffering from the tasmanian devil syndrom.

CubicWeb version 3.0 was released on the last day of 2008. That's 7 years of research and development and (at least) three rewrites that were needed to get this in shape. Enjoy it at http://www.cubicweb.org/ !


DBpedia 3.2 released

2008/11/19 by Nicolas Chauvat
http://wiki.dbpedia.org/images/dbpedia_logo.png

For those interested in the Semantic Web as much as we are at Logilab, the announce of the new DBpedia release is very good news. Version 3.2 is extracted from the October 2008 Wikipedia dumps and provides three mayor improvements: the DBpedia Schema which is a restricted vocabulary extracted from the Wikipedia infoboxes ; RDF links from DBpedia to Freebase, the open-license database providing about a million of things from various domains ; cleaner abstracts without the traces of Wikipedia markup that made them difficult to reuse.

DBpedia can be downloaded, queried with SPARQL or linked to via the Linked Data interface. See the about page for details.

It is important to note that ontologies are usually more of a common language for data exchange, meant for broad re-use, which means that they can not enforce too many restrictions. On the opposite, database schemas are more restrictive and allow for more interesting inferences. For example, a database schema may enforce that the Publisher of a Document is a Person, whereas a more general ontology will have to allow for Publisher to be a Person or a Company.

DBpedia provides its schema and moves forward by adding a mapping from that schema to actual ontologies like UMBEL, OpenCyc and Yago. This enables DBpedia users to infer from facts fetched from different databases, like DBpedia + Freebase + OpenCyc. Moreover 'checking' DBpedia's data against ontologies will help detect mistakes or weirdnesses in Wikipedia's pages. For example, if data extracted from Wikipedia's infoboxes states that "Paris was_born_in New_York", reasoning and consistency checking tools will be able to point out that a person may be born in a city, but not a city, hence the above fact is probably an error and should be reviewed.

With CubicWeb, one can easily define a schema specific to his domain, then quickly set up a web application and easily publish the content of its database as RDF for a known ontology. In other words, CubicWeb makes almost no difference between a web application and a database accessible thru the web.


Command-line graphical user interfaces

2008/09/01 by Nicolas Chauvat
http://azarask.in/gfx/ubiquity_side.png

Graphical user interfaces help command discovery, while command-line interfaces help command efficiency. This article tries to explain why. I reached it when reading the list of references from the introduction to Ubiquity, which is the best extension to firefox I have seen so far. I expect to start writing Ubiquity commands soon, since I have already been using extensively the 'keyword shorcut' functionnality of firefox's bookmarks and we have already done work in the area of 'language interaction', as they call it at Mozilla Labs, when working with Narval. Our Logilab Simple Desktop project, aka simpled, also goes in the same direction since it tries to unify different applications into a coherent work environment by defining basic commands and shorcuts that can be applied everywhere and accessing the rest of the functionnalities via a command-line interface.


Is the Openmoko freerunner a computer or a phone ?

2008/08/27 by Nicolas Chauvat
http://wiki.openmoko.org/images/thumb/b/b9/Freerunner02.gif/150px-Freerunner02.gif

The Openmoko Freerunner is a computer with embedded GSM, accelerometer and GPS. I got mine last week after waiting for a month for the batch to get from Taiwan to the french company I bought it from. The first thing I had to admit was that some time will pass before it gets confortable to use it as a phone. The current version of the system has many weird things in its user interface and the phone works, but the other end of the call suffers a very unpleasant echo.

I will try to install Debian, Qtopia and Om2008.8 to compare them. I also want to quickly get Python scripts to run on it and get back to Narval hacking. I had an agent running on a bulky Palm+GPS+radionetwork back in 1999 and I look forward to run on this device the same kind of funny things I was doing in AI research ten years ago.


simpled - Simple Desktop project started !

2008/08/11 by Nicolas Chauvat

I bought last week a new laptop computer that can drive a 24" LCD monitor, which means I do not need my desktop computer any more. In the process of setting up that new laptop, I did what I have been wanting to do for years without finding the time: spending time on my ion3 config to make it more generic and create a small python setup utility that can regenerate it from a template file and a keyboard layout.

The simpled project was born!

If you take a look at the list of pending tickets, you will guess that I am using a limited number of pieces of software during my work day and tried to configure them so that they share common action/shortcuts. This is what simpled is about: given a keyboard layout generate the config files for the common tools so that action/shortcuts are always on the same key.

I use ion3, xterm+bash, emacs, mutt, firefox, gajim. Common actions are: open, save, close, move up/down/left/right, new frame or tab, close frame or tab, move to previous or next tab, etc.

I will give news in this blog from time to time and announce it on mailing lists when version 0.1 will be out. If you want to give it a try, get the code from the mercurial repository.


Simile-Widgets

2008/08/07 by Nicolas Chauvat
http://simile.mit.edu/images/logo.png

While working on knowledge management and semantic web technologies, I came across the Simile project at MIT a few years back. I even had a demo of the Exhibit widget fetching then displaying data from our semantic web application framework back in 2006 at the Web2 track of Solutions Linux in Paris.

Now that we are using these widgets when implementing web apps for clients, I was happy to see that the projects got a life of their own outside of MIT and became full-fledged free-software projects hosted on Google Code. See Simile-Widgets for more details and expect us to provide a debian package soon unless someone does it first.

Speaking of Debian, here is a nice demo a the Timeline widget presenting the Debian history.

http://beta.thumbalizr.com/app/thumbs/?src=/thumbs/onl/source/d2/d280583f143793f040bdacf44a39b0d5.png&w=320&q=0&enc=

SciPy and TimeSeries

2008/08/04 by Nicolas Chauvat
http://www.enthought.com/img/scipy-sm.png

We have been using many different tools for doing statistical analysis with Python, including R, SciPy, specific C++ code, etc. It looks like the growing audience of SciPy is now in movement to have dedicated modules in SciPy (lets call them SciKits). See this thread in SciPy-user mailing-list.


Python for applied Mathematics

2008/07/29 by Nicolas Chauvat
http://www.ams.org/images/siam2008-brain.jpg

The presentation of Python as a tool for applied mathematics got highlighted at the 2008 annual meeting of the american Society for Industrial and Applied Mathematics (SIAM). For more information, read this blogpost and the slides.


Implementing scalable applications with AppEngine

2008/06/11 by Nicolas Chauvat
http://code.google.com/events/images/io_logo_lg.png

At Google IO, a large part of the Tools track was dedicated to AppEngine. Brett Slatkin gave a talk titled Building scalable Web Applications with Google AppEngine which focused on optimizing the server part of web apps. As other presenters demonstrated it, like Steve Souders in his talk Even Faster Websites, optimizing the browser part of webapps is not to be neglected either.

Webscale applications require man-made optimisation

First of all, I must confess I am used to repeat that "early optimisation is the root of all evil" and "delay commitment until the last responsible time". But reading about AppEngine and listening to the Google IO talks, it appears that the tools we have today ask for human intervention to reach web-scale performance, even when "we" stands for "Google".

In order for web-scale applications to handle the kind of load they are facing, they must be designed and implemented carefully. As carefully as any application was designed before the exponential growth of PC computation power let us move away from low-level implementation details and made some inefficiencies acceptable as long as the time spent developing was short enough.

It all depends on the parameters of your cost function, but for web-scale applications, it seems like we have not enough computer-time and can not trade it for human-time.

Writes are more expensive than reads

To get a better idea of the work constraints, one should know that a disk seek is about 10ms, which means there will be a maximum of 100 accesses per second. On the other hand, if we need consistent data as opposed to transactional data (the latter implying that data is fetched each time it is asked for), data can be read from disk once then cached. Following reads are done from memory at a rate of about 4GB/sec, which means 4000 accesses per second if entities are around 1MB in size. Result of this back of the envelope approximation is 40 reads equals one write.

It follows that, although the actual time depends on the size and shape of data, writes are very expensive compared to reads and both are better done in batches to optimise disk access.

Entity groups in AppEngine

http://code.google.com/appengine/images/noassembly.gif

The AppEngine Datastore was designed with this constraints in mind. Entities are sets of property name/value pairs. Each entity may have a parent. An entity without a parent is the root of a hierarchy called an entity group.

Entities of the same group are stored on disk close to each other, but two distinct entity groups may be stored on different computers. Read access to entities of the same group is thus faster than read access to entities of different groups.

Write access is serialized per entity group. As opposed to a traditionnal RDBMS that provides row locking, the datastore only provides entity group locking. Writes to the a single entity group will always happen in sequence, even though changes concern different entities.

There is no limit to the number of entity groups or to the number of entities per group, but because of the locking strategy, large entity groups will cause high contention and a lot of failed transactions. Since writes are expensive, not thinking about write throughput is a very bad idea when designing an AppEngine application if one want it to scale.

On the other hand, the parallel nature of the datastore make it scale wide and there is no limit to the number of entity groups that can be written to in parallel, nor to the number of reads that can be done in parallel.

To understand this design in details, you will have to read about GFS, BigTable and other technologies developed by Google to implement large-scale clustering.

Example of counters

http://code.google.com/apis/gears/resources/database.gif

Counters are a good example to address when discussing write throughput, because the datastore locking strategy makes writing to global data very expensive.

Let us assume that we want to display on the main page of a wiki application the total number of comments posted.

A global counter would serialize all its updates. If 100 users were to add comments at the same time, some of them would have to wait several seconds for their action to complete: one write for the comment, one write for the counter, at most 100 writes per second for the counter and a lot of time lost due to failed transaction that need to be restarted.

The solution to make the counter scale is to partition it among all entity groups then sum these partial counters when the global value is needed.

Since chances are low that a given user will write more than one comment at a time, comment entities for a user can be grouped together and a partial counter can be added to the same entity group. Creating a new comment and increasing the partial counter will be done in the same batch.

When a new request for the main page is received, the counter total is looked up in the cache. If it is not found, all partial counters are fetched and summed up, then the cache is refreshed with a short timeout, for example one minute.

During the next minute, the counter will be "consistent", read no too far-off, and served extremely fast from the cache.

Prevent repeated or unneeded work

http://code.google.com/apis/gears/resources/localserver.gif

To sum things up, when implementing applications on top of AppEngine with web-scale usage as a goal, everything that can be done to save time should be considered. Including the following:

  • importing python modules as late as possible will minimize the python runtime overhead
  • retrieving data that is not going to be used is a waste
  • repeated queries and queries returning large result sets must be avoided
  • when Get() if sufficient, do not spend time on Query()
  • landing pages are traffic intensive and would better use the same query for everyone
  • entity groups have to be designed to match the load and aim at low write contention
  • caching must be used aggressively (it is no surprise that memcache was the first improvement that followed within a month of the AppEngine public release)

Conclusion

As a conclusion, the interface AppEngine is exhibiting today requires to optimize early, but I would bet that in the years to come, new languages and domain-specific compilers or database engines will take part of that burden off the hands of the developers.

Did not Yahoo and Google start developping PigLatin and Sawzall to make it easier to write parallel data-processing programs ? The same could happen with describe a data-model in a high-level language and get a tool to optimize it for write contention and web-scale application.

See Also

http://www.logilab.fr/images/lax.png

LAX (Logilab App engine eXtension) is a full-featured web application framework running on Google AppEngine developed by Logilab.


Google App Engine future directions

2008/06/09 by Nicolas Chauvat
http://code.google.com/appengine/images/appengine_lowres.jpg

Several of us went to San Francisco last week to attend Google IO. As usual with conferences, meeting people was more interesting than listening to most talks. The AppEngine Fireside Chat was a Q&A session that lasted about an hour. Here is what I learned from this session and various chats with AppEngineers.

  1. Google has decided to provide its scalable datastore architecture as a service. At this point, the datastore is the product and the goal it to make it as widely accessible as possible.
  2. The google.appengine.api.datastore API alone would not have made for a very sexy launch. In order to attract more people and lower the bar the beginners would have to jump over, they looked for a higher level programming interface.
  3. Since some people working at Google have been using Django and know it, they reimplemented part of its interface for defining data models. Late in the project, they added GQL because Django-like queries were a bit too difficult. In both case, the goal was to make it easier for external developers to get started.
  4. But Google is not in the business of providing web application frameworks and AppEngineers made explicit that they would not be officially supporting a specific framework or a specific version of a given framework (not even Django 0.96, although there is a django-appengine-helper project on code.google.com). They expect frameworks to be provided by communities of developers.

My conclusion is twofold:

  • They will be focusing on supporting other languages in AppEngine (I would bet on Java being the next one available) rather than extending Python frameworks support.
  • Anyone is free to join with his own framework and provide support for it, the One True Interface being the one defined by google.appengine.api.datastore, not the one defined by db.model and GQL.

This is why Logilab published its own framework running on App Engine as free software and is providing support for it: Logilab Appengine eXtension.


Another step towards the semantic web

2007/02/06 by Nicolas Chauvat

I co-organized the Web2.0 conference track that was held at Solutions Linux 2007 in Paris last week . Researching to prepare the talk I gave, I came accross microformats and GRDDL. Both try to add semantics on top of (X)HTML.

Microformats uses the class attribute and the "invisibility" of `div` and `span` to insert semantic information, as in ::
  <li class="vevent">
    <a class="url" href="http://www.solutionslinux.fr/">
     <span class="summary">Solutions Linux Web 2.0 Conference</span>: 
     <abbr class="dtstart" title="20070201T143000Z">February 1st 2:30pm</abbr>-
     <abbr class="dtend" title="20070201T18000Z">6pm</abbr>, at the 
     <span class="location">CNIT, La Défense</span>
    </a>
  </li>

GRDDL information is added to the `head` of the XHTML page and points to an XSL that can extract the information from the page and output it as RDF.

Another option is to add `link` to the `head` of the page, pointing to an alternate representations like a RDF formatted one.

Firefox has add-ons that help you spot semantic enabled web pages: Tails detects microformats and the semantic radar detects RDF. Operator is an option I found too invasive.

As for my talk, it involved demonstrating CubicWeb, the engine behind logilab.org, and querying the data stored at logilab.org to reuse it with Exhibit.


Install Python Cartographic Library

2006/12/07 by Nicolas Chauvat

This is how we setup PCL <http://trac.gispython.org/projects/PCL> under debian. Of course having a package would be better...

Install dependencies

apt-get install mapserver-bin python-mapscript mapserver-cgi
gdal-bin python-gdal python-dev libgeos-dev libgdal-dev
libgd2-xpm-dev python-setuptools zope-interface
python-elementtree

Now link libgdal.a to libgdal1.3.2.a

Install PCL from sources

svn co  http://svn.gispython.org/svn/gispy/PCL/trunk

cd PCL/

cd externals/OWSLib
python setup.py install --prefix=my_python_library

cd externals/QuadTree
python setup.py install --prefix=my_python_library

cd PCL-Core
python setup.py install --prefix=my_python_library

cd PCL-GDAL
python setup.py install --prefix=my_python_library

cd PCL-MapServer

Put in a directory the source code of mapserver corresponding to the package version installed above. Get it from http://mapserver.gis.umn.edu/download or with dpkg-src.

Edit setup.py to set ms_home to the path of the above mapserver sources.

python setup.py install --prefix=my_python_library

my_python_library now contains Python eggs and directories with Python and C libraries.