] > Blog entries july 2009 [4]
The view full_list could not be found

Blog entries july 2009 [4]

EuroSciPy'09 (part 1/2): The Need For Speed

2009/07/29 by Nicolas Chauvat
http://www.logilab.org/image/9852?vid=download

The EuroSciPy2009 conference was held in Leipzig at the end of July and was sponsored by Logilab and other companies. It started with three talks about speed.

Starving CPUs

In his keynote, Fransesc Alted talked about starving CPUs. Thirty years back, memory and CPU frequencies where about the same. Memory speed kept up for about ten years with the evolution of CPU speed before falling behind. Nowadays, memory is about a hundred times slower than the cache which is itself about twenty times slower than the CPU. The direct consequence is that CPUs are starving and spend many clock cycles waiting for data to process.

In order to improve the performance of programs, it is now required to know about the multiple layers of computer memory, from disk storage to CPU. The common architecture will soon count six levels: mechanical disk, solid state disk, ram, cache level 3, cache level 2, cache level 1.

Using optimized array operations, taking striding into account, processing data blocks of the right size and using compression to diminish the amount of data that is transfered from one layer to the next are four techniques that go a long way on the road to high performance. Compression algorithms like Blosc increase throughput for they strike the right balance between being fast and providing good compression ratios. Blosc compression will soon be available in PyTables.

Fransesc also mentions the numexpr extension to numpy, and its combination with PyTables named tables.Expr, that nicely and easily accelerates the computation of some expressions involving numpy arrays. In his list of references, Fransesc cites Ulrich Drepper article What every programmer should know about memory.

Using PyPy's JIT for science

Maciej Fijalkowski started his talk with a general presentation of the PyPy framework. One uses PyPy to describe an interpreter in RPython, then generate the actual interpreter code and its JIT.

Since PyPy is has become more of a framework to write interpreters than a reimplementation of Python in Python, I suggested to change its misleading name to something like gcgc the Generic Compiler for Generating Compilers. Maciej answered that there are discussions on the mailing list to split the project in two and make the implementation of the Python interpreter distinct from the GcGc framework.

Maciej then focused his talk on his recent effort to rewrite in RPython the part of numpy that exposes the underlying C library to Python. He says the benefits of using PyPy's JIT to speedup that wrapping layer are already visible. He has details on the PyPy blog. Gaël Varoquaux added that David Cournapeau has started working on making the C/Python split in numpy cleaner, which would further ease the job of rewriting it in RPython.

CrossTwine Linker

Damien Diederen talked about his work on CrossTwine Linker and compared it with the many projects that are actively attacking the problem of speed that dynamic and interpreted languages have been dragging along for years. Parrot tries to be the über virtual machine. Psyco offers very nice acceleration, but currently only on 32bits system. PyPy might be what he calls the Right Approach, but still needs a lot of work. Jython and IronPython modify the language a bit but benefit from the qualities of the JVM or the CLR. Unladen Swallow is probably the one that's most similar to CrossTwine.

CrossTwine considers CPython as a library and uses a set of C++ classes to generate efficient interpreters that make calls to CPython's internals. CrossTwine is a tool that helps improving performance by hand-replacing some code paths with very efficient code that does the same operations but bypasses the interpreter and its overhead. An interpreter built with CrossTwine can be viewed as a JIT'ed branch of the official Python interpreter that should be feature-compatible (and bug-compatible) with CPython. Damien calls he approach "punching holes in C substrate to get more speed" and says it could probably be combined with Psyco for even better results.

CrossTwine works on 64bit systems, but it is not (yet?) free software. It focuses on some use cases to greatly improve speed and is not to be considered a general purpose interpreter able to make any Python code faster.

More readings

Cython is a language that makes writing C extensions for the Python language as easy as Python itself. It replaces the older Pyrex.

The SciPy2008 conference had at least two papers talking about speeding Python: Converting Python Functions to Dynamically Compiled C and unPython: Converting Python Numerical Programs into C.

David Beazley gave a very interesting talk in 2009 at a Chicago Python Users group meeting about the effects of the GIL on multicore machines.

I will continue my report on the conference with the second part titled "Applications And Open Questions".


Logilab at OSCON 2009

2009/07/28 by Sandrine Ribeau
http://assets.en.oreilly.com/1/event/27/oscon2009_oscon_11_years.gif

OSCON, Open Source CONvention, takes place every year and promotes Open Source for technology. It is one of the meeting hubs for the growing open source community. This was the occasion for us to learn about new projects and to present CubicWeb during a BAYPIGgies meeting hosted by OSCON.

http://www.openlina.com/templates/rhuk_milkyway/images/header_red_left.png

I had the chance to talk with some of the folks working at OpenLina where they presented LINA. LINA is a thin virtual layer that enables developers to write and compile code using ordinary Linux tools, then package that code into a single executable that runs on a variety of operating systems. LINA runs invisibly in the background, enabling the user to install and run LINAfied Linux applications as if they were native to that user's operating system. They were curious about CubicWeb and took as a challenge to package it with LINA... maybe soon on LINA's applications list.

Two open sources projects catched my attention as potential semantic data publishers. The first one is Family search where they provide a tool to search for family history and genealogy. Also they are working to define a standard format to exchange citation with Open Library. Democracy Lab provide an application to collect votes and build geographic statitics based on political interests. They will at some point publish data semantically so that their application data could be consumed.

It also was for us the occasion of introducing CubicWeb to the BayPIGgies folks. The same presentation as the one held at Europython 2009. I'd like to take the opportunity to answer a question I did not manage to answer at that time. The question was: how different is CubicWeb from Freebase Parallax in terms of interface and views filters? Before answering this question let's detail what Freebase Parallax is.

Freebase Parallax provides a new way to browse and explore data in Freebase. It allows to browse data from a set of data to a related set of data. This interface enables to aggregate visualization. For instance, given the set of US presidents, different types of views could be applied, such as a timeline view, where the user could set up which start and end date to use to draw the timeline. So generic views (which applies to any data) are customizable by the user.

http://res.freebase.com/s/f64a2f0cc4534b2b17140fd169cee825a7ed7ddcefe0bf81570301c72a83c0a8/resources/images/freebase-logo.png

The search powered by Parallax is very similar to CubicWeb faceted search, except that Parallax provides the user with a list of suggested filters to add in addition to the default one, the user can even remove a filter. That is something we could think about for CubicWeb: provide a generated faceted search so that the user could decide which filters to choose.

Parallax also provides related topics to the current data set which ease navigation between sets of data. The main difference I could see with the view filter offered by Parallax and CubicWeb is that Parallax provides the same views to any type of data whereas CubicWeb has specific views depending on the data type and generic views that applies to any type of data. This is a nice Web interface to browse data and it could be a good source of inspiration for CubicWeb.

http://www.zgeek.com/forum/gallery/files/6/3/2/img_228_96x96.jpg

During this talk, I mentionned that CubicWeb now understands SPARQL queries thanks to the fyzz parser.


Quizz WolframAlpha

2009/07/10 by Nicolas Chauvat
http://www.logilab.org/image/9609?vid=download

Wolfram Alpha is a web front-end to huge database of information covering very different topics ranging from mathematical functions to genetics, geography, astronomy, etc.

When you search for a word, it will try to match it with one of the objects it as in its database and display all the information it has concerning that object. For example it can tell you a lot about the Halley Comet, including where it is at the moment you ask the query. This is the main difference with, say Wikipedia, that will know a lot about that comet in general, but is not meant to compute its location in the sky at the moment you enter your query.

Searches are not limited to words. One can key in commands like weather in Paris in june 2009 or x^2+sin(x) and get results for those precise queries. The processing of the input query is far from bad, since it returns results to questions like what are the cities of France, but I would not call it state of the art natural language processing since that query returns the largest cities instead of just the cities it knows about and the question what are the smallest cities of France will not return any result. Natural language processing is a very difficult problem, though, especially when done in the open world as it is the case there with a engine available to the wide public on the internet.

For more examples, visit the WolframAlpha website, where you will also be able to post feature requests or, if you are a developer, get documentation about the WolframAlpha API and maybe use it as a web service in your application when you need to answer certain types of questions.


EuroPython 2009

2009/07/06 by Nicolas Chauvat
http://www.logilab.org/image/9580?vid=download

Once again Logilab sponsored the EuroPython conference. We would like to thank the organization team (especially John Pinner and Laura Creighton) for their hard work. The Conservatoire is a very central location in Birmingham and walking around the city center and along the canals was nice. The website was helpful when preparing the trip and made it easy to find places where to eat and stay. The conference program was full of talks about interesting topics.

I presented CubicWeb and spent a large part of my talk explaining what is the semantic web and what features we need in the tools we will use to be part of that web of data. I insisted on the fact that CubicWeb is made of two parts, the web engine and the data repository, and that the repository can be used without the web engine. I demonstrated this with a TurboGears application that used the CubicWeb repository as its persistence layer. RQL in TurboGears! See my slides and Reinout Van Rees' write-up.

Christian Tismer took over the development of Psyco a few months ago. He said he recently removed some bugs that were show stoppers, including one that was generating way too many recompilations. His new version looks very promising. Performance improved, long numbers are supported, 64bit support may become possible, generators work... and Stackless is about to be rebuilt on top of Psyco! Psyco 2.0 should be out today.

I had a nice chat with Cosmin Basca about the Semantic Web. He suggested using Mako as a templating language for CubicWeb. Cosmin is doing his PhD at DERI and develops SurfRDF which is an Object-RDF mapper that wraps a SPARQL endpoint to provide "discoverable" objects. See his slides and Reinout Van Rees' summary of his talk.

I saw a lightning talk about the Nagare framework which refuses to use templating languages, for the same reason we do not use them in CubicWeb. Is their h.something the right way of doing things? The example reminds me of the C++ concatenation operator. I am not really convinced with the continuation idea since I have been for years a happy user of the reactor model that's implemented in frameworks liked Twisted. Read the blog and documentation for more information.

I had a chat with Jasper Op de Coul about Infrae's OAI Server and the work he did to manage RDF data in Subversion and a relational database before publishing it within a web app based on YUI. We commented code that handles books and library catalogs. Part of my CubicWeb demo was about books in DBpedia and cubicweb-book. He gave me a nice link to the WorldCat API.

Souheil Chelfouh showed me his work on Dolmen and Menhir. For several design problems and framework architecture issues, we compared the solutions offered by the Zope Toolkit library with the ones found by CubicWeb. I will have to read more about Martian and Grok to make sure I understand the details of that component architecture.

I had a chat with Martijn Faassen about packaging Python modules. A one sentence summary would be that the Python community should agree on a meta-data format that describes packages and their dependencies, then let everyone use the tool he likes most to manage the installation and removal of software on his system. I hope the work done during the last PyConUS and led by Tarek Ziadé arrived at the same conclusion. Read David Cournapeau's blog entry about Python Packaging for a detailed explanation of why the meta-data format is the way to go. By the way, Martijn is the lead developer of Grok and Martian.

Godefroid Chapelle and I talked a lot about Zope Toolkit (ZTK) and CubicWeb. We compared the way the two frameworks deal with pluggable components. ZTK has adapters and a registry. CubicWeb does not use adapters as ZTK does, but has a view selection mechanism that required a registry with more features than the one used in ZTK. The ZTK registry only has to match a tuple (Interface, Class) when looking for an adapter, whereas CubicWeb's registry has to find the views that can be applied to a result set by checking various properties:

  • interfaces: all items of first column implement the Calendar Interface,
  • dimensions: more than one line, more than two columns,
  • types: items of first column are numbers or dates,
  • form: form contains key XYZ that has a value lower than 10,
  • session: user is authenticated,
  • etc.

As for Grok and Martian, I will have to look into the details to make sure nothing evil is hinding there. I should also find time to compare zope.schema and yams and write about it on this blog.

And if you want more information about the conference: