A matching and transformation tool for systems code

The Coccinelle's goal is to ease code maintenance by first revealing code smells based on design patterns and second easing an API (Application Programming Interface) change for a heavily used library. Coccinelle can thus be seen as two tools inside one. The first one matches patterns, the second applies transformations. However facing such a big problem, the project needed to define boundaries in order to increase chances of success. The building motivation was thus to target the Linux kernel. This choice has implied a tool working on the C programming language before the preprocessor step. Moreover the Linux code base adds interesing constraints as it is huge, contains many possible configurations depending on C macros, may contain many bugs and evolves a lot. What was the Coccinelle solution for easing the kernel maintenance?

http://farm1.static.flickr.com/151/398536506_57df539ccf_m.jpg

Generating diff files from the semantic patch langage

The Linux community reads lot of diff files for following the kernel evolution. As a consequence the diff file syntax is widely spread and commonly understood. However this syntax concerns a particular change between two files, its does not allow to match a generic pattern.

The Coccinelle's solution is to build its own langage allowing to declare rules describing a code pattern and a possible transformation. This langage is the Semantic Patch Langage (SmPL), based on the declarative approach of the diff file syntax. It allows to propagate a change rule to many files by generating diff files. Then those results can be directly applied by using the patch command but most of the time they will be reviewed and may be slightly adapted to the programmer's need.

A Coccinelle's rule is made of two parts: metavariable declaration and a code pattern match followed by a possible transformation. A metavariable means a control flow variable, its possibles names inside the program do not matter. Then the code pattern will describe a particular control flow in the program by using the C and SmPL syntaxes manipulating the metavariables. As a result, Coccinelle succeeds to generate diff files because it works on the C program control flow.

A complete SmPL description will not be given here because it can be found in the Coccinelle's documentation. However a brief introduction will be made on a rule declaration. The metavariable part will look like this:

@@
expression E;
constant C;
@@

'expression' means a variable or the result of a function. However 'constant' means a C constant. Then for negating the result of an and operation between an expression and a constant instead of negating the expression first, the transformation part will be:

- !E & C
+ !(E & C)

A file containing several rules like that will be called a semantic patch. It will be applied by using the Coccinelle 'spatch' command that will generate a change written in the diff file syntax each time the above pattern is matched. The next section will illustrate this way of work.

http://www.simplehelp.net/wp-images/icons/topic_linux.jpg

A working example on the Linux kernel 2.6.30

You can download and install Coccinelle 'spatch' command from its website: http://coccinelle.lip6.fr/ if you want to execute the following example. Let's first consider the following structure with accessors in the header 'device.h':

struct device {
    void *driver_data;
};

static inline void *dev_get_drvdata(const struct device *dev)
{
    return dev->driver_data;
}

static inline void dev_set_drvdata(struct device *dev, void* data)
{
    dev->driver_data = data;
}

it imitates the 2.6.30 kernel header 'include/linux/device.h'. Let's now consider the following client code that does not make use of the accessors:

#include <stdlib.h>
#include <assert.h>

#include "device.h"

int main()
{
    struct device devs[2], *dev_ptr;
    int data[2] = {3, 7};
    void *a = NULL, *b = NULL;

    devs[0].driver_data = (void*)(&data[0]);
    a = devs[0].driver_data;

    dev_ptr = &devs[1];
    dev_ptr->driver_data = (void*)(&data[1]);
    b = dev_ptr->driver_data;

    assert(*((int*)a) == 3);
    assert(*((int*)b) == 7);
    return 0;
}

Once this code saved in the file 'fake_device.c', we can check that the code compiles and runs by:

$ gcc fake_device.c && ./a.out

We will now create a semantic patch 'device_data.cocci' trying to add the getter accessor with this first rule:

@@
struct device dev;
@@
- dev.driver_data
+ dev_get_drvdata(&dev)

The 'spatch' command is then run by:

$ spatch -sp_file device_data.cocci fake_device.c

producing the following change in a diff file:

-    devs[0].driver_data = (void*)(&data[0]);
-    a = devs[0].driver_data;
+    dev_get_drvdata(&devs[0]) = (void*)(&data[0]);
+    a = dev_get_drvdata(&devs[0]);

which illustrates the great Coccinelle's way of work on program flow control. However the transformation has also matched code where the setter accessor should be used. We will thus add a rule above the previous one, the semantic patch becomes:

@@
struct device dev;
expression data;
@@
- dev.driver_data = data
+ dev_set_drvdata(&dev, data)

@@
struct device dev;
@@
- dev.driver_data
+ dev_get_drvdata(&dev)

Running the command again will produce the wanted output:

$ spatch -sp_file device_data.cocci fake_device.c
-    devs[0].driver_data = (void*)(&data[0]);
-    a = devs[0].driver_data;
+    dev_set_drvdata(&devs[0], (void *)(&data[0]));
+    a = dev_get_drvdata(&devs[0]);

It is important to write the setter rule before the getter rule else the getter rule will be applied first to the whole file.

At this point our semantic patch is still incomplete because it does not work on 'device' structure pointers. By using the same logic, let's add it to the 'device_data.cocci' semantic patch:

@@
struct device dev;
expression data;
@@
- dev.driver_data = data
+ dev_set_drvdata(&dev, data)

@@
struct device * dev;
expression data;
@@
- dev->driver_data = data
+ dev_set_drvdata(dev, data)

@@
struct device dev;
@@
- dev.driver_data
+ dev_get_drvdata(&dev)

@@
struct device * dev;
@@
- dev->driver_data
+ dev_get_drvdata(dev)

Running Coccinelle again:

$ spatch -sp_file device_data.cocci fake_device.c

will add the remaining transformations for the 'fake_device.c' file:

-    dev_ptr->driver_data = (void*)(&data[1]);
-    b = dev_ptr->driver_data;
+    dev_set_drvdata(dev_ptr, (void *)(&data[1]));
+    b = dev_get_drvdata(dev_ptr);

but a new problem appears: the 'device.h' header is also modified. We meet here an important point of the Coccinelle's philosophy described in the first section. 'spatch' is a tool to ease code maintenance by propagating a code pattern change to many files. However the resulting diff files are supposed to be reviewed and in our case the unwanted modification should be removed. Note that it would be possible to avoid the 'device.h' header modification by using SmPL syntax but the explanation would be too much for a starting tutorial. Instead, we will simply cut the unwanted part:

$ spatch -sp_file device_data.cocci fake_device.c | cut -d $'\n' -f 16-34

This result will now be kept in a diff file by moreover asking 'spatch' to produce it for the current working directory:

$ spatch -sp_file device_data.cocci -patch "" fake_device.c | \
cut -d $'\n' -f 16-34 > device_data.patch

It is now time to apply the change for getting a working C code using accessors:

$ patch -p1 < device_data.patch

The final result for 'fake_device.c' should be:

#include <stdlib.h>
#include <assert.h>

#include "device.h"

int main()
{
    struct device devs[2], *dev_ptr;
    int data[2] = {3, 7};
    void *a = NULL, *b = NULL;

    dev_set_drvdata(&devs[0], (void *)(&data[0]));
    a = dev_get_drvdata(&devs[0]);

    dev_ptr = &devs[1];
    dev_set_drvdata(dev_ptr, (void *)(&data[1]));
    b = dev_get_drvdata(dev_ptr);

    assert(*((int*)a) == 3);
    assert(*((int*)b) == 7);
    return 0;
}

Finally, we can test that the code compiles and runs:

.. sourcecode:: sh
$ gcc fake_device.c && ./a.out

The semantic patch is now ready to be used on the Linux's 2.6.30 kernel:

$ wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.30.tar.bz2
$ tar xjf linux-2.6.30.tar.bz2
$ spatch -sp_file device_data.cocci -dir linux-2.6.30/drivers/net/ \
  > device_drivers_net.patch
$ wc -l device_drivers_net.patch
642

You may also try the 'drivers/ieee1394' directory.

http://coccinelle.lip6.fr/img/lip6.jpg

Conclusion

Coccinelle is made of around 60 thousands lines of Objective Caml. As illustrated by the above example on the linux kernel, the 'spatch' command succeeds to ease code maintenance. For the Coccinelle's team working on the kernel code base, a semantic patch is usually around 100 lines and will generated diff files to sometimes hundred of files. Moreover the processing is rather fast, the average time per file is said to be 0.7s.

Two tools using the 'spatch' engine have already been built: 'spdiff' and 'herodotos'. With the first one you could almost avoid to learn the SmPL language because the idea is to generate a semantic patch by looking to transformations between files pairs. The second allows to correlate defects over software versions once the corresponding code smells have been described in SmPL.

One of the Coccinelle's problem is to not being easily extendable to another language as the engine was designed for analyzing control flows on C programs. The C++ langage may be added but required obviously lot of work. It would be great to also have such a tool on dynamic languages like Python.

image under creative commons by Rémi Vannier

blog entry of