subscribe to this blog

Logilab.org - en

News from Logilab and our Free Software projects, as well as on topics dear to our hearts (Python, Debian, Linux, the semantic web, scientific computing...)

show 184 results
  • Nazca is out !

    2012/12/21 by Simon Chabot

    What is it for ?

    Nazca is a python library aiming to help you to align data. But, what does “align data” mean? For instance, you have a list of cities, described by their name and their country and you would like to find their URI on dbpedia to have more information about them, as the longitude and the latitude. If you have two or three cities, it can be done with bare hands, but it could not if there are hundreds or thousands cities. Nazca provides you all the stuff we need to do it.

    This blog post aims to introduce you how this library works and can be used. Once you have understood the main concepts behind this library, don't hesitate to try Nazca online !

    Introduction

    The alignment process is divided into three main steps:

    1. Gather and format the data we want to align. In this step, we define two sets called the alignset and the targetset. The alignset contains our data, and the targetset contains the data on which we would like to make the links.
    2. Compute the similarity between the items gathered. We compute a distance matrix between the two sets according to a given distance.
    3. Find the items having a high similarity thanks to the distance matrix.

    Simple case

    1. Let's define alignset and targetset as simple python lists.
    alignset = ['Victor Hugo', 'Albert Camus']
    targetset = ['Albert Camus', 'Guillaume Apollinaire', 'Victor Hugo']
    
    1. Now, we have to compute the similarity between each items. For that purpose, the Levenshtein distance [1], which is well accurate to compute the distance between few words, is used. Such a function is provided in the nazca.distance module.

      The next step is to compute the distance matrix according to the Levenshtein distance. The result is given in the following table.

       

      Albert Camus

      Guillaume Apollinaire

      Victor Hugo

      Victor Hugo

      6

      9

      0

      Albert Camus

      0

      8

      6

    2. The alignment process is ended by reading the matrix and saying items having a value inferior to a given threshold are identical.

    [1]Also called the edit distance, because the distance between two words is equal to the number of single-character edits required to change one word into the other.

    A more complex one

    The previous case was simple, because we had only one attribute to align (the name), but it is frequent to have a lot of attributes to align, such as the name and the birth date and the birth city. The steps remain the same, except that three distance matrices will be computed, and items will be represented as nested lists. See the following example:

    alignset = [['Paul Dupont', '14-08-1991', 'Paris'],
                ['Jacques Dupuis', '06-01-1999', 'Bressuire'],
                ['Michel Edouard', '18-04-1881', 'Nantes']]
    targetset = [['Dupond Paul', '14/08/1991', 'Paris'],
                 ['Edouard Michel', '18/04/1881', 'Nantes'],
                 ['Dupuis Jacques ', '06/01/1999', 'Bressuire'],
                 ['Dupont Paul', '01-12-2012', 'Paris']]
    

    In such a case, two distance functions are used, the Levenshtein one for the name and the city and a temporal one for the birth date [2].

    The cdist function of nazca.distances enables us to compute those matrices :

    • For the names:
    >>> nazca.matrix.cdist([a[0] for a in alignset], [t[0] for t in targetset],
    >>>                    'levenshtein', matrix_normalized=False)
    array([[ 1.,  6.,  5.,  0.],
           [ 5.,  6.,  0.,  5.],
           [ 6.,  0.,  6.,  6.]], dtype=float32)
    
      Dupond Paul Edouard Michel Dupuis Jacques Dupont Paul
    Paul Dupont 1 6 5 0
    Jacques Dupuis 5 6 0 5
    Edouard Michel 6 0 6 6
    • For the birthdates:
    >>> nazca.matrix.cdist([a[1] for a in alignset], [t[1] for t in targetset],
    >>>                    'temporal', matrix_normalized=False)
    array([[     0.,  40294.,   2702.,   7780.],
           [  2702.,  42996.,      0.,   5078.],
           [ 40294.,      0.,  42996.,  48074.]], dtype=float32)
    
      14/08/1991 18/04/1881 06/01/1999 01-12-2012
    14-08-1991 0 40294 2702 7780
    06-01-1999 2702 42996 0 5078
    18-04-1881 40294 0 42996 48074
    • For the birthplaces:
    >>> nazca.matrix.cdist([a[2] for a in alignset], [t[2] for t in targetset],
    >>>                    'levenshtein', matrix_normalized=False)
    array([[ 0.,  4.,  8.,  0.],
           [ 8.,  9.,  0.,  8.],
           [ 4.,  0.,  9.,  4.]], dtype=float32)
    
      Paris Nantes Bressuire Paris
    Paris 0 4 8 0
    Bressuire 8 9 0 8
    Nantes 4 0 9 4

    The next step is gathering those three matrices into a global one, called the global alignment matrix. Thus we have :

      0 1 2 3
    0 1 40304 2715 7780
    1 2715 43011 0 5091
    2 40304 0 43011 48084

    Allowing some misspelling mistakes (for example Dupont and Dupond are very closed), the matching threshold can be set to 1 or 2. Thus we can see that the item 0 in our alignset is the same that the item 0 in the targetset, the 1 in the alignset and the 2 of the targetset too : the links can be done !

    It's important to notice that even if the item 0 of the alignset and the 3 of the targetset have the same name and the same birthplace they are unlikely identical because of their very different birth date.

    You may have noticed that working with matrices as I did for the example is a little bit boring. The good news is that Nazca makes all this job for you. You just have to give the sets and distance functions and that's all. An other good news is the project comes with the needed functions to build the sets !

    [2]Provided in the nazca.distances module.

    Real applications

    Just before we start, we will assume the following imports have been done:

    from nazca import dataio as aldio   #Functions for input and output data
    from nazca import distances as ald  #Functions to compute the distances
    from nazca import normalize as aln  #Functions to normalize data
    from nazca import aligner as ala    #Functions to align data
    

    The Goncourt prize

    On wikipedia, we can find the Goncourt prize winners, and we would like to establish a link between the winners and their URI on dbpedia (Let's imagine the Goncourt prize winners category does not exist in dbpedia)

    We simply copy/paste the winners list of wikipedia into a file and replace all the separators (- and ,) by #. So, the beginning of our file is :

    1903#John-Antoine Nau#Force ennemie (Plume)
    1904#Léon Frapié#La Maternelle (Albin Michel)
    1905#Claude Farrère#Les Civilisés (Paul Ollendorff)
    1906#Jérôme et Jean Tharaud#Dingley, l'illustre écrivain (Cahiers de la Quinzaine)

    When using the high-level functions of this library, each item must have at least two elements: an identifier (the name, or the URI) and the attribute to compare. With the previous file, we will use the name (so the column number 1) as identifier (we don't have an URI here as identifier) and attribute to align. This is told to python thanks to the following code:

    alignset = aldio.parsefile('prixgoncourt', indexes=[1, 1], delimiter='#')
    

    So, the beginning of our alignset is:

    >>> alignset[:3]
    [[u'John-Antoine Nau', u'John-Antoine Nau'],
     [u'Léon Frapié', u'Léon, Frapié'],
     [u'Claude Farrère', u'Claude Farrère']]
    

    Now, let's build the targetset thanks to a sparql query and the dbpedia end-point. We ask for the list of the French novelists, described by their URI and their name in French:

    query = """
         SELECT ?writer, ?name WHERE {
           ?writer  <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:French_novelists>.
           ?writer rdfs:label ?name.
           FILTER(lang(?name) = 'fr')
        }
     """
     targetset = aldio.sparqlquery('http://dbpedia.org/sparql', query)
    

    Both functions return nested lists as presented before. Now, we have to define the distance function to be used for the alignment. This is done thanks to a python dictionary where the keys are the columns to work on, and the values are the treatments to apply.

    treatments = {1: {'metric': ald.levenshtein}} # Use a levenshtein on the name
                                                  # (column 1)
    

    Finally, the last thing we have to do, is to call the alignall function:

    alignments = ala.alignall(alignset, targetset,
                           0.4, #This is the matching threshold
                           treatments,
                           mode=None,#We'll discuss about that later
                           uniq=True #Get the best results only
                          )
    

    This function returns an iterator over the different alignments done. You can see the results thanks to the following code :

    for a, t in alignments:
        print '%s has been aligned onto %s' % (a, t)
    

    It may be important to apply some pre-treatment on the data to align. For instance, names can be written with lower or upper characters, with extra characters as punctuation or unwanted information in parenthesis and so on. That is why we provide some functions to normalize your data. The most useful may be the simplify() function (see the docstring for more information). So the treatments list can be given as follow:

    def remove_after(string, sub):
        """ Remove the text after ``sub`` in ``string``
            >>> remove_after('I like cats and dogs', 'and')
            'I like cats'
            >>> remove_after('I like cats and dogs', '(')
            'I like cats and dogs'
        """
        try:
            return string[:string.lower().index(sub.lower())].strip()
        except ValueError:
            return string
    
    
    treatments = {1: {'normalization': [lambda x:remove_after(x, '('),
                                        aln.simply],
                      'metric': ald.levenshtein
                     }
                 }
    

    Cities alignment

    The previous case with the Goncourt prize winners was pretty simply because the number of items was small, and the computation fast. But in a more real use case, the number of items to align may be huge (some thousands or millions…). In such a case it's unthinkable to build the global alignment matrix because it would be too big and it would take (at least...) fews days to achieve the computation. So the idea is to make small groups of possible similar data to compute smaller matrices (i.e. a divide and conquer approach). For this purpose, we provide some functions to group/cluster data. We have functions to group text and numerical data.

    This is the code used, we will explain it:

    targetset = aldio.rqlquery('http://demo.cubicweb.org/geonames',
                               """Any U, N, LONG, LAT WHERE X is Location, X name
                                  N, X country C, C name "France", X longitude
                                  LONG, X latitude LAT, X population > 1000, X
                                  feature_class "P", X cwuri U""",
                               indexes=[0, 1, (2, 3)])
    alignset = aldio.sparqlquery('http://dbpedia.inria.fr/sparql',
                                 """prefix db-owl: <http://dbpedia.org/ontology/>
                                 prefix db-prop: <http://fr.dbpedia.org/property/>
                                 select ?ville, ?name, ?long, ?lat where {
                                  ?ville db-owl:country <http://fr.dbpedia.org/resource/France> .
                                  ?ville rdf:type db-owl:PopulatedPlace .
                                  ?ville db-owl:populationTotal ?population .
                                  ?ville foaf:name ?name .
                                  ?ville db-prop:longitude ?long .
                                  ?ville db-prop:latitude ?lat .
                                  FILTER (?population > 1000)
                                 }""",
                                 indexes=[0, 1, (2, 3)])
    
    
    treatments = {1: {'normalization': [aln.simply],
                      'metric': ald.levenshtein,
                      'matrix_normalized': False
                     }
                 }
    results = ala.alignall(alignset, targetset, 3, treatments=treatments, #As before
                           indexes=(2, 2), #On which data build the kdtree
                           mode='kdtree',  #The mode to use
                           uniq=True) #Return only the best results
    

    Let's explain the code. We have two files, containing a list of cities we want to align, the first column is the identifier, and the second is the name of the city and the last one is location of the city (longitude and latitude), gathered into a single tuple.

    In this example, we want to build a kdtree on the couple (longitude, latitude) to divide our data in few candidates. This clustering is coarse, and is only used to reduce the potential candidats without loosing any more refined possible matchs.

    So, in the next step, we define the treatments to apply. It is the same as before, but we ask for a non-normalized matrix (ie: the real output of the levenshtein distance). Thus, we call the alignall function. indexes is a tuple saying the position of the point on which the kdtree must be built, mode is the mode used to find neighbours [3].

    Finally, uniq ask to the function to return the best candidate (ie: the one having the shortest distance below the given threshold)

    The function outputs a generator yielding tuples where the first element is the identifier of the alignset item and the second is the targetset one (It may take some time before yielding the first tuples, because all the computation must be done…)

    [3]The available modes are kdtree, kmeans and minibatch for numerical data and minhashing for text one.

    Try it online !

    We have also made this little application of Nazca, using Cubicweb. This application provides a user interface for Nazca, helping you to choose what you want to align. You can use sparql or rql queries, as in the previous example, or import your own cvs file [4]. Once you have choosen what you want to align, you can click the Next step button to customize the treatments you want to apply, just as you did before in python ! Once done, by clicking the Next step, you start the alignment process. Wait a little bit, and you can either download the results in a csv or rdf file, or directly see the results online choosing the html output.

    [4]Your csv file must be tab-separated for the moment…

  • Openstack, Wheezy and ZFS on Linux

    2012/12/19 by David Douard

    Openstack, Wheezy and ZFS on Linux

    A while ago, I started the install of an OpenStack cluster at Logilab, so our developers can play easily with any kind of environment. We are planning to improve our Apycot automatic testing platform so it can use "elastic power". And so on.

    http://www.openstack.org/themes/openstack/images/open-stack-cloud-computing-logo-2.png

    I first tried a Ubuntu Precise based setup, since at that time, Debian packages were not really usable. The setup never reached a point where it could be relased as production ready, due to the fact I tried a too complex and bleeding edge configuration (involving Quantum, openvswitch, sheepdog)...

    Meanwhile, we went really short of storage capacity. For now, it mainly consists in hard drives distributed in our 19" Dell racks (generally with hardware RAID controllers). So I recently purchased a low-cost storage bay (SuperMicro SC937 with a 6Gb/s JBOD-only HBA) with 18 spinning hard drives and 4 SSDs. This storage bay being driven by ZFS on Linux (tip: the SSD-stored ZIL is a requirement to get decent performances). This storage setup is still under test for now.

    http://zfsonlinux.org/images/zfs-linux.png

    I also went to the last Mini-DebConf in Paris, where Loic Dachary presented the status of the OpenStack packaging effort in Debian. This gave me the will to give a new try to OpenStack using Wheezy and a bit simpler setup. But I could not consider not to use my new ZFS-based storage as a nova volume provider. It is not available for now in OpenStack (there is a backend for Solaris, but not for ZFS on Linux). However, this is Python and in fact, the current ISCSIDriver backend needs very little to make it work with zfs instead of lvm as "elastics" block-volume provider and manager.

    So, I wrote a custom nova volume driver to handle this. As I don't want the nova-volume daemon to run on my ZFS SAN, I wrote this backend mixing the SanISCSIDriver (which manages the storage system via SSH) and the standard ISCSIDriver (which uses standard Linux isci target tools). I'm not very fond of the API of the VolumeDriver (especially the fact that the ISCSIDriver is responsible for 2 roles: managing block-level volumes and exporting block-level volumes). This small design flaw (IMHO) is the reason I had to duplicate some code (not much but...) to implement my ZFSonLinuxISCSIDriver...

    So here is the setup I made:

    Infrastructure

    My OpenStack Essex "cluster" consists for now in:

    • one control node, running in a "normal" libvirt-controlled virtual machine; it is a Wheezy that runs:
      • nova-api
      • nova-cert
      • nova-network
      • nova-scheduler
      • nova-volume
      • glance
      • postgresql
      • OpenStack dashboard
    • one computing node (Dell R310, Xeon X3480, 32G, Wheezy), which runs:
      • nova-api
      • nova-network
      • nova-compute
    • ZFS-on-Linux SAN (3x raidz1 poools made of 6 1T drives, 2x (mirrored) 32G SLC SDDs, 2x 120G MLC SSDs for cache); for now, the storage is exported to the SAN via one 1G ethernet link.

    OpensStack Essex setup

    I mainly followed the Debian HOWTO to setup my private cloud. I mainly tuned the network settings to match my environement (and the fact my control node lives in a VM, with VLAN stuff handled by the host).

    I easily got a working setup (I must admit that I think my previous experiment with OpenStack helped a lot when dealing with custom configurations... and vocabulary; I'm not sure I would have succeded "easily" following the HOWTO, but hey, it is a functionnal HOWTO, meaning if you do not follow the instructions because you want special tunings, don't blame the HOWTO).

    Compared to the HOWTO, my nova.conf looks like (as of today):

    [DEFAULT]
    logdir=/var/log/nova
    state_path=/var/lib/nova
    lock_path=/var/lock/nova
    root_helper=sudo nova-rootwrap
    auth_strategy=keystone
    dhcpbridge_flagfile=/etc/nova/nova.conf
    dhcpbridge=/usr/bin/nova-dhcpbridge
    sql_connection=postgresql://novacommon:XXX@control.openstack.logilab.fr/nova
    
    ##  Network config
    # A nova-network on each compute node
    multi_host=true
    # VLan manger
    network_manager=nova.network.manager.VlanManager
    vlan_interface=eth1
    # My ip
    my-ip=172.17.10.2
    public_interface=eth0
    # Dmz & metadata things
    dmz_cidr=169.254.169.254/32
    ec2_dmz_host=169.254.169.254
    metadata_host=169.254.169.254
    
    ## More general things
    # The RabbitMQ host
    rabbit_host=control.openstack.logilab.fr
    
    ## Glance
    image_service=nova.image.glance.GlanceImageService
    glance_api_servers=control.openstack.logilab.fr:9292
    use-syslog=true
    ec2_host=control.openstack.logilab.fr
    
    novncproxy_base_url=http://control.openstack.logilab.fr:6080/vnc_auto.html
    vncserver_listen=0.0.0.0
    vncserver_proxyclient_address=127.0.0.1
    

    Volume

    I had a bit more work to do to make nova-volume work. First, I got hit by this nasty bug #695791 which is trivial to fix... when you know how to fix it (I noticed the bug report after I fixed it by myself).

    Then, as I wanted the volumes to be stored and exported by my shiny new ZFS-on-Linux setup, I had to write my own volume driver, which was quite easy, since it is Python, and the logic to implement was already provided by the ISCSIDriver class on the one hand, and by the SanISCSIDrvier on the other hand. So I ended with this firt implementation. This file should be copied to nova volumes package directory (nova/volume/zol.py):

    # vim: tabstop=4 shiftwidth=4 softtabstop=4
    
    # Copyright 2010 United States Government as represented by the
    # Administrator of the National Aeronautics and Space Administration.
    # Copyright 2011 Justin Santa Barbara
    # Copyright 2012 David DOUARD, LOGILAB S.A.
    # All Rights Reserved.
    #
    #    Licensed under the Apache License, Version 2.0 (the "License"); you may
    #    not use this file except in compliance with the License. You may obtain
    #    a copy of the License at
    #
    #         http://www.apache.org/licenses/LICENSE-2.0
    #
    #    Unless required by applicable law or agreed to in writing, software
    #    distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
    #    WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
    #    License for the specific language governing permissions and limitations
    #    under the License.
    """
    Driver for ZFS-on-Linux-stored volumes.
    
    This is mainly a custom version of the ISCSIDriver that uses ZFS as
    volume provider, generally accessed over SSH.
    """
    
    import os
    
    from nova import exception
    from nova import flags
    from nova import utils
    from nova import log as logging
    from nova.openstack.common import cfg
    from nova.volume.driver import _iscsi_location
    from nova.volume import iscsi
    from nova.volume.san import SanISCSIDriver
    
    
    LOG = logging.getLogger(__name__)
    
    san_opts = [
        cfg.StrOpt('san_zfs_command',
                   default='/sbin/zfs',
                   help='The ZFS command.'),
        ]
    
    FLAGS = flags.FLAGS
    FLAGS.register_opts(san_opts)
    
    
    class ZFSonLinuxISCSIDriver(SanISCSIDriver):
        """Executes commands relating to ZFS-on-Linux-hosted ISCSI volumes.
    
        Basic setup for a ZoL iSCSI server:
    
        XXX
    
        Note that current implementation of ZFS on Linux does not handle:
    
          zfs allow/unallow
    
        For now, needs to have root access to the ZFS host. The best is to
        use a ssh key with ssh authorized_keys restriction mechanisms to
        limit root access.
    
        Make sure you can login using san_login & san_password/san_private_key
        """
        ZFSCMD = FLAGS.san_zfs_command
    
        _local_execute = utils.execute
    
        def _getrl(self):
            return self._runlocal
        def _setrl(self, v):
            if isinstance(v, basestring):
                v = v.lower() in ('true', 't', '1', 'y', 'yes')
            self._runlocal = v
        run_local = property(_getrl, _setrl)
    
        def __init__(self):
            super(ZFSonLinuxISCSIDriver, self).__init__()
            self.tgtadm.set_execute(self._execute)
            LOG.info("run local = %s (%s)" % (self.run_local, FLAGS.san_is_local))
    
        def set_execute(self, execute):
            LOG.debug("override local execute cmd with %s (%s)" %
                      (repr(execute), execute.__module__))
            self._local_execute = execute
    
        def _execute(self, *cmd, **kwargs):
            if self.run_local:
                LOG.debug("LOCAL execute cmd %s (%s)" % (cmd, kwargs))
                return self._local_execute(*cmd, **kwargs)
            else:
                LOG.debug("SSH execute cmd %s (%s)" % (cmd, kwargs))
                check_exit_code = kwargs.pop('check_exit_code', None)
                command = ' '.join(cmd)
                return self._run_ssh(command, check_exit_code)
    
        def _create_volume(self, volume_name, sizestr):
            zfs_poolname = self._build_zfs_poolname(volume_name)
    
            # Create a zfs volume
            cmd = [self.ZFSCMD, 'create']
            if FLAGS.san_thin_provision:
                cmd.append('-s')
            cmd.extend(['-V', sizestr])
            cmd.append(zfs_poolname)
            self._execute(*cmd)
    
        def _volume_not_present(self, volume_name):
            zfs_poolname = self._build_zfs_poolname(volume_name)
            try:
                out, err = self._execute(self.ZFSCMD, 'list', '-H', zfs_poolname)
                if out.startswith(zfs_poolname):
                    return False
            except Exception as e:
                # If the volume isn't present
                return True
            return False
    
        def create_volume_from_snapshot(self, volume, snapshot):
            """Creates a volume from a snapshot."""
            zfs_snap = self._build_zfs_poolname(snapshot['name'])
            zfs_vol = self._build_zfs_poolname(snapshot['name'])
            self._execute(self.ZFSCMD, 'clone', zfs_snap, zfs_vol)
            self._execute(self.ZFSCMD, 'promote', zfs_vol)
    
        def delete_volume(self, volume):
            """Deletes a volume."""
            if self._volume_not_present(volume['name']):
                # If the volume isn't present, then don't attempt to delete
                return True
            zfs_poolname = self._build_zfs_poolname(volume['name'])
            self._execute(self.ZFSCMD, 'destroy', zfs_poolname)
    
        def create_export(self, context, volume):
            """Creates an export for a logical volume."""
            self._ensure_iscsi_targets(context, volume['host'])
            iscsi_target = self.db.volume_allocate_iscsi_target(context,
                                                                volume['id'],
                                                          volume['host'])
            iscsi_name = "%s%s" % (FLAGS.iscsi_target_prefix, volume['name'])
            volume_path = self.local_path(volume)
    
            # XXX (ddouard) this code is not robust: does not check for
            # existing iscsi targets on the host (ie. not created by
            # nova), but fixing it require a deep refactoring of the iscsi
            # handling code (which is what have been done in cinder)
            self.tgtadm.new_target(iscsi_name, iscsi_target)
            self.tgtadm.new_logicalunit(iscsi_target, 0, volume_path)
    
            if FLAGS.iscsi_helper == 'tgtadm':
                lun = 1
            else:
                lun = 0
            if self.run_local:
                iscsi_ip_address = FLAGS.iscsi_ip_address
            else:
                iscsi_ip_address = FLAGS.san_ip
            return {'provider_location': _iscsi_location(
                    iscsi_ip_address, iscsi_target, iscsi_name, lun)}
    
        def remove_export(self, context, volume):
            """Removes an export for a logical volume."""
            try:
                iscsi_target = self.db.volume_get_iscsi_target_num(context,
                                                               volume['id'])
            except exception.NotFound:
                LOG.info(_("Skipping remove_export. No iscsi_target " +
                           "provisioned for volume: %d"), volume['id'])
                return
    
            try:
                # ietadm show will exit with an error
                # this export has already been removed
                self.tgtadm.show_target(iscsi_target)
            except Exception as e:
                LOG.info(_("Skipping remove_export. No iscsi_target " +
                           "is presently exported for volume: %d"), volume['id'])
                return
    
            self.tgtadm.delete_logicalunit(iscsi_target, 0)
            self.tgtadm.delete_target(iscsi_target)
    
        def check_for_export(self, context, volume_id):
            """Make sure volume is exported."""
            tid = self.db.volume_get_iscsi_target_num(context, volume_id)
            try:
                self.tgtadm.show_target(tid)
            except exception.ProcessExecutionError, e:
                # Instances remount read-only in this case.
                # /etc/init.d/iscsitarget restart and rebooting nova-volume
                # is better since ensure_export() works at boot time.
                LOG.error(_("Cannot confirm exported volume "
                            "id:%(volume_id)s.") % locals())
                raise
    
        def local_path(self, volume):
            zfs_poolname = self._build_zfs_poolname(volume['name'])
            zvoldev = '/dev/zvol/%s' % zfs_poolname
            return zvoldev
    
        def _build_zfs_poolname(self, volume_name):
            zfs_poolname = '%s%s' % (FLAGS.san_zfs_volume_base, volume_name)
            return zfs_poolname
    

    To configure my nova-volume instance (which runs on the control node, since it's only a manager), I added these to my nova.conf file:

    # nove-volume config
    volume_driver=nova.volume.zol.ZFSonLinuxISCSIDriver
    iscsi_ip_address=172.17.1.7
    iscsi_helper=tgtadm
    san_thin_provision=false
    san_ip=172.17.1.7
    san_private_key=/etc/nova/sankey
    san_login=root
    san_zfs_volume_base=data/openstack/volume/
    san_is_local=false
    verbose=true
    

    Note that the private key (/etc/nova/sankey here) is stored in clear and that it must be readable by the nova user.

    This key being stored in clear and giving root acces to my ZFS host, I have limited a bit this root access by using a custom command wrapper in the .ssh/authorized_keys file.

    Something like (naive implementation):

    [root@zfshost ~]$ cat /root/zfswrapper
    #!/bin/sh
    CMD=`echo $SSH_ORIGINAL_COMMAND | awk '{print $1}'`
    if [ "$CMD" != "/sbin/zfs" && "$CMD" != "tgtadm" ]; then
      echo "Can do only zfs/tgtadm stuff here"
      exit 1
    fi
    
    echo "[`date`] $SSH_ORIGINAL_COMMAND" >> .zfsopenstack.log
    exec $SSH_ORIGINAL_COMMAND
    

    Using this in root's .ssh/authorized_keys file:

    [root@zfshost ~]$ cat /root/.ssh/authorized_keys | grep control
    from="control.openstack.logilab.fr",no-pty,no-port-forwarding,no-X11-forwarding, \
          no-agent-forwarding,command="/root/zfswrapper" ssh-rsa AAAA[...] root@control
    

    I had to set the iscsi_ip_address (the ip address of the ZFS host), but I think this is a result of something mistakenly implemented in my ZFSonLinux driver.

    Using this config, I can boot an image, create a volume on my ZFS storage, and attach it to the running image.

    I have to test things like snapshot, (live?) migration and so. This is a very first draft implementation which needs to be refined, improved and tested.

    What's next

    Besides the fact that it needs more tests, I plan to use salt for my OpenStack deployment (first to add more compute nodes in my cluster), and on the other side, I'd like to try the salt-cloud so I have a bunch of Debian images that "just work" (without the need of porting the cloud-init Ubuntu package).

    On the side of my zol driver, I need to port it to Cinder, but I do not have a Folsom install to test it...


  • Announcing pylint.org

    2012/12/04 by Arthur Lutz

    Pylint - the world renowned Python code static checker - now has a landing page : http://www.pylint.org

    http://www.python.org/images/python-logo.gif

    We've tried to summarize all the things a newcomer should know about pylint. We hope it reflects the diversity of uses and support canals for pylint.

    Open and decentralized Web

    Note that pylint is not hosted on github or another well-known forge, since we firmly believe in a decentralized architecture for the web.

    This applies especially to open source software development. Pylint's development is self-hosted on a forge and its code is version-controlled with mercurial, a distributed version control system (DVCS). Both tools are free software written in python.

    http://www.zjulian.com/wp-content/uploads/2012/05/Centralized-Decentralized-And-Distributed-System.jpg

    We know centralized (and closed source) platforms for managing software projects can make things easier for contributors. We have enabled a mirror on bitbucket (and pylint-brain) so as to ease forks and pull requests. Pull requests can be made there and even from a self-hosted mercurial (with a quick email on the mailing-list).

    Feel free to add your comments or feedback below.


  • Mini-DebConf Paris 2012

    2012/11/29 by Julien Cristau

    Last week-end, I attended the mini-DebConf organized at EPITA (near Paris) by the French Debian association and sponsored by Logilab.

    http://www.logilab.org/file/112649?vid=download

    The event was a great success, with a rather large number of attendees, including people coming from abroad such as Debian kernel maintainers Ben Hutchings and Maximilian Attems, who talked about their work with Linux.

    Among the other speakers were Loïc Dachary about OpenStack and its packaging in Debian, and Josselin Mouette about his work deploying Debian/GNOME desktops in a large enterprise environment at EDF R&D.

    On my part I gave a talk on Saturday about Debian's release team, and the current state of the wheezy (to-be Debian 7.0) release.

    On Sunday I presented together with Vladimir Daric the work we did to migrate a computation cluster from Red Hat to Debian. Attendees had quite a few questions about our use of ZFS on Linux for storage, and salt for configuration management and deployment.

    Slides for the talks are available on the mini-DebConf web page (wheezy state, migration to debian cluster also viewable on slideshare), and videos will soon be on http://video.debian.net/.

    Now looking forward to next summer's DebConf13 in Switzerland, and hopefully next year's edition of the Paris event.


  • PyLint 0.26 is out

    2012/10/08 by Sylvain Thenault

    I'm very pleased to announce new releases of Pylint and underlying ASTNG library, respectivly 0.26 and 0.24.1. The great news is that both bring a lot of new features and some bug fixes, mostly provided by the community effort.

    We're still trying to make it easier to contribute on our free software project at Logilab, so I hope this will continue and we'll get even more contritions in a near future, and an even smarter/faster/whatever pylint!

    For more details, see ChangeLog files or http://www.logilab.org/project/pylint/0.26.0 and http://www.logilab.org/project/logilab-astng/0.24.1

    So many thanks to all those who made that release, and enjoy!


  • Profiling tools

    2012/09/07 by Alain Leufroy

    Python

    Run time profiling with cProfile

    Python is distributed with profiling modules. They describe the run time operation of a pure python program, providing a variety of statistics.

    The cProfile module is the recommended module. To execute your program under the control of the cProfile module, a simple form is

    $ python -m cProfile -s cumulative mypythonscript.py
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
          16    0.055    0.003   15.801    0.988 __init__.py:1(<module>)
           1    0.000    0.000   11.113   11.113 __init__.py:35(extract)
         135    7.351    0.054   11.078    0.082 __init__.py:25(iter_extract)
    10350736    3.628    0.000    3.628    0.000 {method 'startswith' of 'str' objects}
           1    0.000    0.000    2.422    2.422 pyplot.py:123(show)
           1    0.000    0.000    2.422    2.422 backend_bases.py:69(__call__)
           ...
    

    Each column provides information about time execution of every function calls. -s cumulative orders the result by descending cumulative time.

    Note:

    You can profile a particular python function such as main()

    >>> import profile
    >>> profile.run('main()')
    

    Graphical tools to show profiling results

    Even if report tools are included in cProfile profiler, it can be interesting to use graphical tools. Most of them work with a stat file that can be generated by cProfile using the -o filepath option.

    Below are some of available graphical tools that we tested.

    Gpro2Dot

    is a python based tool that allows to transform profiling results output into a picture containing the call tree graph (using graphviz). A typical profiling session with python looks like this:

    $ python -m cProfile -o output.pstats mypythonscript.py
    $ gprof2dot.py -f pstats output.pstats | dot -Tpng -o profiling_results.png
    
    http://wiki.jrfonseca.googlecode.com/git/gprof2dot.png

    Each node of the output graph represents a function and has the following layout:

    +----------------------------------+
    |   function name : module name    |
    | total time including sub-calls % |  total time including sub-calls %
    |    (self execution time %)       |------------------------------------>
    |  total number of self calls      |
    +----------------------------------+
    

    Nodes and edges are colored according to the "total time" spent in the functions.

    Note:The following small patch let the node color correspond to the execution time and the edge color to the "total time":
    diff -r da2b31597c5f gprof2dot.py
    --- a/gprof2dot.py      Fri Aug 31 16:38:37 2012 +0200
    +++ b/gprof2dot.py      Fri Aug 31 16:40:56 2012 +0200
    @@ -2628,6 +2628,7 @@
                     weight = function.weight
                 else:
                     weight = 0.0
    +            weight = function[TIME_RATIO]
    
                 label = '\n'.join(labels)
                 self.node(function.id,
    
    PyProf2CallTree

    is a script to help visualizing profiling data with the KCacheGrind graphical calltree analyzer. This is a more interactive solution than Gpro2Dot but it requires to install KCacheGrind. Typical usage:

    $ python -m cProfile -o stat.prof mypythonscript.py
    $ python pyprof2calltree.py -i stat.prof -k
    

    Profiling data file is opened in KCacheGrind with pyprof2calltree module, whose -k switch automatically opens KCacheGrind.

    http://kcachegrind.sourceforge.net/html/pics/KcgShot3Large.gif

    There are other tools that are worth testing:

    • RunSnakeRun is an interactive GUI tool which visualizes profile file using square maps:

      $ python -m cProfile -o stat.prof mypythonscript.py
      $ runsnake stat.prof
      
    • pycallgraph generates PNG images of a call tree with the total number of calls:

      $ pycallgraph mypythonscript.py
      
    • lsprofcalltree also use KCacheGrind to display profiling data:

      $ python lsprofcalltree.py -o output.log yourprogram.py
      $ kcachegrind output.log
      

    C/C++ extension profiling

    For optimization purpose one may have python extensions written in C/C++. For such modules, cProfile will not dig into the corresponding call tree. Dedicated tools must be used (they are most part of Python) to profile a C++ extension from python.

    Yep

    is a python module dedicated to the profiling of compiled python extension. It uses the google CPU profiler:

    $ python -m yep --callgrind mypythonscript.py
    

    Memory Profiler

    You may want to control the amount of memory used by a python program. There is an interesting module that fits this need: memory_profiler

    You can fetch memory consumption of a program over time using

    >>> from memory_profiler import memory_usage
    >>> memory_usage(main, (), {})
    

    memory_profiler can also spot lines that consume the most using pdb or IPython.

    General purpose Profiling

    The Linux perf tool gives access to a wide variety of performance counter subsystems. Using perf, any execution configuration (pure python programs, compiled extensions, subprocess, etc.) may be profiled.

    Performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots.

    You can have information about execution times with:

    $ perf stat -e cpu-cycles,cpu-clock,task-clock python mypythonscript.py
    

    You can have RAM access information using:

    $ perf stat -e cache-misses python mypythonscript.py
    

    Be careful about the fact that perf gives the raw value of the hardware counters. So, you need to know exactly what you are looking for and how to interpret these values in the context of your program.

    Note that you can use Gpro2Dot to get a more user-friendly output:

    $ perf record -g python mypythonscript.py
    $ perf script | gprof2dot.py -f perf | dot -Tpng -o output.png
    

  • PyLint 0.25.2 and related projects released

    2012/07/18 by Sylvain Thenault

    I'm pleased to announce the new release of Pylint and related projects (i.e. logilab-astng and logilab-common)!

    By installing PyLint 0.25.2, ASTNG 0.24 and logilab-common 0.58.1, you'll get a bunch of bug fixes and a few new features. Among the hot stuff:

    • PyLint should now work with alternative python implementations such as Jython, and at least go further with PyPy and IronPython (but those have not really been tested, please try it and provide feedback so we can improve their support)
    • the new ASTNG includes a description of dynamic code it is not able to understand. This is handled by a bitbucket hosted project described in another post.

    Many thanks to everyone who contributed to these releases, Torsten Marek / Boris Feld in particular (both sponsored by Google by the way, Torsten as an employee and Boris as a GSoC student).

    Enjoy!


  • Introducing the pylint-brain project

    2012/07/18 by Sylvain Thenault

    Huum, along with the new PyLint release, it's time to introduce the PyLint-Brain project I've recently started.

    Despite its name, PyLint-Brain is actually a collection of extensions for ASTNG, with the goal of making ASTNG smarter (and this directly benefits PyLint) by describing stuff that is too dynamic to be understood automatically (such as functions in the hashlib module, defaultdict, etc.).

    The PyLint-Brain collection of extensions is developped outside of ASTNG itself and hosted on a bitbucket project to ease community involvement and to allow distinct development cycles. Basically, ASTNG will include the PyLint-Brain extensions, but you may use earlier/custom versions by tweaking your PYTHONPATH.

    Take a look at the code, it's fairly easy to contribute new descriptions, and help us make pylint smarter!


  • Debian science sprint and workshop at ESRF

    2012/06/22 by Julien Cristau

    esrfdebian

    From June 24th to June 26th, the European Synchrotron organises a workshop centered around Debian. On Monday, a number of talks about the use of Debian in scientific facilities will be featured. On Sunday and Tuesday, members of the Debian Science group will meet for a sprint focusing on the upcoming Debian 7.0 release.

    Among the speakers will be Stefano Zacchiroli, the current Debian project leader. Logilab will be present with Nicolas Chauvat at Monday's conference, and Julien Cristau at both the sprint and the conference.

    At the sprint we'll be discussing packaging of scientific libraries such as blas or MPI implementations, and working on polishing other scientific packages, such as python-related ones (including Salome on which we are currently working).


  • A Python dev day at La Cantine. Would like to have more PyCon?

    2012/06/01 by Damien Garaud
    http://www.logilab.org/file/98313?vid=downloadhttp://www.logilab.org/file/98312?vid=download

    We were at La Cantine on May 21th 2012 in Paris for the "PyCon.us Replay session".

    La Cantine is a coworking space where hackers, artists, students and so on can meet and work. It also organises some meetings and conferences about digital culture, computer science, ...

    On May 21th 2012, it was a dev day about Python. "Would you like to have more PyCon?" is a french wordplay where PyCon sounds like Picon, a french "apéritif" which traditionally accompanies beer. A good thing because the meeting began at 6:30 PM! Presentations and demonstrations were about some Python projects presented at PyCon 2012 in Santa Clara (California) last March. The original pycon presentations are accessible on pyvideo.org.

    PDB Introduction

    By Gael Pasgrimaud (@gawel_).

    pdb is the well-known Python debugger. Gael showed us how to easily use this almost-mandatory tool when you develop in Python. As with the gdb debugger, you can stop the execution at a breakpoint, walk up the stack, print the value of local variables or modify temporarily some local variables.

    The best way to define a breakpoint in your source code, it's to write:

    import pdb; pdb.set_trace()
    

    Insert that where you would like pdb to stop. Then, you can step trough the code with s, c or n commands. See help for more information. Following, the help command in pdb command-line interpreter:

    (Pdb) help
    
    Documented commands (type help <topic>):
    ========================================
    EOF    bt         cont      enable  jump  pp       run      unt
    a      c          continue  exit    l     q        s        until
    alias  cl         d         h       list  quit     step     up
    args   clear      debug     help    n     r        tbreak   w
    b      commands   disable   ignore  next  restart  u        whatis
    break  condition  down      j       p     return   unalias  where
    
    Miscellaneous help topics:
    ==========================
    exec  pdb
    

    It is also possible to invoke the module pdb when you run a Python script such as:

    $> python -m pdb my_script.py
    

    Pyramid

    http://www.logilab.org/file/98311?vid=download

    By Alexis Metereau (@ametaireau).

    Pyramid is an open source Python web framework from Pylons Project. It concentrates on providing fast, high-quality solutions to the fundamental problems of creating a web application:

    • the mapping of URLs to code ;
    • templating ;
    • security and serving static assets.

    The framework allows to choose different approaches according the simplicity//feature tradeoff that the programmer need. Alexis, from the French team of Services Mozilla, is working with it on a daily basis and seemed happy to use it. He told us that he uses Pyramid more as web Python library than a web framework.

    Circus

    http://www.logilab.org/file/98316?vid=download

    By Benoit Chesneau (@benoitc).

    Circus is a process watcher and runner. Python scripts, via an API, or command-line interface can be used to manage and monitor multiple processes.

    A very useful web application, called circushttpd, provides a way to monitor and manage Circus through the web. Circus uses zeromq, a well-known tool used at Logilab.

    matplotlib demo

    This session was a well prepared and funny live demonstration by Julien Tayon of matplotlib, the Python 2D plotting library . He showed us some quick and easy stuff.

    For instance, how to plot a sinus with a few code lines with matplotlib and NumPy:

    import numpy as np
    import matplotlib.pyplot as plt
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    # A simple sinus.
    ax.plot(np.sin(np.arange(-10., 10., 0.05)))
    fig.show()
    

    which gives:

    http://www.logilab.org/file/98315?vid=download

    You can make some fancier plots such as:

    # A sinus and a fancy Cardioid.
    a = np.arange(-5., 5., 0.1)
    ax_sin = fig.add_subplot(211)
    ax_sin.plot(np.sin(a), '^-r', lw=1.5)
    ax_sin.set_title("A sinus")
    
    # Cardioid.
    ax_cardio = fig.add_subplot(212)
    x = 0.5 * (2. * np.cos(a) - np.cos(2 * a))
    y = 0.5 * (2. * np.sin(a) - np.sin(2 * a))
    ax_cardio.plot(x, y, '-og')
    ax_cardio.grid()
    ax_cardio.set_xlabel(r"$\frac{1}{2} (2 \cos{t} - \cos{2t})$", fontsize=16)
    fig.show()
    

    where you can type some LaTeX equations as X label for instance.

    http://www.logilab.org/file/98314?vid=download

    The force of this plotting library is the gallery of several examples with piece of code. See the matplotlib gallery.

    Using Python for robotics

    Dimitri Merejkowsky reviewed how Python can be used to control and program Aldebaran's humanoid robot NAO.

    Wrap up

    Unfortunately, Olivier Grisel who was supposed to make three interesting presentations was not there. He was supposed to present :

    • A demo about injecting arbitrary code and monitoring Python process with Pyrasite.
    • Another demo about Interactive Data analysis with Pandas and the new IPython NoteBook.
    • Wrap up : Distributed computation on cluster related project: IPython.parallel, picloud and Storm + Umbrella

    Thanks to La Cantine and the different organisers for this friendly dev day.


show 184 results