http://wiki.services.openoffice.org/w/images/6/69/Py-uno_128.png

The Task

I recently received from a customer a fairly large amount of data, organized in dozens of xls documents, each having dozens of sheets. I need to process this, and in order to ease the manipulation of the documents, I'd rather use standard text files in CSV (Comma Separated Values) format. Of course I didn't want to spend hours manually converting each sheet of each file to CSV, so I thought this would be a good time to get my hands in pyUno.

So I gazed over the documentation, found the Calc page on the OpenOffice.org wiki, read some sample code and got started.

The easy bit

The first few lines I wrote were (all imports are here, though some were actually added later).

import logging
import sys
import os.path as osp
import os
import time

import uno

def convert_spreadsheet(filename):
    pass

def run():
    for filename in sys.argv[1:]:
        convert_spreadsheet(filename)

def configure_log():
    logger = logging.getLogger('')
    logger.setLevel(logging.DEBUG)
    handler = logging.StreamHandler(sys.stdout)
    logger.addHandler(handler)
    format = "%(asctime)s %(levelname)-7s [%(name)s] %(message)s"
    handler.setFormatter(logging.Formatter(format))

if __name__ == '__main__':
    configure_log()
    run()

That was the easy part. In order to write the convert_spreadsheet function, I needed to open the document. And to do that, I need to start OpenOffice.org.

Starting OOo

http://www.squaregoldfish.co.uk/software/e17icons/oocalc.png

I started by copy-pasting some code I found in another project, which expected OpenOffice.org to be already started with the -accept option. I changed that code a bit, so that the function would launch soffice with the correct options if it could not contact an existing instance:

def _uno_init(_try_start=True):
    """init python-uno bridge infrastructure"""
    try:
        # Get the uno component context from the PyUNO runtime
        local_context = uno.getComponentContext()
        # Get the local Service Manager
        local_service_manager = local_context.ServiceManager
        # Create the UnoUrlResolver on the Python side.
        local_resolver = local_service_manager.createInstanceWithContext(
            "com.sun.star.bridge.UnoUrlResolver", local_context)
        # Connect to the running OpenOffice.org and get its context.
        # XXX make host/port configurable
        context = local_resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
        # Get the ServiceManager object
        service_manager = context.ServiceManager
        # Create the Desktop instance
        desktop = service_manager.createInstance("com.sun.star.frame.Desktop")
        return service_manager, desktop
    except Exception, exc:
        if exc.__class__.__name__.endswith('NoConnectException') and _try_start:
            logging.info('Trying to start UNO server')
            status = os.system('soffice -invisible -accept="socket,host=localhost,port=2002;urp;"')
            time.sleep(2)
            logging.info('status = %d', status)
            return _uno_init(False)
        else:
            logging.exception("UNO server not started, you should fix that now. "
                              "`soffice \"-accept=socket,host=localhost,port=2002;urp;\"` "
                              "or maybe `unoconv -l` might suffice")
            raise

Spreadsheet conversion

Now the easy (sort of, once you start understanding the OOo API): to load a document, use desktop.loadComponentFromURL(). To get the sheets of a Calc document, use document.getSheets() (that one was easy...). To iterate over the sheets, I used a sample from the SpreadsheetCommon page on the OpenOffice.org wiki.

Exporting the CSV was a bit more tricky. The function to use is document.storeToURL(). There are two gotchas, however. The first one, is that we need to specify a filter, and to parameterize it correctly. The second one is that the CSV export filter is only able to export the active sheet, so we need to change the active sheet as we iterate over the sheets.

Parametrizing the export filter

The parameters are passed in a tuple of PropertyValue uno structures, as the second argument to the storeToURL method. I wrote a helper function which accepts any named arguments and convert them to such a tuple:

def make_property_array(**kwargs):
    """convert the keyword arguments to a tuple of PropertyValue uno
    structures"""
    array = []
    for name, value in kwargs.iteritems():
        prop = uno.createUnoStruct("com.sun.star.beans.PropertyValue")
        prop.Name = name
        prop.Value = value
        array.append(prop)
    return tuple(array)

Now, what do we put in that array? The answer is in the FilterOptions page of the wiki : The FilterName property is "Text - txt - csv (StarCalc)". We also need to configure the filter by using the FilterOptions property. This is a string of comma separated values

  • ASCII code of field separator
  • ASCII code of text delimiter
  • character set, use 0 for "system character set", 76 seems to be UTF-8
  • number of first line (1-based)
  • Cell format codes for the different columns (optional)

I used the value "59,34,76,1", meaning I wanted semicolons for separators, and double quotes for text delimiters.

Here's the code:

def convert_spreadsheet(filename):
    """load a spreadsheet document, and convert all sheets to
    individual CSV files"""
    logging.info('processing %s', filename)
    url = "file://%s" % osp.abspath(filename)
    export_mask = make_export_mask(url)
    # initialize Uno, get a Desktop object
    service_manager, desktop = _uno_init()
    try:
        # load the Document
        document = desktop.loadComponentFromURL(url, "_blank", 0, ())
        controller = document.getCurrentController()
        sheets = document.getSheets()
        logging.info('found %d sheets', sheets.getCount())

        # iterate on all the spreadsheets in the document
        enumeration = sheets.createEnumeration()
        while enumeration.hasMoreElements():
            sheet = enumeration.nextElement()
            name = sheet.getName()
            logging.info('current sheet name is %s', name)
            controller.setActiveSheet(sheet)
            outfilename = export_mask % name.replace(' ', '_')
            document.storeToURL(outfilename,
                                make_property_array(FilterName="Text - txt - csv (StarCalc)",
                                                    FilterOptions="59,34,76,1" ))
    finally:
        document.close(True)

def make_export_mask(url):
    """convert the url of the input document to a mask for the written
    CSV file, with a substitution for the sheet name

    >>> make_export_mask('file:///home/foobar/somedoc.xls')
    'file:///home/foobar/somedoc$%s.csv'
    """

    components = url.split('.')
    components[-2] += '$%s'
    components[-1] = 'csv'
    return '.'.join(components)

blog entry of

Logilab.org - en