I recently received from a customer a fairly large amount of data, organized in dozens of xls documents, each having dozens of sheets. I need to process this, and in order to ease the manipulation of the documents, I'd rather use standard text files in CSV (Comma Separated Values) format. Of course I didn't want to spend hours manually converting each sheet of each file to CSV, so I thought this would be a good time to get my hands in pyUno.
So I gazed over the documentation, found the Calc page on the OpenOffice.org wiki, read some sample code and got started.
The first few lines I wrote were (all imports are here, though some were actually added later).
import logging import sys import os.path as osp import os import time import uno def convert_spreadsheet(filename): pass def run(): for filename in sys.argv[1:]: convert_spreadsheet(filename) def configure_log(): logger = logging.getLogger('') logger.setLevel(logging.DEBUG) handler = logging.StreamHandler(sys.stdout) logger.addHandler(handler) format = "%(asctime)s %(levelname)-7s [%(name)s] %(message)s" handler.setFormatter(logging.Formatter(format)) if __name__ == '__main__': configure_log() run()
That was the easy part. In order to write the convert_spreadsheet function, I needed to open the document. And to do that, I need to start OpenOffice.org.
I started by copy-pasting some code I found in another project, which expected OpenOffice.org to be already started with the -accept option. I changed that code a bit, so that the function would launch soffice with the correct options if it could not contact an existing instance:
def _uno_init(_try_start=True): """init python-uno bridge infrastructure""" try: # Get the uno component context from the PyUNO runtime local_context = uno.getComponentContext() # Get the local Service Manager local_service_manager = local_context.ServiceManager # Create the UnoUrlResolver on the Python side. local_resolver = local_service_manager.createInstanceWithContext( "com.sun.star.bridge.UnoUrlResolver", local_context) # Connect to the running OpenOffice.org and get its context. # XXX make host/port configurable context = local_resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext") # Get the ServiceManager object service_manager = context.ServiceManager # Create the Desktop instance desktop = service_manager.createInstance("com.sun.star.frame.Desktop") return service_manager, desktop except Exception, exc: if exc.__class__.__name__.endswith('NoConnectException') and _try_start: logging.info('Trying to start UNO server') status = os.system('soffice -invisible -accept="socket,host=localhost,port=2002;urp;"') time.sleep(2) logging.info('status = %d', status) return _uno_init(False) else: logging.exception("UNO server not started, you should fix that now. " "`soffice \"-accept=socket,host=localhost,port=2002;urp;\"` " "or maybe `unoconv -l` might suffice") raise
Now the easy (sort of, once you start understanding the OOo API): to load a document, use desktop.loadComponentFromURL(). To get the sheets of a Calc document, use document.getSheets() (that one was easy...). To iterate over the sheets, I used a sample from the SpreadsheetCommon page on the OpenOffice.org wiki.
Exporting the CSV was a bit more tricky. The function to use is document.storeToURL(). There are two gotchas, however. The first one, is that we need to specify a filter, and to parameterize it correctly. The second one is that the CSV export filter is only able to export the active sheet, so we need to change the active sheet as we iterate over the sheets.
The parameters are passed in a tuple of PropertyValue uno structures, as the second argument to the storeToURL method. I wrote a helper function which accepts any named arguments and convert them to such a tuple:
def make_property_array(**kwargs): """convert the keyword arguments to a tuple of PropertyValue uno structures""" array =  for name, value in kwargs.iteritems(): prop = uno.createUnoStruct("com.sun.star.beans.PropertyValue") prop.Name = name prop.Value = value array.append(prop) return tuple(array)
Now, what do we put in that array? The answer is in the FilterOptions page of the wiki : The FilterName property is "Text - txt - csv (StarCalc)". We also need to configure the filter by using the FilterOptions property. This is a string of comma separated values
- ASCII code of field separator
- ASCII code of text delimiter
- character set, use 0 for "system character set", 76 seems to be UTF-8
- number of first line (1-based)
- Cell format codes for the different columns (optional)
I used the value "59,34,76,1", meaning I wanted semicolons for separators, and double quotes for text delimiters.
Here's the code:
def convert_spreadsheet(filename): """load a spreadsheet document, and convert all sheets to individual CSV files""" logging.info('processing %s', filename) url = "file://%s" % osp.abspath(filename) export_mask = make_export_mask(url) # initialize Uno, get a Desktop object service_manager, desktop = _uno_init() try: # load the Document document = desktop.loadComponentFromURL(url, "_blank", 0, ()) controller = document.getCurrentController() sheets = document.getSheets() logging.info('found %d sheets', sheets.getCount()) # iterate on all the spreadsheets in the document enumeration = sheets.createEnumeration() while enumeration.hasMoreElements(): sheet = enumeration.nextElement() name = sheet.getName() logging.info('current sheet name is %s', name) controller.setActiveSheet(sheet) outfilename = export_mask % name.replace(' ', '_') document.storeToURL(outfilename, make_property_array(FilterName="Text - txt - csv (StarCalc)", FilterOptions="59,34,76,1" )) finally: document.close(True) def make_export_mask(url): """convert the url of the input document to a mask for the written CSV file, with a substitution for the sheet name >>> make_export_mask('file:///home/foobar/somedoc.xls') 'file:///home/foobar/somedoc$%s.csv' """ components = url.split('.') components[-2] += '$%s' components[-1] = 'csv' return '.'.join(components)