As part of an ongoing customer project, I've been learning about the Condor queue management system (actually it is more than just a batch queue management system, tacking the High-throughput computing problem, but in my current project, we're not using the full possibilities of Condor, and the choice was dictated by other considerations outside the scope of this note). The documentation is excellent, and the features of the product are really amazing (pity the project runs on Windows, and we cannot use 90% of these...).

To launch a job on a computer participating in the Condor farm, you just have to write a job file which looks like this:


and then run condor_submit my_job_file and use condor_q to monitor the status your job (queued, running...)

My program is generating Condor job files and submitting them, and I've spent hours yesterday trying to understand why they were all failing : the stderr file contained a message from Python complaining that it could not import site and exiting.

A point which was not clear in the documentation I read (but I probably overlooked it) is that the executable mentionned in the job file is supposed to be a local file on the submission host which is copied to the computer running the job. In the jobs generated by my code, I was using sys.executable for the Executable field, and a path to the python script I wanted to run in the Arguments field. This resulted in the Python interpreter being copied on the execution host and not being able to run because it was not able to find the standard files it needs at startup.

Once I figured this out, the fix was easy: I made my program write a batch script which launched the Python script and changed the job to run that script.

UPDATE : I'm told there is a Transfer_executable=False line I could have put in the script to achieve the same thing.

(photo by gudi&cris licenced under CC-BY-ND)

blog entry of