Wednesday, June 11, 2008

MOTD: pyprocessing

I have thought about trying multi-processing in Python before but, sometimes, the heavy lifting of building the infra-structure put me off until I found this:

Neat. Off to play with it.

Tuesday, June 10, 2008

Lxml and a Python Optimization Anecdote

One of my co-workers wrote a Python script to check the contents of an xml file written for an OVAL application and remarked it took nearly 48 hours to finish.

"Too slow", says I, "you must be doing something wrong".

"Make it faster then", he said.

My final version completed the same work in under 5 seconds which is an improvement of around 40000 times or 4 orders of magnitude. I was thinking about putting together a talk on optimization for the next DFW Pythoneer meeting but after looking around for information, it seems there is plenty on the web written by wiser people than me.. so instead here is a blog post.

Rules of optimization (applicable to all code, not just Python):
  1. Premature optimization is the root of all evil (D. Knuth)
  2. Make it right then make it fast: not much point in getting a wrong answer quickly.
  3. Measure, don't guess. You might think you know what is fast and what is not .. and you may be surprised.
  4. Rewrite the inner loop (Guido van Rossum). If you don't have any loops then you may not have any opportunity for improvement but otherwise most of your cpu time is spent in the innermost loop so fix that first.
  5. Rethink the algorithm. Sometimes, the profiled and optimized code can still be made faster but only by accepting that you need to rethink your approach. As an example, you could check if a number is prime by dividing it by every number that is smaller than it but that would be silly.
Measuring Python:
  1. - a great script that ships with python, look in the lib/python directory, useful to measure small snippets.
  2. profile - (cProfile or hotshot) collects information about the number of function calls and the time per call and accumulated time.
Using timeit, I (re-)discovered that inserting items into the front of a list is more expensive than an append and that inserting numeric items with a numeric key into a dictionary is cheaper than appending items to a list.

When I used profile, I discovered that xpath queries of lxml are expensive .. so I rewrote lookups to take advantage of knowledge of the structure of the document.

Rewrite the inner loop (or move things out of loops):
  1. Prefer generators/iterators/map/reduce/filter over hand-written loops and take advantage of the C language level speed-up (and reduced chances for error)
  2. Use opportunities for caching (need to look something up? Store the result) or pooling(re-use threads or database connections and amortize start-up costs).
  3. Prefer iterators over lookups: If we know we need to process most of the contents of a list then go ahead and iterate over the entire list rather than perform multiple lookups in random order (as was the case in original problem)
The file that was being processed had several sections containing items that reference items in other sections - multiple many-to-one-to-many relationship that initially were discovered and resolved in the inner loop. Knowing that we will need almost every lookup, I created a cache and pre-populated it with one side of the lookup. This made a huge difference.

XML entities were discovered and then a string ID was passed to the inner loop where a query looked again for the entity and found child elements. All this was replaced with lxml iterators resulting in a huge speed-up.

Go Psyco!
Psyco is a great optimizer but doesn't work for every problem. If most of the work is done by pure Python code then you should get a great improvement but if you are using C extensions then it may not make any difference. I was using lxml which is mostly written in C and using Psyco slowed things down a little. Of course, you can pick which functions to speed up by looking at your profiler report. You do have a profiler report, don't you?

Prove It

Don't guess, test! Don't test with one set of data, try scaling up. If you are expecting linear behaviour and you don't get it then it may be time to check things. Again the output from profile will help you.

Links to Other Stuff:
Guido's essay
EffBot's example - discusses profiling
David Goodger's presentation on idomatic python