Monday, August 11, 2008

Faster! Faster! And Smaller!

I have been hacking some Python code in a (vain) attempt to get a working algorithm for the Netflix prize and I have something that looks promising but is slow so I have been investigating optimizing it.

I got a dramatic speed-up after I noticed that I was searching a list using index when the list was ordered and so amenable to a binary search.

Using the bisect module for this looks like:
import bisect
pos = bisect.bisect_left(sortedList, val)

Binary searches run in logarithmic which is so much better than the index method of a list which is linear, starting at the left and looking until it either falls off the other end or find a match.

Timings using timeit:

First, using index..
>>> timeit.Timer('a.index(10)', setup='a=range(0,100)').timeit()

>>> timeit.Timer('a.index(99)', setup='a=range(0,100)').timeit()

Now, using bisect:
>>> timeit.Timer('bisect.bisect(a,10)', setup='import bisect; a=range(0,100)').timeit()
#Uh-oh, slower than than index by a little bit

>>> timeit.Timer('bisect.bisect(a,90)', setup='import bisect; a=range(0,100)').timeit()
# Faster

The array module allows the efficient storage of basic values. I have a list of scores for each Netflix subscriber that is probably internally represented as an int or even a long int (at least 4 ytes, maybe 8) but the range of values is 1-5. When I change from list to array('B'), the memory footprint falls from 2.2Gb to 900M.. now I can run this thing on a smaller machine!