I have been hacking some Python code in a (vain) attempt to get a working algorithm for the Netflix prize and I have something that looks promising but is slow so I have been investigating optimizing it.
I got a dramatic speed-up after I noticed that I was searching a list using index when the list was ordered and so amenable to a binary search.
Using the bisect module for this looks like:
import bisect
pos = bisect.bisect_left(sortedList, val)
Binary searches run in logarithmic which is so much better than the index method of a list which is linear, starting at the left and looking until it either falls off the other end or find a match.
Timings using timeit:
First, using index..
>>> timeit.Timer('a.index(10)', setup='a=range(0,100)').timeit()
0.50195479393005371
>>> timeit.Timer('a.index(99)', setup='a=range(0,100)').timeit()
2.3854200839996338
Now, using bisect:
>>> timeit.Timer('bisect.bisect(a,10)', setup='import bisect; a=range(0,100)').timeit()
0.63758587837219238
#Uh-oh, slower than than index by a little bit
>>> timeit.Timer('bisect.bisect(a,90)', setup='import bisect; a=range(0,100)').timeit()
0.54854512214660645
# Faster
Smaller!
The array module allows the efficient storage of basic values. I have a list of scores for each Netflix subscriber that is probably internally represented as an int or even a long int (at least 4 ytes, maybe 8) but the range of values is 1-5. When I change from list to array('B'), the memory footprint falls from 2.2Gb to 900M.. now I can run this thing on a smaller machine!
No comments:
Post a Comment