Sunday, December 21, 2008

Best Mac Downloads of 2008

Brad added this link for the best Mac downloads of 2008 to his del.icio.us account and it has several interesting entries including Timemachine backups over the network, Dropbox (file synchronization over the net) and XBMC (a very nice looking cross platform media center).

Writable NTFS on Mac

I use an external HD to to carry large files between home and office but the Mac (or at least Leopard) doesn't support NTFS as a writable filesystem so the arrangement is only useful when all the new data is at the office and nothing needs to be changed at home. Obviously this is not ideal.

MacFUSE allows the addition of new fileystems to the Mac including NTFS-3g which originated on Linux (I believe). Installation is quick and simple (install MacFUSE, install NTFS-3g) and after a reboot, NTFS partitions are writable without any configuration required.

A quick search turns up a number of other filesystems that could also be used through MacFUSE incuding two that will encrypt a partition and one that will allow the se of a Gmail account as storage.

Wednesday, December 10, 2008

Illegal Use of Linux in the Classroom

A teacher in Austin, TX did not like what she saw when she found a student in her class experimenting with a liveCD from HeliOS.. so she confiscated the CD, reprimanded the student and then wrote a letter to his after explaining he was potentially liable to a civil suit and that the kids should be using Microsoft products which they (MS) would be happy to supply for free..

Wow.

Breathlessly awaiting the followup after the teacher and the parent meet in the superintendent's office.

Wednesday, December 3, 2008

More Uses For An OLPC

Video player for toddler.

Meghan likes watching DVDs in the minivan and she likes banging on the keys of the OLPC so I combined the two. I used Handbrake to rip one of her favorite DVDs then loaded mplayer onto the OLPC and played back the movie from a a thumb-drive.

VLC and mplayer had similar problems with freezing before announcing the file was longer available. I tried turning off power management with no improvement. It wasn't until I disabled the radio that playback was successful .. then I found the defect mentioned earlier.

Perl is Dead! Long Live Perl!

This somewhat risible survey reveals that Perl has almost fallen out of the top 10 programming langauges in use. Given the many different ways that statistics can be mashed together and the loose method of collecting the numbers, I think it is unlikely that they can claim 3 decimal places of accuracy for any of this.

On the other hand, using activity as a barometer of interest in a langugae and perhaps what to brush up on to stay/get employed, take a look at stats on Ohloh. You can pick a basket of languages and compare check-ins and TLOC and decide that actionscript is not about to overtake C++ any time soon.

This page lets you look at a ramge of stats for a language off their long list while this will let you compare a wide range of languages with a pretty chart: take a look at this comparison showing that the volume of Python checkins is growing faster than Perl and Ruby together. Or maybe it is showing the rate of Python projects being added to Ohloh? Who can say?

Tuesday, December 2, 2008

My First Defect (for the OLPC)

I noticed that turning the radio off and then back on from the OLPC control panel resulted in being endlessly prompted for my WEP password.. so I investigated a little more and then raised this defect .. now I will have to see if there is something similar in Fedora.

Sunday, November 30, 2008

Other Uses for an OLPC

Baby webcam monitor..
Downloaded motion and set it up, put the lpatop on some furniture and left my daughter sleeping while I watched her in a browser window from another room. It worked ok.

Wednesday, November 12, 2008

New OLPC Spin

Funny thing coincidences.. I was talking to a co-worker about the OLPC and suggesting he avoid vmware images or ISOs and that instead he should use a Fedora system and add Sugar from RPM because the ISOs that I had seen and become out-of-date or unavailable.. and then we found Sebastian Dziallas reports there is a new spin of Sugar available from Fedora. This is a customized version of Fedora that can be run as a live cd or even from a USB stick.

Downloading now..

Thursday, November 6, 2008

Ruby Rocks!

About two weeks ago, I was at Nerdbooks (check them out at http://www.nerdbooks.com) and I bought a couple of great books; The Art of Debugging and Programming Ruby.

The Art of Debugging by Matloff and Salzman from Nostarch Press. I picked this up off the shelf and found a couple of things I didn't know about using gdb inside a minute so I went ahead and got it. The plan is to have it around for those late-night sessions when using gdb on some weird version of Unix (like, say, AIX) is keeping me at the office. Its also proven a good book to share with the team..

The other book is Programming Ruby by Dave Thomas from Pragmatic Bookshelf and it is gathering dust. I only got this because I may have to work with some Ruby programmers at the office and I want to find out more about the language. It seemed like a good idea at the time.. After all, Ruby rocks! Right?

Friday, October 17, 2008

Nerd Score

I just tried the Nerd Quiz and thought I was doing so-so, kind of average (missed some of the questions) so I was surprised to get this:


I am nerdier than 96% of all people. Are you a nerd? Click here to find out!

Monday, August 11, 2008

Faster! Faster! And Smaller!

I have been hacking some Python code in a (vain) attempt to get a working algorithm for the Netflix prize and I have something that looks promising but is slow so I have been investigating optimizing it.

I got a dramatic speed-up after I noticed that I was searching a list using index when the list was ordered and so amenable to a binary search.

Using the bisect module for this looks like:
import bisect
pos = bisect.bisect_left(sortedList, val)

Binary searches run in logarithmic which is so much better than the index method of a list which is linear, starting at the left and looking until it either falls off the other end or find a match.

Timings using timeit:

First, using index..
>>> timeit.Timer('a.index(10)', setup='a=range(0,100)').timeit()
0.50195479393005371

>>> timeit.Timer('a.index(99)', setup='a=range(0,100)').timeit()
2.3854200839996338

Now, using bisect:
>>> timeit.Timer('bisect.bisect(a,10)', setup='import bisect; a=range(0,100)').timeit()
0.63758587837219238
#Uh-oh, slower than than index by a little bit

>>> timeit.Timer('bisect.bisect(a,90)', setup='import bisect; a=range(0,100)').timeit()
0.54854512214660645
# Faster



Smaller!
The array module allows the efficient storage of basic values. I have a list of scores for each Netflix subscriber that is probably internally represented as an int or even a long int (at least 4 ytes, maybe 8) but the range of values is 1-5. When I change from list to array('B'), the memory footprint falls from 2.2Gb to 900M.. now I can run this thing on a smaller machine!

Wednesday, June 11, 2008

MOTD: pyprocessing

I have thought about trying multi-processing in Python before but, sometimes, the heavy lifting of building the infra-structure put me off until I found this:

http://pyprocessing.berlios.de/

Neat. Off to play with it.

Tuesday, June 10, 2008

Lxml and a Python Optimization Anecdote

One of my co-workers wrote a Python script to check the contents of an xml file written for an OVAL application and remarked it took nearly 48 hours to finish.

"Too slow", says I, "you must be doing something wrong".

"Make it faster then", he said.

My final version completed the same work in under 5 seconds which is an improvement of around 40000 times or 4 orders of magnitude. I was thinking about putting together a talk on optimization for the next DFW Pythoneer meeting but after looking around for information, it seems there is plenty on the web written by wiser people than me.. so instead here is a blog post.

Rules of optimization (applicable to all code, not just Python):
  1. Premature optimization is the root of all evil (D. Knuth)
  2. Make it right then make it fast: not much point in getting a wrong answer quickly.
  3. Measure, don't guess. You might think you know what is fast and what is not .. and you may be surprised.
  4. Rewrite the inner loop (Guido van Rossum). If you don't have any loops then you may not have any opportunity for improvement but otherwise most of your cpu time is spent in the innermost loop so fix that first.
  5. Rethink the algorithm. Sometimes, the profiled and optimized code can still be made faster but only by accepting that you need to rethink your approach. As an example, you could check if a number is prime by dividing it by every number that is smaller than it but that would be silly.
Measuring Python:
  1. timeit.py - a great script that ships with python, look in the lib/python directory, useful to measure small snippets.
  2. profile - (cProfile or hotshot) collects information about the number of function calls and the time per call and accumulated time.
Using timeit, I (re-)discovered that inserting items into the front of a list is more expensive than an append and that inserting numeric items with a numeric key into a dictionary is cheaper than appending items to a list.

When I used profile, I discovered that xpath queries of lxml are expensive .. so I rewrote lookups to take advantage of knowledge of the structure of the document.

Rewrite the inner loop (or move things out of loops):
  1. Prefer generators/iterators/map/reduce/filter over hand-written loops and take advantage of the C language level speed-up (and reduced chances for error)
  2. Use opportunities for caching (need to look something up? Store the result) or pooling(re-use threads or database connections and amortize start-up costs).
  3. Prefer iterators over lookups: If we know we need to process most of the contents of a list then go ahead and iterate over the entire list rather than perform multiple lookups in random order (as was the case in original problem)
The file that was being processed had several sections containing items that reference items in other sections - multiple many-to-one-to-many relationship that initially were discovered and resolved in the inner loop. Knowing that we will need almost every lookup, I created a cache and pre-populated it with one side of the lookup. This made a huge difference.

XML entities were discovered and then a string ID was passed to the inner loop where a query looked again for the entity and found child elements. All this was replaced with lxml iterators resulting in a huge speed-up.

Go Psyco!
Psyco is a great optimizer but doesn't work for every problem. If most of the work is done by pure Python code then you should get a great improvement but if you are using C extensions then it may not make any difference. I was using lxml which is mostly written in C and using Psyco slowed things down a little. Of course, you can pick which functions to speed up by looking at your profiler report. You do have a profiler report, don't you?

Prove It

Don't guess, test! Don't test with one set of data, try scaling up. If you are expecting linear behaviour and you don't get it then it may be time to check things. Again the output from profile will help you.

Links to Other Stuff:
Guido's essay
EffBot's example - discusses profiling
David Goodger's presentation on idomatic python

Tuesday, May 20, 2008

Netflix Submission in 2 lines of awk

Netflix are offering a prize if you can develop an algorithm that improves the accuracy of their suggestion system by 10%.. http://www.netflixprize.com

Some people have developed elaborate schemes based around cross-correlation and clustering etc. I submitted my first results using two lines of awk:

NF==2 {print $0}
NF==1 {print 3.4}

which just scores all movies at 3.4 and gives an RMSE of 1.16.

Using the average for each movie instead of a single number gives an RMSE of 1.0533.

Of course, this has all been done.. and I could have just surfed and found it.

There is also a python library called pyflix which lets you get past building infrastructure and onto the fun of the algorithm.

Thursday, January 17, 2008

Nokia Battery

I have a Nokia 6682 phone and it's great, not just because it can run Python, but all of a sudden the battery seems to last not very long at all. So I need a new battery, a BL-5C, and that battery is available on-line for 5 dollars but is that because they are past their shelf-life? Never mind that the particular was recalled for a slight problem with catching fire.

Who can say?