Duck Typing: python

Showing posts with label python. Show all posts

Tuesday, July 21, 2009

Python in a web page

"Gestalt is a library released by MIX Online Labs that allows you to write Ruby, Python & XAML code in your (X)HTML pages. It enables you to build richer and more powerful web applications by marrying the benefits of expressive languages, modern compilers, AJAX & RIAs with the write » save » refresh development model of the web."

Python on a web page! In Silverlight .. which means it is time to learn Mono and .Net and all that stuff.

Update: I did nothing to learn any of that stuff then one day I walked into the office and they said "oh, we need you to work on a C# project.. on a Windows box"

Sunday, June 7, 2009

Unicode Gotcha

What will this give you?

>>> astring = None
>>> print unicode(astring)

If you said None then you will be surprised to find the answer is u"None" which is not at all the same thing. Really messed up my day..

Sunday, April 12, 2009

Regex Exploration

The Question

I was asked a question on the way out of the office at the end of the week: what is the difference between a regex match and a regex search?

It didn't seem like a difficult question but it stumped me for a little while. Both a search and a match should use a regular expression and evaluate it against a text. Perhaps a regex search traverses the input string and returns one result at a time and a regex match does.. hm.. or maybe a regex match is used for validation and a regex search is more exploratory?

The individuals asking the question were looking at the Python regex documentation (or rather AMK's very excellent regex howto ) which says "match() function only checks if the RE matches at the beginning of the string while search() will scan forward through the string for a match" which is interesting because it means that the Python regex match effectively inserts a leading anchor before attempting evaluation. Both functions return a MatchObject.

Wait a minute, you may say, you have the documentation to answer the original question! Not exactly because this question was not inspired by an abstract search for knowledge but a performance problem. The other interesting part was that Python documentation was being studied but the code is in C++ with no Python involved.

If you need a regex library for C++ then a good place to look is at the Boost home page which is “...one of the most highly regarded and expertly designed C++ library projects in the world” and has libraries for all sorts of things like graph theory, linear algebra or interprocess communication. So what does the Boost Regex library say about match and search? How about this: "the result is true only if the expression matches the whole of the input sequence. If you want to search for an expression somewhere within the sequence then use regex_search". Of note is the fact that the return type of match is boolean.

Did you get that? The Python regex match/search answer is a matter of the start position of the match and the Boost C++ answer is whether the input sequence is consumed.

Performance Evaluation

Another alternative for a regex library for the C++ developer is the Perl Compatible Regular Expression library (see the website) which offers a single execute function instead of a search or match function, with a flag parameter modifying behaviour. One suggestion I heard was that the PCRE library performs faster than the Boost library because "maybe the Boost library sucks?" but as Carl Sagan said "Extraordinary claims require extraordinary evidence" or, in other words, how do you figure?

I wrote a small app that generates multiple random strings by appending together words randomly picked from a small dictionary and then evaluates a number of regular expressions. Here's the results (slightly re-arranged):


Boost regex search, pattern: yellowgreensapphireruby.*bluered
12.50 s

Boost regex search with continuous flag, pattern: yellowgreensapphireruby.*bluered
0.07 s

Boost regex search with any flag, pattern: yellowgreensapphireruby.*bluered
12.51 s

Boost regex match, pattern: yellowgreensapphireruby.*bluered
0.05 s

PCRE regex, pattern: yellowgreensapphireruby.*bluered
1.47 s

Boost regex search, pattern: ^yellowgreensapphireruby.*bluered
6.59 s

Boost regex search with continuous flag, pattern: ^yellowgreensapphireruby.*bluered
0.07 s

Boost regex search with any flag, pattern: ^yellowgreensapphireruby.*bluered
6.57 s

Boost regex match, pattern: ^yellowgreensapphireruby.*bluered
0.05 s

PCRE regex, pattern: ^yellowgreensapphireruby.*bluered
0.02 s

Boost regex search, pattern: yellowgreensapphireruby.*bluered$
14.16 s

Boost regex search with continuous flag, pattern: yellowgreensapphireruby.*bluered$
0.07 s

Boost regex search with any flag, pattern: yellowgreensapphireruby.*bluered$
14.15 s

Boost regex match, pattern: yellowgreensapphireruby.*bluered$
0.05 s

PCRE regex, pattern: yellowgreensapphireruby.*bluered$
2.01 s

Boost regex search, pattern: ^yellowgreensapphireruby.*bluered$
6.59 s

Boost regex search with continuous flag, pattern: ^yellowgreensapphireruby.*bluered$
0.07 s

Boost regex search with any flag, pattern: ^yellowgreensapphireruby.*bluered$
6.59 s

Boost regex match, pattern: ^yellowgreensapphireruby.*bluered$
0.05 s

PCRE regex, pattern: ^yellowgreensapphireruby.*bluered$
0.02 s

Saturday, April 4, 2009

Web2py rocks more than the alternatives?

I need to find some spare time, perhaps just some of that discretionary sleeping time, to take a look at web2py. I have a small project in mind - updating my friends website to have a simple calendar and appointment system - but I have been getting slammed at the office porting an application to AIX (yes, it's still in use) and busy at home with the twins.

Wednesday, December 3, 2008

Perl is Dead! Long Live Perl!

This somewhat risible survey reveals that Perl has almost fallen out of the top 10 programming langauges in use. Given the many different ways that statistics can be mashed together and the loose method of collecting the numbers, I think it is unlikely that they can claim 3 decimal places of accuracy for any of this.

On the other hand, using activity as a barometer of interest in a langugae and perhaps what to brush up on to stay/get employed, take a look at stats on Ohloh. You can pick a basket of languages and compare check-ins and TLOC and decide that actionscript is not about to overtake C++ any time soon.

This page lets you look at a ramge of stats for a language off their long list while this will let you compare a wide range of languages with a pretty chart: take a look at this comparison showing that the volume of Python checkins is growing faster than Perl and Ruby together. Or maybe it is showing the rate of Python projects being added to Ohloh? Who can say?

Monday, August 11, 2008

Faster! Faster! And Smaller!

I have been hacking some Python code in a (vain) attempt to get a working algorithm for the Netflix prize and I have something that looks promising but is slow so I have been investigating optimizing it.

I got a dramatic speed-up after I noticed that I was searching a list using index when the list was ordered and so amenable to a binary search.

Using the bisect module for this looks like:
import bisect
pos = bisect.bisect_left(sortedList, val)

Binary searches run in logarithmic which is so much better than the index method of a list which is linear, starting at the left and looking until it either falls off the other end or find a match.

Timings using timeit:

First, using index..
>>> timeit.Timer('a.index(10)', setup='a=range(0,100)').timeit()
0.50195479393005371

>>> timeit.Timer('a.index(99)', setup='a=range(0,100)').timeit()
2.3854200839996338

Now, using bisect:
>>> timeit.Timer('bisect.bisect(a,10)', setup='import bisect; a=range(0,100)').timeit()
0.63758587837219238
#Uh-oh, slower than than index by a little bit

>>> timeit.Timer('bisect.bisect(a,90)', setup='import bisect; a=range(0,100)').timeit()
0.54854512214660645
# Faster

Smaller!
The array module allows the efficient storage of basic values. I have a list of scores for each Netflix subscriber that is probably internally represented as an int or even a long int (at least 4 ytes, maybe 8) but the range of values is 1-5. When I change from list to array('B'), the memory footprint falls from 2.2Gb to 900M.. now I can run this thing on a smaller machine!

Wednesday, June 11, 2008

MOTD: pyprocessing

I have thought about trying multi-processing in Python before but, sometimes, the heavy lifting of building the infra-structure put me off until I found this:

http://pyprocessing.berlios.de/

Neat. Off to play with it.

Tuesday, June 10, 2008

Lxml and a Python Optimization Anecdote

One of my co-workers wrote a Python script to check the contents of an xml file written for an OVAL application and remarked it took nearly 48 hours to finish.

"Too slow", says I, "you must be doing something wrong".

"Make it faster then", he said.

My final version completed the same work in under 5 seconds which is an improvement of around 40000 times or 4 orders of magnitude. I was thinking about putting together a talk on optimization for the next DFW Pythoneer meeting but after looking around for information, it seems there is plenty on the web written by wiser people than me.. so instead here is a blog post.

Rules of optimization (applicable to all code, not just Python):

Premature optimization is the root of all evil (D. Knuth)
Make it right then make it fast: not much point in getting a wrong answer quickly.
Measure, don't guess. You might think you know what is fast and what is not .. and you may be surprised.
Rewrite the inner loop (Guido van Rossum). If you don't have any loops then you may not have any opportunity for improvement but otherwise most of your cpu time is spent in the innermost loop so fix that first.
Rethink the algorithm. Sometimes, the profiled and optimized code can still be made faster but only by accepting that you need to rethink your approach. As an example, you could check if a number is prime by dividing it by every number that is smaller than it but that would be silly.

Measuring Python:

timeit.py - a great script that ships with python, look in the lib/python directory, useful to measure small snippets.
profile - (cProfile or hotshot) collects information about the number of function calls and the time per call and accumulated time.

Using timeit, I (re-)discovered that inserting items into the front of a list is more expensive than an append and that inserting numeric items with a numeric key into a dictionary is cheaper than appending items to a list.

When I used profile, I discovered that xpath queries of lxml are expensive .. so I rewrote lookups to take advantage of knowledge of the structure of the document.

Rewrite the inner loop (or move things out of loops):

Prefer generators/iterators/map/reduce/filter over hand-written loops and take advantage of the C language level speed-up (and reduced chances for error)
Use opportunities for caching (need to look something up? Store the result) or pooling(re-use threads or database connections and amortize start-up costs).
Prefer iterators over lookups: If we know we need to process most of the contents of a list then go ahead and iterate over the entire list rather than perform multiple lookups in random order (as was the case in original problem)

The file that was being processed had several sections containing items that reference items in other sections - multiple many-to-one-to-many relationship that initially were discovered and resolved in the inner loop. Knowing that we will need almost every lookup, I created a cache and pre-populated it with one side of the lookup. This made a huge difference.

XML entities were discovered and then a string ID was passed to the inner loop where a query looked again for the entity and found child elements. All this was replaced with lxml iterators resulting in a huge speed-up.

Go Psyco!
Psyco is a great optimizer but doesn't work for every problem. If most of the work is done by pure Python code then you should get a great improvement but if you are using C extensions then it may not make any difference. I was using lxml which is mostly written in C and using Psyco slowed things down a little. Of course, you can pick which functions to speed up by looking at your profiler report. You do have a profiler report, don't you?

Prove It

Don't guess, test! Don't test with one set of data, try scaling up. If you are expecting linear behaviour and you don't get it then it may be time to check things. Again the output from profile will help you.

Links to Other Stuff:
Guido's essay
EffBot's example - discusses profiling
David Goodger's presentation on idomatic python

Tuesday, May 20, 2008

Netflix Submission in 2 lines of awk

Netflix are offering a prize if you can develop an algorithm that improves the accuracy of their suggestion system by 10%.. http://www.netflixprize.com

Some people have developed elaborate schemes based around cross-correlation and clustering etc. I submitted my first results using two lines of awk:

NF==2 {print $0}
NF==1 {print 3.4}

which just scores all movies at 3.4 and gives an RMSE of 1.16.

Using the average for each movie instead of a single number gives an RMSE of 1.0533.

Of course, this has all been done.. and I could have just surfed and found it.

There is also a python library called pyflix which lets you get past building infrastructure and onto the fun of the algorithm.

Saturday, July 28, 2007

Rectangle classes in python

This will be useful for mapping applications.. assuming it works well. I keep hoping to find time to build something with Google maps and Python people data. Maybe this will help.

Sunday, July 15, 2007

If it looks like a Duck, it may be a Parrot..

Last Saturday (July 15), the DFW Pythoneers gathered together to hear wise words from Patrick Michaud. "Lo", he said, "it is not an ex-parrot (though it may be pining for the fjords.." and we listened in solemnity.

PM gave us another great talk and explained why Perl 6 is important/interesting/intriguing to Python people and how we can get pizza paid for by the Perl foundation while looking at Python.

Parrot is the VM for the upcoming Perl6. It also happens to be capable of 'running' Python (and a bunch of other languages, some of which you will wish you had never heard of if I told you about them so I won't. You're welcome) and Patrick gave us a run-down on how the Python sourcefile in all of its beauty gets transformed into bytecode for Parrot. It's less than simple so you can go ook it up yourself at parrotcode.org.

After the meeting, I checked out the latest Parrot code and tried the test suite. Yay, it worked (good start) and then I tried the command line prompt for the Python part of Parrot, pynie. It also worked but things got a little bumpy after that:

longs appear to be broken but aren't
floats are broken, along with imaginary numbers
list, arrays and dictionaries are less than working

It's a work in progress and I am not sure how to fix the BNF grammar that's written in Perl6 Regex syntax so sent Patrick an email and called it a night.

Thursday, June 7, 2007

Unit testing.. why wouldn't you?

Unit testing is a good thing, right? Especially if it is simple to set up, simple to run and simple to read the results. How do we do it in Python? Simple!
Here's a quick example using some pieces from the unit test file I checked in along with the python bitset:



import unittest
from pybitset import Bitset # import class or module under test

class bitset_testcase(unittest.TestCase):
   def setUp(self): # testcase setup
       self.bits = Bitset(10)
      
   def testSize(self): # some method to test
                       # use assert to compare the expected and 
                       # actual results
       assert len(self.bits.bitstring)==self.bits.size(), 
           'Incorrect Size'
  
   def testRepr(self): # some other method
       assert self.bits.size() == len(self.bits.__repr__()), 
           'Repr size incorrect'
      

# .. if you need to test exceptions then write some code to 
# provoke an exception and catch the exception and add an else 
# clause to deal with exception failure
          
   def testIndex(self):
       try:
           self.bits[40]=1
       except IndexError:
           pass
       else:
           self.fail("Out of range index expected exception")

# add the check for main so you can run from the command line
if __name__ == "__main__":
   unittest.main()

Presto! Try it out, it's easy.

If you don't like my notes, go and read this

Python bitset checked in (along with unittests)

I have just checked in the python version of bitset (and unittests!) into the boost_python project directory. Now I need to spend a little time making the API for the python version agree with the C++ version. Some discrepancies around init/constructor, nothing major.

I think I have enough functionality to start on benchmarks.

The python bitset uses a list internally to store bits as, can you guess, zero or one. The first crack used the string representation of a bit rather than a numeric which was not a big problem until I noticed that all operations except init and repr needed to convert the value somewhere.

One gotcha to note: the zero-th bit is rightmost, as it would be if you were writing out a number rather than the string layout where the zeroth bit is on the left.

Tuesday, May 29, 2007

Dynamic bitset for python

I started trying the Python part of Boost C++ libraries recently and found it surprisingly easy to use. In search of a mini-project, I started coding a Python extension module to expose the Boost dynamic bitset for use in Python. It started as a novelty for a quick talk at the DFW Python group but I got stumped by adding operators.. until last night.

Here's a recap:

The constructor can accept a string ('111000') or a number of bits or a number of bits and an initial value. You can have a very large number of bits - in the millions without a problem.

The logical operators (&|^) work as expected and even throw polite exceptions when the operands are of different sizes.

There is a count method to see how many bits are on, a test method that returns true if the bit at a given index is on, a flip (or toggle) method..

There are even docstrings!

Available at:
https://python.taupro.com/repo/Projects/boost_python

It may be interesting to build pure python version and then use both to implement Conway's Game of Life to see what the difference in speed is.

Saturday, May 26, 2007

DFW Pythoneers Meeting

We are winding up the current meeting and I am trying to capture what we talked about so that when newcomers ask, I don't have to scratch my head and say "um, stuff..".

As always a lively session, mainly driven by Jeff Rush, it has been a lively one with presentations and ad-hoc chats about topics like:

the Forrester Wave work that Jeff did (and I contributed to) about the use of Python for enterprise web applications, including a simple mashup
a presentation of named tuples aka nuples (something like an immutable dictionary)
a recap on the game that John Zurawski developed for a 48-hour competition
a quick look at Gizmo(QP), a python framework (yet another!)
an RSS feed reader
a quick peek at Pyjamas

.. and we had a drop-in from the local PHP group.

Saturday, May 12, 2007

TuxDroid and D-Bus

D-bus is a local IPC protocol that allows applications on a single machine to signal each other and request or consume each other's services. Examples include the little notificiation pop-up that tells you your battery is charged.

Tuxdroid is a robot version of the linux mascot, available from www.kysoh.com. Developer pages here: www.tuxisalive.com/

Here's a tuxdroid service that lets other application speak with the mighty voice of Tux:


#!/usr/bin/python
# Borrowed from http://webcvs.freedesktop.org/dbus/dbus/python/examples/example-service.py

import dbus
import dbus.service
import dbus.glib
import gobject
import sys
sys.path.append('/opt/tuxdroid/api/python')
from tux import *
class TuxObject(dbus.service.Object):
   def __init__(self, bus_name, object_path="/Tux"):
       dbus.service.Object.__init__(self, bus_name, object_path)

   @dbus.service.method("org.dfwpython.TuxInterface")
   def Speak(self, hello_message, voice='Male'):
       print (str(hello_message))
       if voice == 'Male':
           speaker = SPK_US_MALE
       else:
           speaker = SPK_US_FEMALE
       pitch=100
       tux.tts.select_voice(speaker,pitch)
       tux.cmd.mouth_open()
       tux.tts.speak( str(hello_message))
       tux.cmd.mouth_close()


session_bus = dbus.SessionBus()
name = dbus.service.BusName("org.dfwpython.TuxService", bus=session_bus)
object = TuxObject(name)

mainloop = gobject.MainLoop()
mainloop.run()

Thursday, May 10, 2007

Snakeskin Pyjamas?

I was browsing around and I came across this project.. which promises that I can "build AJAX apps in Python (like Google did for Java)". A bold claim.. I'll have to try it out and see if it really is that easy.

It's not an encouraging sign that the webpage is borked (most links appear to point to something that went away) but follow the links to download and things improve.. Google Code is lurking back there.

Wednesday, May 2, 2007

Running Win2K in Qemu

I need to build a Python package for installation on Windows using distutils but I run Linux.. so I thought I'd try Qemu.

I built a disk image for Qemu using qemu-image, made an ISO from a Win2K CD and then installed win2k in a window on my main linux machine. All went well and the whole thing was installed in a couple of hours. Then I tried to run Windows Update. Boom.. gone. Hm.. maybe I'll try XP.

Monday, April 30, 2007

A quick Python recipe for a validating XML parser

ElementTree is now included in Python standard libraries from version 2.5 but as good as it is, it has no support for XMLschema validation and limited support for XPath. For that you need lxml which builds on the foundation of ElementTree.

Here's a few lines of Python to validate an XML document using a schema document using lxml:


from lxml import etree

# Parse the schema document
xsd = etree.ElementTree(file = 'schema.xsd')

# Build an XMLSchema object from the parsed document
xsv = etree.XMLSchema(xsd)

# Validate the document using the schema
doc = etree.ElementTree(file = 'doc.xml')
xsv.validate(doc)

And that's it!

If you also want to perform Xpath operations then here's a few examples:

# continuing from above

# Find all nodes with a tagname amount
nodes = doc.xpath('//amount)

# Find all nodes with a tagname amount and attribute value with value 7
nodes = doc.xpath('//amount[value=7])

# Need a namespace? Supply a dictionary
nodes = doc.xpath('//cdf:amount, {'cdf' : 'http://uri.namespace.org/1.0'})

Later: lxml uses libxml under the hood to do its magic. Apparently, there are some bugs. When trying to validate XCCDF documents, errors are generated. This forced me into actually using C++ to build a schema validator which was kind of useful, seeing as that's what I was supposed to be doing in the first place.

Sunday, April 29, 2007

First, pass the post

I thought I'd start a blog about my use Python and things related.. so here I am. Getting started proves to be difficult but I figure if I start typing and then redact, something might come out..

On an occasional basis (free time is more scarce since the arrival of baby Meghan), I attend the Dallas-Fort Worth pythoneers saturday sprints at Nerdbooks in Richardson.

Yesterday's meeting was covered a range of topics; a programming challenge; extending python via ctypes, pyrex, swig and boost; population simulations in Python; creating S5 presentations using restructured text; and the normal variety of odd conversation topics.