Sunday, April 12, 2009

Regex Exploration

The Question

I was asked a question on the way out of the office at the end of the week: what is the difference between a regex match and a regex search?

It didn't seem like a difficult question but it stumped me for a little while. Both a search and a match should use a regular expression and evaluate it against a text. Perhaps a regex search traverses the input string and returns one result at a time and a regex match does.. hm.. or maybe a regex match is used for validation and a regex search is more exploratory?

The individuals asking the question were looking at the Python regex documentation (or rather AMK's very excellent regex howto ) which says "match() function only checks if the RE matches at the beginning of the string while search() will scan forward through the string for a match" which is interesting because it means that the Python regex match effectively inserts a leading anchor before attempting evaluation. Both functions return a MatchObject.

Wait a minute, you may say, you have the documentation to answer the original question! Not exactly because this question was not inspired by an abstract search for knowledge but a performance problem. The other interesting part was that Python documentation was being studied but the code is in C++ with no Python involved.

If you need a regex library for C++ then a good place to look is at the Boost home page which is “...one of the most highly regarded and expertly designed C++ library projects in the world” and has libraries for all sorts of things like graph theory, linear algebra or interprocess communication. So what does the Boost Regex library say about match and search? How about this: "the result is true only if the expression matches the whole of the input sequence. If you want to search for an expression somewhere within the sequence then use regex_search". Of note is the fact that the return type of match is boolean.

Did you get that? The Python regex match/search answer is a matter of the start position of the match and the Boost C++ answer is whether the input sequence is consumed.

Performance Evaluation


Another alternative for a regex library for the C++ developer is the Perl Compatible Regular Expression library (see the website) which offers a single execute function instead of a search or match function, with a flag parameter modifying behaviour. One suggestion I heard was that the PCRE library performs faster than the Boost library because "maybe the Boost library sucks?" but as Carl Sagan said "Extraordinary claims require extraordinary evidence" or, in other words, how do you figure?

I wrote a small app that generates multiple random strings by appending together words randomly picked from a small dictionary and then evaluates a number of regular expressions. Here's the results (slightly re-arranged):

Boost regex search, pattern: yellowgreensapphireruby.*bluered
12.50 s

Boost regex search with continuous flag, pattern: yellowgreensapphireruby.*bluered
0.07 s

Boost regex search with any flag, pattern: yellowgreensapphireruby.*bluered
12.51 s

Boost regex match, pattern: yellowgreensapphireruby.*bluered
0.05 s

PCRE regex, pattern: yellowgreensapphireruby.*bluered
1.47 s

Boost regex search, pattern: ^yellowgreensapphireruby.*bluered
6.59 s

Boost regex search with continuous flag, pattern: ^yellowgreensapphireruby.*bluered
0.07 s

Boost regex search with any flag, pattern: ^yellowgreensapphireruby.*bluered
6.57 s

Boost regex match, pattern: ^yellowgreensapphireruby.*bluered
0.05 s

PCRE regex, pattern: ^yellowgreensapphireruby.*bluered
0.02 s

Boost regex search, pattern: yellowgreensapphireruby.*bluered$
14.16 s

Boost regex search with continuous flag, pattern: yellowgreensapphireruby.*bluered$
0.07 s

Boost regex search with any flag, pattern: yellowgreensapphireruby.*bluered$
14.15 s

Boost regex match, pattern: yellowgreensapphireruby.*bluered$
0.05 s

PCRE regex, pattern: yellowgreensapphireruby.*bluered$
2.01 s

Boost regex search, pattern: ^yellowgreensapphireruby.*bluered$
6.59 s

Boost regex search with continuous flag, pattern: ^yellowgreensapphireruby.*bluered$
0.07 s

Boost regex search with any flag, pattern: ^yellowgreensapphireruby.*bluered$
6.59 s

Boost regex match, pattern: ^yellowgreensapphireruby.*bluered$
0.05 s

PCRE regex, pattern: ^yellowgreensapphireruby.*bluered$
0.02 s

Saturday, April 11, 2009

Eric's Law

If you think the compiler is wrong, you are 99.999% certain to be the problem.

*This number may vary according to the platform and the compiler.

Saturday, April 4, 2009

Web2py rocks more than the alternatives?

I need to find some spare time, perhaps just some of that discretionary sleeping time, to take a look at web2py. I have a small project in mind - updating my friends website to have a simple calendar and appointment system - but I have been getting slammed at the office porting an application to AIX (yes, it's still in use) and busy at home with the twins.