[Haskell-cafe] Re: Fastest regex package?

Thu Feb 5 13:59:12 EST 2009

Eugene Kirpichov wrote:
>> All in all, my question remains: what is the fastest way to do this
>> kind of parsing on a lazy bytestring?
>>

Your example regular expression works the same in both Posix and Perl-ish 
semantics.   Do you know the difference?  Posix libraries look for the longest 
match of all possible matches.  Perl-ish is left-bias and looks at the left 
branch first and only looks at the right branch when the left fails and it has 
to backtrack.

So the "++" operator is a hack to try and control the backtracking of Perl 
regular expressions.  Such a things has no meaning in Posix where the 
implementation details are totally different.

You might try this variant of the example pattern:
  /foo.xml.*fooid=([0-9]+)[^0-9].*barid=([0-9]+)

The [^0-9] can be used when you know that there is at least one junk character 
before the barid, which I suspect will always occur in a URL.

I expect regex-posix to be slower than regex-pcre. I have not used the new 
pcre-light.  I wrote regex-tdfa — it is pure haskell and not a C library 
wrapper.  There are patterns where regex-pcre will backtrack and take 
exponentially more time than regex-tdfa's automaton (which is not yet ideal and 
may get faster).

So what is the lazy bytestring with its multiple buffers doing for you when 
using PCRE, PCRE-light, or regex-posix? Absolutely nothing.  To run against 
these C libraries the target text is converted to a single buffer, i.e. a 
CStringLen in Haskell.  Thus it is morally converted into a strict bytestring. 
This may involve copying the logfile into a NEW strict bytestring EVERY TIME you 
run a match.  Please Please Please convert to a strict bytestring and then run 
regex-pcre or pcre-light (or any of the others).

regex-tdfa does not convert it into a strict bytestring, but is otherwise much 
slower than pcre for your simple pattern.

As for regex-pcre's interface....you should use the API in regex-base to get a 
pure interface.   The RegexLike functions are the pure interface for this, and 
the RegexContext class offers a slew of instances with useful variants.  But if 
you have been getting to the "low level IO API" in regex-pcre then you probably 
do not need or want the RegexContext transformations.

And BoyerMoore (which I think I helped optimize): this may be faster because it 
does not copy your whole Lazy bytestring into a Strict ByteString for each 
search.  But you may wish to test it with a Strict ByteString as input anyway.

-- 
Chris