[Haskell-cafe] Efficient parallel regular expressions

Martijn van Steenbergen martijn at van.steenbergen.nl
Wed Nov 5 08:56:53 EST 2008


Hello everyone,

Thank you all for your comments! Those are some very useful ideas.

I think I'll try roger's (private) and ChrisK's suggestion first: using 
the match groups. I'm not sure if the match groups inside the individual 
  regexes will cause much trouble, but we'll see. I imagine I'll have to 
count parentheses, except when it's followed by a \, except when that \ 
follows another \, etc. There's probably other situations where a () 
doesn't count as a group, perhaps when it's followed by a * or +. I'll 
look into that.

If that doesn't work out I'll go for Neil's (from an algorithmic POV 
beautiful) suggestion.

While I understand that some of you suggest I use parsec (or some other 
mature parser library) I'm pretty sure that's not what I want here. The 
patterns will almost always be very simple and regular expressions offer 
an extremely concise way of expressing when a hook should fire. Forcing 
the user to use full parsers would cause the programs to become much 
more verbose. Still, Yogurt is flexible enough to allow the user to use 
parsec if he or she so chooses.

Thanks again,

Martijn.



Mitchell, Neil wrote:
> Hi Martijn,
> 
> It's not that tricky if you do a regular expression state machine
> yourself, but that's probably a bit too much work. One way to get a
> speed up might be to take the regular expressions a,b,c,d and
> generate a regex a+b+c+d, and one a+b. You can then check any string
> s against a+b+c+d, if that matches check a+b, if that matches check
> a. At each stage you eliminate half the regular expressions, which
> means a match will take log n, where n is the number of regular
> expressions.
> 
> This assumes the underlying regular expression engine constructs a
> finite state machine, making it O(m) where m is the length of the
> string to match.
> 
> Thanks
> 
> Neil


More information about the Haskell-Cafe mailing list