[Haskell-cafe] Re: ANN: weighted-regexp-0.1.0.0

Wed Jul 28 07:33:41 EDT 2010

On 26/07/2010 16:23, Sebastian Fischer wrote:
> this year's ICFP features A Play on Regular Expressions where two
> Haskell programmers and an automata theory guru develop an efficient
> purely functional algorithm for matching regular expressions.

That is wonderfully clean way to go straight to a DFA!

Have you thought about supporting anchors like (^) and ($) ?  It would
mean the match would have to add some prev & next parameters (or a
Reader monad?) to the shift call.

In the Play you happen to compare with the native POSIX regex library on
Mac OS X.  This library happens to be buggy and sometimes fails to find
the leftmost longest match! (email footnote [1])

You also wrote
> I have some ideas for extending the matcher. For example /a{2,5}/ is  
> currently translated into /aa(a(a(a)?)?)?/ but it may be possible to  
> handle it without such blowup. I also want to add substring matching,  
> i.e., the possibility to find out against which strings parenthesized  
> parts of the regexp were matched.

Getting a good specification for the substring rules is non-trivial.
When writing regex-tdfa (on hackage) I consulted Glenn Fowler's
description at

> http://www2.research.att.com/~gsf/testregex/re-interpretation.html

which are well-defined rules for POSIX substring capture.  No
implementation of POSIX that I have tested against fully follows the
specification, though Fowler's AT&T code comes very very close (the bugs
quickcheck found are also on
http://www.haskell.org/haskellwiki/Regex_Posix ).

In the spirit of your DFA I am guessing the "mark" type for substring
capture would actually be an annotated copy of the whole regular
expression tree.  My regex-tdfa does not look like your DFA; my code
followed Ville Laurikari's master's thesis ( http://laurikari.net/tre/
but his libtre is quite buggy) which uses an array of indexes as the
"mark" on each NFA state.

But regex-tdfa is quite old and over-complicated and while it is linear
in complexity it is not especially fast.  I am currently rewriting the
algorithm in OCaml (to learn OCaml) and I am trying to avoid the
/a{2,5}/ blowup that regex-tdfa has.

Cheers,
  Dr. Chris Kuklewicz

[1] At http://www.haskell.org/haskellwiki/Regex_Posix there is a
discussion of the regex-posix-unittest (on hackage) results for OS
X/FreeBSD.  Some critical failures for leftmost longest that I (actually
quickcheck) found is:

############################# Unexpected Fail # 1
Searched text: "ab"
Regex pattern: "(()|.)(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 2
Searched text: "ab"
Regex pattern: "(()|[ab])(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 3
Searched text: "aaab"
Regex pattern: "(()|[ab])+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(3,4)(3,3)(3,3)"

The first (i,j) values are the indexes for the whole match (staring and
one-past-the-end).  Rational people can argue about the index values in
the other substring captures, but the first one for the whole match is
simply a bug.