Personal tools

Regex Posix

From HaskellWiki

(Difference between revisions)
Jump to: navigation, search
(First draft)
m (Fix some typos.)
 
(9 intermediate revisions by one user not shown)
Line 1: Line 1:
 
= regex-posix Bugs =
 
= regex-posix Bugs =
  +
  +
Executive summary: If you want a bug-free and/or portable POSIX extended regular expression library to use from Haskell, then regex-posix will not help you. You should use the regex-tdfa package instead.
   
 
The regular expressions provided by the GHC bundle (up to 6.10.1) are
 
The regular expressions provided by the GHC bundle (up to 6.10.1) are
Line 24: Line 26:
 
* higher-level subpatterns have leftmost-longest priority over their component subpatterns
 
* higher-level subpatterns have leftmost-longest priority over their component subpatterns
 
* REs have right associative concatenation which can be changed with parenthesis
 
* REs have right associative concatenation which can be changed with parenthesis
* parenthesized subexpressions return the the match from their last usage
+
* parenthesized subexpressions return the match from their last usage
* text of component subexpressions much be contained in the text of the higher-level subexpressions
+
* text of component subexpressions must be contained in the text of the higher-level subexpressions
 
* if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions
 
* if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions
 
* if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string
 
* if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string
Line 37: Line 39:
 
, the rest from the author of regex-tdfa ([[ChrisKuklewicz]]). Note
 
, the rest from the author of regex-tdfa ([[ChrisKuklewicz]]). Note
 
that all of these tests are passed by regex-tdfa-0.97.1 (but not below
 
that all of these tests are passed by regex-tdfa-0.97.1 (but not below
this).
+
this). Since then additional bugs have been found and fixed, so regex-tdfa is now
  +
on at least version 0.97.4, and a future regex-posix-unittest may include the
  +
additional test cases.
   
 
These problems are documented here, but examples of the bugs are
 
These problems are documented here, but examples of the bugs are
Line 95: Line 97:
   
 
There are four classes of failures:
 
There are four classes of failures:
# Critical failures that find the wrong whole match (XBSD as of January 2009)
+
# Critical failures that find the wrong whole match (XBSD as of January 2009, Solaris)
 
# Serious failures that find impossible subexpressions (XBSD)
 
# Serious failures that find impossible subexpressions (XBSD)
 
# Serious failures that violate well-formed subexpressions (GLIBC)
 
# Serious failures that violate well-formed subexpressions (GLIBC)
# Failure to choose Posix captures when the answer is ambiguous (GLIBC and XBSD)
+
# Failure to choose Posix captures when the answer is ambiguous (GLIBC, XBSD, Solaris)
   
 
Example of an ambiguous situation: Matching "a" with the the pattern
 
Example of an ambiguous situation: Matching "a" with the the pattern
Line 332: Line 334:
 
associativity.
 
associativity.
   
== GLIBC ==
+
== OpenBSD ==
  +
  +
There is no report from OpenBSD. One would be quite welcome!
  +
  +
== GNU C library (aka GLIBC) ==
   
 
The system being tested is a current Ubuntu 8.10 distribution:
 
The system being tested is a current Ubuntu 8.10 distribution:
Line 478: Line 480:
 
repetitions of (^). Changing "*" to "+" causes GLIBC to correctly
 
repetitions of (^). Changing "*" to "+" causes GLIBC to correctly
 
return the capture at (0,0).
 
return the capture at (0,0).
  +
  +
== Solaris C library ==
  +
  +
Christian Maeder ran regex-posix-unittest and sent in Solaris 10 results with the message:
  +
<pre>
  +
Same results for:
  +
SunOS 5.10 Generic_138889-03 i86pc i386 i86pc
  +
SunOS 5.10 Generic_138888-02 sun4u sparc SUNW,Sun-Fire-280R
  +
</pre>
  +
  +
The summary is
  +
<pre>
  +
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/basic3.txt",[4,65,67,71,76])
  +
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/class.txt",[3,5])
  +
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/left-assoc.txt",[-1,-2,-9,-10])
  +
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/right-assoc.txt",[1,2,9,10])
  +
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/forced-assoc.txt",[7,8,9,10,13,14,15,16,19,20,21,22,27])
  +
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/nullsub3.txt",[2,18,26,38,40,46])
  +
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/repetition2.txt",[])
  +
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/totest.txt",[3,5,24,26,27,42,43,44,45,110,204,211,212])
  +
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[])
  +
</pre>
  +
  +
The above is similar to but not the same as the GLIBC bugs.
  +
  +
The "basic3.txt" #4 is a syntax error reported for the "^$" pattern,
  +
which seems allowed to me: Error message: (ReturnCode 22,"unknown regex error")
  +
  +
The "basic3.txt" #65 is a new bug:
  +
<pre>
  +
Searched text: "-"
  +
Regex pattern: "(a*)*"
  +
Expected output: "(0,0)(0,0)"
  +
Actual result : "(0,0)(-1,-1)"
  +
</pre>
  +
  +
The "a*" can accept 0 characters so (a*)* should match once at (0,0). There are several other examples of this in "basic3.txt" and "nullsub3.txt"
  +
  +
The too-greedy behavior of GLIBC and the lack of forced association is present in the Solaris bugs.
  +
  +
But in "forced-assoc.txt" #13 I see something far more critical! A critical bug indeed:
  +
<pre>
  +
Searched text: "abc"
  +
Regex pattern: "(a*)(b|abc)"
  +
Expected output: "(0,3)(0,0)(0,3)"
  +
Actual result : "(0,2)(0,1)(1,2)"
  +
</pre>
  +
Note that the whole match should be "abc" but Solaris matches only "ab". (a*) is far far too greedy!
  +
  +
This is also caught in "totest.txt" #3, #43, #44, #45, showing #3 below:
  +
<pre>
  +
Searched text: "ab"
  +
Regex pattern: "(a?)((ab)?)"
  +
Expected output: "(0,2)(0,0)(0,2)(0,2)"
  +
Actual result : "(0,1)(0,1)(1,1)(-1,-1)"
  +
</pre>
  +
  +
== AT&T Software Technology (aka AST or libast) ==
  +
  +
AT&T Research has an [http://www.research.att.com/sw/download/ open source suite] of libraries and binaries. This includes a [http://www.research.att.com/~gsf/testregex/re-interpretation.html POSIX regular expression implementation] in the "libast" library. Most of the unit tests used here are from their [http://www.research.att.com/~gsf/testregex/ "testregex"] program.
  +
  +
I, [[ChrisKuklewicz]], have managed to compile a "regex-ast" Haskell wrapper to this POSIX engine to allow for testing. The difficulties with the AT&T headers prevent me from releasing this to hackage. The AST engine passes the whole set of unit tests.
  +
  +
But randomized testing has found a few (probably 2, maybe 3) bugs in the AST engine as of 2009-02-24. In return, testing against AST has lead to the regex-tdfa-0.97.4 bug fix release. As regex-tdfa and AST improve they will hopefully reach a state where no differences can be found; this will likely be a fully correct state since their designs and implementations are completely different.
  +
  +
<pre>
  +
Bug in AST engine
  +
Pattern: "((.?)?)*."
  +
Text to search: "x"
  +
Expected: (0,1)(0,0)(0,0)
  +
Returned: (0,1)(0,0)(-1,-1)
  +
  +
For the next group the Text to search is "x" and
  +
Expected: (0,0)(0,0)(0,0)
  +
Pattern and Returned:
  +
"((.?)?){0,}." (0,0)(0,0)(-1,-1)
  +
"((.?)?){1,}." (0,0)(0,0)(-1,-1)
  +
"((.?)?){2,}." (0,0)(0,0)(-1,-1)
  +
"((.?)?){3,}." (0,0)(0,0)(-1,-1)
  +
  +
"((.?)?){0,1}." (0,0)(0,0)(0,0)
  +
  +
"((.?)?){0,2}." (0,0)(0,0)(-1,-1)
  +
"((.?)?){1,2}." (0,0)(0,0)(0,0)
  +
  +
"((.?)?){0,3}." (0,0)(0,0)(-1,-1)
  +
"((.?)?){1,3}." (0,0)(0,0)(-1,-1)
  +
"((.?)?){2,3}." (0,0)(0,0)(0,0)
  +
  +
"((.?)?){0,4}." (0,0)(0,0)(-1,-1)
  +
"((.?)?){1,4}." (0,0)(0,0)(-1,-1)
  +
"((.?)?){2,4}." (0,0)(0,0)(-1,-1)
  +
"((.?)?){3,4}." (0,0)(0,0)(0,0)
  +
</pre>
  +
  +
If someone else is interested in "regex-ast" then write to me at "haskell _at_ mightyreason _dot_ com".
  +
  +
= Other implementations =
  +
  +
== Boost (C++ library) ==
  +
  +
There is a [http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/index.html Boost Regex] library, which includes [http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/boost_regex/ref/posix.html "POSIX Compatible C API's"]. But their definition of the [http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/boost_regex/syntax/leftmost_longest_rule.html "The Leftmost Longest Rule"] is a different interpretation than the POSIX standard described above.
  +
  +
In the POSIX standard, the rule applies to subpatterns, regardless of capturing. Boost only applies it to captured subexpressions. Their [http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/boost_regex/background_information/faq.html FAQ] explains:
  +
  +
; Q. Why does using parenthesis in a POSIX regular expression change the result of a match?
  +
  +
: A. For POSIX (extended and basic) regular expressions, but not for perl regexes, parentheses don't only mark; they determine what the best match is as well. When the expression is compiled as a POSIX basic or extended regex then Boost.Regex follows the POSIX standard leftmost longest rule for determining what matched. So if there is more than one possible match after considering the whole expression, it looks next at the first sub-expression and then the second sub-expression and so on. So...
  +
  +
: "(0*)([0-9]*)" against "00123" would produce $1 = "00" $2 = "123"
  +
  +
: where as
  +
  +
: "0*([0-9])*" against "00123" would produce $1 = "00123"
  +
  +
:If you think about it, had $1 only matched the "123", this would be "less good" than the match "00123" which is both further to the left and longer. If you want $1 to match only the "123" part, then you need to use something like:
  +
  +
: "0*([1-9][0-9]*)"
  +
  +
The POSIX subpattern rules would always have the "0*" match "00" and the "[0-9]*" match "123", regardless of whether either is parenthesized.
  +
  +
== HSRex / Postgresql / TCL ==
  +
  +
This is from [http://www.arglist.com/regex/] which is named after Henry Spencer. This has been used in TCL 8.2 (at least) and postgresql 7.4 (at least) and wxWidgets. I tested the "Walter Waldo's port" listed on that page (hsrex.tar.gz). It has a different spectrum of non-Posix bugs. An interesting new one is:
  +
  +
<pre>
  +
test '(aba|ab|a)*' 'ababa'
  +
(0,5)(4,5)
  +
</pre>
  +
  +
Where the expected answer is (0,5)(2,5). It has picked a possible match, but has not maximized the second capture, using (ab) instead of (aba). It finds the right answer for explicit repetitions because it does find the longest match:
  +
  +
<pre>
  +
test '(aba|ab|a)(aba|ab|a)' 'ababa'
  +
(0,5)(0,2)(2,5)
  +
</pre>
  +
  +
<pre>
  +
test '(aba|ab|a)(aba|ab|a)(aba|ab|a)' 'ababa'
  +
(0,5)(0,2)(2,4)(4,5)
  +
</pre>

Latest revision as of 09:40, 13 July 2011

Contents

[edit] 1 regex-posix Bugs

Executive summary: If you want a bug-free and/or portable POSIX extended regular expression library to use from Haskell, then regex-posix will not help you. You should use the regex-tdfa package instead.

The regular expressions provided by the GHC bundle (up to 6.10.1) are through a wrapping of the operating system's native C library. The C library calls in "regex.h" have been imported via FFI and wrapped by regex-posix , either the older 0.72.* version or the newer version (> 0.94.*). Note that the matchAll semantics changed in regex-posix with version 0.94.0 (see mailing list announcement).

Unfortunately the native platforms provide Posix regular expressions support that contains bugs and/or violates the specification. This is especially true for the GNU C library (GLIBC) used by Linux distributions and the BSD C library used by OS X and FreeBSD and NetBSD (XBSD). More discussion of the standard is available from Glenn Fowler at AT&T.

A capsule summary of the Posix rules:

  • regular expressions (REs) take the leftmost starting match, and the longest match starting there
  • earlier subpatterns have leftmost-longest priority over later subpatterns
  • higher-level subpatterns have leftmost-longest priority over their component subpatterns
  • REs have right associative concatenation which can be changed with parenthesis
  • parenthesized subexpressions return the match from their last usage
  • text of component subexpressions must be contained in the text of the higher-level subexpressions
  • if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions
  • if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string

All of the above rules will be violated by one or another bug described below.

[edit] 2 regex-posix-unittest

Many of these unittests come from testregex by AT&T , the rest from the author of regex-tdfa (ChrisKuklewicz). Note that all of these tests are passed by regex-tdfa-0.97.1 (but not below this). Since then additional bugs have been found and fixed, so regex-tdfa is now on at least version 0.97.4, and a future regex-posix-unittest may include the additional test cases.

These problems are documented here, but examples of the bugs are testable on your system by way of the regex-posix-unittest package on Hackage. This package can be installed either as "--user" or "--global", and provides three things:

  1. The "regex-posix-unittest" binary executable
  2. The "test-manifest.txt" file which lists files to load test from
  3. A set of ".txt" files containing one unit test per line

Running the regex-posix-unittest program produces output for the passing or failing of each test in each file listed in "test-manifest.txt", which is shipped listing all the test files. This is followed a summary list of failing test id numbers.

Each unit test line has four fields, separated by white-space (one tab in the above):

  1. The test identification Int, negative if this is expected to fail
  2. The regular expression pattern (extended regular expression)
  3. The text to search (no white-space) # The expected result
    1. NOMATCH if the matching should fail ## "(n,m)(..)" if the match succeeds
      • Each pair of numbers denotes a substring of the text to search as two 0-based indexes
      • The length of the substring is (m-n)
      • The first pair is the whole match, further pairs are for parenthesized subexpression captures
      • Subexpressions that do not match are lists as "(?,?)" instead of with numbers

You can add and delete but please do not change existing tests.

If your platform is not mentioned in this wiki page (see below) then please either add it and your summary results or email the results to "TestRegexLazy -at- mightyreason -dot- com".

One can also pass two strings to "regex-posix-unittest" to run your own test:

user@ubuntu810desktop:~/test/regex-posix-unittest$ ~/.cabal/bin/regex-posix-unittest "abcd" "([^c])+"
Test for [Char], Version {versionBranch = [1,0], versionTags = []}
"abcd"
"([^c])+"
Posix
("","ab","cd",["b"])
POSIX count = (2,2)
array (0,1) [(0,(0,2)),(1,(1,1))]
array (0,1) [(0,(3,1)),(1,(3,1))]

This actually is a slightly return format. The strings are echoed and the ("","ab","cd",["b"]) is the "" before the first match, the "ab" of the whole first match, and the "cd" after the first match, which the captures in the list ["b"]. The arrays lines are the results of consecutive non-overlapping matches, in (capture #, (start index, length)) format. So (1,(1,1)) is the 1st parenthesis and starts after the 1st character and is 1 character long: "b". The (1,(3,1)) is "d".

[edit] 3 Types of failure

There are four classes of failures:

  1. Critical failures that find the wrong whole match (XBSD as of January 2009, Solaris)
  2. Serious failures that find impossible subexpressions (XBSD)
  3. Serious failures that violate well-formed subexpressions (GLIBC)
  4. Failure to choose Posix captures when the answer is ambiguous (GLIBC, XBSD, Solaris)

Example of an ambiguous situation: Matching "a" with the the pattern "(.)?(a)?" or the pattern "(.)|(a)". Each pattern can use either "(.)" or "(a)" to match the "a', but not both; thus only one of the two subexpressions must succeed and the other must fail. In this case the right answer is that "(.)" succeeds in both cases, so the full answer is (0,1)(0,1)(?,?). The first pattern uses the right associativity rule for concatenation and the second pattern uses the leftmost subpattern bias.

[edit] 4 Results and Bugs

[edit] 4.1 OS X, FreeBSD, NetBSD

On a G4 version of OS X 10.5.6 I see:

("/Users/chrisk/local/share/regex-posix-unittest-1.0/basic3.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/class.txt",[3,9])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/left-assoc.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/right-assoc.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/forced-assoc.txt",[5,6,7,8,15,16,21,22])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/nullsub3.txt",[50,51])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/repetition2.txt",[101,102,103,104,105,106,107,
110,111,112,113,114,115,116,117,260,261,262,263,266,267,268,270,271])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/totest.txt",[5,30,33,209])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[1,-1,2,-2,3,-3,14,-14])

The "osx-bsd-critical.txt" unexpected failures are:

############################# Unexpected Fail # 1 #############################

Searched text: "ab"
Regex pattern: "(()|.)(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 2 #############################

Searched text: "ab"
Regex pattern: "(()|[ab])(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 3 #############################

Searched text: "aaab"
Regex pattern: "(()|[ab])+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(3,4)(3,3)(3,3)"

############################# Unexpected Fail # 14 #############################

Searched text: "aaab"
Regex pattern: "([ab]|())+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(0,4)(3,3)(3,3)"

Tests 1 and 2 should match the whole (0,2) or "ab" but only match the (1,2) of "b". Test 3 should match the (0,3) of "aabb" but only matches the (3,4) of "b". These are critical failures to match the whole text properly. This problem was found by QuickCheck and reported to OS X and FreeBSD and NetBSD (the latter two checked over IRC). OpenBSD reportedly did not show this bug (anecdote on IRC).

This also infects sed, and can be seen as

prompt$ echo "ab" | sed -En 's/(()|.)(b)/[&][\1]/p'
a[b][]

instead of the correct "[ab][a]" output.

The failure in Test 14 is different. Instead of the whole match being wrong the pattern being repeated by the "+" operator is matched 3 times against "a" and "a" and "a", followed by a fourth empty match at (3,3) before the final "b". Once the repeated pattern "[ab]|()" has matched a non-empty string it should not be used to match an empty-string.

OS X also gets this wrong in "totext.txt" #209:

Searched text: "yyyyyy"
Regex pattern: "(yyy|(x?)){2,4}"
Expected output: "(0,6)(3,6)(-1,-1)"
Actual result  : "(0,6)(6,6)(6,6)"

where the (x?) matches the empty string at (6,6) instead of stopping after 2 repetitions of "yyy".

But OS X is getting this right on numerous tests in "nullsub3.txt" such as

Expected Pass #8
text and pattern: "a"
Regex pattern: "(a*)*"
Outputs agree: "(0,1)(0,1)"

Expected Pass #47
text and pattern: "ax"
Regex pattern: "(a*)*(x)"
Outputs agree: "(0,2)(0,1)(1,2)"

Where (a*) matches "a" and is reported as (0,1) and the no additional repetition matches the empty string at (1,1).

The OS X failure in "totest.txt" #33 illustrates the "impossible" capture:

Searched text: "ababa"
Regex pattern: "(aba|ab|a)*"
Expected output: "(0,5)(2,5)"
Actual result  : "(0,5)(0,3)"

The whole match is (0,5) and the correct repetition has "ab" then "aba" matching, so the correct last repetition is the "aba" between indexes (2,5). The OS X code recognizes the whole match then does a second pass of greedy repetitions, so the first repetition is "aba" between indexes (0,3). The OS X code then fails to match the "ba" in this second pass but not care. It is logically impossible that the last iteration of the above subexpression ends before the whole match, but OS X reports the last repetition ended at index 3 even though the whole match ends at index 5.

This is also shown in numerous failed tests in "repetition2.txt" such as #260:

Searched text: "ababcd"
Regex pattern: "(a|ab|c|bcd){0,}(d*)"
Expected output: "(0,6)(3,6)(6,6)"
Actual result  : "(0,6)(4,5)(6,6)"

The correct repetitions are "ab" then "a" then "bcd", reported as (3,6). The incorrect greedy pass uses by OS X matches "ab" then "c" as (4,5) and then gives up even though (5,6) is still not matched in the second pass. As a sed test:

prompt$ echo "ababcd" | sed -En 's/(a|ab|c|bcd)*/[&][\1]/p'
[ababcd][c]

The above output should be impossible; the correct output is [ababcd][bcd].

The "forced-assoc.txt" failures show that one cannot use parenthesis to force the naturally right associative concatenation to be left associative. This is a violation of the Posix standard, and of the "man 7 re_format" page form OS X 10.5.6: "Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting later. Note that higher-level subexpressions thus take priority over their lower-level component subexpressions". Take #15 as an example:

Searched text: "abc"
Regex pattern: "((a*)(b|abc))(c*)"
Expected output: "(0,3)(0,3)(0,0)(0,3)(3,3)"
Actual result  : "(0,3)(0,2)(0,1)(1,2)(2,3)"

Here ((a*)(b|abc))(c*) should be parsed as ((a*)(b|abc)) followed by (c*). As ((a*)(b|abc)) started earlier it should be as long as possible, at the expense of (c*). The longest match for this is the correct (0,3) not the incorrect (0,2) returned by OS X. And ((a*)(b|abc)) is a higher-level than its lower-level component (a*), so (a*) should match with the side constraint that ((a*)(b|abc)) matches (0,3), which means (a*) should match (0,0) instead of the longer (0,1). OS X gets this associativity this wrong.

In "repetitions2.txt" there is a series:

Expected Pass #100
text and pattern: "X1234567Y"
Regex pattern: "X(.?){0,}Y"
Outputs agree: "(0,9)(7,8)"

############################# Unexpected Fail # 101 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){1,}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"
...
############################# Unexpected Fail # 107 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){7,}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"

Expected Pass #108
text and pattern: "X1234567Y"
Regex pattern: "X(.?){8,}Y"
Outputs agree: "(0,9)(8,8)"

############################# Unexpected Fail # 110 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){0,8}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"

############################# Unexpected Fail # 111 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){1,8}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"
...

############################# Unexpected Fail # 117 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){7,8}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"

Expected Pass #118
text and pattern: "X1234567Y"
Regex pattern: "X(.?){8,8}Y"
Outputs agree: "(0,9)(8,8)"

The good news is that #108 and #118 are correct, as the last repetition is forced to match the empty string after the "7" and before the "Y". The {1,} through {7,} and {1,8} through {7,8} are all wrong: they all should match the "7" at (7,8) but XBSD returns the null at (8,8). The {0,} answer is right, but {0,8} and {1,} are both wrong. So {0,} seems to handled specially, probably because {0,} is precisely the "*" operator, though the wrong {1,} is precisely the "+" operator.

The failures of XBSD in "repetitions2.txt" in the 260 to 271 range show a different ambiguity where {0,} is not matched correctly. It gets these wrong by being greedy in the expanded form of the {n,m} operators. The "a|ab|c|bcd" should choose "ab" then "a" then "bcd" at (3,6) to match "ababcd", then (d*) should match nothing at (6,6). But XBSD is matching "ab" then "ab" then "c" at (4,5) and then uses (d*) to match the last "d" at (5,6). Changing the pattern to "((a|ab|c|bcd){3,10})(d*)" fails to force the grouping and associativity.

[edit] 4.2 OpenBSD

There is no report from OpenBSD. One would be quite welcome!

[edit] 4.3 GNU C library (aka GLIBC)

The system being tested is a current Ubuntu 8.10 distribution:

user@ubuntu810desktop:~$ uname -a
Linux ubuntu810desktop 2.6.27-11-generic #1 SMP Wed Jan 28 00:02:01 UTC 2009 i686 GNU/Linux

The failing tests are

("/home/user/.cabal/share/regex-posix-unittest-1.0/basic3.txt",[71])
("/home/user/.cabal/share/regex-posix-unittest-1.0/class.txt",[3,5,10])
("/home/user/.cabal/share/regex-posix-unittest-1.0/left-assoc.txt",[-1,-2,-9,-10])
("/home/user/.cabal/share/regex-posix-unittest-1.0/right-assoc.txt",[1,2,9,10])
("/home/user/.cabal/share/regex-posix-unittest-1.0/forced-assoc.txt",[7,8,9,10,15,16,21,22,27])
("/home/user/.cabal/share/regex-posix-unittest-1.0/nullsub3.txt",[41])
("/home/user/.cabal/share/regex-posix-unittest-1.0/repetition2.txt",[26,28,34,41,42,108,110,111,112,113,114,115,116])
("/home/user/.cabal/share/regex-posix-unittest-1.0/totest.txt",[5,24,26,27,36,42,83,84,110,206,207,208,209,215,250,251])
("/home/user/.cabal/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[3])

The GLIBC engine has an API for its own GNU regular expression standard which differs from Posix. It also exposes a Posix compatible "regex.h" API, and GNU sed has a "--posix" switch. But the library is not really trying to get the Posix answer.

First is nested subexpressions. In Posix sets of nested parenthesis that return captures must return nested captures: the inner subexpressions return substrings of the outer substrings. Consider "totest.txt" #215:

Searched text: "ab"
Regex pattern: "((a)|(b)){2,}"
Expected output: "(0,2)(1,2)(-1,-1)(1,2)"
Actual result  : "(0,2)(1,2)(0,1)(1,2)"

Where "(a)" matches the "a" at (0,1), and the next repetion matches "(b)" matches the "b" at (1,2). Under Posix rules the "(a)" did not match in the last use of the parent subexpression ((a)|(b)), so it should be (-1,-1). Under GLIBC rules it should return (0,1). The above could largely be fixed by a filter that unsets child patterns that are not properly nested, though I worry about edge cases and empty matches.

In GLIBC this is not the case, take "osx-bsd-critical" #3 and #14 as an example:

############################# Unexpected Fail # 3 #############################
Searched text: "aaab"
Regex pattern: "(()|[ab])+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(0,4)(2,3)(0,0)"

Expected Pass #14
text and pattern: "aaab"
Regex pattern: "([ab]|())+b"
Outputs agree: "(0,4)(2,3)(-1,-1)"

In #3, the "()" does match at the initial position, so it gets its capture reported, even though (0,0) is not a substring of (2,3). This is contradicted by #14 where "()" does not match at the initial position. This is an example of where "p|q" should be the same as "q|p" but returns a different answer.

The failure of some of the "right-assoc.txt" tests shows that GLIBC uses neither right nor left associative concatenation:

############################# Unexpected Fail # 1 #############################

Searched text: "abcd"
Regex pattern: "(a|ab)(c|bcd)(d*)"
Expected output: "(0,4)(0,2)(2,3)(3,4)"
Actual result  : "(0,4)(0,1)(1,4)(4,4)"

############################# Unexpected Fail # 2 #############################

Searched text: "abcd"
Regex pattern: "(a|ab)(bcd|c)(d*)"
Expected output: "(0,4)(0,2)(2,3)(3,4)"
Actual result  : "(0,4)(0,1)(1,4)(4,4)"

Expected Pass #3
text and pattern: "abcd"
Regex pattern: "(ab|a)(c|bcd)(d*)"
Outputs agree: "(0,4)(0,2)(2,3)(3,4)"

Expected Pass #4
text and pattern: "abcd"
Regex pattern: "(ab|a)(bcd|c)(d*)"
Outputs agree: "(0,4)(0,2)(2,3)(3,4)"

Test #1 through #4 check all four orderings of "a" versus "ab" and "c" versus "bcd". And GLIBC seems to be applying a greedy strategy to the "a" or "ab" subpattern. Consider these in "forced-assoc.txt":

Expected Pass #5
text and pattern: "abcd"
Regex pattern: "((a|ab)(c|bcd))(d*)"
Outputs agree: "(0,4)(0,4)(0,1)(1,4)(4,4)"

Expected Pass #6
text and pattern: "abcd"
Regex pattern: "((a|ab)(bcd|c))(d*)"
Outputs agree: "(0,4)(0,4)(0,1)(1,4)(4,4)"

############################# Unexpected Fail # 7 #############################

Searched text: "abcd"
Regex pattern: "((ab|a)(c|bcd))(d*)"
Expected output: "(0,4)(0,4)(0,1)(1,4)(4,4)"
Actual result  : "(0,4)(0,3)(0,2)(2,3)(3,4)"

############################# Unexpected Fail # 8 #############################

Searched text: "abcd"
Regex pattern: "((ab|a)(bcd|c))(d*)"
Expected output: "(0,4)(0,4)(0,1)(1,4)(4,4)"
Actual result  : "(0,4)(0,3)(0,2)(2,3)(3,4)"

In the above the extra parenthesis around the first two subexpression ought to have forced the pair to match the longest possible string "abcd" from (0,4). But #7 and #8 failed to match this because "ab" was greedily chosen.

This greedy approach is not entirely consistent, see "repetition2.txt" #108:

Searched text: "X1234567Y"
Regex pattern: "X(.?){8,}Y"
Expected output: "(0,9)(8,8)"
Actual result  : "(0,9)(7,8)"

The first seven repetitions should match 1234567 and the 8th repetition would then match an empty string at (8,8) after the "7" and before the "Y". But GLIBC does something I do not understand and ensures that the last repetition is not empty.

Another not-greedy-enough example is "basic3.txt" #71:

Searched text: "-"
Regex pattern: "(^)*"
Expected output: "(0,0)(0,0)"
Actual result  : "(0,0)(-1,-1)"

The ^ should match and capture at (0,0), but GLIBC uses zero repetitions of (^). Changing "*" to "+" causes GLIBC to correctly return the capture at (0,0).

[edit] 4.4 Solaris C library

Christian Maeder ran regex-posix-unittest and sent in Solaris 10 results with the message:

Same results for:
SunOS 5.10 Generic_138889-03 i86pc i386 i86pc
SunOS 5.10 Generic_138888-02 sun4u sparc SUNW,Sun-Fire-280R

The summary is

("/home/maeder/.cabal/share/regex-posix-unittest-1.0/basic3.txt",[4,65,67,71,76])
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/class.txt",[3,5])
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/left-assoc.txt",[-1,-2,-9,-10])
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/right-assoc.txt",[1,2,9,10])
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/forced-assoc.txt",[7,8,9,10,13,14,15,16,19,20,21,22,27])
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/nullsub3.txt",[2,18,26,38,40,46])
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/repetition2.txt",[])
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/totest.txt",[3,5,24,26,27,42,43,44,45,110,204,211,212])
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[])

The above is similar to but not the same as the GLIBC bugs.

The "basic3.txt" #4 is a syntax error reported for the "^$" pattern, which seems allowed to me: Error message: (ReturnCode 22,"unknown regex error")

The "basic3.txt" #65 is a new bug:

Searched text: "-"
Regex pattern: "(a*)*"
Expected output: "(0,0)(0,0)"
Actual result  : "(0,0)(-1,-1)"

The "a*" can accept 0 characters so (a*)* should match once at (0,0). There are several other examples of this in "basic3.txt" and "nullsub3.txt"

The too-greedy behavior of GLIBC and the lack of forced association is present in the Solaris bugs.

But in "forced-assoc.txt" #13 I see something far more critical! A critical bug indeed:

Searched text: "abc"
Regex pattern: "(a*)(b|abc)"
Expected output: "(0,3)(0,0)(0,3)"
Actual result  : "(0,2)(0,1)(1,2)"

Note that the whole match should be "abc" but Solaris matches only "ab". (a*) is far far too greedy!

This is also caught in "totest.txt" #3, #43, #44, #45, showing #3 below:

Searched text: "ab"
Regex pattern: "(a?)((ab)?)"
Expected output: "(0,2)(0,0)(0,2)(0,2)"
Actual result  : "(0,1)(0,1)(1,1)(-1,-1)"

[edit] 4.5 AT&T Software Technology (aka AST or libast)

AT&T Research has an open source suite of libraries and binaries. This includes a POSIX regular expression implementation in the "libast" library. Most of the unit tests used here are from their "testregex" program.

I, ChrisKuklewicz, have managed to compile a "regex-ast" Haskell wrapper to this POSIX engine to allow for testing. The difficulties with the AT&T headers prevent me from releasing this to hackage. The AST engine passes the whole set of unit tests.

But randomized testing has found a few (probably 2, maybe 3) bugs in the AST engine as of 2009-02-24. In return, testing against AST has lead to the regex-tdfa-0.97.4 bug fix release. As regex-tdfa and AST improve they will hopefully reach a state where no differences can be found; this will likely be a fully correct state since their designs and implementations are completely different.

Bug in AST engine
Pattern: "((.?)?)*."
Text to search: "x"
Expected: (0,1)(0,0)(0,0)
Returned: (0,1)(0,0)(-1,-1)

For the next group the Text to search is "x" and
Expected: (0,0)(0,0)(0,0)
Pattern and Returned:
"((.?)?){0,}."  (0,0)(0,0)(-1,-1)
"((.?)?){1,}."  (0,0)(0,0)(-1,-1)
"((.?)?){2,}."  (0,0)(0,0)(-1,-1)
"((.?)?){3,}."  (0,0)(0,0)(-1,-1)

"((.?)?){0,1}."  (0,0)(0,0)(0,0)

"((.?)?){0,2}."  (0,0)(0,0)(-1,-1)
"((.?)?){1,2}."  (0,0)(0,0)(0,0)

"((.?)?){0,3}."  (0,0)(0,0)(-1,-1)
"((.?)?){1,3}."  (0,0)(0,0)(-1,-1)
"((.?)?){2,3}."  (0,0)(0,0)(0,0)

"((.?)?){0,4}."  (0,0)(0,0)(-1,-1)
"((.?)?){1,4}."  (0,0)(0,0)(-1,-1)
"((.?)?){2,4}."  (0,0)(0,0)(-1,-1)
"((.?)?){3,4}."  (0,0)(0,0)(0,0) 

If someone else is interested in "regex-ast" then write to me at "haskell _at_ mightyreason _dot_ com".

[edit] 5 Other implementations

[edit] 5.1 Boost (C++ library)

There is a Boost Regex library, which includes "POSIX Compatible C API's". But their definition of the "The Leftmost Longest Rule" is a different interpretation than the POSIX standard described above.

In the POSIX standard, the rule applies to subpatterns, regardless of capturing. Boost only applies it to captured subexpressions. Their FAQ explains:

Q. Why does using parenthesis in a POSIX regular expression change the result of a match?
A. For POSIX (extended and basic) regular expressions, but not for perl regexes, parentheses don't only mark; they determine what the best match is as well. When the expression is compiled as a POSIX basic or extended regex then Boost.Regex follows the POSIX standard leftmost longest rule for determining what matched. So if there is more than one possible match after considering the whole expression, it looks next at the first sub-expression and then the second sub-expression and so on. So...
"(0*)([0-9]*)" against "00123" would produce $1 = "00" $2 = "123"
where as
"0*([0-9])*" against "00123" would produce $1 = "00123"
If you think about it, had $1 only matched the "123", this would be "less good" than the match "00123" which is both further to the left and longer. If you want $1 to match only the "123" part, then you need to use something like:
"0*([1-9][0-9]*)"

The POSIX subpattern rules would always have the "0*" match "00" and the "[0-9]*" match "123", regardless of whether either is parenthesized.

[edit] 5.2 HSRex / Postgresql / TCL

This is from [1] which is named after Henry Spencer. This has been used in TCL 8.2 (at least) and postgresql 7.4 (at least) and wxWidgets. I tested the "Walter Waldo's port" listed on that page (hsrex.tar.gz). It has a different spectrum of non-Posix bugs. An interesting new one is:

test '(aba|ab|a)*' 'ababa'
(0,5)(4,5)

Where the expected answer is (0,5)(2,5). It has picked a possible match, but has not maximized the second capture, using (ab) instead of (aba). It finds the right answer for explicit repetitions because it does find the longest match:

test '(aba|ab|a)(aba|ab|a)' 'ababa'
(0,5)(0,2)(2,5)
test '(aba|ab|a)(aba|ab|a)(aba|ab|a)' 'ababa'
(0,5)(0,2)(2,4)(4,5)