Personal tools

Regex Posix

From HaskellWiki

Revision as of 20:42, 2 February 2009 by ChrisKuklewicz (Talk | contribs)

Jump to: navigation, search

Contents

1 regex-posix Bugs

The regular expressions provided by the GHC bundle (up to 6.10.1) are through a wrapping of the operating system's native C library. The C library calls in "regex.h" have been imported via FFI and wrapped by regex-posix , either the older 0.72.* version or the newer version (> 0.94.*). Note that the matchAll semantics changed in regex-posix with version 0.94.0 (see mailing list announcement).

Unfortunately the native platforms provide Posix regular expressions support that contains bugs and/or violates the specification. This is especially true for the GNU C library (GLIBC) used by Linux distributions and the BSD C library used by OS X and FreeBSD and NetBSD (XBSD). More discussion of the standard is available from Glenn Fowler at AT&T.

A capsule summary of the Posix rules:

  • regular expressions (REs) take the leftmost starting match, and the longest match starting there
  • earlier subpatterns have leftmost-longest priority over later subpatterns
  • higher-level subpatterns have leftmost-longest priority over their component subpatterns
  • REs have right associative concatenation which can be changed with parenthesis
  • parenthesized subexpressions return the the match from their last usage
  • text of component subexpressions much be contained in the text of the higher-level subexpressions
  • if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions
  • if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string

All of the above rules will be violated by one or another bug described below.

2 regex-posix-unittest

Many of these unittests come from testregex by AT&T , the rest from the author of regex-tdfa (ChrisKuklewicz). Note that all of these tests are passed by regex-tdfa-0.97.1 (but not below this).

These problems are documented here, but examples of the bugs are testable on your system by way of the regex-posix-unittest package on Hackage. This package can be installed either as "--user" or "--global", and provides three things:

  1. The "regex-posix-unittest" binary executable
  2. The "test-manifest.txt" file which lists files to load test from
  3. A set of ".txt" files containing one unit test per line

Running the regex-posix-unittest program produces output for the passing or failing of each test in each file listed in "test-manifest.txt", which is shipped listing all the test files. This is followed a summary list of failing test id numbers.

Each unit test line has four fields, separated by white-space (one tab in the above):

  1. The test identification Int, negative if this is expected to fail
  2. The regular expression pattern (extended regular expression)
  3. The text to search (no white-space) # The expected result
    1. NOMATCH if the matching should fail ## "(n,m)(..)" if the match succeeds
      • Each pair of numbers denotes a substring of the text to search as two 0-based indexes
      • The length of the substring is (m-n)
      • The first pair is the whole match, further pairs are for parenthesized subexpression captures
      • Subexpressions that do not match are lists as "(?,?)" instead of with numbers

You can add and delete but please do not change existing tests.

If your platform is not mentioned in this wiki page (see below) then please either add it and your summary results or email the results to "TestRegexLazy -at- mightyreason -dot- com".

One can also pass two strings to "regex-posix-unittest" to run your own test:

user@ubuntu810desktop:~/test/regex-posix-unittest$ ~/.cabal/bin/regex-posix-unittest "abcd" "([^c])+"
Test for [Char], Version {versionBranch = [1,0], versionTags = []}
"abcd"
"([^c])+"
Posix
("","ab","cd",["b"])
POSIX count = (2,2)
array (0,1) [(0,(0,2)),(1,(1,1))]
array (0,1) [(0,(3,1)),(1,(3,1))]

This actually is a slightly return format. The strings are echoed and the ("","ab","cd",["b"]) is the "" before the first match, the "ab" of the whole first match, and the "cd" after the first match, which the captures in the list ["b"]. The arrays lines are the results of consecutive non-overlapping matches, in (capture #, (start index, length)) format. So (1,(1,1)) is the 1st parenthesis and starts after the 1st character and is 1 character long: "b". The (1,(3,1)) is "d".

3 Types of failure

There are four classes of failures:

  1. Critical failures that find the wrong whole match (XBSD as of January 2009)
  2. Serious failures that find impossible subexpressions (XBSD)
  3. Serious failures that violate well-formed subexpressions (GLIBC)
  4. Failure to choose Posix captures when the answer is ambiguous (GLIBC and XBSD)

Example of an ambiguous situation: Matching "a" with the the pattern "(.)?(a)?" or the pattern "(.)|(a)". Each pattern can use either "(.)" or "(a)" to match the "a', but not both; thus only one of the two subexpressions must succeed and the other must fail. In this case the right answer is that "(.)" succeeds in both cases, so the full answer is (0,1)(0,1)(?,?). The first pattern uses the right associativity rule for concatenation and the second pattern uses the leftmost subpattern bias.

4 Results and Bugs

4.1 OS X, FreeBSD, NetBSD

On a G4 version of OS X 10.5.6 I see:

("/Users/chrisk/local/share/regex-posix-unittest-1.0/basic3.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/class.txt",[3,9])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/left-assoc.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/right-assoc.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/forced-assoc.txt",[5,6,7,8,15,16,21,22])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/nullsub3.txt",[50,51])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/repetition2.txt",[101,102,103,104,105,106,107,
110,111,112,113,114,115,116,117,260,261,262,263,266,267,268,270,271])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/totest.txt",[5,30,33,209])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[1,-1,2,-2,3,-3,14,-14])

The "osx-bsd-critical.txt" unexpected failures are:

############################# Unexpected Fail # 1 #############################

Searched text: "ab"
Regex pattern: "(()|.)(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 2 #############################

Searched text: "ab"
Regex pattern: "(()|[ab])(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 3 #############################

Searched text: "aaab"
Regex pattern: "(()|[ab])+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(3,4)(3,3)(3,3)"

############################# Unexpected Fail # 14 #############################

Searched text: "aaab"
Regex pattern: "([ab]|())+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(0,4)(3,3)(3,3)"

Tests 1 and 2 should match the whole (0,2) or "ab" but only match the (1,2) of "b". Test 3 should match the (0,3) of "aabb" but only matches the (3,4) of "b". These are critical failures to match the whole text properly. This problem was found by QuickCheck and reported to OS X and FreeBSD and NetBSD (the latter two checked over IRC). OpenBSD reportedly did not show this bug (anecdote on IRC).

This also infects sed, and can be seen as

prompt$ echo "ab" | sed -En 's/(()|.)(b)/[&][\1]/p'
a[b][]

instead of the correct "[ab][a]" output.

The failure in Test 14 is different. Instead of the whole match being wrong the pattern being repeated by the "+" operator is matched 3 times against "a" and "a" and "a", followed by a fourth empty match at (3,3) before the final "b". Once the repeated pattern "[ab]|()" has matched a non-empty string it should not be used to match an empty-string.

OS X also gets this wrong in "totext.txt" #209:

Searched text: "yyyyyy"
Regex pattern: "(yyy|(x?)){2,4}"
Expected output: "(0,6)(3,6)(-1,-1)"
Actual result  : "(0,6)(6,6)(6,6)"

where the (x?) matches the empty string at (6,6) instead of stopping after 2 repetitions of "yyy".

But OS X is getting this right on numerous tests in "nullsub3.txt" such as

Expected Pass #8
text and pattern: "a"
Regex pattern: "(a*)*"
Outputs agree: "(0,1)(0,1)"

Expected Pass #47
text and pattern: "ax"
Regex pattern: "(a*)*(x)"
Outputs agree: "(0,2)(0,1)(1,2)"

Where (a*) matches "a" and is reported as (0,1) and the no additional repetition matches the empty string at (1,1).

The OS X failure in "totest.txt" #33 illustrates the "impossible" capture:

Searched text: "ababa"
Regex pattern: "(aba|ab|a)*"
Expected output: "(0,5)(2,5)"
Actual result  : "(0,5)(0,3)"

The whole match is (0,5) and the correct repetition has "ab" then "aba" matching, so the correct last repetition is the "aba" between indexes (2,5). The OS X code recognizes the whole match then does a second pass of greedy repetitions, so the first repetition is "aba" between indexes (0,3). The OS X code then fails to match the "ba" in this second pass but not care. It is logically impossible that the last iteration of the above subexpression ends before the whole match, but OS X reports the last repetition ended at index 3 even though the whole match ends at index 5.

This is also shown in numerous failed tests in "repetition2.txt" such as #260:

Searched text: "ababcd"
Regex pattern: "(a|ab|c|bcd){0,}(d*)"
Expected output: "(0,6)(3,6)(6,6)"
Actual result  : "(0,6)(4,5)(6,6)"

The correct repetitions are "ab" then "a" then "bcd", reported as (3,6). The incorrect greedy pass uses by OS X matches "ab" then "c" as (4,5) and then gives up even though (5,6) is still not matched in the second pass. As a sed test:

prompt$ echo "ababcd" | sed -En 's/(a|ab|c|bcd)*/[&][\1]/p'
[ababcd][c]

The above output should be impossible; the correct output is [ababcd][bcd].

The "forced-assoc.txt" failures show that one cannot use parenthesis to force the naturally right associative concatenation to be left associative. This is a violation of the Posix standard, and of the "man 7 re_format" page form OS X 10.5.6: "Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting later. Note that higher-level subexpressions thus take priority over their lower-level component subexpressions". Take #15 as an example:

Searched text: "abc"
Regex pattern: "((a*)(b|abc))(c*)"
Expected output: "(0,3)(0,3)(0,0)(0,3)(3,3)"
Actual result  : "(0,3)(0,2)(0,1)(1,2)(2,3)"

Here ((a*)(b|abc))(c*) should be parsed as ((a*)(b|abc)) followed by (c*). As ((a*)(b|abc)) started earlier it should be as long as possible, at the expense of (c*). The longest match for this is the correct (0,3) not the incorrect (0,2) returned by OS X. And ((a*)(b|abc)) is a higher-level than its lower-level component (a*), so (a*) should match with the side constraint that ((a*)(b|abc)) matches (0,3), which means (a*) should match (0,0) instead of the longer (0,1). OS X gets this associativity this wrong.

In "repetitions2.txt" there is a series:

Expected Pass #100
text and pattern: "X1234567Y"
Regex pattern: "X(.?){0,}Y"
Outputs agree: "(0,9)(7,8)"

############################# Unexpected Fail # 101 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){1,}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"
...
############################# Unexpected Fail # 107 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){7,}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"

Expected Pass #108
text and pattern: "X1234567Y"
Regex pattern: "X(.?){8,}Y"
Outputs agree: "(0,9)(8,8)"

############################# Unexpected Fail # 110 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){0,8}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"

############################# Unexpected Fail # 111 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){1,8}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"
...

############################# Unexpected Fail # 117 #############################

Searched text: "X1234567Y"
Regex pattern: "X(.?){7,8}Y"
Expected output: "(0,9)(7,8)"
Actual result  : "(0,9)(8,8)"

Expected Pass #118
text and pattern: "X1234567Y"
Regex pattern: "X(.?){8,8}Y"
Outputs agree: "(0,9)(8,8)"

The good news is that #108 and #118 are correct, as the last repetition is forced to match the empty string after the "7" and before the "Y". The {1,} through {7,} and {1,8} through {7,8} are all wrong: they all should match the "7" at (7,8) but XBSD returns the null at (8,8). The {0,} answer is right, but {0,8} and {1,} are both wrong. So {0,} seems to handled specially, probably because {0,} is precisely the "*" operator, though the wrong {1,} is precisely the "+" operator.

The failures of XBSD in "repetitions2.txt" in the 260 to 271 range show a different ambiguity where {0,} is not matched correctly. It gets these wrong by being greedy in the expanded form of the {n,m} operators. The "a|ab|c|bcd" should choose "ab" then "a" then "bcd" at (3,6) to match "ababcd", then (d*) should match nothing at (6,6). But XBSD is matching "ab" then "ab" then "c" at (4,5) and then uses (d*) to match the last "d" at (5,6). Changing the pattern to "((a|ab|c|bcd){3,10})(d*)" fails to force the grouping and associativity.

4.2 GLIBC

The system being tested is a current Ubuntu 8.10 distribution:

user@ubuntu810desktop:~$ uname -a
Linux ubuntu810desktop 2.6.27-11-generic #1 SMP Wed Jan 28 00:02:01 UTC 2009 i686 GNU/Linux

The failing tests are

("/home/user/.cabal/share/regex-posix-unittest-1.0/basic3.txt",[71])
("/home/user/.cabal/share/regex-posix-unittest-1.0/class.txt",[3,5,10])
("/home/user/.cabal/share/regex-posix-unittest-1.0/left-assoc.txt",[-1,-2,-9,-10])
("/home/user/.cabal/share/regex-posix-unittest-1.0/right-assoc.txt",[1,2,9,10])
("/home/user/.cabal/share/regex-posix-unittest-1.0/forced-assoc.txt",[7,8,9,10,15,16,21,22,27])
("/home/user/.cabal/share/regex-posix-unittest-1.0/nullsub3.txt",[41])
("/home/user/.cabal/share/regex-posix-unittest-1.0/repetition2.txt",[26,28,34,41,42,108,110,111,112,113,114,115,116])
("/home/user/.cabal/share/regex-posix-unittest-1.0/totest.txt",[5,24,26,27,36,42,83,84,110,206,207,208,209,215,250,251])
("/home/user/.cabal/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[3])

The GLIBC engine has an API for its own GNU regular expression standard which differs from Posix. It also exposes a Posix compatible "regex.h" API, and GNU sed has a "--posix" switch. But the library is not really trying to get the Posix answer.

First is nested subexpressions. In Posix sets of nested parenthesis that return captures must return nested captures: the inner subexpressions return substrings of the outer substrings. Consider "totest.txt" #215:

Searched text: "ab"
Regex pattern: "((a)|(b)){2,}"
Expected output: "(0,2)(1,2)(-1,-1)(1,2)"
Actual result  : "(0,2)(1,2)(0,1)(1,2)"

Where "(a)" matches the "a" at (0,1), and the next repetion matches "(b)" matches the "b" at (1,2). Under Posix rules the "(a)" did not match in the last use of the parent subexpression ((a)|(b)), so it should be (-1,-1). Under GLIBC rules it should return (0,1). The above could largely be fixed by a filter that unsets child patterns that are not properly nested, though I worry about edge cases and empty matches.

In GLIBC this is not the case, take "osx-bsd-critical" #3 and #14 as an example:

############################# Unexpected Fail # 3 #############################
Searched text: "aaab"
Regex pattern: "(()|[ab])+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(0,4)(2,3)(0,0)"

Expected Pass #14
text and pattern: "aaab"
Regex pattern: "([ab]|())+b"
Outputs agree: "(0,4)(2,3)(-1,-1)"

In #3, the "()" does match at the initial position, so it gets its capture reported, even though (0,0) is not a substring of (2,3). This is contradicted by #14 where "()" does not match at the initial position. This is an example of where "p|q" should be the same as "q|p" but returns a different answer.

The failure of some of the "right-assoc.txt" tests shows that GLIBC uses neither right nor left associative concatenation:

############################# Unexpected Fail # 1 #############################

Searched text: "abcd"
Regex pattern: "(a|ab)(c|bcd)(d*)"
Expected output: "(0,4)(0,2)(2,3)(3,4)"
Actual result  : "(0,4)(0,1)(1,4)(4,4)"

############################# Unexpected Fail # 2 #############################

Searched text: "abcd"
Regex pattern: "(a|ab)(bcd|c)(d*)"
Expected output: "(0,4)(0,2)(2,3)(3,4)"
Actual result  : "(0,4)(0,1)(1,4)(4,4)"

Expected Pass #3
text and pattern: "abcd"
Regex pattern: "(ab|a)(c|bcd)(d*)"
Outputs agree: "(0,4)(0,2)(2,3)(3,4)"

Expected Pass #4
text and pattern: "abcd"
Regex pattern: "(ab|a)(bcd|c)(d*)"
Outputs agree: "(0,4)(0,2)(2,3)(3,4)"

Test #1 through #4 check all four orderings of "a" versus "ab" and "c" versus "bcd". And GLIBC seems to be applying a greedy strategy to the "a" or "ab" subpattern. Consider these in "forced-assoc.txt":

Expected Pass #5
text and pattern: "abcd"
Regex pattern: "((a|ab)(c|bcd))(d*)"
Outputs agree: "(0,4)(0,4)(0,1)(1,4)(4,4)"

Expected Pass #6
text and pattern: "abcd"
Regex pattern: "((a|ab)(bcd|c))(d*)"
Outputs agree: "(0,4)(0,4)(0,1)(1,4)(4,4)"

############################# Unexpected Fail # 7 #############################

Searched text: "abcd"
Regex pattern: "((ab|a)(c|bcd))(d*)"
Expected output: "(0,4)(0,4)(0,1)(1,4)(4,4)"
Actual result  : "(0,4)(0,3)(0,2)(2,3)(3,4)"

############################# Unexpected Fail # 8 #############################

Searched text: "abcd"
Regex pattern: "((ab|a)(bcd|c))(d*)"
Expected output: "(0,4)(0,4)(0,1)(1,4)(4,4)"
Actual result  : "(0,4)(0,3)(0,2)(2,3)(3,4)"

In the above the extra parenthesis around the first two subexpression ought to have forced the pair to match the longest possible string "abcd" from (0,4). But #7 and #8 failed to match this because "ab" was greedily chosen.

This greedy approach is not entirely consistent, see "repetition2.txt" #108:

Searched text: "X1234567Y"
Regex pattern: "X(.?){8,}Y"
Expected output: "(0,9)(8,8)"
Actual result  : "(0,9)(7,8)"

The first seven repetitions should match 1234567 and the 8th repetition would then match an empty string at (8,8) after the "7" and before the "Y". But GLIBC does something I do not understand and ensures that the last repetition is not empty.

Another not-greedy-enough example is "basic3.txt" #71:

Searched text: "-"
Regex pattern: "(^)*"
Expected output: "(0,0)(0,0)"
Actual result  : "(0,0)(-1,-1)"

The ^ should match and capture at (0,0), but GLIBC uses zero repetitions of (^). Changing "*" to "+" causes GLIBC to correctly return the capture at (0,0).