Personal tools

Regex Posix

From HaskellWiki

(Difference between revisions)
Jump to: navigation, search
m
m
Line 1: Line 1:
 
= regex-posix bugs =
 
= regex-posix bugs =
   
The regular expessions provided by the GHC bundle (up to 6.10.1) are
+
The regular expressions provided by the GHC bundle (up to 6.10.1) are
 
through a wrapping of the operating system's native C library. The C
 
through a wrapping of the operating system's native C library. The C
 
library calls in
 
library calls in
Line 8: Line 8:
 
0.96.*).
 
0.96.*).
   
Unfortunately the native platforms provide Posix regular experssions
+
Unfortunately the native platforms provide Posix regular expressions
 
support that contains bugs and/or violates the
 
support that contains bugs and/or violates the
[http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html specification]. This is expecially true for the GNU C library (GLIBC)
+
[http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html specification]. This is especially true for the GNU C library (GLIBC)
 
used by Linux distributions and the BSD C library used by OS X and
 
used by Linux distributions and the BSD C library used by OS X and
 
FreeBSD and NetBSD (XBSD). More discussion of the standard is
 
FreeBSD and NetBSD (XBSD). More discussion of the standard is
Line 35: Line 35:
 
summary list of failing test id numbers.
 
summary list of failing test id numbers.
   
Each unit test line has four fields, sepatated by whitespace (one tab
+
Each unit test line has four fields, separated by white-space (one tab
 
in the above):
 
in the above):
   
 
# The test identification Int, negative if this is expected to fail
 
# The test identification Int, negative if this is expected to fail
 
# The regular expression pattern (extended regular expression)
 
# The regular expression pattern (extended regular expression)
# The text to search (no whitespace) # The expected result
+
# The text to search (no white-space) # The expected result
 
## NOMATCH if the matching should fail ## "(n,m)(..)" if the match succeeds
 
## NOMATCH if the matching should fail ## "(n,m)(..)" if the match succeeds
##* Each pair of numbers denotes a substring of the text to search as two 0-based indices
+
##* Each pair of numbers denotes a substring of the text to search as two 0-based indexes
 
##* The length of the substring is (m-n)
 
##* The length of the substring is (m-n)
##* The first pair is the whole match, futher pairs are for parenthesized subexpression captures
+
##* The first pair is the whole match, further pairs are for parenthesized subexpression captures
 
##* Subexpressions that do not match are lists as "(?,?)" instead of with numbers
 
##* Subexpressions that do not match are lists as "(?,?)" instead of with numbers
   
Line 61: Line 61:
 
# Failure to choose Posix captures when the answer is ambiguous (GLIBC and XBSD)
 
# Failure to choose Posix captures when the answer is ambiguous (GLIBC and XBSD)
   
Example of an ambigious situation: Matching "a" with the the pattern
+
Example of an ambiguous situation: Matching "a" with the the pattern
 
"(.)?(a)?" or the pattern "(.)|(a)". Each pattern can use either
 
"(.)?(a)?" or the pattern "(.)|(a)". Each pattern can use either
 
"(.)" or "(a)" to match the "a', but not both; thus only one of the
 
"(.)" or "(a)" to match the "a', but not both; thus only one of the
Line 175: Line 175:
 
"aba" matching, so the correct last repetition is the "aba" between
 
"aba" matching, so the correct last repetition is the "aba" between
 
indexes (2,5). The OS X code recognizes the whole match then does a
 
indexes (2,5). The OS X code recognizes the whole match then does a
second pass of greedy repetitons, so the first repetition is "aba"
+
second pass of greedy repetitions, so the first repetition is "aba"
 
between indexes (0,3). The OS X code then fails to match the "ba" in
 
between indexes (0,3). The OS X code then fails to match the "ba" in
 
this second pass but not care. It is logically impossible that the
 
this second pass but not care. It is logically impossible that the
last iteration of the above subexpession ends before the whole match,
+
last iteration of the above subexpression ends before the whole match,
 
but OS X reports the last repetition ended at index 3 even though the
 
but OS X reports the last repetition ended at index 3 even though the
 
whole match ends at index 5.
 
whole match ends at index 5.
Line 196: Line 196:
 
The above output should be impossible; the correct output is [ababcd][bcd].
 
The above output should be impossible; the correct output is [ababcd][bcd].
   
The "forced-assoc.txt" failures show that one cannout use parenthesis
+
The "forced-assoc.txt" failures show that one cannot use parenthesis
 
to force the naturally right associative concatenation to be left
 
to force the naturally right associative concatenation to be left
 
associative. This is a violation of the Posix standard, and of the
 
associative. This is a violation of the Posix standard, and of the

Revision as of 15:56, 2 February 2009

Contents

1 regex-posix bugs

The regular expressions provided by the GHC bundle (up to 6.10.1) are through a wrapping of the operating system's native C library. The C library calls in "regex.h" have been imported via FFI and wrapped by regex-posix, either the older 0.72.* version or the newer version (> 0.96.*).

Unfortunately the native platforms provide Posix regular expressions support that contains bugs and/or violates the specification. This is especially true for the GNU C library (GLIBC) used by Linux distributions and the BSD C library used by OS X and FreeBSD and NetBSD (XBSD). More discussion of the standard is available from Glenn Fowler at AT&T.

2 regex-posix-unittest

Many of these unittests come from testregex by AT&T, the rest from the author of regex-tdfa (ChrisKuklewicz).

These problems are documented here, but examples of the bugs are testable on your system by way of the regex-posix-unittest package on Hackage. This package can be installed either as "--user" or "--global", and provides three things:

  1. The "regex-posix-unittest" binary executable
  2. The "manifest.txt" file which lists files to load test from
  3. A set of ".txt" files containing one unit test per line

Running the regex-posix-unittest program produces output for the passing or failing of each test in each file listed in "manifest.txt", which is shipped listing all the test files. This is followed a summary list of failing test id numbers.

Each unit test line has four fields, separated by white-space (one tab in the above):

  1. The test identification Int, negative if this is expected to fail
  2. The regular expression pattern (extended regular expression)
  3. The text to search (no white-space) # The expected result
    1. NOMATCH if the matching should fail ## "(n,m)(..)" if the match succeeds
      • Each pair of numbers denotes a substring of the text to search as two 0-based indexes
      • The length of the substring is (m-n)
      • The first pair is the whole match, further pairs are for parenthesized subexpression captures
      • Subexpressions that do not match are lists as "(?,?)" instead of with numbers

You can add and delete but please do not change existing tests.

If your platform is not mentioned in this wiki page (see below) then please either add it and your summary results or email the results to "TestRegexLazy -at- mightyreason -dot- com".

3 Types of failure

There are four classes of failures:

  1. Critical failures that find the wrong whole match (XBSD as of January 2009)
  2. Serious failures that find impossible subexpressions (XBSD)
  3. Serious failures that violate well-formed subexpressions (GLIBC)
  4. Failure to choose Posix captures when the answer is ambiguous (GLIBC and XBSD)

Example of an ambiguous situation: Matching "a" with the the pattern "(.)?(a)?" or the pattern "(.)|(a)". Each pattern can use either "(.)" or "(a)" to match the "a', but not both; thus only one of the two subexpressions must succeed and the other must fail. In this case the right answer is that "(.)" succeeds in both cases, so the full answer is (0,1)(0,1)(?,?). The first pattern uses the right associativity rule for concatenation and the second pattern uses the leftmost subpattern bias.

4 Results and Bugs

4.1 OS X, FreeBSD, NetBSD

On a G4 version of OS X 10.5.6 I see:

("/Users/chrisk/local/share/regex-posix-unittest-1.0/basic3.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/class.txt",[3,9])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/left-assoc.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/right-assoc.txt",[])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/forced-assoc.txt",[5,6,7,8,15,16,21,22])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/nullsub3.txt",[50,51])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/repetition2.txt",[101,102,103,104,105,106,107,
110,111,112,113,114,115,116,117,260,261,262,263,266,267,268,270,271])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/totest.txt",[5,30,33,209])
("/Users/chrisk/local/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[1,-1,2,-2,3,-3,14,-14])

The "osx-bsd-critical.txt" unexpected failures are:

############################# Unexpected Fail # 1 #############################

Searched text: "ab"
Regex pattern: "(()|.)(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 2 #############################

Searched text: "ab"
Regex pattern: "(()|[ab])(b)"
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"
Actual result  : "(1,2)(1,1)(1,1)(1,2)"

############################# Unexpected Fail # 3 #############################

Searched text: "aaab"
Regex pattern: "(()|[ab])+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(3,4)(3,3)(3,3)"

############################# Unexpected Fail # 14 #############################

Searched text: "aaab"
Regex pattern: "([ab]|())+b"
Expected output: "(0,4)(2,3)(-1,-1)"
Actual result  : "(0,4)(3,3)(3,3)"

Tests 1 and 2 should match the whole (0,2) or "ab" but only match the (1,2) of "b". Test 3 should match the (0,3) of "aabb" but only matches the (3,4) of "b". These are critical failures to match the whole text properly. This problem was found by QuickCheck and reported to OS X and FreeBSD and NetBSD (the latter two checked over IRC). OpenBSD reportedly did not show this bug (anecdote on IRC).

This also infects sed, and can be seen as

prompt$ echo "ab" | sed -En 's/(()|.)(b)/[&][\1]/p'
a[b][]

instead of the correct "[ab][a]" output.

The failure in Test 14 is different. Instead of the whole match being wrong the pattern being repeated by the "+" operator is matched 3 times against "a" and "a" and "a", followed by a fourth empty match at (3,3) before the final "b". Once the repeated pattern "[ab]|()" has matched a non-empty string it should not be used to match an empty-string.

OS X also gets this wrong in "totext.txt" #209:

Searched text: "yyyyyy"
Regex pattern: "(yyy|(x?)){2,4}"
Expected output: "(0,6)(3,6)(-1,-1)"
Actual result  : "(0,6)(6,6)(6,6)"

where the (x?) matches the empty string at (6,6) instead of stopping after 2 repetitions of "yyy".

But OS X is getting this right on numerous tests in "nullsub3.txt" such as

Expected Pass #8
text and pattern: "a"
Regex pattern: "(a*)*"
Outputs agree: "(0,1)(0,1)"

Expected Pass #47
text and pattern: "ax"
Regex pattern: "(a*)*(x)"
Outputs agree: "(0,2)(0,1)(1,2)"

Where (a*) matches "a" and is reported as (0,1) and the no additional repetition matches the empty string at (1,1).

The OS X failure in "totest.txt" #33 illustrates the "impossible" capture:

Searched text: "ababa"
Regex pattern: "(aba|ab|a)*"
Expected output: "(0,5)(2,5)"
Actual result  : "(0,5)(0,3)"

The whole match is (0,5) and the correct repetition has "ab" then "aba" matching, so the correct last repetition is the "aba" between indexes (2,5). The OS X code recognizes the whole match then does a second pass of greedy repetitions, so the first repetition is "aba" between indexes (0,3). The OS X code then fails to match the "ba" in this second pass but not care. It is logically impossible that the last iteration of the above subexpression ends before the whole match, but OS X reports the last repetition ended at index 3 even though the whole match ends at index 5.

This is also shown in numerous failed tests in "repetition2.txt" such as #260:

Searched text: "ababcd"
Regex pattern: "(a|ab|c|bcd){0,}(d*)"
Expected output: "(0,6)(3,6)(6,6)"
Actual result  : "(0,6)(4,5)(6,6)"

The correct repetitions are "ab" then "a" then "bcd", reported as (3,6). The incorrect greedy pass uses by OS X matches "ab" then "c" as (4,5) and then gives up even though (5,6) is still not matched in the second pass. As a sed test:

prompt$ echo "ababcd" | sed -En 's/(a|ab|c|bcd)*/[&][\1]/p'
[ababcd][c]

The above output should be impossible; the correct output is [ababcd][bcd].

The "forced-assoc.txt" failures show that one cannot use parenthesis to force the naturally right associative concatenation to be left associative. This is a violation of the Posix standard, and of the "man 7 re_format" page form OS X 10.5.6: "Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting later. Note that higher-level subexpressions thus take priority over their lower-level component subexpressions". Take #15 as an example:

Searched text: "abc"
Regex pattern: "((a*)(b|abc))(c*)"
Expected output: "(0,3)(0,3)(0,0)(0,3)(3,3)"
Actual result  : "(0,3)(0,2)(0,1)(1,2)(2,3)"

Here ((a*)(b|abc))(c*) should be parsed as ((a*)(b|abc)) followed by (c*). As ((a*)(b|abc)) started earlier it should be as long as possible, at the expense of (c*). The longest match for this is the correct (0,3) not the incorrect (0,2) returned by OS X. And ((a*)(b|abc)) is a higher-level than its lower-level component (a*), so (a*) should match with the side constraint that ((a*)(b|abc)) matches (0,3), which means (a*) should match (0,0) instead of the longer (0,1). OS X gets this wrong.

4.2 GLIBC

to be posted