[Haskell-cafe] Re: ANN: weighted-regexp-0.1.0.0

Wed Jul 28 18:07:48 EDT 2010

> Maybe I underestimated the utility of ^ and $. The definition seems
> intricate. I thought about adding a combinator for matching newline but
> now think that would lead to wrong start and end positions. For example
> the start position of the matching substring for  ^a  in  "a\na"  should
> be 2 not 1, right? Or is it 0 although there is no newline at the
> beginning?

The first "a" would match with indexes (0,1) and the second "a" would
match with indexes (1,2).

>
> Is there a page with examples that show how ^ and $ should behave exactly?
>

Without REG_NEWLINE the meanings are:

. matches any single character (though note that handling of a zero byte
is impossible for C style strings for a different reason).

^ is an assertion that instead of being AlwaysTrue (eps)or AlwaysFalse
(noMatch) is true before any characters have been accepted and false
afterward.

$ is an assertion that is true only when there are no more characters to
match and false before this.

With REG_NEWLINE the meanings are:

. matches any single character EXCEPT '\n' newline (ASCII 10, I think).

^ is true before any characters have been matched and true right after a
newline has been matched, else false.

^ is true when there are no more characters to match and true if the
next character to match is a newline, else false.

Let 'a' and 'b' and 'c' be some complicated regular expressions that
cannot accept a newline with REG_NEWLINE enabled:

^$ finds blank lines, the indexes between newlines or between a newline
and the start or end of the text.

^a$ requires 'a' to exactly fill a line and the captured string has no
newlines.

A more complicated use, perhaps as part of a crazy parser:
"(a(\n)?)(^|b)(c|$)" has 'a' much some text and perhaps the newline.  If
the newline was there then the ^ matches and b might be skipped,
otherwise b must be used.  The match ends with '(c|$)' is thus either
starting the new line or trailing b.  And (c|$) can avoid matching 'c'
if the next character is a newline.

Note that the regular expression "(^|[aA])" has a non-trivial
"can_accept_empty" property: it can sometimes accept empty.

And if you are recording parenthetical captures then "(^)?" is subtle.
When ^ is true the (^) succeeds like () and when it is false it does
not.  This inserts a test into the pattern that can be checked later.

And "((^$)|(^)|($))" is worse: it does not always succeed and which
sub-pattern gets captured depends on the presence of one or two newlines.

In "((^)|(^$))" it is impossible for (^$) to be used since the first (^)
will always be favored by the POSIX rules.

Similarly "(()|(^))" will never use (^).

A small chunk of regex-tdfa sifts through the possible ways to accept 0
characters for each node in the parse-tree and keeps an ordered list of
sets of assertions to check, and cleans outs those that are logically
excluded.

Slightly more useful anchors are added in Perl/PCRE:
> ANCHORS AND SIMPLE ASSERTIONS
>          \b          word boundary
>          \B          not a word boundary
>          ^           start of subject
>                       also after internal newline in multiline mode
>          \A          start of subject
>          $           end of subject
>                       also before newline at end of subject
>                       also before internal newline in multiline mode
>          \Z          end of subject
>                       also before newline at end of subject
>          \z          end of subject
>          \G          first matching position in subject

I added \b \B as above, and added \` \' to be like \A and \Z above, and
added  \< and \> to be beginning and end of word assertions.

With enough assertions and negated assertions one could level up to
using a binary decision diagram to express when a sub-pattern can accept
0 characters.

Ville's libtre gets this wrong:
> Searched text: "searchme"
> Regex pattern: "((s^)|(s)|(^)|($)|(^.))*"
> Expected output: "(0,1)(0,1)(-1,-1)(0,1)(-1,-1)(-1,-1)(-1,-1)"
> Actual result  : "(0,1)(0,1)(-1,-1)(0,1)(1,1)(1,1)(-1,-1)"
And sometimes very wrong:
> Searched text: "searchme"
> Regex pattern: "s(^|())e"
> Expected output: "(0,2)(1,1)(1,1)"
> Actual result  : "NOMATCH"

Cheers,
  Dr. Chris Kuklewicz