[Haskell-cafe] attoparsec and backtracking

Sat Mar 16 00:26:36 CET 2013

On 3/15/13 3:29 PM, Evan Laforge wrote:
> However, which error msg shows up depends on the order of the (<|>)
> alternatives, and in general the global structure of the entire
> parser, because I think it just backtracks and then picks the last
> failing backtrack.  Even after carefully rearranging all the parsers
> it seems impossible to get this particular error to bubble up to the
> top.  The thing is, as soon as I see an unexpected suffix I know I can
> fail entirely right there, with exactly that error msg, but since
> there's no way to turn off backtracking I think there's no way to do
> that.

I had some similar issues recently. The trick is figuring out how to
convince attoparsec to commit to a particular alternative. For example,
consider the grammar: A (B A)* C; where if the B succeeds then we want to
commit to parsing an A (and if it fails then return A's error, not C's).

To simplify things, let's drop the leading A since it's not part of the
problem. And let's try to parse an invalid string like "BX" (or "BABX").
The key point is that,

    bad = (pB *> pure (:) <*> pA <*> bad) <|> (pC *> pure [])

is different than,

    good = do
        e <- eitherP pB pC -- (Left <$> pB) <|> (Right <$> pC)
        case e of
            Left  _ -> (:) <$> pA <*> good
            Right _ -> pure []

In particular, the first one is bad (for our purposes) because due to
hoisting the choice up high, after parsing the B we fail to commit, so
when parsing A fails we'll backtrack over the B and try C instead.
Assuming C doesn't overlap with B, we'll then report C's error. Whereas
the latter is good because due to pushing the choice down, once we've
parsed B (or C) we're committed to that choice; so when A fails, we'll
report A's error (or backtrack to the lowest choice that dominates the
call to good).

Attoparsec does indeed just report the failure generated by the final
parse, so you'll have to refactor things to recognize which sort of token
you're looking for (e.g., p_num vs p_identifier or whatever), and then
commit to that choice before actually parsing the token. It's not very
modular that way, but I think that's the only option right now. It
shouldn't be too hard to design a combinator for doing a hard commit (by
discarding the backtrack continuation); but that has modularity issues of
its own...

Another option, of course, is to drop down to performing lexing on the
ByteString itself (e.g., [1]) and then wrap those individual lexers to
work as attoparsec Parsers. Even if using attoparsec for the heavy
lifting, this is a good approach for maximizing performance of the lexing
step.

[1] <http://hackage.haskell.org/package/bytestring-lexing>

-- 
Live well,
~wren