[Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Malcolm Wallace Malcolm.Wallace at cs.york.ac.uk
Mon May 14 10:18:47 EDT 2007


Henning Thielemann <lemming at henning-thielemann.de> wrote:

> > > *Text.ParserCombinators.PolyLazy>
> > >       runParser (exactly 4 (satisfy Char.isAlpha))
> > >       ("abc104"++undefined)
> > > ("*** Exception: Parse.satisfy: failed
> 
> How can I rewrite the above example that it returns
>   ("abc*** Exception: Parse.satisfy: failed

The problem in your example is that the 'exactly' combinator forces
parsing of 'n' items before returning any of them.  If you roll your
own, then you can return partial results:

    > let one = return (:) `apply` satisfy (Char.isAlpha)
      in runParser (one `apply` (one `apply`
                   (one `apply` (one `apply` return []))))
             ("abc104"++undefined)
    ("abc*** Exception: Parse.satisfy: failed

Equivalently:

    > let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f
      in runParser (one (one (one (one (return []))))) ("abc104"++undefined)
    ("abc*** Exception: Parse.satisfy: failed

Perhaps I should just rewrite the 'exactly' combinator to have the
behaviour you desire?  Its current definition is:

    exactly 0 p = return []
    exactly n p = do x <- p
                     xs <- exactly (n-1) p
                     return (x:xs)

and a lazier definition would be:

    exactly 0 p = return []
    exactly n p = return (:) `apply` p `apply` exactly (n-1) p

> How can I tell the parser that everything it parsed so
> far will not be invalidated by further input?

Essentially, you need to return a constructor as soon as you know that
the initial portion of parsed data is correct.  Often the only sensible
way to do that is to use the 'apply' combinator (as shown in the
examples above), returning a constructor _function_ which is lazily
applied to the remainder of the parsing task.

> I wondered whether 'commit' helps, but it didn't. (I thought it would
> convert a global 'fail' to a local 'error'.)

The 'commit' combinator is intended for early abortion of a parse
attempt that it is known can no longer succeed.  That's the opposite of
what you want.  By contrast, the 'apply' combinator causes a parse
attempt to succeed early, even though it may turn out to fail later.

Regards,
    Malcolm


More information about the Haskell-Cafe mailing list