[Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Malcolm Wallace Malcolm.Wallace at cs.york.ac.uk
Fri May 11 10:57:15 EDT 2007


Henning Thielemann <lemming at henning-thielemann.de> wrote:

> HaXml has a so called lazy parser, but it is not what I consider lazy:

Lazy parsing is rather subtle, and it is easy to write a too-strict
parser when one intended to be more lazy.  Equally, it can be easy to
imagine that the parser is too strict, when in fact it is the usage
context that is wrong.

You have indeed found some bugs in HaXml's lazy HTML parser, but you
have also partly misunderstood what lazy parsing means.

>  Text.XML.HaXml.Pretty.document $ htmlParse "text" $
>      "<html><head></head><body>"++undefined++"</body></html>"
> *** Exception: Prelude.undefined

The problem here is not that the parser is too strict, but that the
pretty-printer is.  The pretty-printer is demanding an undefined portion
of the value before it produces any output.

> If it would be lazy, it would return some HTML code before the error.

It can do, but only if you consume the part without errors first!  For
instance, it would be safe to extract just the <head> tag, because that
is complete:

    import Text.XML.HaXml
    import Text.XML.HaXml.Posn
    import Text.XML.HaXml.Html.ParseLazy as Lazy
    import Text.XML.HaXml.Pretty as PP
    import Text.PrettyPrint.HughesPJ

    main = putStrLn $ render $ fsep PP.content $
           -- the following line extracts just the first child tag of <html>
           (\(Document _ _ e _)-> (position 0 children) (CElem e nopos)) $
           Lazy.htmlParse "text" $
           "<html><head></head><body>"++undefined++"</body></html>"

Unfortunately, this program currently does throw an "undefined"
exception, even though it should not.  The lazy HTML parser contains a
couple of tricky corners that _probably_ stop it from being lazy.
  (1) The element parser does not immediately return an element after
      seeing its start tag, because it also has to return a stack of
      improperly terminated elements inside this one (so they can be
      repaired).
  (2) After parsing, we simplify the tree structure, which of course
      traverses it, and may again force too much evaluation.
In any case, I will need to investigate further, and hopefully soon push
a patch to fix the problem.

> HaXML uses the Polyparse package for parsing which contains a so
> called lazy parser. However it has return type (Either String a). That
> is, for the decision whether the parse was successful, the document
> has to be parsed completely.

Not true.  PolyLazy.runParser has the signature
    runParser :: Parser t a -> [t] -> (a, [t])
that is, it returns the partially parsed value (which may contain
bottoms), and the remaining unparsed token-stream.  (Examining either of
these return values may cause sufficient evaluation to be forced to lead
to a runtime exception.)  There is no 'Either' type at the user level.
(Although an Either is used internally, see below, it does not do what
you think).

> *Text.ParserCombinators.PolyLazy>
>       runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined)
> ("*** Exception: Parse.satisfy: failed

This output is exactly correct.  You asked for the first four characters
provided that they were alphabetic, but in fact only the first three
were alphabetic.  Hence, 'satisfy' failed and threw an exception.  If
you ask for only the first three characters, then the parse succeeds:

  > fst $ runParser (exactly 3 (satisfy Char.isAlpha))
                    ("abc104"++undefined)
  "abc"

The purpose of the internal Either type that you mentioned, is to permit
backtracking within the parse, not to force complete evaluation.  Thus,
you can equally ask for the first four characters provided they are
alphanumeric, where alphanumeric is decided by a combination of
alternate parsers:

  > fst $ runParser (exactly 4 (satisfy Char.isAlpha `onFail`
                                    satisfy Char.isDigit))
                    ("abc104"++undefined)
  "abc1"

This example illustrates that a parse failure is still recoverable when
parsing lazily (but only by another parser, not once the failure has
escaped the parsing world).

Regards,
    Malcolm


More information about the Haskell-Cafe mailing list