Parsing HTML

andrew cooke andrew at acooke.org
Wed Dec 10 15:22:52 EST 2003


It appears that the Haskell XML Toolbox may be what I want -
http://www.fh-wedel.de/~si/HXmlToolbox/ - but any other suggestions would
be appreciated.  Apologoies for relying on Haskell.org rather than
Googling (I'll mail the web page maintainers).

Cheers,
Andrew

andrew cooke said:
>
> What are the options for parsing/lexing (X)HTML?  As far as I can see...
>
> - the HTML library in GHC (or from  Andy Gill) is for creating documents,
> not parsing them
>
> - HaXml looks like it might do what I want, but (1) seems tricky to
> install (needs "make", which isn't that cool for Windows); (2) has a load
> of fancy-schmancy combinator stuff, when all I want is a stream of tokens
> (something like the Java SAX interface); (3) doesn't seem that solid on
> the basics (doesn't seem to handle namespaces (maybe they appear as part
> of the attribute name?) (and I haven't yet worked out what it does about
> other "esoteric" things like character entities, XML declarations, CDATA,
> comments, etc)).  (No offense implied - it's a cool piece of work, just
> doesn't seem to be what I'm looking for; this is all from reading the docs
> and api rather than looking at code, so I may be mistaken).
>
> - nothing else on the haskell.org page appears to do parsing.
>
> I'd write it myself, but (X)HTML is deceptively complex, in my experience.
>  You start of thinking it's going to be trivial (S-expressions), then you
> realise that there HTML isn't XML, then there are character entities,
> weird CDATA things, namespaces, that what you have isn't robust enough to
> parse typical malformed pages (unescaped "<" in text; unescaped data in
> URLs inside links (eg "&"), etc) that are accepted by browsers, etc.
>
> Maybe that's why there doesn't seem to be anything?!
>
> (I'm writing a simple tool that generates web pages from templates; the
> tool data appears in attributes with a namespace (this is the standard
> trick for mixing code generation with HTML in a way that web authoring
> tools can parse).  Hence the mix of requirements for HTML and XML.)
>
> Cheers,
> Andrew
>
> --
> personal web site: http://www.acooke.org/andrew
> personal mail list: http://www.acooke.org/andrew/compute.html
> _______________________________________________
> Haskell mailing list
> Haskell at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell
>
>


-- 
personal web site: http://www.acooke.org/andrew
personal mail list: http://www.acooke.org/andrew/compute.html


More information about the Haskell mailing list