[Haskell-cafe] Incremental XML parsing with namespaces?

Krzysztof Skrzętnicki gtener at gmail.com
Tue Jun 9 04:33:53 EDT 2009


And just to provide an example of working program:

---
module Main where

import Text.XML.Expat.Qualified
import Text.XML.Expat.Namespaced
import Text.XML.Expat.Tree

import qualified Data.ByteString.Lazy as BSL

main = do
	f <- BSL.readFile "doc1.xml"
	let (tree,error) = parseTree Nothing f
	case error of
		Nothing -> putStrLn "Here you are: " >> (print . toNamespaced .
toQualified $ (tree :: Node String String))
		Just err -> putStrLn "Error!" >> print err
---


$ ./hexpat-test
Here you are:
Element {eName = NName {nnNamespace = Just "org:myproject:mainns",
nnLocalPart = "doc"}, eAttrs = [(NName {nnNamespace = Just
"http://www.w3.org/2000/xmlns/", nnLocalPart =
"x"},"org:myproject:otherns"),(NName {nnNamespace = Just
"org:myproject:mainns", nnLocalPart =
"xmlns"},"org:myproject:mainns")], eChildren = [Text "\n",Text "
",Element {eName = NName {nnNamespace = Just "org:myproject:mainns",
nnLocalPart = "title"}, eAttrs = [], eChildren = [Text "Doc
title"]},Text "\n",Text "    ",Element {eName = NName {nnNamespace =
Just "org:myproject:otherns", nnLocalPart = "ref"}, eAttrs = [],
eChildren = [Text "abc1234"]},Text "\n",Text "    ",Element {eName =
NName {nnNamespace = Just "http://www.w3.org/1999/xhtml", nnLocalPart
= "html"}, eAttrs = [(NName {nnNamespace = Just
"http://www.w3.org/1999/xhtml", nnLocalPart =
"xmlns"},"http://www.w3.org/1999/xhtml")], eChildren = [Element {eName
= NName {nnNamespace = Just "http://www.w3.org/1999/xhtml",
nnLocalPart = "body"}, eAttrs = [], eChildren = [Text "Hello
world!"]}]},Text "\n"]}
# we mess with doc1.xml and exchange </doc> for </do>
$ ./hexpat-test
Error!
XMLParseError "mismatched tag" (XMLParseLocation {xmlLineNumber = 5,
xmlColumnNumber = 2, xmlByteIndex = 205, xmlByteCount = 0})


Best regards

Krzysztof Skrzętnicki

2009/6/9 Krzysztof Skrzętnicki <gtener at gmail.com>:
> On Mon, Jun 8, 2009 at 20:39, John Millikin<jmillikin at gmail.com> wrote:
>> I'm trying to convert an XML document, incrementally, into a sequence
>> of XML events. A simple example XML document:
>>
>> <doc xmlns="org:myproject:mainns" xmlns:x="org:myproject:otherns">
>>    <title>Doc title</title>
>>    <x:ref>abc1234</x:ref>
>>    <html xmlns="http://www.w3.org/1999/xhtml"><body>Hello world!</body></html>
>> </doc>
>>
>> The document can be very large, and arrives in chunks over a socket,
>> so I need to be able to "feed" the text data into a parser and receive
>> a list of XML events per chunk. Chunks can be separated in time by
>> intervals of several minutes to an hour, so pausing processing for the
>> arrival of the entire document is not an option. The type signatures
>> would be something like:
>>
>> type Namespace = String
>> type LocalName = String
>>
>> data Attribute = Attribute Namespace LocalName String
>>
>> data XMLEvent =
>>    EventElementBegin Namespace LocalName [Attribute] |
>>    EventElementEnd Namespace LocalName |
>>    EventContent String |
>>   EventError String
>>
>> parse :: Parser -> String -> (Parser, [XMLEvent])
>>
>> I've looked at HaXml, HXT, and hexpat, and unless I'm missing
>> something, none of them can achieve this:
>>
>> + HaXml and hexpat seem to disregard namespaces entirely -- that is,
>> the root element is parsed to "doc" instead of
>> ("org:myproject:mainns", "doc"), and the second child is "x:ref"
>> instead of ("org:myproject:otherns", "ref"). Obviously, this makes
>> parsing mixed-namespace documents effectively impossible. I found an
>> email from 2004[1] that mentions a "filter" for namespace support in
>> HaXml, but no further information and no working code.
>
> I would recommend hexpat to do the job. Contrary to what you are
> saying, hexpat does offer namespace handling:
> http://hackage.haskell.org/packages/archive/hexpat/0.8/doc/html/Text-XML-Expat-Namespaced.html
> Perhaps you need more than that?
>
> Personally I found hexpat to be fast, space efficient and easy to use.
>
> Here is the representation I got for your example. Please note the
> namespaces in right places.
> * > (toNamespaced ( toQualified t'))
> Element {eName = NName {nnNamespace = Just "org:myproject:mainns",
> nnLocalPart = "doc"}, eAttrs = [(NName {nnNamespace = Just
> "http://www.w3.org/2000/xmlns/", nnLocalPart =
> "x"},"org:myproject:otherns"),(NName {nnNamespace = Just
> "org:myproject:mainns", nnLocalPart =
> "xmlns"},"org:myproject:mainns")], eChildren = [Text "\n",Text "
> ",Element {eName = NName {nnNamespace = Just "org:myproject:mainns",
> nnLocalPart = "title"}, eAttrs = [], eChildren = [Text "Doc
> title"]},Text "\n",Text "    ",Element {eName = NName {nnNamespace =
> Just "org:myproject:otherns", nnLocalPart = "ref"}, eAttrs = [],
> eChildren = [Text "abc1234"]},Text "\n",Text "    ",Element {eName =
> NName {nnNamespace = Just "http://www.w3.org/1999/xhtml", nnLocalPart
> = "html"}, eAttrs = [(NName {nnNamespace = Just
> "http://www.w3.org/1999/xhtml", nnLocalPart =
> "xmlns"},"http://www.w3.org/1999/xhtml")], eChildren = [Element {eName
> = NName {nnNamespace = Just "http://www.w3.org/1999/xhtml",
> nnLocalPart = "body"}, eAttrs = [], eChildren = [Text "Hello
> world!"]}]},Text "\n"]}
>
> Best regards
>
> Krzysztof Skrzętnicki
>


More information about the Haskell-Cafe mailing list