[Haskell-cafe] Incremental XML parsing with namespaces?

John Millikin jmillikin at gmail.com
Mon Jun 8 23:00:03 EDT 2009


On Mon, Jun 8, 2009 at 3:39 PM, Henning
Thielemann<lemming at henning-thielemann.de> wrote:
> I think you could use the parser as it is and do the name parsing later.
> Due to lazy evaluation both parsers would run in an interleaved way.
>
I've been trying to figure out how to get this to work with lazy
evaluation, but haven't made much headway. Tips? The only way I can
think of to get incremental parsing working is to maintain explicit
state, but I also can't figure out how to achieve this with the
parsers I've tested (HaXml, HXT, hexpat).

Here's a working example of what I'm trying to do, in Python. It reads
XML from stdin, prints events as they are parsed, and will terminate
when the document ends:

##########################

from xml.sax import handler, saxutils, expatreader

class ContentHandler (handler.ContentHandler):
	def __init__ (self):
		self.events = []
		self.level = 0
		
	def startElementNS (self, ns_name, lname, attrs):
		self.events.append (("BEGIN", ns_name, lname, dict (attrs)))
		self.level += 1
		
	def endElementNS (self, ns_name, lname):
		self.events.append (("END", ns_name, lname))
		self.level -= 1
		
	def characters (self, content):
		self.events.append (("TEXT", content))
		
def main ():
	parser = expatreader.ExpatParser ()
	content = ContentHandler ()
	parser.setFeature (handler.feature_namespaces, True)
	parser.setContentHandler (content)
	got_events = False
	while content.level > 0 or (not got_events):
		text = raw_input ("Enter XML:\n")
		parser.feed (text)
		print content.events
		content.events = []
		got_events = True

if __name__ == "__main__": main()

###############################

$ python incremental.py
Enter XML:
<test xmlns="urn:test"><test2><test3>
[('BEGIN', (u'urn:test', u'test'), u'test', {}), ('BEGIN',
(u'urn:test', u'test2'), u'test2', {}), ('BEGIN', (u'urn:test',
u'test3'), u'test3', {})]
Enter XML:
</test3></test2><test2 a="b"/>text content goes here
[('END', (u'urn:test', u'test3'), None), ('END', (u'urn:test',
u'test2'), None), ('BEGIN', (u'urn:test', u'test2'), u'test2', {(None,
u'a'): u'b'}), ('END', (u'urn:test', u'test2'), None), ('TEXT', u'text
content goes here')]
Enter XML:
</test>
[('END', (u'urn:test', u'test'), None)]

#############################

As demonstrated, the parser retains state (namespaces, nesting)
between text inputs. Are there any XML parsers for Haskell that
support this incremental behavior?


More information about the Haskell-Cafe mailing list