[Haskell-cafe] memory needed for SAX parsing XML

John Lato jwlato at gmail.com
Thu Apr 22 11:48:02 EDT 2010


> Message: 8
> Date: Tue, 20 Apr 2010 12:08:36 +0400
> From: Daniil Elovkov <daniil.elovkov at googlemail.com>
> Subject: Re: [Haskell-cafe] memory needed for SAX parsing XML
> To: Haskell-Cafe <haskell-cafe at haskell.org>
> Message-ID: <4BCD6104.50508 at googlemail.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Jason Dagit wrote:
>>
>>
>> On Mon, Apr 19, 2010 at 3:01 AM, Daniil Elovkov
>> <daniil.elovkov at googlemail.com <mailto:daniil.elovkov at googlemail.com>>
>> wrote:
>
>> I think iteratees are slowly catching on as an alternative to lazy io.
>> Basically, the iteratee approach uses a left fold style to stream the
>> data and process it in chunks, including some exception handling.
>> Unfortunately, I think it may also require a special sax parser that is
>> specifically geared towards iteratee use.  Having an iteratee based sax
>> parser would make processing large xml streams very convenient in
>> haskell.  Hint, hint, if you want to write a library :)  (Or, maybe it
>> exists, I admit that I haven't checked.)
>
>
> Iteratees  seem like a natural thing if we want to completely avoid
> unsafeInterleaveIO. But the presence of the latter is so good for
> modularity.
>
> We can glue IO code and pure code of the type String -> a so seamlessly.
> In case of iteratees, as I understand, pure functions of the type String
> -> a would no longer be usable with IO String result. The signature (and
> the code itself) would have to be changed to be left fold.

To some extent, yes, although the amount of work required can vary.

The general rule of thumb is that functions can be used directly with
iteratees if they can work strictly.  Functions that rely on laziness
need some adaptation, although how much varies.

If you've built your string handling functions out of a parser
combinator library, e.g. parsec-3 or attoparsec, you can just lift the
parser into an iteratee and use all your existing functions.

Since you're using HaXml, this should work.  The only missing part is
a polyparse-iteratee converter.  I haven't used polyparse, but it
looks like the converter would be similar to the one used for
attoparsec in the attoparsec-iteratee package.

That said, it's not how I would do it.  Since SAX is a stream
processor, an iteratee-based SAX implementation would be a good fit.
I would write a lexer and parser (using a parser combinator library)
and then use those with Data.Iteratee.convStream.  That's how I would
write an iteratee-based SAX parser.  HaXml already includes a suitable
lexer and parser, but unfortunately they're not exposed.

>
> Another (additional) approach would be to encapsulate unsafeInterleaveIO
> within some routine and not let it go out into the wild.
>
> lazilyDoWithIO :: IO a -> (a -> b) -> IO b
>
> It would use unsafeInterleave internally but catch all IO errors within
> itself.
>
> I wonder if this is a reasonable idea? Maybe it's already done?
> So the topic is shifting...

doWithIO :: NFData b => IO a -> (a -> b) -> IO b
doWithIO m f = liftM (\a -> let b = f a in b `deepseq` b) m

It works (just stick it in a "try" block for error handling), but you
need to write a lot of NFData instances.  You also need to be careful
that b is some sort of reduced structure, or you can end up forcing
the whole file (or other data) into memory.  It also doesn't help with
other IO effects, e.g. writing output.  I consider this one of the
nicest features of iteratee-based processing.

John


More information about the Haskell-Cafe mailing list