Space faults in HaXml

Joe English [email protected]
Sun, 04 Nov 2001 18:36:06 -0800


Last week, Dmitry Astapov posted a message to haskell-cafe
describing a HaXml program that was running out of memory
on large inputs.  I've done some investigation on this;
here's what I've discovered so far:

    + The HaXml parser is overly eager; it doesn't produce
      a value until the entire input has been read.
      The heap profile for Dmitry's program shows heap
      usage climbing steadily up to a large peak as the
      document is being parsed, then a sharp drop as
      output begins, then climbs again as the output
      is produced.  I suspect that this behaviour is typical
      of most HaXml programs.

      I replaced the parser with HXML (recently announced here);
      this flattens out the first peak, but the second one persists.

      (HXML uses an incremental parser -- instead of parsing a Document,
      it parses individual Events (start-tag, character data, end-tag,
      et cetera) and assembles them into a tree with a separate
      function.)

    + nhc98's heap profiler is fantastic.

    + Under nhc98 (v1.10), 'putStr' doesn't seem to like overly-long
      strings.  This leads to trouble since the HaXml main driver uses
      a single call to 'hPutStrLn' to write the entire output document.

      Replacing this with 'mapM_ putChar' (or using GHC 5.02), fixes
      this problem, but the program still leaks.

    + The nhc98 heap profiler is amazingly useful.

    + HaXml uses the Hughes & Peyton Jones Pretty printer to format
      the output.  This appears to have a slow leak.  (Later tests
      show that the peak consists of "void" heap cells from the
      Pretty module, mostly suspended calls to Prelude.Int-
      and Prelude.Int+.)

      Figuring that it's overkill anyway I replaced it with a simpler
      serializer.

    + After this change there's *still* a space fault, which
      with the help of nhc's profiler (which, BTW, is great), I
      tracked down to the HaXml 'cat' combinator:

      	cat :: [a -> [b]] -> (a -> [b])
	cat fs = \e -> concat [f e | f <- fs]

Earlier I had reported that rewriting Dmitry's test case
to use the HXML internal representation directly instead
of converting to HaXml's representation fixed the space leak.
As it turns out, I didn't implement the 'cat' combinator,
but instead used a binary concatenation operator:

    (+++) :: (a -> [b]) -> (a -> [b]) -> (a -> [b])
    f +++ g = \x -> f x ++ g x

and my implementation of mkElem had the signature

    mkElem :: String -> CFilter -> CFilter

instead of HaXml's

    mkElem :: String -> [CFilter] -> CFilter.

which uses 'cat' to process the second argument.

Backporting this to HaXml:

    mkElem' name cf = \x -> [CElem (Elem h [] (cf x))]

and modifying Dmitry's test case to use mkElem' and +++
instead of mkElem and cat finally fixes the problem.
I'm still not sure *why* this does the trick --
hopefully somebody smarter can explain that --
but the modified program runs in bounded space,
no matter how large the input file is.

(It even works in Hugs, which I found surprising, since
the HXML tree builder has a known problem when run with
Hugs' garbage collector.)


--Joe English

  [email protected]