[Haskell-cafe] Hexpat: Lazy I/O problem with huge input files

Daniel Fischer daniel.is.fischer at web.de
Wed Oct 13 17:23:19 EDT 2010


On Wednesday 13 October 2010 23:06:04, Aleksandar Dimitrov wrote:
> Hello Haskell Cafe,
>
> I really hope this is the right list for this sort of question. I've
> bugged the folks in #haskell, they say go here, so I'm turning to you.
>
> I want to use Hexpat to read in some humongous XML files (linguistic
> corpora,) since it's the only Haskell XML library (I could find) that
> takes ByteStrings as input. I stumbled on a problem when using one of
> the examples from the docs of Text.XML.Expat.Tree. The "cookbook
> recipe" there suggests *first* processing the data, and only then
> looking into the parser error to see if there has been an error. I
> understand this should prevent the parse tree from being fully
> evaluated before use. Unfortunately, that is not what happens on my
> system (ghc 6.12.1, if that's of importance.)
>
> This is the code from the docs, that I modified to read files:
> > import Text.XML.Expat.Tree
> > import System.Environment (getArgs)
> > import Control.Monad (liftM)
> > import qualified Data.ByteString.Lazy as C
>> > -- This is the recommended way to handle errors in lazy parses
> > main = do
> >     f <- liftM head getArgs >>= C.readFile
> >     let (tree, mError) = parse defaultParseOptions f
> >     print (tree :: UNode String)
>> >     -- Note: We check the error _after_ we have finished our
> > processing -- on the tree.
> >      case mError of
> >          Just err -> putStrLn $ "It failed : "++show err
> >          Nothing -> putStrLn "Success!"
>
> Given a 42M test file, an invocation like this:
>
> % ghc --make -O2 Hexpat.hs
> % ./Hexpat input.xml > dump.xml
>
> will gobble up some 2Gigs of RAM (at least. I usually kill it before
> it starts thrashing the swap space, since that almost crashes my
> entire machine.)

I don't know Hexpat at all, so I can only guess.

Perhaps due to the laziness of let-bindings, mError keeps a reference to 
the entire tuple, thus preventing tree from being garbage collected as it 
is consumed by print.

Try

main = do
    f <- liftM head getArgs >>= C.readFile
    case parse defaultParseOptions f of
      (tree, mError) -> do
        print (tree :: UNode String)
        case mError of
          Just err -> putStrLn $ "It failed: " ++ show err
          Nothing -> putStrLn "Success!"

it may fix the leak, change nothing or make it worse.

> If I remove the last 3 lines:
> > import Text.XML.Expat.Tree
> > import System.Environment (getArgs)
> > import Control.Monad (liftM)
> > import qualified Data.ByteString.Lazy as C
> >
> > main = do
> >     f <- liftM head getArgs >>= C.readFile
> >     let (tree, mError) = parse defaultParseOptions f
> >     print (tree :: UNode String)
>
> the same invocation and input file barely uses a megabyte or two of
> RAM and finishes really quickly.
>
> Why is that? Is this a mistake in the Hexpat docs, or am I doing
> something wrong? Lazy IO has always been a little bit of a mystery to
> me, and just when I thought I had it...
>
> Thanks for any help on the matter!
> Aleks



More information about the Haskell-Cafe mailing list