[Haskell-cafe] Profiling makes memory leak go away? Is Haskell a practical language?

Brandon Michael Moore brandon at heave.ugcs.caltech.edu
Tue Apr 10 19:01:01 EDT 2007


On Tue, Apr 10, 2007 at 11:03:32AM -0700, Oren Ben-Kiki wrote:
> On Tue, 2007-04-10 at 12:14 +0200, apfelmus wrote:
> >Oren Ben-Kiki wrote:
> >> The code is in http://hpaste.org/1314#a1 if anyone at all is willing
> >> to take pity on me and explain what I'm doing wrong.
> >
> >There is an important point to note about streaming, namely that it
> >conflicts with knowing whether the input is syntactically correct or
> >not.
> 
> True, this is  the core issue. Think of a turing machine processing an
> infinite tape. It is writing output and producing a final result; it is
> possible to examine the output tape before knowing the final result.
> Haskell parsers insist on having the "output tape" live in memory until
> the final result is known, and then give it to you as one huge object
> when done (or discard it and just give you the error message). NOT a
> good idea if your input tape is a 200 MB file!

It's nothing to do with Haskell or memory mangagement, you just can't decide
whether the whole input is well-formed until you're done parsing, just like
you can't in general decide if a Turing machine is going to terminate until
it has.

You have to accept not knowing whether the input is well-formed until you
get to the end. There are two ways to do this that make it easy to get
streaming right. One is to have a data structure that explicitly contains
the possiblity of errors, like

data ErrList a = Another a (ErrList a) | Done | Failed err

Another is to return an ordinary structure containing values that
will raise an error when examined, and wrap a catch around the code
processing the streaming results. You might return for example a result

[1,2,3,error "parse error at 10:3 blah blah blah"]

You chose the most difficult way, returning immediately a structure
that has a field that when examined blocks until the input is done
and tells you if everything is valid.

That's tricky becuase it's very easy to make that field be some
unevaluated code that hangs onto the complete list of tokens and so
on, something like (a thunk of)
"first_line_valid && second_line_valid && ..."

GHC doesn't just go out and evaluate thunks onces their dependencies
arrive, because sometimes that's a bad idea, most obviously it it's something
like an unevaluated infinite list, say [1..], which has no free parameters.

It's the same problem you see in

--argument to break sharing
input () = 'a' : input ()

main = let text = input() in putStr (text ++ [last text])

As the infinite list is unfolded the thunk for "last text" is still
hanging onto the beginning, so it can't be garbage collected.

It happens that you can incrementally compute length as the list is
unfolded, but it's somewhat beyond the compiler to figure that out
for itself. But, you can fix it by writing a function that does
both operations together:

list_followed_by_length l = rec l 0 where
  rec (x:xs) len = len `seq` (x:rec xs (len + 1))
  rec [] = show len

Another option, if you're determined to be fancy, is to use the one
case where GHC actually does decide to evaluate something a little
bit during garbage collection. It's called a "selector thunk" -
if a piece of unevaluated code is *exactly* (after optimization)
case x of (_, .. , projected, ... _) -> projected, or an equivalent
pattern match on another data type with just a single constructor
it will be replaced by a direct reference to x as evaluation
proceeds.

If you want to go this way, add the -ddump-simpl flag to GHC and
inspect the output, and see what adding -O or -O2 does to it.

Brandon


More information about the Haskell-Cafe mailing list