[Haskell-cafe] Data.Enumerator.Text.utf8 not constant memory?

Skirmantas Kligys skirmantas.kligys at gmail.com
Tue Apr 26 04:36:25 CEST 2011


Yay, much better!

Now iterLinesWc works within 4 kilobytes, and iterLinesMine within 22
kilobytes, both nicely bounded.

Thanks a lot for your help, Felipe!


On Mon, Apr 25, 2011 at 7:03 PM, Felipe Almeida Lessa
<felipe.lessa at gmail.com> wrote:
> [CC'ing John Millikin, enumerator's maintainer]
>
> On Mon, Apr 25, 2011 at 7:10 PM, Skirmantas Kligys
> <skirmantas.kligys at gmail.com> wrote:
>> I expected to be able to do what SAX does in Java, i.e. to avoid loading the
>> whole 2 gigabytes into memory.  For warm-up, I wrote an iteratee to count lines
>> in the file, and it does load the whole file into memory!  After profiling, I
>> see that the problem was Data.Enumerator.Text.utf8, it allocates up to 60
>> megabytes when run on a 40 megabyte test file.
>
> It seems to me that this is a bug in enumerator's "strict" fold not
> being strict at all =).  The current version 0.4.9.1 of
> Data.Enumerator.List.fold is
>
> -- | Consume the entire input stream with a strict left fold, one element
> -- at a time.
> --
> -- Since: 0.4.8
> fold :: Monad m => (b -> a -> b) -> b
>       -> Iteratee a m b
> fold step = continue . loop where
>        f = L.foldl' step
>        loop acc stream = case stream of
>                Chunks [] -> continue (loop acc)
>                Chunks xs -> continue (loop (f acc xs))
>                EOF -> yield acc EOF
>
> Note that the list fold is strict (f = Data.List.foldl' step),
> *however* the acc parameter of loop isn't strict at all!  It just
> creates a big, fat thunk with references to all of you input =(.
>
> But the fix is extremely easy, just change the 'Chunks xs' line to
>
>                Chunks xs -> continue (loop $! f acc xs)
>
> Using only your iterLinesWc test with a 105 MiB file (a movie I had
> lying around), with enumerator's definition it takes 220 MiB of memory
> and 1.3~1.5 seconds according to +RTS -s.  By doing only this very
> change above, it takes 2 MiB of memory (100x improvement :P) and
> 0.8~0.9 seconds.
>
> John Millikin, could you please apply the attached patch? =)
>
> Cheers,
>
> --
> Felipe.
>



More information about the Haskell-Cafe mailing list