[Haskell-cafe] How on Earth Do You Reason about Space?

Wed Jun 1 01:55:16 CEST 2011

On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:
> I can't reproduce the space leak here.  I tried Aleksander's original code,
> my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy
> bytestring version.

I unfortunately can't post the actual corpus here, because it's copyrighted. But
there's plenty of ways to retrieve large amounts of data from the Internet. See
below.

> f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
> f' = I.joinI $ (enumLinesBS I.><> I.filter (not . S.null)) $ I.foldl' (\t s
> -> T.insertWith (+) s 1 t) T.empty

Neat, folding the specialised function into the enumeratee is nifty! One can
tell that my experience with iteratee/enumerator has been only a day's worth for
now :-\

> None of these leak space for me (all compiled with ghc-7.0.3 -O2).
> Performance was pretty comparable for every version, although Aleksander's
> original did seem to have a very small edge.

How big were your input corpora?

Here's an absolutely evil shell script that is going to make me donate money to
project Gutenberg. It will gather a corpus that is still very small, but at
least realistic in its distributional properties.

+++ scrape.sh

#!/bin/sh

textfile=all_text.txt

touch $textfile

text=0
size=0
for i in `seq 10 300`; do
	wget -q "http://www.gutenberg.org/files/$i/$i.zip"
	if [ -f $i.zip ]; then
		unzip -qq $i.zip
		tr -sc '[[:alpha:]]' '\n' < $i.txt >> $textfile
		text=`dc -e "$text 1 + p"`
		size=`du -h $textfile | cut -f 1`
		rm -f $i.zip $i.txt
	fi
	echo -n "\rFound $text of $i texts, total $size."
done
echo "\rFound $text texts, total $size"

+++

It'll take a while to run, and the resulting corpus is just a fraction of what
I'm using, but it serves well to illustrate the problem. If you want, you can
increase the amount of data gathered by simply tweaking the numerical range in
the seq statement. (If you make it gobble up more bandwidth, it might be polite
to put a sleep somewhere in the inner loop. I'd host the resulting file myself,
but I don't know if there aren't any conditions on that.)

If you did not tweak the script, it should've gathered some 100MB of data.

Running my unmodified original program, htop records 320MB of RAM usage towards
the end (*without* profiling being enabled.)

So it seems that I can't get rid of a factor of around 3x the input file size.
Luckily, the dependency seems to be linear. Here's some profiling:

<<ghc: 30478883712 bytes, 57638 GCs, 41925397/143688744 avg/max bytes residency (189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT (24.94 elapsed), 26.71 GC (27.10 elapsed) :ghc>>
../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr  50.44s user 1.50s system 99% cpu 52.064 total

ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.)

Redirecting the program's output to a file will yield a mere 2.2M for the data
gathered by the above script. Since those 2.2M of data are all I care about, why
do I need so much more RAM to compute them?

Are my demands unreasonable?

> I'd be happy to help track down a space leak in iteratee, but for now I'm
> not seeing one.

Thank you for your offer! Maybe I'm just seeing ghosts, and there is no space
leak. But I really do think that the program is eating too much RAM.

Regards,
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110601/63e50510/attachment.pgp>