[Haskell-cafe] How on Earth Do You Reason about Space?

Aleksandar Dimitrov aleks.dimitrov at googlemail.com
Wed Jun 1 00:15:47 CEST 2011


On Tue, May 31, 2011 at 11:43:27AM -0700, Brandon Moore wrote:
> I can't reproduce heap usage growing with the
> size of the input file.
> 
> I made a word list from Project Gutenberg's
> copy of "War and Peace" by
> 
> tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt
> 
> Using 1, 25, or 1000 repetitions of this ~3MB wordlist
> shows about 100MB of address space used according
> to top, and no more than 5MB or so of haskell heap
> used according to the memory profile, with a flat
> memory profile.

This will lead to very small variance in your data. The effect I'm experiencing
is so small in this case, that it's barely observable (but it is, see below.)

> Is your memory usage growing with the size of the input
> file, or the size of the histogram?

While the histogram is naturally growing with the size of the input file, memory
usage seems to be proportional mainly to the histogram. It is clear that, due to
the effect of the Long Tail, the histogram is going to constantly grow in a real
setting, as opposed to just replicating the same data.  In your test case, the
histogram is *not* growing with the size of the input file.

The memory usage is proportional to the histogram, which is proportional to the
file size. That is not the problem. The problem is, that, compared to the size
of the histogram, the memory consumption is *inadequately* high. Here's some
more data, using your Tolstoi example:

du file.txt
344M	file.txt

<<ghc: 86596536784 bytes, 163747 GCs, 8798605/11210760 avg/max bytes residency
(2348 samples), 30M in use, 0.00 INIT (0.00 elapsed), 59.20 MUT (60.44 elapsed),
60.28 GC (60.20 elapsed) :ghc>>

As you can see, memory residency is at 8 MB avg, 10 MB max. This is the
War&Peace file, replicated 100 times. Let's look at the result for the file
*without* first replicating it 100 times:

du words.txt
3.0M	words.txt

<<ghc: 805440768 bytes, 1519 GCs, 5106482/9552824 avg/max bytes residency (35
samples), 25M in use, 0.00 INIT (0.00 elapsed), 0.56 MUT (0.59 elapsed), 0.50 GC
(0.53 elapsed) :ghc>>

4.8MB avg, 9.1 MB max. It seems input file size *does* have an effect of sorts,
but it's negligible. What is more interesting is this: the file is a whopping
3MB big. How on earth does the program consume almost 5 MB *on average*? This is
*not* constant memory usage. This is memory usage trivial enough to not be worth
a fuss for small inputs, but unfortunately, it gets larger as soon as you
increase the file size (in a realistic fashion: i.e. you'll also increase the
unigram count.)

> I was worried data sharing might mean your keys
> retain entire 64K chunks of the input. However, it
> seems enumLines depends on the StringLike ByteString
> instance, which just converts to and from String.

Ouch, that sounds like something worth fixing.

> The other thing that occurs to me is that the total size of
> your keys would also be approximately the size of the input
> file if you were using plain text without each word split onto
> a separate line.

Well, I am not. The corpus is a word-per-line corpus, I'm reading a word per
line, and adding that to my hash map. This should never result in a data
structure even close to the size of the original corpus.

It could be, in a very unrealistic worst case scenario. But even a corpus of
30GB of Poe and Heidegger isn't going to make that happen. Furthermore, mine is not
such a scenario at all. As I said, if you reduce the corpus to a set of words
(i.e. a set of unigrams) you get a 40MB file from a 1.4GB corpus. Why is it,
that in order to create that 40MB file from a 1.4GB corpus, my trivial little
program needs somewhere north of 6-8 GB of RAM?

In this trivial example for War and Peace, why is it that in order to create the
unigram table for War and Peace, which ends up being a mere 201KB big, we're
chomping through 5MB on average, and nearly 10MB max? That's at least 25 times
more than we actually *should* have (yes, I know that the RTS is somewhere
there, too, but I think it's not a problem to ignore that for now.)

Regards,
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110601/974d3312/attachment.pgp>


More information about the Haskell-Cafe mailing list