[Haskell-cafe] help optimizing memory usage for a program

Wed Mar 4 13:00:10 EST 2009

manlio_perillo:
> Hi.
>
> After some work I have managed to implement two simple programs that  
> parse the Netflix Prize data set.
>
> For details about the Netflix Prize, there was a post by Kenneth Hoste  
> some time ago.
>
> I have cabalized the program, and made available here:
> http://haskell.mperillo.ath.cx/netflix-0.0.1.tar.gz
>
> The process-data-1 program parse the training data set, grouping ratings  
> by movies.
> The process-data-2 program, instead, groups ratings by users.
>
>
> The structure of the two programs is similar: I create singletons IntMap  
> and then use foldl + union to merge them together.
>
>
> The data structure used is:
>
> type Rating = Word32 :*: Word8 -- id, rating
> type MovieRatings = IntMap (UArr Rating)
>
> UArr is from uvector package.
>
>
> The training data set contains about 100 million ratings, so, in theory,  
> the amount of memory required to store pairs of (id, rating) is about  
> 480 MB.
>
>
> The process-data-1 program parse the entire dataset using about 1.4 GB  
> of memory (3x increment).
>
> This is strange.
> The memory required is proportional to the number of ratings.
> It may be IntMap the culprit, or the garbage collector than does not  
> release the memory to the operating system (or, worse, does not  
> deallocate all used temporary memory).
>
>
> The process-data-2 has much more problems.
> When I try to parse *only* 500 movie ratings, the memory usage is
> 471 MB; 780 MB after all data has been fully evaluated (array  
> concatenations).
>
>
> I'm using ghc 6.8.2 on Debian Etch.
> I would like to do some tests with recent GHC versions, but it may be a  
> problem.
>

Seems like a useful benchmark of a number of things. Any chance you'll
upload it to hackage?