Daniel thank you for all your advice.<br><br>An additional ! bang pattern in convertIntToDouble fixed the issue! Also using a foldl'<br>did the trick. <br><br>Now the program runs as it should with a constant amount of memory and in a very small amount of time. <br>
<br>I believe these problems are one of the major sources of frustration for Haskell newbies. Things that could work in <X> language easily suddenly become problems in Haskell. When you overcome these issues then you feel happy again that you chose Haskell as the main programming language of your research project. <br>
<br>Is there any guide that explains more about the "bad consumption pattern". Are there any general rules defined to avoid these issues? It helped me to re-read the chapter on profiling in the Real World Haskell book to sorta understand the problem. Is there a more detailed definition of the problem than in RWH?<br>
<br>Regards,<br><br>Arnoldo<br><br><div class="gmail_quote">On Tue, Apr 20, 2010 at 2:49 AM, Daniel Fischer <span dir="ltr"><<a href="mailto:daniel.is.fischer@web.de">daniel.is.fischer@web.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller:<br>
<div class="im">> Hello Daniel:<br>
><br>
> My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2.<br>
<br>
</div>Very bad. Can I see the code?<br>
<div class="im"><br>
> Thank you for clarifying about the pointers.<br>
<br>
</div>Not to forget the Ints for counting.<br>
<div class="im"><br>
><br>
> Slowly my memory grows up and eventually it explodes. I would expect<br>
> that the list comprehension is lazily evaluated and therefore at any<br>
> given time I am only executing one hamming distance. The result of the<br>
> hamming distance is stored into a small statistics datatype I built<br>
> (only stores sums and sum of squares and the counts). This datatype is<br>
> updated using a foldr.<br>
<br>
</div>That might very well be the problem, if you update it with a foldr, you<br>
must construct the entire list of 2000^2 hamming-thunks before the work can<br>
begin.<br>
It's probably better to use foldl' (and make the type strict) so you can<br>
start the work immediately.<br>
<div class="im"><br>
><br>
> I have no idea where the leak is. What do you see in a .prof file to<br>
> find a leak (hamming distance has the largest amount of time and %alloc)<br>
<br>
</div>For finding leaks, heap profiling (-h*) gives more info than -p. The .prof<br>
says more about where you spend your time than what hangs on to memory.<br>
<div class="im"><br>
> ? From my .prof file where would you start looking at?<br>
<br>
</div>- use hamming instead of hamming2<br>
- convertIntToDouble looks suspicious<br>
- calculating a few million Hamming distances takes some time, but what<br>
about getMyStats, should that really take 25%?<br>
<div class="im"><br>
<br>
> > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-<br>
> > > exampl, ys <- exampl]<br>
> > ><br>
<br>
</div>filter (/= 0) [hamming xs ys | xs <- example, ys <- example]<br>
<br>
And of course, you can trivially avoid half of the work.<br>
<div><div></div><div class="h5"><br>
><br>
><br>
> Best Regards,<br>
><br>
> Arnoldo Muller<br>
><br>
> On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer<br>
<<a href="mailto:daniel.is.fischer@web.de">daniel.is.fischer@web.de</a>>wrote:<br>
> > Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:<br>
> > > Hello all:<br>
> > ><br>
> > > I want to generate some hamming distance statistics about a set of<br>
> > > strings. As explained in another e-mail in this list, I used the<br>
> > > following code to call the<br>
> > > functions:<br>
> > > (exampl holds the list of strings of size w)<br>
> > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-<br>
> > > exampl, ys <- exampl]<br>
> > ><br>
> > > I have two hamming functions:<br>
> > > -- hamming distance for variable length strings<br>
> > > hamming :: String -> String -> Int<br>
> > > hamming x y = hamming' x y 0<br>
> > > where<br>
> > > hamming' [] _ !c = c<br>
> > > hamming' _ [] !c = c<br>
> > > hamming' (x:xs) (y:ys) !c<br>
> > ><br>
> > > | x == y = hamming' xs ys c<br>
> > > | otherwise = hamming' xs ys (c + 1)<br>
> > ><br>
> > > -- function posted in this mailing list<br>
> > > hamming2 :: String -> String -> Int<br>
> > > hamming2 xs ys = length (filter not (zipWith (==) xs ys))<br>
> > ><br>
> > > I am executing these functions millions of times and the bottleneck<br>
> > > of my program is in them as explained by running in profiling mode<br>
> > > with +RTS -K400M -p -RTS<br>
> > ><br>
> > > The costlier function is the hamming distance<br>
> > > COST CENTRE MODULE %time %alloc<br>
> > ><br>
> > > hamming Distances 66.6 41.9<br>
> > ><br>
> > > It says that it is performing 41% of the allocations. In the case of<br>
> > > hamming2 the allocations go as far as 52%.<br>
> ><br>
> > Allocations are cheap, so that's not necessarily a problem. More<br>
> > important is, what's the maximum residency and how much is copied<br>
> > during GC? Are you compiling with -O2 ?<br>
> ><br>
> > > I could understand that<br>
> > > there are allocations in "hamming2" because we are creating pairs,<br>
> > > but in the case of "hamming" there should be no allocation.<br>
> ><br>
> > Why not? I don't know how GHC counts allocations, but everytime you go<br>
> > from (x:xs) to xs, you need a new pointer to the tail. If that counts<br>
> > as allocation, hamming must allocate a lot, too.<br>
> ><br>
> > > How can I execute my hamming functions without allocating memory?<br>
> > ><br>
> > > Best regards,<br>
> > ><br>
> > > Arnoldo Muller<br>
<br>
</div></div></blockquote></div><br>