[Haskell-cafe] sort and lazyness (?)

Fri Dec 19 09:53:23 EST 2008

On Fri, 2008-12-19 at 14:40 +0100, Daniel Kraft wrote:
> Hi,
> 
> I'm just a beginner trying to learn a little about Haskell, and as such 
> write some toy programs (e.g. for projecteuler.net) in Haskell.
> 
> Currently, I'm experiencing what I would call "strange behaviour":
> 
> I've got a data-type
> 
> data Fraction = Fraction Int Int
> 
> to hold rational numbers (maybe there's already some built-in type for 
> this in Haskell, much like for instance Scheme has a rational type?), 
> and then I compute a list of pairs of those numbers, that is
> [(Fraction, Fraction)].  Fraction is declared an instance of Ord.
> 
> This list has up to 3 million elements.  If I do
> 
> main = print $ length $ points
> 
> then the program prints out the length fine and takes around 6s to 
> finish (compiled with GHC -O3).  Ok, but I acknowledge that length isn't 
> quite an expensive function, as I can imagine that Haskell does not 
> compute and hold the entire list for this but instead each element at 
> once and discards it afterwards.

Right.

> Doing
> 
> main = print $ length $ map (\(x, _) -> x == Fraction 1 2) points
> 
> instead, gives a slightly longer runtime (6.6s), but in this case I'm 
> sure that at least each element is computed; right?

Nope. Nothing looks at the result of the comparison so it is not done
and so it does not force any of the fractions to be computed.

> main = print $ length $ reverse points
> 
> gives 11.9s, and here I guess (?) that for this to work, the entire list 
> is computed and hold in memory.

Yes, the list is held in memory but the elements are not computed.

> However, trying to do
> 
> import List
> main = print $ length $ sort points

Now it really does have to compute the elements because comparing the
elements involves inspecting them.

> makes memory usage go up and the program does not finish in 15m, also 
> spending most time waiting for swapped out memory.

Right. Much more memory used this time and that is pushing your machine
into swapping which then goes very very slowly.

> What am I doing wrong, why is sort this expensive in this case?

Sort itself is not so expensive, the expensive thing was calculating all
the elements of your list and sort was the first function that had to do
it. If you used something like sum (rather than length) then it'd hit
the same issue as it would need to evaluate each element.

> I would assume that computing and holding the whole list does not take
> too much memory, given its size and data type; 

That does not seem to be the case.

data Fraction = Fraction Int Int

This takes 7 words per Fraction. There's 3 words for the Fraction
constructor and it's two fields. Each Int also takes 2 words. On a 32bit
machine that is 28 bytes, or 56 on a 64bit machine.

We can reduce the memory requirements for Fraction by making the Int's
strict and unpacking them into the Fraction constructor:

data Fraction = Fraction {-# UNPACK #-} !Int {-# UNPACK #-} !Int

Now it takes 3 words per Fraction.

However it is no longer so lazy. Previously you could have (Fraction 3
(error "not used")) and as long as you never looked at the second Int it
would work. Now that we made both fields strict (the ! on the fields)
that does not work any more, (Fraction 3 (error "not used")) will
produce the error, even if we only look at the first Int.

> doing the very same calculation in C should be straight forward. 

In an eager language like C the first one would have failed too because
it would have had to compute all the elements eagerly and holding them
all in memory at once seems to require more memory than your machine
has.

On the other hand C would use something more like the strict variant I
mentioned above so the initial memory requirements would probably be
lower.

> And sort should be O(n * log n) for time and also not much more
> expensive in memory, right?

Sort does take a bit more memory than just reversing because it has to
construct some intermediate lists while merging.

> Am I running into a problem with lazyness?

A problem in your understanding of lazyness.

> What can I do to avoid it? 

Look at this again:

length $ map (\(x, _) -> x == Fraction 1 2) points

Try something like:

length $ map (\_ -> error "look! never evaluated!") points

you'll find it gives the same answer. That's because length never looks
at the elements of the list.

> As far as I see it though, the reverse or map call above should do 
> nearly the same as sort, except maybe that the list needs to be stored 
> in memory as a whole and sort has an additional *log n factor, but 
> neither of those should matter.  What's the problem here?

reverse never looks at the elements of the list, sort does.

Duncan