Deforestration harmful?

Shin-Cheng Mu scm@comlab.ox.ac.uk
Fri, 3 May 2002 21:58:51 +0100


Hi all,

We encountered a case where elimination of intermediate data
structure seeems to have a bad impact on the overall performance
and we were wondering why.

Relationally, our program would be as simple as:

     f . g

where g :: Seed -> Tree1 and f :: Tree1 -> Tree2. Basically g
generate some kind of tree via an unfold and f map a Tree1 to
some (usually 3 or 4) Tree2. In reality we have to implement
relations as set (or list) valued functions, so it becomes

    concat . map f. g

where g looks like

    g seed = [ z | (s1, s2) <- split seed,
                    x <- g s1, y <- g s2, z <- join x y ]

The relation f was almost as simple as a replacing each
constructor in Tree1 with one or more coresponding constructors in
Tree2. They two can be fused together and becomes

   fg seed = [ z | (s1, s2) <- split seed,
                    x <- fg s1, y <- fg s2, z <- join' x y ]

The intermediate datatype Tree1 was thus eliminated.

The problem was: the fused program ran slower than the
original one! At least if you measure the total running
time (either via the UNIX time command or use the CPUTime
module). The time reported in the GHC time profile indeed
showed that the fused program is faster, though, if GC time
is excluded.

We were quite curious why and my guess was: since f maps a
Tree1 to more than one Tree2, in the two list comprehensions,
the list produced by (fg s2) is much longer than (g s2).
The fused program thus has to keep a longer list in memory.
Worse heap residency results in more GCs. (Luckily GHC did not
attempt to automatically perform the fusion for us! :) )

An obvious thing to try was to increase the heap size.
Again we were surprised. Indeed the GC time decreased
but the running time in the profile (excluding the GC time)
increased a bit. Our guess was that it might have something
to do with the way GHC allocates new memory cells such that
larger heap results in longer allocation time. It was quite
tricky to find the best heap size at which the program
runs at maximal speed.

Are the above guesses are plausible? Is it a known phenomena?

Thank you very much.

sincerely,
Shin-Cheng Mu