# [Haskell-cafe] idea for avoiding temporaries

Jan-Willem Maessen Janwillem.Maessen at sun.com
Fri Mar 9 10:58:43 EST 2007

```On Mar 8, 2007, at 12:00 PM, David Roundy wrote:

> Hi all,
>
> I was just teaching a class on minimization methods, with a focus on
> conjugate gradient minimization in particular, and one of my main
> points
> was that with conjugate gradient minimization we only need three or
> four
> arrays of size N (depending on whether we use the Fletcher-Reeves or
> Polak-Ribiere variant), ...  This got me thinking about one of the
> largest problems
> with writing serious numerical code in Haskell, which is that of
> memory
> consumption and avoiding temporary variables.

I've been following this discussion with interest, as I've been
looking in some detail at conjugate gradient algorithms as part of my
day job, and I've spent a good deal of time thinking about exactly
the problems you raise.  For those following along at home, here's a
sample somewhat-imperative CG algorithm (this is the simplified
stripped-down version):

for j <- seq(1#cgit) do
q = A p
alpha = rho / (p DOT q)
z += alpha p
rho0 = rho
r -= alpha q
rho := r DOT r
beta = rho / rho0
p := r + beta p
end

Here p,q,r, and z are vectors, A is the derivative of our function
(in this case a sparse symmetric positive-definite matrix, but we can
really think of it as a higher-order function of type Vector->Vector)
and the greek letters are scalars.  The "answer" is z.  In practice
we'd not run a fixed number of iterations, but instead do a
convergence test.  All the hard work is in the line "q = A p", but
the storage consumption is mostly in the surrounding code.  On a
parallel machine (where these sorts of programs are often run) this
part of the algorithm has almost no parallelism---all those dot
products and normalizations preclude it---but the A p step and the
dot products themselves are of course quite parallelizable.

Sadly, many of the suggestions, while generally sound, just don't
apply well to this particular use case.

* As other have noted, burying the updatable state in a monad just
begs the question.  The resulting code looks nothing at all like the
mathematics, either, which is a big problem if you're trying to
understand and maintain it (The above code is virtually isomorphic to
the problem specification).  I'm sure David is seeking a more-
declarative version of the code than the spec from which we were
working.  Note that rather than embedding all the state in a special
monad, we might as well be using update-in-place techniques (such as
the += and -= operations you see above) with the monads we've already
got; the result will at least be readable, but it will be too
imperative.

* There are opportunities for loop fusion in CG algorithms---we can
compute multiple dot products on the same array in a single loop---
but these have the flavor of fusing multiple consumers of a data
structure into a single consumer, which I believe is still an
unsolved problem in equational frameworks like foldr/build or
streams.  It's a bit like fusing:

n = foldl' (+) 0 . map (const 1) \$ xs
sum_xs = foldl' (+) 0 \$ xs
sum_sq = foldl' (+) 0 . map (\x->x*x) \$ xs

into:

(n,sum_xs,sum_sq) =
foldl' (\(a0,b0,c0) (an,bn,cn)->(a0+an, b0+bn, c0+cn)) (0,0,0) .
map (\x->(const 1 x, x, x*x)) \$ xs

which we understand how to do, but not equationally (or at least we
didn't last I looked, though Andy Gill and I had both fantasized

None of these fusion opportunities actually save space, which makes
the problem tricker still.

* The algorithm as written already tries to express minimal storage.
The only question is, do +=, -=, and := update their left-hand side
in place, or do we think of them (in the Haskell model of the
universe) as fresh arrays, and the previous values as newly-created
garbage?  My challenge to fellow Haskell hackers: find a way to
express this such that it doesn't look so imperative.

* Even if we do a good job with += and -=, which is what David seems
to be looking for, we still have those irritating := assignments---
we'd like to throw out the old p and reuse the space in the last
line.  And we'd like to have one piece of storage to hold the q on
each successive iteration.  So even if we solve David's problem, we
still haven't matched the space requirements of the imperative code.

* DiffArrays are too expensive to be acceptable here.  It's not even
a question of unboxing.  Let's say we keep the current array in fast,
unboxed storage; this lets us read its elements using a single load.
Each update still needs to retrieve the old data from the array and
add it to the old-versions lookup table; together these operations
are much more expensive than the actual update to the current version
of the table.  And we need to do this even though no old versions
exist!  We should be able to avoid this work entirely.  (And, if old
versions do exist, a DiffArray is the pessimal representation for
them given that we're updating the whole array).

* Linear or Uniqueness types are almost what we want.  I think Josef
Svenningson was the one who captured this the best: Uniqueness type
*require* that the *caller* of these routines make sure that it is
not sharing the data.  So we need two copies of our linear algebra
library---one which takes unique arrays as arguments and updates in
place, and one which takes non-unique arrays and allocates.  And we
have to pick the right one based on context.  What we want, it seems
to me, is one library and a compiler which can make informed judgments.

* We could imagine tracking uniqueness dynamically at run time, using
something like reference counting for all our arrays.  But we need to
do the reference counting precisely---this is pretty much the most
inefficient way possible of tracking the storage, and doesn't play
well at all with using efficient GC elsewhere in our programs.  The
inefficiency might be worth it for arrays, but Haskell is polymorphic
and in many cases we need to treat all our data the same way.

* Finally, I'll observe that we often want to use slightly different
algorithms depending upon whether we're updating in place or
computing into fresh storage.  Often copying the data and then
updating it in place is not actually a good idea.

I'd love to hear if anyone has insights / pointers to related work on
any of the issues above; I'm especially keen to learn if there's work
I didn't know about on fusion of multiple traversals.  In my day job
with Fortress we are looking at RULES-like approaches, but they
founder quickly because the kind of problems David is trying to solve
are 90% of what our programmers want to do.

-Jan-Willem Maessen

```