Loop unrolling + fusion ?

Sun Mar 1 11:42:27 EST 2009

> So now, since we've gone to such effort to produce a tiny loop like, this,
> can't we unroll it just a little? 
> it is worth unrolling this guy, so we get the win of both aggressive high level
> fusion, and aggressive low level loop optimisations?

It might be useful to point out that the interaction goes both ways.
Not only are fused loops candidates for unrolling, but unrolling
can also enable fusion, giving one example of why Core-level
unrolling (in addition to backend-level loop restructuring) would
be useful. Consider this silly example (with Apply as before, in
the rewrite rules thread, just syntactically unrolling the loop, and
loop as before, but generalised to arbitrary accumulators, see below):

--------------------------------------------------------
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE CPP #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE BangPatterns #-}

import Data.Array.Vector
import Data.Bits
import Apply
import GHC.Prim
import GHC.Base

main = print $ loop 1 10000000 body (toU [1,2,3,4,5::Int])

body i arr = mapU (42+) arr
--------------------------------------------------------

Here, the refusal to partially unfold recursive definitions means
there are no opportunities for fusion, whereas unrolling enables
fusion (which wouldn't work if unrolling was done only in the
backend, after fusion).

--------------------------------------------------------
{-# INLINE loop #-}
loop :: Int -> Int -> (Int -> acc -> acc) -> acc -> acc
loop i max body acc = loopW i acc
  where
#ifdef N
  loopW !i !acc | i+N<=max  = loopW (i+N) ($(apply (0::Int) N) (\j acc->body (i+j) acc) acc)
#endif
  loopW !i !acc | i<=max    = loopW (i+1) (body i acc)
                | otherwise = acc
--------------------------------------------------------

Compare the versions without and with unrolling, not just for
time, but for allocation (+RTS -s).

As usual, we'd like to reassociate the sums to enable constant
folding, but this rule

{-# RULES
-- "reassoc" forall a# b# c. ((I# a#) +# ((I# b#) +# c)) = ((I# a#) +# (I# b#)) +# c
  #-}

 is rejected.

Claus