Fun with GHC's optimiser

Thu, 02 Nov 2000 21:27:32 +1100

----Next_Part(Thu_Nov__2_21:27:30_2000_507)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Questions over questions arise in my quest to understand
GHC's code generator. In this nice little piece of STG

$w$snewPArray =
    NO_CCS[] \r[ww w]
	case newIntArray# [ww realWorld#] of wild {
	  (#,#) s2# mba# ->
	      case -# [ww 1] of y {
		DEFAULT ->
		    let { wild1 = NO_CCS I#! [ww];
		    } in 
		      case ># [0 y] of wild2 {
			True -> PArray [wild1 mba#];
			False ->
			    let {
			      $wgo2 =
				  NO_CCS[] \r[y1 y2]
				      let { a = NO_CCS PArray! [wild1 mba#];
				      } in 
					case w of wild3 {
					  I# e# ->
					      case writeIntArray# [mba# y1 e# y2] of s2#1 {
						DEFAULT ->
						    case ==# [y1 y] of wild11 {
						      True -> (#,#) [s2#1 a];
						      False ->
							  case +# [y1 1] of stg_c2aL {
							    DEFAULT -> $wgo2 stg_c2aL s2#1
							  };
						    }
					      };
					};
			    } in  case $wgo2 0 s2# of wild3 { (#,#) ds r -> r; };
		      }
	      };
	};

I am wondering whether there is a particular reason why the
optimiser doesn't pull the

  (1)  a = NO_CCS PArray! [wild1 mba#];

and the

  (2)  case w of wild3 {
         I# e# ->

out off the $wgo2 loop - or at least push (1) down into the
True branch of `case ==# [y1 y] of'.  I would say that (1)
in this form pointlessly allocates heap like mad, but is
only using one of the many copies of `a', namely the one
allocated in the final iteration and returned to the caller.
As for (2), the loop would be nice and straight if that
unboxing where outside of the loop - as it is, we break the
pipeline once per iteration it seems (if the branch
prediction isn't very clever).

Or do I misunderstand something here, or is there maybe some
magic after STG that gets rid of the stuff?

I attach the Haskell source, which I compiled with

  ghc -O -c PArrays.hs -fglasgow-exts -ddump-stg

(This is 4.08.1 of course.)

Also if somebody is looking at the attached source, I was
wondering why, when I use the commented out code in
`newPArray', I get a lot worse code (the STG code is in a
comment at the end of the file).  In particular, the lambda
abstraction is not inlined, whereas `fill' gets inlined into
the code of which the dump is above.  Is it because the
compiler has a lot harder time with explicit recursion than
with fold/build?  If so, the right RULES magic should allow
me to do the same for my own recursively defined
combinators, shouldn't it?

Cheers,
Manuel

PS: Otherwise, it is quite impressive what the RULES and
    inliner do to the foldr that produced the above code.

----Next_Part(Thu_Nov__2_21:27:30_2000_507)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="PArrays.hs"

--  Parallel Arrays: Sequential prototype
--
--  Authors: Manuel M. T. Chakravarty
--	     Gabriele Keller
--  Created: 26 October 2000
--
--  Version $Revision$ from $Date$
--
--  Copyright (c) 2000 Chakravarty & Keller
--
--  This file is free software; you can redistribute it and/or modify
--  it under the terms of the GNU General Public License as published by
--  the Free Software Foundation; either version 2 of the License, or
--  (at your option) any later version.
--
--  This file is distributed in the hope that it will be useful,
--  but WITHOUT ANY WARRANTY; without even the implied warranty of
--  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
--  GNU General Public License for more details.
--
--- DESCRIPTION ---------------------------------------------------------------
--
--  This module provides unboxed arrays of primitive types as a sequential
--  prototype for the parallel arrays of Gabi's SCL.  They come with all the
--  necessary operations to support flattened Haskell generated from lambdaPA.
--
--- DOCU ----------------------------------------------------------------------
--
--  language: Haskell 98 + GHC extensions (unboxed types and byte arrays)
--
--  We export the immutable `PArray' only; however, we internally also use a
--  mutable variant `MPArray', which allows us an efficient incremental
--  definition of the collective structure.
--
--  The class `PAE' is quite peculiar.  It essentially serves to overload
--  array access (which depends on the size of the unboxed value stored in the
--  parallel array); however, the actual operations in the class are only to
--  be used internally in this module - and therefore, the class `PAE' is
--  exported *abstractly*.  In fact, most operations work on mutable parallel
--  arrays for efficiency.
--
--  We shamelessly steal coding tricks from GHC's `ArrayBase', `IArray' &
--  friends, but without most of that overloading mess.
--
--- TODO ----------------------------------------------------------------------
--

module PArrays (
  PAE, PArray, newPArray, (!|), loop, gen
) where

-- infixl 9 !|

import PrelGHC
import PrelBase
import PrelST

import ST

-- data structures
-- ---------------

-- * although all `PArray's are represented by the same structure, the
--   functions operating on them differ with the element type; hence, we have
--   to overload these functions

-- integer indexed array (EXPORTED ABSTRACTLY)
--
data PArray e = PArray Int ByteArray#

-- mutable integer indexed array
--
data MPArray s e = MPArray Int (MutableByteArray# s)

-- the class of "Parallel Array Element"s (EXPORTED ABSTRACTLY)
--
class PAE a where
  newMPArray   :: Int                     -> ST s (MPArray s a)
  readMPArray  :: MPArray s a -> Int      -> ST s a
  writeMPArray :: MPArray s a -> Int -> a -> ST s ()
  indexPArray  ::  PArray a   -> Int      -> a

-- exported functions
-- ------------------

-- create new parallel array, where all elements are initialised to the given
-- values (EXPORTED)
--
newPArray     :: PAE e => Int -> e -> PArray e
newPArray n e  = runST (do
  mpa <- newMPArray n
--  simpleGen n $ 
--    \i -> writeMPArray mpa i e
--  return $ unsafeFreezeMPArray mpa)
  foldr (fill mpa e) (return $ unsafeFreezeMPArray mpa) [0..n-1])  -- a la ArrayBase

{-# SPECIALIZE newPArray :: Int -> Int -> PArray Int #-}

fill mpa e i next = writeMPArray mpa i e >> next

-- indexing of a parallel array (EXPORTED)
--
(!|) :: PAE e => PArray e -> Int -> e
(!|)  = indexPArray

-- Gabi's loop (EXPORTED)
--
loop :: () -- ??
loop  = error "loop?"

-- Gabi's gen (EXPORTED)
--
gen :: () -- ??
gen  = error "gen??"

-- instances of `PAE'
-- ------------------

instance PAE Int where
  newMPArray   = newMPArrayInt
  readMPArray  = readMPArrayInt
  writeMPArray = writeMPArrayInt
  indexPArray  = indexPArrayInt

newMPArrayInt           :: Int -> ST s (MPArray s Int)
newMPArrayInt n@(I# n#)  = ST $ \s# ->
  case (newIntArray# n# s#)	     of {(# s2#, mba# #) ->
  (# s2#, MPArray n mba# #)}

readMPArrayInt                         :: MPArray s Int -> Int -> ST s Int
{-# INLINE readMPArrayInt #-}
readMPArrayInt (MPArray _ mba#) (I# i#)  = ST $ \s# ->
  case readIntArray# mba# i# s#      of {(# s2#, r# #) ->
  (# s2#, I# r# #)}

writeMPArrayInt :: MPArray s Int -> Int -> Int -> ST s ()
{-# INLINE writeMPArrayInt #-}
writeMPArrayInt (MPArray _ mba#) (I# i#) (I# e#) = ST $ \s# ->
  case writeIntArray# mba# i# e# s#  of {s2#   ->
  (# s2#, () #)}

indexPArrayInt                        :: PArray Int -> Int -> Int
{-# INLINE indexPArrayInt #-}
indexPArrayInt (PArray _ ba#) (I# i#)  =
  case indexIntArray# ba# i# 	     of {r# ->
  (I# r#)}

-- auxilliary functions
-- --------------------

-- unsafely convert a mutable into an immutable array
--
unsafeFreezeMPArray                  :: MPArray s e -> PArray e
unsafeFreezeMPArray (MPArray n mba#)  = PArray n (unsafeCoerce# mba#)

-- simple generator abstraction
--
simpleGen     :: Monad m => Int -> (Int -> m ()) -> m ()
simpleGen 0 p  = return ()
simpleGen n p  = p (n - 1) >> simpleGen (n - 1) p

{-

-- That's what we get for a newPArray specialised for Int and using the
-- explicitly recursive simpleGen.

$wsimpleGen =
    NO_CCS srt: (0,2)[] \r[ww w]
	case ww of ds {
	  0 -> $wlvl1;
	  DEFAULT ->
	      let {
		stg_c1Qg =
		    NO_CCS srt: (0,1)[] \r[s1]
			case -# [ds 1] of a {
			  DEFAULT ->
			      let { stg_c1Lg = NO_CCS I#! [a];
			      } in 
				case w stg_c1Lg s1 of wild {
				  (#,#) new_s r -> $wsimpleGen a w new_s;
				}
			};
	      } in  stg_c1Qg
	};
SRT: [$wsimpleGen, $wlvl1]

$w$snewPArray =
    NO_CCS srt: (0,1)[] \r[ww w]
	case newIntArray# [ww realWorld#] of wild {
	  (#,#) s2# mba# ->
	      let {
		stg_c1M1 =
		    NO_CCS[] \r[i]
			case i of wild1 {
			  I# i# ->
			      case w of wild2 {
				I# e# ->
				    let {
				      stg_c1Q7 =
					  NO_CCS[] \r[s#]
					      case writeIntArray# [mba# i# e# s#] of s2#1 {
						DEFAULT -> (#,#) [s2#1 ()]
					      };
				    } in  stg_c1Q7;
			      };
			};
	      } in 
		case $wsimpleGen ww stg_c1M1 s2# of wild1 {
		  (#,#) new_s r -> let { a = NO_CCS I#! [ww]; } in  (#,#) [a mba#];
		};
	};
SRT: [$wsimpleGen]

-- the `case w of' could be pulled out of the loop
-- stg_c1Q7 builds a superfluous closure (which is immediately entered)
-- could we somehow get a specialised version of simpleGen?

-}
----Next_Part(Thu_Nov__2_21:27:30_2000_507)----