[Long, probably not-beginners anymore] Parallel folds and folds as arrows (was: Re: [Haskell-beginners] Re: When, if ever, does Haskell "calculate once"?)

Daniel Fischer daniel.is.fischer at web.de
Fri May 7 10:15:41 EDT 2010


On Friday 07 May 2010 03:15:19, Maciej Piechotka wrote:
> On Thu, 2010-05-06 at 23:46 +0200, Daniel Fischer wrote:
> > Share.share :: GHC.Types.Int
> > GblId
> > [Str: DmdType]
> > Share.share =
> >   case GHC.List.$wlen @ GHC.Integer.Type.Integer Share.share_a 0
> >   of ww_amc { __DEFAULT ->
> >   GHC.Types.I# (GHC.Prim.+# ww_amc ww_amc)
> >   }
>
> Hmm. What's the name of this form and how to get it?
>

It's GHC's intermediate language, named Core. It's sort of a slimmed down 
Haskell.
You can get it by
a) compiling with the -ddump-simpl flag
(e.g. ghc -O2 -ddump-simpl --make prog > prog.core
If you don't redirect, it's spat out to stdout, better to have it in a file 
for reading. Also, it's easier to read with syntax-highlighting, plain 
Haskell-highlighting already goes a long way.)
b) using Don Stewart's ghc-core (http://hackage.haskell.org/package/ghc-
core), e.g. ghc.core -f html -- -O2 Source.hs > Source.html

Looking at the core, you can see what GHC really makes from your code.

> > No.
> >
> > cFoldl' f g (b0,c0) xs0 = lgo b0 c0 xs0
> >     where
> >       lgo b c [] = (b,c)
> >       lgo !b !c (x:xs) = lgo (f b x) (g c x) xs
>
> Ok. Fixed (I tried fast rewrite from foldr')
>

Doesn't matter when compiled with optimisations (produces nearly identical 
core), but without optimisations the original constructs a new pair in each 
step.

However, I inadvertently swapped the clauses of lgo (could also be fixed by 
putting the bang patterns in the first clause). That didn't matter as long 
as both accumulating functions were lengthFold, but it makes a big 
difference when one is (+).

> > 64-bit system? I get
>
> 64 bit, GHC 6.12.2.
> % ghc -V
> The Glorious Glasgow Haskell Compilation System, version 6.12.2
> % file a.out
> a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically
> linked (uses shared libs), for GNU/Linux 2.6.9, not stripped
>
> > And that is strange, because I get the same figures for that one as
> > for the first (times differ by a few hundredths of a second).
>
> Fixed or non-fixed version?
>
> > Is that a difference between 32-bit code generator and 64-bit or
> > between GHC versions (6.12.2 here, but 6.10.3 gives roughly the same
> > results)?
>
> Hmm. Compiler and platform matches. Unless you use some other 64-bit
> platform - not x86-64 ;)

No, I have a 32-bit system (I got almost exactly half the allocation 
figures you got, so I suspected your Ints [and small Integers] were twice 
as large as mine).

>
> > > > main =
> > > >   print $! uncurry (+) (cFoldl' lengthFold lengthFold
> > > >                                 (0, 0) [1..size])
> >
> > And that gives the same figures as the other two (plus/minus 0.05s).
> >
> > > All are compiled with optimizations.
> >
> > All compiled with -O2.
>
> Hmm. Difference between -O1 and -O2
>

No, they produce identical core for these.

> Fixed versions and with sum (and -O2):
> > main = let a = [1..size]
> >            l = length a + sum a
> >        in print $! l
>
> Lot's of memory(Over 3 GiB). I voluntarily killed process
>

Yes, of course.

> > main = print $! length [1..size] + sum [1..size]
>
> Lot's of memory(Over 3 GiB). I voluntarily killed process.
>
> So far as being inplace.
>

Yes. Actually, with optimisations turned on, GHC *does* share the list 
[1 .. size]. Oops.
It's not shared with -O0 unless you give it a name.

> > main = print $! uncurry (+) (cFoldl' lengthFold (+) (0, 0) [1..size])
>
> 5000000150000000
>   16,889,753,976 bytes allocated in the heap
>        3,356,480 bytes copied during GC
>            1,976 bytes maximum residency (1 sample(s))
>           28,200 bytes maximum slop
>                1 MB total memory in use (0 MB lost due to fragmentation)
>
>   Generation 0: 32216 collections,     0 parallel,  0.29s,  0.38s
> elapsed
>   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s
> elapsed
>
>   INIT  time    0.00s  (  0.00s elapsed)
>   MUT   time   14.89s  ( 15.12s elapsed)
>   GC    time    0.29s  (  0.38s elapsed)
>   EXIT  time    0.00s  (  0.00s elapsed)
>   Total time   15.18s  ( 15.50s elapsed)
>
>   %GC time       1.9%  (2.4% elapsed)
>
>   Alloc rate    1,134,094,110 bytes per MUT second
>
>   Productivity  98.1% of total user, 96.1% of total elapsed
>
> ./a.out +RTS -s  15.18s user 0.11s system 98% cpu 15.503 total

With the really fixed cFoldl' (overflows of course):

./lsCFoldl +RTS -s                                             
1087459712                                                     
   4,015,621,900 bytes allocated in the heap                   
         126,564 bytes copied during GC                        
           1,460 bytes maximum residency (1 sample(s))         
          29,892 bytes maximum slop                            
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  7659 collections,     0 parallel,  0.07s,  0.05s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.50s  (  1.54s elapsed)
  GC    time    0.07s  (  0.05s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.56s  (  1.59s elapsed)

  %GC time       4.3%  (3.1% elapsed)

  Alloc rate    2,684,070,586 bytes per MUT second

  Productivity  95.7% of total user, 94.2% of total elapsed

Misfixed:

./lsTFoldl +RTS -s                                             
1087459712                                                     
   6,400,058,168 bytes allocated in the heap                   
         638,784 bytes copied during GC                        
           1,496 bytes maximum residency (1 sample(s))         
          29,892 bytes maximum slop                            
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 12207 collections,     0 parallel,  0.04s,  0.08s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    2.64s  (  2.64s elapsed)
  GC    time    0.04s  (  0.08s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    2.68s  (  2.72s elapsed)

  %GC time       1.3%  (3.1% elapsed)

  Alloc rate    2,420,446,752 bytes per MUT second

  Productivity  98.7% of total user, 97.1% of total elapsed



> -----------------------------------------------------------------------
>
> On the other hand it seems to form an arrow[1].
> First the result of test:
> 5000000150000000
>   12,800,063,440 bytes allocated in the heap
>        2,545,048 bytes copied during GC
>            1,968 bytes maximum residency (1 sample(s))
>           28,216 bytes maximum slop
>                1 MB total memory in use (0 MB lost due to fragmentation)
>
>   Generation 0: 24414 collections,     0 parallel,  0.24s,  0.29s
> elapsed
>   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s
> elapsed
>
>   INIT  time    0.00s  (  0.00s elapsed)
>   MUT   time    7.18s  (  7.35s elapsed)
>   GC    time    0.24s  (  0.29s elapsed)
>   EXIT  time    0.00s  (  0.00s elapsed)
>   Total time    7.42s  (  7.64s elapsed)
>
>   %GC time       3.3%  (3.8% elapsed)
>
>   Alloc rate    1,782,759,994 bytes per MUT second
>
>   Productivity  96.7% of total user, 93.9% of total elapsed
>
> ./a.out +RTS -s  7.42s user 0.09s system 98% cpu 7.648 total
>
> (Yes - lower then the code above - result reproducable).
>
> Code:
> > {-# LANGUAGE BangPatterns #-}
> > import Control.Arrow
<snip>
> > main =
> >     print $! uncurry (+) $ doFoldl (lengthA *** sumA) (0, 0) [1..size]

Hmm. For me, that gives identical performance (and identical core for main) 
as the misfixed cFoldl', the really fixed cFoldl' does significantly 
better.
It becomes equal (for this task at least) if I make FoldlArrows also strict 
in the first argument (replace all "\a" with "\ !a").

>
> Regards
>
> PS. As it is probably out of scope and topic of beginners mailing list
> I'm CC'ing cafe (possibly beginners should be dropped).
>
> [1] It can be extended to work on other arrows as well - not only (->).



More information about the Beginners mailing list