Personal tools

Benchmarks Game/Parallel/BinaryTreesDPH

From HaskellWiki

< Benchmarks Game | Parallel(Difference between revisions)
Jump to: navigation, search
 
Line 7: Line 7:
 
* Not submitted, requires GHC 6.10 beta or above
 
* Not submitted, requires GHC 6.10 beta or above
   
Hardware:
+
Info:
   
 
* 2.2gHz core2duo, Macbook Pro
 
* 2.2gHz core2duo, Macbook Pro
  +
* The Glorious Glasgow Haskell Compilation System, version 6.10.0.20080921
   
 
Build and run:
 
Build and run:
Line 20: Line 21:
 
<b>Note, these times have a 350mb heap (+RTS -A350M)</b>
 
<b>Note, these times have a 350mb heap (+RTS -A350M)</b>
   
Single core vs. dual core
+
DPH - Single core vs. dual core
   
 
* Single core (no -N3): 18.945
 
* Single core (no -N3): 18.945
 
* Dual core (with -N3): 14.338 (see below)
 
* Dual core (with -N3): 14.338 (see below)
   
DPH vs. parallel strategies
 
   
* Parallel strategies dual core: 11.471 (see below)
 
* DPH Dual core: 14.338 (see below)
 
   
Comparison with parallel strategies version of parallel binary trees:
+
====Parallel Strategies vs. Data Parallel Haskell====
   
====Data Parallel Haskell====
+
<b>DPH</b>:
   
 
time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
 
time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
Line 96: Line 94:
 
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 23.57s user 2.13s system 179% cpu 14.338 total
 
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 23.57s user 2.13s system 179% cpu 14.338 total
   
====Parallel Strategies====
+
  +
<b>Parallel Strategies</b>:
   
 
time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
 
time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
Line 149: Line 147:
   
 
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 18.62s user 2.15s system 181% cpu 11.471 total
 
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 18.62s user 2.15s system 181% cpu 11.471 total
  +
  +
===Code===
  +
<haskell>
  +
{-# LANGUAGE BangPatterns, TypeOperators #-}
  +
{-# OPTIONS -funbox-strict-fields #-}
  +
--
  +
-- The Computer Language Shootout
  +
-- http://shootout.alioth.debian.org/
  +
--
  +
-- Contributed by Don Stewart and Thomas Davie, modified for DPH
  +
-- usage by Austin Seipp
  +
--
  +
-- This uses the Data Parallel Haskell beta available in ghc 6.10, to
  +
-- exploit multi-core machines.
  +
--
  +
  +
import System
  +
import Data.Bits
  +
import Text.Printf
  +
import Data.Array.Parallel.Base
  +
import Data.Array.Parallel.Unlifted as U
  +
import Data.Array.Parallel.Unlifted.Parallel as UP
  +
--
  +
-- an artificially strict tree.
  +
--
  +
-- normally you would ensure the branches are lazy, but this benchmark
  +
-- requires strict allocation.
  +
--
  +
data Tree = Nil | Node !Int !Tree !Tree
  +
  +
minN = 4
  +
  +
io s n t = printf "%s of depth %d\t check: %d\n" s n t
  +
  +
main = do
  +
n <- getArgs >>= readIO . head
  +
let maxN = max (minN + 2) n
  +
stretchN = maxN + 1
  +
  +
-- stretch memory tree
  +
let c = check' (make 0 stretchN)
  +
io "stretch tree" stretchN c
  +
  +
-- allocate a long lived tree
  +
let !long = make 0 maxN
  +
  +
-- allocate, walk, and deallocate many bottom-up binary trees
  +
let vs = UP.mapUP (depth' maxN) $ UP.enumFromThenToUP minN (minN+2) maxN
  +
mapM_ (\((m :*:d :*: i)) ->
  +
io (show m ++ "\t trees") d i) (U.toList vs)
  +
  +
-- confirm the the long-lived binary tree still exists
  +
io "long lived tree" maxN (check' long)
  +
  +
-- generate many trees
  +
depth' :: Int -> Int -> (Int :*: Int :*: Int)
  +
depth' m d =
  +
(2*n) :*: d :*: (sumT d n 0)
  +
where
  +
n = 1 `shiftL` (m - d + minN)
  +
  +
-- allocate and check lots of trees
  +
sumT :: Int -> Int -> Int -> Int
  +
sumT d 0 t = t
  +
sumT d i t = sumT d (i-1) (t + a + b)
  +
where a = check' (make i d)
  +
b = check' (make (-i) d)
  +
  +
-- traverse the tree, counting up the nodes
  +
check' :: Tree -> Int
  +
check' Nil = 0
  +
check' (Node i l r) = i + check' l - check' r
  +
  +
-- build a tree
  +
make :: Int -> Int -> Tree
  +
make i 0 = Node i Nil Nil
  +
make i d = Node i (make (i2-1) d2) (make i2 d2)
  +
where i2 = 2*i; d2 = d-1
  +
</haskell>
  +
  +
===Notes===
  +
  +
I'm taking a guess (core is tremendous) that an issue potentially lies here:
  +
<haskell> mapM_ (\((m :*:d :*: i)) ->
  +
io (show m ++ "\t trees") d i) (U.toList vs)</haskell>
  +
Not sure if U.toList fuses properly - but how can we traverse it otherwise?

Revision as of 20:48, 7 October 2008

Contents

1 Binary Trees

1.1 Data Parallel Haskell

  • Not submitted, requires GHC 6.10 beta or above

Info:

  • 2.2gHz core2duo, Macbook Pro
  • The Glorious Glasgow Haskell Compilation System, version 6.10.0.20080921

Build and run:

   $ ~/ghc-6.10/bin/ghc --make -fcpr-off -threaded -fdph-par -package dph-base -Odph binarytrees.hs
   $ ./binarytrees 20 +RTS -N3 -sstderr -A350M

1.1.1 Times

Note, these times have a 350mb heap (+RTS -A350M)

DPH - Single core vs. dual core

  • Single core (no -N3): 18.945
  • Dual core (with -N3): 14.338 (see below)


1.1.2 Parallel Strategies vs. Data Parallel Haskell

DPH:

time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
./binarytrees 20 +RTS -N3 -sstderr -A350M 
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
 14,584,449,064 bytes allocated in the heap
    215,093,024 bytes copied during GC
     34,674,344 bytes maximum residency (2 sample(s))
         51,688 bytes maximum slop
           1083 MB total memory in use (9 MB lost due to fragmentation)
 Generation 0:    16 collections,    16 parallel,  1.97s,  1.13s elapsed
 Generation 1:     2 collections,     2 parallel,  0.83s,  0.47s elapsed
 Parallel GC work balance: 2.34 (53773189 / 22969253, ideal 3)
 Task  0 (worker) :  MUT time:  23.18s  ( 12.54s elapsed)
                     GC  time:   0.37s  (  0.21s elapsed)
 Task  1 (worker) :  MUT time:  21.46s  ( 12.54s elapsed)
                     GC  time:   2.09s  (  1.19s elapsed)
 Task  2 (worker) :  MUT time:  23.53s  ( 12.54s elapsed)
                     GC  time:   0.02s  (  0.02s elapsed)
 Task  3 (worker) :  MUT time:  23.56s  ( 12.57s elapsed)
                     GC  time:   0.00s  (  0.00s elapsed)
 Task  4 (worker) :  MUT time:  23.24s  ( 12.57s elapsed)
                     GC  time:   0.32s  (  0.18s elapsed)
 INIT  time    0.02s  (  0.03s elapsed)
 MUT   time   20.75s  ( 12.54s elapsed)
 GC    time    2.80s  (  1.60s elapsed)
 EXIT  time    0.00s  (  0.02s elapsed)
 Total time   23.57s  ( 14.17s elapsed)
 %GC time      11.9%  (11.3% elapsed)
 Alloc rate    702,222,726 bytes per MUT second
 Productivity  88.1% of total user, 146.4% of total elapsed
recordMutableGen_sync: 0
gc_alloc_block_sync: 14318885
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 354
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 10112
gen[1].steps[0].sync_large_objects: 0
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS  23.57s user 2.13s system 179% cpu 14.338 total


Parallel Strategies:

time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
./binarytrees 20 +RTS -N3 -sstderr -A350M 
stretch tree of depth 21	 check: -1
2097152	 trees of depth 4	 check: -2097152
524288	 trees of depth 6	 check: -524288
131072	 trees of depth 8	 check: -131072
32768	 trees of depth 10	 check: -32768
8192	 trees of depth 12	 check: -8192
2048	 trees of depth 14	 check: -2048
512	 trees of depth 16	 check: -512
128	 trees of depth 18	 check: -128
32	 trees of depth 20	 check: -32
long lived tree of depth 20	 check: -1
9,719,681,300 bytes allocated in the heap
164,038,148 bytes copied during GC (scavenged)
       160 bytes copied during GC (not scavenged)
33,718,272 bytes maximum residency (2 sample(s))
        11 collections in generation 0 (  0.33s)
         2 collections in generation 1 (  0.20s)
      1094 Mb total memory in use
 Task  0 (worker) :  MUT time:  18.43s  ( 10.65s elapsed)
                     GC  time:   0.17s  (  0.19s elapsed)
 Task  1 (worker) :  MUT time:  18.54s  ( 10.65s elapsed)
                     GC  time:   0.06s  (  0.06s elapsed)
 Task  2 (worker) :  MUT time:  18.32s  ( 10.65s elapsed)
                     GC  time:   0.29s  (  0.36s elapsed)
 Task  3 (worker) :  MUT time:  18.60s  ( 10.67s elapsed)
                     GC  time:   0.01s  (  0.01s elapsed)
 Task  4 (worker) :  MUT time:  18.61s  ( 10.68s elapsed)
                     GC  time:   0.00s  (  0.00s elapsed)
 INIT  time    0.01s  (  0.03s elapsed)
 MUT   time   18.08s  ( 10.65s elapsed)
 GC    time    0.53s  (  0.63s elapsed)
 EXIT  time    0.00s  (  0.01s elapsed)
 Total time   18.62s  ( 11.31s elapsed)
 %GC time       2.8%  (5.6% elapsed)
 Alloc rate    537,291,716 bytes per MUT second
 Productivity  97.1% of total user, 159.8% of total elapsed
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS  18.62s user 2.15s system 181% cpu 11.471 total

1.2 Code

{-# LANGUAGE BangPatterns, TypeOperators #-}
{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- Contributed by Don Stewart and Thomas Davie, modified for DPH
-- usage by Austin Seipp
--
-- This uses the Data Parallel Haskell beta available in ghc 6.10, to
-- exploit multi-core machines.
-- 
 
import System
import Data.Bits
import Text.Printf
import Data.Array.Parallel.Base 
import Data.Array.Parallel.Unlifted as U
import Data.Array.Parallel.Unlifted.Parallel as UP
--
-- an artificially strict tree.
--
-- normally you would ensure the branches are lazy, but this benchmark
-- requires strict allocation.
--
data Tree = Nil | Node !Int !Tree !Tree
 
minN = 4
 
io s n t = printf "%s of depth %d\t check: %d\n" s n t
 
main = do
    n <- getArgs >>= readIO . head
    let maxN     = max (minN + 2) n
        stretchN = maxN + 1
 
    -- stretch memory tree
    let c = check' (make 0 stretchN)
    io "stretch tree" stretchN c
 
    -- allocate a long lived tree
    let !long    = make 0 maxN
 
    -- allocate, walk, and deallocate many bottom-up binary trees
    let vs = UP.mapUP (depth' maxN) $ UP.enumFromThenToUP minN (minN+2) maxN
    mapM_ (\((m :*:d :*: i)) -> 
               io (show m ++ "\t trees") d i) (U.toList vs)
 
    -- confirm the the long-lived binary tree still exists
    io "long lived tree" maxN (check' long)
 
-- generate many trees
depth' :: Int -> Int -> (Int :*: Int :*: Int)
depth' m d =
  (2*n) :*: d :*: (sumT d n 0)
  where
    n = 1 `shiftL` (m - d + minN)
 
-- allocate and check lots of trees
sumT :: Int -> Int -> Int -> Int
sumT d 0 t = t
sumT  d i t = sumT d (i-1) (t + a + b)
  where a = check' (make i    d)
        b = check' (make (-i) d)
 
-- traverse the tree, counting up the nodes
check' :: Tree -> Int
check' Nil          = 0
check' (Node i l r) = i + check' l - check' r
 
-- build a tree
make :: Int -> Int -> Tree
make i 0 = Node i Nil Nil
make i d = Node i (make (i2-1) d2) (make i2 d2)
  where i2 = 2*i; d2 = d-1

1.3 Notes

I'm taking a guess (core is tremendous) that an issue potentially lies here:

    mapM_ (\((m :*:d :*: i)) -> 
               io (show m ++ "\t trees") d i) (U.toList vs)

Not sure if U.toList fuses properly - but how can we traverse it otherwise?