Benchmarks Game/Parallel/BinaryTreesDPH
From HaskellWiki
Contents |
1 Binary Trees
1.1 Data Parallel Haskell
- Not submitted, requires GHC 6.10 beta or above
Info:
- 2.2gHz core2duo, Macbook Pro
- The Glorious Glasgow Haskell Compilation System, version 6.10.0.20080921
Build and run:
$ ~/ghc-6.10/bin/ghc --make -fcpr-off -threaded -fdph-par -package dph-base -Odph binarytrees.hs $ ./binarytrees 20 +RTS -N3 -sstderr -A350M
1.1.1 Times
Note, these times have a 350mb heap (+RTS -A350M)
DPH - Single core vs. dual core
- Single core (no -N3): 18.945
- Dual core (with -N3): 14.338 (see below)
1.1.2 Parallel Strategies vs. Data Parallel Haskell
DPH:
time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
./binarytrees 20 +RTS -N3 -sstderr -A350M
stretch tree of depth 21 check: -1
2097152 trees of depth 4 check: -2097152
524288 trees of depth 6 check: -524288
131072 trees of depth 8 check: -131072
32768 trees of depth 10 check: -32768
8192 trees of depth 12 check: -8192
2048 trees of depth 14 check: -2048
512 trees of depth 16 check: -512
128 trees of depth 18 check: -128
32 trees of depth 20 check: -32
long lived tree of depth 20 check: -1
14,584,449,064 bytes allocated in the heap
215,093,024 bytes copied during GC
34,674,344 bytes maximum residency (2 sample(s))
51,688 bytes maximum slop
1083 MB total memory in use (9 MB lost due to fragmentation)
Generation 0: 16 collections, 16 parallel, 1.97s, 1.13s elapsed Generation 1: 2 collections, 2 parallel, 0.83s, 0.47s elapsed
Parallel GC work balance: 2.34 (53773189 / 22969253, ideal 3)
Task 0 (worker) : MUT time: 23.18s ( 12.54s elapsed)
GC time: 0.37s ( 0.21s elapsed)
Task 1 (worker) : MUT time: 21.46s ( 12.54s elapsed)
GC time: 2.09s ( 1.19s elapsed)
Task 2 (worker) : MUT time: 23.53s ( 12.54s elapsed)
GC time: 0.02s ( 0.02s elapsed)
Task 3 (worker) : MUT time: 23.56s ( 12.57s elapsed)
GC time: 0.00s ( 0.00s elapsed)
Task 4 (worker) : MUT time: 23.24s ( 12.57s elapsed)
GC time: 0.32s ( 0.18s elapsed)
INIT time 0.02s ( 0.03s elapsed) MUT time 20.75s ( 12.54s elapsed) GC time 2.80s ( 1.60s elapsed) EXIT time 0.00s ( 0.02s elapsed) Total time 23.57s ( 14.17s elapsed)
%GC time 11.9% (11.3% elapsed)
Alloc rate 702,222,726 bytes per MUT second
Productivity 88.1% of total user, 146.4% of total elapsed
recordMutableGen_sync: 0 gc_alloc_block_sync: 14318885 whitehole_spin: 0 gen[0].steps[0].sync_todo: 0 gen[0].steps[0].sync_large_objects: 0 gen[0].steps[1].sync_todo: 354 gen[0].steps[1].sync_large_objects: 0 gen[1].steps[0].sync_todo: 10112 gen[1].steps[0].sync_large_objects: 0 ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 23.57s user 2.13s system 179% cpu 14.338 total
Parallel Strategies:
time ./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS
./binarytrees 20 +RTS -N3 -sstderr -A350M
stretch tree of depth 21 check: -1
2097152 trees of depth 4 check: -2097152
524288 trees of depth 6 check: -524288
131072 trees of depth 8 check: -131072
32768 trees of depth 10 check: -32768
8192 trees of depth 12 check: -8192
2048 trees of depth 14 check: -2048
512 trees of depth 16 check: -512
128 trees of depth 18 check: -128
32 trees of depth 20 check: -32
long lived tree of depth 20 check: -1
9,719,681,300 bytes allocated in the heap
164,038,148 bytes copied during GC (scavenged)
160 bytes copied during GC (not scavenged)
33,718,272 bytes maximum residency (2 sample(s))
11 collections in generation 0 ( 0.33s)
2 collections in generation 1 ( 0.20s)
1094 Mb total memory in use
Task 0 (worker) : MUT time: 18.43s ( 10.65s elapsed)
GC time: 0.17s ( 0.19s elapsed)
Task 1 (worker) : MUT time: 18.54s ( 10.65s elapsed)
GC time: 0.06s ( 0.06s elapsed)
Task 2 (worker) : MUT time: 18.32s ( 10.65s elapsed)
GC time: 0.29s ( 0.36s elapsed)
Task 3 (worker) : MUT time: 18.60s ( 10.67s elapsed)
GC time: 0.01s ( 0.01s elapsed)
Task 4 (worker) : MUT time: 18.61s ( 10.68s elapsed)
GC time: 0.00s ( 0.00s elapsed)
INIT time 0.01s ( 0.03s elapsed) MUT time 18.08s ( 10.65s elapsed) GC time 0.53s ( 0.63s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 18.62s ( 11.31s elapsed)
%GC time 2.8% (5.6% elapsed)
Alloc rate 537,291,716 bytes per MUT second
Productivity 97.1% of total user, 159.8% of total elapsed
./binarytrees 20 +RTS -N3 -sstderr -A350M -RTS 18.62s user 2.15s system 181% cpu 11.471 total
1.2 Code
{-# LANGUAGE BangPatterns, TypeOperators #-} {-# OPTIONS -funbox-strict-fields #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- Contributed by Don Stewart and Thomas Davie, modified for DPH -- usage by Austin Seipp -- -- This uses the Data Parallel Haskell beta available in ghc 6.10, to -- exploit multi-core machines. -- import System import Data.Bits import Text.Printf import Data.Array.Parallel.Base import Data.Array.Parallel.Unlifted as U import Data.Array.Parallel.Unlifted.Parallel as UP -- -- an artificially strict tree. -- -- normally you would ensure the branches are lazy, but this benchmark -- requires strict allocation. -- data Tree = Nil | Node !Int !Tree !Tree minN = 4 io s n t = printf "%s of depth %d\t check: %d\n" s n t main = do n <- getArgs >>= readIO . head let maxN = max (minN + 2) n stretchN = maxN + 1 -- stretch memory tree let c = check' (make 0 stretchN) io "stretch tree" stretchN c -- allocate a long lived tree let !long = make 0 maxN -- allocate, walk, and deallocate many bottom-up binary trees let vs = UP.mapUP (depth' maxN) $ UP.enumFromThenToUP minN (minN+2) maxN mapM_ (\((m :*:d :*: i)) -> io (show m ++ "\t trees") d i) (U.toList vs) -- confirm the the long-lived binary tree still exists io "long lived tree" maxN (check' long) -- generate many trees depth' :: Int -> Int -> (Int :*: Int :*: Int) depth' m d = (2*n) :*: d :*: (sumT d n 0) where n = 1 `shiftL` (m - d + minN) -- allocate and check lots of trees sumT :: Int -> Int -> Int -> Int sumT d 0 t = t sumT d i t = sumT d (i-1) (t + a + b) where a = check' (make i d) b = check' (make (-i) d) -- traverse the tree, counting up the nodes check' :: Tree -> Int check' Nil = 0 check' (Node i l r) = i + check' l - check' r -- build a tree make :: Int -> Int -> Tree make i 0 = Node i Nil Nil make i d = Node i (make (i2-1) d2) (make i2 d2) where i2 = 2*i; d2 = d-1
1.3 Notes
I'm taking a guess (core is tremendous) that an issue potentially lies here:
mapM_ (\((m :*:d :*: i)) -> io (show m ++ "\t trees") d i) (U.toList vs)
Not sure if U.toList fuses properly - but how can we traverse it otherwise?
