[Haskell-cafe] Why is boxed mutable array so slow?

Branimir Maksimovic bmaxa at hotmail.com
Sat Dec 1 19:04:03 CET 2012


Wow that sped it up 5 times.I see that boxed Vector is 25% faster than IOArray.What is the difference and when to use Vector,when IOArray?Thanks!
bmaxa at maxa:~/examples$ time ./Cumul +RTS -A1600Mboxed arraylast 262486571 seconds 1.196unboxed arraylast 262486571 seconds 0.748boxed vectorlast 262486571 seconds 0.908unboxed vectorlast 262486571 seconds 0.720
real    0m3.805suser    0m3.428ssys     0m0.372s

> Date: Sat, 1 Dec 2012 12:20:37 -0500
> Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow?
> From: dons00 at gmail.com
> To: bmaxa at hotmail.com
> CC: haskell-cafe at haskell.org
> 
> The obvious difference between boxed and unboxed arrays is that the
> boxed arrays are full of pointers to heap allocated objects. This
> means you pay indirection to access the values, much more time in GC
> spent chasing pointers (though card marking helps), and generally do
> more allocation.
> 
> Compare the GC stats below, for
> 
> * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s
> * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s
> 
> So there's your main answer. The increased data density of unboxed
> arrays also helps a too.
> 
> Now, you can help out  the GC signifcantly by hinting at how much
> you're going to allocated in the youngest generation (see the
> ghc-gc-tune app for a methodical approach to this, though it needs
> updating to ghc 7 --
> http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/
>  and http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection
> ).
> 
> Use the +RTS -A flag to set an initial youngest generation heap size
> to the size of your array, and watch the GC cost disappear. For our
> boxed vector, we'd use +RTS -A50M, resulting in:
> 
> * Boxed vector: 8k copied, 1% of time in GC, 0.157s
> 
> So not bad. 3x speedup through a RTS flag. -A is very useful if you
> are working with boxed, mutable arrays.
> 
> For reference, there's a generic version below that specializes based
> on the vector type parameter.
> 
> ---------------------------------
> 
> {-# LANGUAGE BangPatterns #-}
> 
> import System.CPUTime
> import Text.Printf
> import Data.Int
> import Control.DeepSeq
> import System.Mem
> 
> import qualified Data.Vector.Mutable as V
> import qualified Data.Vector.Unboxed.Mutable as U
> import qualified Data.Vector.Generic.Mutable as G
> 
> main :: IO()
> main = do
> 
> --   (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector"
> --   performGC
>    (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector"
>    performGC
> 
> test' s a = do
>     putStrLn s
>     begin <- getCPUTime
>     init'' a
>     partial_sum' a
>     end <- getCPUTime
>     let diff = (fromIntegral (end - begin)) / (10**12)
>     last <- G.read a (n'-1)
>     printf "last %d seconds %.3f\n" last (diff::Double)
> 
> n' :: Int
> n' = 1000 * 1000
> 
> init'' !a = init 0 (n'-1)
>   where
>     init :: Int -> Int -> IO ()
>     init !k !n
>         | k > n     = return ()
>         | otherwise = do
>             let !x = fromIntegral $ k + k `div` 3
>             G.write a k x
>             init (k+1) n
> 
> 
> 
> partial_sum' !a = do
>     k <- G.read a 0
>     ps 1 (n'-1) k
>   where
>     ps :: Int -> Int -> Int32 -> IO ()
>     ps i n s
>         | i > n     = return ()
>         | otherwise = do
>             k <- G.read a i
>             let !l = fromIntegral $ s + k
>             G.write a i l
>             ps (i+1) n l
> 
> 
> ---------------------------------
> 
> $ time ./A +RTS -s
> boxed vector
> last 945735787 seconds 0.420
>       40,121,448 bytes allocated in the heap
>       88,355,272 bytes copied during GC
>       24,036,456 bytes maximum residency (6 sample(s))
>          380,632 bytes maximum slop
>               54 MB total memory in use (0 MB lost due to fragmentation)
> 
>   %GC     time      75.2%  (75.9% elapsed)
> 
>   Alloc rate    359,655,602 bytes per MUT second
> 
> ./A +RTS -s  0.40s user 0.07s system 98% cpu 0.475 total
> 
> 
> $ time ./A +RTS -s
> unboxed vector
> last 945735787 seconds 0.080
>        4,113,568 bytes allocated in the heap
>           11,288 bytes copied during GC
>        4,003,256 bytes maximum residency (3 sample(s))
>          182,856 bytes maximum slop
>                5 MB total memory in use (0 MB lost due to fragmentation)
> 
>   %GC     time       1.3%  (1.3% elapsed)
> 
>   Alloc rate    51,416,660 bytes per MUT second
> 
> ./A +RTS -s  0.08s user 0.01s system 98% cpu 0.088 total
> 
> 
> $ time ./A +RTS -A50M -s
> boxed vector
> last 945735787 seconds 0.127
>       40,121,504 bytes allocated in the heap
>            8,032 bytes copied during GC
>           44,704 bytes maximum residency (2 sample(s))
>           20,832 bytes maximum slop
>               59 MB total memory in use (0 MB lost due to fragmentation)
> 
>   %GC     time       1.0%  (1.0% elapsed)
> 
>   Productivity  97.4% of total user, 99.6% of total elapsed
> 
> ./A +RTS -A50M -s  0.10s user 0.05s system 97% cpu 0.157 total
> 
> 
> 
> ---------------------------------
> 
> 
> On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <bmaxa at hotmail.com> wrote:
> > I have made benchmark test inspired by
> > http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/
> >
> > What surprised me is that unboxed array is much faster than boxed array.
> > Actually boxed array performance is on par with standard Haskell list
> > which is very slow.
> > All in all even unboxed array is about 10 times slower than Java version.
> > I don't understand why is even unboxed array so slow.
> > But! unboxed array consumes least amount of RAM.
> > (warning, program consumes more than 3gb of ram)
> >
> >  bmaxa at maxa:~/examples$ time ./Cumul
> > boxed array
> > last 262486571 seconds 4.972
> > unboxed array
> > last 262486571 seconds 0.776
> > list
> > last 262486571 seconds 6.812
> >
> > real    0m13.086s
> > user    0m11.996s
> > sys     0m1.080s
> >
> > -------------------------------------------------------------------------
> > {-# LANGUAGE CPP, BangPatterns #-}
> > import System.CPUTime
> > import Text.Printf
> > import Data.Array.IO
> > import Data.Array.Base
> > import Data.Int
> > import Control.DeepSeq
> > import System.Mem
> >
> > main :: IO()
> > main = do
> > (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array"
> > performGC
> > (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array"
> > performGC
> > begin <- getCPUTime
> > printf "list\nlast %d" $ last $ force $ take n' $ sum' data'
> > end <- getCPUTime
> > let diff = (fromIntegral (end - begin)) / (10^12)
> > printf " seconds %.3f\n" (diff::Double)
> >
> > test s a = do
> > putStrLn s
> > begin <- getCPUTime
> > init' a
> > partial_sum a
> > end <- getCPUTime
> > let diff = (fromIntegral (end - begin)) / (10^12)
> > last <- readArray a (n'-1)
> > printf "last %d seconds %.3f\n" last (diff::Double)
> >
> > n' :: Int
> > n' = 50 * 1000 * 1000
> >
> > type A = IOArray Int Int32
> > type B = IOUArray Int Int32
> >
> > init' a = do
> > (_,n) <- getBounds a
> > init a 0 n
> > where
> > init a k n
> > | k > n = return ()
> > | otherwise = do
> > let  !x = fromIntegral $ k + k `div` 3
> > unsafeWrite a k x
> > init a (k+1) n
> >
> > partial_sum a = do
> > (_,n) <- getBounds a
> > k <- unsafeRead a 0
> > ps a 1 n k
> > where
> > ps a i n s
> > | i > n = return ()
> > | otherwise = do
> > k <- unsafeRead a i
> > let !l = fromIntegral $ s + k
> > unsafeWrite a i l
> > ps a (i+1) n l
> >
> > data' :: [Int32]
> > data' = [k + k `div` 3 | k <- [0..] ]
> >
> > sum' = scanl1 (+)
> >
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20121201/6a364bb0/attachment.htm>


More information about the Haskell-Cafe mailing list