<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Tahoma
}
--></style></head>
<body class='hmmessage'><div dir='ltr'>
Wow, now performance is on par with Java ;)<div>So slow division was main problem, that and GC .</div><div><br></div><div>Thanks!</div><div><br><br><div><div id="SkyDrivePlaceholder"></div>> From: daniel.is.fischer@googlemail.com<br>> To: haskell-cafe@haskell.org<br>> CC: bmaxa@hotmail.com<br>> Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow?<br>> Date: Sat, 1 Dec 2012 21:12:29 +0100<br>> <br>> On Samstag, 1. Dezember 2012, 16:09:05, Branimir Maksimovic wrote:<br>> > All in all even unboxed array is about 10 times slower than Java version.<br>> > I don't understand why is even unboxed array so slow.<br>> <br>> It's not the unboxed arrays that are slow.<br>> <br>> Your code has a couple of weak spots, and GHC's native code generator has a <br>> weakness that bites here.<br>> <br>> On my box, I don't quite have a 10× difference to my translation to Java, it's <br>> a bit less than 7× (0.82s vs 0.12s - I don't want to bring my box to its knees <br>> by running something that takes 3GB+ of RAM, so I run the unboxed array part <br>> only) with the LLVM backend and 8× (0.93s) with the native code generator. <br>> That's in the same ballpark, though.<br>> <br>> So what's the deal?<br>> <br>> Main.main_$s$wa1 [Occ=LoopBreaker]<br>> :: GHC.Prim.Int#<br>> -> GHC.Prim.Int#<br>> -> GHC.Prim.State# GHC.Prim.RealWorld<br>> -> GHC.Types.Int<br>> -> GHC.Types.Int<br>> -> GHC.Types.Int<br>> -> ...<br>> <br>> Your loops carry boxed Ints around, that's always a bad sign. In this case it <br>> doesn't hurt too much, however, since these values are neither read nor <br>> substituted during the loop (they're first and last index of the array and <br>> number of elements). Additionally, they carry an IOUArray constructor around. <br>> That is unnecessary. Eliminating a couple of dead parameters<br>> <br>> <br>> init' a = do<br>> (_,n) <- getBounds a<br>> let init k<br>> | k > n = return ()<br>> | otherwise = do<br>> let x = fromIntegral $ k + k `div` 3<br>> unsafeWrite a k x<br>> init (k+1)<br>> init 0<br>> <br>> partial_sum a = do<br>> (_,n) <- getBounds a<br>> let ps i s<br>> | i > n = return ()<br>> | otherwise = do<br>> k <- unsafeRead a i<br>> let l = s + k<br>> unsafeWrite a i l<br>> ps (i+1) l<br>> k <- unsafeRead a 0<br>> ps 1 k<br>> <br>> brings the time for the native code generator down to 0.82s, and for the LLVM <br>> backend the time remains the same.<br>> <br>> Next problem, you're using `div` for the division.<br>> <br>> `div` does some checking and potentially fixup (not here, since everything is <br>> non-negative) after the machine division because `div` is specified to satisfy<br>> <br>> a = (a `div` b) * b + (a `mod` b)<br>> <br>> with 0 <= a `mod` b < abs b.<br>> <br>> That is in itself slower than the pure machine division you get with quot.<br>> <br>> So let's see what we get with `quot`.<br>> <br>> 0.65s with the native code generator, and 0.13 with the LLVM backend.<br>> <br>> Whoops, what's that?<br>> <br>> The problem is, as can be seen by manually replacing k `quot` 3 with<br>> <br>> (k *2863311531) `shiftR` 33<br>> <br>> (requires 64-bit Ints; equivalent in Java: k*28..1L >> 33), when the native <br>> backend, the LLVM backend and Java (as well as C) all take more or less the <br>> same time [well, the NCG is a bit slower than the other two, 0.11s, 0.11s, <br>> 0.14s], that division is a **very** slow operation.<br>> <br>> Java and LLVM know how to replace the division by the constant 3 with a <br>> mulitplication, a couple of shifts and an addition (since we never have <br>> negative numbers here, just one multiplication and shift suffice, but neither <br>> Java nor LLVM can do that on their own because it's not guaranteed by the <br>> type). The native code generator doesn't - not yet.<br>> <br>> So the programme spends the majority of the time dividing. The array reads and <br>> writes are on par with Java's (and, for that matter, C's).<br>> <br>> If you make the divisor a parameter instead of a compile time constant, the <br>> NCG is not affected at all, the LLVM backend gives you equal performance (it <br>> can't optimise a division by a divisor it doesn't know). Java is at an <br>> advantage there, after a while the JIT sees that it might be a good idea to <br>> optimise the division and so its time only trebles.<br></div></div>                                            </div></body>
</html>