(Forwarding to haskell-cafe)<br><br><div class="gmail_quote">Hi,<br><br>I have a program that computes a matrix of Floats of m rows by n columns. Computing each Float is relatively expensive. Each line is completely independent of the others, so I thought I'd try some simple SMP parallelism on this code:<br>
<br><span style="font-family: courier new,monospace;">myFun :: FilePath -> IO ()</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">myFun fp = </span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> do fs <- readDataDir fp</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> let process f = readFile' f >>= parse</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> printLine = putStrLn . foldr (\a b -> show a ++ "\t" ++ b) ""</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> runDiff l = [ [ diff x y | y <- l ]</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> | (x,i) <- zip l (map getId fs), myFilter i ]</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> ps <- mapM process fs</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> sequence_ [ printLine x | x <- runDiff ps <u>`using` parList rdeepseq</u> ]</span><br><br>So, I'm using parList to evaluate the rows in parallel, and fully evaluating each row. Here are the timings on a Dual Quad Core AMD 2378 @2.4 GHz, ghc-6.12.3, parallel-2.2.0.1:<br>
<br><span style="font-family: courier new,monospace;">-N time (ms)</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">none 1m50</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">2 1m33</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">3 1m35</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">4 1m22</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">5 1m11</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">6 1m06</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">7 1m45</span><br><br>The increase at 7 is justified by the fact that there were two other processes running. I don't know how to justify the small increase at N3, though, but that doesn't matter too much. The problem is that I am not getting the gains I expected (halving at N2, a third at N3, etc.). Is this the best one can achieve with this implicit parallelism, or am I doing something wrong? In particular, is the way I'm printing the results at the end destroying potential parallel gains?<br>
<br>Any insights on this are appreciated.<br><br><br>Thanks,<br><font color="#888888">Pedro<br>
</font></div><br>