[Haskell-cafe] Odd parallel haskell observations

Sat Aug 7 16:47:45 EDT 2010

Hello.

I am exploring haskell features for parallel and cocurrent programming
and see something difficult to explain.

In brief - asking RTS to use more threads results in awfull drop of
performance. And according to 'top'  test programm consumes up to N CPUs
power.

Am I doing something wrong? I attached the code, but I am just issuing
thousands of HTTP GET requests in 1-4 forkIO threads. And since it looks
like local apache is faster than haskell program (which is a pity) I
expected that using more OS threads should improve performance.

Just in case:
ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1

-------------- next part --------------
import Data.List
import System.IO
import qualified System.IO.UTF8
import System.Environment (getArgs)
import Network.HTTP
import Network.URI
import System.Time
import System.IO.Unsafe
import Control.Monad
import Control.Exception
import Control.Concurrent
import Control.Concurrent.MVar

secDiff :: ClockTime -> ClockTime -> Float
secDiff (TOD secs1 psecs1) (TOD secs2 psecs2) =
        fromInteger (psecs2 - psecs1) / 1e12 + fromInteger (secs2 - secs1)

-- single get
get :: Int -> IO(String)
get id = do
  res <- simpleHTTP $ getRequest "http://127.0.0.1"
  case res of
    Left err -> return(show err)
    Right rsp -> return(show $ rspCode rsp)

-- perform GET per each list element using c threads
doList :: [Int] -> Int -> IO()
doList ids 0 =
    return()

doList [] c =
    return()

doList ids c = do
    forkChild $ forM_ todo get
    doList later (c-1)
    where (todo, later) = splitAt (length ids `div` c) ids

{-
Copied from
http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/Control-Concurrent.html#11
Terminating the program
-}
children :: MVar [MVar ()]
children = unsafePerformIO (newMVar [])

waitForChildren :: IO ()
waitForChildren = do
  cs <- takeMVar children
  case cs of
    []   -> return ()
    m:ms -> do
            putMVar children ms
            takeMVar m
            waitForChildren

forkChild :: IO () -> IO ThreadId
forkChild io = do
  mvar <- newEmptyMVar
  childs <- takeMVar children
  putMVar children (mvar:childs)
  forkIO (io `finally` putMVar mvar ())
-- end of copied code

main = do
  [c', n'] <- getArgs
  let 
      c = read c' :: Int
      n = read n' :: Int
  start <- getClockTime
  doList [1..n] c
  waitForChildren
  end <- getClockTime
  putStrLn $ show(c) ++ " " ++ show(secDiff start end) ++ "s"

-------------- next part --------------

20:31 sacha at loft4633:/tmp 21> ghc --make -threaded get.hs
[1 of 1] Compiling Main             ( get.hs, get.o )
Linking get ...
20:31 sacha at loft4633:/tmp 22> ./get 1 10000
1 3.242352s
20:31 sacha at loft4633:/tmp 23> ./get 2 10000
2 3.08306s
20:31 sacha at loft4633:/tmp 24> ./get 2 10000 +RTS -N2
2 6.898871s
20:32 sacha at loft4633:/tmp 25> ./get 3 10000
3 2.950677s
20:32 sacha at loft4633:/tmp 26> ./get 3 10000 +RTS -N2
3 7.381678s
20:32 sacha at loft4633:/tmp 27> ./get 3 10000 +RTS -N3
3 14.683548s
20:32 sacha at loft4633:/tmp 28> ./get 4 10000
4 3.332165s
20:33 sacha at loft4633:/tmp 29> ./get 4 10000 +RTS -N4 -s
./get 4 10000 +RTS -N4 -s
4 57.17923s
   2,147,969,912 bytes allocated in the heap
      49,059,288 bytes copied during GC
         736,656 bytes maximum residency (98 sample(s))
         486,744 bytes maximum slop
               5 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:   949 collections,   948 parallel, 76.73s, 25.67s elapsed
  Generation 1:    98 collections,    98 parallel,  7.70s,  2.56s elapsed

  Parallel GC work balance: 2.17 (6115428 / 2822692, ideal 4)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    1.43s    ( 27.76s)       6.31s    (  2.12s)
  Task  1 (worker) :    0.00s    ( 28.13s)      10.62s    (  3.56s)
  Task  2 (worker) :    0.37s    ( 28.63s)      11.06s    (  3.69s)
  Task  3 (worker) :    0.00s    ( 28.95s)       6.29s    (  2.10s)
  Task  4 (worker) :   20.73s    ( 28.95s)       9.68s    (  3.24s)
  Task  5 (worker) :    0.00s    ( 28.95s)       0.60s    (  0.20s)
  Task  6 (worker) :   21.81s    ( 28.95s)      11.91s    (  3.97s)
  Task  7 (worker) :   18.59s    ( 28.95s)      13.04s    (  4.36s)
  Task  8 (worker) :   17.24s    ( 28.96s)      14.92s    (  4.99s)

  SPARKS: 0 (0 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   79.23s  ( 28.95s elapsed)
  GC    time   84.43s  ( 28.23s elapsed)
  EXIT  time    0.00s  (  0.01s elapsed)
  Total time  162.49s  ( 57.19s elapsed)

  %GC time      52.0%  (49.4% elapsed)

  Alloc rate    27,513,782 bytes per MUT second

  Productivity  48.0% of total user, 136.5% of total elapsed

gc_alloc_block_sync: 15006
whitehole_spin: 0
gen[0].steps[0].sync_large_objects: 7617
gen[0].steps[1].sync_large_objects: 35
gen[1].steps[0].sync_large_objects: 1400