The parallel GC currently doesn&#39;t behave well with concurrent programs that uses multiple capabilities (aka OS threads), and the behaviour you see is the known symptom of this.. I believe that Simon Marlow has some fixes in hand that may go into 6.12.2.<div>

<br></div><div>Are you saying that you see two different classes of undesirable performance, one with -qg and one without? How are your threads in your real program communicating with each other? We&#39;ve seen problems there when there&#39;s a lot of contention for e.g. IORefs among thousands of threads.<br>

<br><div class="gmail_quote">On Mon, Mar 1, 2010 at 7:59 AM, Michael Lesniak <span dir="ltr">&lt;<a href="mailto:mlesniak@uni-kassel.de">mlesniak@uni-kassel.de</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Hello haskell-cafe,<br>

<br>

Sorry for this long post, but I can&#39;t think of a way to describe and explain<br>

the problem in a shorter way.<br>

<br>

I&#39;ve (again) a very strange behaviour with the parallel GC and would be glad<br>

if someone could either reproduce (and explain) it or provide a solution. A<br>

similar but unrelated problem has been described in [1].<br>

<br>

<br>

EXAMPLE CODE<br>

The following demonstration program, which is a much smaller and<br>

single-threaded version of my real problem behaves as my real program.<br>

It does some number crunching by calculating pi to a definable precision:<br>

<br>

&gt; -- File Pi.hs<br>

&gt; -- you need the numbers package from hackage.<br>

&gt; module Main where<br>

&gt; import Data.Number.CReal<br>

&gt; import System.Environment<br>

&gt; import GHC.Conc<br>

&gt;<br>

&gt; main = do<br>

&gt;     digits &lt;- (read . head) `fmap` getArgs :: IO Int<br>

&gt;     calcPi digits<br>

&gt;<br>

&gt; calcPi digits = showCReal (fromEnum digits) pi `pseq` return ()<br>

<br>

Compile it with<br>

<br>

  ghc --make -threaded -O2 Pi.hs -o pi<br>

<br>

<br>

BENCHMARKS<br>

On my two-core machine I get the following quite strange and<br>

unpredictable results:<br>

<br>

* Using one thread:<br>

<br>

    $ for i in `seq 1 5`;do time pi 5000 +RTS -N1;done<br>

<br>

    real        0m1.441s<br>

    user        0m1.390s<br>

    sys 0m0.020s<br>

<br>

    real        0m1.449s<br>

    user        0m1.390s<br>

    sys 0m0.000s<br>

<br>

    real        0m1.399s<br>

    user        0m1.370s<br>

    sys 0m0.010s<br>

<br>

    real        0m1.401s<br>

    user        0m1.380s<br>

    sys 0m0.000s<br>

<br>

    real        0m1.404s<br>

    user        0m1.380s<br>

    sys 0m0.000s<br>

<br>

<br>

* Using two threads, hence the parallel GC is used:<br>

<br>

    for i in `seq 1 5`;do time pi 5000 +RTS -N2;done<br>

<br>

    real        0m2.540s<br>

    user        0m2.490s<br>

    sys 0m0.010s<br>

<br>

    real        0m1.527s<br>

    user        0m1.530s<br>

    sys 0m0.010s<br>

<br>

    real        0m1.966s<br>

    user        0m1.900s<br>

    sys 0m0.010s<br>

<br>

    real        0m5.670s<br>

    user        0m5.620s<br>

    sys 0m0.010s<br>

<br>

    real        0m2.966s<br>

    user        0m2.910s<br>

    sys 0m0.020s<br>

<br>

<br>

* Using two threads, but disabling the parallel GC:<br>

<br>

    for i in `seq 1 5`;do time pi 5000 +RTS -N2 -qg;done<br>

<br>

    real        0m1.383s<br>

    user        0m1.380s<br>

    sys 0m0.010s<br>

<br>

    real        0m1.420s<br>

    user        0m1.360s<br>

    sys 0m0.010s<br>

<br>

    real        0m1.406s<br>

    user        0m1.360s<br>

    sys 0m0.010s<br>

<br>

    real        0m1.421s<br>

    user        0m1.380s<br>

    sys 0m0.000s<br>

<br>

    real        0m1.360s<br>

    user        0m1.360s<br>

    sys 0m0.000s<br>

<br>

<br>

THREADSCOPE<br>

I&#39;ve additionally attached the threadscope profile of a really bad run,<br>

started with<br>

<br>

     $ time pi 5000 +RTS -N2 -ls<br>

<br>

    real        0m15.594s<br>

    user        0m15.490s<br>

    sys 0m0.010s<br>

<br>

as file pi.pdf<br>

<br>

<br>

FURTHER INFORMATION/QUESTION<br>

Just disabling the parallel GC leads to very bad performance in my original<br>

code, which forks threads with forkIO and does a lot of communications. Hence,<br>

using -qg is not a real option for me.<br>

<br>

Do I have overlooked some cruical aspect of this problem? If you&#39;ve<br>

read this far, thank you for reading ... this far ;-)<br>

<br>

Cheers,<br>

  Michael<br>

<br>

<br>

<br>

[1] <a href="http://osdir.com/ml/haskell-cafe@haskell.org/2010-02/msg00850.html" target="_blank">http://osdir.com/ml/haskell-cafe@haskell.org/2010-02/msg00850.html</a><br>

<br>

<br>

--<br>

Dipl.-Inf. Michael C. Lesniak<br>

University of Kassel<br>

Programming Languages / Methodologies Research Group<br>

Department of Computer Science and Electrical Engineering<br>

<br>

Wilhelmshöher Allee 73<br>

34121 Kassel<br>

<br>

Phone: +49-(0)561-804-6269<br>

<br>_______________________________________________<br>

Haskell-Cafe mailing list<br>

<a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br>

<a href="http://www.haskell.org/mailman/listinfo/haskell-cafe" target="_blank">http://www.haskell.org/mailman/listinfo/haskell-cafe</a><br>

<br></blockquote></div><br></div>