GCC, Mac OS X & the future

Simon Marlow marlowsd at gmail.com
Sat Jul 2 21:42:43 CEST 2011


On 02/07/11 19:34, David Peixotto wrote:

> I'm glad you caught my benchmarking error because the new results
> look quite different! Running the benchmarks with the -threaded
> runtime shows that the actual slowdown is close to 30% for GC-intense
> programs.
>
> In the fibon results, the average execution time was 12% slower for
> llvm-gcc, and the average GC slowdown was 42%. In the nofib gc
> benchmark results, the average execution time for llvm-gcc was 30%
> longer.

Ok, that's bad.  I'm not a Mac user, but I wouldn't put up with more 
than 5% (and I'd be very unhappy about that).

> While the results are disappointing, they seem reasonable after
> taking a look at the code generated for the access of the `gct`
> variable in the GC. I had hoped using pthread_getspecific would just
> require a few inline assembly instructions, but it looks like the
> overhead is much higher. When accessing  the `gct` variable in the GC
> it calls `getThreadLocalVar` which is the GHC wrapper for
> pthread_getspecific. Then the actual call to pthread_getspecific goes
> through the dynamic linker so we take an extra hit there. The actual
> code for pthread_getspecific is just a mov followed by a return.
>
> The best we could hope for would be for an access of `gct` to turn
> into something like this in the GC:
 >
>      movq    (%rdi),%rdi #deref the key which is an index into the tls memory
>      movq    %gs:0x00000060(,%rdi,8),%rax # read the value indexed by the key
>
> but it looks like we are getting something like this:
>
>      call getThreadLocalVar
>      movq    (%rdi),%rdi #deref the key which is an index into the tls memory
>      jmp<dynamic_linker_stub>
>      movq    %gs:0x00000060(,%rdi,8),%rax #pthread_getspecific body
>      ret

you don't need to go through getThreadLocalVar, right?   Just call 
pthread_getspecific directly.  I don't know why it's going through the 
dynamic linker stub, I thought it was supposed to be #defined to the 
inline assembly.

Anyway, the last resort will be to pass gct as a parameter to the 
critical functions in the GC - scavenge_block() and everything it calls 
transitively, including evacuate().  This is likely to give quite good 
performance, but not as good as a register variable, so unfortunately 
we'll need some #ifdefery or macros which will be quite ugly (hence why 
I say this is a last resort).

Cheers,
	Simon


> The call to getThreadLocalVar may be getting inlined in some places, but not at the site I examined. I've include the detailed benchmark results below.
>
> For the fibon results, a negative number indicates that llvm-gcc is slower. Efficiency is the percent of the total execution time spent in GC.
>
> Fibon Results
> -----------------------------------------------------------------
>                  MutCPUTime    GCCPUTime TotalCPUTime   Efficiency
> -----------------------------------------------------------------
> Agum                +0.17%      -53.48%      -11.49%       80.32%
> BinaryTrees         +0.13%      -60.70%      -22.98%       68.96%
> Blur                -0.26%      -15.27%       -0.43%       98.58%
> Bzlib               -2.31%       -8.57%       -2.32%       99.89%
> Chameneos          -15.65%      -37.53%      -15.74%       99.53%
> Cpsa                -0.02%      -58.13%       -5.25%       91.32%
> Crypto              -0.59%      -47.97%      -27.08%       52.18%
> FFT2d               +2.09%      -33.66%       +0.30%       94.64%
> FFT3d               -1.58%      -12.19%       -1.88%       96.58%
> Fannkuch            -0.84%      -26.99%       -2.59%       92.64%
> Fgl                 -0.40%      -50.78%      -21.16%       63.74%
> Fst                 +0.32%      -66.43%      -13.21%       81.93%
> Funsat              -1.36%      -44.08%      -18.94%       65.30%
> Gf                  -5.38%      -44.43%      -17.56%       77.11%
> HaLeX               +3.77%      -66.52%       +1.13%       96.30%
> Happy               -0.98%      -59.06%      -25.67%       64.51%
> Hgalib              -2.45%      -44.33%       -5.96%       91.67%
> Laplace             +0.43%      -23.42%       -0.63%       95.09%
> MMult               +1.04%      -13.62%       +0.48%       95.34%
> Mandelbrot          +0.06%      -17.29%       +0.03%       99.78%
> Nbody               -0.69%      -18.18%       -0.82%       98.99%
> Palindromes         -2.83%      -82.54%      -52.78%       57.72%
> Pappy               +0.17%      -44.32%      -38.64%       34.84%
> Pidigits            +0.17%      -57.56%      -11.34%       81.62%
> QuickCheck          +0.36%      -50.14%       -6.52%       87.62%
> Regex               -1.14%      -35.26%       -2.78%       94.79%
> Simgi               +1.39%      -41.70%      -10.15%       74.64%
> SpectralNorm        +0.06%         ----       +0.06%      100.00%
> TernaryTrees        +1.59%      -48.39%      -23.62%       58.03%
> Xsact               -0.72%      -61.65%      -28.25%       63.44%
> -----------------------------------------------------------------
> Min                -15.65%      -82.54%      -52.78%       34.84%
> Mean                -0.85%      -42.21%      -12.19%       81.90%
> Max                 +3.77%       -8.57%       +1.13%      100.00%
>
>
> For the nofib results, a positive number means the llvm-gcc version was slower.
>
> NoFib Results
> ------------------------------------------------------------------------------
>          Program           Size    Allocs   Runtime   Elapsed  TotalMem
> ------------------------------------------------------------------------------
>          circsim          +0.0%     +0.0%    +22.5%    +21.2%     -0.2%
>      constraints          +0.0%     +0.0%    +39.4%    +38.3%     +0.0%
>           fulsom          +0.0%     +0.0%    +23.7%    +22.2%     +7.1%
>         gc_bench          +0.1%     +0.0%    +68.7%    +67.8%     +0.3%
>            happy          +0.1%     +0.0%    +14.8%    +14.4%     +0.0%
>             lcss          +0.1%     +0.0%    +34.3%    +31.6%     +0.0%
>        mutstore1          +0.0%     +0.0%    +41.3%    +35.6%     +0.0%
>        mutstore2          +0.0%     +0.0%    +24.3%    +23.4%     +0.0%
>            power          +0.0%     +0.0%    +34.6%    +35.1%     +0.0%
>       spellcheck          +0.1%     +0.0%    +11.8%    +11.9%     +0.0%
> ------------------------------------------------------------------------------
>              Min          +0.0%     +0.0%    +11.8%    +11.9%     -0.2%
>              Max          +0.1%     +0.0%    +68.7%    +67.8%     +7.1%
>   Geometric Mean          +0.0%     +0.0%    +30.7%    +29.3%     +0.7%
>
> On Jul 1, 2011, at 2:45 PM, David Peixotto wrote:
>
>>
>> On Jul 1, 2011, at 2:05 PM, Simon Marlow wrote:
>>
>>> On 30/06/11 17:43, David Peixotto wrote:
>>>> I have made the changes necessary to compile GHC with llvm-gcc. The
>>>> major change was to use the pthread api for thread level storage to
>>>> access the gct variable during garbage collection. My measurements
>>>> indicate this causes an average slowdown of about 5% for gc heavy
>>>> programs. The changes are available from the `clang` branch on my
>>>> github fork.
>>>
>>> Sounds good.  One question: did you measure the GC performance with -threaded?  Because the thread-specific variable in the GC is only used with -threaded.
>>>
>>
>> Oops, I totally forgot about that :\ Those numbers were actually for the non-threaded runtime, so they don't measure the changes to the GC just the difference in compiling with llvm-gcc. I'll rerun the benchmarks with -threaded. Sorry about that!
>>
>>
>> _______________________________________________
>> Cvs-ghc mailing list
>> Cvs-ghc at haskell.org
>> http://www.haskell.org/mailman/listinfo/cvs-ghc
>>
>




More information about the Cvs-ghc mailing list