GCC, Mac OS X & the future
David Peixotto
dmp at rice.edu
Wed Jul 6 19:08:11 CEST 2011
I tried a few alternative implementations and found that passing the gct variable as a function parameter in the garbage collector performed the best, with an average execution time increase of 6% for gc-intense programs compared to a gcc compiled version.
The patches for passing the gct variable as a parameter are available in the clang-param-pass branch here:
git://github.com/dmpots/ghc.git clang-param-pass
Because passing gct as a parameter is an invasive change, I tried a few other techniques first.
First, I tried changing the gct definition to call pthread_getspecific directly (instead of going through getThreadLocalVar), but that only improved performance by a few percent. The call to pthread_getspecific was still incurring the dynamic linking overhead.
I then tried creating my "own" pthread_getspecific function that just contains the inline assembly to read the value as done in the pthread function. With this change I see the overhead drop to around 9% over the non-llvm-gcc version. The code for accessing the gct looked something like this:
static inline gc_thread* __gct(void) {
gc_thread *gct_tls;
__asm__("movq %%gs:0x60(,%[key],8),%[gct_tls]"
: [gct_tls] "=r" (gct_tls)
: [key] "r" (gctKey));
return gct_tls;
}
#define gct (__gct())
#define DECLARE_GCT /* nothing */
In all the cases I saw, the __gct function was getting inlined correctly. Because I'm directly inlining assembly for the pthread function, I'm not sure how portable across MacOS X versions and I'm pretty sure it wouldn't work on linux.
Finally, I tried changing the GC to pass the gct variable as a parameter which further reduces the performance difference so that the llvm-gcc version is about 6% slower than the gcc version on the nofib gc benchmarks and 3.5% slower on the fibon benchmarks. My initial (accidental) measurements with the non-threaded runtime showed similar numbers, so part of this overhead is just the difference in code generation between llvm and gcc.
To support both passing gct as a parameter and accessing as a global variable I added some macros that can be used with GC functions that access (or call functions that access) the gct. These macros will add the gct as an extra parameter to the function if the PASS_GCT_AS_PARAM variable is defined. They are used like this:
// declaration
void someGcFunc(DECLARE_GCT_PARAM(orig_param_list))
// call site
someGcFunc(GCT_PARAM(orig_params))
It's a bit ugly to look at, but I couldn't think of a nice way to support both ways of accessing the gct.
On Jul 3, 2011, at 11:23 AM, David Peixotto wrote:
>
> On Jul 2, 2011, at 2:42 PM, Simon Marlow wrote:
>> On 02/07/11 19:34, David Peixotto wrote:
>>
>>> The best we could hope for would be for an access of `gct` to turn
>>> into something like this in the GC:
>>>
>>> movq (%rdi),%rdi #deref the key which is an index into the tls memory
>>> movq %gs:0x00000060(,%rdi,8),%rax # read the value indexed by the key
>>>
>>> but it looks like we are getting something like this:
>>>
>>> call getThreadLocalVar
>>> movq (%rdi),%rdi #deref the key which is an index into the tls memory
>>> jmp<dynamic_linker_stub>
>>> movq %gs:0x00000060(,%rdi,8),%rax #pthread_getspecific body
>>> ret
>>
>> you don't need to go through getThreadLocalVar, right? Just call pthread_getspecific directly.
>
> Yeah, I can change it to be a direct call to pthread_getspecific. I was just trying to reuse the existing GHC api for thread local storage, and I thought the call would be inlined away.
>
>> I don't know why it's going through the dynamic linker stub, I thought it was supposed to be #defined to the inline assembly.
>
> I can't see any obvious definition in the header files on my machine. The definition I found is an assembly file that is part of apples libc implementation:
>
> http://www.opensource.apple.com/source/Libc/Libc-594.9.5/x86_64/pthreads/pthread_getspecific.s
>
> This definition seems to match what I see when I debug an executable in gdb.
>
>> Anyway, the last resort will be to pass gct as a parameter to the critical functions in the GC - scavenge_block() and everything it calls transitively, including evacuate(). This is likely to give quite good performance, but not as good as a register variable, so unfortunately we'll need some #ifdefery or macros which will be quite ugly (hence why I say this is a last resort).
>
> Ok, hopefully we won't have to resort to that, but I'm not too optimistic at this point. If we are actually stuck dealing with the dynamic linker for pthread_getspecific then the overhead is going to probably be too high.
>
>
>
> _______________________________________________
> Cvs-ghc mailing list
> Cvs-ghc at haskell.org
> http://www.haskell.org/mailman/listinfo/cvs-ghc
>
More information about the Cvs-ghc
mailing list