[GHC] #7602: Threaded RTS performing badly on recent OS X (10.8?)

GHC cvs-ghc at haskell.org
Sun Feb 10 04:07:34 CET 2013


#7602: Threaded RTS performing badly on recent OS X (10.8?)
---------------------------------+------------------------------------------
    Reporter:  simonmar          |       Owner:                  
        Type:  bug               |      Status:  new             
    Priority:  normal            |   Milestone:  _|_             
   Component:  Runtime System    |     Version:  7.6.1           
    Keywords:                    |          Os:  Unknown/Multiple
Architecture:  Unknown/Multiple  |     Failure:  None/Unknown    
  Difficulty:  Unknown           |    Testcase:                  
   Blockedby:                    |    Blocking:                  
     Related:                    |  
---------------------------------+------------------------------------------

Comment(by thoughtpolice):

 Alright, I think my patch is almost working, but in the mean time I've
 verified with a small snippet the behavior I think we want. Simon, can you
 please tell me if this approach would be OK?

 Essentially, there is a small set of predefined TLS keys in the OS X C
 library for various Apple-internal things. There are about 100 of these
 special keys. With them, it's possible to use very special inline variants
 of ```pthread_getspecific``` and ```pthread_setspecific``` that directly
 write into an offset block of the ```%gs``` register. Performance-wise,
 this should be very close to Linux's implementation.

 One of these things on modern OS X and its libc is WebKit. pthread has a
 specific range of keys (5 to be exact) dedicated to WebKit. These are used
 in JavaScriptCore's FastMalloc allocator for performance critical sections
 - likely for their GC! But only a single key is used by WebKit at all, and
 there are 0 references to it elsewhere that I can find on the internet.

 You can see this here:

 http://www.opensource.apple.com/source/Libc/Libc-825.25/pthreads/pthread_machdep.h

 This defines the inline get/set routines for special TLS keys. If you
 scroll down a little you can see the ```JavaScriptCore``` keys (keys 90-94
 to be exact.)

 Now, look here:

 http://code.google.com/codesearch#mcaWan7Aaio/trunk/WebKit-r115846/Source/WTF/wtf/FastMalloc.cpp&q=__PTK_FRAMEWORK_JAVASCRIPTCORE_KEY0&type=cs&l=453

 And you can see there's a special stubbed out ```pthread_getspecific```
 and ```pthread_setspecific``` routine for this exact purpose.

 Therefore, I propose we steal one of the high TLS keys that dedicated to
 WebKit's JS engine for the GC. Unfortunately, ```pthread_machdep.h``` is
 not installed by default in modern variants of XCode, so we must inline
 the definitions ourselves for the necessary architectures.

 The following example demonstrates the use of these special keys:

 {{{
 #include <stdio.h>
 #include <stdlib.h>

 #include <pthread.h>

 /** Snipped from pthread_machdep.h */
 #define __PTK_FRAMEWORK_JAVASCRIPTCORE_KEY4 94


 __inline__ void *
 _pthread_getspecific_direct(unsigned long slot) {
   void* ret;
 #if defined(__i386__) || defined(__x86_64__)
   __asm__("mov %%gs:%1, %0" : "=r" (ret) : "m" (*(void **)(slot *
 sizeof(void *))));
 #else
 #error "No definition of pthread_getspecific_direct!"
 #endif
   return ret;
 }


 /* To be used with static constant keys only */
 __inline__ static int
 _pthread_setspecific_direct(unsigned long slot, void * val)
 {
 #if defined(__x86_64__)
   /* PIC is free and cannot be disabled, even with: gcc -mdynamic-no-pic
 ... */
   __asm__("movq %1,%%gs:%0" : "=m" (*(void **)(slot * sizeof(void *))) :
 "rn" (val));
 #else
 #error "No definition of pthread_setspecific_direct!"
 #endif
   return 0;
 }

 /** End snippets */

 static const pthread_key_t fooKey =
   __PTK_FRAMEWORK_JAVASCRIPTCORE_KEY4;

 #define GET_FOO() ((int)(_pthread_getspecific_direct(fooKey)))
 #define SET_FOO(to) (_pthread_setspecific_direct(fooKey, to))

 int main(int ac, char* av[]) {
   if (ac < 2) SET_FOO((void*)10);
   else SET_FOO((void*)atoi(av[1]));

   printf("foo = %d\n", GET_FOO());

   return 0;
 }
 }}}

 This is pretty close to what the GC does now. And compiling:

 {{{
 $ clang -O3 tls2.c
 $ lldb ./a.out
 Current executable set to './a.out' (x86_64).
 (lldb) disassemble -m -n main
 a.out`main
 a.out[0x100000ef0]:  pushq  %rbp
 a.out[0x100000ef1]:  movq   %rsp, %rbp
 a.out[0x100000ef4]:  cmpl   $1, %edi
 a.out[0x100000ef7]:  jg     0x100000f08               ; main + 24
 a.out[0x100000ef9]:  movq   $10, %gs:752
 a.out[0x100000f06]:  jmp    0x100000f1d               ; main + 45
 a.out[0x100000f08]:  movq   8(%rsi), %rdi
 a.out[0x100000f0c]:  callq  0x100000f38               ; symbol stub for:
 atoi
 a.out[0x100000f11]:  movslq %eax, %rax
 a.out[0x100000f14]:  movq   %rax, %gs:752
 a.out[0x100000f1d]:  movq   %gs:752, %rsi
 a.out[0x100000f26]:  leaq   59(%rip), %rdi            ; "foo = %d\n"
 a.out[0x100000f2d]:  xorb   %al, %al
 a.out[0x100000f2f]:  callq  0x100000f3e               ; symbol stub for:
 printf
 a.out[0x100000f34]:  xorl   %eax, %eax
 a.out[0x100000f36]:  popq   %rbp
 a.out[0x100000f37]:  ret
 (lldb) r
 Process 67488 launched: './a.out' (x86_64)
 foo = 10
 Process 67488 exited with status = 0 (0x00000000)
 (lldb) ^D
 $
 }}}

 This will probably only work on modern versions of XCode and OS X (10.8
 etc.) In part, older libcs have very different implementations of
 ```pthread_setspecific_direct```, which means this could be very wrong on
 older machines. I'm not sure how much older, so if we had any 10.7 users
 who could try this that would be awesome. The build system will need
 modifications to check for that, and fall back to the much slower routines
 otherwise I suppose.

 Simon, does this approach sound OK? I think it will recover the
 performance loss here and we can just go ahead and use Clang, which is the
 easiest for everybody I think.

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/7602#comment:13>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler



More information about the ghc-tickets mailing list