[GHC] #7602: Threaded RTS performing badly on recent OS X (10.8?)

GHC cvs-ghc at haskell.org
Thu Jan 17 22:17:24 CET 2013


#7602: Threaded RTS performing badly on recent OS X (10.8?)
---------------------------------+------------------------------------------
    Reporter:  simonmar          |       Owner:                  
        Type:  bug               |      Status:  new             
    Priority:  normal            |   Milestone:  _|_             
   Component:  Runtime System    |     Version:  7.6.1           
    Keywords:                    |          Os:  Unknown/Multiple
Architecture:  Unknown/Multiple  |     Failure:  None/Unknown    
  Difficulty:  Unknown           |    Testcase:                  
   Blockedby:                    |    Blocking:                  
     Related:                    |  
---------------------------------+------------------------------------------

Comment(by thoughtpolice):

 I don't think so, or at least it doesn't in my trivial case:

 {{{
 #include <stdio.h>
 #include <stdlib.h>

 __thread int foo;

 int main(int ac, char* av[]) {
   if (ac < 2) foo = 10;
   else foo = atoi(av[1]);

   printf("foo = %d\n", foo);

   return 0;
 }
 }}}

 On Mac OS X 10.8, with Clang 3.2, I can compile this with no special
 options. Disassembling, we see:

 {{{
 $ lldb ./a.out
 (lldb) disassemble -m -n main
 a.out`main at tls.c:6
    5
    6    int main(int ac, char* av[]) {
    7      if (ac < 2) foo = 10;
 a.out[0x100000eb0]:  pushq  %rbp
 a.out[0x100000eb1]:  movq   %rsp, %rbp
 a.out[0x100000eb4]:  subq   $48, %rsp
 a.out[0x100000eb8]:  movl   $0, -4(%rbp)
 a.out[0x100000ebf]:  movl   %edi, -8(%rbp)
 a.out[0x100000ec2]:  movq   %rsi, -16(%rbp)
 a.out`main + 22 at tls.c:7
    6    int main(int ac, char* av[]) {
    7      if (ac < 2) foo = 10;
    8      else foo = atoi(av[1]);
 a.out[0x100000ec6]:  cmpl   $2, -8(%rbp)
 a.out[0x100000ecd]:  jge    0x100000ee7               ; main + 55 at
 tls.c:8
 a.out[0x100000ed3]:  leaq   326(%rip), %rdi           ; foo
 a.out[0x100000eda]:  callq  *(%rdi)
 a.out[0x100000edc]:  movl   $10, (%rax)
 a.out[0x100000ee2]:  jmpq   0x100000f05               ; main + 85 at
 tls.c:8
 a.out`main + 55 at tls.c:8
    7      if (ac < 2) foo = 10;
    8      else foo = atoi(av[1]);
    9
 a.out[0x100000ee7]:  movq   -16(%rbp), %rax
 a.out[0x100000eeb]:  movq   8(%rax), %rdi
 a.out[0x100000eef]:  callq  0x100000f36               ; symbol stub for:
 atoi
 a.out[0x100000ef4]:  leaq   293(%rip), %rdi           ; foo
 a.out[0x100000efb]:  movl   %eax, -20(%rbp)
 a.out[0x100000efe]:  callq  *(%rdi)
 a.out[0x100000f00]:  movl   -20(%rbp), %ecx
 a.out[0x100000f03]:  movl   %ecx, (%rax)
 a.out[0x100000f05]:  leaq   92(%rip), %rdi            ; "foo = %d\n"
 a.out`main + 92 at tls.c:10
    9
    10     printf("foo = %d\n", foo);
    11
 a.out[0x100000f0c]:  movq   %rdi, -32(%rbp)
 a.out[0x100000f10]:  leaq   265(%rip), %rdi           ; foo
 a.out[0x100000f17]:  callq  *(%rdi)
 a.out[0x100000f19]:  movl   (%rax), %esi
 a.out[0x100000f1b]:  movq   -32(%rbp), %rdi
 a.out[0x100000f1f]:  movb   $0, %al
 a.out[0x100000f21]:  callq  0x100000f3c               ; symbol stub for:
 printf
 a.out[0x100000f26]:  movl   $0, %esi
 a.out`main + 123 at tls.c:12
    11
    12     return 0;
    13   }
 a.out[0x100000f2b]:  movl   %eax, -36(%rbp)
 a.out[0x100000f2e]:  movl   %esi, %eax
 a.out[0x100000f30]:  addq   $48, %rsp
 a.out[0x100000f34]:  popq   %rbp
 a.out[0x100000f35]:  ret

 (lldb) ^D
 }}}

 In the origial post, David says that we basically get code like:

 {{{
     call getThreadLocalVar
     movq    (%rdi),%rdi #deref the key which is an index into the tls
 memory
     jmp <dynamic_linker_stub>
     movq    %gs:0x00000060(,%rdi,8),%rax #pthread_getspecific body
     ret
 }}}

 where the biggest penalty is the jump into dyld to do linking for the
 stub. This code does still exist in the latest implementation of Apple's
 libc:

 http://www.opensource.apple.com/source/Libc/Libc-594.9.1/pthreads/pthread_machdep.h

 (Look at the __OPTIMIZE__ implementation.)

 However, Clang on OS X seems to directly avoid this? I'm not sure why the
 offsets of ```leaq``` for ```foo``` seem to decrease for every access...

 I attempted to look through the LLVM source code for specific notes about
 this, but the new TLS support is of course deeply ingrained in the new
 release, so it's hard to point out any one thing about this behavioral
 change.

 I'll investigate this more over the next few days and look at disassembly
 outputs, we should be able to see if this buys is anything at all pretty
 quickly.

 We don't use TLS for x86, only register variables, correct? If so, then
 this still leaves 32bit OS X users up a creek a bit, but Apple and the
 community are largely moving away from this anyway, it seems.

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/7602#comment:4>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler



More information about the ghc-tickets mailing list