cvs commit: fptools/ghc/rts Capability.c (resend)

Peter Tanski peter_tanski at cox.net
Wed Oct 18 10:28:42 EDT 2006


[NOTE: was sent to Simon Marlow, Simon Peyton Jones, erroneously to  
cvs-ghc-request at haskell.org]

Great job!  I had actually worked on this problem for some time but  
did not have enough experience with the source code and backtracing  
from assembler through Stg to find the exact problem.  Would you   
have time to answer a few questions I have?

(1) how do I obtain the latest 6.4.3 release?  It is no longer on  
CVS, the darcs branch, http://darcs.haskell.org/ghc.ghc-6.4, seems to  
have been updated last January, and I had been working on the  
ghc-6.4.2.tar.bz2 snapshot from http://www.haskell.org/ghc/dist/6.4.2/ 
ghc-6.4.2-src.tar.bz2. Is 6.4.3 a darcs tag?

(2) For working a debug build of ghc-6.4.2 I had to modify the file  
ghc/compiler/nativeGen/RegisterAlloc.hs by adding a deriving  
declaration:

ghc/compiler/nativeGen/RegisterAlloc.hs:158
 > data FreeRegs = FreeRegs !Word32 !Word32
+				deriving (Show)

This fix was in the 6.6 branch.  Is it also now in the 6.4.3 branch?

(3) I cheated and modified the ghc script that invokes the  
executable ..lib/ghc-6.4.2/ghc-6.4.2 by inserting a gdb invocation  
after the exec statement.  (I was working on compiling Crypto with  
the original Cabal setup but didn't want to resort to makefiles.):

>  # Mini-driver for GHC
>  exec gdb --args $GHCBIN  $TOPDIROPT ${1+"$@"}

Is there a better way to go about this?

(4) Would you please elaborate on the problem and the fix?  The  
problems consistently showed up in ghc/rts/GC.c:threadSqueezeStack,  
in the variable frame (note: comments *follow* code):

(gdb) disas threadSqueezeStack
...
0x00c24678 <threadSqueezeStack+16>:     mr      r24,r3		
									; tso, tso
0x00c2467c <threadSqueezeStack+20>:     addi    r0,r3,56	
									; r0 = tso->stack[0]
0x00c24680 <threadSqueezeStack+24>:     lwz     r2,44(r3)	
									; <variable>.stack_size
0x00c24684 <threadSqueezeStack+28>:     rlwinm  r2,r2,2,0,29
									; stack_size
0x00c24688 <threadSqueezeStack+32>:     add     r25,r0,r2	
									; bottom, tso->stack[0], stack_size
0x00c2468c <threadSqueezeStack+36>:     lwz     r31,52(r3)	
									; <variable>.sp (tso->sp)
0x00c24690 <threadSqueezeStack+40>:     cmplw   cr7,r25,r31	
									; assert(frame < stack)
0x00c24694 <threadSqueezeStack+44>:     bgt+    cr7,0xc246a8
									; <threadSqueezeStack+64>
0x00c24698 <threadSqueezeStack+48>:     lis     r3,211
0x00c2469c <threadSqueezeStack+52>:     addi    r3,r3,304
0x00c246a0 <threadSqueezeStack+56>:     li      r4,4356
0x00c246a4 <threadSqueezeStack+60>:     bl      0xcb9758
									; <_assertFail>
0x00c246a8 <threadSqueezeStack+64>:     addi    r29,r31,-8	
									; gap, <variable>.sp,
0x00c246ac <threadSqueezeStack+68>:     li      r23,0		
									; updatee,
0x00c246b0 <threadSqueezeStack+72>:     li      r27,0		
									; prev_was_update_frame,
0x00c246b4 <threadSqueezeStack+76>:     li      r28,0		
									; current_gap_size,
0x00c246b8 <threadSqueezeStack+80>:     mr      r11,r31
0x00c246bc <threadSqueezeStack+84>:     lwz     r2,0(r31)	
									; <variable>.header.info, D.xxxx
0x00c246c0 <threadSqueezeStack+88>:     addi    r9,r2,-12	
									; info, <variable>.header.info
0x00c246c4 <threadSqueezeStack+92>:     lhz     r2,8(r9)	
; crash point: <variable>.i.type, r9=0xfffffffc, sometimes r9=0x70000100
; 8(r9) overflows to 0x00000004
; NOTE: 8 in 8(r9) derived from:
;	sizeof((StgInt)srt_offset) + sizeof((StgClosureInfo)layout)

Sorry if my comments seem pedantic--I just started really learning  
assembler in August and I partly used gcc -S -fverbose-asm to help  
out.  After running the build under Crypto many times there were a  
few times when the assert (frame < stack) would fail so I was  
following registers (r31, r2 and r9) in the functions used in other  
threads, as well.  (The problem was in r9.) After that it was a  
matter of traceback.

(5) One other avenue I was exploring was the use of Zero Length  
Arrays (ZLA's) and potential gcc bugs (a few of this sort have been  
noticed in gcc-3.3 through 4.0).  Why do you use ZLA's in the code?   
The reasons not to are:

	a. ZLA's are largely supported by GNU extensions.
	As noted in the GCC manual, Section 5.12, at http://gcc.gnu.org/ 
onlinedocs/gcc/Zero-Length.html : "A structure containing a flexible  
array member, or a union containing such a structure (possibly  
recursively), may not be a member of a structure or an element of an  
array. (However, these uses are permitted by GCC as extensions.)"
	You are therefore forced to include structures containing ZLA's as  
pointers-to-structures, for example:

> # 280 "ghc/includes/InfoTables.h"
> typedef struct _StgInfoTable {
>     StgClosureInfo layout;
>     StgHalfWord type;
>     StgHalfWord srt_bitmap;
>     StgCode code[];
> } StgInfoTable;
> # 44 "ghc/includes/Closures.h"
> typedef struct {
>  const struct _StgInfoTable* info;
> } StgHeader;
>

	b. the C sizeof() operator does not correctly report the size of  
structures containing ZLA's, so sizeof(StgInfoTable) reports 8, not  
12, although the gcc compiler correctly produces the assembler for  
manipulating such a structure:

0x00c246c0 <threadSqueezeStack+88>:     addi    r9,r2,-12	
	; (((StgRetInfoTable *)(((StgClosure *)frame)->header.info) - 1))
	; the -12 is the size of StgRetInfoTable

So macros such as ghc/includes/TSO.h:TSO_STRUCT_SIZE can't simply be  
defined as


> #define TSO_STRUCT_SIZE sizeof(StgTSO)

and gdb has trouble accessing the members of the structure:

(gdb) p *tso
$30 = {
   header = {
     info = 0xaf5ff0
   },
   link = 0xded718,
   mut_link = 0xded71c,
   global_link = 0x2bce000,
   what_next = 1,
   why_blocked = 11,
   block_info = {
     closure = 0x0,
     tso = 0x0,
     fd = 0,
     target = 0
   },
   blocked_exceptions = 0xded718,
   id = 2,
   saved_errno = 25,
   main = 0x0,
   trec = 0xded714,
   stack_size = 242,
   max_stack_size = 2080754,
   sp = 0x2bddc78
}
// note the lack of member tso.stack

Finally, if there are alignment issues, wouldn't that be better  
controlled explicitly through pragmas?

Please don't think I am being critical here: I just don't know enough  
to understand your reasons.

-Pete

on 2006/10/16 06:50:02 PDT Simon Marlow wrote:


>  Modified files:        (Branch: ghc-6-4-branch)
>     ghc/rts              Capability.c
>   Log:
>   Fix crash in the threaded RTS caused by spurious wakeups of
>   pthread_cond_wait().  This is certainly affecting the threaded  
> RTS in
>   6.4.x on Solaris, and possibly other platforms too.  I'm currently
>   testing to see whether there are any further problems on Solaris,  
> but
>   with luck this may be the final fix for the threaded RTS problems in
>   the 6.4.x branch.
>
>   Does not affect 6.6; the corresponding code in 6.6 is already
>   spurious-wakeup-safe.
>
>   Revision  Changes    Path
>   1.31.6.2  +32 -7     fptools/ghc/rts/Capability.c
>


More information about the Cvs-ghc mailing list