cvs commit: fptools/ghc/rts Main.c
Simon Marlow
simonmar@microsoft.com
Mon, 6 Aug 2001 10:57:15 +0100
This is a multi-part message in MIME format.
------_=_NextPart_001_01C11E5E.2C266652
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
> I have questions for all you seasoned GHC developers:
>=20
> On 2001-07-26T02:26:34-0700, Julian Seward (Intl Vendor) wrote:
> > Sounds like a potential memory management/corruption bug.
> > Do you have an example program which causes this to happen
> > on Alpha, so we can see if it can be repro'd on other plats?
>=20
> (See additional quoted message for original commit log.)
>=20
> The original problem was simply that the conc004 test
> (ghc/tests/concurrent/should_run/conc004.hs) fails on alpha-dec-osf3
> (my homegrown version with patches on top of ghc-5.00.2), but not
> i686-pc-linux-gnu (the distributed version ghc-5.00.2). I thought I
> magically fixed by adding the initialization call to tzset(), in my
> commit. It did allow conc004 to complete successfully on
> alpha-dec-osf3.
>=20
> As it recently turned out, it doesn't seem to have to. Once in a
> while, when using the abovementioned version of ghc-5.00.2 on
> alpha-dec-osf3 to compile GHC itself, I still get a segmentation
> fault. It happens in Schedule.c: Right after the
>=20
> switch (cap->rCurrentTSO->what_next) {
> ...
> }
>=20
> there's the innocuous line
>=20
> t =3D cap->rCurrentTSO;
>=20
> but cap is null by then, not &MainRegTable anymore! It appears that
> gcc put the cap variable in register $s3 (same as the register used
> for Hp), but StgRun is supposed to have saved $s3 on the stack and
> restored it on return.
Ok, this looks like a classic case of the stack being clobbered
somewhere in STG land. The slot in which %s3 was saved is being
overwritten with a NULL somehow. I've taken a quick look at the code in
StgCRun for the Alpha, and it looks reasonable, so probably the best way
to track this down is to find a repeatable example and use a watchpoint
in GDB to find who's stomping on the stack.
> I'm not sure what's going on here.
>=20
> I then discovered sanity checking, so I started running ghc-inplace
> with "+RTS -D128 -RTS". It would always seg fault:
>=20
> puffin:~/u/glasgow/puffin2/ghc/lib/std$ gdb=20
> ../../compiler/ghc-5.01 core
> GNU gdb 4.18
> Copyright 1998 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public=20
> License, and you are
> welcome to change it and/or distribute copies of it under=20
> certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show=20
> warranty" for details.
> This GDB was configured as "alphaev56-dec-osf4.0e"...
> Core was generated by `ghc-5.01'.
> Program terminated with signal 11, Segmentation fault.
> #0 0x120d6fe98 in checkClosure (p=3D0x183bae1c8) at Sanity.c:220
> 220 ASSERT(!closure_STATIC(p));
> (gdb) bt
> #0 0x120d6fe98 in checkClosure (p=3D0x183bae1c8) at Sanity.c:220
> #1 0x120d70c78 in checkHeap (bd=3D0x183b02b80) at Sanity.c:472
> #2 0x120d611b4 in checkSanity () at Storage.c:736
> #3 0x120d6b850 in GarbageCollect (get_roots=3D0x120d5e470=20
> <GetRoots>,
> force_major_gc=3D304) at GC.c:923
> #4 0x120d5db40 in schedule () at Schedule.c:1222
> #5 0x120d5e3cc in waitThread (tso=3D0x1800bc000, ret=3D0x0)=20
> at Schedule.c:1956
> #6 0x120d69850 in rts_evalIO (p=3D0x140098620, ret=3D0x0) at=20
> RtsAPI.c:421
> #7 0x120d58ae0 in main (argc=3D18, argv=3D0x11fffc018) at =
Main.c:120
>=20
> (Note: HEAP_BASE is 0x180000000.) I don't know what we mean by "slop"
> in the GC/Sanity code, but this seems to be slop at my first glance:
>=20
> (gdb) p p
> $1 =3D (StgClosure *) 0x183bae1c8
> (gdb) p *p
> $2 =3D {header =3D {info =3D 0x42000001}, payload =3D 0x183bae1d0}
> (gdb) p *((StgSelector*)p)
> $3 =3D {header =3D {info =3D 0x42000001}, selectee =3D =
0x120849770}
> (gdb) up
> #1 0x120d70c78 in checkHeap (bd=3D0x183b02b80) at Sanity.c:472
> 472 nat size =3D checkClosure((StgClosure *)p);
> (gdb) p *bd
> $4 =3D {start =3D 0x183bae000, free =3D 0x183baefe8, link =3D=20
> 0x183b02940, u =3D {
> back =3D 0x0, bitmap =3D 0x0}, gen_no =3D 1, step =3D=20
> 0x140200000, blocks =3D 1,
> flags =3D 0, _padding =3D {0, 0}}
>=20
> Is this a real sanity problem, or a problem with the sanity check
> code? Is there another strategy that anyone would suggest for
> tracking down the original segfault-in-scheduler bug?
It's hard to tell what the problem is, without seeing the contents of
the memory around p. I've attached my .gdbinit file which has a few
useful macros in it - in particular I like to use p4 & p8 which print
out the 4 or 8 words starting from the given address as addresses (i.e.
looking up symbols). So you would do 'p8 p-4' to print the 8 words
around p, for example. And 'pinfo' dumps an info table, so 'pinfo *p'
gives the info table of the closure pointed to by p.
The sanity check code is working fine on x86, BTW. It's probably an
incorrect 32-bit assumption somewhere.
Cheers,
Simon
------_=_NextPart_001_01C11E5E.2C266652
Content-Type: application/octet-stream;
name="gdb.init"
Content-Transfer-Encoding: base64
Content-Description: gdb.init
Content-Disposition: attachment;
filename="gdb.init"
ZGVmaW5lIHByZWdzCnByaW50ICooU3RnUmVnVGFibGUgKikkZWJ4CmVuZAoKZGVmaW5lIHB0c28K
cHJpbnQgKigoU3RnUmVnVGFibGUgKikkZWJ4KS0+ckN1cnJlbnRUU08KZW5kCgpkZWZpbmUgcFIx
CnByaW50ICgoKFN0Z1JlZ1RhYmxlKU1haW5SZWdUYWJsZSkuclIxKQplbmQKZGVmaW5lIHBSMgpw
cmludCAoKChTdGdSZWdUYWJsZSlNYWluUmVnVGFibGUpLnJSMikKZW5kCmRlZmluZSBwUjMKcHJp
bnQgKCgoU3RnUmVnVGFibGUpTWFpblJlZ1RhYmxlKS5yUjMpCmVuZApkZWZpbmUgcFI0CnByaW50
ICgoKFN0Z1JlZ1RhYmxlKU1haW5SZWdUYWJsZSkuclI0KQplbmQKZGVmaW5lIHBSNQpwcmludCAo
KChTdGdSZWdUYWJsZSlNYWluUmVnVGFibGUpLnJSNSkKZW5kCmRlZmluZSBwUjYKcHJpbnQgKCgo
U3RnUmVnVGFibGUpTWFpblJlZ1RhYmxlKS5yUjYpCmVuZApkZWZpbmUgcFI3CnByaW50ICgoKFN0
Z1JlZ1RhYmxlKU1haW5SZWdUYWJsZSkuclI3KQplbmQKZGVmaW5lIHBSOApwcmludCAoKChTdGdS
ZWdUYWJsZSlNYWluUmVnVGFibGUpLnJSOCkKZW5kCmRlZmluZSBwRmx0MQpwcmludCAoU3RnRmxv
YXQpICgoKFN0Z1JlZ1RhYmxlKU1haW5SZWdUYWJsZSkuckZsdDEpCmVuZApkZWZpbmUgcERibDEK
cHJpbnQgKFN0Z0RvdWJsZSkgKCgoU3RnUmVnVGFibGUpTWFpblJlZ1RhYmxlKS5yRGJsMSkKZW5k
CgpkZWZpbmUgcFNwCnByaW50ICgoKFN0Z1JlZ1RhYmxlKU1haW5SZWdUYWJsZSkuclNwKQplbmQK
ZGVmaW5lIHBTdQpwcmludCAoKChTdGdSZWdUYWJsZSlNYWluUmVnVGFibGUpLnJTdSkKZW5kCmRl
ZmluZSBwU3BMaW0KcHJpbnQgKCgoU3RnUmVnVGFibGUpTWFpblJlZ1RhYmxlKS5yU3BMaW0pCmVu
ZAoKZGVmaW5lIHBIcApwcmludCAoKChTdGdSZWdUYWJsZSlNYWluUmVnVGFibGUpLnJIcCkKZW5k
CmRlZmluZSBwSHBMaW0KcHJpbnQgKCgoU3RnUmVnVGFibGUpTWFpblJlZ1RhYmxlKS5ySHBMaW0p
CmVuZAoKZGVmaW5lIHBzdGsKcG1lbSAkZWJwIDE2CmVuZAoKZGVmaW5lIHBzdGtfZ2MKcG1lbSBN
YWluVFNPLT5zcCAxNgplbmQKCmRlZmluZSBwbWVtCnNldCAkaSA9ICRhcmcxCndoaWxlICRpID49
IDAKeC8xYSAoKChpbnQgKikkYXJnMCkgKyRpKQpzZXQgJGkgPSAkaSAtIDEKZW5kCmVuZAoKZGVm
aW5lIHA0CnBtZW0gJGFyZzAgNAplbmQKCmRlZmluZSBwOApwbWVtICRhcmcwIDgKZW5kCgpkZWZp
bmUgcDE2CnBtZW0gJGFyZzAgMTYKZW5kCgpkZWZpbmUgcG1lbV9mb3J3YXJkcwpzZXQgJGkgPSAw
CndoaWxlICRpIDwgJGFyZzEKeC8xYSAoKChpbnQgKikkYXJnMCkgKyAkaSkKc2V0ICRpID0gJGkg
KyAxCmVuZAplbmQKCmRlZmluZSBwaGVhcApwbWVtICRlZGktMTYgMTYKZW5kCgpkZWZpbmUgZHNp
CmRpc3BsYXkgL2kgJHBjCnNpCmVuZAoKZGVmaW5lIHBpbmZvCnAgKigoU3RnSW5mb1RhYmxlICop
JGFyZzAtMSkKZW5kCgpkZWZpbmUgcGJkZXNjcgpwICogKChiZGVzY3IgKikoKCgkYXJnMCAmIDB4
ZmZmMDAwMDApIHwgKCgkYXJnMCAmIDB4ZmYwMDApID4+IDcpKSAmIDB4ZmZmZmZmZTApKQplbmQK
CmRlZmluZSBwZ2VuCnAgZ2VuZXJhdGlvbnNbKChiZGVzY3IgKikoKCgkYXJnMCAmIDB4ZmZmMDAw
MDApIHwgKCgkYXJnMCAmIDB4ZmYwMDApID4+IDcpKSAmIDB4ZmZmZmZmZTApKS0+Z2VuX25vXQpw
ICogKChiZGVzY3IgKikoKCgkYXJnMCAmIDB4ZmZmMDAwMDApIHwgKCgkYXJnMCAmIDB4ZmYwMDAp
ID4+IDcpKSAmIDB4ZmZmZmZmZTApKS0+c3RlcAplbmQKCmRlZmluZSBnZXRtYXJrCnNldCAkYmQg
PSAoYmRlc2NyICopKCgoJGFyZzAgJiAweGZmZjAwMDAwKSB8ICgoJGFyZzAgJiAweGZmMDAwKSA+
PiA3KSkgJiAweGZmZmZmZmUwKQpzZXQgJG9mZnNldCA9IChTdGdQdHIpJGFyZzAgLSAkYmQtPnN0
YXJ0CnNldCAkYml0bWFwX3dvcmQgPSAkYmQtPnUuYml0bWFwICsgKCRvZmZzZXQgLyAzMikKc2V0
ICRtYXNrID0gMSA8PCAoJG9mZnNldCAmIDMxKQpwICgqJGJpdG1hcF93b3JkICYgJG1hc2spICE9
IDAKZW5kCgojIGlnbm9yZSBTSUdQSVBFcwpoYW5kbGUgU0lHUElQRSBub3N0b3Agbm9wcmludCBp
Z25vcmUK
------_=_NextPart_001_01C11E5E.2C266652--