cvs commit: fptools/ghc/rts Main.c
Ken Shan
ken@digitas.harvard.edu
Sat, 4 Aug 2001 01:14:53 -0400
--+HP7ph2BbKc20aGI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
I have questions for all you seasoned GHC developers:
On 2001-07-26T02:26:34-0700, Julian Seward (Intl Vendor) wrote:
> Sounds like a potential memory management/corruption bug.
> Do you have an example program which causes this to happen
> on Alpha, so we can see if it can be repro'd on other plats?
(See additional quoted message for original commit log.)
The original problem was simply that the conc004 test
(ghc/tests/concurrent/should_run/conc004.hs) fails on alpha-dec-osf3
(my homegrown version with patches on top of ghc-5.00.2), but not
i686-pc-linux-gnu (the distributed version ghc-5.00.2). I thought I
magically fixed by adding the initialization call to tzset(), in my
commit. It did allow conc004 to complete successfully on
alpha-dec-osf3.
As it recently turned out, it doesn't seem to have to. Once in a
while, when using the abovementioned version of ghc-5.00.2 on
alpha-dec-osf3 to compile GHC itself, I still get a segmentation
fault. It happens in Schedule.c: Right after the
switch (cap->rCurrentTSO->what_next) {
...
}
there's the innocuous line
t =3D cap->rCurrentTSO;
but cap is null by then, not &MainRegTable anymore! It appears that
gcc put the cap variable in register $s3 (same as the register used
for Hp), but StgRun is supposed to have saved $s3 on the stack and
restored it on return.
I'm not sure what's going on here.
I then discovered sanity checking, so I started running ghc-inplace
with "+RTS -D128 -RTS". It would always seg fault:
puffin:~/u/glasgow/puffin2/ghc/lib/std$ gdb ../../compiler/ghc-5.01 core
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and yo=
u are
welcome to change it and/or distribute copies of it under certain condi=
tions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for deta=
ils.
This GDB was configured as "alphaev56-dec-osf4.0e"...
Core was generated by `ghc-5.01'.
Program terminated with signal 11, Segmentation fault.
#0 0x120d6fe98 in checkClosure (p=3D0x183bae1c8) at Sanity.c:220
220 ASSERT(!closure_STATIC(p));
(gdb) bt
#0 0x120d6fe98 in checkClosure (p=3D0x183bae1c8) at Sanity.c:220
#1 0x120d70c78 in checkHeap (bd=3D0x183b02b80) at Sanity.c:472
#2 0x120d611b4 in checkSanity () at Storage.c:736
#3 0x120d6b850 in GarbageCollect (get_roots=3D0x120d5e470 <GetRoots>,
force_major_gc=3D304) at GC.c:923
#4 0x120d5db40 in schedule () at Schedule.c:1222
#5 0x120d5e3cc in waitThread (tso=3D0x1800bc000, ret=3D0x0) at Schedul=
e.c:1956
#6 0x120d69850 in rts_evalIO (p=3D0x140098620, ret=3D0x0) at RtsAPI.c:=
421
#7 0x120d58ae0 in main (argc=3D18, argv=3D0x11fffc018) at Main.c:120
(Note: HEAP_BASE is 0x180000000.) I don't know what we mean by "slop"
in the GC/Sanity code, but this seems to be slop at my first glance:
(gdb) p p
$1 =3D (StgClosure *) 0x183bae1c8
(gdb) p *p
$2 =3D {header =3D {info =3D 0x42000001}, payload =3D 0x183bae1d0}
(gdb) p *((StgSelector*)p)
$3 =3D {header =3D {info =3D 0x42000001}, selectee =3D 0x120849770}
(gdb) up
#1 0x120d70c78 in checkHeap (bd=3D0x183b02b80) at Sanity.c:472
472 nat size =3D checkClosure((StgClosure *)p);
(gdb) p *bd
$4 =3D {start =3D 0x183bae000, free =3D 0x183baefe8, link =3D 0x183b029=
40, u =3D {
back =3D 0x0, bitmap =3D 0x0}, gen_no =3D 1, step =3D 0x140200000, =
blocks =3D 1,
flags =3D 0, _padding =3D {0, 0}}
Is this a real sanity problem, or a problem with the sanity check
code? Is there another strategy that anyone would suggest for
tracking down the original segfault-in-scheduler bug?
The original commit:
> | ken 2001/07/25 20:20:52 PDT
> |=20
> | Modified files:
> | ghc/rts Main.c=20
> | Log:
> | Added to main():
> | =20
> | /*
> | * Believe it or not, calling tzset() at startup seems=20
> | to get rid of
> | * a scheduler-related Heisenbug on alpha-dec-osf3. The=20
> | symptom of
> | * the bug is that, when the load on the machine is high or when
> | * there are many threads, the variable "Capability *cap" in the
> | * function "schedule" in the file "Schedule.c" magically becomes
> | * null before the line "t =3D cap->rCurrentTSO;". Why,=20
> | and why does
> | * calling tzset() here seem to fix it? Excellent questions!
> | */
> | tzset();
> | =20
> | Revision Changes Path
> | 1.28 +16 -1 fptools/ghc/rts/Main.c
--=20
Edit this signature at http://www.digitas.harvard.edu/cgi-bin/ken/sig
"To what shall the character of utility be ascribed , if not to that which
is a source of pleasure?" - Jeremey Bentham
--+HP7ph2BbKc20aGI
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE7a4TMzjAc4f+uuBURAkbnAJ9lFOlYP5arU1mT5oQdZ6A/FetcFQCg/8k1
fPO+jXDYZuH9CwLtWCU/VmQ=
=qV/t
-----END PGP SIGNATURE-----
--+HP7ph2BbKc20aGI--