cvs commit: fptools/ghc/rts Main.c

Ken Shan ken@digitas.harvard.edu
Sat, 4 Aug 2001 01:14:53 -0400


--+HP7ph2BbKc20aGI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

I have questions for all you seasoned GHC developers:

On 2001-07-26T02:26:34-0700, Julian Seward (Intl Vendor) wrote:
> Sounds like a potential memory management/corruption bug.
> Do you have an example program which causes this to happen
> on Alpha, so we can see if it can be repro'd on other plats?

(See additional quoted message for original commit log.)

The original problem was simply that the conc004 test
(ghc/tests/concurrent/should_run/conc004.hs) fails on alpha-dec-osf3
(my homegrown version with patches on top of ghc-5.00.2), but not
i686-pc-linux-gnu (the distributed version ghc-5.00.2).  I thought I
magically fixed by adding the initialization call to tzset(), in my
commit.  It did allow conc004 to complete successfully on
alpha-dec-osf3.

As it recently turned out, it doesn't seem to have to.  Once in a
while, when using the abovementioned version of ghc-5.00.2 on
alpha-dec-osf3 to compile GHC itself, I still get a segmentation
fault.  It happens in Schedule.c: Right after the

    switch (cap->rCurrentTSO->what_next) {
        ...
    }

there's the innocuous line

    t =3D cap->rCurrentTSO;

but cap is null by then, not &MainRegTable anymore!  It appears that
gcc put the cap variable in register $s3 (same as the register used
for Hp), but StgRun is supposed to have saved $s3 on the stack and
restored it on return.

I'm not sure what's going on here.

I then discovered sanity checking, so I started running ghc-inplace
with "+RTS -D128 -RTS".  It would always seg fault:

    puffin:~/u/glasgow/puffin2/ghc/lib/std$ gdb ../../compiler/ghc-5.01 core
    GNU gdb 4.18
    Copyright 1998 Free Software Foundation, Inc.
    GDB is free software, covered by the GNU General Public License, and yo=
u are
    welcome to change it and/or distribute copies of it under certain condi=
tions.
    Type "show copying" to see the conditions.
    There is absolutely no warranty for GDB.  Type "show warranty" for deta=
ils.
    This GDB was configured as "alphaev56-dec-osf4.0e"...
    Core was generated by `ghc-5.01'.
    Program terminated with signal 11, Segmentation fault.
    #0  0x120d6fe98 in checkClosure (p=3D0x183bae1c8) at Sanity.c:220
    220             ASSERT(!closure_STATIC(p));
    (gdb) bt
    #0  0x120d6fe98 in checkClosure (p=3D0x183bae1c8) at Sanity.c:220
    #1  0x120d70c78 in checkHeap (bd=3D0x183b02b80) at Sanity.c:472
    #2  0x120d611b4 in checkSanity () at Storage.c:736
    #3  0x120d6b850 in GarbageCollect (get_roots=3D0x120d5e470 <GetRoots>,
        force_major_gc=3D304) at GC.c:923
    #4  0x120d5db40 in schedule () at Schedule.c:1222
    #5  0x120d5e3cc in waitThread (tso=3D0x1800bc000, ret=3D0x0) at Schedul=
e.c:1956
    #6  0x120d69850 in rts_evalIO (p=3D0x140098620, ret=3D0x0) at RtsAPI.c:=
421
    #7  0x120d58ae0 in main (argc=3D18, argv=3D0x11fffc018) at Main.c:120

(Note: HEAP_BASE is 0x180000000.)  I don't know what we mean by "slop"
in the GC/Sanity code, but this seems to be slop at my first glance:

    (gdb) p p
    $1 =3D (StgClosure *) 0x183bae1c8
    (gdb) p *p
    $2 =3D {header =3D {info =3D 0x42000001}, payload =3D 0x183bae1d0}
    (gdb) p *((StgSelector*)p)
    $3 =3D {header =3D {info =3D 0x42000001}, selectee =3D 0x120849770}
    (gdb) up
    #1  0x120d70c78 in checkHeap (bd=3D0x183b02b80) at Sanity.c:472
    472                 nat size =3D checkClosure((StgClosure *)p);
    (gdb) p *bd
    $4 =3D {start =3D 0x183bae000, free =3D 0x183baefe8, link =3D 0x183b029=
40, u =3D {
        back =3D 0x0, bitmap =3D 0x0}, gen_no =3D 1, step =3D 0x140200000, =
blocks =3D 1,
      flags =3D 0, _padding =3D {0, 0}}

Is this a real sanity problem, or a problem with the sanity check
code?  Is there another strategy that anyone would suggest for
tracking down the original segfault-in-scheduler bug?

The original commit:

> | ken         2001/07/25 20:20:52 PDT
> |=20
> |   Modified files:
> |     ghc/rts              Main.c=20
> |   Log:
> |   Added to main():
> |  =20
> |      /*
> |       * Believe it or not, calling tzset() at startup seems=20
> | to get rid of
> |       * a scheduler-related Heisenbug on alpha-dec-osf3.  The=20
> | symptom of
> |       * the bug is that, when the load on the machine is high or when
> |       * there are many threads, the variable "Capability *cap" in the
> |       * function "schedule" in the file "Schedule.c" magically becomes
> |       * null before the line "t =3D cap->rCurrentTSO;".  Why,=20
> | and why does
> |       * calling tzset() here seem to fix it?  Excellent questions!
> |       */
> |      tzset();
> |  =20
> |   Revision  Changes    Path
> |   1.28      +16 -1     fptools/ghc/rts/Main.c

--=20
Edit this signature at http://www.digitas.harvard.edu/cgi-bin/ken/sig
"To what shall the character of utility be ascribed , if not to that which
is a source of pleasure?" - Jeremey Bentham

--+HP7ph2BbKc20aGI
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE7a4TMzjAc4f+uuBURAkbnAJ9lFOlYP5arU1mT5oQdZ6A/FetcFQCg/8k1
fPO+jXDYZuH9CwLtWCU/VmQ=
=qV/t
-----END PGP SIGNATURE-----

--+HP7ph2BbKc20aGI--