GHC core representation

Alastair David Reid reid@cs.utah.edu
04 Jun 2001 22:23:10 -0600


[This is in reply to Andrew Tolmach's mail to the Haskell list.  He
asked that replies should be sent to ghc-users.]

> Many months after the topic was first raised, there is finally a draft
> document describing a formal external syntax for GHC's "Core"
> intermediate language.

Wow!  This is very cool - it answers so many questions that I
previously had to examine the source code to figure out!

Two meta-comments before I go into detailed comments.

1) GHC documentation tends to have a very short half-life.
   What can be done to minimize this problem?
   For example can the tables of primops in section 4 be automatically
   generated?

2) I haven't seriously hacked on GHC or the the GHC frontend for 
   several years so some of the following comments will be a little
   off.  I'm hoping that people who are more up to date will correct
   me and that their corrections will be useful to anyone like me
   who used to know GHC but have lost track of the details.

OK, onto the detailed comments.

1) GHC identifiers have SrcLocs (filename/linenumber) attached to them.
   It might be useful to be able to pass those into your parser.
   (But what are the SrcLoc's used for in Core code?  Would they be useful?)

2) The names you use for operations all use the somewhat unreadable 
   Z encoding.  I think you should Z-decode these names because:

   1) They're a lot more readable.  e.g., decode "ZLZmZgZR" = "(->)"

   2) It's possible that GHC will get confused if you fail to Z-encode
      a name.  For example, it isn't possible to have the name "Z"
      in Core because GHC encodes the Haskell identifier "Z" as "ZZ".

   From memory, the Z encoding is something like this:

     ZL -> (
     ZR -> )
     Zm -> -
     Zg -> >
     Zl -> <
     Ze -> =
     Zc -> :
     Zp -> .
     Zxxx -> chr(xxx)   where xxx is a sequence of digits

3) The restriction that Main.main have type IO a is unfortunate and, I
   think, unnecessary.  It shouldn't be that hard to change it so that
   its type is more like:

     State# RealWorld -> (# State# RealWorld, Void# #)

4) While explaining namespaces, it'd be convenient to point out that
   you put a % in front of all keywords.

   Hmmm, I guess this is one reason to use Z-encoded names: % is a
   legal Haskell identifier.

5) The @a in datatype declarations threw me for a while.
   Later I saw you using it like BigLambda in an explicitly typed
   lambda calculus and understood what you were doing.  It'd be
   worth making the connection explicit.

6) An alternative way of defining data constructors would be like this:

   %data BinTree :: * -> * = {
      Fork :: %forall a . BinTree a -> BinTree a -> BinTree a;
      Leaf :: %forall a . a -> BinTree a
   }

   That is, specify the actual type of data constructors instead of
   using Haskell datatype declaration syntax.  This makes the language
   a little easier to specify.

   Notice that this syntax is a little more liberal than standard Haskell
   syntax because you could write types like:

   %data Expr :: * -> * = {
     Int :: Int -> Expr Int;
     App :: %forall a, b . Expr (a -> b) -> Expr a -> Expr b;
     Lam :: %forall a, b . Var a -> Expr b -> Expr (a -> b)
   }

   which cannot be expressed using normal Haskell syntax because
   the result types of each constructor are different.  I saw 
   Lennart mention this idea in a talk about 8 years ago and I've 
   always wanted to play with it. :-)

   On the other hand, you might avoid this generalisation in case
   GHC does something weird with it.

7) In section 3.6, you say:

    "Value applications may be of user-defined functions, data constructors
     or primitives.  Application of the latter two sorts need _not_ be
     saturated."

   I think you mean "none of these applications need to be saturated
   although both previously published descriptions of Core required
   that the latter two be saturated."

8) You say that the list of case alternatives "need not be exhaustive,
   even if no default is given; it is a disastrous run-time error if a
   needed case arm is missing."

   1) Just how disastrous?  Is an exception raised or does the RTS crash?
   2) I feel a little uneasy with this design decision.

9) You don't mention the _scc_ operation used for profiling.

10) There's a move on to define %ccall in a more generic way
    that would apply to calling Java functions, .net functions, etc.
    This is bound to result in a change of the "ccall" name. 
    It'd be worth adding a Working note to that effect.

11) The discussion of the terms "unboxed", "heap allocated" and 
    "unlifted" (page 9) doesn't seem quite right.  I believe it is:

    Lifted types must be heap allocated.  Unlifted types may be heap
    allocated (e.g., Array#s) or unboxed (not heap allocated) (e.g., Int#).

12) The discussion of operations on MVar# (page 10) says that they
    take an initial type argument of the form (State# t) for some
    thread "t".  Is this true?  Does "t" not have to be RealWorld#?
    What does it mean if it is not RealWorld#?  

13) Are the CCallable and CReturnable classes still there?  Blech!
    [I'm the one who implemented them - I'd hoped they'd gone by now.]

14) If strings are represented by the "address of a C-format string"
    (section 4.2), how do we represent strings with embedded \0 characters
    in them?

15) dataToTag#, tagToEnum# and getTag# (section 4.4.1) might be used
    to implement the to/fromEnum operations but they may also be
    remnants of GHCi version 0.0 - a metacircular interpreter that
    was in GHC version 0.18 (or thereabouts).  If the latter, someone
    ought to give them a decent burial.

16) You document unsafeCoerce# but not 

       reallyUnsafePtrEq# :: a -> a -> Bool

    did someone finally kill that?

17) decodeDouble# and decodeFloat# are used to extract the exponent
    and mantissa of a floating point number.  Ignoring unboxity,
    it returns an Int (the exponent) and an Integer (the mantissa).
    Representing that as unboxed types, you get (# Int#, Int#, ByteArray# #)
   
18) Section 4.4.8 asks what the relationships are between quotient,
    remainder, div and mod.  This ought to be the same as that specified
    in the Haskell report for their boxed equivalents.

19) Your question about what indexArray# does and what alignment 
    restrictions apply reminds me of a subtlety:

      When indexing into a ByteArr# or a Addr#, is the index
      scaled by the size of the object being read/written (as in C)
      or not?

    I am pretty sure that the index on an Addr# is not scaled (and the
    value of the (Addr# + Int#) combination is subject to exactly the
    same alignment restrictions as imposed by the hardware architecture.

    I don't recall the story for ByteArr# but a ByteArr# object is
    always aligned to the nearest 4 (maybe 8) byte boundary.

20) Section 4.4.15 says that takeMVar# and putMVar# "die" if the MVar is
    empty (respectively, full).

    1) What does "die" mean?
       More generally, how is failure implemented in the primops?
       Do they raise exceptions?  Do they return error codes?
       Do they call exit?  Do they crash the RTS?

    2) Are you sure these ops "die"?  Their unboxed Haskell brethren
       will block the thread under the same circumstances.

21) You repeatedly mention the sequential implementation of the RTS.
    Last time I looked, it wasn't possible to build the RTS with 
    concurrency turned off.  

    If one were to build the GHC RTS with concurrency turned off, I
    think the concurrency primitives should not be available rather
    than providing versions that always/mostly fail.  Or maybe one could 
    implement non-preemptive "concurrency" in much the same way as we
    did in Hugs?  (Not sure I recommend this route - the interactions
    between threads and exception handlers in Hugs are somewhat tricky.)



-- 
Alastair Reid        reid@cs.utah.edu        http://www.cs.utah.edu/~reid/