Some questions about GHC source code and wiki
marlowsd at gmail.com
Thu May 27 04:38:16 EDT 2010
On 27/05/2010 00:08, Marco Túlio Gontijo e Silva wrote:
> I've made some patches to GHC to correct these issues. I don't know if you're
> willing to recieve these kind of small patches, but I'll send them anyway.
I didn't see any patches - did you forget to attach them?
> I thought there would be a central point (function) for general object
> allocation. The garbage collector just allocates new objects when it's
> copying from one generation to the another one, right?
In the garbage collector, space for copying objects is allocated using
StgPtr alloc_for_copy (nat size, generation *gen);
>> The mutator allocates objects inline, by bumping the heap pointer (Hp).
>> Allocation of objects is such a common operation it needs to be
>> incredibly cheap.
> But what happens when the heap pointer hits the end of a Block? In this case
> it should not just bump the pointer.
Right - before allocating, the mutator does a "heap check", comparing Hp
with HpLim. If the heap check fails, then the mutator has to arrange to
save its current state and return to the runtime system requesting a GC.
There are quite a lot of details in how this happens, but the general
idea is that one of the functions in rts/HeapStackCheck.cmm gets called,
which returns to the RTS.
The nursery is a linked list of blocks, so when a heap check fails we
might be able to just start allocating into the next block rather than
requesting a GC: see the code in HeapStackCheck.cmm.
> Ok. There's also the corner case where the object starts in the beginning of
> the Line, in which we can consider the next Line to be empty, but probably it's
> not an issue to ignore this one.
That's a worthwhile optimisation, I think.
We might consider "medium" objects to be any object that is larger than,
say 75% of a line. The point is we want to avoid having objects that
are nearly a line size being allocated in the same way as small objects,
because one of these objects will almost always cause allocation in the
current line to fail, which might waste up to a line's worth of memory
in the current line. That's what I menat by fragmentation.
That fraction is something we could experiment with, maybe even 50% is
> We'll need to mark differently Lines that are not empty because they have an
> objects and Lines that are not empty because the previous Line had an object.
> I'm saying this because I got confused in the first time I read "a Line is only
> empty if the Line preceding it is also empty", thinking of an infinite
I'm proposing that we don't have explicit line marks at all, we just use
the object mark bitmap (we might have to expand this to a bytemap for
> Anyway, this is proposed in the Immix paper. In their terms, they mark only
> the one Line for small objects and they don't mark the last Line of medium
> objects, and ignore the first Line of each role in allocation.
I suggest we allocate medium objects in a separate space entirely, and
maybe just copy them as normal, rather than using mark/sweep on them.
In GHC's GC we can choose on a block-by-block basis whether the block is
copied during GC or marked+swept; this is how we can implement
"opportunistic defragmentation". Take a look at the current sweeping
algorithm in rts/sm/Sweep.c to see what I mean.
>> Then we need a representation for the free list of Lines, probably just
>> link them together in a list within a block. The blocks in the old
>> generation are already on a list attached to the generation, so we just
>> need a pointer to the first free Line from the generation.
> So you're proposing that the last free Line of a Block points to the first free
> Line of the next one?
That's one option, yes. I'm not sure if its the easiest, maybe you can
explore the alternatives here.
>> That covers (1). For (2), you'll need to change the way that to-space
>> memory is allocated, but only for the old generation. If there is a
>> free Line then use that. We'll also need to be able to allocate space
>> for objects larger than a Line (or maybe larger than a certain fraction
>> of a Line, to avoid too much fragmentation).
> I didn't got the part in the parenthesis.
Hopefully I explained that above...
>> Identifying objects that are larger than a Line can be done manually, but it
>> will probably be better in the long run if these objects have a special flag
>> in the info table.
> Yes. Immix authors propose that the differentiation between objects smaller
> than a Line (Small) and larger than a Line but much smaller than a Block
> (Medium) is done in allocation time, to save time of the GC. They propose that
> the Medium objects are objects smaller than a quarter of the Block size. If
> the object is bigger than this, it'll be allocated differently. How is this
> currently done in GHC?
See Thomas's answer on this one.
>> Ok, that's a sketch of what I had in mind, there are quite a few gaps
>> still to be filled in, and I'm sure you'll encounter some things I
>> haven't considered.
> Thanks for that, things are much more clear for me now. There're so many
> new things too look at that I'm feeling a bit lost in the RTS code. More focus
> is welcome. So, initially I must look at the code of the major garbage
> collection, and think of a representation of the list of free blocks of Lines.
If you're a bit lost, one way to get acquainted is to start adding print
statements to the code so you can get a better idea of what's going on
at each point. e.g., you might want to try printing out each object as
it is evacuated. I find this kind of hands-on approach much more
illuminating than staring at the code.
> Maybe the start of the block can hold the size of the block (in number of
> Lines) and a pointer to the next block?
This is an alternative to the linked list of free lines I described
> I'll study more how the sweep works currently. Is there something on this
> subject, which could complement the source code? Maybe an article on how this
> is done...
The Jones/Lins book is the best way to learn about GC basics, or if you
can't get hold of that easily try Paul Wilson's survey paper:
More information about the Cvs-ghc