LLVM backend and spilling
Simon Marlow
marlowsd at gmail.com
Tue Jan 10 10:52:33 CET 2012
On 10/01/2012 09:15, David Terei wrote:
> Hi Simon,
>
> This is finally fixed!
>
> LLVM now creates this code for Updates.cmm:
>
> stg_upd_frame_info: # @stg_upd_frame_info
> # BB#0: # %cL
> subq $24, %rsp
> movq 8(%rbp), %rdx
> movq %rbx, 8(%rdx)
> movq $stg_BLACKHOLE_info, (%rdx)
> movq %rdx, %rax
> shrq $6, %rax
> andq $16320, %rax # imm = 0x3FC0
> movq %rdx, %rcx
> andq $-1048576, %rcx # imm = 0xFFFFFFFFFFF00000
> orq %rax, %rcx
> movswq 40(%rcx), %rax
> addq $16, %rbp testq %rax, %rax
> movq %r15, %rcx
> je .LBB0_4
>
> compared to the NCG which creates:
>
> stg_upd_frame_info:
> .LcL:
> movq 8(%rbp),%rax
> movq %rax,64(%rsp)
> addq $16,%rbp
> movq %rbx,8(%rax)
> movq $stg_BLACKHOLE_info,0(%rax)
> movq %rax,%rcx
> andq $-1048576,%rcx
> andq $1044480,%rax
> shrq $6,%rax
> orq %rcx,%rax
> cmpw $0,40(%rax)
> jne .LcH
> jmp *0(%rbp).LcH:
>
> See ticket:> http://hackage.haskell.org/trac/ghc/ticket/4308 for
> respective commits.
It's great that you're now tracking the liveness of GlobalRegs, but the
LLVM code here still doesn't look as good. Any idea why? (this bit of
code is executed a *lot*, saving a few instructions here is worthwhile)
Cheers,
Simon
> Cheers,
> David
>
> On 13 September 2010 18:49, David Terei<davidterei at gmail.com> wrote:
>> OK, I've created a ticket for this issue here
>> http://hackage.haskell.org/trac/ghc/ticket/4308
>>
>> On 13 September 2010 18:29, Simon Marlow<marlowsd at gmail.com> wrote:
>>> On 13/09/2010 09:03, David Terei wrote:
>>>>
>>>> Hi Simon,
>>>>
>>>> Hmm, I've been trying to find a nasty performance regression when ghc
>>>> is bootstrapped with llvm, this problem must be a large part of this
>>>> so thanks.
>>>
>>> Right, I expect the mfence on every update is costing quite a lot (one of
>>> those can be hundreds of cycles).
>>>
>>> Thanks for looking at this!
>>>
>>> Cheers,
>>> ` Simon
>>>
>>>
>>>>> - All the spilling. In the slow path there's a foreign call (to
>>>>> allocBlocks_lock)
>>>>> but that is annotated with the live registers, in this case [R1], so I
>>>>> don't
>>>>> understand why LLVM should be spilling everything. Any ideas?
>>>>
>>>> I fixed a problem like this ox x86 a month or two back. The issue is
>>>> this:
>>>>
>>>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
>>>> // do some stuff
>>>> call foreign g() [R1];
>>>> // do some stuff
>>>> tail call f_next (Base, Hp', Sp', R1', R2, R3, R4, R5, R6);
>>>> }
>>>>
>>>> Because of the use of the calling convention to pass the stg registers
>>>> around llvm thinks they are all live across the call in the code
>>>> above, since well they are. The fix is to explicitly kill the
>>>> registers that aren't live across the call, which is easy as llvm
>>>> provides a nice symbolic 'undef' value that can do just this. So above
>>>> code needs to become:
>>>>
>>>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
>>>> // do some stuff
>>>> call foreign g() [R1];
>>>> // do some stuff
>>>> tail call f_next (Base, Hp', Sp', R1', undef, undef, undef, undef,
>>>> undef);
>>>> }
>>>>
>>>> What I actually do is slightly different then this but the concept is
>>>> the same. Hope that makes sense. As I said I thought I had made this
>>>> fix a while back but hand written Cmm generally tests very different
>>>> code paths than compiler generated Cmm, so I must have missed a case.
>>>> This should be an easy fix as well.
>>>>
>>>> I played around a little with compiling Updates.cmm on x86-32, will
>>>> try soon on x64. The ncg gives this assembly (annotated with the
>>>> corresponding cmm for my benefit):
>>>>
>>>> _stg_upd_frame_info:
>>>> .Lcz:
>>>> movl 4(%ebp),%eax # bits32 updatee = b32[Sp +
>>>> SIZEOF_StgHeader]
>>>> movl %eax,64(%esp) # spill updatee
>>>> addl $8,%ebp # Sp = Sp + (SIZEOF_StgHeader + 4)
>>>> movl %esi,4(%eax) # gcptr[updatee + SIZEOF_StgHeader] =
>>>> R1
>>>> # prim %write_barrier() [];
>>>> movl $_stg_BLACKHOLE_info,0(%eax) # b32[updatee] = stg_BLACKHOLE_info
>>>>
>>>> movl %eax,%ecx # copy updatee
>>>> andl $-1048576,%ecx # x = updatee& ((1<< 20) - 1)&
>>>> ~((1<< 12) - 1)
>>>> andl $1044480,%eax # y = updatee& ~((1<< 20) - 1)
>>>> shrl $7,%eax # x>> (12 - 5)
>>>> orl %ecx,%eax # bd = x& y
>>>>
>>>> cmpw $0,28(%eax) #
>>>> jne .LcA # if ( b16[bd + 28] != 0): goto .LcA
>>>>
>>>> jmp *0(%ebp) # else: jump %ENTRY_CODE( bits32[Sp])
>>>>
>>>>
>>>> If I compile it with llvm (manually fixing the write barrier issue) I
>>>> get the following llvm code:
>>>>
>>>> define cc10 void @stg_upd_frame_info(i32* noalias nocapture %Base_Arg,
>>>> i32* noalias nocapture %Sp_Arg, i32* noalias nocapture %Hp_Arg, i32
>>>> %R1_Arg) nounwind section ".text; .text 2#" align 4 {
>>>> c2K:
>>>> %ln2M = getelementptr inbounds i32* %Sp_Arg, i32 1
>>>> %ln2Q = load i32* %ln2M
>>>> %ln2V = getelementptr inbounds i32* %Sp_Arg, i32 2
>>>> %ln2Y = add i32 %ln2Q, 4
>>>> %ln30 = inttoptr i32 %ln2Y to i32*
>>>> store i32 %R1_Arg, i32* %ln30
>>>> %ln34 = inttoptr i32 %ln2Q to i32*
>>>> store i32 ptrtoint ([0 x i32]* @stg_BLACKHOLE_info to i32), i32* %ln34
>>>> %ln3c = lshr i32 %ln2Q, 7
>>>> %ln3e = and i32 %ln3c, 8160
>>>> %ln3j = and i32 %ln2Q, -1048576
>>>> %ln3k = or i32 %ln3e, %ln3j
>>>> %ln3m1 = or i32 %ln3k, 28
>>>> %ln3n = inttoptr i32 %ln3m1 to i16*
>>>> %ln3o = load i16* %ln3n, align 4
>>>> %ln3p = icmp eq i16 %ln3o, 0
>>>> br i1 %ln3p, label %n3r, label %c3q
>>>>
>>>> n3r: ; preds = %c2K
>>>> %ln3x = load i32* %ln2V
>>>> %ln3y = inttoptr i32 %ln3x to void (i32*, i32*, i32*, i32)*
>>>> tail call cc10 void %ln3y(i32* %Base_Arg, i32* %ln2V, i32* %Hp_Arg,
>>>> i32 %R1_Arg) nounwind
>>>> ret void
>>>>
>>>> Which seems quite good to me. That compiles to the assembly:
>>>>
>>>> _stg_upd_frame_info:
>>>> subl $20, %esp
>>>> movl %edi, 16(%esp) # 4-byte Spill
>>>> movl 4(%ebp), %edi
>>>> addl $8, %ebp
>>>> movl %esi, 8(%esp) # 4-byte Spill
>>>> movl %edi, %ecx
>>>> movl %edi, %eax
>>>> shrl $7, %ecx
>>>> andl $-1048576, %eax # imm = 0xFFFFFFFFFFF00000
>>>> andl $8160, %ecx # imm = 0x1FE0
>>>> addl %eax, %ecx
>>>> movl %esi, 4(%edi)
>>>> movl $_stg_BLACKHOLE_info, (%edi)
>>>> movswl 28(%ecx), %eax
>>>> movl %eax, 12(%esp) # 4-byte Spill
>>>> testl %eax, %eax
>>>> je LBB1_4
>>>> [...]
>>>> LBB1_4: # %n3r
>>>> movl (%ebp), %eax
>>>> movl 16(%esp), %edi # 4-byte Reload
>>>> addl $20, %esp
>>>> jmpl *%eax # TAILCALL
>>>>
>>>> Which is OK but still worse than the ncg. Improving this will probably
>>>> require talking to the llvm guys, I think the llvm register allocator
>>>> may have some asssumptions/design decisions that interact badly with
>>>> our calling convention.
>>>>
>>>> Roman and I were a while ago investigating an issue where llvm wasn't
>>>> doing a very good job for some dph code, quite a few unnecessary
>>>> spills. We thought it was an aliasing issue but in the end seemed to
>>>> be an llvm problem with the instruction selector/scheduler creating a
>>>> lot of register pressure.
>>>>
>>>> Cheers,
>>>> David
>>>>
>>>>>
>>>>> I don't expect us to do better than the NCG here, because the NCG code is
>>>>> just about optimal, but I would like to use -fllvm on other parts of the RTS
>>>>> code so it would be good if we could generate code that is at least as good
>>>>> as the NCG here.
>>>>>
>>>>> Cheers,
>>>>> Simon
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Cvs-ghc mailing list
>>>> Cvs-ghc at haskell.org
>>>> http://www.haskell.org/mailman/listinfo/cvs-ghc
>>>
>>>
More information about the Cvs-ghc
mailing list