LLVM backend and spilling
David Terei
davidterei at gmail.com
Tue Jan 10 10:15:39 CET 2012
Hi Simon,
This is finally fixed!
LLVM now creates this code for Updates.cmm:
stg_upd_frame_info: # @stg_upd_frame_info
# BB#0: # %cL
subq $24, %rsp
movq 8(%rbp), %rdx
movq %rbx, 8(%rdx)
movq $stg_BLACKHOLE_info, (%rdx)
movq %rdx, %rax
shrq $6, %rax
andq $16320, %rax # imm = 0x3FC0
movq %rdx, %rcx
andq $-1048576, %rcx # imm = 0xFFFFFFFFFFF00000
orq %rax, %rcx
movswq 40(%rcx), %rax
addq $16, %rbp testq %rax, %rax
movq %r15, %rcx
je .LBB0_4
compared to the NCG which creates:
stg_upd_frame_info:
.LcL:
movq 8(%rbp),%rax
movq %rax,64(%rsp)
addq $16,%rbp
movq %rbx,8(%rax)
movq $stg_BLACKHOLE_info,0(%rax)
movq %rax,%rcx
andq $-1048576,%rcx
andq $1044480,%rax
shrq $6,%rax
orq %rcx,%rax
cmpw $0,40(%rax)
jne .LcH
jmp *0(%rbp).LcH:
See ticket: > http://hackage.haskell.org/trac/ghc/ticket/4308 for
respective commits.
Cheers,
David
On 13 September 2010 18:49, David Terei <davidterei at gmail.com> wrote:
> OK, I've created a ticket for this issue here
> http://hackage.haskell.org/trac/ghc/ticket/4308
>
> On 13 September 2010 18:29, Simon Marlow <marlowsd at gmail.com> wrote:
>> On 13/09/2010 09:03, David Terei wrote:
>>>
>>> Hi Simon,
>>>
>>> Hmm, I've been trying to find a nasty performance regression when ghc
>>> is bootstrapped with llvm, this problem must be a large part of this
>>> so thanks.
>>
>> Right, I expect the mfence on every update is costing quite a lot (one of
>> those can be hundreds of cycles).
>>
>> Thanks for looking at this!
>>
>> Cheers,
>> ` Simon
>>
>>
>>>> - All the spilling. In the slow path there's a foreign call (to
>>>> allocBlocks_lock)
>>>> but that is annotated with the live registers, in this case [R1], so I
>>>> don't
>>>> understand why LLVM should be spilling everything. Any ideas?
>>>
>>> I fixed a problem like this ox x86 a month or two back. The issue is
>>> this:
>>>
>>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
>>> // do some stuff
>>> call foreign g() [R1];
>>> // do some stuff
>>> tail call f_next (Base, Hp', Sp', R1', R2, R3, R4, R5, R6);
>>> }
>>>
>>> Because of the use of the calling convention to pass the stg registers
>>> around llvm thinks they are all live across the call in the code
>>> above, since well they are. The fix is to explicitly kill the
>>> registers that aren't live across the call, which is easy as llvm
>>> provides a nice symbolic 'undef' value that can do just this. So above
>>> code needs to become:
>>>
>>> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
>>> // do some stuff
>>> call foreign g() [R1];
>>> // do some stuff
>>> tail call f_next (Base, Hp', Sp', R1', undef, undef, undef, undef,
>>> undef);
>>> }
>>>
>>> What I actually do is slightly different then this but the concept is
>>> the same. Hope that makes sense. As I said I thought I had made this
>>> fix a while back but hand written Cmm generally tests very different
>>> code paths than compiler generated Cmm, so I must have missed a case.
>>> This should be an easy fix as well.
>>>
>>> I played around a little with compiling Updates.cmm on x86-32, will
>>> try soon on x64. The ncg gives this assembly (annotated with the
>>> corresponding cmm for my benefit):
>>>
>>> _stg_upd_frame_info:
>>> .Lcz:
>>> movl 4(%ebp),%eax # bits32 updatee = b32[Sp +
>>> SIZEOF_StgHeader]
>>> movl %eax,64(%esp) # spill updatee
>>> addl $8,%ebp # Sp = Sp + (SIZEOF_StgHeader + 4)
>>> movl %esi,4(%eax) # gcptr[updatee + SIZEOF_StgHeader] =
>>> R1
>>> # prim %write_barrier() [];
>>> movl $_stg_BLACKHOLE_info,0(%eax) # b32[updatee] = stg_BLACKHOLE_info
>>>
>>> movl %eax,%ecx # copy updatee
>>> andl $-1048576,%ecx # x = updatee& ((1<< 20) - 1)&
>>> ~((1<< 12) - 1)
>>> andl $1044480,%eax # y = updatee& ~((1<< 20) - 1)
>>> shrl $7,%eax # x>> (12 - 5)
>>> orl %ecx,%eax # bd = x& y
>>>
>>> cmpw $0,28(%eax) #
>>> jne .LcA # if ( b16[bd + 28] != 0): goto .LcA
>>>
>>> jmp *0(%ebp) # else: jump %ENTRY_CODE( bits32[Sp])
>>>
>>>
>>> If I compile it with llvm (manually fixing the write barrier issue) I
>>> get the following llvm code:
>>>
>>> define cc10 void @stg_upd_frame_info(i32* noalias nocapture %Base_Arg,
>>> i32* noalias nocapture %Sp_Arg, i32* noalias nocapture %Hp_Arg, i32
>>> %R1_Arg) nounwind section ".text; .text 2#" align 4 {
>>> c2K:
>>> %ln2M = getelementptr inbounds i32* %Sp_Arg, i32 1
>>> %ln2Q = load i32* %ln2M
>>> %ln2V = getelementptr inbounds i32* %Sp_Arg, i32 2
>>> %ln2Y = add i32 %ln2Q, 4
>>> %ln30 = inttoptr i32 %ln2Y to i32*
>>> store i32 %R1_Arg, i32* %ln30
>>> %ln34 = inttoptr i32 %ln2Q to i32*
>>> store i32 ptrtoint ([0 x i32]* @stg_BLACKHOLE_info to i32), i32* %ln34
>>> %ln3c = lshr i32 %ln2Q, 7
>>> %ln3e = and i32 %ln3c, 8160
>>> %ln3j = and i32 %ln2Q, -1048576
>>> %ln3k = or i32 %ln3e, %ln3j
>>> %ln3m1 = or i32 %ln3k, 28
>>> %ln3n = inttoptr i32 %ln3m1 to i16*
>>> %ln3o = load i16* %ln3n, align 4
>>> %ln3p = icmp eq i16 %ln3o, 0
>>> br i1 %ln3p, label %n3r, label %c3q
>>>
>>> n3r: ; preds = %c2K
>>> %ln3x = load i32* %ln2V
>>> %ln3y = inttoptr i32 %ln3x to void (i32*, i32*, i32*, i32)*
>>> tail call cc10 void %ln3y(i32* %Base_Arg, i32* %ln2V, i32* %Hp_Arg,
>>> i32 %R1_Arg) nounwind
>>> ret void
>>>
>>> Which seems quite good to me. That compiles to the assembly:
>>>
>>> _stg_upd_frame_info:
>>> subl $20, %esp
>>> movl %edi, 16(%esp) # 4-byte Spill
>>> movl 4(%ebp), %edi
>>> addl $8, %ebp
>>> movl %esi, 8(%esp) # 4-byte Spill
>>> movl %edi, %ecx
>>> movl %edi, %eax
>>> shrl $7, %ecx
>>> andl $-1048576, %eax # imm = 0xFFFFFFFFFFF00000
>>> andl $8160, %ecx # imm = 0x1FE0
>>> addl %eax, %ecx
>>> movl %esi, 4(%edi)
>>> movl $_stg_BLACKHOLE_info, (%edi)
>>> movswl 28(%ecx), %eax
>>> movl %eax, 12(%esp) # 4-byte Spill
>>> testl %eax, %eax
>>> je LBB1_4
>>> [...]
>>> LBB1_4: # %n3r
>>> movl (%ebp), %eax
>>> movl 16(%esp), %edi # 4-byte Reload
>>> addl $20, %esp
>>> jmpl *%eax # TAILCALL
>>>
>>> Which is OK but still worse than the ncg. Improving this will probably
>>> require talking to the llvm guys, I think the llvm register allocator
>>> may have some asssumptions/design decisions that interact badly with
>>> our calling convention.
>>>
>>> Roman and I were a while ago investigating an issue where llvm wasn't
>>> doing a very good job for some dph code, quite a few unnecessary
>>> spills. We thought it was an aliasing issue but in the end seemed to
>>> be an llvm problem with the instruction selector/scheduler creating a
>>> lot of register pressure.
>>>
>>> Cheers,
>>> David
>>>
>>>>
>>>> I don't expect us to do better than the NCG here, because the NCG code is
>>>> just about optimal, but I would like to use -fllvm on other parts of the RTS
>>>> code so it would be good if we could generate code that is at least as good
>>>> as the NCG here.
>>>>
>>>> Cheers,
>>>> Simon
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Cvs-ghc mailing list
>>> Cvs-ghc at haskell.org
>>> http://www.haskell.org/mailman/listinfo/cvs-ghc
>>
>>
More information about the Cvs-ghc
mailing list