LLVM backend and spilling
Simon Marlow
marlowsd at gmail.com
Mon Sep 13 04:29:29 EDT 2010
On 13/09/2010 09:03, David Terei wrote:
> Hi Simon,
>
> Hmm, I've been trying to find a nasty performance regression when ghc
> is bootstrapped with llvm, this problem must be a large part of this
> so thanks.
Right, I expect the mfence on every update is costing quite a lot (one
of those can be hundreds of cycles).
Thanks for looking at this!
Cheers,
` Simon
>> - All the spilling. In the slow path there's a foreign call (to allocBlocks_lock)
>> but that is annotated with the live registers, in this case [R1], so I don't
>> understand why LLVM should be spilling everything. Any ideas?
>
> I fixed a problem like this ox x86 a month or two back. The issue is this:
>
> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
> // do some stuff
> call foreign g() [R1];
> // do some stuff
> tail call f_next (Base, Hp', Sp', R1', R2, R3, R4, R5, R6);
> }
>
> Because of the use of the calling convention to pass the stg registers
> around llvm thinks they are all live across the call in the code
> above, since well they are. The fix is to explicitly kill the
> registers that aren't live across the call, which is easy as llvm
> provides a nice symbolic 'undef' value that can do just this. So above
> code needs to become:
>
> fun f (Base, Hp, Sp, R1, R2, R3, R4, R5, R6) {
> // do some stuff
> call foreign g() [R1];
> // do some stuff
> tail call f_next (Base, Hp', Sp', R1', undef, undef, undef, undef, undef);
> }
>
> What I actually do is slightly different then this but the concept is
> the same. Hope that makes sense. As I said I thought I had made this
> fix a while back but hand written Cmm generally tests very different
> code paths than compiler generated Cmm, so I must have missed a case.
> This should be an easy fix as well.
>
> I played around a little with compiling Updates.cmm on x86-32, will
> try soon on x64. The ncg gives this assembly (annotated with the
> corresponding cmm for my benefit):
>
> _stg_upd_frame_info:
> .Lcz:
> movl 4(%ebp),%eax # bits32 updatee = b32[Sp +
> SIZEOF_StgHeader]
> movl %eax,64(%esp) # spill updatee
> addl $8,%ebp # Sp = Sp + (SIZEOF_StgHeader + 4)
> movl %esi,4(%eax) # gcptr[updatee + SIZEOF_StgHeader] = R1
> # prim %write_barrier() [];
> movl $_stg_BLACKHOLE_info,0(%eax) # b32[updatee] = stg_BLACKHOLE_info
>
> movl %eax,%ecx # copy updatee
> andl $-1048576,%ecx # x = updatee& ((1<< 20) - 1)&
> ~((1<< 12) - 1)
> andl $1044480,%eax # y = updatee& ~((1<< 20) - 1)
> shrl $7,%eax # x>> (12 - 5)
> orl %ecx,%eax # bd = x& y
>
> cmpw $0,28(%eax) #
> jne .LcA # if ( b16[bd + 28] != 0): goto .LcA
>
> jmp *0(%ebp) # else: jump %ENTRY_CODE( bits32[Sp])
>
>
> If I compile it with llvm (manually fixing the write barrier issue) I
> get the following llvm code:
>
> define cc10 void @stg_upd_frame_info(i32* noalias nocapture %Base_Arg,
> i32* noalias nocapture %Sp_Arg, i32* noalias nocapture %Hp_Arg, i32
> %R1_Arg) nounwind section ".text; .text 2#" align 4 {
> c2K:
> %ln2M = getelementptr inbounds i32* %Sp_Arg, i32 1
> %ln2Q = load i32* %ln2M
> %ln2V = getelementptr inbounds i32* %Sp_Arg, i32 2
> %ln2Y = add i32 %ln2Q, 4
> %ln30 = inttoptr i32 %ln2Y to i32*
> store i32 %R1_Arg, i32* %ln30
> %ln34 = inttoptr i32 %ln2Q to i32*
> store i32 ptrtoint ([0 x i32]* @stg_BLACKHOLE_info to i32), i32* %ln34
> %ln3c = lshr i32 %ln2Q, 7
> %ln3e = and i32 %ln3c, 8160
> %ln3j = and i32 %ln2Q, -1048576
> %ln3k = or i32 %ln3e, %ln3j
> %ln3m1 = or i32 %ln3k, 28
> %ln3n = inttoptr i32 %ln3m1 to i16*
> %ln3o = load i16* %ln3n, align 4
> %ln3p = icmp eq i16 %ln3o, 0
> br i1 %ln3p, label %n3r, label %c3q
>
> n3r: ; preds = %c2K
> %ln3x = load i32* %ln2V
> %ln3y = inttoptr i32 %ln3x to void (i32*, i32*, i32*, i32)*
> tail call cc10 void %ln3y(i32* %Base_Arg, i32* %ln2V, i32* %Hp_Arg,
> i32 %R1_Arg) nounwind
> ret void
>
> Which seems quite good to me. That compiles to the assembly:
>
> _stg_upd_frame_info:
> subl $20, %esp
> movl %edi, 16(%esp) # 4-byte Spill
> movl 4(%ebp), %edi
> addl $8, %ebp
> movl %esi, 8(%esp) # 4-byte Spill
> movl %edi, %ecx
> movl %edi, %eax
> shrl $7, %ecx
> andl $-1048576, %eax # imm = 0xFFFFFFFFFFF00000
> andl $8160, %ecx # imm = 0x1FE0
> addl %eax, %ecx
> movl %esi, 4(%edi)
> movl $_stg_BLACKHOLE_info, (%edi)
> movswl 28(%ecx), %eax
> movl %eax, 12(%esp) # 4-byte Spill
> testl %eax, %eax
> je LBB1_4
> [...]
> LBB1_4: # %n3r
> movl (%ebp), %eax
> movl 16(%esp), %edi # 4-byte Reload
> addl $20, %esp
> jmpl *%eax # TAILCALL
>
> Which is OK but still worse than the ncg. Improving this will probably
> require talking to the llvm guys, I think the llvm register allocator
> may have some asssumptions/design decisions that interact badly with
> our calling convention.
>
> Roman and I were a while ago investigating an issue where llvm wasn't
> doing a very good job for some dph code, quite a few unnecessary
> spills. We thought it was an aliasing issue but in the end seemed to
> be an llvm problem with the instruction selector/scheduler creating a
> lot of register pressure.
>
> Cheers,
> David
>
>>
>> I don't expect us to do better than the NCG here, because the NCG code is just about optimal, but I would like to use -fllvm on other parts of the RTS code so it would be good if we could generate code that is at least as good as the NCG here.
>>
>> Cheers,
>> Simon
>>
>>
>
> _______________________________________________
> Cvs-ghc mailing list
> Cvs-ghc at haskell.org
> http://www.haskell.org/mailman/listinfo/cvs-ghc
More information about the Cvs-ghc
mailing list