[Haskell-cafe] speed: ghc vs gcc

Fri Feb 20 11:41:33 EST 2009

bulat.ziganshin:
> Hello haskell-cafe,
> 
> since there are no objective tests comparing ghc to gcc, i made my own
> one. these are 3 programs, calculating sum in c++ and haskell:

Wonderful. Thank you!

> main = print $ sum[1..10^9::Int]

This won't be comparable to your loop below, as 'sum' is a left fold
(which doesn't fuse under build/foldr).

You should use the list implementation from the stream-fusion package (or
uvector) if you're expecting it to fuse to the following loop:

> main = print $ sum0 (10^9) 0
> 
> sum0 :: Int -> Int -> Int
> sum0 0  !acc = acc
> sum0 !x !acc = sum0 (x-1) (acc+x)

Note the bang patterns aren't required here. It compiles to the
following core:

    $wsum0 :: Int# -> Int# -> Int#
    $wsum0 =
      \ (ww_sON :: Int#) (ww1_sOR :: Int#) ->
          case ww_sON of ds_XD0 {
                _ -> $wsum0 (-# ds_XD0 1) (+# ww1_sOR ds_XD0);
                0 -> ww1_sOR

which is perfect.

    Main_zdwsum0_info:
      testq       %rsi, %rsi
      movq        %rsi, %rax
      jne .L2
      movq        %rdi, %rbx
      jmp *(%rbp)
    .L2:
      leaq        -1(%rsi), %rsi
      addq        %rax, %rdi
      jmp Main_zdwsum0_info

Which seems ... OK.

    $ ghc-core A.hs -fvia-C -optc-O3
    $ time ./A
    500000000500000000
    ./A  1.12s user 0.00s system 99% cpu 1.127 total

Works for me. That's on linux x86_64, gcc 4.4

Trying -fasm:

    Main_zdwsum0_info:
    .LcQs:
      movq %rsi,%rax
      testq %rax,%rax
      jne .LcQw
      movq %rdi,%rbx
      jmp *(%rbp)
    .LcQw:
      movq %rdi,%rcx
      addq %rax,%rcx
      leaq -1(%rax),%rsi
      movq %rcx,%rdi
      jmp Main_zdwsum0_info

    $ time ./A
    500000000500000000
    ./A  1.65s user 0.00s system 98% cpu 1.677 total

Is  a bit slower.

> main()
> {
>   int sum=0;
>   //for(int j=0; j<100;j++)
>     for(int i=0; i<1000*1000*1000;i++)
>       sum += i;
>   return sum;
> }

Well, that's a bit different. It doesn't print the result, and it returns a different
results on 64 bit....

    $ gcc -O0 t.c
    $ time ./a.out 
    -1243309312
    ./a.out  3.99s user 0.00s system 88% cpu 4.500 total

    $ gcc -O1 t.c
    $ time ./a.out
    -1243309312
    ./a.out  0.88s user 0.00s system 99% cpu 0.892 total

    $ gcc -O3 -funroll-loops t.c 
    $ time ./a.out
    -1243309312
    ./a.out  0.31s user 0.00s system 97% cpu 0.318 total

I don't get anything near the 0.062s which is interesting.
The print statement slows things down, I guess...

So we have:

    ghc -fvia-C -O2             1.127
    ghc -fasm                   1.677
    gcc -O0                     4.500
    gcc -O3 -funroll-loops      0.318

So. some lessons. GHC is around 3-4x slower on this tight loop. (Which isn't as
bad as it used to be).

That's actually a worse margin than any current shootout program, where we are no 
worse than 2.9 slower on larger things:

    http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=ghc&lang2=gcc&box=1

> 
> execution times:
>  sum:
>    ghc 6.6.1 -O2               : 12.433 secs
>    ghc 6.10.1 -O2              : 12.792 secs
>  sum-fast:
>    ghc 6.6.1 -O2               :  1.919 secs
>    ghc 6.10.1 -O2              :  1.856 secs
>    ghc 6.10.1 -O2 -fvia-C      :  1.966 secs
>  C++:
>    gcc 3.4.5 -O3 -funroll-loops:  0.062 secs
> 

I couldn't reproduce your final number. 

Now, given GHC gets most of the way there -- I think this might make a good bug
report against GHC head, so we can see if the new register allocator helps any.

    http://hackage.haskell.org/trac/ghc/newticket?type=bug

Thanks for the report, Bulat!

-- Don