[Haskell-beginners] Speed performance problem on Windows?

MAN elviotoccalino at gmail.com
Sun Mar 7 22:19:47 EST 2010


To answer your question, I run gcc-4.4.1 (default with Ubuntu 9.10). I
took your advice, and made a few more tests. After reordering both the
recursive and stream-fusion oriented versions I compiled and tested as
follows:


FOR THE NCG, WITH excess precision ON:

~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bm1-reordered
bigmean1-reordered.hs
~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bm2-reordered
bigmean2-reordered.hs

~$ time ./bm1-reordered 10e8
500000000.067109

real 0m13.330s	user 0m13.285s	sys 0m0.004s

~$ time ./bm2-reordered 10e8
500000000.067109

real 0m23.473s	user 0m23.433s	sys 0m0.008s

[[
Recall that the previous time's where:
~$ time ./bigmean1-precis 1e9
500000000.067109
real 0m16.521s  user 0m16.413s  sys 0m0.008s

~$ time ./bigmean2-precis 1e9
500000000.067109
real 0m27.381s  user 0m27.190s  sys 0m0.016s
]]


TURNING excess precision OFF, TO SEE ITS IMPACT IN THE IMPROVEMENT:

~$ ghc --make -fforce-recomp -O2 -o bm1-reordered-noEP
bigmean1-reordered.hs
~$ ghc --make -fforce-recomp -O2 -o bm2-reordered-noEP
bigmean2-reordered.hs

~$ time ./bm1-reordered-noEP 10e8
500000000.067109
real 0m13.306s	user 0m13.277s	sys 0m0.004s

~$ time ./bm1-reordered-noEP 10e8
500000000.067109
real 0m23.523s	user 0m23.441s	sys 0m0.000s


Which is great! This way of compiling is much more comfortable (?). It
is still odd that swapping the types has such impact in the
performance of both... Any ideas?


I then tried the same code, compiling with '-fvia-C -optc-O3
-fexcess-precision' and obtained the following (smoking hot) results:

~$ time ./bm1-reord-C 10e8
500000000.067109
real 0m9.630s	user 0m9.617s	sys 0m0.000s

~$ time ./bm2-reord-C 10e8
500000000.067109
real 0m17.837s	user 0m17.769s	sys 0m0.028s

[[
Recall that previous time's for this same compilation run were
~$ time ./bigmean1-precis 1e9
500000000.067109
real 0m11.937s  user 0m11.841s  sys 0m0.012s

~$ time ./bigmean2-precis 1e9
500000000.067109
real 0m17.105s  user 0m17.081s  sys 0m0.004s
]]


So, the improvement it's not so evident here on the fusion code, but
the recursive implementation is notably faster.

It seems every time Daniel suggests some little change time's drop a
couple of secs... so... any more ideas? :D

Seriously, now, why does argument order matter so much. More
importantly: is this common and predictable? Should I start putting
all my Int params at the front of the type signature?

Thanks for the tips, btw, I've learned a couple of very important
things as I re-read this thread.

Elvio.

El sáb, 06-03-2010 a las 23:25 +0100, Daniel Fischer escribió:
> Am Samstag 06 März 2010 19:50:46 schrieb MAN:
> > For the record, I'm adding my numbers to the pool:
> >
> > Calling "bigmean1.hs" to the first piece of code (the recursive version)
> > and "bigmean2.hs" to the second (the one using 'foldU'), I compiled four
> > versions of the two and timed them while they computed the mean of
> > [1..1e9]. Here are the results:
> >
> >
> > MY SYSTEM (512 RAM, Mobile AMD Sempron(tm) 3400+ proc [1 core]) (you're
> > run-o-the-mill Ubuntu laptop):
> > ~$ uname -a
> > Linux dy-book 2.6.31-19-generic #56-Ubuntu SMP Thu Jan 28 01:26:53 UTC
> > 2010 i686 GNU/Linux
> > ~$ ghc -V
> > The Glorious Glasgow Haskell Compilation System, version 6.12.1
> >
> > RUN 1 - C generator, without excess-precision
> >
> > ~$ ghc -o bigmean1 --make -fforce-recomp -O2 -fvia-C -optc-O3
> > bigmean1.hs
> > ~$ ghc -o bigmean2 --make -fforce-recomp -O2 -fvia-C -optc-O3
> > bigmean2.hs
> >
> > ~$ time ./bigmean1 1e9
> > 500000000.067109
> >
> > real 0m47.685s	user 0m47.655s	sys 0m0.000s
> >
> > ~$ time ./bigmean2 1e9
> > 500000000.067109
> >
> > real 1m4.696s	user 1m4.324s	sys 0m0.028s
> >
> >
> > RUN 2 - default generator, no excess-precision
> >
> > ~$ ghc --make -O2 -fforce-recomp -o bigmean2-noC bigmean2.hs
> > ~$ ghc --make -O2 -fforce-recomp -o bigmean1-noC bigmean1.hs
> >
> > ~$ time ./bigmean1-noC 1e9
> > 500000000.067109
> >
> > real 0m16.571s	user 0m16.493s	sys 0m0.012s
> 
> That's pretty good (not in comparison to Don's times, but in comparison to 
> the other timings).
> 
> >
> > ~$ time ./bigmean2-noC 1e9
> > 500000000.067109
> >
> > real 0m27.146s	user 0m27.086s	sys 0m0.004s
> >
> 
> That's roughly the time I get with -O2 and the NCG, 27.3s for the explicit 
> recursion, 25.9s for the stream-fusion. However, I can bring the explicit 
> recursion down to 24.8s by reordering the parameters,
> 
> mean :: Double -> Double -> Double
> mean n m = go 0 n 0
>     where
>         go :: Int -> Double -> Double -> Double
>         go l x s
>             | x > m     = s / fromIntegral l
>             | otherwise = go (l+1) (x+1) (s+x)
> 
> (or up to 40.8s by making the Int the last parameter).
> 
> I had no idea the ordering of the parameters could have such a big impact 
> even in simple cases like this.
> 
> Anyway, the difference between NCG and via-C (without excess-precision) on 
> your system is astonishingly large. What version of GCC have you (mine is 
> 4.3.2)?
> 
> >
> > RUN 3 - C generator, with excess-precision.
> >
> > ~$ ghc --make -fforce-recomp -O2 -fvia-C -optc-O3 -fexcess-precision -o
> > bigmean1-precis bigmean1.hs
> > ~$ ghc --make -fforce-recomp -O2 -fvia-C -optc-O3 -fexcess-precision -o
> > bigmean2-precis bigmean2.hs
> >
> > ~$ time ./bigmean1-precis 1e9
> > 500000000.067109
> >
> > real 0m11.937s	user 0m11.841s	sys 0m0.012s
> 
> Roughly the same time here, both, explicit recursion and stream-fusion.
> 
> >
> > ~$
> > time ./bigmean2-precis 1e9
> > 500000000.067109
> >
> > real 0m17.105s	user 0m17.081s	sys 0m0.004s
> >
> >
> > RUN 4 - default generator, with excess-precision
> >
> > ~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bigmean1-precis
> > bigmean1.hs
> > ~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bigmean2-precis
> > bigmean2.hs
> >
> > ~$ time ./bigmean1-precis 1e9
> > 500000000.067109
> >
> > real 0m16.521s	user 0m16.413s	sys 0m0.008s
> >
> > ~$ time ./bigmean2-precis 1e9
> >
> > 500000000.067109
> >
> > real 0m27.381s	user 0m27.190s	sys 0m0.016s
> >
> 
> 
> NCG, -O2:
> Fusion:
> 25.86user 0.05system 0:25.91elapsed 100%CPU
> Explicit:
> 27.34user 0.02system 0:27.48elapsed 99%CPU
> Explicit reordered:
> 24.84user 0.00system 0:24.91elapsed 99%CPU
> 
> NCG, -O2 -fexcess-precision:
> Fusion:
> 25.84user 0.00system 0:25.86elapsed 99%CPU
> Explicit:
> 27.32user 0.02system 0:27.41elapsed 99%CPU
> Explicit reordered:
> 24.86user 0.00system 0:24.86elapsed 100%CPU
> 
> -O2 -fvia-C -optc-O3: [1]
> Fusion:
> 38.44user 0.01system 0:38.45elapsed 99%CPU
> 24.92user 0.00system 0:24.92elapsed 100%CPU
> Explicit:
> 37.50user 0.02system 0:37.53elapsed 99%CPU
> 26.61user 0.00system 0:26.61elapsed 99%CPU
> Explicit reordered:
> 38.13user 0.00system 0:38.14elapsed 100%CPU
> 24.94user 0.02system 0:24.96elapsed 100%CPU
> 
> 
> -O2 -fexcess-precision -fvia-C -optc-O3:
> Fusion:
> 11.90user 0.01system 0:11.92elapsed 99%CPU
> Explicit:
> 11.80user 0.00system 0:11.86elapsed 99%CPU
> Explicit reordered:
> 11.81user 0.00system 0:11.81elapsed 100%CPU
> 
> >
> > CONCLUSIONS:
> > · Big difference between the two versions (recursive and
> > fusion-oriented).
> 
> Odd. It shouldn't be a big difference, and here it isn't. Both should 
> compile to almost the same machine code [however, the ordering of the 
> parameters matters, you might try to shuffle them around a bit and see what 
> that gives (If I swap the Int and the Double in the strict pair of the 
> fusion code, I get a drastic performance penalty, perhaps you'll gain 
> performance thus)].
> 
> > I check compiling with -ddump-simple-stats, and the
> > rule mention in Don's article IS being fired (streamU/unstraemU) once.
> > The recursive expression of the algorithm is quite faster
> > · Big gain adding the excess-precision flag to the compiling step, even
> > if not using the C code generator.
> 
> I think you looked at the wrong numbers there, for the native code 
> generator, the times with and without -fexcess-precision are very close, 
> both for explicit recursion and fusion.
> 
> > · The best time is achieved compiling through the C generator, with
> > excess-precis flag; second best (5 seconds away in execution) is adding
> 
> Yes. If you are doing lots of floating-point operations and compile via C, 
> better tell the C compiler that it shouldn't truncate every single 
> intermediate result to 64 bit doubles, that takes time.
> There are two ways to do that, you can tell GHC that you don't want to 
> truncate (-fexcess-precision), then GHC tells the C compiler [gcc], or you 
> can tell gcc directly [well, that's via GHC's command line too :) ] by 
> using -optc-fno-float-store.
> 
> For the NCG, -fexcess-precision doesn't seem to make a difference (at least 
> with Doubles, may make a big difference with Floats).
> 
> > the same flag to the default generator.
> >
> > I didn't know of the -fexcess-precision. It really makes a BIG
> > difference to number cruncher modules :D
> >
> 
> Via C.
> 
> 
> 
> [1] This is really irritating. These timings come from the very same 
> binaries, and I haven't noticed such behaviour from any of my other 
> programmes. Normally, these programmes take ~38s, but every now and then, 
> there's a run taking ~25/26s. The times for the slower runs are pretty 
> stable, and the times for the fast runs are pretty stable (a few hundredths 
> of a second difference). Of course, the running time of a programme (for 
> the same input) depends on processor load, how many other processes want 
> how many of the registers and such, but I would expect much less regular 
> timings from those effects.
> Baffling.




More information about the Beginners mailing list