[Haskell-cafe] vector stream fusion, inlining and compilation time

Roman Leshchinskiy rl at cse.unsw.edu.au
Thu Mar 4 21:05:31 EST 2010

On 05/03/2010, at 04:34, stefan kersten wrote:

> i've been hunting down some performance problems in DSP code using vector and
> the single most important transformation seems to be throwing in INLINE pragmas
> for any function that uses vector combinators and is to be called from
> higher-level code. failing to do so seems to prevent vector operations from
> being fused and results in big performance hits (the good news is that the
> optimized code is quite competitive). does anybody have some more info about the
> do's and don'ts when programming with vector?

This is a general problem when working with RULES-based optimisations. Here is an example of what happens: suppose we have

foo :: Vector Int -> Vector Int
foo xs = map (+1) xs

Now, GHC will generate a nice tight loop for this but if in a different module, we have something like this:

bar xs = foo (foo xs)

then this won't fuse because (a) foo won't be inlined and (b) even if GHC did inline here, it would inline the nice tight loop which can't possibly fuse instead of the original map which can. By slapping an INLINE pragma on foo, you're telling GHC to (almost) always inline the function and to use the original definition for inlining, thus giving it a chance to fuse.

GHC could be a bit cleverer here (perhaps by noticing that the original definition is small enough to inline and keeping it) but in general, you have to add INLINE pragmas in such cases if you want to be sure your code fuses. A general-purpose mechanism for handling situations like this automatically would be great but we haven't found a good one so far.

> the downside after adding the INLINE pragmas is that now some of my modules take
> _really_ long to compile (up to a couple of minutes); any ideas where i can
> start looking to bring the compilation times down again?

Alas, stream fusion (and fusion in general, I guess) requires what I would call whole loop compilation - you need to inline everything into loops. That tends to be slow. I don't know what your code looks like but you could try to control inlining a bit more. For instance, if you have something like this:

foo ... = ... map f xs ...
    f x = ...

you could tell GHC not to inline f until fairly late in the game by adding

  {-# INLINE [0] f #-}

to the where clause. This helps sometimes.

> i'm compiling with -O2 -funbox-strict-fields instead of -Odph (with ghc 6.10.4
> on OSX 10.4), because it's faster for some of my code, but -O2 vs. -Odph doesn't
> make a noticable difference in compilation time.

If you're *really* interested in performance, I would suggest using GHC head. It really is much better for this kind of code (although not necessarily faster wrt to compilation times).

This is what -Odph does:

-- -Odph is equivalent to
--    -O2                               optimise as much as possible
--    -fno-method-sharing               sharing specialisation defeats fusion
--                                      sometimes
--    -fdicts-cheap                     always inline dictionaries
--    -fmax-simplifier-iterations20     this is necessary sometimes
--    -fsimplifier-phases=3             we use an additional simplifier phase
--                                      for fusion
--    -fno-spec-constr-threshold        run SpecConstr even for big loops
--    -fno-spec-constr-count            SpecConstr as much as possible

I'm surprised -Odph doesn't produce faster code than -O2. In any case, you could try turning these flags on individually (esp. -fno-method-sharing and the spec-constr flags) to see how they affect performance and compilation times.


More information about the Haskell-Cafe mailing list