[Haskell-cafe] Some quick experiments with GHC 7.0.2 in Intel's Manycore Testing Lab (32 cores)

Mon Mar 14 14:52:32 CET 2011

Hi José,

On 11/03/2011 14:16, José Pedro Magalhães wrote:
>
> I've played a bit with Intel's Manycore Testing Lab
> (http://software.intel.com/en-us/articles/intel-many-core-testing-lab/).
> Part of the agreement to use it requires that you report back your
> experiences, which I did in an Intel forum post
> (http://software.intel.com/en-us/forums/showthread.php?t=81396). I
> thought this could be interesting to the Haskell community in general as
> well, so I'm reposting here, and pasting the text below for convenience.
> I've replaced the images with links.

Is it possible for you to make the code for your benchmarks available? 
I'd be interested in analysing the results further.

In our testing I've been able to achieve speedups over 20 on 24 cores 
with GHC 7.0.2, so there should be no reason in principle that you 
couldn't achieve similar results for obviously parallel problems, which 
yours seem to be.  Some tweaking of GC parameters might be necessary: 
e.g. I've found that +RTS -A1m helps if your L2 caches are large enough. 
  A good starting point for profiling is ThreadScope, which will tell 
you if the program is really trying to use all the cores or not.

Cheers,
	Simon

>
> Cheers,
> Pedro
>
>     As per the agreement with Intel, I am reporting my experiences with
>     the Intel Manycore Testing Lab (Linux). This was my first time in
>     the lab, and I wanted to test GHC's [1] SMP parallelism [2] features.
>
>     The first challenge was to actually get GHC to work on the lab.
>     There was a working version of ghc under /opt/ghc6.13/bin/ghc, but I
>     really needed GHC 7. So first I built GHC 7.0.2-rc2, which worked
>     without much trouble.
>
>     Next step was to get all the necessary libraries in place. Since the
>     lab has no direct internet access, cabal-install [3] wouldn't be of
>     much use. Instead, I downloaded a snapshot of hackage [4] with the
>     latest version of every package and manually installed the packages
>     I needed. A bit boring, but doable.
>
>     Finally I was ready to compile my programs and test. First thing I
>     tried was an existing algorithm I had which, at some point, takes a
>     list of about 500 trees and, for each tree, computes a measure which
>     is expressed as a floating point number. This is basically a map
>     over a list transforming each tree into a float. Each operation is
>     independent of the others, and all require the same input, so it
>     seems ideal for parallelisation. A quick benchmark revealed the
>     following running times:
>
>     http://dreixel.net/images/perm/ParList.png
>
>     (Note the non-linear number of cores at the end of the x-axis.)
>     Apparently there are performance gains with up to 6 cores; adding
>     more cores after this makes the total running time worse.
>
>     While this might sound bad, do note that all that was necessary to
>     parallelise this algorithm was a one line change: basically, at the
>     point where the list of floats @l@ is generated, it is replaced with
>     @l `using` parList rdeepseq at . This change, together with
>     recompilation using -threaded, is all that is necessary to
>     parallelise this program.
>
>     Later I performed a more accurate benchmark, this time using the
>     equality function (take two elements and compare them for equality).
>     The first step was to parallelise the equality function, which,
>     again, is a very simple task:
>
>     -- Tree datatype
>     data Tree a = Leaf | Bin a (Tree a) (Tree a)
>
>     -- Parallel equality
>     eqTreePar :: Tree Int -> Tree Int -> Bool
>     eqTreePar Leaf Leaf = True
>     eqTreePar (Bin x1 l1 r1) (Bin x2 l2 r2) = x1 == x2 && par l (pseq r
>     (l && r))
>                                                  where l = eqTreePar l1 l2
>                                                        r = eqTreePar r1 r2
>     eqTreePar _ _ = False
>
>     `par` and `pseq` are the two primitives for parallelisation in GHC
>     [5]. The performance graph follows:
>
>     http://dreixel.net/images/perm/ParEq.png
>
>     (This time I ran the benchmark several times; the error bars on the
>     graph are the standard deviations.) Again we get performance
>     improvements with up to 6 cores, and after that performance
>     decreases. What I find really nice is the improvement with two
>     cores, which is almost a 50% decrease in running time. The ratios
>     for 2 to 4 cores wrt. the running time with 1 core are 0.52, 0.39,
>     and 0.35, respectively. This is really good for such a simple change
>     in the source code, and most people only have up to 4 cores anyway.
>     In any case, the results of this (very preliminary) experiment seem
>     to indicate that GHC's SMP parallelism is not particularly optimized
>     for a high number of cores (yet).
>
>     I'm planning to explore this line of research further, and I'm
>     hoping to be able to conduct more experiments in the near future.
>     Feel free to contact me if you want more information on what I've done.
>
>
>     Cheers,
>     Pedro
>
>     [1] http://www.haskell.org/ghc/
>     [2]
>     http://www.haskell.org/ghc/docs/latest/html/users_guide/using-smp.html
>     [3] http://hackage.haskell.org/package/cabal-install
>     [4] http://hackage.haskell.org
>     [5]
>     http://hackage.haskell.org/packages/archive/parallel/latest/doc/html/Control-Parallel.html
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe