GHC supports running Haskell programs in parallel on an SMP (symmetric multiprocessor).
There's a fine distinction between concurrency and parallelism: parallelism is all about making your program run faster by making use of multiple processors simultaneously. Concurrency, on the other hand, is a means of abstraction: it is a convenient way to structure a program that must respond to multiple asynchronous events.
However, the two terms are certainly related. By making use of multiple CPUs it is possible to run concurrent threads in parallel, and this is exactly what GHC's SMP parallelism support does. But it is also possible to obtain performance improvements with parallelism on programs that do not use concurrency. This section describes how to use GHC to compile and run parallel programs, in Section 7.18, “Concurrent and Parallel Haskell” we describe the language features that affect parallelism.
In order to make use of multiple CPUs, your program must be
linked with the -threaded option (see Section 4.10.7, “Options affecting linking”). Additionally, the following
compiler options affect parallelism:
-feager-blackholing
Blackholing is the act of marking a thunk (lazy
computuation) as being under evaluation. It is useful for
three reasons: firstly it lets us detect certain kinds of
infinite loop (the NonTermination
exception), secondly it avoids certain kinds of space
leak, and thirdly it avoids repeating a computation in a
parallel program, because we can tell when a computation
is already in progress.
The option -feager-blackholing causes
each thunk to be blackholed as soon as evaluation begins.
The default is "lazy blackholing", whereby thunks are only
marked as being under evaluation when a thread is paused
for some reason. Lazy blackholing is typically more
efficient (by 1-2% or so), because most thunks don't
need to be blackholed. However, eager blackholing can
avoid more repeated computation in a parallel program, and
this often turns out to be important for parallelism.
We recommend compiling any code that is intended to be run
in parallel with the -feager-blackholing
flag.
To run a program on multiple CPUs, use the
RTS -N option:
-N[x]
Use x simultaneous threads when
running the program. Normally x
should be chosen to match the number of CPU cores on the
machine[9]. For example,
on a dual-core machine we would probably use
+RTS -N2 -RTS.
Omitting x,
i.e. +RTS -N -RTS, lets the runtime
choose the value of x itself
based on how many processors are in your machine.
Be careful when using all the processors in your machine: if some of your processors are in use by other programs, this can actually harm performance rather than improve it.
Setting -N also has the effect of
enabling the parallel garbage collector (see
Section 4.14.3, “RTS options to control the garbage collector”).
There is no means (currently) by which this value may vary after the program has started.
The following options affect the way the runtime schedules threads on CPUs:
-qmDisable automatic migration for load balancing.
Normally the runtime will automatically try to schedule
threads across the available CPUs to make use of idle
CPUs; this option disables that behaviour. It is probably
only of use if you are explicitly scheduling threads onto
CPUs with GHC.Conc.forkOnIO.
-qwMigrate a thread to the current CPU when it is woken up. Normally when a thread is woken up after being blocked it will be scheduled on the CPU it was running on last; this option allows the thread to immediately migrate to the CPU that unblocked it.
The rationale for allowing this eager migration is that it tends to move threads that are communicating with each other onto the same CPU; however there are pathalogical situations where it turns out to be a poor strategy. Depending on the communication pattern in your program, it may or may not be a good idea.
Add the -s RTS option when
running the program to see timing stats, which will help to tell you
whether your program got faster by using more CPUs or not. If the user
time is greater than
the elapsed time, then the program used more than one CPU. You should
also run the program without -N for comparison.
GHC's parallelism support is new and experimental. It may make your program go faster, or it might slow it down - either way, we'd be interested to hear from you.
One significant limitation with the current implementation is that the garbage collector is still single-threaded, and all execution must stop when GC takes place. This can be a significant bottleneck in a parallel program, especially if your program does a lot of GC. If this happens to you, then try reducing the cost of GC by tweaking the GC settings (Section 4.14.3, “RTS options to control the garbage collector”): enlarging the heap or the allocation area size is a good start.
[9] Whether hyperthreading cores should be counted or not is an open question; please feel free to experiment and let us know what results you find.