memory fragmentation with ghc-7.6.1

Sat Sep 29 08:47:44 CEST 2012

Simon Marlow <marlowsd at gmail.com> writes:

> On 28/09/12 17:36, Ben Gamari wrote:
>> Unfortunately, after poking around I found a few obvious problems with
>> both the code and my testing configuration which explained the
>> performance drop. Things seem to be back to normal now. Sorry for the
>> noise! Great job on the new codegen.
>
> That's good to hear, thanks for letting me know!
>
Of course!

That being said, I have run in to a bit of a performance issue which
could be related to the runtime system. In particular, as I scale up in
thread count (from 6 up to 48, the core count of the machine) in my
program[1] (test data available), I'm seeing the total runtime increase,
as well as a corresponding increase in CPU-seconds used. This despite
the RTS claiming consistently high (~94%) productivity. Meanwhile
Threadscope shows that nearly all of my threads are working busily with
very few STM retries and no idle time. This in an application which
should scale reasonably well (or so I believe). Attached below you will
find a crude listing of various runtime statistics over a variety of
thread counts (ranging into what should probably be regarded as the
absurdly large).

The application is a parallel Gibbs sampler for learning probabilistic
graphical models. It involves a set of worker threads (updateWorkers)
pulling work units off of a common TQueue. After grabbing a work unit,
the thread will read a reference to the current global state from an
IORef. It will then begin a long-running calculation, resulting in a
small value (forced to normal form with deepseq) which it then
communicates back to a global update thread (diffWorker) in the form of
a lambda through another TQueue. The global update thread then maps the
global state (the same as was read from the IORef earlier) through this
lambda with atomicModifyIORef'. This is all implemented in [2].

I do understand that I'm asking a lot of the language and I have been
quite impressed by how well Haskell and GHC have stood up to the
challenge thusfar. That being said, the behavior I'm seeing seems a bit
strange. If synchronization overhead were the culprit, I'd expect to
observe STM retries or thread blocking, which I do not see (by eye it
seems that STM retries occur on the order of 5/second and worker threads
otherwise appear to run uninterrupted except for GC; GHC event log
from a 16 thread run available here[3]). If GC were the problem, I would
expect this to be manifested in the productivity, which it is clearly
not. Do you have any idea what else might be causing such extreme
performance degradation with higher thread counts? I would appreciate
any input you would have to offer.

Thanks for all of your work!

Cheers,

- Ben

[1] https://github.com/bgamari/bayes-stack/v2
[2] https://github.com/bgamari/bayes-stack/blob/v2/BayesStack/Core/Gibbs.hs
[3] http://goldnerlab.physics.umass.edu/~bgamari/RunCI.eventlog

Performance of Citation Influence model on lda-handcraft data set
1115 arcs, 702 nodes, 50 items per node average
100 sweeps in blocks of 10, 200 topics
Running with +RTS -A1G
ghc-7.7 9c15249e082642f9c4c0113133afd78f07f1ade2 

Cores      User time (s)  Walltime (s)   CPU %       Productivity
======     =============  =============  ======      =============
2          488.66         269.41         188%        93.7%
3          533.43         195.28         281%        94.1%
4          603.92         166.94         374%        94.3%
5          622.40         138.16         466%        93.8%
6          663.73         123.00         558%        94.2%
7          713.96         114.17         647%        94.0%
8          724.66         101.98         736%        93.7%
9          802.75         100.59         826%        .
10         865.05         97.69          917%        .
11         966.97         99.09          1010%       .
12         1238.42        114.28         1117%    
13         1242.43        106.53         1206%    
14         1428.59        112.48         1310%    
15         1299.40        97.52          1387%    
16         1559.99        108.86         1481%    
17         1972.02        126.49         1604%    
18         2157.03        130.91         1696%    
19         1966.24        115.29         1770%       .
20         2693.64        146.76         1887%       .
21         3051.16        158.48         1990%       .
22         4100.88        199.18         2109%       93.7%
23         4156.94        193.38         2201%       93.5%   
24         4780.46        212.13         2303%       94.3%
25         5733.64        242.78         2407%       .
26         7806.47        313.92         2526%       .
27         6368.32        249.65         2596%       .
28         8563.18        320.26         2717% 

           -sstderr output from 2 cores:
       ======================================

   214,358,463,960 bytes allocated in the heap
   2,261,367,592 bytes copied during GC
      71,053,384 bytes maximum residency (7 sample(s))
       2,077,816 bytes maximum slop
            2299 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       100 colls,   100 par   28.58s   14.27s     0.1427s    0.3345s
  Gen  1         7 colls,     6 par    3.36s    1.74s     0.2486s    0.3241s

  Parallel GC work balance: 25.06% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.03s  (  0.03s elapsed)
  MUT     time  474.28s  (252.98s elapsed)
  GC      time   31.94s  ( 16.01s elapsed)
  EXIT    time    0.12s  (  0.12s elapsed)
  Total   time  506.49s  (269.26s elapsed)

  Alloc rate    451,969,681 bytes per MUT second

  Productivity  93.7% of total user, 176.2% of total elapsed

gc_alloc_block_sync: 108391
whitehole_spin: 0
gen[0].sync: 7392
gen[1].sync: 97126

           -sstderr output from 24 cores:
       =======================================

 262,335,150,000 bytes allocated in the heap
   1,323,818,912 bytes copied during GC
      67,648,048 bytes maximum residency (7 sample(s))
       2,927,176 bytes maximum slop
           25172 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        29 colls,    29 par   162.40s    6.77s     0.2335s    0.3469s
  Gen  1         7 colls,     6 par   118.34s    5.32s     0.7600s    1.2448s

  Parallel GC work balance: 5.61% (serial 0%, perfect 100%)

  TASKS: 27 (1 bound, 26 peak workers (26 total), using -N24)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.34s  (  0.34s elapsed)
  MUT     time  4606.75s  (197.59s elapsed)
  GC      time  280.74s  ( 12.09s elapsed)
  EXIT    time    0.40s  (  0.40s elapsed)
  Total   time  4888.61s  (210.80s elapsed)

  Alloc rate    56,945,837 bytes per MUT second

  Productivity  94.3% of total user, 2185.8% of total elapsed

gc_alloc_block_sync: 5397994
whitehole_spin: 0
gen[0].sync: 32282
gen[1].sync: 5072