Eventful GHC
Alp Mestanogullari - 2019-09-24
What can we do when it takes GHC a lot of time to compile a given module? Where is it spending its time? Where can we start to get a 10,000 feet view of what GHC is doing? This blog post covers one possible answer, using the eventlog mechanism.
(This post has also been published on the Well-Typed blog.)
Eventlog ?
GHC has a mechanism that allows us to record all sorts of “events” to an
eventlog (a simple binary file), during a program’s execution, attaching a
timestamp to it to later allow tools to reconstruct as much of a program’s
execution as the events allow. This includes RTS events (garbage collection,
HEC/thread activity) but also user defined events, where “user” designates the
author of a Haskell library or program;
Debug.Trace
in base provides functions that anyone can use to emit events, in addition to
the ones that the RTS itself will emit.
Those functions are implemented in terms of primitive operations that are backed by GHC’s runtime system. Its design and implementation is discussed, along with other topics, in Parallel Performance Tuning for Haskell by Don Jones Jr., Simon Marlow and Satnam Singh.
After the program’s execution, one can then use libraries like ghc-events or tools like ghc-events-analyze, or threadscope to consume the eventlog in order to gain some insights into where time was spent.
While profiling lets us gather more detailed information about where time is
spent and where allocations are made, it requires rebuilding our program’s code
and using a dedicated RTS flavour. The code generated by GHC for a profiled
program is quite different, adding lots of instrumentation to support profiling
the execution of Haskell code. The extra code generated to support profiling
can get in the way of some optimisations and can therefore drastically affect
the performance of a program. On the other hand, generating the eventlog for a
program only requires re-linking it (against a flavour of the RTS that supports
tracing) with -eventlog and running it with +RTS -l. The eventlog mechanism
also has much lower impact on runtime performance, since emitting events is
merely about putting a few values in a buffer, that the RTS will then regularly
flush to the eventlog file or whatever the destination is. The aforementionned
paper has some precise numbers on the overhead of the eventlog mechanism, but
it’s of course quite low.
This can therefore be an interesting solution when you want to get a big picture without having to rebuild your whole program and some of its dependencies. All you have to do is set up some events that cover the fragments of the program’s execution that you’re particularly interested in.
Events are in general emitted using one of the following 4 functions from
Debug.Trace:
- traceEvent :: String -> a -> a
- traceEventIO :: String -> IO ()
- traceMarker :: String -> a -> a
- traceMarkerIO :: String -> IO ()
The traceEvent[IO] functions should be used for all sorts of events,
particularly the ones that are likely to happen a lot during the execution of
your program, while traceMarker[IO] are generally used to mark certain points
or phases in the execution and see that visually in the profile. This is
particularly helpful with tools like eventlog2html (see the last section of
this post) that allow you to visualize heap profiles, drawing those
“marker events” on top so that users can get a sense of when some
particularly allocations or deallocations take place, with user-supplied labels
instead of trying to guess from timestamps.
For more about eventlogs in general, see GHC’s users guide or the event-log page on the GHC wiki.
GHC events
Starting with this commit,
by Ben Gamari about 3 months ago, GHC started emitting eventlog entries for all
calls to withTiming, which is a function we use to measure how long various
parts of GHC take. We then started adding a few more withTiming calls:
- 688a1b89 added tracing around calls to various external tools (C compiler, linker, assembler, etc);
- 0c5cd771 added tracing around all the individual passes that make up the codegen pipeline (we previously only tracked codegen as a whole, single event);
- e3cbe319 added tracing around the package database initialization.
And other similar patches later on. As a result, we can trace most of the execution of GHC when compiling a trivial hello world module:
-- hello.hs
main :: IO ()
main = putStrLn "hello, world!"First, you need to get a somewhat recent checkout of GHC’s master branch -
e3cbe319
or newer. Then you need to build that source tree and make sure that the stage 2
GHC is linked with -eventlog:
$ ./boot && ./configure
# Build with Hadrian:
$ hadrian/build.sh -j "stage1.ghc-bin.ghc.link.opts += -eventlog"
# stage 2 ghc at: _build/stage1/bin/ghc
# Build with Make:
$ make -j GhcStage2HcOpts+=-eventlog
# stage 2 ghc at: inplace/bin/ghc-stage2
# If you have a GHC build around already, both of those commands should not
# cause a lot of work to be done: just linking the GHC executable against a
# slightly different RTS flavour! No recompilation needed.You can then build any module, library or program with the resulting stage 2
executable as you would normally do. Except that if you pass +RTS -l with one
of -v2 or -ddump-timings, GHC will produce an eventlog at ghc.eventlog
with all the standard RTS events, but also events for each pass in GHC’s
pipeline. Let’s see this in action by compiling `hello.hs’ from earlier.
# use inplace/bin/ghc-stage2 if you built GHC with Make
$ _build/stage1/bin/ghc -ddump-timings hello.hs -o hello +RTS -lWe now have an eventlog. But what can we do with it? You’ve got several options:
- The ghc-events library and program, used by all the other options, provide primitives for decoding the eventlog and extracting data from it. The accompanying program provides various commands for querying eventlog files.
- The threadscope program lets you visualize eventlogs with a GTK+ frontend.
- ghc-events-analyze produces per-label totals, SVG visualizations and per-label timing data given an eventlog file as input.
- … and maybe other options that I’m not remembering or aware of at all.
I personally use ghc-events-analyze a lot, since it can produce pretty SVG
pictures that I can then embed in GitLab comments or… blog posts. :-)
To get the timings and SVG picture, you need to install ghc-events-analyze,
and then ask it to process ghc.eventlog with the right “markers”:
$ ghc-events-analyze --timed --timed-txt --totals \
                     --start "GHC:started:" --stop "GHC:finished:" \
		     ghc.eventlog
# --timed:     produce SVG visualization of the eventlog, in ghc.timed.svg
# --timed-txt: produce per-label groupings of timings that report when the events
               were emitted, in ghc.timed.txt
# --totals:    produce per-label totals, reporting how much time was spent in
               a given label, in total
# --start:     all events that we're interested in are wrapped in
# --stop       GHC:started:... / GHC:finished:... events, so we just tell
               ghc-events-analyze that it should be looking for those markers
	       and report about those events.Here are the totals that I get for our hello world program compilation.
GC                                389461503ns   0.389s
USER EVENTS (user events are corrected for GC)
 systool:linker                  2386891920ns   2.387s
 systool:cc                       801347228ns   0.801s
 systool:as                       145128851ns   0.145s
 Renamer/typechecker [Main]        45709853ns   0.046s
 initializing package database     20877412ns   0.021s
 CodeGen [Main]                    20754058ns   0.021s
 CoreTidy [Main]                    8262122ns   0.008s
 NCG                                7566252ns   0.008s
 Chasing dependencies               2441212ns   0.002s
 Cmm pipeline                       2040174ns   0.002s
 Desugar [Main]                     1657607ns   0.002s
 STG -> Cmm                         1103737ns   0.001s
 Simplifier [Main]                  1045768ns   0.001s
 Cmm -> Raw Cmm                      319442ns   0.000s
 Parser [Main]                       286350ns   0.000s
 CorePrep [Main]                      88795ns   0.000s
TOTAL                            3445520781ns   3.446s
THREAD EVENTS
1                                    460138ns   0.000s
IOManager on cap 0:2                 670044ns   0.001s
TimerManager:3                       143699ns   0.000s
4                                 153050783ns   0.153s
weak finalizer thread:5               43518ns   0.000s
weak finalizer thread:6               10677ns   0.000s
weak finalizer thread:7               33126ns   0.000s
weak finalizer thread:8               23787ns   0.000s
weak finalizer thread:9                1534ns   0.000s
weak finalizer thread:10               8142ns   0.000s
weak finalizer thread:11               1352ns   0.000s
weak finalizer thread:12              10080ns   0.000s
weak finalizer thread:13              10603ns   0.000s
weak finalizer thread:14               8767ns   0.000s
weak finalizer thread:15              10849ns   0.000s
16                                    69745ns   0.000s
17                                   307420ns   0.000s
18                                    83017ns   0.000s
19                                    76866ns   0.000s
weak finalizer thread:20              10083ns   0.000s
21                                    96582ns   0.000s
22                                   103373ns   0.000s
23                                    62562ns   0.000s
24                                    97655ns   0.000s
weak finalizer thread:25              11676ns   0.000s
26                                   238116ns   0.000s
27                                   245821ns   0.000s
weak finalizer thread:28               8268ns   0.000s
29                                    80235ns   0.000s
30                                    86759ns   0.000s
31                                    21571ns   0.000s
weak finalizer thread:32                396ns   0.000s
TOTAL                             156087244ns   0.156sAnd the SVG image:
(click here for full size picture)
We can see two time windows where we don’t have an “active” label:
- towards the beginning, where GHC likely parses settings, sets everything up and reads the input file;
- after code generation and before calling the assembler, where it turns out we spend most of the time getting some information about the C compiler being used.
We otherwise have a pretty good picture of what GHC is doing the rest of the time. It might be nice in the future to add a few more events, e.g to track when we’re running Template Haskell code.
If you want GHC to only emit so-called “user events” (= non-RTS events) to the
eventlog, pass +RTS -l-au instead of +RTS -l. More options can be found
in the relevant section of the GHC user manual.
Wrapping up
Of course, if you’re looking into a performance problem, once you narrow down
the part of the execution that is taking too long, you will quite likely start
using slightly fancier profiling methods, nothing new there. You can however
emit your own events when working on GHC, so as to collect some simple timing
data about those codepaths that you are working on, without affecting GHC’s
performance much. You simply have to use compiler/main/ErrUtils.hs:withTiming:
withTiming <action to get dynflags> (text "<label>") <forcing function> $
  <action to time>where the forcing function is used to force the evaluation of the action’s
result just as much as the user wants; the time it takes to perform this
evaluation is included in the timings reported by withTiming. Your <label>
would then show up in the eventlog reports, if the corresponding code path is
entered. See Note [withTiming]
for more explanations about the withTiming function and how to use it.
All the eventlog markers that withTiming emits can then be integrated in heap
profiles using eventlog2html,
written by Matthew Pickering; see
here
for an example. If you put your mouse on one of the vertical lines, you will see
the label of the phase that begins or ends at that moment.
Before ending this post, I would like to mention that we have some
infrastructure to collect timings and allocations for various compilation
phases in GHC, over the
head.hackage package set.
This can be used locally or triggered manually from the CI interface of
the GHC repository, and is set up to run nightly against GHC’s master
branch. The resulting data can be loaded in PostgreSQL and queried at will.
We plan on collecting more data about GHC’s behaviour when compiling all
those packages and on relying more on the eventlog than GHC output parsing to do
so, with two major goals in mind:
- have an easy way to track GHC’s performance over time and in particular provide some initial data about where a performance regression might be coming from;
- expose enough data to be able to identify “real world modules” where GHC doesn’t perform as well as we would like it to, which could in turn lead to an investigation and possibly a patch that improves GHC’s performance on the said module.
Happy profiling!