[Haskell-cafe] [14/16] SBM: Behind the measurements (rationale)

Sat Dec 22 04:17:26 EST 2007

 // I am getting sick and tired of working on this project and it's probably
 // better to get it fired off than polishing it any further.
 //
 // This email could benefit from being rewritten from a rough draft into a
 // well-crafted letter but that would take a couple of hours.
 //
 // So here it is, a lot rougher than I'd like -- but it *IS* :)

why so big input files?
  easiest way to spot non-linearity and bad memory behaviour.  Anyway, files
  should be big enough to overflow caches and kick the gc in.
  (short files interesting, too, but the big ones cause more complex behaviour
  of run-time system and CPU.  If complex behaviour is behaved then simple
  behaviour probably is too -- but can still get its constant factors
  improved.  If complex behaviour bad, then shouldn't that be fixed in any
  case?)

waitpid4() has a struct w/ info about the child program's resource usage.
  unfortunately, the peakrss field is not filled in.  Seems to be a general
  Unix problem.  I've seen complaints on the net that Solaris doesn't fill it
  in, either.  Other solution needed.

pause-at-end, /proc/self/maps + /proc/self/status.  VmmHWM = peak of VmmRSS,
  which is Resident (working) Set Size.  It doesn't say what is shared with
  other processes or the operating system, though.  In our case, we don't
  expect to share anything but some libraries -- which nobody else wants to
  share with us anyway (except for the C library).  We are the only user of
  them.
  Discovered about a week ago that I could probably have used waitid()
  w/ WNOWAIT flag but didn't know.  Was quick to write pause-at-end, anyway.
  It took about 15 minutes from the desire to know the peak memory use to
  having written and tested the first cut of it.  Pause-at-end not completely
  bullet-proof in case of dyn libraries that get unloaded before the end of
  the program has been reached.  On the other hand, plenty good enough for
  these tests + can conceivably allow more intricate poking around than
  waitid() solution.

getting good measurements - eatmem, dd, probably should also dd library.
  good to have a "sacrificial run".  Good to measure how good the measurements
  are (rel. std.dev + user/sys/real check).
  why average -- disturbances are mostly interrupts, daemons that everybody
    have anyway, slightly luckier/unluckier physical pages.  These are real
    effects that nobody can control anyway.  I'm not interested in the best
    possible times on an ideal, undisturbed machine with a helpful kernel.
    I'm interested in clean times under realistic circumstances.  Therefore
    average instead of minimum.
  why I use real and not user/sys -- handling of blocking reads vs. mmap vs.
   madvise/fadvise vs. reading in separate thread in the future.  User+sys
   would probably give me better numbers at the moment and I could change to
   real later.  Still, I choose to stay with real (and the difference is marginal,
   anyway).

   Funny that the exact distribution of time between sys and user fluctuates a
   lot.  In space-bslc8-lenfil-2 sys varies between 0.160s and 0.244s.  Real is
   completely stable with 5x 1.396s and 1x 1.397s.

  look at /proc/interrupts, perhaps copy before/after to .intr?  Warn if more
  than 100 (or 1000) Hz + 10%?

write date/time + runlevel to platforminfo and/or sysinfo.

barcharts
  why barcharts.
  should the time/mem barcharts be equal length?  don't think so (hard to
    colour them in a text file.  Would work with less -r and the console but
    not in an email or a text editor.  Visual difference is good).
    But should perhaps not be /that/ different.
  visible markers if measurements bad.
   (5% real/user/sys check, typically within 0.1% on old laptop when doing a
   quick or thorough benchmark.  Occasionally up to 1% - and 3% on c/byte-4k
   because it only takes 56ms in total.)
  prints out how tight the user/sys/real thing is.

microarchitecture -- performance counters.  Would be interesting to look at
    once the obvious performance problems have been handled.  Let's fix the
    memory usage of bytestrings, the performance of lazy bytestrings, and 
    start using registers in the machine code first.
    regularity of input file probably means that branch predictor on all three
    CPUs can remember pattern of spaces vs. non-spaces (or at least part of the
    pattern).  Branch predictors not only use two-bit saturating counter for
    strongly non-taken/weakly non-taken/weakly taken/strongly taken.  They also
    try to remember the pattern of jumps/non-jumps.  A more realistic test
    would have less regular input file.  This effect is very small given the
    current performance limiters, though.

cache -- turned out to be pretty regular (by eyeballing cachegrind reports).
    Go up a factor of 10 in filesize and the number of access also went up a
    factor of 10.  The miss ratios stayed the same.  The miss ratios differed
    a bit between the benchmarks but I don't think it's time to look into that
    yet.  The data are available, though, for those who can't wait to look
    into that.

minor page faults
    we gather that through /usr/bin/time -- and could also get the same info
    from dumping the right file inside /proc/self/.  Probably not important yet.
    Probably will be once all the low-lying fruit has been gathered up from the
    ground.
    More of a factor on slower OS'es than Linux.

C files.  Buffer size.
    reading it all in one go is slower than (re)using a small buffer.  Cache
    effects, both in the operating system when copying (because the destination
    will be cached with a small buffer but non-cached with a big buffer) and in
    the application (everything will be cached with the small buffer, nothing
    with the big buffer).  Note that at least the Core and the Athlon64 have
    automatic prefetchers that tries to fill the cache in advance so we don't
    have to wait for the cache misses.  Doesn't quite seem to work.
    Older caches had a different write behaviour, they were write-through instead
    of the modern (lazy) write-back.  For those caches the writing to the user-space
    buffer should be slow even when a small buffer is reused all the time (because
    we would have to wait for all the writes to be flushed out to main memory).

C files.
    getchar/getchar_unlocked.  A comparison with getchar() is NOT what the
    simple haskell program does [grammar!].  Thread-safe by default (cause of
    libraries).  getchar_unlocked() is what the haskell programs do.  getchar() and
    getchar_unlocked() use a single buffer for stdin.
    getwchar() and getwchar_unlocked() included at the insistence of wli.  Much
    slower, because it is run-time dependent on locale (to choose encoding).
    Therefore, can't be a macro like getchar_unlocked() is.  With an indirect
    jump, should be same speed as getchar() on Core and Athlon64 -- but
    curiously isn't.

C and Haskell integer sizes and other limitations.
    Haskell uses unboxed 32-bit signed integers, except in lazy lenfil tests.

    Most of the C programs are simple and just use an int for the space count.

    One of them (space-megabuf)is more complicated.
    off_t is 64-bit.  ssize_t is 32-bit.  Potential overflow in c/space-megabuf.
    Potential 32-bit wrap-around in all my C tests.  Same problem with all
    Haskell tests, except for the two lenfil tests that use lazy bytestrings,
    because they use a 64-bit int for the length of the intermediate string of
    just the spaces filtered out from stdin.
    Also potentially out of memory.
    In practice, they have almost the same limit because they use about
    107MB for the 143MB input file.  In other words, it will run out
    of virtual address space or RAM or swap at about the same time that the
    others will run out of bits in a 32-bit signed integer.

-Peter