[Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

Sat Dec 15 22:07:17 EST 2007

firefly:
> What do you think the relative speeds are of the six small haskell
> programs at the end of this email?
> 
> All they do is read from stdin and count the number of spaces they see.
> There are two that use strict bytestrings, two that use lazy
> bytestrings, and two that use the standard Haskell strings.  Three use a
> recursive function with an accumulator parameter and three use a foldl
> with a lambda function.
> 
> Say the fastest one takes the time 1.  How much time will the others
> take?
> 
> And how about memory?  How much memory do you think they require?  Let's
> say we feed a 150MB(*) file into each of them, how many megabytes do you
> think they end up using (as seen from the OS, not in terms of how big
> the live heap is)?
> 
> I'm going to post full benchmarks + analysis on Wednesday.
> 
> -Peter
> 
> *) hardddisk megabytes.  The file is 150000034 bytes ≈ 143 mebibytes.
> 
> 
> PS: For extra credit, what do you think is the peak memory use for this
>     program when given an input file of 150MB?

Well, I'm not going to wait till Wednesday for the numbers!

    $ ghc --version
    The Glorious Glasgow Haskell Compilation System, version 6.8.2

    $ du -hs 150M 
    150M

------------------------------------------------------------------------
Program 1:

 {-# LANGUAGE BangPatterns #-}

 import qualified Data.ByteString.Lazy.Char8 as B
 import GHC.Int (Int64)

 -- note that D.BS.Lazy.Char8.length is ByteString -> Int64
 --           D.BS.C8.length is ByteString -> Int
 cnt    :: B.ByteString -> Int64
 cnt bs = B.length (B.filter (== ' ') bs)

 main = do s <- B.getContents
          print (cnt s)

Ok, so lazy bytestrings. Should be constant space use, but two traversals
(since there's no lazy bytestring fusion, over each lazy chunk).  Not perfect,
but should be ok. length will consume chunks as they're produced by filter.

** Prediction:
    Constant ~3M space use (runtime, plus small overhead)
    Fast, due to chunk-wise processing.

** Result:
    $ ghc -O2 A.hs -o A --make
    $ time ./A < 150M +RTS -sstderr
    ./A +RTS -sstderr < 150M  1.01s user 0.10s system 98% cpu 1.123 total

    And top says 40M allocated.

** Summary: Ok, pretty fast, but an unexpected(!) amount of memory allocated.

Now, this memory result is suspicious, I wonder if the now obsolete 'array fusion'
is messing things up. In Data.ByteString.Lazy, we have:

    filter p = F.loopArr . F.loopL (F.filterEFL p) F.NoAcc

We keep meaning to replace this stuff with the stream fusion mechanisms, which 
compile a lot better. Perhaps the time has come to look at that :)

I'll put this memory allocation down as a bug that needs to be looked at.

------------------------------------------------------------------------
Program 2:

 hs/space-bs-c8-acc-1.hs:
 {-# LANGUAGE BangPatterns #-}

 import qualified Data.ByteString.Char8 as B

 cnt    :: Int -> B.ByteString -> Int
 cnt !acc bs = if B.null bs
                then acc
                else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

 main = do s <- B.getContents
          print (cnt 0 s)

Strict bytestrings, and you manually fuse the length/filter calculation.
Allocating all that memory will cost you.

** Prediction:
    O(N) memory, around 150M allocated
    Slower, due to cache effects (more data to traverse) and more indirections.

** Result   
    top says 154M
    ./B +RTS -sstderr < 150M  1.10s user 0.52s system 111% cpu 1.454 total

** Summary: Seems reasonable, when its doing all that allocation. 

------------------------------------------------------------------------
Program 2a:

But we could easily improve this program: Since:
    length (filter (== ' ')
  ==
    length (filterByte ' '
  ==
    count ' '

And we have:

    import qualified Data.ByteString.Char8 as B

    cnt   :: B.ByteString -> Int
    cnt x = B.count ' ' x

    main = do s <- B.getContents
              print (cnt s)

** Prediction: Which should be a bit faster.

** Result:
    $ time ./B < 150M                            
    24569024
    ./B < 150M  0.66s user 0.55s system 113% cpu 1.070 total

** Summary: So that's the fastest program so far.

The rewrite rules to do these transformatoins used to be enabled, but 
need looking at again. There should also be no real benefit to manually fuse a
length . filter loop like this, however, the old fusion system used in
bytestring might have some small overhead here. This also needs looking at.

------------------------------------------------------------------------
Program 2b:

We can do even better if we read the file in:

    import qualified Data.ByteString.Char8    as B
    import System.IO.Posix.MMap

    main = print . B.count ' ' =<< mmapFile "150M"

** Prediction: super fast

** Result:

    $ time ./B2
    24569024
    ./B2  0.31s user 0.01s system 101% cpu 0.314 total

(similar results if you use vanilla B.readFile too, fwiw).

Summary: This suggests to me we could look again at how strings of unknown size
         are read in.

------------------------------------------------------------------------
Program 3:

 hs/space-bslc8-acc-1.hs:
 {-# LANGUAGE BangPatterns #-}

 import qualified Data.ByteString.Lazy.Char8 as B

 cnt    :: Int -> B.ByteString -> Int
 cnt !acc bs = if B.null bs
                    then acc
                    else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

 main = do s <- B.getContents
                print (cnt 0 s)

Ok, for lazy bytestrings.  

** Prediction: Should run in constant space, but there are 
   more checks than in the unfused length/filter case.
   It should be a bit slower, and hopefully run in constant space.

** Result:
    $ time ./C < 150M
    24569024
    ./C < 150M  2.36s user 0.11s system 99% cpu 2.489 total
    top says 3804K

** Summary:
    So it runs in the 3M constant space I'd expected the original program to run in,
    but its a fair bit slower. The generated code looks pretty good though.

    Investigate why this is slower.

------------------------------------------------------------------------
Program 4:

  ==============================
 hs/space-xxxxx-acc-1.hs:
 {-# LANGUAGE BangPatterns #-}

 cnt   :: Int -> String -> Int
 cnt !acc bs = if null bs
               then acc
               else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)

 main = do s <- getContents
         print (cnt 0 s)

** Prediction:
      Lazy, so constant space, but probably 10x slower than the previous program.

** I'm not going to bother with this one, since strings suck for large data.

------------------------------------------------------------------------
Program 5:

 hs/space-bs-c8-foldlx-1.hs:
 {-# LANGUAGE BangPatterns #-}

 import qualified Data.ByteString.Char8 as B

 cnt   :: B.ByteString -> Int
 cnt bs        = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

 main = do s <- B.getContents
           print (cnt s)

Ok, a strict foldl'. 

** Prediction:
     Should be similar to the strict bytestring first example, and a little
     faster due to no redundant null checks.

** Result

     $ time ./D < 150M 
     24569024
    ./D < 150M  1.02s user 0.58s system 111% cpu 1.436 tota

** Summary: As expected

------------------------------------------------------------------------
Program 6:

  ==============================
  hs/space-bslc8-foldlx-1.hs:
  {-# LANGUAGE BangPatterns #-}

  import qualified Data.ByteString.Lazy.Char8 as B

  cnt   :: B.ByteString -> Int
  cnt bs        = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

  main = do s <- B.getContents
          print (cnt s)

A strict fold over a lazy bytestring. 

Prediction: Ok, constant memory use, and should be similar in speed
to the first length. filter program. It'll be faster than manual
accumulating lazy bytestring program, since the redundant null checks
are gone from the inner loop.

Result:
    $ time ./E < 150M
    24569024
    ./E < 150M  0.84s user 0.07s system 98% cpu 0.928 total

    2468k memory.

Wow. Didn't expect it to be that fast. So properly lazy, and nicely fast.
This is what we'd hope to see :) It also re-enforces that there's a bug in the
length . filter program.

------------------------------------------------------------------------
Program 7:

     ==============================
     hs/space-xxxxx-foldl.hs:
     {-# LANGUAGE BangPatterns #-}

     cnt   :: String -> Int
     cnt bs        = foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs

     main = do s <- getContents
             print (cnt s)

Hmm. Lazy accumulator eh, on String?  Should exhibit a space leak.
Not worth running..

**********************

Summary,

      * Program 1 is fast, as expected, but exhbits a bug in the bytestring
        library's lazy bytestring fusion system.  Something in length or filter
        isn't doing the right job. This code will be replaced by the stream fusion
        system soon.

      * Program 2: as expected. strict IO uses O(N) space, and that has performance 
        effects.

      * Program 3: lazy bytestrings use constant space, but you better avoid
        redundant bounds checks in the inner loops. 

      * Program 4: strings are silly

      * Program 5: as expected. similar to program 2.

      * Program 6: strict foldl's over lazy bytestrings are good :)
                   fast, and constant space.

      * Program 7: see program 4.

Pretty much as expected then, but with a bug identified in lazy bytestring fusion (I think).

Nice little benchmark.

-- Don