Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

Mon Feb 23 20:03:14 EST 2009

wren:
> Neil Mitchell wrote:
>> 2) The storage for String seems to be raw strings, which is nice.
>> Would I get a substantial speedup by moving to bytestrings instead of
>> strings? If I hashed the strings and stored common ones in a hash
>> table is it likely to be a big win?
>
> Bytestrings should help. The big wins in this application are likely to  
> be cache issues, though the improved memory/GC overhead is nice too.
>

Here's a quick demo using Data.Binary directly.

First, let's read in the dictionary file, and build a big, worst-case finite
map of words to their occurence (always 1).

    import Data.Binary
    import Data.List
    import qualified Data.ByteString.Char8 as B
    import System.Environment
    import qualified Data.Map as M

    main = do
        [f] <- getArgs
        s   <- B.readFile f
        let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ]
        encodeFile "dict" m
        print "done"

So that writes a "dict" file with a binary encoded Map ByteString Int.
Using ghc -O2 --make for everying.

    $ time ./A /usr/share/dict/cracklib-small
    "done"
    ./A /usr/share/dict/cracklib-small  0.28s user 0.03s system 94% cpu 0.331 total

Yields a dictionary file Map:

    $ du -hs dict
    1.3M    dict

Now, let's read back in and decode it back to a Map 

    main = do
        [f] <- getArgs
        m   <- decodeFile f
        print (M.size (m :: M.Map B.ByteString Int))
        print "done"

Easy enough:

    $ time ./A dict +RTS -K20M
    52848
    "done"
    ./A dict +RTS -K20M  1.51s user 0.06s system 99% cpu 1.582 total

Ok. So 1.5s to decode a 1.3M Map. There may be better ways to build the Map since we know the input will be sorted, but
the Data.Binary instance can't do that.

Since decode/encode are a nice pure function on lazy bytestrings, we can try a trick of 
compressing/decompressing the dictionary in memory.

Compressing the dictionary:

    import Data.Binary
    import Data.List
    import qualified Data.ByteString.Char8 as B
    import qualified Data.ByteString.Lazy  as L
    import System.Environment
    import qualified Data.Map as M
    import Codec.Compression.GZip

    main = do
        [f] <- getArgs
        s   <- B.readFile f
        let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ]
        L.writeFile "dict.gz" . compress . encode $ m
        print "done"

Pretty cool, imo, is "compress . encode":

    $ time ./A /usr/share/dict/cracklib-small 
    "done"
    ./A /usr/share/dict/cracklib-small  0.38s user 0.02s system 85% cpu 0.470 total

Ok. So building a compressed dictionary takes only a bit longer than uncompressed one (zlib is fast),

    $ du -hs dict.gz 
    216K    dict.gz

Compressed dictionary is much smaller. Let's load it back in and unpickle it:

    main = do
        [f] <- getArgs
        m <- (decode . decompress) `fmap` L.readFile f
        print (M.size (m :: M.Map B.ByteString Int))
        print "done"

Also cute. But how does it run:

    $ time ./A dict.gz
    52848
    "done"
    ./A dict.gz  0.28s user 0.03s system 98% cpu 0.310 total

Interesting. So extracting the Map from a compressed bytestring in memory is a fair bit faster than loading it 
directly, uncompressed from disk.

Neil, does that give you some ballpark figures to work with?

-- Don