[Haskell-cafe] Compress and serialise data with lazy bytestrings, zlib and Data.Binary (was: Allocating enormous amounts of memory)

Sun Jul 8 22:38:57 EDT 2007

Jefferson Heard write:

> I'm using the Data.AltBinary package to read in a list of 4.8 million
> floats and 1.6 million ints.  Doing so caused the memory footprint to
> blow up to more than 2gb, which on my laptop simply causes the program
> to crash.  I can do it on my workstation, but I'd really rather not,
> because I want my program to be fairly portable.  
> 
> The file that I wrote out in packing the data structure was only 28MB,
> so I assume I'm just using the wrong data structure, or I'm using full
> laziness somewhere I shouldn't be.

Here's a quick example of how to efficient read and write such a structure to
disk, compressing and decompressing on the fly. 

    $ time ./A
    Wrote 4800000 floats, and 1600000 ints
    Read  4800000 floats, and 1600000 ints
    ./A  0.93s user 0.06s system 89% cpu 1.106 total

It uses Data.Binary to provide quick serialisation, and the zlib library to
compress the resulting stream. It builds the tables in memory, writes and
compresses the result to disk, reads them back in, and checks we read the right
amount of CFloats and CInts. You'd then pass the CFloats over to your C library
that needs them.

Compressing with zlib is a flourish, but cheap and simple, so we may as well do
it. With zlib and Data.Binary, the core code just becomes:

        encodeFile "/tmp/table.gz" table
        table' <- decodeFile "/tmp/table.gz"

Which transparently streams the data through zlib, and onto the disk, and back.

Simple and efficient.

-- Don

Here's the code:

    -- some imports

    import Text.Printf
    import Data.Binary
    import Codec.Compression.Zlib
    import Control.Monad
    import qualified Data.ByteString.Lazy as L
    import qualified Data.ByteString.Base as B
    import Foreign
    import Foreign.C.Types

    -- A simple table type for the arrays, that will be easy to manipulate

    data Table = Table { floats :: L.ByteString
                       , ints   :: L.ByteString }

    -- Serialise this data in gzip-compressed form: a gzipping Binary instance.

    instance Binary Table where
        put t = do put (compress (floats t))
                   put (compress (ints   t))

        get = liftM2 Table (decompress `liftM` get)
                           (decompress `liftM` get)

    -- Default sizes

    floatSize = 4800000
    intSize   = 1600000

    -- Build a new empty table

    newTable :: IO Table
    newTable = do
        f <- mallocArray floatSize :: IO (Ptr CFloat)
        i <- mallocArray intSize   :: IO (Ptr CInt)

        -- fill them with data here --

        -- convert to bytestrings.
        bf <- B.packCStringFinalizer (castPtr f)
                    (floatSize * sizeOf (undefined :: CFloat)) (return ())
        bi <- B.packCStringFinalizer (castPtr i)
                    (intSize   * sizeOf (undefined :: CInt))   (return ())
        return $ Table (L.fromChunks [bf]) (L.fromChunks [bi])

    -- Now just build the table, serialise it, read it back in, and print the result

    main = do
        table <- newTable

        -- write the data to disk, compressed with gzip as we go.
        encodeFile "/tmp/table.gz" table
        draw "Wrote" table

        -- load it back in, decompressing on the fly
        table' <- decodeFile "/tmp/table.gz"
        draw "Read"  table'

        -- now use 'floats' as a Ptr to pass back to C.

      where
        draw s v = printf "%s %d floats, and %d ints\n" s
                        (fromIntegral (lengthFloats v) `div` sizeOf (undefined :: CFloat))
                        (fromIntegral (lengthInts v  ) `div` sizeOf (undefined :: CInt))

        lengthFloats = L.length . floats
        lengthInts   = L.length . ints