[Haskell-cafe] Using Data.Binary for compression

Chad Scherrer chad.scherrer at gmail.com
Thu Nov 15 14:10:01 EST 2007


> > Almost all 'real users' just use Codec.Compression.GZip.  It's very
> > fast, very compositional, and (perhaps suprisingly) almost as effective
> > as application-specific schemes.
>
> I was about to say the same thing. So so much simpler to use Duncan's
> carefully written zlib binding,
>
>     import Data.Binary
>     import Codec.Compression.GZip
>     import qualified Data.ByteString.Lazy as L
>
>     main = L.writeFile "log.gz" . compress . encode $ [1..10::Int]
>
> Simple, purely functional, fast.
>
> -- Don

I have several types of symbols, and for each type the probabilities
are very predictable - to the point where they could even be
hard-coded. And upon completion I can be sure the first two questions
will be "Can we make it smaller?" and "Can we make it faster?". GZip
(while very cool) is adaptive and general-purpose, so it's building
frequency tables as it goes and ignoring the structure of the data I
should be able to take advantage of.

With an awful lot of trouble, it must be possible to write something
in C to go faster and yield better compression than gzip for this
particular data. With the probability structure known in advance,
there are just a lot of steps taken by gzip that are no longer needed.
Besides this, gzip only assumes an arbitrary sequence of bytes, but my
data are much more structured than this.

Considering the high performance achieved using idiomatic Haskell and
the ByteString and Binary libraries, I would think a similar approach
could be used for writing individual bits. Then it would be
(relatively) easy to write custom compression routines in Haskell with
reasonable performance - I don't think this can be said of any other
language.

Chad


More information about the Haskell-Cafe mailing list