[Haskell-cafe] Data.Binary suboptimal instance

Fri May 22 14:42:51 EDT 2009

So lately I've been working on a little program to generate trippy 
graphics. (Indeed, some of you may remember it from a few years back...) 
Anyway, getting to the point, I just restructured my program. As part of 
the restructuring, I got rid of all the jiggery-pokery with 
Data.Array.Storable and so forth and decided to use Data.Binary.

My first problem was quite simple. I want to write a program that reads 
an item from the input file, processes it, writes it to the output file, 
and repeats until the input file is empty. Data.Binary doesn't appear to 
provide any obvious way to do this. You can't be in the Get and Put 
monads simultaneously, and I can't figure out how to interleave them. In 
the end, I ended up with something like

  xs <- decodeFile f1
  encodeFile f2 (map f xs)

Unfortunately, it looks like this doesn't do what I want. Serialised 
lists have their length prepended to them, which means that encodeFile 
is going to have to generate the *entire* list in memory so it can count 
how big it is, before it can output a single byte to disk. (!) This is 
particularly annoying since the output list will be exactly the same 
size as the input one, and the very first thing decodeFile is doing is 
reading in that size figure.

Maybe there's a way around this glitch, and I just haven't thought of it 
yet. But you'd think that wanting to lazily process data from an input 
file and write it into an output file would be an extremely common 
use-case. Given that, I'm not seeing an obvious way to handle it.

Anyway, that's one small problem, but I can live with it. There's 
another, much bigger problem though, and that's really what I wanted to 
talk about...

As we all know, Data.Binary is *built* for speed. So imagine my shock 
when my newly adjusted program now based around this library turned out 
to be massively slower, and the files it produced were drastically 
larger. It didn't take me that long to dig up the problem.

The problem seems to boil down to this: The Binary instance for Double 
(and Float, by the way) is... well I guess you could argue it's very 
portable, but efficient it isn't. As we all know, an IEEE-754 
double-precision floating-point number occupies 64 bits; 1 sign bit, 11 
exponent bits, and 52 mantissa bits (implicitly 53). I had assumed that 
the Binary instance for Double would simply write these bits to disk, 
requiring approximately 0 computational power, and exactly 64 bits of 
disk space. I was wrong.

...what it *actually* does is convert the 64-bit Double into a 32-bit 
exponent and an arbitrary-precision integer part. (!) It appears to do 
this by floating-point arithmetic rather than just low-level bit 
shuffling. Looking up how arbitrary-precision integers are serialised, I 
see that there's an 8-bit "flag" indicating whether the integer fits 
into 32 bits, and if it doesn't, there's an 8-bit sign value, followed 
by a serialised list of bytes. So, in other words, a 32-bit length 
value, followed by the bytes themselves.

To summarise, when I ask for a 64-bit Double to be serialised, I get this:
  32-bit exponent.
  8-bit flag (probably always "1")
  8-bit sign (either "0" or "1")
  32-bit size (probably always "8")
  64-bit mantissa

That's 144 bits in total, i.e., 225% larger than the original Double. 
(Let's not even go into how much computer power it takes to construct 
all this data before it's written to disk...)

So, that's why my files balooned in size, and why my program slowed to a 
crawl. But how do I fix this? Well, my solution was simple, brutal, and 
probably *highly* non-portable. It begins with Unsafe.Coerce (that's 
never a good thing!)

Put simply, if you take your 64-bit Double and unsafeCoerce it into a 
Word64 and then ask Data.Binary to serialise that, it writes it straight 
to disk without screwing around with it first. The result ought to be 
portable to any architecture that uses IEEE-754 arithmetic natively. 
(Anybody know of an arch this *doesn't* apply to?) But sure, if you 
wanted to do this on some other architecture with a different native 
float format, you'd have to write some trixy code to handle it. (And 
then add conditional compilation for it.)

Is there any danger that there might be some kind of improvement to the 
Double instance in the next version of Data.Binary?

END.