Difference between revisions of "Dealing with binary data"

From HaskellWiki
Jump to navigation Jump to search
(18 intermediate revisions by 8 users not shown)
Line 1: Line 1:
  +
[[Category:How to]]
 
== Handling Binary Data with Haskell ==
 
== Handling Binary Data with Haskell ==
   
Line 5: Line 6:
 
libraries for handling binary data in Haskell.
 
libraries for handling binary data in Haskell.
   
=== ByteStrings ===
+
=== Bytestrings ===
   
 
Everything else in this tutorial will be based on bytestrings. Normal Haskell
 
Everything else in this tutorial will be based on bytestrings. Normal Haskell
<hask>String</hask> types are linked lists of 32-bit charactors. This has a
+
<hask>String</hask> types are linked lists of 32-bit characters. This has a
number of useful properties like coverage of the Unicode space and lazyness,
+
number of useful properties like coverage of the Unicode space and laziness,
however when it comes to dealing with byte-wise data the <hask>String</hask>
+
however when it comes to dealing with bytewise data, <hask>String</hask>
 
involves a space-inflation of about 24x and a large reduction in speed.
 
involves a space-inflation of about 24x and a large reduction in speed.
   
 
Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience
 
Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience
in C, their memory representation would be the same as a <code>uint8_t[]</code>
+
in C, their memory representation would be the same as a <code>uint8_t[]</code>—although bytestrings know their length and don't allow overflows, etc.
- although bytestrings know their length and don't allow overflows etc.
 
   
Their are two major flavours of bytestrings, strict and lazy. Strict
+
There are two major flavours of bytestrings: strict and lazy. Strict
bytestrings are exactly what you would expect - a linear array of bytes in
+
bytestrings are exactly what you would expect—a linear array of bytes in
memory. Lazy bytestrings are a list of strict bytestrings, often this is called
+
memory. Lazy bytestrings are a list of strict bytestrings; often this is called
 
a cord in other languages. When reading a lazy bytestring from a file, the data
 
a cord in other languages. When reading a lazy bytestring from a file, the data
 
will be read chunk by chunk and the file can be larger than the size of memory.
 
will be read chunk by chunk and the file can be larger than the size of memory.
Line 51: Line 51:
 
Note that we are using strict bytestrings here. (It's quite common to import the
 
Note that we are using strict bytestrings here. (It's quite common to import the
 
<code>ByteString</code> module under the names <code>B</code> or <code>BS</code>.)
 
<code>ByteString</code> module under the names <code>B</code> or <code>BS</code>.)
Since the bytestrings are strict the code will read the whole of stdin into
+
Since the bytestrings are strict, the code will read the whole of <code>stdin</code> into
 
memory and then write it out. If the input was too large this would overflow
 
memory and then write it out. If the input was too large this would overflow
the availble memory and fail.
+
the available memory and fail.
   
 
Let's see the same program using lazy bytestrings. We are just changing the
 
Let's see the same program using lazy bytestrings. We are just changing the
Line 75: Line 75:
 
found.
 
found.
   
You should review the [[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html documentation]]
+
You should review the [http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString.html documentation]
 
which lists all the functions which operate on ByteStrings. The documentation
 
which lists all the functions which operate on ByteStrings. The documentation
 
for the various types (lazy Word8, strict Char8, ...) are all very similar. You
 
for the various types (lazy Word8, strict Char8, ...) are all very similar. You
Line 81: Line 81:
 
import the modules as <code>qualified</code> and give them different names.
 
import the modules as <code>qualified</code> and give them different names.
   
==== The Guts of ByteStrings ====
+
==== The guts of ByteStrings ====
   
I'll just mention in passing that somes you need to do something which would
+
I'll just mention in passing that sometimes you need to do something which would
 
endanger the referential transparency of ByteStrings. Generally you only need
 
endanger the referential transparency of ByteStrings. Generally you only need
 
to do this when using the FFI to interface with C libraries. Should such a need
 
to do this when using the FFI to interface with C libraries. Should such a need
arise, you have have a look at the
+
arise, you can have a look at the
[[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Internal.html internal functions]] and the
+
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Internal.html internal functions] and the
[[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Unsafe.html unsafe functions]].
+
[http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Unsafe.html unsafe functions].
Remember that the last set of functions are called unsafe for a reason - misuse
+
Remember that the last set of functions are called unsafe for a reason—misuse
can crash you program!.
+
can crash your program!
   
 
=== Binary parsing ===
 
=== Binary parsing ===
Line 96: Line 96:
 
Once you have your data as a bytestring you'll be wanting to parse something
 
Once you have your data as a bytestring you'll be wanting to parse something
 
from it. Here you need to install the
 
from it. Here you need to install the
<tt>[[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-0.4.1 binary]]</tt> package.
+
<tt>[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-0.4.1 binary]</tt> package. You should read the instructions on
  +
[http://haskell.org/haskellwiki/Cabal/How_to_install_a_Cabal_package how to install a Cabal package] if you haven't done so already.
Instructions for installing Cabal packages are out of scope for this tutorial.
 
   
 
The <tt>binary</tt> package has three major parts: the <code>Get</code> monad,
 
The <tt>binary</tt> package has three major parts: the <code>Get</code> monad,
 
the <code>Put</code> monad and a general serialisation for Haskell types. The
 
the <code>Put</code> monad and a general serialisation for Haskell types. The
latter is like the <tt>pickle</tt> module that you may know from Python - it
+
latter is like the <tt>pickle</tt> module that you may know from Python—it
has it's own serialisation format and I won't be covering it any more here.
+
has its own serialisation format and I won't be covering it any more here.
 
However, if you just need to persist some Haskell data structures, it might be
 
However, if you just need to persist some Haskell data structures, it might be
 
exactly what you want: the documentation is
 
exactly what you want: the documentation is
[[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html here]]
+
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html here]
   
 
==== The <tt>Get</tt> monad ====
 
==== The <tt>Get</tt> monad ====
  +
  +
The <tt>Get</tt> monad is a state monad; it keeps some state and each action
  +
updates that state. The state in this case is an offset into the bytestring
  +
which is getting parsed. <tt>Get</tt> parses lazy bytestrings; this is how
  +
packages like
  +
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar-0.1.1.1 tar]
  +
can parse files several gigabytes long in constant memory: they are using a
  +
pipeline of lazy bytestrings. However, this also has a downside. When parsing a
  +
lazy bytestring a parse failure (such as running off the end of the bytestring)
  +
is signified by an exception. Exceptions can only be caught in the IO monad
  +
and, because of laziness, might not be thrown exactly where you expect. If this
  +
is a problem, you probably want a strict version of <tt>Get</tt>, which is
  +
covered below.
  +
  +
Here's an example of using the <tt>Get</tt> monad:
  +
  +
<haskell>import qualified Data.ByteString.Lazy as BL
  +
import Data.Binary.Get
  +
import Data.Word
  +
  +
deserialiseHeader :: Get (Word32, Word32, Word32)
  +
deserialiseHeader = do
  +
alen <- getWord32be
  +
plen <- getWord32be
  +
chksum <- getWord32be
  +
return (alen, plen, chksum)
  +
  +
main :: IO ()
  +
main = do
  +
input <- BL.getContents
  +
print $ runGet deserialiseHeader input</haskell>
  +
  +
This code takes three big-endian, 32-bit unsigned numbers from the input string
  +
and returns them as a tuple. Let's try running it:
  +
  +
<pre>% runhaskell /tmp/example.hs << EOF
  +
heredoc> 123412341235
  +
heredoc> EOF
  +
(825373492,825373492,825373493)</pre>
  +
  +
Makes sense, right? Look what happens if the input is too short:
  +
  +
<pre>% runhaskell /tmp/example.hs << EOF
  +
tooshort
  +
EOF
  +
(1953460083,1752134260,example.hs: too few bytes. Failed reading at byte position 12</pre>
  +
  +
Here an exception was thrown because we ran out of bytes.
  +
  +
So the <tt>Get</tt> monad consists of a set of operations like
  +
<hask>getWord32be</hask> which walk over the input and return some type of
  +
data. You can see the full list of those functions in the
  +
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary-Get.html documentation].
  +
  +
Here's another example; decoding an EOF-terminated list of
  +
numbers just involves recursion:
  +
  +
<haskell>listOfWord16 = do
  +
empty <- isEmpty
  +
if empty
  +
then return []
  +
else do v <- getWord64be
  +
rest <- listOfWord16
  +
return (v : rest)</haskell>
  +
  +
==== Strict <tt>Get</tt> monad ====
  +
  +
If you're parsing small messages then, firstly your input isn't going to be a
  +
lazy bytestring but a strict one. That's not reallly a problem because you can
  +
easilly convert between them. However, if you want to handle parse failures you
  +
either have to write your parser very carefully, or you have to deal with the
  +
fact that you can only catch exceptions in the IO monad.
  +
  +
If this is your dilemma, then you need a strict version of the <tt>Get</tt>
  +
monad. It's almost exactly the same, but a parser of type <hask>Get a</hask>
  +
results in <hask>(Either String a, ByteString)</hask> as the result of
  +
<hask>runGet</hask>. That type is a tuple where the first value is ''either'' a
  +
string (an error string from the parse) or the result, and the second value is
  +
the remaining bytestring when the parser finished.
  +
  +
Let's update the first example with this strict version of <tt>Get</tt>. You'll
  +
have to install the
  +
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-strict binary-strict]
  +
package for it to work.
  +
  +
<haskell>import qualified Data.ByteString as B
  +
import Data.Binary.Strict.Get
  +
import Data.Word
  +
  +
deserialiseHeader :: Get (Word32, Word32, Word32)
  +
deserialiseHeader = do
  +
alen <- getWord32be
  +
plen <- getWord32be
  +
chksum <- getWord32be
  +
return (alen, plen, chksum)
  +
  +
main :: IO ()
  +
main = do
  +
input <- B.getContents
  +
print $ runGet deserialiseHeader input</haskell>
  +
  +
Note that all we've done is change from lazy bytestrings to strict bytestrings
  +
and change to importing <tt>Data.Binary.Strict.Get</tt>. Now we'll run
  +
it again:
  +
  +
<pre>% runhaskell /tmp/example.hs << EOF
  +
heredoc> 123412341235
  +
heredoc> EOF
  +
(Right (825373492,825373492,825373493),"\n")</pre>
  +
  +
Now we can see that the parser was successful (we got a <tt>Right</tt>) and we
  +
can see that our shell actually added an extra newline on the input (correctly)
  +
and the parser didn't consume that, so it's also returned to us. Now we try it
  +
with a truncated input:
  +
  +
<pre>% runhaskell /tmp/example.hs << EOF
  +
heredoc> tooshort
  +
heredoc> EOF
  +
(Left "too few bytes","\n")</pre>
  +
  +
This time we didn't get an exception, but a <tt>Left</tt> value, which can be
  +
handled in pure code. The remaining bytestring is the same because our
  +
truncated input is 9 bytes long, parsing the first two <tt>Word32</tt>'s
  +
consumed 8 bytes and parsing the third failed—at which point we had the last
  +
byte still in the input.
  +
  +
In your parser, you can also call <hask>fail</hask>, with an error string,
  +
which will result in a <tt>Left</tt> value.
  +
  +
That's it; it's otherwise the same as the <tt>Get</tt> monad.
  +
  +
====Incremental parsing====
  +
  +
If you have to deal with a protocol which isn't length prefixed, or otherwise
  +
chunkable, from the network then you are faced with the problem of knowing when
  +
you have enough data to parse something semantically useful. You could run a
  +
strict <tt>Get</tt> over what you have and catch the truncation result, but
  +
that means that you're parsing the data multiple times etc.
  +
  +
Instead, you can use an incremental parser. There's an incremental version of
  +
the <tt>Get</tt> monad in <tt>Data.Binary.Strict.IncrementalGet</tt> (you'll
  +
need the <tt>binary-strict</tt> package).
  +
  +
You use it as normal, but rather than returning an <tt>Either</tt> value, you
  +
get a [http://hackage.haskell.org/packages/archive/binary-strict/0.2.4/doc/html/Data-Binary-Strict-IncrementalGet.html#t%3AResult Result]. You need to go follow that link and look at the documentation for <tt>Result</tt>.
  +
  +
It reflects the three outcomes of parsing possibly truncated data. Either the
  +
data is invalid as is, or it's complete, or it's truncated. In the truncated
  +
case you are given a function (called a continuation), to which you can pass
  +
more data, when you get it, and continue the parse. The continuation, again,
  +
returns a <tt>Result</tt> depending on the result of parsing the additional
  +
data as well.
  +
  +
====Bit twiddling====
  +
  +
Even with all this monadic goodness, sometimes you just need to move some bits
  +
around. That's perfectly possible in Haskell too. Just import
  +
<tt>Data.Bits</tt> and use the following table.
  +
  +
<table>
  +
<tr><th>Name</th><th>C operator</th><th>Haskell</th></tr>
  +
<tr><td>AND</td><td><tt>&amp;</tt></td><td><hask>.&.</hask></td></tr>
  +
<tr><td>OR</td><td><tt>|</tt></td><td><hask>.|.</hask></td></tr>
  +
<tr><td>XOR</td><td><tt>^</tt></td><td><hask>`xor`</hask></td></tr>
  +
<tr><td>NOT</td><td><tt>~</tt></td><td><hask>`complement`</hask></td></tr>
  +
<tr><td>Left shift</td><td><tt>&lt;&lt;</tt></td><td><hask>`shiftL`</hask></td></tr>
  +
<tr><td>Right shift</td><td><tt>&gt;&gt;</tt></td><td><hask>`shiftR`</hask></td></tr>
  +
</table>
  +
  +
====The <tt>BitGet</tt> monad====
  +
  +
As an alternative to bit twiddling, you can also use the <tt>BitGet</tt> monad.
  +
This is another state-like monad, like <tt>Get</tt>, but here the state
  +
includes the current bit-offset in the input. This means that you can easily pull out
  +
unaligned data. Sadly, haddock is currently breaking when trying to generate the
  +
documentation for <tt>BitGet</tt> so I'll start with an example. Again, you'll
  +
need the
  +
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-strict binary-strict] package installed.
  +
  +
Here's a description of the header of a DNS packet, direct from RFC 1035:
  +
  +
<pre> 1 1 1 1 1 1
  +
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
  +
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  +
| ID |
  +
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  +
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
  +
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  +
| QDCOUNT |
  +
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  +
| ANCOUNT |
  +
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  +
| NSCOUNT |
  +
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  +
| ARCOUNT |
  +
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+</pre>
  +
  +
The actual fields don't matter, but here's a function for parsing it:
  +
  +
<haskell>
  +
parseHeader :: G.Get Header
  +
parseHeader = do
  +
id <- G.getWord16be
  +
flags <- G.getByteString 2
  +
qdcount <- G.getWord16be >>= return . fromIntegral
  +
ancount <- G.getWord16be >>= return . fromIntegral
  +
nscount <- G.getWord16be >>= return . fromIntegral
  +
arcount <- G.getWord16be >>= return . fromIntegral
  +
  +
let r = BG.runBitGet flags (do
  +
isquery <- BG.getBit
  +
opcode <- BG.getAsWord8 4 >>= parseEnum
  +
aa <- BG.getBit
  +
tc <- BG.getBit
  +
rd <- BG.getBit
  +
ra <- BG.getBit
  +
  +
BG.getAsWord8 3
  +
rcode <- BG.getAsWord8 4 >>= parseEnum
  +
  +
return $ Header id isquery opcode aa tc rd ra rcode qdcount ancount nscount arcount)
  +
  +
case r of
  +
Left error -> fail error
  +
Right x -> return x</haskell>
  +
  +
Here you can see that only the second line (from the ASCII-art diagram) is
  +
parsed using <tt>BitGet</tt>. An outer <tt>Get</tt> monad is used for
  +
everything else and the bit fields are pulled out with
  +
<hask>getByteString</hask>. Again, <tt>BitGet</tt> is a strict monad and
  +
returns an <tt>Either</tt>, but it doesn't return the remaining bytestring,
  +
just because there's no obvious way to represent a bytestring of a fractional
  +
number of bytes.
  +
  +
You can see the list of <tt>BitGet</tt> functions and their comments in the
  +
[http://darcs.imperialviolet.org/darcsweb.cgi?r=binary-strict;a=headblob;f=/src/Data/Binary/Strict/BitGet.hs source code].
  +
  +
===Binary generation===
  +
  +
In contrast to parsing binary data, you might want to generate it. This is the
  +
job of the <tt>Put</tt> monad. Follow along with the
  +
[http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary-Put.html documentation]
  +
if you like.
  +
  +
The <tt>Put</tt> monad is another state-like monad, but the state is an offset
  +
into a series of buffers where the generated data is placed. All the buffer
  +
creation and handling is done for you, so you can just forget about it. It
  +
results in a lazy bytestring (so you can generate outputs that are larger than memory).
  +
  +
Here's the reverse of our simple <tt>Get</tt> example:
  +
  +
<haskell>import qualified Data.ByteString.Lazy as BL
  +
import Data.Binary.Put
  +
  +
serialiseSomething :: Put
  +
serialiseSomething = do
  +
putWord32be 1
  +
putWord16be 2
  +
putWord8 3
  +
  +
main :: IO ()
  +
main = BL.putStr $ runPut serialiseSomething</haskell>
  +
  +
And running it shows that it's generating the correct serialisation:
  +
  +
<pre>% runhaskell /tmp/example.hs| hexdump -C
  +
00000000 00 00 00 01 00 02 03 |.......|</pre>
  +
  +
If you want the output of <tt>runPut</tt> to be a strict bytestring, you just
  +
need to convert it with <hask>B.concat $ BL.toChunks $ runPut xyz</hask>.
  +
  +
One limitation of <tt>Put</tt>, due to the nature of the <tt>Builder</tt> monad
  +
which it works with, is that you can't get the current offset into the output.
  +
This can be an issue with some formats which require you to encode byte offsets
  +
into the file. You have to calculate these byte offsets yourself.
  +
  +
=== Other useful packages ===
  +
  +
There are other packages which you should know about, but which are mostly
  +
covered by their documentation:
  +
  +
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/network-bytestring-0.1.1 network-bytestring]: for reading and writing bytestring from the network
  +
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/zlib-0.4.0.2 zlib] and [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bzlib-0.4.0.1 bzlib]: for compressed formats
  +
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding-0.3 encoding]: for dealing with character encodings
  +
* [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar-0.1.1.1 tar]: as an example of lazy parsing/serialisation

Revision as of 04:07, 3 August 2012

Handling Binary Data with Haskell

Many programming problems call for the use of binary formats for compactness, ease-of-use, compatibility or speed. This page quickly covers some common libraries for handling binary data in Haskell.

Bytestrings

Everything else in this tutorial will be based on bytestrings. Normal Haskell String types are linked lists of 32-bit characters. This has a number of useful properties like coverage of the Unicode space and laziness, however when it comes to dealing with bytewise data, String involves a space-inflation of about 24x and a large reduction in speed.

Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience in C, their memory representation would be the same as a uint8_t[]—although bytestrings know their length and don't allow overflows, etc.

There are two major flavours of bytestrings: strict and lazy. Strict bytestrings are exactly what you would expect—a linear array of bytes in memory. Lazy bytestrings are a list of strict bytestrings; often this is called a cord in other languages. When reading a lazy bytestring from a file, the data will be read chunk by chunk and the file can be larger than the size of memory. The default chunk size is currently 32K.

Within each flavour of bytestring comes the Word8 and Char8 versions. These are mostly an aid to the type system since they are fundamentally the same size of element. The Word8 unpacks as a list of Word8 elements (bytes), the Char8 unpacks as a list of Char, which may be useful if you want to convert them to Strings

You might want to open the documentation for strict bytestrings and lazy bytestrings in another tab so that you can follow along.

Simple file IO

Here's a very simple program which copies a file from standard input to standard output

module Main where

import qualified Data.ByteString as B

main :: IO ()
main = do
  contents <- B.getContents
  B.putStr contents

Note that we are using strict bytestrings here. (It's quite common to import the ByteString module under the names B or BS.) Since the bytestrings are strict, the code will read the whole of stdin into memory and then write it out. If the input was too large this would overflow the available memory and fail.

Let's see the same program using lazy bytestrings. We are just changing the imported ByteString module to be the lazy one and calling the exact same functions from the new module:

module Main where

import qualified Data.ByteString.Lazy as BL

main :: IO ()
main = do
  contents <- BL.getContents
  BL.putStr contents

This code, because of the lazy bytestrings, will cope with any sized input and will start producing output before all the input has been read. You can think of the code as setting up a pipeline, rather than executing in-order, as you might expect. As putStr needs more data, it will cause the lazy bytestring contents to read more until the end of the input is found.

You should review the documentation which lists all the functions which operate on ByteStrings. The documentation for the various types (lazy Word8, strict Char8, ...) are all very similar. You generally find the same functions in each, with the same names. Remember to import the modules as qualified and give them different names.

The guts of ByteStrings

I'll just mention in passing that sometimes you need to do something which would endanger the referential transparency of ByteStrings. Generally you only need to do this when using the FFI to interface with C libraries. Should such a need arise, you can have a look at the internal functions and the unsafe functions. Remember that the last set of functions are called unsafe for a reason—misuse can crash your program!

Binary parsing

Once you have your data as a bytestring you'll be wanting to parse something from it. Here you need to install the binary package. You should read the instructions on how to install a Cabal package if you haven't done so already.

The binary package has three major parts: the Get monad, the Put monad and a general serialisation for Haskell types. The latter is like the pickle module that you may know from Python—it has its own serialisation format and I won't be covering it any more here. However, if you just need to persist some Haskell data structures, it might be exactly what you want: the documentation is here

The Get monad

The Get monad is a state monad; it keeps some state and each action updates that state. The state in this case is an offset into the bytestring which is getting parsed. Get parses lazy bytestrings; this is how packages like tar can parse files several gigabytes long in constant memory: they are using a pipeline of lazy bytestrings. However, this also has a downside. When parsing a lazy bytestring a parse failure (such as running off the end of the bytestring) is signified by an exception. Exceptions can only be caught in the IO monad and, because of laziness, might not be thrown exactly where you expect. If this is a problem, you probably want a strict version of Get, which is covered below.

Here's an example of using the Get monad:

import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
  alen <- getWord32be
  plen <- getWord32be
  chksum <- getWord32be
  return (alen, plen, chksum)

main :: IO ()
main = do
  input <- BL.getContents
  print $ runGet deserialiseHeader input

This code takes three big-endian, 32-bit unsigned numbers from the input string and returns them as a tuple. Let's try running it:

% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(825373492,825373492,825373493)

Makes sense, right? Look what happens if the input is too short:

% runhaskell /tmp/example.hs << EOF
tooshort
EOF
(1953460083,1752134260,example.hs: too few bytes. Failed reading at byte position 12

Here an exception was thrown because we ran out of bytes.

So the Get monad consists of a set of operations like getWord32be which walk over the input and return some type of data. You can see the full list of those functions in the documentation.

Here's another example; decoding an EOF-terminated list of numbers just involves recursion:

listOfWord16 = do
  empty <- isEmpty
  if empty
     then return []
     else do v <- getWord64be
             rest <- listOfWord16
             return (v : rest)

Strict Get monad

If you're parsing small messages then, firstly your input isn't going to be a lazy bytestring but a strict one. That's not reallly a problem because you can easilly convert between them. However, if you want to handle parse failures you either have to write your parser very carefully, or you have to deal with the fact that you can only catch exceptions in the IO monad.

If this is your dilemma, then you need a strict version of the Get monad. It's almost exactly the same, but a parser of type Get a results in (Either String a, ByteString) as the result of runGet. That type is a tuple where the first value is either a string (an error string from the parse) or the result, and the second value is the remaining bytestring when the parser finished.

Let's update the first example with this strict version of Get. You'll have to install the binary-strict package for it to work.

import qualified Data.ByteString as B
import Data.Binary.Strict.Get
import Data.Word

deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
  alen <- getWord32be
  plen <- getWord32be
  chksum <- getWord32be
  return (alen, plen, chksum)

main :: IO ()
main = do
  input <- B.getContents
  print $ runGet deserialiseHeader input

Note that all we've done is change from lazy bytestrings to strict bytestrings and change to importing Data.Binary.Strict.Get. Now we'll run it again:

% runhaskell /tmp/example.hs << EOF
heredoc> 123412341235
heredoc> EOF
(Right (825373492,825373492,825373493),"\n")

Now we can see that the parser was successful (we got a Right) and we can see that our shell actually added an extra newline on the input (correctly) and the parser didn't consume that, so it's also returned to us. Now we try it with a truncated input:

% runhaskell /tmp/example.hs << EOF
heredoc> tooshort
heredoc> EOF
(Left "too few bytes","\n")

This time we didn't get an exception, but a Left value, which can be handled in pure code. The remaining bytestring is the same because our truncated input is 9 bytes long, parsing the first two Word32's consumed 8 bytes and parsing the third failed—at which point we had the last byte still in the input.

In your parser, you can also call fail, with an error string, which will result in a Left value.

That's it; it's otherwise the same as the Get monad.

Incremental parsing

If you have to deal with a protocol which isn't length prefixed, or otherwise chunkable, from the network then you are faced with the problem of knowing when you have enough data to parse something semantically useful. You could run a strict Get over what you have and catch the truncation result, but that means that you're parsing the data multiple times etc.

Instead, you can use an incremental parser. There's an incremental version of the Get monad in Data.Binary.Strict.IncrementalGet (you'll need the binary-strict package).

You use it as normal, but rather than returning an Either value, you get a Result. You need to go follow that link and look at the documentation for Result.

It reflects the three outcomes of parsing possibly truncated data. Either the data is invalid as is, or it's complete, or it's truncated. In the truncated case you are given a function (called a continuation), to which you can pass more data, when you get it, and continue the parse. The continuation, again, returns a Result depending on the result of parsing the additional data as well.

Bit twiddling

Even with all this monadic goodness, sometimes you just need to move some bits around. That's perfectly possible in Haskell too. Just import Data.Bits and use the following table.

NameC operatorHaskell
AND&.&.
OR|.|.
XOR^`xor`
NOT~`complement`
Left shift<<`shiftL`
Right shift>>`shiftR`

The BitGet monad

As an alternative to bit twiddling, you can also use the BitGet monad. This is another state-like monad, like Get, but here the state includes the current bit-offset in the input. This means that you can easily pull out unaligned data. Sadly, haddock is currently breaking when trying to generate the documentation for BitGet so I'll start with an example. Again, you'll need the binary-strict package installed.

Here's a description of the header of a DNS packet, direct from RFC 1035:

                                    1  1  1  1  1  1
      0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      ID                       |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |QR|   Opcode  |AA|TC|RD|RA|   Z    |   RCODE   |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    QDCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    ANCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    NSCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    ARCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

The actual fields don't matter, but here's a function for parsing it:

parseHeader :: G.Get Header
parseHeader = do
  id <- G.getWord16be
  flags <- G.getByteString 2
  qdcount <- G.getWord16be >>= return . fromIntegral
  ancount <- G.getWord16be >>= return . fromIntegral
  nscount <- G.getWord16be >>= return . fromIntegral
  arcount <- G.getWord16be >>= return . fromIntegral

  let r = BG.runBitGet flags (do
    isquery <- BG.getBit
    opcode <- BG.getAsWord8 4 >>= parseEnum
    aa <- BG.getBit
    tc <- BG.getBit
    rd <- BG.getBit
    ra <- BG.getBit

    BG.getAsWord8 3
    rcode <- BG.getAsWord8 4 >>= parseEnum

    return $ Header id isquery opcode aa tc rd ra rcode qdcount ancount nscount arcount)

  case r of
    Left error -> fail error
    Right x -> return x

Here you can see that only the second line (from the ASCII-art diagram) is parsed using BitGet. An outer Get monad is used for everything else and the bit fields are pulled out with getByteString. Again, BitGet is a strict monad and returns an Either, but it doesn't return the remaining bytestring, just because there's no obvious way to represent a bytestring of a fractional number of bytes.

You can see the list of BitGet functions and their comments in the source code.

Binary generation

In contrast to parsing binary data, you might want to generate it. This is the job of the Put monad. Follow along with the documentation if you like.

The Put monad is another state-like monad, but the state is an offset into a series of buffers where the generated data is placed. All the buffer creation and handling is done for you, so you can just forget about it. It results in a lazy bytestring (so you can generate outputs that are larger than memory).

Here's the reverse of our simple Get example:

import qualified Data.ByteString.Lazy as BL
import Data.Binary.Put

serialiseSomething :: Put
serialiseSomething = do
  putWord32be 1
  putWord16be 2
  putWord8 3

main :: IO ()
main = BL.putStr $ runPut serialiseSomething

And running it shows that it's generating the correct serialisation:

% runhaskell /tmp/example.hs| hexdump -C
00000000  00 00 00 01 00 02 03                              |.......|

If you want the output of runPut to be a strict bytestring, you just need to convert it with B.concat $ BL.toChunks $ runPut xyz.

One limitation of Put, due to the nature of the Builder monad which it works with, is that you can't get the current offset into the output. This can be an issue with some formats which require you to encode byte offsets into the file. You have to calculate these byte offsets yourself.

Other useful packages

There are other packages which you should know about, but which are mostly covered by their documentation:

  • network-bytestring: for reading and writing bytestring from the network
  • zlib and bzlib: for compressed formats
  • encoding: for dealing with character encodings
  • tar: as an example of lazy parsing/serialisation