bytestring-0.10.1.0: Fast, compact, strict and lazy byte strings with a list interface

PortabilityGHC
MaintainerSimon Meier <[email protected]>
Safe HaskellNone

Data.ByteString.Builder.Prim

Contents

Description

This module provides Builder primitives, which are lower level building blocks for constructing Builders. You don't need to go down to this level but it can be slightly faster.

Morally, builder primitives are like functions a -> Builder, that is they take a value and encode it as a sequence of bytes, represented as a Builder. Of course their implementation is a bit more specialised.

Builder primitives come in two forms: fixed-size and bounded-size.

  • Fixed(-size) primitives are builder primitives that always result in a sequence of bytes of a fixed length. That is, the length is independent of the value that is encoded. An example of a fixed size primitive is the big-endian encoding of a Word64, which always results in exactly 8 bytes.
  • Bounded(-size) primitives are builder primitives that always result in a sequence of bytes that is no larger than a predetermined bound. That is, the bound is independent of the value that is encoded but the actual length will depend on the value. An example for a bounded primitive is the UTF-8 encoding of a Char, which can be 1,2,3 or 4 bytes long, so the bound is 4 bytes.

Note that fixed primitives can be considered as a special case of bounded primitives, and we can lift from fixed to bounded.

Because bounded primitives are the more general case, in this documentation we only refer to fixed size primitives where it matters that the resulting sequence of bytes is of a fixed length. Otherwise, we just refer to bounded size primitives.

The purpose of using builder primitives is to improve the performance of Builders. These improvements stem from making the two most common steps performed by a Builder more efficient. We explain these two steps in turn.

The first most common step is the concatenation of two Builders. Internally, concatenation corresponds to function composition. (Note that Builders can be seen as difference-lists of buffer-filling functions; cf. http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dlist. ) Function composition is a fast O(1) operation. However, we can use bounded primitives to remove some of these function compositions altogether, which is more efficient.

The second most common step performed by a Builder is to fill a buffer using a bounded primitives, which works as follows. The Builder checks whether there is enough space left to execute the bounded primitive. If there is, then the Builder executes the bounded primitive and calls the next Builder with the updated buffer. Otherwise, the Builder signals its driver that it requires a new buffer. This buffer must be at least as large as the bound of the primitive. We can use bounded primitives to reduce the number of buffer-free checks by fusing the buffer-free checks of consecutive Builders. We can also use bounded primitives to simplify the control flow for signalling that a buffer is full by ensuring that we check first that there is enough space left and only then decide on how to encode a given value.

Let us illustrate these improvements on the CSV-table rendering example from Data.ByteString.Builder. Its "hot code" is the rendering of a table's cells, which we implement as follows using only the functions from the Builder API.

import Data.ByteString.Builder as B

renderCell :: Cell -> Builder
renderCell (StringC cs) = renderString cs
renderCell (IntC i)     = B.intDec i

renderString :: String -> Builder
renderString cs = B.charUtf8 '"' <> foldMap escape cs <> B.charUtf8 '"'
  where
    escape '\\' = B.charUtf8 '\\' <> B.charUtf8 '\\'
    escape '\"' = B.charUtf8 '\\' <> B.charUtf8 '\"'
    escape c    = B.charUtf8 c

Efficient encoding of Ints as decimal numbers is performed by intDec. Optimization potential exists for the escaping of Strings. The above implementation has two optimization opportunities. First, the buffer-free checks of the Builders for escaping double quotes and backslashes can be fused. Second, the concatenations performed by foldMap can be eliminated. The following implementation exploits these optimizations.

import qualified Data.ByteString.Builder.Prim  as P
import           Data.ByteString.Builder.Prim
                 ( condB, liftFixedToBounded, (>*<), (>$<) )

renderString :: String -> Builder
renderString cs =
    B.charUtf8 '"' <> E.encodeListWithB escape cs <> B.charUtf8 '"'
  where
    escape :: E.BoundedPrim Char
    escape =
      condB (== '\\') (fixed2 ('\\', '\\')) $
      condB (== '\"') (fixed2 ('\\', '\"')) $
      E.charUtf8
     
    {-# INLINE fixed2 #-}
    fixed2 x = liftFixedToBounded $ const x >$< E.char7 >*< E.char7

The code should be mostly self-explanatory. The slightly awkward syntax is because the combinators are written such that the size-bound of the resulting BoundedPrim can be computed at compile time. We also explicitly inline the fixed2 primitive, which encodes a fixed tuple of characters, to ensure that the bound computation happens at compile time. When encoding the following list of Strings, the optimized implementation of renderString is two times faster.

maxiStrings :: [String]
maxiStrings = take 1000 $ cycle ["hello", "\"1\"", "λ-wörld"]

Most of the performance gain stems from using primMapListBounded, which encodes a list of values from left-to-right with a BoundedPrim. It exploits the Builder internals to avoid unnecessary function compositions (i.e., concatenations). In the future, we might expect the compiler to perform the optimizations implemented in primMapListBounded. However, it seems that the code is currently to complicated for the compiler to see through. Therefore, we provide the BoundedPrim escape hatch, which allows data structures to provide very efficient encoding traversals, like primMapListBounded for lists.

Note that BoundedPrims are a bit verbose, but quite versatile. Here is an example of a BoundedPrim for combined HTML escaping and UTF-8 encoding. It exploits that the escaped character with the maximal Unicode codepoint is '>'.

{-# INLINE charUtf8HtmlEscaped #-}
charUtf8HtmlEscaped :: E.BoundedPrim Char
charUtf8HtmlEscaped =
    condB (>  '>' ) E.charUtf8 $
    condB (== '<' ) (fixed4 ('&',('l',('t',';')))) $        -- &lt;
    condB (== '>' ) (fixed4 ('&',('g',('t',';')))) $        -- &gt;
    condB (== '&' ) (fixed5 ('&',('a',('m',('p',';'))))) $  -- &amp;
    condB (== '"' ) (fixed5 ('&',('#',('3',('4',';'))))) $  -- &#34;
    condB (== '\'') (fixed5 ('&',('#',('3',('9',';'))))) $  -- &#39;
    (liftFixedToBounded E.char7)         -- fallback for Chars smaller than '>'
  where
    {-# INLINE fixed4 #-}
    fixed4 x = liftFixedToBounded $ const x >$<
      E.char7 >*< E.char7 >*< E.char7 >*< E.char7
     
    {-# INLINE fixed5 #-}
    fixed5 x = liftFixedToBounded $ const x >$<
      E.char7 >*< E.char7 >*< E.char7 >*< E.char7 >*< E.char7

This module currently does not expose functions that require the special properties of fixed-size primitives. They are useful for prefixing Builders with their size or for implementing chunked encodings. We will expose the corresponding functions in future releases of this library.

Synopsis

Bounded-size primitives

data BoundedPrim a Source

A builder primitive that always results in sequence of bytes that is no longer than a pre-determined bound.

Combinators

The combinators for BoundedPrims are implemented such that the size of the resulting BoundedPrim can be computed at compile time.

emptyB :: BoundedPrim aSource

The BoundedPrim that always results in the zero-length sequence.

(>*<) :: Monoidal f => f a -> f b -> f (a, b)Source

A pairing/concatenation operator for builder primitives, both bounded and fixed size.

For example,

 showF (char7 >*< char7) ('x','y') = "xy"

We can combine multiple primitives using >*< multiple times.

 showEncoding (char7 >*< char7 >*< char7) ('x',('y','z')) = "xyz"

(>$<) :: Contravariant f => (b -> a) -> f a -> f bSource

A fmap-like operator for builder primitives, both bounded and fixed size.

Builder primitives are contravariant so it's like the normal fmap, but backwards (look at the type). (If it helps to remember, the operator symbol is like ($) but backwards.)

We can use it for example to prepend and/or append fixed values to an primitive.

showEncoding ((\x -> ('\'', (x, '\''))) >$< fixed3) 'x' = "'x'"
  where
    fixed3 = char7 >*< char7 >*< char7

Note that the rather verbose syntax for composition stems from the requirement to be able to compute the size / size bound at compile time.

eitherB :: BoundedPrim a -> BoundedPrim b -> BoundedPrim (Either a b)Source

Encode an Either value using the first BoundedPrim for Left values and the second BoundedPrim for Right values.

Note that the functions eitherB, pairB, and contramapB (written below using >$<) suffice to construct BoundedPrims for all non-recursive algebraic datatypes. For example,

maybeB :: BoundedPrim () -> BoundedPrim a -> BoundedPrim (Maybe a)
maybeB nothing just = maybe (Left ()) Right >$< eitherB nothing just

condB :: (a -> Bool) -> BoundedPrim a -> BoundedPrim a -> BoundedPrim aSource

Conditionally select a BoundedPrim. For example, we can implement the ASCII primitive that drops characters with Unicode codepoints above 127 as follows.

charASCIIDrop = condB (< '\128') (fromF char7) emptyB

Builder construction

primBounded :: BoundedPrim a -> a -> BuilderSource

Create a Builder that encodes values with the given BoundedPrim.

We rewrite consecutive uses of primBounded such that the bound-checks are fused. For example,

 primBounded (word32 c1) `mappend` primBounded (word32 c2)

is rewritten such that the resulting Builder checks only once, if ther are at 8 free bytes, instead of checking twice, if there are 4 free bytes. This optimization is not observationally equivalent in a strict sense, as it influences the boundaries of the generated chunks. However, for a user of this library it is observationally equivalent, as chunk boundaries of a lazy ByteString can only be observed through the internal interface. Morevoer, we expect that all primitives write much fewer than 4kb (the default short buffer size). Hence, it is safe to ignore the additional memory spilled due to the more agressive buffer wrapping introduced by this optimization.

primMapListBounded :: BoundedPrim a -> [a] -> BuilderSource

Create a Builder that encodes a list of values consecutively using a BoundedPrim for each element. This function is more efficient than the canonical

 filter p =
  B.toLazyByteString .
  E.encodeLazyByteStringWithF (E.ifF p E.word8) E.emptyF)

 mconcat . map (primBounded w)

or

 foldMap (primBounded w)

because it moves several variables out of the inner loop.

primUnfoldrBounded :: BoundedPrim b -> (a -> Maybe (b, a)) -> a -> BuilderSource

Create a Builder that encodes a sequence generated from a seed value using a BoundedPrim for each sequence element.

primMapByteStringBounded :: BoundedPrim Word8 -> ByteString -> BuilderSource

Create a Builder that encodes each Word8 of a strict ByteString using a BoundedPrim. For example, we can write a Builder that filters a strict ByteString as follows.

 import Data.ByteString.Builder.Primas P (word8, condB, emptyB)
 filterBS p = P.condB p P.word8 P.emptyB

Fixed-size primitives

data FixedPrim a Source

A builder primitive that always results in a sequence of bytes of a pre-determined, fixed size.

Combinators

The combinators for FixedPrims are implemented such that the size of the resulting FixedPrim is computed at compile time.

The '(>*and '($<)' pairing and mapping operators can be used with FixedPrim.

emptyF :: FixedPrim aSource

The FixedPrim that always results in the zero-length sequence.

Builder construction

In terms of expressivity, the function fixedPrim would be sufficient for constructing Builders from FixedPrims. The fused variants of this function are provided because they allow for more efficient implementations. Our compilers are just not smart enough yet; and for some of the employed optimizations (see the code of encodeByteStringWithF) they will very likely never be.

Note that functions marked with "Heavy inlining." are forced to be inlined because they must be specialized for concrete encodings, but are rather heavy in terms of code size. We recommend to define a top-level function for every concrete instantiation of such a function in order to share its code. A typical example is the function byteStringHex from Data.ByteString.Builder.ASCII, which is implemented as follows.

 byteStringHex :: S.ByteString -> Builder
 byteStringHex = encodeByteStringWithF word8HexFixed

primFixed :: FixedPrim a -> a -> BuilderSource

Encode a value with a FixedPrim.

primMapListFixed :: FixedPrim a -> [a] -> BuilderSource

Encode a list of values from left-to-right with a FixedPrim.

primUnfoldrFixed :: FixedPrim b -> (a -> Maybe (b, a)) -> a -> BuilderSource

Encode a list of values represented as an unfoldr with a FixedPrim.

primMapByteStringFixed :: FixedPrim Word8 -> ByteString -> BuilderSource

Heavy inlining. Encode all bytes of a strict ByteString from left-to-right with a FixedPrim. This function is quite versatile. For example, we can use it to construct a Builder that maps every byte before copying it to the buffer to be filled.

 mapToBuilder :: (Word8 -> Word8) -> S.ByteString -> Builder
 mapToBuilder f = encodeByteStringWithF (contramapF f word8)

We can also use it to hex-encode a strict ByteString as shown by the byteStringHex example above.

primMapLazyByteStringFixed :: FixedPrim Word8 -> ByteString -> BuilderSource

Heavy inlining. Encode all bytes of a lazy ByteString from left-to-right with a FixedPrim.

Standard encodings of Haskell values

Binary encodings

int8 :: FixedPrim Int8Source

Encoding single signed bytes as-is.

word8 :: FixedPrim Word8Source

Encoding single unsigned bytes as-is.

Big-endian

int16BE :: FixedPrim Int16Source

Encoding Int16s in big endian format.

int32BE :: FixedPrim Int32Source

Encoding Int32s in big endian format.

int64BE :: FixedPrim Int64Source

Encoding Int64s in big endian format.

word16BE :: FixedPrim Word16Source

Encoding Word16s in big endian format.

word32BE :: FixedPrim Word32Source

Encoding Word32s in big endian format.

word64BE :: FixedPrim Word64Source

Encoding Word64s in big endian format.

floatBE :: FixedPrim FloatSource

Encode a Float in big endian format.

doubleBE :: FixedPrim DoubleSource

Encode a Double in big endian format.

Little-endian

int16LE :: FixedPrim Int16Source

Encoding Int16s in little endian format.

int32LE :: FixedPrim Int32Source

Encoding Int32s in little endian format.

int64LE :: FixedPrim Int64Source

Encoding Int64s in little endian format.

word16LE :: FixedPrim Word16Source

Encoding Word16s in little endian format.

word32LE :: FixedPrim Word32Source

Encoding Word32s in little endian format.

word64LE :: FixedPrim Word64Source

Encoding Word64s in little endian format.

floatLE :: FixedPrim FloatSource

Encode a Float in little endian format.

doubleLE :: FixedPrim DoubleSource

Encode a Double in little endian format.

Non-portable, host-dependent

intHost :: FixedPrim IntSource

Encode a single native machine Int. The Ints is encoded in host order, host endian form, for the machine you are on. On a 64 bit machine the Int is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way are not portable to different endian or integer sized machines, without conversion.

int16Host :: FixedPrim Int16Source

Encoding Int16s in native host order and host endianness.

int32Host :: FixedPrim Int32Source

Encoding Int32s in native host order and host endianness.

int64Host :: FixedPrim Int64Source

Encoding Int64s in native host order and host endianness.

wordHost :: FixedPrim WordSource

Encode a single native machine Word. The Words is encoded in host order, host endian form, for the machine you are on. On a 64 bit machine the Word is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way are not portable to different endian or word sized machines, without conversion.

word16Host :: FixedPrim Word16Source

Encoding Word16s in native host order and host endianness.

word32Host :: FixedPrim Word32Source

Encoding Word32s in native host order and host endianness.

word64Host :: FixedPrim Word64Source

Encoding Word64s in native host order and host endianness.

floatHost :: FixedPrim FloatSource

Encode a Float in native host order and host endianness. Values written this way are not portable to different endian machines, without conversion.

doubleHost :: FixedPrim DoubleSource

Encode a Double in native host order and host endianness.

Character encodings

ASCII

char7 :: FixedPrim CharSource

Encode the least 7-bits of a Char using the ASCII encoding.

Decimal numbers

Decimal encoding of numbers using ASCII encoded characters.

int8Dec :: BoundedPrim Int8Source

Decimal encoding of an Int8.

int16Dec :: BoundedPrim Int16Source

Decimal encoding of an Int16.

int32Dec :: BoundedPrim Int32Source

Decimal encoding of an Int32.

int64Dec :: BoundedPrim Int64Source

Decimal encoding of an Int64.

intDec :: BoundedPrim IntSource

Decimal encoding of an Int.

word8Dec :: BoundedPrim Word8Source

Decimal encoding of a Word8.

word16Dec :: BoundedPrim Word16Source

Decimal encoding of a Word16.

word32Dec :: BoundedPrim Word32Source

Decimal encoding of a Word32.

word64Dec :: BoundedPrim Word64Source

Decimal encoding of a Word64.

wordDec :: BoundedPrim WordSource

Decimal encoding of a Word.

Hexadecimal numbers

Encoding positive integers as hexadecimal numbers using lower-case ASCII characters. The shortest possible representation is used. For example,

 showB word16Hex 0x0a10 = "a10"

Note that there is no support for using upper-case characters. Please contact the maintainer if your application cannot work without hexadecimal encodings that use upper-case characters.

word8Hex :: BoundedPrim Word8Source

Hexadecimal encoding of a Word8.

word16Hex :: BoundedPrim Word16Source

Hexadecimal encoding of a Word16.

word32Hex :: BoundedPrim Word32Source

Hexadecimal encoding of a Word32.

word64Hex :: BoundedPrim Word64Source

Hexadecimal encoding of a Word64.

wordHex :: BoundedPrim WordSource

Hexadecimal encoding of a Word.

Fixed-width hexadecimal numbers

Encoding the bytes of fixed-width types as hexadecimal numbers using lower-case ASCII characters. For example,

 showF word16HexFixed 0x0a10 = "0a10"

int8HexFixed :: FixedPrim Int8Source

Encode a Int8 using 2 nibbles (hexadecimal digits).

int16HexFixed :: FixedPrim Int16Source

Encode a Int16 using 4 nibbles.

int32HexFixed :: FixedPrim Int32Source

Encode a Int32 using 8 nibbles.

int64HexFixed :: FixedPrim Int64Source

Encode a Int64 using 16 nibbles.

word8HexFixed :: FixedPrim Word8Source

Encode a Word8 using 2 nibbles (hexadecimal digits).

word16HexFixed :: FixedPrim Word16Source

Encode a Word16 using 4 nibbles.

word32HexFixed :: FixedPrim Word32Source

Encode a Word32 using 8 nibbles.

word64HexFixed :: FixedPrim Word64Source

Encode a Word64 using 16 nibbles.

floatHexFixed :: FixedPrim FloatSource

Encode an IEEE Float using 8 nibbles.

doubleHexFixed :: FixedPrim DoubleSource

Encode an IEEE Double using 16 nibbles.

ISO/IEC 8859-1 (Char8)

The ISO/IEC 8859-1 encoding is an 8-bit encoding often known as Latin-1. The Char8 encoding implemented here works by truncating the Unicode codepoint to 8-bits and encoding them as a single byte. For the codepoints 0-255 this corresponds to the ISO/IEC 8859-1 encoding. Note that the Char8 encoding is equivalent to the ASCII encoding on the Unicode codepoints 0-127. Hence, functions such as intDec can also be used for encoding Ints as a decimal number with Char8 encoded characters.

char8 :: FixedPrim CharSource

Char8 encode a Char.

UTF-8

The UTF-8 encoding can encode all Unicode codepoints. It is equivalent to the ASCII encoding on the Unicode codepoints 0-127. Hence, functions such as intDec can also be used for encoding Ints as a decimal number with UTF-8 encoded characters.

charUtf8 :: BoundedPrim CharSource

UTF-8 encode a Char.