Builder +bytestring

data Builder
bytestring Data.ByteString.Builder
Builders denote sequences of bytes. They are Monoids concatenation, which runs in O(1).
module Data.ByteString.Builder
bytestring Data.ByteString.Builder
IntC Int deriving( Eq, Ord, Show ) type Row = [Cell] type Table = [Row] </pre> We use the following imports and abbreviate mappend to simplify reading. > import qualified Data.ByteString.Lazy as L > import Data.ByteString.Builder > import Data.ByteString.Builder.ASCII (intDec) > import Data.Monoid > import Data.Foldable (foldMap) > import Data.List (intersperse) > > infixr 4 <> > (<>) :: Monoid m => m -> m -> m > (<>) = mappend CSV is a character-based representation of tables. For maximal modularity, we could first render Tables as Strings and then encode this String using some Unicode character encoding. However, this sacrifices performance due to the intermediate String representation being built and thrown away right afterwards. We get rid of this intermediate String representation by fixing the character encoding to UTF-8 and using Builders to convert Tables directly to UTF-8 encoded CSV tables represented as lazy ByteStrings. > encodeUtf8CSV :: Table -> L.ByteString > encodeUtf8CSV = toLazyByteString . renderTable > > renderTable :: Table -> Builder > renderTable rs = mconcat [renderRow r <> charUtf8 '\n' | r <- rs] > > renderRow :: Row -> Builder > renderRow [] = mempty > renderRow (c:cs) = > renderCell c <> mconcat [ charUtf8 ',' <> renderCell c' | c' <- cs ] > > renderCell :: Cell -> Builder > renderCell (StringC cs) = renderString cs > renderCell (IntC i) = intDec i > > renderString :: String -> Builder > renderString cs = charUtf8 '"' <> foldMap escape cs <> charUtf8 '"' > > escape '\\' = charUtf8 '\\' <> charUtf8 '\\' > escape '\"' = charUtf8 '\\' <> charUtf8 '\"' > escape c = charUtf8 c Note that the ASCII encoding is a subset of the UTF-8 encoding, which is why we can use the optimized function intDec to encode an Int as a decimal number with UTF-8 encoded digits. Using intDec is more efficient than stringUtf8 . show, as it avoids constructing an intermediate String. Avoiding this intermediate data structure significantly improves performance because encoding Cells is the core operation for rendering CSV-tables. See Data.ByteString.Builder.Prim for further information on how to improve the performance of renderString. We demonstrate our UTF-8 CSV encoding function on the following table. > strings :: [String] > strings = ["hello", "\"1\"", "»-wörld"] > > table :: Table > table = [map StringC strings, map IntC [-3..3]] The expression encodeUtf8CSV table results in the following lazy ByteString. > Chunk "\"hello\",\"\\\"1\\\"\",\"\206\187-w\195\182rld\"\n-3,-2,-1,0,1,2,3\n" Empty We can clearly see that we are converting to a binary format. The '»' and 'ö' characters, which have a Unicode codepoint above 127, are expanded to their corresponding UTF-8 multi-byte representation. We use the criterion library (http://hackage.haskell.org/package/criterion) to benchmark the efficiency of our encoding function on the following table. > import Criterion.Main -- add this import to the ones above > > maxiTable :: Table > maxiTable = take 1000 $ cycle table > > main :: IO () > main = defaultMain > [ bench "encodeUtf8CSV maxiTable (original)" $ > whnf (L.length . encodeUtf8CSV) maxiTable > ] On a Core2 Duo 2.20GHz on a 32-bit Linux, the above code takes 1ms to generate the 22'500 bytes long lazy ByteString. Looking again at the definitions above, we see that we took care to avoid intermediate data structures, as otherwise we would sacrifice performance. For example, the following (arguably simpler) definition of renderRow is about 20% slower. > renderRow :: Row -> Builder > renderRow = mconcat . intersperse (charUtf8 ',') . map renderCell Similarly, using O(n) concatentations like ++ or the equivalent concat operations on strict and lazy ByteStrings should be avoided. The following definition of renderString is also about 20% slower. > renderString :: String -> Builder > renderString cs = charUtf8 $ "\"" ++ concatMap escape cs ++ "\"" > > escape '\\' = "\\" > escape '\"' = "\\\"" > escape c = return c Apart from removing intermediate data-structures, encodings can be optimized further by fine-tuning their execution parameters using the functions in Data.ByteString.Builder.Extra and their "inner loops" using the functions in Data.ByteString.Builder.Prim.
module Data.ByteString.Lazy.Builder
bytestring Data.ByteString.Lazy.Builder
We decided to rename the Builder modules. Sorry about that. The old names will hang about for at least once release cycle before we deprecate them and then later remove them.
hPutBuilder :: Handle -> Builder -> IO ()
bytestring Data.ByteString.Builder
Output a Builder to a Handle. The Builder is executed directly on the buffer of the Handle. If the buffer is too small (or not present), then it is replaced with a large enough buffer. It is recommended that the Handle is set to binary and BlockBuffering mode. See hSetBinaryMode and hSetBuffering. This function is more efficient than hPut . toLazyByteString because in many cases no buffer allocation has to be done. Moreover, the results of several executions of short Builders are concatenated in the Handles buffer, therefore avoiding unnecessary buffer flushes.
runBuilder :: Builder -> BufferWriter
bytestring Data.ByteString.Builder.Extra
Turn a Builder into its initial BufferWriter action.