[Haskell-cafe] file splitter with enumerator package

Sat Jul 23 04:56:53 CEST 2011

Hi Felipe,

Thank you for the very detailed explanation and help. Regarding the first
point, for this particular use case it's fine if the user-specified file
size is extended by the length of a partial line (it's a compact csv file so
if the user breaks a big file into 100mb chunks, each chunk would only ever
be about 100mb + up to 80 bytes, which is fine for the user).

I'm intrigued by the idea of making the bulk copy function with EB.isolate
and EB.iterHandle, but I couldn't find a way to fit these into the larger
context of writing to multiple file handles. I'll keep working on it and see
if I can address the concerns you brought up.

Thanks again!
Eric

On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa <
felipe.lessa at gmail.com> wrote:

> There is one problem with your algorithm.  If the user asks for 4 GiB,
> then the program will create files with *at least* 4 GiB.  So the user
> would need to ask for less, maybe 3.9 GiB.  Even so there's some
> danger, because there could be a 0.11 GiB line on the file.
>
> Now, the biggest problem your code won't run in constant memory.
> 'EB.take' does not lazily return a lazy ByteString.  It strictly
> returns a lazy ByteString [1].  The lazy ByteString is used to avoid
> copying data (as it is basically the same as a linked list of strict
> bytestrings).  So if the user asked for 4 GiB files, this program
> would need at least 4 GiB of memory, probably more due to overheads.
>
> If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy
> I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator
> package doesn't really buy you anything.  You should just use
> bytestring package's lazy I/O functions.
>
> If you want the guarantee of no leaks that enumerator gives, then you
> have to use another way of constructing your program.  One safe way of
> doing it is something like:
>
>  takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString)
>  takeNextLine = ...
>
>  go :: Monad m => Handle -> Int64 -> E.Iteratee B.ByteString m (Maybe
> L.ByteString)
>  go h n = do
>    mline <- takeNextLine
>    case mline of
>      Nothing -> return Nothing
>      Just line
>        | L.length line <= n -> L.hPut h line >> go h (n - L.length line)
>        | otherwise -> return mline
>
> So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h'
> and returns the leftover data.  The driver code needs to check its
> results.  Case 'Nothing', then the program finishes.  Case 'Just
> line', save line on a new file and call 'go h2 (n - L.length line)'.
> It isn't efficient because lines could be small, resulting in many
> small hPuts (bad).  But it is correct and will never use more than 'n'
> bytes (great).  You could also have some compromise where the user
> says that he'll never have lines longer than 'x' bytes (say, 1 MiB).
> Then you call a bulk copy function for 'n - x' bytes, and then call
> 'go h x'.  I think you can make the bulk copy function with EB.isolate
> and EB.iterHandle.
>
> Cheers, =)
>
> [1]
> http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take
>
> --
> Felipe.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110722/8deec0f7/attachment.htm>