[Haskell-cafe] file splitter with enumerator package

Yves Parès limestrael at gmail.com
Sun Jul 24 17:28:34 CEST 2011


If you used Data.Enumerator.Text, you would maybe benefit the "lines"
function:

lines :: Monad m => Enumeratee Text Text m b

But there is something I don't get with that signature:
why isn't it:
lines :: Monad m => Enumeratee Text [Text] m b
??


2011/7/23 Eric Rasmussen <ericrasmussen at gmail.com>

> Hi Felipe,
>
> Thank you for the very detailed explanation and help. Regarding the first
> point, for this particular use case it's fine if the user-specified file
> size is extended by the length of a partial line (it's a compact csv file so
> if the user breaks a big file into 100mb chunks, each chunk would only ever
> be about 100mb + up to 80 bytes, which is fine for the user).
>
> I'm intrigued by the idea of making the bulk copy function with EB.isolate
> and EB.iterHandle, but I couldn't find a way to fit these into the larger
> context of writing to multiple file handles. I'll keep working on it and see
> if I can address the concerns you brought up.
>
> Thanks again!
> Eric
>
>
>
>
>
> On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa <
> felipe.lessa at gmail.com> wrote:
>
>> There is one problem with your algorithm.  If the user asks for 4 GiB,
>> then the program will create files with *at least* 4 GiB.  So the user
>> would need to ask for less, maybe 3.9 GiB.  Even so there's some
>> danger, because there could be a 0.11 GiB line on the file.
>>
>> Now, the biggest problem your code won't run in constant memory.
>> 'EB.take' does not lazily return a lazy ByteString.  It strictly
>> returns a lazy ByteString [1].  The lazy ByteString is used to avoid
>> copying data (as it is basically the same as a linked list of strict
>> bytestrings).  So if the user asked for 4 GiB files, this program
>> would need at least 4 GiB of memory, probably more due to overheads.
>>
>> If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy
>> I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator
>> package doesn't really buy you anything.  You should just use
>> bytestring package's lazy I/O functions.
>>
>> If you want the guarantee of no leaks that enumerator gives, then you
>> have to use another way of constructing your program.  One safe way of
>> doing it is something like:
>>
>>  takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString)
>>  takeNextLine = ...
>>
>>  go :: Monad m => Handle -> Int64 -> E.Iteratee B.ByteString m (Maybe
>> L.ByteString)
>>  go h n = do
>>    mline <- takeNextLine
>>    case mline of
>>      Nothing -> return Nothing
>>      Just line
>>        | L.length line <= n -> L.hPut h line >> go h (n - L.length line)
>>        | otherwise -> return mline
>>
>> So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h'
>> and returns the leftover data.  The driver code needs to check its
>> results.  Case 'Nothing', then the program finishes.  Case 'Just
>> line', save line on a new file and call 'go h2 (n - L.length line)'.
>> It isn't efficient because lines could be small, resulting in many
>> small hPuts (bad).  But it is correct and will never use more than 'n'
>> bytes (great).  You could also have some compromise where the user
>> says that he'll never have lines longer than 'x' bytes (say, 1 MiB).
>> Then you call a bulk copy function for 'n - x' bytes, and then call
>> 'go h x'.  I think you can make the bulk copy function with EB.isolate
>> and EB.iterHandle.
>>
>> Cheers, =)
>>
>> [1]
>> http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take
>>
>> --
>> Felipe.
>>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110724/da20373c/attachment.htm>


More information about the Haskell-Cafe mailing list