FPS again

Bulat Ziganshin bulat.ziganshin at gmail.com
Sat Jul 15 13:57:45 EDT 2006


Hello Duncan,

Saturday, July 15, 2006, 8:04:26 PM, you wrote:
>> can you test that this implementation
>>   lines = split 0x0a
>> is as fast as existing (long) ones both for Lazy and Strict ByteString?

> It might actually be the other way around, that the split implementation
> could benefit from the work that went into the optimisation of the lines
> function. I spent quite some time trying to optimise the lines
> implementation, at least for the Lazy module. To get better performance
> it relies on the assumption that many lines fit into a chunk. That may
> not be true for uses of split in general. It's worth investigating.

well, you know this problem much deeper than me. so i'm shutting up :)

although i can say that strict ByteString should benefit from your
implementation too (both for lines and split, for obvious reasons)

imho, Lazy.split should just use (map P.split) and then join lines
that was split between adjacent blocks

> Btw, you can run the benchmarks too, they are included in the fps repo.

>> also, is not it faster to use the following implementation:
>>   isSpaceWord8 = (spacesFlagsArray!)?

> Benchmark it and tell us which is faster.

can my laziness be enough justification? :)

>> also, i propose to move getLine/getContents/putStr/interact/readFile-type
>> functions into .Char8 modules (both for strict and lazy bytestrings),
>> because these functions are encoding-dependent and work with texts
>> (as opposite to hGet/hPut which works with raw binary data blocks).

> Yes, getLine and putStrLn are encoding dependent (they know the encoding
> of '\n'). getContents, putStr, readfile, interact etc are
> encoding-independent, they're just the same as hGet/hPut, working on
> binary data blocks. Indeed putStr = hPut stdout.

they all work with text files, so they are also encoding-dependent
(translating CR+LF to LF on windows). putStr is only exception, but
it can be moved for company :)

this will make clear distinction between functions using ByteString as
raw sequence of bytes (hGet/hPut) and functions using ByteString as
packed String representing text data

>> in particular, i tried to implement Lazy.hGetLines as 'hGetContents >>= lines'
>> but it was impossible because 'lines' function is defined only in
>> Lazy.Char8 module

> Yes, that's the way it should be. And of course there is no need for
> hGetLines in the Lazy module since it is just hGetContents >>= lines
> In my opinion the hGetLines in the other module should be removed too as
> it's just a special case of what the Lazy module does.

it's also possible. but the situation when one ByteString
implementation supports particular function while another don't
imho is not very good. user should be able to switch between
implementations w/o rewriting his entire program

btw, you may be interested to know that i implemented in Streams lib
mmapBinaryFile, based on the code from ByteString. it works both on
Windows and Unix, using universal mmap API i described in letter to
David Roundy

-- 
Best regards,
 Bulat                            mailto:Bulat.Ziganshin at gmail.com



More information about the Libraries mailing list