FPS again

Sat Jul 15 14:20:08 EDT 2006

On Sat, 2006-07-15 at 21:57 +0400, Bulat Ziganshin wrote:
> Hello Duncan,
> 
> Saturday, July 15, 2006, 8:04:26 PM, you wrote:
> >> can you test that this implementation
> >>   lines = split 0x0a
> >> is as fast as existing (long) ones both for Lazy and Strict ByteString?
> 
> > It might actually be the other way around, that the split implementation
> > could benefit from the work that went into the optimisation of the lines
> > function. I spent quite some time trying to optimise the lines
> > implementation, at least for the Lazy module. To get better performance
> > it relies on the assumption that many lines fit into a chunk. That may
> > not be true for uses of split in general. It's worth investigating.
> 
> well, you know this problem much deeper than me. so i'm shutting up :)
> 
> although i can say that strict ByteString should benefit from your
> implementation too (both for lines and split, for obvious reasons)
> 
> imho, Lazy.split should just use (map P.split) and then join lines
> that was split between adjacent blocks

That's what I did first. Keeping track of re-joining bits between
adjacent blocks adds quite a bit of bookkeeping overhead.

> > Btw, you can run the benchmarks too, they are included in the fps repo.
> 
> >> also, is not it faster to use the following implementation:
> >>   isSpaceWord8 = (spacesFlagsArray!)?
> 
> > Benchmark it and tell us which is faster.
> 
> can my laziness be enough justification? :)
> 
> >> also, i propose to move getLine/getContents/putStr/interact/readFile-type
> >> functions into .Char8 modules (both for strict and lazy bytestrings),
> >> because these functions are encoding-dependent and work with texts
> >> (as opposite to hGet/hPut which works with raw binary data blocks).
> 
> > Yes, getLine and putStrLn are encoding dependent (they know the encoding
> > of '\n'). getContents, putStr, readfile, interact etc are
> > encoding-independent, they're just the same as hGet/hPut, working on
> > binary data blocks. Indeed putStr = hPut stdout.
> 
> they all work with text files, so they are also encoding-dependent
> (translating CR+LF to LF on windows). putStr is only exception, but
> it can be moved for company :)

Ok fair enough, they should be using openBinaryFile then rather than
openFile.

> this will make clear distinction between functions using ByteString as
> raw sequence of bytes (hGet/hPut) and functions using ByteString as
> packed String representing text data

There really is no difference with hGet/hPut. readFile/writeFile etc are
implemented using hGet/hPut.

> >> in particular, i tried to implement Lazy.hGetLines as 'hGetContents >>= lines'
> >> but it was impossible because 'lines' function is defined only in
> >> Lazy.Char8 module
> 
> > Yes, that's the way it should be. And of course there is no need for
> > hGetLines in the Lazy module since it is just hGetContents >>= lines
> > In my opinion the hGetLines in the other module should be removed too as
> > it's just a special case of what the Lazy module does.
> 
> it's also possible. but the situation when one ByteString
> implementation supports particular function while another don't
> imho is not very good. user should be able to switch between
> implementations w/o rewriting his entire program

Yeah, I think we should eliminate hGetLines partly for that reason.

> btw, you may be interested to know that i implemented in Streams lib
> mmapBinaryFile, based on the code from ByteString. it works both on
> Windows and Unix, using universal mmap API i described in letter to

Sounds good. If we can get a universal mmap API into the base lib then
we can add mmapFile back into the ByteString module (it's currently got
a commented-out posix version).

Duncan