ByteString I/O Performance

Wed Sep 5 16:32:29 EDT 2007

On Wed, 2007-09-05 at 21:30 +0200, Peter Simons wrote:

> As far as I can tell, the only reason why a function like
> 'unsafeUseAsCStringLen' has to be dubbed unsafe is because 'index' makes
> it unsafe. The limitation that ByteString has to be immutable is a
> consequence of the choice to provide 'index' as a pure function.

Well, it's not just index, all the functions that get data from the
ByteString, like head/tail/uncons etc etc are pure. That is the whole
point of the design of ByteString, to provide pure/immutable high
performance strings.

What you want is just fine, but it's a mutable interface not a pure one.
We cannot provide any operations that mutate an existing ByteString
without breaking the semantics of all the pure operations.

It's very much like the difference between the MArray and IArray
classes, for mutable and immutable arrays. One provides index in a
monad, the other is pure.

> Personally, I won't use 'index' in my code. I'll happily dereference the
> pointer in the IO monad, because I've found that to be no effort
> whatsoever. I love monads. For my purposes, 'unsafeUseAsCStringLen' is a
> perfectly safe function. The efficient variant of 'hGet' I posted can be
> implemented on top of it, so that 'hGet' is by all means a safe function
> in my code. There really is no risk at all, unless one uses 'index' or
> something that's based on it.

Right, or if you were to hand out a ByteString and then change the
contents of it when nobody is looking then that's very much unsafe.

So the point is you can break the semantics locally and nobody will
notice. It's not a technique we should encourage however.

> The way I see it, there will be other people who'll find the performance
> limitations of standard 'hGet' a decisive factor in their design
> decisions. Chances are, those people will wonder about using the base
> pointer for hGetBuf and then they'll end up re-inventing the wheel we
> just came up with.

I'd rather not provide a quick easy way to break the semantics.
unsafeUseAsCStringLen and friends are already plenty enough rope...

> Maybe I'll find the time to submit a patch to the documentation, so that
> fine points like an optimal buffer size etc. are explained in more
> detail than they are right now. It would be nice if some kind of result
> would come out of this discussion.

I really don't think we can provide anything that copies into an
existing pre-allocated ByteString. As far as I can see, the best we can
do is to allocate a fresh buffer and do a single copy into that. 

Mutating an existing buffer is fine, and System.IO already provides
hGetBuf. But you have to be really really careful if you create a
ByteString based on the contents of that mutable buffer, without making
any copy first.

> Anyway, thank you. I appreciate everyone's efforts in helping me figure
> out why I/O with ByteString is more than two times slower than it could
> be.

Thanks very much for pointing out where we are copying more than
necessary.

As for the last bit of performance difference due to the cache benefits
of reusing a mutable buffer rather than allocating and GCing a range of
buffer, I can't see any way within the existing design how we can
achieve that.

Bear in mind, that these cache benefits are fairly small in real
benchmarks as opposed to 'cat' on fully cached files. Usually you do
some actual IO and some operation on the data rather than just copying
it from one file descriptor to another.

For example, my lazy bytestring binding to iconv performs exactly the
same as the command line iconv. In that case we are doing a bit of work
on the data which swamps the cache benefits that the command line iconv
prog gets from using mutable buffers.

If we are trying to optimise the 'cat' case however, eg for network
servers, there are even lower level things we can do so that no copies
of the data have to be made at all. eg mmap or linux's copyfile or
splice. ByteString certainly isn't the right abstraction for that
though.

Duncan