[Haskell-cafe] The question of ByteString

Fri Nov 2 19:13:23 EDT 2007

On Fri, 2007-11-02 at 21:35 +0000, Andrew Coppin wrote:

> Well OK, maybe I was a little vague. Let me be a bit more specific...
> 
> If you do text processing using ByteString rather than String, you get 
> dramatically better performance in time and space. For me, this raises a 
> number of questions:
> 
> 1. Why do I have to type "ByteString" in my code? Why isn't the compiler 
> automatically performing this optimisation for me? (I.e., is there some 
> observable property that is changed? 

Yes, the semantics are different. ByteString is stricter. In some
circumstances you could discover that some list is being used
sufficiently strictly (spine and element strict) that you could do a
representation change to use strict arrays. It is something I have
pondered occasionally and I think that is an interesting avenue for
research.

One approach might be to do a more sophisticated strictness analysis
earlier in the compilation process; one that gives details on strictness
of substructure, ie the tail/element strictness in lists. Then if this
strictness information were available to the rule matching then we might
be able to write rules that change certain functions to work on
optimised data representations.

However this is likely to be quite fragile. I usually think that it's
better to declare the strictness you want up front in one place, and
have that be propagated, rather than doing the reverse of inferring that
something could be stricter from all the use sites. Strictness
annotations on data constructors are a good example of this.

> Currently the answer is yes: the ByteString interface only provides
> "trancated" Unicode characters. But, in principle, that could be
> changed.)

Indeed it could, we could provide a proper Unicode string type.

> 2. ByteString makes text strings faster. But what about other kinds of 
> collections? Can't we do something similar to them that makes them go 
> faster?

There is much less benefit for other collections since the overheads of
generic structures are smaller for other types.

Note that the NDP parallel arrays stuff uses type functions to calculate
optimised data representations for arrays of types.

> As I understand it, ByteString is faster due to several factors. First 
> of all, it's stricter.

Do that's the semantic difference.

> Secondly, it's an "unboxed" structure (so you eliminate layers of
> indirection and there's less GC load). 

Which is the representation optimisation allowed by the semantic change
of making it stricter.

> Third, it's implemented as an array that looks like a linked list.
> Given how ubiquitous lists are in Haskell, "array that looks like a
> linked list" sounds like one seriously useful data type! Yet
> ByteString seems to be the only implementation of this concept - and
> only for lists on unboxed bytes. (Not even unboxed Word16 or anything
> else, *only* Word8.) If I understand this correctly, a ByteString is
> actually a linked list of large array chunks. (This presumably yields
> fastER random access than a plain linked list?) Also, it seems to be
> possible to create a new array which is merely a subrange of an
> existing one, without any copying; the standard array API doesn't seem
> to provide this, yet it sounds damn useful.

I think the NDP project should get us most of this stuff actually.

Duncan