[Haskell-cafe] bytestring vs. uvector

Duncan Coutts duncan.coutts at worc.ox.ac.uk
Sat Mar 14 12:02:06 EDT 2009


On Mon, 2009-03-09 at 18:29 -0700, Alexander Dunlap wrote:

> Thanks for all of the responses!
> 
> So let me see if my summary is accurate here:
> 
> - ByteString is for just that: strings of bytes, generally read off of
> a disk. The Char8 version just interprets the Word8s as Chars but
> doesn't do anything special with that.

Right. So it's only suitable for binary or ASCII (or mixed) formats.

> - Data.Text/text library is a higher-level library that deals with
> "text," abstracting over Unicode details and treating each element as
> a potentially-multibye "character."

If you're writing about this on the wiki for people, it's best not to
confuse the issue by talking about multibyte anything. We do not
describe [Char] as a multibyte encoding of Unicode. We say it is a
Unicode string. The abstraction is at the level of Unicode code points.
The String type *is* a sequence of Unicode code points.

This is exactly the same for Data.Text. It is a sequence of Unicode code
points. It is not an encoding. It is not UTF-anything. It does not
abstract over Unicode.

The Text type is just like the String type but with different strictness
and performance characteristics. Both are just sequences of Unicode code
points.

There is a reasonably close correspondence between Unicode code points
and what people normally think of as characters.

> - utf8-string is a wrapper over ByteString that interprets the bytes
> in the bytestring as potentially-multibye unicode "characters."

This on the other hand is an encoding. ByteString is a sequence of bytes
and when we interpret that as UTF-8 then we are looking at an encoding
of a sequence of Unicode code points.

Clear as mud? :-)

Duncan



More information about the Haskell-Cafe mailing list