Data.ByteString candidate 3

Tue Apr 25 21:48:52 EDT 2006

On 25.04 17:26, John Meacham wrote:
> On Wed, Apr 26, 2006 at 02:16:38AM +0300, Einar Karttunen wrote:
> > Using the Word8 API is not very pleasant, because all
> > character constants etc are not Word8.
> 
> yeah, but using the version restricted to latin1 seems rather special
> case, I can't imagine (or certainly hope) it won't be used in general
> internally unless people are already doing low level stuff. In this day
> and age, I expect unicode to work pretty much everywhere.

Like in protocols where some segments may be compressed binary data?
And they use ascii character based matching to distinguish header
fields, which may have text data that is actually Utf8?

> I am not saying we should kill the latin1 version, since there is
> interest in it, just that it doesn't fill the need for a general fast
> string replacement.

It mostly fills the "I want to use the Word8 module with nicer API" place.
But most of the time it may not be Latin1. If we implement a Latin1 module
then we should implement it properly. Also if we implement Latin1 there
is a case for implementing Latin2-5 also.

Of course the people really arguing for this module are not interested in
a proper Latin1 implementation but just want the agnostic ascii superset.

I think the wishes on the libraries list have been mainly:
* UTF8
* Word8 interface
* "Ascii superset"

The easiest way seems to have three modules - one for each. Then we get
to the naming part.

I would like:
* Data.ByteString.Word8
* Data.ByteString.Char8
* Data.ByteString.UTF

And select your favorite and make Data.ByteString export that one.
I think that could be the Word8 or the UTF one.

> I don't see why. ascii is a subset of utf8, the routines building a
> packedstring from an ascii string or a utf8 string can be identical, if
> you know your string is ascii to begin with you can use an optimized
> routine but the end result is the same as if you used the general utf8
> version.

Actually toUpper works differently on ascii + something in the high bytes
and ISO-8859-1. Same with all the isXXX predicates, fortunately not a problem
for things like whitespace.

> the proper thing for PackedString is to make it behave exactly as the
> String instances behave, since it is suposed to be a drop in
> replacement. Which means the natuarl ordering based on the Char order
> and the toLower and toUpper from the libraries.

toUpper and toLower are the correct version in the standard
and they use the unicode tables. The natural ordering by
codepoint without any normalization is not very useful for
text handling, but works for e.g. putting strings in a Map.

> uncode collation, graphemes, normalization, and localized sorting can be
> provided as separate routines as another project (it would be nice to
> have them work on both Strings and PackedStrings, so perhaps they could
> be in a class?)

These are quite essential for really working with unicode characters.
It didn't matter much before as Haskell didn't provide good ways
to handle unicode chars with IO, but these are very important,
otherwise it becomes hard to do many useful things with the parsed
unicode characters.

How are we supposed to process user input without normalization
e.g. if we need to compare Strings for equivalence?

But a simple UTF8 layer with more features added later is a good way.

- Einar Karttunen