[Haskell-cafe] Re: String vs ByteString

John Millikin jmillikin at gmail.com
Tue Aug 17 12:19:14 EDT 2010


On Tue, Aug 17, 2010 at 06:12, Michael Snoyman <michael at snoyman.com> wrote:
> I'm not talking about API changes here; the topic at hand is the internal
> representation of the stream of characters used by the text package. That is
> currently UTF-16; I would argue switching to UTF8.

The Data.Text.Foreign module is part of the API, and is currently
hardcoded to use UTF-16. Any change of the internal encoding will
require breaking this module's API.

>> > We can't consider a CJK encoding for text,
>>
>> Not as a default, certainly not as the only option. But
>> nice to have as a choice.
>>
> I think you're missing the point at hand: I don't think *any* is opposed to
> offering encoders/decoders for all the multitude of encoding types out
> there. In fact, I believe the text-icu package already supports every
> encoding type under discussion. The question is the internal representation
> for text, for which a language-specific encoding is *not* a choice, since it
> does not support all unicode code points.
> Michael

The reason many Japanese and Chinese users reject UTF-8 isn't due to
space constraints (UTF-8 and UTF-16 are roughly equal), it's because
they reject Unicode itself. Shift-JIS and the various Chinese
encodings both contain Han characters which are missing from Unicode,
either due to the Han unification or simply were not considered
important enough to include (yet there's a codepage for Linear-B...).
Ruby, which has an enormous Japanese userbase, solved the problem by
essentially defining Text = (Encoding, ByteString), and then
re-implementing text logic for each encoding. This allows very
efficient operation with every possible encoding, at the cost of
increased complexity (caching decoded characters, multi-byte handling,
etc).


More information about the Haskell-Cafe mailing list