[Haskell-cafe] PROPOSAL: New efficient Unicode string library.

Wed Sep 26 13:46:35 EDT 2007

In message <1190825044.9435.1.camel at jcchost> Jonathan Cast <jcast at ou.edu> writes:
> On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:

> > If UTF-16 is what's used by everyone else (how about Java? Python?) I
> > think that's a strong reason to use it. I don't know Unicode well
> > enough to say otherwise.
> 
> I disagree.  I realize I'm a dissenter in this regard, but my position
> is: excellent Unix support first, portability second, excellent support
> for Win32/MacOS a distant third.  That seems to be the opposite of every
> language's position.  Unix absolutely needs UTF-8 for backward
> compatibility.

I think you're talking about different things, internal vs external representations.

Certainly we must support UTF-8 as an external representation. The choice of
internal representation is independent of that. It could be [Char] or some
memory efficient packed format in a standard encoding like UTF-8,16,32. The
choice depends mostly on ease of implementation and performance. Some formats
are easier/faster to process but there are also conversion costs so in some use
cases there is a performance benefit to the internal representation being the
same as the external representation.

So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8
has the advantage of being the same as a common external representation so
conversion is cheap (only need to validate rather than copy). UTF-8 is more
compact for western languages but less compact for eastern languages compared to
UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the
common case UTF-16 is effectively fixed width. According to the ICU implementors
this has speed advantages (probably due to branch prediction and smaller code size).

One solution is to do both and benchmark them.

Duncan