UniCode

Andrew J Bromage andrew@bromage.org
Fri, 5 Oct 2001 23:23:50 +1000


G'day all.

On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:

> Why Char is 32 bit. UniCode characters is 16 bit.

It's not quite as simple as that.  There is a set of one million
(more correctly, 1M) Unicode characters which are only accessible
using surrogate pairs (i.e. two UTF-16 codes).  There are currently 
none of these codes assigned, and when they are, they'll be extremely
rare.  So rare, in fact, that the cost of strings taking up twice the
space that the currently do simply isn't worth the cost.

However, you still need to be able to handle them.  I don't know what
the "official" Haskell reasoning is (it may have more to do with word
size than Unicode semantics), but it makes sense to me to store single
characters in UTF-32 but strings in a more compressed format (UTF-8 or
UTF-16).

See also: http://www.unicode.org/unicode/faq/utf_bom.html

It just goes to show that strings are not merely arrays of characters
like some languages would have you believe.

Cheers,
Andrew Bromage