UniCode

Dylan Thurston dpt@math.harvard.edu
Sat, 6 Oct 2001 01:00:34 +0900


On Fri, Oct 05, 2001 at 11:23:50PM +1000, Andrew J Bromage wrote:
> G'day all.
> 
> On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:
> 
> > Why Char is 32 bit. UniCode characters is 16 bit.
> 
> It's not quite as simple as that.  There is a set of one million
> (more correctly, 1M) Unicode characters which are only accessible
> using surrogate pairs (i.e. two UTF-16 codes).  There are currently 
> none of these codes assigned, and when they are, they'll be extremely
> rare.  So rare, in fact, that the cost of strings taking up twice the
> space that the currently do simply isn't worth the cost.

This is no longer true, as of Unicode 3.1.  Almost half of all
characters currently assigned are outside of the BMP (i.e., require
surrogate pairs in the UTF-16 encoding), including many Chinese
characters.  In current usage, these characters probably occur mainly
in names, and are rare, but obviously important for the people
involved.

> However, you still need to be able to handle them.  I don't know what
> the "official" Haskell reasoning is (it may have more to do with word
> size than Unicode semantics), but it makes sense to me to store single
> characters in UTF-32 but strings in a more compressed format (UTF-8 or
> UTF-16).

Haskell already stores strings as lists of characters, so I see no
advantage to anything other than UTF-32, since they'll take up a full
machine word in any case.  (Right?)  There's even plenty of room for
tags if any implementations want to use it.

> See also: http://www.unicode.org/unicode/faq/utf_bom.html
> 
> It just goes to show that strings are not merely arrays of characters
> like some languages would have you believe.

Right.  In Unicode, the concept of a "character" is not really so
useful; most functions that traditionally operate on characters (e.g.,
uppercase or display-width) fundamentally need to operate on strings.
(This is due to properties of particular languages, not any design
flaw of Unicode.)

Err, this raises some questions as to just what the "Char" module
from the standard library is supposed to do.  Most of the functions
are just not well-defined:
  isAscii, isLatin1 - OK
  isControl - I don't know about this.
  isPrint - Dubious.  Is a non-spacing accent a printable character?
  isSpace - OK, by the comment in the report: "The isSpace function
            recognizes only white characters in the Latin-1 range".
  isUpper, isLower - Maybe OK.
  toUpper, toLower - Not OK.  There are cases where upper casing a
     character yields two characters.
etc.  Any program using this library is bound to get confused on
Unicode strings.  Even before Unicode, there is much functionality
missing; for instance, I don't see any way to compare strings using
a localized order.

Is anyone working on honest support for Unicode, in the form of a real
Unicode library with an interface at the correct level?

Best,
	Dylan Thurston