Unicode

Marcin 'Qrczak' Kowalczyk qrczak@knm.org.pl
24 May 2001 22:04:22 GMT


Thu, 24 May 2001 14:41:21 -0700, Ashley Yakeley <ashley@semantic.org> pisze:

>>   - Initial Unicode support - the Char type is now 31 bits.
> 
> It might be appropriate to have two types for Unicode, a UCS2 type
> (16 bits) and a UCS4 type (31 bits).

Actually it's 20.087462841250343 bits. Unicode 3.1 extends to U+10FFFF,
ISO-10646-1 is said to shrink to U+10FFFF in future, so maxBound::Char
is '\x10FFFF' now.

Among encodings of Unicode in a stream of bytes there are UTF-8,
UTF-16 and UTF-32 (with endianness variants). AFAIK terms UCS2 and
UCS4 are obsolete: there is a single code space 0..0x10FFFF and
various ways to serialize characters.

Ghc is going to support conversion between internal Unicode and
some encodings for external byte streams. Among them there will be
UTF-{8,16,32} (with endianness variants), all treated as streams
of bytes.

There is no point in storing characters in UTF-16 internally.
Especially in ghc where characters are boxed objects, and Word16 is
represented as a full machine word (32 or 64 bits). UTF-16 will be
supported as an external encoding, parallel to ISO-8859-x etc.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK