Unicode

Fergus Henderson fjh@cs.mu.oz.au
Sat, 26 May 2001 03:17:40 +1000


On 24-May-2001, Marcin 'Qrczak' Kowalczyk <qrczak@knm.org.pl> wrote:
> Thu, 24 May 2001 14:41:21 -0700, Ashley Yakeley <ashley@semantic.org> pisze:
> 
> >>   - Initial Unicode support - the Char type is now 31 bits.
> > 
> > It might be appropriate to have two types for Unicode, a UCS2 type
> > (16 bits) and a UCS4 type (31 bits).
> 
> Actually it's 20.087462841250343 bits. Unicode 3.1 extends to U+10FFFF,
> ISO-10646-1 is said to shrink to U+10FFFF in future, so maxBound::Char
> is '\x10FFFF' now.
> 
> Among encodings of Unicode in a stream of bytes there are UTF-8,
> UTF-16 and UTF-32 (with endianness variants). AFAIK terms UCS2 and
> UCS4 are obsolete: there is a single code space 0..0x10FFFF and
> various ways to serialize characters.
> 
> Ghc is going to support conversion between internal Unicode and
> some encodings for external byte streams. Among them there will be
> UTF-{8,16,32} (with endianness variants), all treated as streams
> of bytes.
> 
> There is no point in storing characters in UTF-16 internally.
> Especially in ghc where characters are boxed objects, and Word16 is
> represented as a full machine word (32 or 64 bits). UTF-16 will be
> supported as an external encoding, parallel to ISO-8859-x etc.

What about for interfacing with Win32, MacOS X, or Java?
Your talk about "external" versus "internal" worries me a bit,
since the distinction between these is not always clear.
Is there a way to convert a Haskell String into a UTF-16
encoded byte stream without writing to a file and then
reading the file back in?

-- 
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
                                    |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.