Unicode support

Kent Karlsson [email protected]
Mon, 8 Oct 2001 11:46:44 +0200


----- Original Message -----
From: "Wolfgang Jeltsch" <[email protected]>
To: "The Haskell Mailing List" <[email protected]>
Sent: Thursday, October 04, 2001 8:47 PM
Subject: Re: Unicode support


> On Sunday, 30 September 2001 20:01, John Meacham wrote:
> > sorry for the me too post, but this has been a major pet peeve of mine
> > for a long time. 16 bit unicode should be gotten rid of, being the worst
> > of both worlds, non backwards compatable with ascii, endianness issues
> > and no constant length encoding.... utf8 externally and utf32 when
> > worknig with individual characters is the way to go.
>
> I totally agree with you.

Now, what are your technical arguments for this position?
(B.t.w., UTF-16 isn't going to go away, it's very firmly established.)

From what I've seen, those who take the position you seem to
prefer, are people not very involved with Unicode and its implementation.
Whereas people that are so involved strongly prefer UTF-16.

Note that nearly no string operation of interest (and excepting low level
stuff, like buffer sizes, and copying) can be done on a string looking
at individual characters only.  Just about the only thing that sensibly
can be done on isolated characters is property interrogation.You
can't do case mapping of a string (involving Greek or Lithuanian text)
without being sensitive to the context of each character.  And, as
somebody already noted, combining characters have to be taken
into account. E.g.  (U+211B (deprecated), or U+00C5) must
collate the same as <U+0041,U+030A>, even when not collating
them among the A's (U+0041).

So it is not surprising that most people involved do not consider
UTF-16 a bad idea.  The extra complexity is minimal, and further
surfaces rarely.  Indeed they think UTF-16 is a good idea since the
supplementary characters will in most cases occur very rarely,
BMP characters are still (relatively) easy to process, and it saves
memory space and cache misses when large amounts of text data
is processed (e.g. databases).

On the other hand, Haskell implementations are probably still
rather wasteful when representing strings, and Haskell isn't used to hold
large databases, so going to UTF-32 is not a big deal for Haskell,
I guess. (Though I don't think that will happen for Java.)

> > seeing as how the haskell standard is horribly vauge when it comes to
> > character set encodings anyway, I would recommend that we just omit any
> > reference to the bit size of Char, and just say abstractly that each
> > Char represents one unicode character, but the entire range of unicode
> > is not guarenteed to be expressable, which must be true, since haskell
> > 98 implementations can be written now, but unicode can change in the
> > future. The only range guarenteed to be expressable in any
> > representation are the values 0-127 US ASCII (or perhaps latin1)
>
> This sounds also very good.

Why?  This is the approach taken by programming languages like C,
where the character encoding *at runtime* (both for char and wchar_t)
is essentially unknown.  This, of course, leads to all sorts of trouble,
which some try to mitigate by *suggesting* to have all sorts of locale
independent stuff in (POSIX) "locales". Nobody has worked out any
sufficiently comprehensive set of data for this though, and nobody ever
will, both because it is openended and because nobody is really trying.
Furthermore, this is not the approach of Java, Ada, or Haskell.  And it is
not the approach advocated by people involved with inplementing
support for Unicode (and other things related to internationalisation
and localisation). Even C is (slowly) leaving that approach, having
introduced the __STDC_ISO_10646__ property macro (with it's semantics),
and the \uhhhh and \Uhhhhhhhh 'universal character names.

            Kind regards
            /kent k