Unicode support

Kent Karlsson kentk@md.chalmers.se
Tue, 9 Oct 2001 13:50:40 +0200

----- Original Message -----
From: "Ketil Malde" <ketil@ii.uib.no>
> >>> for a long time. 16 bit unicode should be gotten rid of, being the worst
> >>> of both worlds, non backwards compatable with ascii, endianness issues
> >>> and no constant length encoding.... utf8 externally and utf32 when
> >>> worknig with individual characters is the way to go.
> >> I totally agree with you.
> > Now, what are your technical arguments for this position?
> > (B.t.w., UTF-16 isn't going to go away, it's very firmly established.)
> What's wrong with the ones already mentioned?
> You have endianness issues, and you need to explicitly type text files
> or insert BOMs.

You have to distinguish between the encoding form (what you use internally)
and encoding scheme (externally).  For the encoding form, there is no endian
issue, just like there is no endian issue for int internally in your program.
For the encoding form there is no BOM either (or rather, it should have been
removed upon reading, if the data is taken in from an external source).

But I agree that the BOM (for all of the Unicode encoding schemes) and
the byte order issue (for the non-UTF-8 encoding schemes; the external ones)
are a pain.  But as I said: they will not go away now, they are too firmly established.

> An UTF-8 stream limited to 7-bit ASCII simply is that ASCII stream.

Which is a large portion of the raison d'Ítre for UTF-8.

> When not limited to ASCII, at least it avoids zero bytes and other
> potential problems.  UTF-16 will among other things, be full of
> NULLs.

Yes, and so what?

So will a file filled with image data, video clips, or plainly a list of raw
integers dumped to file (not formatted as strings).  I know, many old
utility programs choke on NULL bytes, but that's not Unicode's fault.
Further, NULL (as a character) is a perfectly valid character code.
Always was.

> I can understand UCS-2 looking attractive when it looked like a
> fixed-length encoding, but that no longer applies.
> > So it is not surprising that most people involved do not consider
> > UTF-16 a bad idea.  The extra complexity is minimal, and further
> > surfaces rarely.
> But it needs to be there.  It will introduce larger programs, more
> bugs

True.  But implementing normalisation, or case mapping for that matter,
is non-trivial too.  In practice, the additional complexity with UTF-16 seems small.

> , lower efficiency.


> > BMP characters are still (relatively) easy to process, and it saves
> > memory space and cache misses when large amounts of text data
> > is processed (e.g. databases).
> I couldn't find anything about the relative efficiencies of UTF-8 and
> UTF-16 on various languages.  Do you have any pointers?  From a
> Scandinavian POV, (using ASCII plus a handful of extra characters)
> UTF-8 should be a big win, but I'm sure there are counter examples.

So, how big is our personal hard disk now? 3GiB? 10GiB? How many images,
mp3 files and video clips do you have?  (I'm sorry, but your argument here
is getting old and stale.  Very few worry about that aspect anymore. Except
when it comes to databases stored in RAM and UTF-16 vs. UTF-32 which
is guaranteed to be wasteful.)

        Kind regards
        /kent k