More Unicode nit-picking

Kent Karlsson kentk@md.chalmers.se
Fri, 19 Oct 2001 10:08:18 +0200


----- Original Message ----- 
From: "Colin Paul Adams" 
...
> But this seems to assume there is a one-to-one mapping of upper-case
> to lower-case equivalent, and vice-versa. Apparently this is not
> so. 

True.  It's quite tricky. See below.

> It seems that whilst the Unicode database's definitions of whether or
> not a character is upper/lower/title case are normative, the mappings
> from upper to lower case are only suggestive.

Not anymore. TOGETHER with the case mappings in SpecialCasing.txt
(ref. below), case mappings are now (very recently) made normative.  Locale
(or rather, language) specific exceptions are noted in the SpecialCasing.txt
file as well as those cases where multiple characters may be returned
from one character that is case mapped.   Note also that there are
contextual requirements, not only concering language, but also surrounding
string: e.g. lowercasing a capital sigma should turn into either a lowercase
'ordinary' sigma (when not at end of word) or a terminal sigma (when at
end of word). Handling of i, j and related characters are also non-trivial (and
some remaining problems with SpecialCasing.txt will hopefully soon be fixed).

> This is because it depends upon language conventions as to how the
> mapping is done. In Turkish for instance, I is not the upper-case
> equivalent of i, and vice-versa (apparently there is a dotted i, and a
> non-dotted i, and likewise for I).

There's more to it than that.  See UTR 21 (http://www.unicode.org/reports/tr21/),
and in particular the UCD (Unicode Character Database) file SpecialCasing.txt
(http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt).

                Kind regards
                /kent k