Why are strings linked lists?

Kent Karlsson kentk at cs.chalmers.se
Sun Nov 30 12:10:17 EST 2003


> > Glynn Clements wrote:
> >> What Unicode support?
> 
> >> Simply claiming that values of type Char are Unicode characters
> >> doesn't make it so.

Well, *claiming* so doesn't make it so. But actually representing
characters in such a way that the Unicode conformance rules are
followed, makes it so. There is no requirement for a particular API,
for instance.

> > Just because some implementations lack toUpper etc. doesn't mean
> > they all do.  

toUpper etc. are over-rated. They are very rarely used in real life,
or at least should be very rarely used, with very few exceptions:
auto-titlecasing of the first word of a sentence (which I find rather
handy for natural language texts), and for making "small caps"
(some fonts do that internally, but that's a mistake, since it is then
not language dependent).

Some things that are much more interesting and of practical use are:
Unicode normalisation, transformation between encoding forms
(mainly for I/O), finding formal character (or rather, code point)
properties, line breaking, combining character handling, language
dependent collation (UCA based), decimal number parsing and
formatting (for several scripts), regular expressions generalised
to Unicode (including support for "default ignorable"), ...

Case mapping falls rather low on the priority list. Except perhaps
for the special form of "case folding" (almost lowercasing but not
quite) used for IDNs, but almost only there; but could be used also
for Ada, SQL, etc. that "ignore" case.

B.t.w., for line breaking Thai, Lao, or Khmer, you need a dictionary.
ZERO WIDTH NO BREAK SPACE can be used between words, but isn't
normally.

> I think the point is that for toUpper etc to be properly Unicoded,
> they can't simply look at a single character.  IIRC, there are some
> characters that expand to two characters when the case is changed,

Yes, for instance for ß (sharp s). The uppercase of ß is SS. For
proper lowercasing you need a dictionary. It is also language
dependent.  Case mapping for Lithuanian and Turkish/Azerbaijani
have exceptions to what is done elsewhere. See
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

		/kent k
-------------- next part --------------
z'µìmjÛZržžÜ²Ç+¹¶ÞtÖ¦{§™¨¥u«SÊ—š¦™bq«b¢æ²ÙÞ}(³{ºÑ¼­zÀÞ±É赫ڊV›•å+–m§ÿájÉ
–Z+ƒúb¥êæj)
«$zYjÛZržžÛ?ÛM7×]ôÿMüçM´oü"žf¢•¸§


More information about the Haskell mailing list