Why are strings linked lists?

Kent Karlsson kentk at cs.chalmers.se
Tue Dec 9 00:06:53 EST 2003


> GHC 6.2 (shortly to be released) also supports toUpper, toLower, and
the
> character predicates isUpper, isLower etc. on the full Unicode
character
> set.
> 
> There is one caveat: the implementation is based on the C library's
> towupper() and so on, so the support is only as good as the C library
> provides, and it relies on wchar_t being equivalent to Unicode (the
> sensible choice, but not all libcs do this).

Now, why would one want to base this on C's wchar_t and its
"w" routines? wchar_t is sometimes (isolated) UTF-32 code units,
including in Linux, sometimes it is (isolated) UTF-16 code units,
including in Windows, and sometimes something utterly useless.
The casing data is not reliable (it could be entirely wrong, and even
locale dependent in an erroneous way), nor kept up to date with the
Unicode character database in all implementations (even where
wchar_t is some form of Unicode/10646). wchar_t is best forgotten,
especially for portable programs.

Please instead use ICU's UChar32, which is (isolated) UTF-32, and
and Unicode::isUpperCase(cp), Unicode::toUpperCase(cp) (C++ here),
etc. The ICU data is kept up-to-date with Unicode versions. The
case mappings are the simplistic ones, not taking SpecialCasing.txt
into account, just the UnicodeData.txt case mapping data. It is thus
not locale dependent, nor context dependent, nor doesn't cae-map
a character to more than one character (so it is not fully appropriate
for strings, but still much, much better than C's wchar_t and its
w-functions).

> Proper support for character set conversions in the I/O library has
been
> talked about for some time, and there are a couple of implementations

One can base this on the ICU character encoding conversions. I would
very much recommend that over the C locale dependent "mb"
conversion routines, for the same reasons as above.

	/kent k



More information about the Haskell mailing list