H98 Report: Unicode (was: Re: H98 Report: input functions)

10 Sep 2002 09:20:08 +0200

(This is mostly a summary of discussions taking place on the i18n
list.  Some things have come up that might impact the Report, and I'm
not sure if it has propagated to the right places)

While we're at it, are there any plans to remove this paragraph from
section 2.1:

| Haskell uses a pre-processor to convert non-Unicode character sets
| into Unicode. This pre-processor converts all characters to Unicode
| and uses the escape sequence \uhhhh, where the "h" are hex digits,
| to denote escaped Unicode characters. Since this translation occurs
| before the program is compiled, escaped Unicode characters may
| appear in identifiers and any other place in the program.

Apparently, no compilers implement this, and the backslash introduces
quite a bit of syntactic confusion.  Besides, four 'h's are not
sufficient for all code points.

Unicode is already supported in strings with \nnnn, from section 2.6:

| Escape characters for the Unicode character set, including control
| characters such as \^X, are also provided. Numeric escapes such as
| \137 are used to designate the character with decimal representation
| 137; octal (e.g. \o137) and hexadecimal (e.g. \x37) representations
| are also allowed. Numeric escapes that are out-of-range of the
| Unicode standard (16 bits) are an error.

Note that the 16-bits remark should probably be removed, Unicode code
points extend beyond that nowadays.

Also, if a provision is made for escaped Unicode in identifiers, it
would be nice if the section on layout (2.7) discouraged layout rules
where the indentation level depended on the width of non-space
characters.  (Ideally, this would result in a compiler warning.)
In fact, this might always be useful, since some Unicode characters
are defined as double width.

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants