[Haskell-cafe] invalid character encoding

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Sat Mar 19 13:18:36 EST 2005


Wolfgang Thaller <wolfgang.thaller at gmx.net> writes:

> Also, IIRC, Java strings are supposed to be unicode, too -
> how do they deal with the problem?

Java (Sun)
----------

Filenames are assumed to be in the locale encoding.

a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD.

b) Creating. Characters which cannot be converted are replaced by "?".

Command line arguments and standard I/O are treated in the same way.


Java (GNU)
----------

Filenames are assumed to be in Java-modified UTF-8.

a) Interpreting. If a filename cannot be converted, a directory listing
   contains a null instead of a string object.

b) Creating. All Java characters are representable in Java-modified UTF-8.
   Obviously not all potential filenames can be represented.

Command line arguments are interpreted according to the locale.
Bytes which cannot be converted are skipped.

Standard I/O works in ISO-8859-1 by default. Obviously all input is
accepted. On output characters above U+00FF are replaced by "?".


C# (mono)
---------

Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS
environment variable, with UTF-8 implicitly added at the end. These
encodings are tried in order.

a) Interpreting. If a filename cannot be converted, it's skipped in
   a directory listing.

   The documentation says that if a filename, a command line argument
   etc. looks like valid UTF-8, it is treated as such first, and
   MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases.
   The reality seems to not match this (mono-1.0.5).

b) Creating. If UTF-8 is used, U+0000 throws an exception
   (System.ArgumentException: Path contains invalid chars), paired
   surrogates are treated correctly, and an isolated surrogate causes
   an internal error:
** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL)
aborting...

Command line arguments are treated in the same way, except that if an
argument cannot be converted, the program dies at start:
[Invalid UTF-8]
Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea).
Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again.

Console.WriteLine emits UTF-8. Paired surrogates are treated
correctly, unpaired surrogates are converted to pseudo-UTF-8.

Console.ReadLine interprets text as UTF-8. Bytes which cannot be
converted are skipped.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Haskell-Cafe mailing list