[Haskell-cafe] invalid character encoding

Thu Mar 17 15:01:13 EST 2005

Glynn Clements <glynn at gclements.plus.com> writes:

> The (non-wchar) curses API functions take byte strings (char*),
> so the Haskell bindings should take CString or [Word8] arguments.

Programmers will not want to use such interface. When they want to
display a string, it will be in Haskell String type.

And it prevents having a single Haskell interface which uses either
the narrow or wide version of curses interface, depending on what is
available.

> If you provide "wrapper" functions which take String arguments,
> either they should have an encoding argument or the encoding should
> be a mutable per-terminal setting.

There is already a mutable setting. It's called "locale".

> I don't know enough about the wchar version of curses to comment on
> that.

It uses wcsrtombs or eqiuvalents to display characters. And the
reverse to interpret keystrokes.

> It is possible for curses to be used with a terminal which doesn't
> use the locale's encoding.

No, it will break under the new wide character curses API, and it will
confuse programs which use the old narrow character API.

The user (or the administrator) is responsible for matching the locale
encoding with the terminal encoding.

> Also, it's quite common to use non-standard encodings with terminals
> (e.g. codepage 437, which has graphic characters beyond the ACS_* set
> which terminfo understands).

curses don't support that.

>> The locale encoding is the right encoding to use for conversion of the
>> result of strerror, gai_strerror, msg member of gzip compressor state
>> etc. When an I/O error occurs and the error code is translated to a
>> Haskell exception and then shown to the user, why would the application
>> need to specify the encoding and how?
>
> Because the application may be using multiple locales/encodings.

But strerror always returns messages in the locale encoding.
Just like Gtk+2 always accepts texts in UTF-8.

For compatibility the default locale is "C", but new programs
which are prepared for I18N should do setlocale(LC_CTYPE, "")
and setlocale(LC_MESSAGES, "").

There are places where the encoding is settable independently,
or stored explicitly. For them Haskell should have withCString /
peekCString / etc. with an explicit encoding. And there are
places which use the locale encoding instead of having a separate
switch.

> [The most common example is printf("%f"). You need to use the C
> locale (decimal point) for machine-readable text but the user's
> locale (locale-specific decimal separator) for human-readable text.

This is a different thing, and it is what IMHO C did wrong.

> This isn't directly related to encodings per se, but a good example
> of why parameters are preferable to state.]

The LC_* environment variables are the parameters for the encoding.
There is no other convention to pass the encoding to be used for
textual output to stdout for example.

> C libraries which use the locale do so as a last resort.

No, they do it by default.

> The only reason that the C locale mechanism isn't a major nuisance
> is that you can largely ignore it altogether.

Then how would a Haskell program know what encoding to use for stdout
messages? How would it know how to interpret filenames for graphical
display?

Do you want to invent a separate mechanism for communicating that, so
that an administrator has to set up a dozen of environment variables
and teach each program separately about the encoding it should assume
by default? We had this mess 10 years ago, and parts of it are still
alive until today - you must sometimes configure xterm or Emacs
separately, but it's being more common that programs know to use the
system-supplied setting and don't have to be configured separately.

> Code which requires real I18N can use other mechanisms, and code
> which doesn't require any I18N can just pass byte strings around and
> leave encoding issues to code which actually has enough context to
> handle them correctly.

Haskell can't just pass byte strings around without turning the
Unicode support into a joke (which it is now).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/