[Haskell-cafe] invalid character encoding

Thu Mar 17 14:25:48 EST 2005

Marcin 'Qrczak' Kowalczyk wrote:

> Glynn Clements <glynn at gclements.plus.com> writes:
> 
> >> It should be possible to specify the encoding explicitly.
> >
> > Conversely, it shouldn't be possible to avoid specifying the
> > encoding explicitly.
> 
> What encoding should a binding to readline or curses use?
> 
> Curses in C comes in two flavors: the traditional byte version and a
> wide character version. The second version is easy if we can assume
> that wchar_t is Unicode, but it's not always available and until
> recently in ncurses it was buggy. Let's assume we are using the byte
> version. How to encode strings?

The (non-wchar) curses API functions take byte strings (char*), so the
Haskell bindings should take CString or [Word8] arguments. If you
provide "wrapper" functions which take String arguments, either they
should have an encoding argument or the encoding should be a mutable
per-terminal setting.

> A terminal uses an ASCII-compatible encoding. Wide character version
> of curses convert characters to the locale encoding, and byte version
> passes bytes unchanged. This means that if a Haskell binding to the
> wide character version does the obvious thing and passes Unicode
> directly, then an equivalent behavior can be obtained from the byte
> version (only limited to 256-character encodings) by using the locale
> encoding.

I don't know enough about the wchar version of curses to comment on
that.

I do know that, to work reliably, the normal (byte) version of curses
needs to pass "printable" bytes through unmodified.

It is possible for curses to be used with a terminal which doesn't use
the locale's encoding. Specifically, a single process may use curses
with multiple terminals with differing encodings, e.g. an airport
public information system displaying information in multiple
languages.

Also, it's quite common to use non-standard encodings with terminals
(e.g. codepage 437, which has graphic characters beyond the ACS_* set
which terminfo understands).

> The locale encoding is the right encoding to use for conversion of the
> result of strerror, gai_strerror, msg member of gzip compressor state
> etc. When an I/O error occurs and the error code is translated to a
> Haskell exception and then shown to the user, why would the application
> need to specify the encoding and how?

Because the application may be using multiple locales/encodings.
Having had to do this in C (i.e. repeatedly calling setlocale() to
select the correct encoding), I would much prefer to have been able to
pass the locale as a parameter.

[The most common example is printf("%f"). You need to use the C locale
(decimal point) for machine-readable text but the user's locale
(locale-specific decimal separator) for human-readable text. This
isn't directly related to encodings per se, but a good example of why
parameters are preferable to state.]

> > If application code doesn't want to use the locale's encoding, it
> > shouldn't be shoe-horned into doing so because a library developer
> > decided to duck the encoding issues by grabbing whatever encoding
> > was readily to hand (i.e. the locale's encoding).
> 
> If a C library is written with the assumption that texts are in the
> locale encoding, a Haskell binding to such library should respect that
> assumption.

C libraries which use the locale do so as a last resort. K&R C
completely ignored I18N issues. ANSI C added the locale mechanism to
as a hack to provide minimal I18N support while maintaining backward
compatibility and in a minimally-intrusive manner.

The only reason that the C locale mechanism isn't a major nuisance is
that you can largely ignore it altogether. Code which requires real
I18N can use other mechanisms, and code which doesn't require any I18N
can just pass byte strings around and leave encoding issues to code
which actually has enough context to handle them correctly.

> Only some libraries allow to work with different, explicitly specified
> encodings. Many libraries don't, especially if the texts are not the
> core of the library functionality but error messages.

And most such libraries just treat text as byte strings. They don't
care about their interpretation, or even whether or not they are valid
in the locale's encoding.

-- 
Glynn Clements <glynn at gclements.plus.com>