[Haskell-cafe] invalid character encoding

Sat Mar 19 10:04:04 EST 2005

Marcin 'Qrczak' Kowalczyk wrote:

> > I'm talking about standard (XSI) curses, which will just pass
> > printable (non-control) bytes straight to the terminal. If your
> > terminal uses CP437 (or some other non-standard encoding), you can
> > just pass the appropriate bytes to waddstr() etc and the corresponding
> > characters will appear on the terminal.
> 
> Which terminal uses CP437?

Most software terminal emulators can use any encoding. Traditional
comms packages tend to support this (including their own "VGA" font if
necessary) because of its widespread use on BBSes which were targeted
at MS-DOS systems.

There exist hardware terminals (I can't name specific models, but I
have seen them in use) which support this, specifically for use with
MS-DOS systems.

> Linux console doesn't, except temporarily after switching the mapping
> to builtin CP437 (but this state is not used by curses) or after
> loading CP437 as the user map (nobody does this, and it won't work
> properly with all characters from the range 0x80-0x9F anyway).

I *still* encounter programs written for the linux console which
assume that the built-in CP437 font is being used (if you use an
ISO-8859-1 font, you get dialogs with accented characters where you
would expect line-drawing characters).

> >> You can treat it as immutable. Just don't call setlocale with
> >> different arguments again.
> >
> > Which limits you to a single locale. If you are using the locale's
> > encoding, that limits you to a single encoding.
> 
> There is no support for changing the encoding of a terminal on the fly
> by programs running inside it.

If you support multiple terminals with different encodings, and the
library uses the global locale settings to determine the encoding, you
need to switch locale every time you write to a different terminal.

> > The point is that a single program often generates multiple streams of
> > text, possibly for different "audiences" (e.g. humans and machines).
> > Different streams may require different conventions (encodings,
> > numeric formats, collating orders), but may use the same functions.
> 
> A single program has a single stdout and a single filesystem. The
> contexts which use the locale encoding don't need multiple encodings.
> 
> Multiple encodings are needed e.g. for exchanging data with other
> machines for the network, for reading contents of text files after the
> user has specified an encoding explicitly etc. In these cases an API
> with explicitly provided encoding should be used.

A API which is used for reading and writing text files or sockets is
just as applicable to stdin/stdout.

> >> > The "current locale" mechanism is just a way of avoiding the issues
> >> > as much as possible when you can't get away with avoiding them
> >> > altogether.
> >> 
> >> It's a way to communicate the encoding of the terminal, filenames,
> >> strerror, gettext etc.
> >
> > It's *a* way, but it's not a very good way. It sucks when you can't
> > apply a single convention to everything.
> 
> It's not so bad to justify inventing our own conventions and forcing
> users to configure the encoding of Haskell programs separately.

I'm not suggesting inventing conventions. I'm suggesting leaving such
issues to the application programmer who, unlike the library
programmer, probably has enough context to be able to reliably
determine the correct encoding in any specific instance.

> >> Unicode has no viable competition.
> >
> > There are two viable alternatives. Byte strings with associated
> > encodings and ISO-2022.
> 
> ISO-2022 is an insanely complicated brain-damaged mess. I know it's
> being used in some parts of the world, but the sooner it will die,
> the better.

ISO-2022 has advantages and disadvantages relative to UTF-8. I don't
want to go on about the specifics here because they aren't
particularly relevant. What's relevant is that it isn't likely to
disappear any time soon.

A large part of the world already has a universal encoding which works
well enough; they don't *need* UTF-8, and aren't going to rebuild
their IT infrastructure from scratch for the sake of it.

-- 
Glynn Clements <glynn at gclements.plus.com>