[Haskell-cafe] Writing binary files?

Mon Sep 13 18:29:29 EDT 2004

Marcin 'Qrczak' Kowalczyk wrote:

> > Unless you are the sole user of a system, you have no control over
> > what filenames may occur on it (and even if you are the sole user,
> > you may wish to use packages which don't conform to your rules).
> 
> For these occasions you may set the encoding to ISO-8859-1. But then
> you can't sensibly show them to the user in a GUI, nor in ncurses
> using the wide character API, nor you can't sensibly store them in a
> file which is to be always encoded in UTF-8 (e.g. XML file where you
> can't put raw bytes without knowing their encoding).

If you need to preserve the data exactly, you can use octal escapes
(\337), URL encoding (%DF) or similar. If you don't, you can just
approximate it (e.g. display unrepresentable characters as "?"). But
this is an inevitable consequence of filenames being bytes rather than
chars.

[Actually, regarding on-screen display, this is also an issue for
Unicode. How many people actually have all of the Unicode glyphs? I
certainly don't.]

> There are two paradigms: manipulate bytes not knowing their encoding,
> and manipulating characters explicitly encoded in various encodings
> (possibly UTF-8). The world is slowly migrating from the first to the
> second.

This migration isn't a process which will ever be complete. There will
always be plenty of cases where bytes really are just bytes.

And even to the extent that it can be done, it will take a long time. 
Outside of the Free Software ghetto, long-term backward compatibility
still means a lot.

[E.g.: EBCDIC has been in existence longer than I have and, in spite
of the fact that it's about the only widely-used encoding in existence
which doesn't have ASCII as a subset, it shows no sign of dying out
any time soon.]

> >> > There are limits to the extent to which this can be achieved. E.g. 
> >> > what happens if you set the encoding to UTF-8, then call
> >> > getDirectoryContents for a directory which contains filenames which
> >> > aren't valid UTF-8 strings?
> >> 
> >> The library fails. Don't do that. This environment is internally
> >> inconsistent.
> >
> > Call it what you like, it's a reality, and one which programs need to
> > deal with.
> 
> The reality is that filenames are encoded in different encodings
> depending on the system. Sometimes it's ISO-8859-1, sometimes
> ISO-8859-2, sometimes UTF-8. We should not ignore the possibility
> of UTF-8-encoded filenames.

I'm not suggesting we do.

> In CLisp it fails silently (undecodable filenames are skipped), which
> is bad. It should fail loudly.

No, it shouldn't fail at all.

> > Most programs don't care whether any filenames which they deal with
> > are valid in the locale's encoding (or any other encoding). They just
> > receive lists (i.e. NUL-terminated arrays) of bytes and pass them
> > directly to the OS or to libraries.
> 
> And this is why I can't switch my home environment to UTF-8 yet. Too
> many programs are broken; almost all terminal programs which use more
> than stdin and stdout in default modes, i.e. which use line editing or
> work in full screen. How would you display a filename in a full screen
> text editor, such that it works in a UTF-8 environment?

So, what are you suggesting? That the whole world switches to UTF-8? 
Or that every program should pass everything through iconv() (and
handle the failures)? Or what?

> > If the assumed encoding is ISO-8859-*, this program will work
> > regardless of the filenames which it is passed or the contents of the
> > file (modulo the EOL translation on Windows). OTOH, if it were to use
> > UTF-8 (e.g. because that was the locale's encoding), it wouldn't work
> > correctly if either filename or the file's contents weren't valid
> > UTF-8.
> 
> A program is not supposed to encounter filenames which are not
> representable in the locale's encoding.

Huh? What does "supposed to" mean in this context? That everything
would be simpler if reality wasn't how it is?

If that's your position, then my response is essentially: Yes, but so
what?

> In your setting it's
> impossible to display a filename in a way other than printing
> to stdout.

Writing to stdout doesn't amount to "displaying" anything; stdout
doesn't have to be a terminal.

> > More accurately, it specifies which encoding to assume when you *need*
> > to know the encoding (i.e. ctype.h etc), but you can't obtain that
> > information from a more reliable source.
> 
> In the case of filenames there is no more reliable source.

Sure; but that doesn't automatically mean that the locale's encoding
is correct for any given filename. The point is that you often don't
need to know the encoding.

Converting a byte string to a character string when you're just going
to be converting it back to the original byte string is pointless. And
it introduces unnecessary errors. If the only difference between
(decode . encode) and the identity function is that the former
sometimes fails, what's the point?

> > My central point is that the existing API forces the encoding to be
> > an issue when it shouldn't be.
> 
> It is an unavoidable issue because not every interface in a given
> computer system uses the same encoding. Gtk+ uses UTF-8; you must
> convert text to UTF-8 in order to display it, and in order to convert
> you must know its encoding.

It frequently *is* an avoidable issue, because not every interface
uses *any* encoding. Most of the standard Unix utilities work fine
without even considering encodings.

> > Extending something like curses to handle encoding issues is far
> > from trivial; which is probably why it hasn't been finished yet.
> 
> It's almost finished. The API specification was ready in 1997.
> It works in ncurses modulo unfixed bugs.

Right, so it's taken 7 years to get to "almost finished, modulo
unfixed bugs". Which seems to back up my "far from trivial" claim.

> > Although, if you're going to have implicit String -> [Word8]
> > converters, there's no reason why you can't do the reverse, and have
> > isAlpha :: Word8 -> IO Bool. Although, like ctype.h, this will only
> > work for single-byte encodings.
> 
> We should not ignore multibyte encodings like UTF-8, which means that
> Haskell should have a Unicoded character type. And it's already
> specified in Haskell 98 that Char is such a type!

I'm not suggesting that we ignore them. I'm suggesting that we:

1. Don't provide a broken API which makes it impossible to write
programs which work reliably in the real world (rather than some
fantasy world where inconveniences (like filenames which don't match
the locale's encoding) never happen).

2. Don't force everyone to deal with all of the the complexities
involved in character encoding even when they shouldn't have to.

> What is missing is API for manipulating binary files, and conversion
> between byte streams and character streams using particular text
> encodings.
> 
> >> A mail client is expected to respect the encoding set in headers.
> >
> > A client typically needs to know the encoding in order to display
> > the text.
> 
> This is easier to handle when String type means Unicode.

Not necessarily. There are advantages and disadvantages to both the
byte-stream approach and the wide-character approach.

And, given that Unicode isn't a simple "one code, one character"
system (what with composing characters), it isn't actually all that
much simpler than dealing with multi-byte strings. From a display
perspective, it's largely multi-byte strings vs multi-wchar_t strings.

The main advantage of Unicode for display is that there's only one
encoding. Unfortunately, given that most of the existing Unicode fonts
are a bit short on actual glyphs, you typically just end up converting
the Unicode back into pseudo-ISO-2022 anyhow.

> > As a counter-example, a mail *server* can do its job without paying
> > any attention to the encodings used. It can also handle non-MIME email
> > (which doesn't specify any encoding) regardless of the encoding.
> 
> So it should push bytes, not characters.

And so should a lot of software. But it helps if languages and
libraries doesn't go to great lengths to try and coerce everything
into characters.

> >> This is why I said "1. API for manipulating byte sequences in I/O
> >> (without representing them in String type)".
> >
> > Yes. But that API also needs to include functions such as those in the
> > Directory and System modules.
> 
> If deemed really necessary, I will not fight against them.

Oh, it's necessary. As to whether it will be *deemed* necessary might
be a different issue. The "wishful thinking" approach to problem
solving is no stranger around these parts.

> > It isn't just about reading and writing streams. Most of the Unix
> > API (kernel, libc, and many standard libraries) is byte-oriented
> > rather than character-oriented.
> 
> Because they are primarily used from C,

The Unix API is primarily used from *everything which runs on Unix*. 

Particularly the kernel API (i.e. system calls), which you have no
choice but to use; if you want to open a file, it *will* go through
the open() system call at some point.

> which use the older paradigm
> of handling text: represent it in an unspecified external encoding
> rather than in Unicode.
> 
> OTOH newer Windows APIs use Unicode.
> 
> Haskell aims at being portable. It's easier to emulate the traditional
> C paradigm in the Unicode paradigm than vice versa,

I'm not entirely sure what you mean by that, but I think that I
disagree. The C/Unix approach is more general; it isn't tied to any
specific encoding.

> and Haskell already tries to specify that it uses Unicode
> internally.

I don't have any complaints about it using Unicode for characters. My
gripe is with the fact that it tries hard to coerce what is actually
arbitrary binary data into being characters.

> It's not that hard if you may sacrifice supporting every broken
> configuration. I did it myself, albeit without serious testing in real
> world situations and without trying to interface to too many libraries.

I take it that, by "broken", you mean any string of bytes (file,
string, network stream, etc) which neither explicitly specifies its
encoding(s) nor uses your locale's encoding?

That's a pretty big sacrifice.

> > My view is that, right now, we have the worst of both worlds, and
> > taking a short step backwards (i.e. narrow the Char type and leave the
> > rest alone) is a lot simpler (and more feasible) than the long journey
> > towards real I18N.
> 
> It would bury any hope in supporting a UTF-8 environment.
> 
> I've heard that RedHat tried to impose UTF-8 by default. It was mostly
> a failure because it's too early, too many programs are not ready for
> it. I guess the RedHat move helped to identify some of them. But UTF-8
> will inevitably be usable in future.

If they tried a decade hence, it would still be too early. The
single-byte encodings (ISO-8859-*, koi-8, win-12xx) aren't likely to
be disappearing any time soon, nor is ISO-2022 (UTF-8 has quite
spectacularly failed to make inroads in CJK-land; there are probably
more UTF-8 users in the US than there).

> It would be great if Haskell programs were in the group which can
> support it instead of being forced to be abandoned because of lack
> of Unicode support in the language they are written in.

Haskell should be able to support it, but it shouldn't refuse to
support anything else, it shouldn't make you jump through hoops to
write usable programs, and we shouldn't have to wait until all of the
encoding issues have been sorted out to do things which don't even
deal with encodings.

Look, C has all of the functionality that we're talking about: wide
characters, wide versions of string.h and ctype.h, and conversion
between byte-streams and wide characters.

But it did it without getting in the way of writing programs which
don't care about encodings, without consigning everything which has
gone before to the scrap heap, and without everyone having to wait a
couple of decades to (reliably) do simple things like copying a file
to a socket or enumerating a directory.

-- 
Glynn Clements <glynn.clements at virgin.net>