[Haskell-cafe] Writing binary files?

Wed Sep 15 15:24:09 EDT 2004

Glynn Clements <glynn.clements at virgin.net> writes:

> But this seems to be assuming a closed world. I.e. the only files
> which the program will ever see are those which were created by you,
> or by others who are compatible with your conventions.

Yes, unless you set the default encoding to Latin1.

>> Some programs use UTF-8 in filenames no matter what the locale is. For
>> example the Evolution mail program which stores mail folders as files
>> under names the user entered in a GUI.
>
> This is entirely reasonable for a file which a program creates. If a
> filename is just a string of bytes, a program can use whatever
> encoding it wants.

But then they display wrong in any other program.

> If it had just treated them as bytes, rather than trying to interpret
> them as characters, there wouldn't have been any problems.

I suspect it treats some characters in these synthesized newsgroup
names, like dots, specially, so it won't work unless it was designed
differently.

>> When I switch my environment to UTF-8, which may happen in a few
>> years, I will convert filenames to UTF-8 and set up mount options to
>> translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.
>
> But what about files which were been created by other people, who
> don't use UTF-8?

All people sharing a filesystem should use the same encoding.

BTW, when ftping files between Windows and Unix, a good ftp client
should convert filenames to keep the same characters rather than
bytes, so CP-1250 encoded names don't come as garbage in the encoding
used on Unix which is definitely different (ISO-8859-2 or UTF-8) or
vice versa.

>> I expect good programs to understand that and display them
>> correctly no matter what technique they are using for the display.
>
> When it comes to display, you have to have to deal with encoding
> issues one way or another. But not all programs deal with display.

So you advocate using multiple encodings internally. This is in
general more complicated than what I advocate: using only Unicode
internally, limiting other encodings to I/O boundary.

> Assuming that everything is UTF-8 allows a lot of potential problems
> to be ignored.

I don't assume UTF-8 when locale doesn't say this.

> The core OS and network server applications essentially remain
> encoding-agnostic.

Which is a problem when they generate an email, e.g. to send a
non-empty output of a cron job, or report unauthorized use of sudo.
If the data involved is not pure ASCII, I will often be mangled.

It's rarely a problem in practice because filenames, command
arguments, error messages, user full names etc. are usually pure
ASCII. But this is slowly changing.

> But, as I keep pointing out, filenames are byte strings, not
> character strings. You shouldn't be converting them to character
> strings unless you have to.

Processing data in their original byte encodings makes supporting
multiple languages harder. Filenames which are inexpressible as
character strings get in the way of clean APIs. When considering only
filenames, using bytes would be sufficient, but in overall it's more
convenient to Unicodize them like other strings.

> 1. Actually, each user decides which locale they wish to use. Nothing
> forces two users of a system to use the same locale.

Locales may be different, but they should use the same encoding when
they share files. This applies to file contents too - various formats
don't have a fixed encoding and don't specify the encoding explicitly,
so these files are assumed to be in the locale encoding.

> 2. Even if the locale was constant for all users on a system, there's
> still the (not exactly minor) issue of networking.

Depends on the networking protocols. They might insist that filenames
are represented in UTF-8 for example.

>> > Or that every program should pass everything through iconv()
>> > (and handle the failures)?
>> 
>> If it uses Unicode as internal string representation, yes (because the
>> OS API on Unix generally uses byte encodings rather than Unicode).
>
> The problem with that is that you need to *know* the source and
> destination encodings. The program gets to choose one of them, but it
> may not even know the other one.

If it can't know the encoding, it should process the data as a
sequence of bytes, and can output it only to another channel which
accepts raw bytes.

But usually it's either known or can be assumed to be the locale
encoding.

> The term "mismatch" implies that there have to be at least two things.
> If they don't match, which one is at fault? If I make a tar file
> available for you to download, and it contains non-UTF-8 filenames, is
> that my fault or yours?

Such tarballs are not portable across systems using different encodings.

If I tar a subdirectory stored on ext2 partition, and you untar it on
a vfat partition, whose fault it is that files which differ only in
case are conflated?

> In any case, if a program refuses to deal with a file because it is
> cannot convert the filename to characters, even when it doesn't have
> to, it's the program which is at fault.

Only if it's a low-level utility, to be used in an unfriendly
environment.

A Haskell program in my world can do that too. Just set the encoding
to Latin1.

> My specific point is that the Haskell98 API has a very big problem due
> to the assumption that the encoding is always known. Existing
> implementations work around the problem by assuming that the encoding
> is always ISO-8859-1.

The API is incomplete and needs to be enhanced. Programs written using
the current API will be limited to using the locale encoding.

Just as ReadFile is limited to text files because of line endings.
What do you prefer: to provide a non-Haskell98 API for binary files,
or to "fix" the current API by forcing programs to use "\r\n" on
Windows and "\n" on Unix manually?

>> If filenames were expressed as bytes in the Haskell program, how would
>> you map them to WinAPI? If you use the current Windows code page, the
>> set of valid characters is limited without a good reason.
>
> Windows filenames are arguably characters rather than bytes. However,
> if you want to present a common API, you can just use a fixed encoding
> on Windows (either UTF-8 or UTF-16).

This encoding would be incompatible with most other texts seen by the
program. In particular reading a filename from a file would not work
without manual recoding.

>> Which is a pity. ISO-2022 is brain-damaged because of enormous
>> complexity,
>
> Or, depending upon ones perspective, Unicode is brain-damaged because,
> for the sake of simplicity, it over-simplifies the situation. The
> over-simplification is one reason for it's lack of adoption in the CJK
> world.

It's necessary to simplify things in order to make them usable by
ordinary programs. People reject overly complicated designs even if
they are in some respects more general.

ISO-2022 didn't catch - about the only program I've seen which tries
to fully support it is Emacs.

> Multi-lingual text consists of distinct sections written in distinct
> languages with distinct "alphabets". It isn't actually one big chunk
> in a single global language with a single massive alphabet.

Multi-lingual text is almost context-insensitive. You can copy a part
of it into another text, even written in another language, and it will
retain its alphabet - this is much harder with stateful ISO-2022.

ISO-2022 is wrong not by distinguishing alphabets but by being
stateful.

>> and ISO-8859-x have small repertoires.
>
> Which is one of the reasons why they are likely to persist for longer
> than UTF-8 "true believers" might like.

My I/O design doesn't force UTF-8, it works with ISO-8859-x as well.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/