[Haskell-cafe] invalid character encoding

Wolfgang Thaller wolfgang.thaller at gmx.net
Sat Mar 19 18:56:17 EST 2005


>> Also, IIRC, Java strings are supposed to be unicode, too -
>> how do they deal with the problem?
>
> Files are represented by instances of the File class:
> [...]
> The documentation for the File class doesn't mention encoding issues
> at all.

... which led me to conclude that they don't deal with the problem 
properly.

>> I think that if we wait long enough, the filename encoding problems
>> will become irrelevant and we will live in an ideal world where 
>> unicode
>> actually works. Maybe next year, maybe only in ten years.
>
> Maybe not even then. If Unicode really solved encoding problems, you'd
> expect the CJK world to be the first adopters, but they're actually
> the least eager; you are more likely to find UTF-8 in an
> English-language HTML page or email message than a Japanese one.

Hmm, that's possibly because english-language users can get away with 
just marking their ASCII files as UTF-8. But I'm not arguing files or 
HTML pages here, I'm only concerned with filenames. I prefer unicode 
nowadays because I was born within a hundred kilometers of the "border" 
between ISO-8859-1 and ISO-8859-2. I need 8859-1 for German-language 
texts, but as soon as I write about where I went for vacation, I need a 
few 8859-2 characters. So 8-byte encodings didn't cut it, and nobody 
ever tried to sell ISO-2022 to me, so unicode was the only alternative.

So you've now convinced me that there is a considerable number of 
computers using ISO-2022, where there's more than one way to encode the 
same text (how do people use this from the command line??). There is 
also multi-user systems where the user's don't agree on a single 
encoding. I still reserve the right to call those systems messed-up, 
but that's just my personal opinion and "reality" couldn't care less 
about what I think.

So, as I don't want to stick with the status quo forever (lists of 
bytes that pretend to be lists of unicode chars, even on platforms 
where unicode is used anyway), how about we get to work - what do we 
want?

I don't think we want a type class here, a plain (abstract) data type 
will do:

 > data File

Obviously, we'll need conversion from and to C strings. On Mac OS X, 
they'd be guaranteed to be in UTF-8.

 > withFilePathCString :: String -> (CString -> IO a) -> IO a
 > fileFromCString :: CString -> IO File

We will need functions for converting to and from unicode strings. I'm 
pretty sure that we want to keep those functions pure, otherwise 
they'll be very annoying to use.

 > fileFromPath :: String -> File

Any impure operations that might be needed to decide how to encode the 
file name will have to be delayed until the File is actually used.

 > fileToPath :: File -> String

Same here: any impure operation necessary to convert the File to a 
unicode string needs to be done when the file is created.

What about failure? If you go from String to File, errors should be 
reported when you actually access the file. At an earlier time, you 
can't know whether the file name is valid (e.g. if you mount a 
"classic" HFS volume on Mac OS X, you can only create files there whose 
names can be represented in the volume's file name encoding - but you 
only find that out once you try to create a file).

For going from File to String, I'm not so sure, but I would be very 
annoyed if I had to deal with a Maybe String return type on platforms 
where it will always succeed. Maybe there should be separate functions 
for different purposes - i.e. for display, you'd use a File -> String 
function that will silently use '?'s when things can't be decoded, but 
in other situations you might use a File -> Maybe String function and 
check for Nothing.

If people want to implement more sophisticated ways of decoding file 
names than can be provided by the library, they'd get the C string and 
do the same things.

Of course, there should also be lots of other useful functions that 
make it more or less unnecessary to deal with path names directly in 
most cases.

Thoughts?

Cheers,

Wolfgang



More information about the Haskell-Cafe mailing list