FilePath as ADT

Sun Feb 5 23:02:03 EST 2006

Marcin 'Qrczak' Kowalczyk wrote:
> Encouraged by Mono, for my language Kogut I adopted a hack that
> Unicode people hate: the possibility to use a modified UTF-8 variant
> where byte sequences which are illegal in UTF-8 are decoded into
> U+0000 followed by another character.

I don't like the idea of using U+0000, because it looks like an ASCII 
control character, and in any case has a long tradition of being used for 
something else. Why not use a code point that can't result from decoding a 
valid UTF-8 string? U+FFFF (EF BF BF) is illegal in UTF-8, for example, and 
I don't think it's legal UTF-16 either. This would give you round-tripping 
for all legal UTF-8 and UTF-16 strings.

Or you could use values from U+DC00 to U+DFFF, which definitely aren't legal 
UTF-8 or UTF-16. There's plenty of room there to encode each invalid UTF-8 
byte in a single word, instead of a sequence of two words.

A much cleaner solution would be to reserve part of the private use area, 
say U+109780 through U+1097FF (DBE5 DF80 through DBE5 DFFF). There's a 
pretty good chance you won't collide with anyone. It's too bad Unicode 
hasn't set aside 128 code points for this purpose. Maybe we should grab some 
unassigned code points, document them, and hope it catches on.

There's a lot to be said for any encoding, however nasty, that at least 
takes ASCII to ASCII. Often people just want to inspect the ASCII portions 
of a string while leaving the rest untouched (e.g. when parsing 
"--output-file=¡£ª±ïñ¹!.txt"), and any encoding that permits this is good 
enough.

> Alternatives were:
> 
> * Use byte strings and character strings in different places,
>   sometimes using a different type depending on the OS (Windows
>   filenames would be character strings).
> 
> * Fail when encountering byte strings which can't be decoded.

Another alternative is to simulate the existence of a UTF-8 locale on Win32. 
Represent filenames as byte strings on both platforms; on NT convert between 
UTF-8 and UTF-16 when interfacing with the outside; on 9x either use the 
ANSI/OEM encoding internally or convert between UTF-8 and the ANSI/OEM 
encoding. I suppose NT probably doesn't check that the filenames you pass to 
the kernel are valid UTF-16, so there's some possibility that files with 
illegal names might be accessible to other applications but not to Haskell 
applications. But I imagine such files are much rarer than Unix filenames 
that aren't legal in the current locale. And you could still use the 
private-encoding trick if not.

-- Ben