Raw filenames vs locales

Daan Leijen daan at cs.uu.nl
Sat Jul 30 14:57:48 EDT 2005


Hi all,

Just to clarify: filenames can be written (by different users) in different locales. 
Therefore, one should treat filesnames as abstract entitities (sequences of bytes) 
since one can't sensibly convert a filename to a string (if the locale in which it 
was created is unknown).

If the above is true, we should just treat file names as an abstract data type 
(FilePath) with a set of operations to break them down in smaller pieces (directory, 
extension etc), to append them again, and to compare them. FilePath's can be created 
from strings, and even be shown. But showing and creating a filepath again would not 
be an identity  (ie: makeFilePath . show /= id).

(Ian: I haven't studied your proposal in detail, but I can't see directly why you 
propose a separate FilePath class?)

All the best,
  -- Daan.

David Roundy wrote:
> On Sat, Jul 30, 2005 at 06:13:21PM +0200, Udo Stenzel wrote:
> 
>>Ian Lynagh wrote:
>>
>>>With it's closer adherence to the Haskell 98 report, it is no longer
>>>possible with hugs to manipulate files using the standard IO functions
>>>if the filenames are not representable in your locale.
>>
>>Note that this basically means your filesystem is broken.  This
>>situation can only occur if a filesystem is written in one and then read
>>in another locale. [...]
> 
> 
> That is true, but on any multiuser system it's quite a reasonable scenario
> to have different users using different locales.  It's an embarrassing
> scenario that I can't write a tool in Haskell that recursively deletes a
> directory in which there are files that aren't representable in my current
> locale... or display the contents of such files, or anything else.
> 
> 
>>This "problem" cannot really be fixed, only worked around.
> 
> 
> On the contrary, the problem *can* be fixed, by only requiring that
> filenames be converted to unicode if necesary.  For many purposes (possibly
> even *most* purposes), knowledge of the character encoding is completely
> unnecesary.
> 
> More to the point, the "problem" is inherent in the langage, not the
> filesystem--or perhaps you'd prefer to say that it's a problem with writing
> portable code.  The point is that it would seem best to present an API
> which makes it possible to write portable code.  On POSIX filesystems
> filenames are not sequences of unicode characters, and treating them as
> such causes trouble.
> 
> 
>>>UTF-8:       65533 = U+FFFD = "replacement character"
>>>
>>>=================
>>>Proposed solution
>>>=================
>>
>>I have a simpler proposal: allocate 128 "replacement characters" in the
>>"Vendor Zone" of Unicode.  Their purpose is as place holders for
>>incorrect UTF8.  Then use these replacement characters when decoding
>>UTF8 and reproduce the original, broken, code when re-encoding.  Under
>>ordinary circumstances these codes should never occur in strings.
> 
> 
> I guess you'd then want a couple of functions in the IO monad to convert
> between FilePath and CString (or something we could actually use)?
> 
> While your suggestion would solve the problem of being unable to access
> some files, it would also result in FilePaths themselves (without
> conversion routines) being useless for purposes other than actually
> accessing the same files.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries



More information about the Libraries mailing list