[Haskell] [Haskell-cafe] ANNOUNCE: system-filepath 0.4.5 and system-fileio 0.3.4

John Millikin jmillikin at gmail.com
Mon Feb 6 04:17:32 CET 2012

On Sun, Feb 5, 2012 at 18:49, Joey Hess <joey at kitenet.net> wrote:
> John Millikin wrote:
>> In GHC  7.2 and later, file path handling in the platform libraries
>> was changed to treat all paths as text (encoded according to locale).
>> This does not work well on POSIX systems, because POSIX paths are byte
>> sequences. There is no guarantee that any particular path will be
>> valid in the user's locale encoding.
> I've been dealing with this change too, but my current understanding
> is that GHC's handling of encoding for FilePath is documented to allow
> "arbitrary undecodable bytes to be round-tripped through it".
> As long as FilePaths are read using this file system encoding, any
> FilePath should be usable even if it does not match the user's encoding.

That was my understanding also, then QuickCheck found a
counter-example. It turns out that there are cases where a valid path
cannot be roundtripped in the GHC 7.2 encoding.

$ ~/ghc-7.0.4/bin/ghci
Prelude> writeFile ".txt" "test"
Prelude> readFile ".txt"

$ ~/ghc-7.2.1/bin/ghci
Prelude> import System.Directory
Prelude System.Directory> getDirectoryContents "."
Prelude System.Directory> readFile "\61347.txt"
*** Exception: .txt: openFile: does not exist (No such file or directory)
Prelude System.Directory>

The issue is that  [238,189,178] decodes to 0xEF72, which is within
the 0xEF00-0xEFFF range that GHC uses to represent un-decodable bytes.

> For FFI, anything that deals with a FilePath should use this
> withFilePath, which GHC contains but doesn't export(?), rather than the
> old withCString or withCAString:
> import GHC.IO.Encoding (getFileSystemEncoding)
> import GHC.Foreign as GHC
> withFilePath :: FilePath -> (CString -> IO a) -> IO a
> withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f

If code uses either withFilePort or withCString, then the filenames
written will depend on the user's locale. This is wrong. Filenames are
either non-encoded text strings (Windows), UTF8 (OSX), or arbitrary
bytes (non-OSX POSIX). They must not change depending on the locale.

> Code that reads or writes a FilePath to a Handle (including even to
> stdout!) must take care to set the right encoding too:
> fileEncoding :: Handle -> IO ()
> fileEncoding h = hSetEncoding h =<< getFileSystemEncoding

This is also wrong. A "file path" cannot be written to a handle with
any hope of correct behavior. If it's to be displayed to the user, a
path should be converted to text first, then displayed.

>> * system-filepath has been converted from GHC's escaping rules to its
>> own, more compatible rules. This lets it support file paths that
>> cannot be represented in GHC 7.2's escape format.
> I'm dobutful about adding yet another encoding to the mix. Things are
> complicated enough already! And in my tests, GHC 7.4's FilePath encoding
> does allow arbitrary bytes in FilePaths.

Unlike the GHC encoding, this encoding is entirely internal, and
should not change the API's behavior.

> BTW, GHC now also has RawFilePath. Parts of System.Directory could be
> usefully written to support that data type too. For example, the parent
> directory can be determined. Other things are more difficult to do with
> RawFilepath.

This is new in 7.4, and won't be backported, right? I tried compiling
the new "unix" package in 7.2 to get proper file path support, but it
failed with an error about some new language extension.

More information about the Haskell mailing list