[Haskell-cafe] ANNOUNCE: system-filepath 0.4.5 and system-fileio 0.3.4

John Millikin jmillikin at gmail.com
Mon Feb 6 04:17:32 CET 2012

On Sun, Feb 5, 2012 at 18:49, Joey Hess <joey at kitenet.net> wrote:
> John Millikin wrote:
>> In GHC  7.2 and later, file path handling in the platform libraries
>> was changed to treat all paths as text (encoded according to locale).
>> This does not work well on POSIX systems, because POSIX paths are byte
>> sequences. There is no guarantee that any particular path will be
>> valid in the user's locale encoding.
> I've been dealing with this change too, but my current understanding
> is that GHC's handling of encoding for FilePath is documented to allow
> "arbitrary undecodable bytes to be round-tripped through it".
> As long as FilePaths are read using this file system encoding, any
> FilePath should be usable even if it does not match the user's encoding.

That was my understanding also, then QuickCheck found a
counter-example. It turns out that there are cases where a valid path
cannot be roundtripped in the GHC 7.2 encoding.

$ ~/ghc-7.0.4/bin/ghci
Prelude> writeFile ".txt" "test"
Prelude> readFile ".txt"

$ ~/ghc-7.2.1/bin/ghci
Prelude> import System.Directory
Prelude System.Directory> getDirectoryContents "."
Prelude System.Directory> readFile "\61347.txt"
*** Exception: .txt: openFile: does not exist (No such file or directory)
Prelude System.Directory>

The issue is that  [238,189,178] decodes to 0xEF72, which is within
the 0xEF00-0xEFFF range that GHC uses to represent un-decodable bytes.

> For FFI, anything that deals with a FilePath should use this
> withFilePath, which GHC contains but doesn't export(?), rather than the
> old withCString or withCAString:
> import GHC.IO.Encoding (getFileSystemEncoding)
> import GHC.Foreign as GHC
> withFilePath :: FilePath -> (CString -> IO a) -> IO a
> withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f

If code uses either withFilePort or withCString, then the filenames
written will depend on the user's locale. This is wrong. Filenames are
either non-encoded text strings (Windows), UTF8 (OSX), or arbitrary
bytes (non-OSX POSIX). They must not change depending on the locale.

> Code that reads or writes a FilePath to a Handle (including even to
> stdout!) must take care to set the right encoding too:
> fileEncoding :: Handle -> IO ()
> fileEncoding h = hSetEncoding h =<< getFileSystemEncoding

This is also wrong. A "file path" cannot be written to a handle with
any hope of correct behavior. If it's to be displayed to the user, a
path should be converted to text first, then displayed.

>> * system-filepath has been converted from GHC's escaping rules to its
>> own, more compatible rules. This lets it support file paths that
>> cannot be represented in GHC 7.2's escape format.
> I'm dobutful about adding yet another encoding to the mix. Things are
> complicated enough already! And in my tests, GHC 7.4's FilePath encoding
> does allow arbitrary bytes in FilePaths.

Unlike the GHC encoding, this encoding is entirely internal, and
should not change the API's behavior.

> BTW, GHC now also has RawFilePath. Parts of System.Directory could be
> usefully written to support that data type too. For example, the parent
> directory can be determined. Other things are more difficult to do with
> RawFilepath.

This is new in 7.4, and won't be backported, right? I tried compiling
the new "unix" package in 7.2 to get proper file path support, but it
failed with an error about some new language extension.

