[PATCH] Better encoding/decoding for GHC
mark.lentczner at gmail.com
Sat May 7 18:38:48 CEST 2011
(Crud - Simon just pointed out that I accidentally sent my reply to just
him, not the list. D'oh! -- Sorry for the tardy reply-all, all!)
On Wed, Apr 20, 2011 at 2:59 AM, Simon Marlow <marlowsd at gmail.com> wrote:
> So that means filenames that are not legal in the current encoding won't
> round-trip? But wasn't that the problem that Max was originally trying to
I think the major issue was mapping file paths to strings, without requiring
every application to perform its own decoding/encoding.
That in turn, brought up the issue of what to do about file paths octet
sequences that don't match the expected (or any) encoding. (Which existed
for every application before.. they probably just ignored it!)
We have a choice. The current proposal maps the two following classes of
file paths onto the same string, and so when encoding back to the system we
must choose which it is -- the other class getting the short-end of the
1. File paths that don't decode.
2. File paths with a small range of private use characters.
If we encode in favor of files paths that don't decode (that is, encode
U+F700 ~ U+F7FF as the bytes 0x00 ~ 0xFF), then we incur a raft of security
issues, as input that passes various checks ("there is no / in the file
name, for example") can be bypassed. The Python hack is to not encode 0x00 ~
0x7F. If the inferred encoding is one that has invalid encodings in this
range (for example EBCDIC, though these kinds of encoding are rare), then
this hack still results in some illegal encoded names failing to be
encodable back to the system.
If we encode in favor of all valid encoded strings, then bad encodings fail.
However, I tried on Mac, and one can't actually create a file name with a
bad UTF-8 sequence! I bet the same is true for Windows.
I'd also be in favor of just presuming the encoding is UTF-8 (as it is on
Mac and modern Linux, and Windows doesn't matter as we get the paths in
Unicode anyway) rather than using the user's locale. In my tests, the file
path system calls do not respect the locale setting, nor do I think the
stdlib calls do either.
In the end, I don't think it matters much which way we go here. The private
use characters are highly unlikely to be in use. But then again, so are
non-UTF8 file paths from the system. If we presume UTF-8 as the encoding,
then the security risk is lower, as we only have to worry about bytes 0x7f ~
0xff (assuming we do decode failure of multi-byte UTF-8 sequences early).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Cvs-ghc