[PATCH] Better encoding/decoding for GHC
marlowsd at gmail.com
Wed Apr 20 11:59:03 CEST 2011
On 18/04/2011 21:46, Mark Lentczner wrote:
> (A minor point: I think your definition D10, rather than D76,
> is closest to what GHC implements as Char, since you can for
> example evaluate (length "\xD800") with no complaints
> Yikes - I thought earlier versions of GHC wouldn't evaluate "\xD800". So
> you are right - GHC seems to be D10, but yes, I do believe it would be
> best if Haskell (and GHC) defined Char in terms of D76.
> So to summarise, your proposal is to:
> I want to make sure that all agree on the "stance" the code should take:
> 1. The system infers, to the best it can, the encoding used for file
> paths. This encoding might be wrong, though on modern systems, if
> it is inferred as a Unicode encoding, it is almost certainly
> right. Nonetheless, there is no guarantee that file paths are
> valid encodings.
> 2. The system presents to user code file paths that were valid
> encodings as valid Strings, and user code can present such Strings
> back with perfect round-trip fidelity.
> 3. The system presents to user code file paths that are not valid
> encodings as valid Strings, by mapping the invalid encodings onto
> the private use area U+F700 to U+F7FF. These will of course be
> indistinguishable from valid file paths that contained such
> characters (only possible if the encoding is a Unicode encoding),
> and thus are not round-trippable.
> 4. If user code presents file paths as Strings that do not encode
> into the inferred encoding, an exception is thrown. This includes
> when the inferred encoding cannot encode the private use area/.
> /When the inferred encoding is a Unicode encoding (UTF-*), the
> private use characters will be encoded normally (and thus
> differently if they were generated due to an original illegally
> encoded file path).
So that means filenames that are not legal in the current encoding won't
round-trip? But wasn't that the problem that Max was originally trying
> The crux of the issue is the handling in #4. If we believe our inferred
> encoding is generally right, and that invalid encodings are rare to
> non-existant (and perhaps indicative of bigger problems on the whole) -
> then as stated above is the way to go.
> > Lastly, I'm curious how the proposed code infers the encoding
> from the locale.
> This code already exists in GHC. The behaviour at the moment
> is platform dependent and as follows:
> Thanks for those details! It looks good to me.
> - Mark
> Cvs-ghc mailing list
> Cvs-ghc at haskell.org
More information about the Cvs-ghc