[PATCH] Better encoding/decoding for GHC

Simon Marlow marlowsd at gmail.com
Wed Apr 20 11:59:03 CEST 2011


On 18/04/2011 21:46, Mark Lentczner wrote:
>     (A minor point: I think your definition D10, rather than D76,
>     is closest to what GHC implements as Char, since you can for
>     example evaluate (length "\xD800") with no complaints
>
> Yikes - I thought earlier versions of GHC wouldn't evaluate "\xD800". So
> you are right - GHC seems to be D10, but yes, I do believe it would be
> best if Haskell (and GHC) defined Char in terms of D76.
>
>     So to summarise, your proposal is to:
>
> I want to make sure that all agree on the "stance" the code should take:
>
>    1. The system infers, to the best it can, the encoding used for file
>       paths. This encoding might be wrong, though on modern systems, if
>       it is inferred as a Unicode encoding, it is almost certainly
>       right. Nonetheless, there is no guarantee that file paths are
>       valid encodings.
>    2. The system presents to user code file paths that were valid
>       encodings as valid Strings, and user code can present such Strings
>       back with perfect round-trip fidelity.
>    3. The system presents to user code file paths that are not valid
>       encodings as valid Strings, by mapping the invalid encodings onto
>       the private use area U+F700 to U+F7FF. These will of course be
>       indistinguishable from valid file paths that contained such
>       characters (only possible if the encoding is a Unicode encoding),
>       and thus are not round-trippable.
>    4. If user code presents file paths as Strings that do not encode
>       into the inferred encoding, an exception is thrown. This includes
>       when the inferred encoding cannot encode the private use area/.
>       /When the inferred encoding is a Unicode encoding (UTF-*), the
>       private use characters will be encoded normally (and thus
>       differently if they were generated due to an original illegally
>       encoded file path).

So that means filenames that are not legal in the current encoding won't 
round-trip?  But wasn't that the problem that Max was originally trying 
to solve?

Cheers,
	Simon



> The crux of the issue is the handling in #4. If we believe our inferred
> encoding is generally right, and that invalid encodings are rare to
> non-existant (and perhaps indicative of bigger problems on the whole) -
> then as stated above is the way to go.
>
>      > Lastly, I'm curious how the proposed code infers the encoding
>     from the locale.
>     This code already exists in GHC. The behaviour at the moment
>     is platform dependent and as follows:
>
> Thanks for those details! It looks good to me.
>
> - Mark
>
>
>
> _______________________________________________
> Cvs-ghc mailing list
> Cvs-ghc at haskell.org
> http://www.haskell.org/mailman/listinfo/cvs-ghc




More information about the Cvs-ghc mailing list