[PATCH] Better encoding/decoding for GHC
mark.lentczner at gmail.com
Wed May 18 23:54:37 CEST 2011
On Wed, May 18, 2011 at 2:28 AM, Max Bolingbroke <batterseapower at hotmail.com
> > U+F1E00 ~ U+F1EFF -- for "Fie! we need to encode bad encodings!"
> > We can (I'll be happy to) register this with the unofficial registory(2).
I've prepared a draft for the registry and submitted it…. Only to have it
pointed out to me that the registry has a region reserved (rather than
allocated) for precisely this use! (I missed it as the main pages only
discuss the allocated ranges, and don't mention the reserved ranges.)
The range is U+EF80 through U+EFFF, called "Reserved for encoding hacks".
See *Roadmap to the ConScript Unicode
*. John Cowan informs me that our use is precisely what this range has been
This range is only 128 code points, and they didn't anticipate needing to
deal with encoding issues with octets 0x00 through 0x7F. So long as we
restrict ourselves to ASCII superset encodings, this is true. If we want to
be more general, we could use U+EF00 through U+EFFF and lobby for reserving
the additional 128 points. I've already enquired about this possibility.
On a related note, If we want to be able to round trip file names that
contain proper UTF-8 encoded characters from this range, we can: Treat the
byte sequences 0xEE 0xBE 0x80 through 0xEE 0xBF 0xBF as if they were
encoding errors, and replace such bytes with the encoding hack characters
for each octet. In such a way, *all* octet sequences are round-trippable,
and all are to legal Unicode strings:
41 -> U+00A1 -- ASCII character
CE B1 -> U+03B1 -- Greek character
E0 A4 85 -> U+090F -- Devanagari character
C0 -> U+EFC0 -- illegal UTF-8 byte
C2 20 -> U+EFC2 U+0020 -- malformed UTF-8 sequence
C2 F0 -> U+EFC2 U+EFF0 -- malformed UTF-8 sequence
EE BE 80 -> U+EFEE U+EFBE U+EF80
-- special handling of encoding hack character
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Cvs-ghc