<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Nov 19, 2014 at 7:56 AM, Donn Cave <span dir="ltr"><<a href="mailto:donn@avvanta.com" target="_blank">donn@avvanta.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">quoth Donn Cave <<a href="mailto:donn@avvanta.com">donn@avvanta.com</a>><br>

...<br>

<span class="">> Umlaut u turns up as 0xFC for UTF-8 users;  0xDCFC, for Latin-1 users.<br>

> This is an ordinary hello world type program, can't think of any<br>

> unique environmental issues.<br>

<br>

</span>Well, I mischaracterized that problem, so to speak.<br>

<br>

I find that GHC is not picking up on my "current locale" encoding,<br>

and instead seems to be hard-wired to UTF-8.  On MacOS X, I can<br>

select an encoding in Terminal Preferences, open a new window, and<br>

for all intents and purposes it's an ISO8859-1 world, including<br>

LANG=en_US.ISO8859-1, but GHC isn't going along with it.<br>

<br>

So the ISO8859-1 umlaut u is undecodable if GHC is stuck in UTF-8,<br>

which seems to explain what I'm seeing.  If I understand this right,<br>

the 0xDC00 high byte is recognized in some circumstances, and the<br>

value is spared from UTF-8 encoding and instead simply copied.<br></blockquote><div><br></div><div>ISO8859 is not multibyte. And your earlier description is incorrect, in a way showing a common confusion about the relationship between Unicode and UTF8 and ISO8859-1.</div><div><br></div><div>U+00FC is the Unicode codepoint for u-umlaut. This is, by design, the same as the single byte sequence for u-umlaut (0xFC) in ISO8859-1. It is *not* the UTF8 representation of u-umlaut; that is 0xC3 0xBC.</div><div><br></div><div>The 0xDC prefix is, as I said earlier, a hack used by ghc. Internally it only uses UTF8; so a non-UTF8 value which it needs to roundtrip from its external representation, which per POSIX has no encoding / is an octet string, to its internal representation is encoded as if it were UTF8 with a 0xDC prefix (stolen; that range belongs to Syriac) and then decoded back to the non-UTF8 external form by stripping the prefix. But this means that you will find yourself working with a "strange" Unicode codepoint.</div></div><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div>brandon s allbery kf8nh                               sine nomine associates</div><div><a href="mailto:allbery.b@gmail.com" target="_blank">allbery.b@gmail.com</a>                                  <a href="mailto:ballbery@sinenomine.net" target="_blank">ballbery@sinenomine.net</a></div><div>unix, openafs, kerberos, infrastructure, xmonad        <a href="http://sinenomine.net" target="_blank">http://sinenomine.net</a></div></div></div>

</div></div>