[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

shelarcy shelarcy at gmail.com
Tue Feb 6 01:16:17 EST 2007


On Tue, 06 Feb 2007 00:25:45 +0900, Chris Kuklewicz <haskell at list.mightyreason.com> wrote:
>> UTF-8 also uses 4 to 6 byte encodings now.
>> CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol,
>> etc ... use 4 byte encoding.
>
> Looking at several sources, it seems you are incorrect.
>
> Haskell Char go up to Unicode 1114111 (decimal) or 0x10ffff Hexidecimal).
> These are encoded by UTF-8 in 1,2,3,or 4 bytes.

I see. I'm confused Unicode support with Charset support.
I'm sorry about it.

UCS-4 can support greater than 1114111 code pages.
So if we want to support full UCS-4 range, we must support
5, 6 byte encoding as RFC2279 decribed before.

http://www.rfc-editor.org/rfc/rfc2279.txt

But ... unfortunately UTF-16 can support only 1114111 code
points, and The Unicode Consortium adhere to UTF-16.
So 5, 6 byte and over 1114111 code pages' 4 byte encodings
are invalid now.

http://www.rfc-editor.org/rfc/rfc3629.txt
(RFC3629 says "This memo obsoletes and replaces RFC 2279.")

And Haskell implementation uses only valid rage. I forgot
about that.


I'm afraid that its fantasy is broken again, as no surrogate
pair UCS-2 cover all language that is trusted before Europe
and America people.

-- 
shelarcy <shelarcy    capella.freemail.ne.jp>
http://page.freett.com/shelarcy/


More information about the Haskell mailing list