UTF-8 library

Ashley Yakeley ashley@semantic.org
Tue, 6 Aug 2002 16:53:42 -0700


At 2002-08-06 05:38, John Meacham wrote:

>One major nit I have with this is the type signature of 
>decodeUTF8 and encodeUTF8
>a String should always represent a string of characters, not a byte
>stream, the signatures should be
>
>decodeUTF8 :: String -> [Word8]
>encodeUTF8 :: [Word8] -> String

I think you mean

  encodeUTF8 :: String -> [Word8]
  decodeUTF8 :: [Word8] -> String

...or even

  decodeUTF8 :: [Word8] -> Maybe String

It might also be useful to have stream functions. Decoding UTF8 octets is 
a kind of parsing, after all.

But yes, you're right. A Char is a Unicode codepoint, nothing else, and 
certainly not a C 'char'. A C char is _usually_ a Word8 or an Int8, but 
not necessarily IIRC. I've always thought it a bit odd that the 
well-specified types Word8, Int8 etc. are hidden away in a package while 
the machine-dependent Int type, which I avoid in all my code, is in the 
Prelude.

-- 
Ashley Yakeley, Seattle WA