UTF-8 encode/decode libraries.

Sven Panne Sven.Panne at aedion.de
Mon Apr 26 21:33:38 EDT 2004


Duncan Coutts wrote:
> On Mon, 2004-04-26 at 18:49, David Brown wrote: [...]
> toUTF :: String -> String

Hmmm, "String -> [Word8]" would be nicer...

> fromUTF :: String -> String

... and here: "[Word8] -> String" or "[Word8] -> Maybe String".
Furthermore, UTF-8 is not restricted to a maximum of 3 bytes per character,
here an excerpt from "man utf8" on my SuSE Linux:

        * UTF-8  encoded  UCS  characters  may  be up to six bytes
          long, however the Unicode standard specifies no  characters­
          above  0x10ffff, so Unicode characters can only be up to
          four bytes long in UTF-8.

IIRC we discussed encoders/decoders quite some time ago on the libraries
mailing list, but nothing really happened, which is a pity. We should
strive for something more general than UTF-8 <-> UCS/Unicode, there are
quite a few more widely used encodings, e.g. GSM 03.38, etc. Any takers?

Cheers,
    S.



More information about the Libraries mailing list