Text in Haskell: A PROPOSAL

Ashley Yakeley ashley@semantic.org
Wed, 7 Aug 2002 23:05:28 -0700


At 2002-08-07 17:37, Sven Moritz Hallberg wrote:

>Hm, what about encodeCharUTF16? Would that return Word16s? 

UTF-16 may represent a single Char as one or two Word16s.

  encodeCharUTF16 :: Char -> [Word16];

or

  encodeCharUTF16 :: Char -> (Word16,Maybe Word16);

or

  encodeCharUTF16 :: Char -> Either Word16 (Word16,Word16);

>Hrm. But then, how to write that to a file? 

Depends on what order you want the halves of each Word16.

Unicode 3.0 defines four character encoding forms/schemes: UTF-8, UTF-16, 
UTF-16LE, and UTF-16BE. UTF-16 encodes as 16-bit units, the other three 
encode as 8-bit units.

So you might have something like this:

  encodeUTF8 :: String -> [Word8];
  encodeUTF16 :: String -> [Word16];
  encodeUTF16LE :: String -> [Word8];
  encodeUTF16BE :: String -> [Word8];

The authority here is Unicode Technical Report 17, which is part of the 
Unicode Standard.
<http://www.unicode.org/unicode/reports/tr17/>

But watch out... I've noticed a certain amount of incoherence in the 
Unicode standards, for instance Unicode 3.0 sec. 2.3 refers to UTF-16 one 
of four "Character Encoding Schemes" which is "an encoding form plus byte 
serialization" even though UTF-16 by itself doesn't include byte 
serialization.

-- 
Ashley Yakeley, Seattle WA