Text in Haskell: a second proposal

Simon Marlow simonmar@microsoft.com
Tue, 13 Aug 2002 12:13:17 +0100


> At 2002-08-09 03:26, Simon Marlow wrote:
>=20
> >Why combine I/O and {en,de}coding?  Firstly, efficiency.=20
>=20
> Hmm... surely the encoding functions can be defined efficiently?
>=20
>     decodeISO88591 :: [Word8] -> [Char];
>     encodeISO88591 :: [Char] -> [Word8]; -- uses low octet of=20
> codepoint
>=20
> You could surely define them as native functions very efficiently, if=20
> necessary.

That depends what you mean by efficient: these functions represent an
extra layer of intermediate list between the handle buffer and the final
[Char], and furthermore they don't work with partial reads - the input
has to be a lazy stream gotten from hGetContents.  I don't want to be
forced to use lazy I/O.

> A monadic stream-transformer:
>=20
>    decodeStreamUTF8 :: (Monad m) =3D> m Word8 -> m Char;
>=20
>    hGetChar h =3D decodeStreamUTF8 (hGetWord8 h);
>=20
> This works provided each Char corresponds to a contiguous block of=20
> Word8s, with no state between them. I think that includes all the=20
> standard character encoding schemes.

This is better: it doesn't force you to use lazy I/O, and when
specialised to the IO monad it might get decent performance.  The
problem is that in general I don't think you can assume the lack of
state.  For example: UTF-7 has a state which needs to be retained
between characters, and UTF-16 and UTF-32 have an endianness state which
can be changed by a special sequence at the beginning of the file.  Some
other encodings have states too.

Cheers,
	Simon