UTF-8 library

Sven Moritz Hallberg pesco@gmx.de
10 Aug 2002 13:38:36 +0200


--=-xUEd1aquDc0nYHirrxCd
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

On Sat, 2002-08-10 at 12:03, anatoli wrote:
> --- Sven Moritz Hallberg <pesco@gmx.de> wrote:
> > I argue _strongly_ against associating some sort of locale state with
> > handles.
> >=20
> > 1) In agreement with Ashley's statements, file IO should use octets,
> > because that's what's in a file.
>=20
> By the same token, we should handle CR/LF/CR-LF/LF-CR mess by hand.
> (Files don't have lines in them, they are just sequences of octets.)

That's a good point, I've forgotten about this mess. I think that it's
ugly, though, to do it somewhere outside, pretending the issue's not
there. I value about Haskell it's clean representation of reality.
Attaching all kinds of state to handles just isn't as clear as "Look
here, a file: It's a sequence of octets.", "Watch out though, each file
can use an entirely different encoding.", "The Char versions of the IO
functions will try to deal with encoding for you.", and "If you know you
need some special treatment, we have these functions blahblahblah..."


> I prefer somewhat higher-level view of files.

Of course, so do I, I just want the higher-level view to be implemented
in Haskell, not under the hood of some ominous "handle" type; which,
btw, will then no longer be simply a handle but some sort of great big
file IO "object". That's confusing for anyone who hasn't been exposed to
the C way of dealing with files. I'd teach some old people clean
concepts they might not be used to, rather than repeating the same old
yuck to every new little programmer who's just starting.


> > 2) If you need to decode those octets to characters, or vice-versa,
> > compose a (de)serialization function before it.
>=20
> I *always* need that. (Except for binary IO). Might as well have this=20
> functionality built in a handle.

Well, then *always* use the Char functions. I don't see the point.


> > 3) A "best shot" character reading(or writing, for that matter)
> > function, will be convenient. This should probably use your current
> > locale, because when writing a character, you'll probably want to be
> > able to write your own language's characters correctly.
>=20
> I routinely read and write messages in three different languages that
> use three different encodings. All of them are my "own" languages.

Where is the problem? The system is not going to be able to decide which
one to use either way, so you must make the encoding explicit. Now we
just have to come up with a convenient way to do it. Transforming
between [Word8] and [Char] seems plausible to me.


> > 4) For decoding, we'll need some parsing functionality, as someone
> > already mentioned. With that we can have functions like parseUTF8.
> > "Associating a locale with a stream", as you put it, is a matter of, if
> > f is the raw Word8 stream, g =3D parseUTF8 f, where g is the Char strea=
m,
> > parsed as UTF-8-encoded characters from f.
>=20
> A "Word8 stream" can be either Handle (Word8Handle?) or [Word8]. We can t=
ransform
> [Word8] to [Char], but not Word8Handle to CharHandle. I argue that the la=
tter
> is needed as well.

The only reason for that would be efficiency. Simon said something about
that. I admit that I have no clue about it.


Sven Moritz


--=-xUEd1aquDc0nYHirrxCd
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQA9VPs8Bz8tX8KX/qsRAo5CAJ0a7axcEsABRNF0HNipOJGWJBE7fgCeNs7/
4WO8BP/8CdQ8eMdeAlpsHXQ=
=hHmj
-----END PGP SIGNATURE-----

--=-xUEd1aquDc0nYHirrxCd--