Text in Haskell: a second proposal

Ken Shan ken@digitas.harvard.edu
Fri, 9 Aug 2002 02:10:36 -0400


--mP3DRpeJDSE+ciuQ
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Thanks to the discussion on this list, I now see that there are not four
but five types to be distinguished in text processing:

 1. Octets.
 2. C "char".
 3. Unicode code points.
 4. Unicode code values, useful only for UTF-16, which is seldom used.
 5. "What handles handle".

The new entry in the list is the last one.  Wolfgang Jeltsch mentioned
"Stream_Element" in Ada 95, which I don't know specifically about but
sounds like the same idea.

I suggest that the following Haskell types be used for the five items
above:

 1. Word8
 2. CChar
 3. CodePoint
 4. Word16
 5. Char

On most machines, Char will be a wrapper around Word8.  (This
contradicts the present language standard.)

Let me elaborate.  Files are funny because the information units they
contain can be treated as both numbers and characters.  Treating these
units as numbers, we can convert them to say octets and use them to
serialize higher-level data structures such as directed acyclic graphs,
bitstrings, and Unicode text.  Treating the same units as characters, we
can concatenate text like "cat" does, strip out or replace selected
characters like "tr" does, or compute the number of units in a stream
like "wc -c" does.

When we treat a file as containing characters (rather than numbers), we
are in effect using a funky character encoding.  This encoding is not
UTF-8, not ISO-8859-1, and not ASCII.  The specifics of this encoding
depends on the machine.  On mine, it supports 256 characters, mapped to
the numbers 0 through 255.  Between 0 and 127 is ASCII.  From 128 to 255
are 128 additional characters -- let's call them "#128" through "#255".
These characters are completely new.  For example, "=E1" is not "#225",
just like the integer "225" and the unique complete graph with 225
vertices are not the character "#225", either.

Many people will never worry about the characters #128 through #255.
This is akin to the fact that the characters "a" through "z" does not
concern a program that reads two numbers (in decimal, textual form) from
standard input, adds them up, and prints the result to standard out
(again in decimal, textual form).  As long as the set of characters
handled by standard input and standard output includes the decimal
digits, the period, and whitespace, the program will work.

What do the five Haskell types proposed above mean for the practical
programmer?  The types in the Haskell IO library will not change; for
instance, the only way to read an information unit from a handle is

    hGetChar :: Handle -> IO Char

As I mentioned above, Char under the present proposal is not the type of
a Unicode character, but the type of an information unit handled by
handles, contra the current language standard.  There should, however,
be a basic guarantee on how big this information unit is; a reasonable
one to make is that it contains at least 8 bits.  In other words,

    ord (chr i) =3D=3D i	    for all i such that 0 <=3D i <=3D 255.

CChar will be a synonym for Char on most systems, just like CInt is a
synonym for Int on most systems.  I don't know if the sockets library
should use Word8 or Char, but it should be one of the two.

Now for the tricky issue of converting and defaulting and guessing
encodings.  In short: Encodings should be handled separately from files.
Encoding conversion should be in a library separate from the I/O
library.  Without involving I/O, I should be able to write a program to
answer the question "how many 3-byte sequences are valid UTF-8 text?".

We will want stuff in the library like

    data Encoding text code
	=3D Encoding { encode :: [text] -> Maybe [code]
                   , decode :: [code] -> Maybe [text] }

    utf8     :: Encoding CodePoint Word8
    iso88591 :: Encoding CodePoint Word8

as well as

    char     :: Encoding Char Word8

so that UTF-8 conversion from [Char] to [CodePoint] is

    (>>=3D decode utf8) . encode char :: [Char] -> Maybe [CodePoint]

I am sure many complexities of character encodings are not considered
here, but that is in part the point: I would like to see Char in the
language standard dissociated from Unicode and return to the more
abstract concept of an information unit.

--=20
Edit this signature at http://www.digitas.harvard.edu/cgi-bin/ken/sig
http://www.ethnologue.com/

--mP3DRpeJDSE+ciuQ
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE9U1zbzjAc4f+uuBURApgcAKDM57PrOpMlfjpWzaLJnbHjrXYt5QCgs8+7
UIBguyKUXjTbGMXd3pvLqvU=
=9N41
-----END PGP SIGNATURE-----

--mP3DRpeJDSE+ciuQ--