Text in Haskell: A PROPOSAL

Sven Moritz Hallberg pesco@gmx.de
08 Aug 2002 02:37:04 +0200


On Thu, 2002-08-08 at 01:34, Joe English wrote:
> It's often very useful to treat a file as a sequence
> of characters; in fact I'd say that's probably more
> common than treating them as a sequence of octets.
> But both are clearly needed.

I agree with Ashley, a file is a sequence of Word8s. Very often we use
files to store a sequence of characters. Treating the file as a sequence
of characters is one level higher, though. In between lies the
(de)serialization of the charactars (according to some code), just like
you'd have it with every other "object".

Is there a compelling reason against simply providing character
en-/decoding functions?


> In my opinion, hPutChar :: Handle -> Char -> IO () should
> do what its name and type indicate -- write a character
> to the specified output handle.  The I/O subsystem
> should take care of translation to UTF-8 (or whatever
> the system encoding is).

Is there a sensible way of defining the "system encoding"?
UTF-8 is a superset of the ASCII, correct? As in, encoding ASCII
characters in UTF-8 yields valid ASCII? If so, this actually sounds like
it would usually produce the desired result without much hassle to the
developer.

However, I think it is of great importance to make it very clear that a
file in itself _is not_ a sequence of characters, but 8-bit-words and
that any function like hPutChar would actually be just a shortcut for
something like

  hPut . encodeChar

along with which there would also be

  encodeCharUTF8
  encodeCharASCII
  ...

Hm, what about encodeCharUTF16? Would that return Word16s? Hrm. But
then, how to write that to a file? Would there be a reason against
encodeCharUTF16 returning Word8s? Otherwise there would have to be two
separate functions or another level which would convert Word16s to
Word8s.


> hPutWord8 :: Handle -> Word8 -> IO () should be available
> _in addition to_ hPutChar, for applications that need
> to treat files as a sequence of octets.

I think having one hPut :: Handle -> Word8 -> IO () along with a bunch
of serialization functions is preferable to seperate hPutWord8,
hPutWord16, hPutChar, ... functions just because it gives a much cleaner
picture of what's actually happening. Maybe there could even be a class
Serializable so one could have

  hPut :: Serializable a => Handle -> a -> IO ()

. I think (under the assumption that encoding ASCII characters in UTF-8
yields valid ASCII code) this would actually make the nicest design,
because it accomplishes these goals:

  - The programmer need not usually concern himself with en-/decoding.
  - If she needs control, she can use a specific function herself and
    pass the resulting [Word8] (which is trivially Serializable) to
    hPut.
  - The code expresses directly the need to encode a piece of data
    before it can be written to a file.


Regards,
Sven Moritz