UTF-8 library

Ashley Yakeley ashley@semantic.org
Sat, 10 Aug 2002 03:42:22 -0700


At 2002-08-10 03:03, anatoli wrote:

>--- Sven Moritz Hallberg <pesco@gmx.de> wrote:
>> I argue _strongly_ against associating some sort of locale state with
>> handles.
>> 
>> 1) In agreement with Ashley's statements, file IO should use octets,
>> because that's what's in a file.
>
>By the same token, we should handle CR/LF/CR-LF/LF-CR mess by hand.
>(Files don't have lines in them, they are just sequences of octets.)

Correct. Exactly what kind of newline do you want in your file?

>I prefer somewhat higher-level view of files.

Well, that's what encoding functions are for. You can take higher-level 
views of your octets as text, images, XML-structures, experimental 
datasets, whatever.

What's so special about text that the functionality should be bound 
_right into the API_?

>> 2) If you need to decode those octets to characters, or vice-versa,
>> compose a (de)serialization function before it.
>
>I *always* need that. (Except for binary IO).

You *always* need that. (Except when you don't).

The term of "binary" is quite misleading. It suggests a particular file 
type, but it's actually used to mean "something other than 
ASCII-compatible text". One might as well have a word that means 
"something other than a JPEG image".

...
>A "Word8 stream" can be either Handle (Word8Handle?) or [Word8]. We can 
>transform
>[Word8] to [Char], but not Word8Handle to CharHandle. I argue that the latter
>is needed as well.

Well, it should be a utility library built on top of the real Word8-based 
functions:

  data TextHandle = MkTextHandle Handle TextEncoding;
  etc.


-- 
Ashley Yakeley, Seattle WA