[Haskell-cafe] Fun with ByteStrings [was: A very edgy language]

Sun Jul 8 10:58:32 EDT 2007

On Sun, Jul 08, 2007 at 04:38:19PM +0200, Malte Milatz wrote:
> Tillmann Rendel:
> > As I understand it (wich may or may not be correct):
> > 
> > A normal Haskell string is basically  [Word8]
> 
> Hm, let's see whether I understand it better or worse.  Actually it is
> [Char], and Char is a Unicode code point in the range 0..1114111 (at
> least in GHC).  Compare:
> 
> 	Prelude Data.Word> fromEnum (maxBound :: Char)
> 	1114111
> 	Prelude Data.Word> fromEnum (maxBound :: Word8)
> 	255
> 
> So it seems that the Char type abstracts the encoding away.  I'm
> actually a little confused by this, because I haven't found any means to
> make the I/O functions of the Prelude (getContents etc.) encoding-aware:
> The string "ä", when read from a UTF-8-encoded file via readFile, has a
> length of 2.  Anyone with a URI to enlighten me?

Not sure of any URIs.

Char is just a code point.  It's a 32 bit integer (64 on 64-bit
platforms due to infelicities in the GHC backend) with a code point.  It
is not bytes.  A Char in the heap also has a tag-pointer, bringing the
total to 8 (16) bytes.  (However, GHC uses shared Char objects for
Latin-1 characters, so a "fresh" Char in that range uses 0 bytes).

[a] is polymorphic.  It is a linked list, it consumes 12 (24) bytes per
element.  It just stores pointers to its elements, and has no hope of
packing anything.

[Char] is a linked list of pointers to heap-allocated fullword integers,
20 (40) bytes per character (assuming non-latin1).

The GHC IO functions truncate down to 8 bits.  There is no way in GHC to
read or write full UTF-8, short of doing the encoding yourself (google
for UTF8.lhs).

Stefan