Data.ByteString candidate 3

Duncan Coutts duncan.coutts at worc.ox.ac.uk
Tue Apr 25 08:44:31 EDT 2006


On Tue, 2006-04-25 at 13:13 +0100, Duncan Coutts wrote:
> On Tue, 2006-04-25 at 13:08 +0100, Simon Marlow wrote:
> > Donald Bruce Stewart wrote:
> > 
> > > The code has been partioned into:
> > >     Data.ByteString         a Word8 only layer. All functions are in terms of Word8
> > >     Data.ByteString.Char    provides an ascii/byte-Char layer over the Word8 layer.
> > 
> > Ok, but where would we put a UTF8 version of the Char layer?  I'm 
> > thinking that "Latin1" would be more correct than "Char", and leaves 
> > room for adding UTF8 and other encodings later.
> 
> As others have pointed out, it's not strictly Latin1. Don and I reckon
> it's probably safe to say that the current Data.ByteString.Char layer is
> ok for any 8-bit fixed-width encoding with ASCII as a subset, so that
> means it's probably ok for many of the Latin* encodings.
> 
> How would we distinguish a full fixed0width 4-byte Unicode version? A
> purist mgiht say that this should be Data.ByteString.Char since a Char
> really is a 4-byte Unicode value and then change the current
> Data.ByteString.Char to be Data.ByteString.Char8 or something like that.

Actually after further discussion we've think that strictly
Data.ByteString.Char will only fully work with Latin1 because only for
Latin1 will the Chars we get back be genuine Unicode code-points (since
the first 256 code points of Unicode are the same as Latin1 - or so I am
told).

For other Latin encodings what you get back will only be a Unicode code
point for chars <127. So for other Latin encodings you'd need different
implementations of w2c & c2w that map the 256 chars to/from the correct
Unicode code points.

So that suggests that we might want to call it Data.ByteString.Latin1.
At this point we wish we had parameterisable modules so we could have
various other encodings just by parameterising on the w2c/c2w mappings.

Most of the time you could use Data.ByteString.Latin1 for other Latin
encodings and get away with it (so long as you don't want to use things
like isUpper for chars >127) which is both a blessing and a curse.

Duncan



More information about the Libraries mailing list