[Haskell-cafe] How does GHC read UNICODE.

Duncan Coutts duncan.coutts at worc.ox.ac.uk
Tue May 20 07:03:27 EDT 2008


On Tue, 2008-05-20 at 09:30 +0200, Ketil Malde wrote:
> Don Stewart <dons at galois.com> writes:
> 
> > You can use either bytestrings, which will ignore any encoding, 
> 
> Uh, I am hesitant to voice my protest here, but I think this bears
> some elaboration:
> 
> Bytestrings are exactly that, strings of bytes.

Yes, we tried to make it explicit.

> Basically, bytestrings are the wrong tool for the job if you need more
> than 8 bits per character.

Right. It's not intended for text, except for those 8-bit mixed binary
ASCII network protocols, file formats etc.

> I think the predecessors of bytestring (FPS?) had support for other
> fixed-size encodings, that is, two-byte and four-byte characters.

I'm not sure about that, but there is the old Data.PackedString which
uses UTF-32. There is no fixed size two-byte Unicode encoding (there is
only UTF-16 which is variable width.)

>  Perhaps writing a Data.Word16String bytestrings-alike using UCS-2
> would be an option?

I'm supervising a masters student who is working on a proper Unicode ADT
with a similar API and underlying implementation to that of ByteString.
Hopefully people will be able to start using that for an internal
representation of text instead of ByteString.

Duncan



More information about the Haskell-Cafe mailing list