[Haskell-cafe] PROPOSAL: New efficient Unicode string library.

Jonathan Cast jcast at ou.edu
Wed Sep 26 13:54:16 EDT 2007


On Wed, 2007-09-26 at 18:46 +0100, Duncan Coutts wrote:
> In message <1190825044.9435.1.camel at jcchost> Jonathan Cast <jcast at ou.edu> writes:
> > On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:
> 
> > > If UTF-16 is what's used by everyone else (how about Java? Python?) I
> > > think that's a strong reason to use it. I don't know Unicode well
> > > enough to say otherwise.
> > 
> > I disagree.  I realize I'm a dissenter in this regard, but my position
> > is: excellent Unix support first, portability second, excellent support
> > for Win32/MacOS a distant third.  That seems to be the opposite of every
> > language's position.  Unix absolutely needs UTF-8 for backward
> > compatibility.
> 
> I think you're talking about different things, internal vs external representations.
> 
> Certainly we must support UTF-8 as an external representation. The choice of
> internal representation is independent of that. It could be [Char] or some
> memory efficient packed format in a standard encoding like UTF-8,16,32. The
> choice depends mostly on ease of implementation and performance. Some formats
> are easier/faster to process but there are also conversion costs so in some use
> cases there is a performance benefit to the internal representation being the
> same as the external representation.
> 
> So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8
> has the advantage of being the same as a common external representation so
> conversion is cheap (only need to validate rather than copy). UTF-8 is more
> compact for western languages but less compact for eastern languages compared to
> UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the
> common case UTF-16 is effectively fixed width. According to the ICU implementors
> this has speed advantages (probably due to branch prediction and smaller code size).
> 
> One solution is to do both and benchmark them.

OK, right.

jcc



More information about the Haskell-Cafe mailing list