[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

Mon Oct 1 22:50:39 EDT 2007

Sorry for the long delay, work has been really busy...

On Sep 27, 2007, at 12:25 PM, Aaron Denney wrote:
> On 2007-09-27, Aaron Denney <wnoise at ofb.net> wrote:
>>> Well, not so much. As Duncan mentioned, it's a matter of what the  
>>> most
>>> common case is. UTF-16 is effectively fixed-width for the majority  
>>> of
>>> text in the majority of languages. Combining sequences and surrogate
>>> pairs are relatively infrequent.
>>
>> Infrequent, but they exist, which means you can't seek x/2 bytes  
>> ahead
>> to seek x characters ahead.  All such seeking must be linear for both
>> UTF-16 *and* UTF-8.
>>
>>> Speaking as someone who has done a lot of Unicode implementation, I
>>> would say UTF-16 represents the best time/space tradeoff for an
>>> internal representation. As I mentioned, it's what's used in  
>>> Windows,
>>> Mac OS X, ICU, and Java.
>
> I guess why I'm being something of a pain-in-the-ass here, is that
> I want to use your Unicode implementation expertise to know what
> these time/space tradeoffs are.
>
> Are there any algorithmic asymptotic complexity differences, or all
> these all constant factors?  The constant factors depend on projected
> workload.  And are these actually tradeoffs, except between UTF-32
> (which uses native wordsizes on 32-bit platforms) and the other two?
> Smaller space means smaller cache footprint, which can dominate.

Yes, cache footprint is one reason to use UTF-16 rather than UTF-32.  
Having no surrogate pairs also doesn't save you anything because you  
need to handle sequences anyway, such as combining marks and clusters.

The best reference for all of this is:

http://www.unicode.org/faq/utf_bom.html

See especially:
http://www.unicode.org/faq/utf_bom.html#10
http://www.unicode.org/faq/utf_bom.html#12

Which data type is best depends on what the purpose is. If the data  
will primarily be ASCII with an occasional non-ASCII characters, UTF-8  
may be best. If the data is general Unicode text, UTF-16 is best. I  
would think a Unicode string type would be intended for processing  
natural language text, not just ASCII data.

> Simplicity of algorithms is also a concern.  Validating a byte  
> sequence
> as UTF-8 is harder than validating a sequence of 16-bit values as  
> UTF-16.
>
> (I'd also like to see a reference to the Mac OS X encoding.  I know  
> that
> the filesystem interface is UTF-8 (decomposed a certain a way).  Is it
> just that UTF-16 is a common application choice, or is there some
> common framework or library that uses that?)

UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon,  
and is what appears in the APIs for all of them. UTF-16 is also what's  
stored in the volume catalog on Mac disks. UTF-8 is only used in BSD  
APIs for backward compatibility. It's also used in plain text files  
(or XML or HTML), again for compatibility.

Deborah