[Haskell-cafe] Re: String vs ByteString

Tue Aug 17 21:25:09 EDT 2010

Bulat Ziganshin wrote:
> Johan wrote:
>> So it's not clear to me that using UTF-16 makes the program
>> noticeably slower or use more memory on a real program.
> 
> it's clear misunderstanding. of course, not every program holds much
> text data in memory. but some does, and here you will double memory
> usage

I write programs that hold onto quite a good deal of natural language 
text; a few million words at least. Getting efficient Unicode for that 
is a high priority. However, all of that text is in Japanese, Chinese, 
Arabic, Hindi, Urdu,... That's the reason I want Unicode. I'm pretty 
sure UTF-16 isn't going to be causing any special problems here.

For NLP work, any language with a vaguely ASCII format isn't a problem. 
We've been shoving English and western European languages into a subset 
of ASCII for years (heck, we don't even allow real parentheses!).

For the mostly English files on my harddrive, UTF-8 is a clear win. But 
when it comes to programming, I'm not so sure. I'd like to see some good 
benchmarks and a clear explanation of where the costs are. Relying on 
intuitions is notoriously bad for these kinds of encoding issues.

-- 
Live well,
~wren