Haskell Platform Proposal: add the 'text' library

Tyson Whitehead twhitehead at gmail.com
Thu Oct 21 12:26:20 EDT 2010


On October 20, 2010 15:45:44 Axel Simon wrote:
> AFAIK there are scripts that have so many combinations that Unicode
> does not have a single codepoints for each character. In Arabic you
> can have one of 5 vowel signs on each of the 28 letters. But Unicode
> does not provide 5*28 codepoints for the combinations. That is
> probably the reason for have these combined characters.
> 
> Mac OS tries to take all the characters into as many codepoints as
> possible whereas Windows tries to merge them as much as possible. I
> don't think there is a good semantics for replace without knowing what
> (normal) form you're working on. Normally, search/replace and sorting
> on Unicode are specialized algorithms that cannot be reduces to simple
> substitutions or permutations.

Thanks to everyone for the examples.

Given that not all combined characters can be reduced to a single code point 
(from your first paragraph), it would seem that MacOS normalization has a 
conceptual advantage over Windows normalization.

Specifically, it is appealing that the normalized string is in some sense less 
complex in that it only contains elementary codepoints (ones that can't be 
further decomposed) and compositions.  The other would still contain a mix.

Am I correct then in understanding that, from the view of strings as a 
vector/list of elementary chars, the elementary chars  would actually have to 
be a codepoint plus an arbitrary number of additional composition codepoints 
in order to correspond well to the human notion of a character.

This then doesn't map well onto the existing vector/list style interfaces 
because this elementary char type is not a simple enumeration to be treated 
atomically.  Operations would actually need to frequently look inside it 
(e.g., replace base codepoints irrespective of the compositional codepoints).

Cheers!  -Tyson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
Url : http://www.haskell.org/pipermail/libraries/attachments/20101021/8a6bc0e1/attachment.bin


More information about the Libraries mailing list