Haskell Platform Proposal: add the 'text' library

wren ng thornton wren at freegeek.org
Tue Oct 19 22:36:47 EDT 2010


On 10/19/10 7:59 PM, Ross Paterson wrote:
> On Wed, Oct 20, 2010 at 12:35:33AM +0100, Duncan Coutts wrote:
>> On 19 October 2010 22:08, Roman Leshchinskiy<rl at cse.unsw.edu.au>  wrote:
>>> On 19/10/2010, at 15:22, John Lato wrote:
>>>> I think there's a significant difference between vector and text, namely a Vector is conceptually the same as a list/1D array, while a Text is not.  I think this difference is enough to warrant a break from the list API.
>>>
>>> Are you sure? From its interface Text looks exactly like a list of Chars to me.
>>
>> Right, that's a very common misunderstanding of Unicode. A Unicode
>> code point (type Char) does not correspond 1:1 with the human notion
>> of a character. It would be nice if it did, but unfortunately it is
>> not something we can ignore. Because of this it is better not to think
>> of operations on individual Chars but on short sequences of Chars. In
>> any case, when processing text (even ASCII where Chars do match
>> characters) many of the most common operations that you want are
>> substring not element based.
>
> I believe Roman is referring to the Text API, which does indeed look a lot
> like the list API specialized to Char, with relatively few exceptions.
> The above would be an argument against including any of the functions
> with Char parameters, but a high proportion of them do.

<musing>
I almost wonder if it would be worth it to define a new type, Character, 
which does correspond 1:1 to the human notion of a "character" (being 
intentionally vague about what exactly that means). Then we could have 
that Text is a vector/list/sequence of Characters, and give it the 
appropriate interface for being thought of that way.

Of course, under the covers, Character would just be a newtype of 
Text[1] and so the bulk of text/text-icu implementation would need no 
changes.

At least, it seems like that might make it possible for us to get out of 
this impasse about the text library matching vector/list/sequence APIs 
when Text is not a vector/list/array of Char. Also, it helps to codify 
what we mean by "a short sequence of Chars", which could possibly allow 
for some simplifying assumptions for the algorithms being used (since 
often there are better (X,X)->Y algos available when we know one of the 
X is much smaller than the other).
</musing>


[1] Using a type alias seems like it'd be too easy to break the API 
idealization.

-- 
Live well,
~wren


More information about the Libraries mailing list