[Haskell-cafe] ANNOUNCE: text 0.8.0.0, fast Unicode text support

Daniel Fischer daniel.is.fischer at web.de
Wed Sep 1 10:37:20 EDT 2010


On Wednesday 01 September 2010 06:08:07, Bryan O'Sullivan wrote:
> New in this release:
>
>    - Substantial performance improvements.
>    - Bug fixes, and better quality assurance via automated QuickCheck
> and HPC tests and coverage.
>    - An even nicer API than before.

Good job.

>
> The text library provides an efficient packed, immutable Unicode text
> type (both strict and lazy), with a powerful loop fusion optimization
> framework.
>
> The 'Text' type represents Unicode character strings, in a time
> and space-efficient manner. This package provides text
> processing capabilities that are optimized for performance critical use,
> both in terms of large data quantities and high speed.

Hmm, I have to throw a few grains of salt into that.

File: big.txt from Peter Norvig's spelling checker (6.2MiB)

1. read the file and output it to stdout (redirected to /dev/null)
ByteString.Lazy: <= 0.01s
text-0.7.2.1 (Lazy): 0.68s
text-0.8.0.0 (Lazy): 0.26s
String: 0.35s

ByteString doesn't do any de/encoding or validity checking, so it's clear 
that can be way faster than I/O which does.
The interesting part is the comparison between text and vanilla String I/O, 
the difference is smaller than I expected for text-0.8.0.0. It came as a 
surprise that text-0.7.2.1 actually had much worse performance than vanilla 
Strings.

Okay, now we know how long the I/O takes, let's measure some work.

2. read the file, replace some string, output to stdout (redirected to 
/dev/null)
a) "Professor" -> "Teacher"
stringsearch (Lazy): 0.02s, 2MB total
text-0.7.2.1 (Lazy): 0.92s, 28MB total
text-0.7.2.1 (Strict): 0.90s, 65MB total
text-0.8.0.0 (Lazy): 0.44s, 12MB total
text-0.8.0.0 (Strict): 0.36S, 55MB total
String (naive): 0.51s, 1MB total
String (KMP): 0.46s, 1MB total

b) "deteriorate" -> "worsen"
stringsearch (Lazy): 0.02s, 2MB total
text-0.7.2.1 (Lazy): 1.01s, 39MB total
text-0.7.2.1 (Strict): 0.91s, 65MB total
text-0.8.0.0 (Lazy): 0.45s, 17MB total
text-0.8.0.0 (Strict): 0.36s, 55MB total
String (naive): 0.52s, 1MB total
String (KMP): 0.50s, 1MB total

c) "hen" -> "here"
stringsearch (Lazy): 0.04s, 2MB total
text-0.7.2.1 (Lazy): 1.10s, 2MB total
text-0.7.2.1 (Strict): 0.95s, 65MB total
text-0.8.0.0 (Lazy): 0.66s, 2MB total
text-0.8.0.0 (Strict): 0.41s, 55MB total
String (naive): 0.53s, 1MB total
String (KMP): 0.51s, 1MB total

d) "kre" -> "here"
stringsearch (Lazy): 0.03s, 2MB total
text-0.7.2.1 (Lazy): 1.23s, 39MB total
text-0.7.2.1 (Strict): 0.96s, 65MB total
text-0.8.0.0 (Lazy): 0.65s, 17MB total
text-0.8.0.0 (Strict): 0.40s, 55MB total
String (naive): 0.51s, 1MB total
String (KMP): 0.47s, 1MB total

The performance of text-0.8.0.0 has improved significantly over that of 
text-0.7.2.1 (for the tested features), the improvement of the replacing 
algorithm is however not as impressive as that of I/O.

Replacing isn't much faster than for Strings, however, for (some?) very 
short patterns, even slower.
That's not so good.

What's *really* bad is the space behaviour.

My tests suggest that the space usage (for Text.Lazy) depends on the index 
of the pattern to replace, the later it appears (if at all), the more 
memory is used by text. Apparently it doesn't deliver the beginning before 
a match has been found (or the whole string has been scanned).
That space leak ought to be fixed.

The space behaviour of the strict variant doesn't show that problem, it 
seems to only depend on the file, but it uses disproportionally much 
memory.


More information about the Haskell-Cafe mailing list