[Haskell-beginners] Fwd: Implementing a spellchecker - problem with Data.HashTable performance

Fri Apr 20 11:33:51 CEST 2012

Hello fellow Haskellers,

It's my first post to this list, so at first: Greetings to everyone!

I've wanted to pick up Haskell for some time and I've run in
to a perfect (at least it seemed so) oocasion at my Natural Language
Processing course. At first I tried using Python, but it turned out
that some extra performance wouldn't hurt and I hoped Haskell would
give me some boost without forcing me to sacrifice high-level
expressiveness at the same time.

The first lesson was to use ByteStrings instead of ordinary Strings
for IO. Then I needed to build a dictionary of words found in the
source text. That's the problematic part - I've tried using a few
structures, namely Data.Map and Data.HashTable from base package and
Data.HashMap from unordered-containers. In all cases I couldn't reach
acceptable performance. As of that, I'd like to ask about some advice
and comments about the following examples.

First example [1] just builds a HashTable of ~1 500 000 ByteStrings.
The program takes about 11 seconds to execute. Second example [2] does
roughly the same, but taking the HashTable data from a file (just a
dictionary of valid Polish forms). Reading data itself is not a
problem, as I've measured the execution time of a program which just
reads the file (and prints its last line to actually force the read,
hence "handle" Haskell's laziness) and it takes no more than a second.
Coupled together reading and building, however, takes unbearably long.
I didn't wait till the program stopped and just terminated it, but it
had been running for ~10 minutes already.

The most disappointing fact is that a roughly equivalent (and rather
effortless to write in contrast to the Haskell example - I'm just
learning though) Python test which just reads everything into a dict
takes about 10 seconds to execute (reading and updating the dict of
course). Writing the spellchecker in Python seemed pretty
straightforward for me (but its certainly not the only language I
know; I've had some Java (who didn't nowadays), system level C and am
daily working in Erlang). I understand of course, that the most
convenient tool is generally the one you know the best, but my
experience writing the Haskell equivalent reached both extremes -
effortless expression of the logic, but frustration because of the
unexplainable (for me at least) performance behaviour.

The question is why is it so slow? (or AM I RILY DAT DUMB? ;)

The link to examples:
[1] https://gist.github.com/2411699 TestHashTableArtificial
[2] https://gist.github.com/2411699 TestReadIntoHashTable

I'm looking forward to somebody shedding some light on this. Thanks in advance!

Best regards,
Radek Szymczyszyn