[Haskell-cafe] Re: Haskell version of Norvig's Python Spelling Corrector

Mon Apr 23 20:51:21 EDT 2007

Albert Y. C. Lai wrote:
> I try using WordSet = [String] (plus corresponding change in code) and 
> get great speedup, actually way more than 3x. There was also a memory 
> growth phenomenon using Set String, and replacement by [String] stops 
> that too, now it's constant space (constant = 20M). It is possible to 
> attribute part of the speedup to excellent rewrite rules in GHC 
> regarding lists; however, I cannot explain the memory growth when using 
> Set.

Now I see. The higher memory usage of the Set version is not growth or 
leak, just brutal. For each input word, a ton of derived words are 
generated, and an optimal one is chosen (by a criterion that's 
essentially a mapping from words to scores). There are two ways to do this.

You could put the derived words into a lazy list and then traverse it, 
and therefore they're generated, examined, and thrown away on demand. At 
most two words are ever going to be stored (the best one so far and the 
next candidate), maybe plus a couple of cons cells if the compiler does 
not optimize.

Or you could put all derived words into a set first (to avoid 
duplicates, but note that you still have to generate a ton, you only 
save by storing just half a ton), then traverse it. Storing a ton, or 
even just half a ton, of these words is going to clog up memory.
The perceived growth is only because some input words cause fewer 
derived words and some other input words encountered later cause more 
derived words. So, plotting memory over time, once in a while we have 
more derived words than before, so the heap grows for that; they are 
still thrown away correctly in due course. If you input the same input 
word over and over gain, space usage is still constant.