[Haskell-cafe] Spelling checker exercise

Mon Feb 8 16:58:03 EST 2010

Just wanted to follow up on this, after getting distracted by less important things.

Thanks to Daniel's suggestions re the ByteString variant, the fairly-faithful Haskell translation of Norvig's Python clocks in at about 1.7 times slower than spelling.py, most of which is down to the fact that "train" doesn't have a O(1) hashmap to use.

I tried using Data.Trie rather than Data.Map, but that made it slightly slower. A counting Bloom filter might be better, but that's going further than I have time for right now.

So, in "real" usage this simple version would probably be usable, since train would be done once. Of course there are far cleverer algorithms, but the version I have so far I think retains the value that Norvig's has as a learning example.

The latest version I came up with is here:

  http://github.com/scramjet/spelling/blob/master/spelling.hs

Cheers, and thanks again,

Matthew.

P.S.

> Daniel:

> [*] classic example: why will
> 
> average xs = sum / len
>   where
>      (sum,len) = foldl' accum (0,0) xs
>      accum (sm,ln) x = (sm+x,ln+1)
> 
> cause a stack overflow for long lists?

You gave a strong hint before this, so I'd guess it's due to the lazy tuple creation in "accum"?

On 25/01/2010, at 5:49 PM, Daniel Fischer wrote:

> Am Montag 25 Januar 2010 05:34:50 schrieb Matthew Phillips:
>> On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:
>> 
>> <...>
>> 
>>> I think I know what happened here:
>>> 
>>> $ ghc -fforce-recomp --make matthew -o matthew0
>> 
>> <...>
>> 
>>> I habitually compile all code with -O2, unless I have a specific
>>> reason not to. I tend to forget that some compile without
>>> optimisations. For some kinds of code that makes hardly any
>>> difference, for others it makes a big difference.
>> 
>> I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not*
>> -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would
>> turn on optimisation, but of course that's just the flag for GCC.
> 
> By the way, compiling via C nowadays hardly ever produces faster code than 
> the native code generator (most times I tried since 6.10, if there was a 
> difference, native was faster).
> 
>> 
>>  $ time ./spelling becuase
>>  becuase -> because
>> 
>>  real	0m4.467s
>>  user	0m3.865s
>>  sys	0m0.280s
>> 
>>  $ time ./spelling_df becuase
>>  becuase -> because
>> 
>>  real	0m2.422s
>>  user	0m2.198s
>>  sys	0m0.081s
>> 
>> Your previous version is close to twice as fast, and now only 2.5 times
>> slower than Python.
>> 
>> <snipped new version of code with toLower removed>
>> 
>> With your suggested changes, your latest version on my machine:
>> 
>>  $ time ./spelling_df becuase
>>  becuase -> because
>> 
>>  real	0m1.731s
>>  user	0m1.572s
>>  sys	0m0.056s
>> 
>> Now, we're getting close: 4.7s -> 2.3s -> 1.7s.
>> 
>>>>> But once you start needing two edits (try korrekt), correct and
>>>>> edits1 start to show up. That shows that Norvig's algorithm isn't
>>>>> really good. With two edit steps, you create a _lot_ of strings you
>>>>> need to look up, far more than there are in the map. That takes
>>>>> time. It'll be better to scan the map for entries with an edit
>>>>> distance (< 3) if you have a good method to check that
>>> 
>>> Indeed:
>>> $ time ./nLDBSWSpelling becuase
>>> becuase -> because
>>> 2.84user 0.02system 0:02.86elapsed 100%CPU
>>> $ time ./nLDBSWSpelling korrekt
>>> korrekt -> correct
>>> 2.83user 0.05system 0:02.88elapsed 100%CPU
>> 
>> Not sure if I see what you're saying here: do you mean to point out the
>> 2.86 vs 2.88 elapsed?
> 
> Well, above the code, I had the times for Norvig's algorithm (creating all 
> two-step edits and checking which are known):
> 
>>> And, without profiling:
>>> $ time ./spellingBSW becuase
>>> becuase -> because
>>> 2.84user 0.03system 0:02.88elapsed 99%CPU
>>> 
>>> Finally, building the set of two-step edits takes longer than the
>>> additional lookups:
>>> 
>>> $ time ./spellingBSW becuase
>>> becuase -> because
>>> 2.84user 0.03system 0:02.88elapsed 99%CPU
>>> $ time ./spellingBSW korrekt
>>> korrekt -> correct
>>> 3.50user 0.02system 0:03.52elapsed 100%CPU
>>> 
>>> vs.
>>> 
>>> $ time ./spellingBSWL becuase
>>> becuase -> because
>>> 2.79user 0.04system 0:02.83elapsed 100%CPU
>>> $ time ./spellingBSWL3 korrekt
>>> korrekt -> correct
>>> 3.20user 0.02system 0:03.23elapsed 99%CPU
> 
> For "becuase", which is a one-step edit, all take the same time (within 
> normal fluctuation), 2.8x seconds.
> For the two-step edit "korrekt", the version building Sets takes 3.5 
> seconds, the version which doesn't build a Set of two-step edits takes 3.2 
> seconds, *and the version scanning the Map of known words for entries with 
> a Levenshtein distance [modified to account for transpositions] less than 
> 3, takes 2.8y seconds, practically the same time as for the one-step edit*.
> 
> It becomes more obvious, perhaps, if we test a couple of two-steppers:
> 
> Lazy Levenshtein:
> 
> time ./nLDBSWSpelling becrase korrekt halmos territoir gutzenperg
> becrase -> because
> korrekt -> correct
> halmos -> holmes
> territoir -> territory
> gutzenperg -> gutenberg
> 2.94user 0.03system 0:02.97elapsed 100%CPU
> 
> just something like 0.1 - 0.15 seconds more than for "becuase", that makes 
> 0.02 - 0.04 seconds per two-edit correction.
> 
> Sets:
> 
> $ time ./spellingBSW becrase korrekt halmos territoir gutzenperg
> becrase -> because
> korrekt -> correct
> halmos -> holmes
> territoir -> territory
> gutzenperg -> gutenberg
> 7.48user 0.03system 0:07.55elapsed 99%CPU
> 
> Gewalt geschrien! That takes almost a second per two-edit correction.
> 
> List:
> 
> $ time ./spellingBSWL3 becrase korrekt halmos territoir gutzenperg
> becrase -> because
> korrekt -> correct
> halmos -> holmes
> territoir -> territory
> gutzenperg -> gutenberg
> 5.54user 0.04system 0:05.58elapsed 99%CPU
> 
> Better, but still bad, roughly half a second per correction.
> 
> Python without psyco:
> 
> $ time python ./norvig.py becrase korrekt halmos territoir gutzenperg
> Trained in  1.46993017197 seconds
> because
> correct
> holmes
> territory
> gutenberg
> 3.00user 0.08system 0:03.09elapsed 99%CPU
> $ time python ./norvig.py becuase
> Trained in  1.49027395248 seconds
> because
> 1.46user 0.08system 0:01.55elapsed 100%CPU
> 
> about 0.3 seconds per correction
> 
> and with:
> 
> $ time python ./norvig.py becrase korrekt halmos territoir gutzenperg
> Trained in  1.22417902946 seconds
> because
> correct
> holmes
> territory
> gutenberg
> 2.12user 0.09system 0:02.21elapsed 100%CPU
> $ time python ./norvig.py becuase
> Trained in  1.17486715317 seconds
> because
> 1.14user 0.08system 0:01.23elapsed 99%CPU
> 
> about 0.2 seconds per correction.
> 
> (Just for the record, I found out what Python did in the half second I 
> couldn't account for: it looked for matches for "./norvig.py", I forgot 
> that Python passes the name of the script in argv[0] - d'oh)
> 
>> 
>>>>> Another thing is
>>>>> 
>>>>> allWords = keysSet wordCounts
>>>>> 
>>>>> Ouch. For each correction, you construct that set anew. Just use
>>>>> member from Data.Map instead of Data.Set.member and look up the
>>>>> words in the map.
>>>> 
>>>> Whoops! Just to be clear though: Haskell will memoise the result of
>>>> "allWords" for a given invocation of "correct"?
>>> 
>>> Yes. But not across different invocations.
>>> 
>>>> So this would only make a large difference for multiple corrections?
>>> 
>>> Right. But that's the interesting use case, isn't it?
>> 
>> It will be when I get the the rest of it working, yes :)
>> 
>>>>>> I will look at using Bloom Filters or
>>>>>> Trie’s instead of Data.Map, but I wonder if readFile should be
>>>>>> taking nearly %30 of the run time, even for a 6MB file?
>>>>> 
>>>>> No way. But it doesn't seem to, from my GHC's point of view.
>>>> 
>>>> Just to be sure I wasn't using the SCC incorrectly, I split out the
>>>> readFile call into "myReadFile". The prof line comes out as:
>>>> 
>>>> myReadFile            Main         210           1  35.8    8.6
>>>> 35.8    8.6
>>>> 
>>>> i.e. 35.8%  of runtime.
>>> 
>>> Can I see the exact code which gives that profile?
>>> Somehow, things which shouldn't must be attributed to readFile.
>> 
>> The version at this link has myReadFile split out.
>> 
>> http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975
>> 228fd43b9/spelling.hs
>> 
>> Doing the same to your version has the same result:
>> 
>>  myReadFile       Main       210           1  45.7    9.6    45.7   
>> 9.6
>> 
>> It does seem that the profiler is wrong or misleading somehow.
> 
> Strange. Doing that here has no such effect. The only line in the profile 
> (ByteString) where myReadFile occurs is
> 
>   myReadFile  Main  260   1   0.0  0.9   0.0    0.9   0   6507348
> 
> (snipped whitespace), entered once, allocated 6.5 million bytes, took not 
> measurable time.
> 
> Both, compiled via C and with the native code generator.
> 
> With String IO, it costs 4.6% time and 8.1% allocation.
> 
>> 
>> Two other quick questions:
>> 
>> (1) you added parentheses to "candidates":
>> 
>>  candidates = known [word] `or` ((known e1) `or` known_edits2)
>> 
>> The "or"'s should short circuit so that if "known [word]" is non-empty,
>> the others shouldn't be evaluated. I read the above as forcing
>> evaluation of the second "or" first: am I wrong?
> 
> Au contraire, 
> "Any operator lacking a fixity declaration is assumed to be infixl 9" 
> says the report, so without parentheses, it's parsed as
> 
> (known [word] `or` known e1) `or` known_edits2
> 
> and both 'or's must be evaluated even if word is known. Since 'or' is lazy 
> in its second argument, if we associate it
> 
> known [word] `or` (known e1 `or` known_edits2)
> 
> , in case of a known word, we needn't look at the parenthesis at all (it 
> will probably not make a measurable difference in running time unless you 
> check a couple of million known words, but still, it feels lazier). For a 
> local definition, I thought an explicit fixity declaration was overkill.
> 
>> 
>> (2) you eliminated the "fold" in "correct" in favour of a tail-recursive
>> search in "maxCount": was this for style or performance reasons (or both
>> :)?
> 
> Performance, kind of. Since the lists we fold over are in fact short, it 
> doesn't really matter, but if they were long, there'd be the risk of 
> unnecessarily allocating lots of pairs. I'm rather confident that with -O2, 
> GHC will eliminate the pairs anyway, but without looking at the core, I'm 
> not entirely sure.
> However, in fact it was an unthought-through rewrite because I just had 
> someone with a stack overflow in spite of foldl' due to the laziness of the 
> data constructors[*]. So I made sure that that couldn't happen, without 
> looking at the folded function to see if that already prevents it. And in 
> fact, it does, so using a foldl' if you use lists instead of Sets is fine, 
> with respect to style, even preferable.
> 
> [*] classic example: why will
> 
> average xs = sum / len
>   where
>      (sum,len) = foldl' accum (0,0) xs
>      accum (sm,ln) x = (sm+x,ln+1)
> 
> cause a stack overflow for long lists?
> 
>> 
>> Cheers,
>> 
>> Matthew.
>