[Haskell-cafe] Spelling checker exercise

Sun Jan 24 23:34:50 EST 2010

On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:

<...>

> I think I know what happened here:
> 
> $ ghc -fforce-recomp --make matthew -o matthew0

<...>

> I habitually compile all code with -O2, unless I have a specific reason not 
> to. I tend to forget that some compile without optimisations.
> For some kinds of code that makes hardly any difference, for others it 
> makes a big difference.

I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not* -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would turn on optimisation, but of course that's just the flag for GCC.

  $ time ./spelling becuase
  becuase -> because

  real	0m4.467s
  user	0m3.865s
  sys	0m0.280s

  $ time ./spelling_df becuase
  becuase -> because

  real	0m2.422s
  user	0m2.198s
  sys	0m0.081s

Your previous version is close to twice as fast, and now only 2.5 times slower than Python.

<snipped new version of code with toLower removed>

With your suggested changes, your latest version on my machine:

  $ time ./spelling_df becuase
  becuase -> because

  real	0m1.731s
  user	0m1.572s
  sys	0m0.056s

Now, we're getting close: 4.7s -> 2.3s -> 1.7s.

>>> But once you start needing two edits (try korrekt), correct and edits1
>>> start to show up. That shows that Norvig's algorithm isn't really
>>> good. With two edit steps, you create a _lot_ of strings you need to
>>> look up, far more than there are in the map. That takes time. It'll be
>>> better to scan the map for entries with an edit distance (< 3) if you
>>> have a good method to check that
> 
> Indeed:
> $ time ./nLDBSWSpelling becuase
> becuase -> because
> 2.84user 0.02system 0:02.86elapsed 100%CPU
> $ time ./nLDBSWSpelling korrekt
> korrekt -> correct
> 2.83user 0.05system 0:02.88elapsed 100%CPU

Not sure if I see what you're saying here: do you mean to point out the 2.86 vs 2.88 elapsed?

>>> Another thing is
>>> 
>>> allWords = keysSet wordCounts
>>> 
>>> Ouch. For each correction, you construct that set anew. Just use
>>> member from Data.Map instead of Data.Set.member and look up the words
>>> in the map.
>> 
>> Whoops! Just to be clear though: Haskell will memoise the result of
>> "allWords" for a given invocation of "correct"?
> 
> Yes. But not across different invocations.
> 
>> So this would only make a large difference for multiple corrections?
> 
> Right. But that's the interesting use case, isn't it?

It will be when I get the the rest of it working, yes :)

>>>> I will look at using Bloom Filters or
>>>> Trie’s instead of Data.Map, but I wonder if readFile should be taking
>>>> nearly %30 of the run time, even for a 6MB file?
>>> 
>>> No way. But it doesn't seem to, from my GHC's point of view.
>> 
>> Just to be sure I wasn't using the SCC incorrectly, I split out the
>> readFile call into "myReadFile". The prof line comes out as:
>> 
>>  myReadFile            Main         210           1  35.8    8.6   
>> 35.8    8.6
>> 
>> i.e. 35.8%  of runtime.
>> 
> 
> Can I see the exact code which gives that profile?
> Somehow, things which shouldn't must be attributed to readFile.

The version at this link has myReadFile split out.

http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975228fd43b9/spelling.hs

Doing the same to your version has the same result:

  myReadFile       Main       210           1  45.7    9.6    45.7    9.6

It does seem that the profiler is wrong or misleading somehow.

Two other quick questions:

(1) you added parentheses to "candidates":

  candidates = known [word] `or` ((known e1) `or` known_edits2)

The "or"'s should short circuit so that if "known [word]" is non-empty, the others shouldn't be evaluated. I read the above as forcing evaluation of the second "or" first: am I wrong?

(2) you eliminated the "fold" in "correct" in favour of a tail-recursive search in "maxCount": was this for style or performance reasons (or both :)?

Cheers,

Matthew.