need help optimizing a function

Andrew J Bromage ajb@spamcop.net
Thu, 10 Oct 2002 13:19:19 +1000


G'day all.

On Wed, Oct 09, 2002 at 02:29:26PM -0400, David Roundy wrote:

> I get a speedup of about a factor of 6 for the test files I was using (of
> course, this would depend on file size), and find that now only 2% of my
> time is spent in that function.  I'm still something like 100 times slower
> than GNU diff, so I think (hope, certainly!) there's still room for more
> improvement (another day).  I'm sure I'll have more questions in a few
> days, once I've tracked down what the new bottlenecks are.

If you understand logic languages, you might want to look at the diff
implementation which I wrote for the Mercury distribution:

	http://www.cs.mu.oz.au/research/mercury/

It's pretty close to GNU diff in speed.  In fact, it was
indistinguishable on every test case I could think of.

There are, two main differences to what I could gather from your 
implementation.

First thing was that I noticed that a lot of time was being spent
doing string comparisons.  I inserted a pre-pass which mapped strings
to (unboxed) integers with the constraint that the integers are equal
if and only if the strings are equal.  This also turned out to be
critical for implementing flags such as --ignore-space-change.

The other was I used a different algorithm than you did:

	Eugene W. Myers. "An O(ND) difference algorithm and its
		variations." Algorithmica, 1:251-266, 1986. 

I found it to be much faster than the O(n log n) algorithm, even
on cases where you would expect it to perform poorly (i.e. where
D is large), partly because the constant factors are really, really
low and because in the pre-pass you can optimise the case where you
have a number of consecutive lines none of which appear anywhere
else, which is typical for most uses of diff.

Cheers,
Andrew Bromage