[Haskell-cafe] Re: Memoizing longest-common-subsequence

Mon Aug 21 05:44:45 EDT 2006

> From: Jared Updike [mailto:jupdike at gmail.com] 
> 
> Ian's LCS code looks like it works fine in terms of space usage, but
> for large input (lengrh as == 40000, length bs == 50000) it seems to
> be way too slow in terms of time complexity (GNU sdiff executes on
> this same data in a few seconds, the Hunt-Szymanski LCS algorithm
> didn't terminate, even after like 10 minutes).
> 
> If it's not too much trouble, could you send Myers STArray based code?
> 
> Thanks,
>   Jared.

(Sorry for the late reply; have been on holiday.)

Included below. I used to have a unit test module, and also a pure
DiffArray-based implementation for performance comparison (it was
sloooow), but I lost them when my laptop died (a harsh lesson in
infrequent backups), so all I have left is the implementation module
now.

I've used it to diff fairly large files (hundreds of K's, if not Megs)
where there were few differences. It seemed to perform OK, and in cases
where GNU diff (or whatever comes with MSYS) failed.

Alistair

module Diff (diff, diffArray, diffSTArray, Modification(..)) where

{-
See:
  http://www.eecs.berkeley.edu/~gene/Papers/np_diff.pdf
  http://www.cs.arizona.edu/~gene/PAPERS/np_diff.ps
    (An O(NP) sequence comparison algorithm)

The algorithm in the paper is:

    compare(src, dst)
      M := length src; N := length dst; delta := N - M;
      fp[-(M+1)..(N+1)] := -1;
      p := -1;
      repeat
        p++;
        for k := -p to delta-1 do
          fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1]));
        for k := delta+p downto delta+1 by -1 do
          fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1]));
        k := delta;
        fp[k] := snake(k, max(fp[k-1] + 1, fp[k+1]));
      until fp[delta] = N
      print "edit distance " + (delta + 2*p)
    end;

    snake(k, y)
      x := y - k;
      while x < M and y < N and src[x+1] = B[y+1] do
        x := x + 1; y := y + 1;
      return y;
    end;

Notes:

1. In the paper, y refers to the column position,
and x is the row number.
This is not conventional! i.e. x and y are swapped.
In this implementation we use x and y in the conventional
manner, and store the x value in the vertex array.

3. The algorithm as stated in the paper does not indicate
how to determine when a move down or right occurs.
I've assumed it's any update to a diagonal,
where the new x or y value is within the dst or src bounds.

4. We need to set vtx[0] = 0. This handles the case where the
destination string is empty.
In this case, the first move should be a delete i.e. a move down.
If vtx[0] = -1 then the nextEdit function would choose
a move right to diagonal 0 (this would put us further along
the diagonal than a move down), but we only want to move down,
so this is the wrong move. Hence we set vtx[0] = 0.

Note that the algorithm appears to work for all other cases
(including empty src and non-empty dst) when vtx[0] is
either 0 or -1.
All other entries of vtx[] are initialised to -1.

5. The algorithm in the paper only handles the case where
length dst >= length src
i.e. where delta >= 0.
How do we modify it to handle delta < 0?

The rules for generating digaonals for delta >= 0 are:
[-p up-to d-1]  (below delta)
[p+d downto d+1]  (above delta)
[d]

And the rules for delta < 0:
[p downto d+1]  (above delta)
[-d-p up-to d-1]  (below delta)
[d]

These are derived from the "recurrence relation" which is:
x[k,p] =
  k < delta | snake( max( x[k-1,p  ] + 1, x[k+1,p-1] ) )
  k > delta | snake( max( x[k-1,p  ] + 1, x[k+1,p  ] ) )
  k = delta | snake( max( x[k-1,p-1] + 1, x[k+1,p  ] ) )

Look at patterns for examining diagonals:

delta = 0:
0 | p=0: 0  p=1: -1,1,0  p=2: -2,-1,2,1,0  p=3: -3,-2,-1,3,2,1,0 p=4:
-4,...

delta > 0:
1 | p=0: 0,1  p=1: -1,0,2,1  p=2: -2,-1,0,3,2,1  p=3: -3,-2,-1,0,4,3,2,1
p=3: -4,...
2 | p=0: 0,1,2  p=1: -1,0,1,3,2  p=2: -2,-1,0,1,4,3,2  p=3: -3,...

First iteration goes from 0 to delta (outside in).
Let's see what pattern is like for delta < 0.

delta < 0
-1| p=0: 0,-1  p=1: 1,0,-2,-1  p=2: 2,1,0,-3,-2,-1
-2| p=0: 0,-1,-2  p=1: 1,0,-2,-1  p=2: 2,1,0,-3,-2,-1

-}

import Data.Array.ST as STA
import Data.Array
import Control.Monad.ST

data Modification a = Ins !Int !a | Del !Int !a
  deriving (Show, Eq)

inBounds arr i = case STA.bounds arr of (l, u) -> i >= l && i <= u
upperBound arr = case STA.bounds arr of (_, u) -> u
isUpperBound arr i = i == upperBound arr

-- list interface
diff :: Eq a => [a] -> [a] -> [Modification a]
diff [] [] = []
diff src dst = runST ( do
  srca <- makeArray1 src
  dsta <- makeArray1 dst
  diffSTArray srca dsta )

-- Array interface
diffArray :: Eq a => Array Int a -> Array Int a -> [Modification a]
diffArray src dst = runST ( do
  -- We can use unsafeThaw because we don't modify src or dst
  srca <- unsafeThaw src
  dsta <- unsafeThaw dst
  diffSTArray srca dsta )

-- STArray interface
diffSTArray :: Eq a => STArray s Int a -> STArray s Int a -> ST s
[Modification a]
diffSTArray src dst = do
  let
    boundLower = 1 + upperBound src
    boundUpper = 1 + upperBound dst
    delta = boundUpper - boundLower
  vtx <- makeArray boundLower boundUpper (-1)
  writeArray vtx 0 0
  edt <- makeArray boundLower boundUpper []
  diffLoop src dst vtx edt 0 delta

-- Need to help type-checker with array types
makeArray :: Int -> Int -> a -> ST s (STArray s Int a)
makeArray lower upper val = newArray (-lower, upper) val

makeArray1 :: [a] -> ST s (STArray s Int a)
makeArray1 str = newListArray (1, length str) str

diffLoop src dst vtx edt p delta = do
  updateForP src dst vtx edt p delta
  fin <- finished dst vtx delta
  if fin
    then do
      l <- readArray edt delta
      return (reverse l)
    else diffLoop src dst vtx edt (p+1) delta

finished dst vtx delta = do
  x <- readArray vtx delta
  return (isUpperBound dst x)

makeDiagsIncr from to = [from..to]
makeDiagsDecr from to = reverse [to..from]

-- process diagonals within p-band in certain order:
--  1. those below diagonal delta, from outside in
--  2. those above diagonal delta, from outside in
--  3. diagonal delta

updateForP src dst vtx edt p delta =
  if delta >= 0
    then do
      forDiagonals src dst vtx edt (makeDiagsIncr (-p) (delta-1))
      forDiagonals src dst vtx edt (makeDiagsDecr (p+delta) (delta+1))
      forDiagonals src dst vtx edt [delta]
    else do
      forDiagonals src dst vtx edt (makeDiagsDecr p (delta+1))
      forDiagonals src dst vtx edt (makeDiagsIncr (-delta-p) (delta-1))
      forDiagonals src dst vtx edt [delta]

forDiagonals src dst vtx edits [] = return ()
forDiagonals src dst vtx edits (d:diags) = do
  nextEdit src dst vtx edits d
  forDiagonals src dst vtx edits diags

-- Move down is delete, move right is insert.
-- Note that the position reported for Insert is relative to the
-- src (y) axis, not the dst axis.
-- This is because we expect to report the edit script as changes
-- relative to the src sequence.

-- type-sig is a big help with space usage - squashes unnecessary
polymorphism.
nextEdit :: Eq a => STArray s Int a -> STArray s Int a -> STArray s Int
Int -> STArray s Int [Modification a] -> Int -> ST s ()
nextEdit src dst vtx edits diag = do
  mr <- readArray vtx (diag-1)
  moveDown <- readArray vtx (diag+1)
  let
    moveRight = 1 + mr
    x = if moveDown >= moveRight then moveDown else moveRight
    y = x - diag
  newX <- slideDownDiagonal src dst x diag
  writeArray vtx diag newX
  if moveDown >= moveRight
      then do
        editList <- makeEdit src y Del y edits (diag+1)
        writeArray edits diag editList
      else do
        editList <- makeEdit dst x Ins (y+1) edits (diag-1)
        writeArray edits diag editList

makeEdit :: Eq a => STArray s Int a -> Int -> (Int -> a -> Modification
a) -> Int -> STArray s Int [Modification a] -> Int -> ST s [Modification
a]
makeEdit str i cons y edits diag = do
  es <- readArray edits diag
  if inBounds str i
    then do
      e <- readArray str i
      return ((cons y e):es)
    else return es

slideDownDiagonal :: Eq a => STArray s Int a -> STArray s Int a -> Int
-> Int -> ST s Int
slideDownDiagonal src dst x' dia = do
  let x = x' + 1; y = x - dia
  if (inBounds dst x) && (inBounds src y)
    then do
      a <- readArray dst x; b <- readArray src y
      if a == b
        then slideDownDiagonal src dst x dia
        else return x'
    else return x'
*****************************************************************
Confidentiality Note: The information contained in this message,
and any attachments, may contain confidential and/or privileged
material. It is intended solely for the person(s) or entity to
which it is addressed. Any review, retransmission, dissemination,
or taking of any action in reliance upon this information by
persons or entities other than the intended recipient(s) is
prohibited. If you received this in error, please contact the
sender and delete the material from any computer.
*****************************************************************