[Haskell-cafe] Re: CYK-style parsing and laziness

Wed May 23 13:32:23 EDT 2007

Steffen Mazanek wrote:
> I have two questions regarding a Cocke, Younger, Kasami parser.
> 
> type NT = Char -- Nonterminal
> type T = Char  -- Terminal
> -- a Chomsky production has either two nonterminals or one terminal on its
> right-hand side
> type ChomskyProd = (NT, Either T (NT, NT))
> -- a grammar consists of a startsymbol, nonterminal symbols, terminal
> symbols and productions
> type Grammar = (NT, [NT], [T], [ChomskyProd])
> 
> parse::Grammar->[T]->Bool
> parse (s, nts, ts, prods) w = s `elem` gs n 1
> where
> n = length w
> table = [[gs i j|j<-[1..n-i+1]]|i<-[1..n]]
> gs 1 j = [nt|p<-prods,termProd p,
>              let (nt, Left t)=p, w!!(j-1)==t]
> gs i j = [nt|k<-[1..i-1],p<-prods,
>              not (termProd p),
>              let (nt, Right (a, b))=p,
>              a `elem` table!!(k-1)!!(j-1), b `elem`
> table!!(i-k-1)!!(j+k-1)]
> 
> The sets gs i j contain all nonterminal symbols from which the substring of
> w starting at index j of length i can be derived.
> 
> Please have a look at the last line of the algorithm. In my first attempt I
> just referred to gs k j and gs (i-k) (j+k) what looks a lot more intuitive.
> However I noted that this way the sets gs i j are evaluated multiple
times.
> Is there a better and more readable way to prevent multiple evaluation of
> these sets?

The key point of the dynamic programming algorithm is indeed to memoize
the results gs i j for all pairs of i and j. In other words, the insight
 that yields a fast algorithm is that for solving the subproblems gs i j
(of which there are n^2), solution to smaller subproblems of the same
form are sufficient. Tabulation is a must.

Of course, you can still choose how to represent the table. There's a
nice higher order way to do that

  tabulate :: (Int -> Int -> a) -> (Int -> Int -> a)

  gs = tabulate gs'
     where
     gs' 1 j = ... uses  gs x y  for some  x y ...
     gs' i j = ... ditto ...

The function  tabulate  takes a function on an N x N grid and well,
tabulates it. Actually, it doesn't return the table but already the
function that indexes into the table.

Your current code uses lists to represent the table, i.e.

  tabulate f = f'
     where
     table = [[f i j | j<-[1..n-i+1]] | i<-[1..n]]
     f'    = \i j -> table !! i !! j

Note that this depends on sharing the  table  over multiple invocations
of f'. In other words,

  tabulate f i j = table !! i !! j
     where table = ...

would defeat it's purpose.

Of course, accessing the (i,j)-th element of the  table  takes O(i+j)
time. Another options is to use (boxed, immutable) arrays for the table
which offer O(1) access. There's an example for this on the wiki page

   http://haskell.org/haskellwiki/Edit_distance

> The second question regards lazy evaluation. Consider the stupid grammar
> S->S S
> S->A A
> A->a
> 
> that generates a^(2n).
>
> The performance of the algorithm drops very fast even for small n, probably
> because the gs i j are getting very large.

That would be strange. I mean, no gs i j may have more than two
elements, namely S or A. The other key point of the CYK algorithm is
that the sets gs i j are indeed sets and may only contain as many
elements as there are nonterminals.

In your case however, nonterminals get chosen multiple times (for each
production and, worse, for each k !). Be sure to keep every nonterminal
only once. Again, you can choose whether to represent the sets of
nonterminals by lists or with Data.Set or whatever.

Here's a one (suboptimal) way to do it with lists

  gs = tabulate gs'
     where
     gs' 1 j = [nt | p<-prods, termProd p,
                     let (nt, Left t)=p, w !! (j-1)==t ]

     gs' i j = [nt | p<-prods, not (termProd p)
                     let (nt, Right (a, b))=p,
                     not $ null
                        [k | k<-[1..i-1],
                             a `elem` gs k j,
                             b `elem` gs (i-k) (j+k)]]

The gs i j are still not sets, some nonterminals may appear multiple
times. But they have at most  length prods  elements instead of
previously  length prods * n  elements.

> Is there a trick to get lazy evaluation into play here?
> It is sufficient to find only one occurence of the start
> symbol in gs n 1.

The fact that the gs i j have to be sets is not related to laziness.
(Although the code above exploits that (not $ null [k | ...]) returns
True as soon as possible thanks to lazy evaluation).

Regards,
apfelmus