[Haskell-cafe] Suffix tree

Mon Aug 13 04:15:21 EDT 2007

On Sun, 2007-08-12 at 23:09 +0100, Jon Harrop wrote:
> Suffix trees are a data structure used to search for a substring of length "m" 
> in a string of length "n" in O(m) time. Suffix trees can also be used for 
> efficient approximate searches. This data structure is of particular 
> importance in bioinformatics.

Suffix trees (and their sibling, suffix arrays) are cool.  Basically a
suffix tree is a trie, but where any nodes with only one child are
collapsed.  Easy to construct through sorting, linear time is trickier,
and perhaps the biggest problem is keeping memory use reasonable.

Suffix arrays is a sorted array of indices, so it's more compact, but
takes O(m log n) to search.  There's also the enhanced suffix array,
which provides the same functionality as the suffix tree, but uses
approximately the same space as well (something like 12n bytes vs 5n
bytes, IIRC).

> Does anyone have any Haskell code implementing suffix trees? I'm particularly 
> interested in high-performance construction.

I was using a suffix tree to index some data, but I discovered I only
needed the tree to constant depth, so - shame on me - I ended up simply
sorting the suffixes in one case, and storing hashed words in a Map in
another.

I've also toyed with FFI'ing to comressed suffix arrays, but in the end,
this wasn't as successful as I'd hoped.  For straightforward suffix
array construction, this is probably the fastest way to go about it,
though.

Linear time suffix trees are a bit complicated to implement in a
functional language, as they depend on additional links in the tree
during construction.  I'm sure this can be done using IO/STrefs, but it
would of course be cooler if it could be done tying-the-knot style. 

Anyway, I have a handful of shaky FFI bindings, a reasonable bibtex
file, and lots of good will and encouragement.  Let me know if you have
use for any of them!

-k