Announcing Very Fast Searching of ByteStrings

ChrisK haskell at list.mightyreason.com
Fri Aug 17 09:49:28 EDT 2007


Attached is the Boyer-Moore algorithm implemented for strict and lazy
bytestrings (and combinations thereof).  It finds all the overlapping instances
of the pattern inside the target.

I have performance tuned it.  But the performance for searching a strict
bytestring is better then for a lazy bytestring (even if they only had a single
strict chunk), which almost certainly means I was not clever enough to get GHC
to produce the optimal code.

There is much more description in the module's haddock header.

Hopefully Don or other ByteString experts/maintainers can tweak this even further.

Also attaches is a Knuth-Morris-Pratt module which find non-overlapping
instances and is slightly slower on my benchmarks.

Happy Searching,
  Chris Kuklewicz
-------------- next part --------------
{-# OPTIONS_GHC -fbang-patterns -O2 #-}
-- | BoyerMoore module for searching String and Lazy ByteStrings
--
-- Authors: Daniel Fischer <daniel.is.fischer `at` web.de>
--          Chris Kuklewicz <haskell `at` list.mightyreason.com>
--
-- License: BSD3
--
-- This module exports 4 search functions: 'bmMatchLL', 'bmMatchLS',
-- 'bmMatchSL', and 'bmMatchSS'.
--
-- The first parameter is always the pattern string.  The second
-- parameter is always the target string to be searched.  The returned
-- list is all the locations of overlapping patterns.  A returned Int
-- or Int64 is an index into the target string which is aligned to the
-- head of the pattern string.  Strict targets return Int indices and
-- lazy targets return Int64 indices.  All returned lists are computed
-- and returned in a lazy fashion.
--
-- 'bmMatchLL' and 'bmMatchLS' take lazy bytestrings as patterns.  For
-- performance, if the pattern is not a single strict chunk then all
-- the the pattern chunks will copied into a concatenated strict
-- bytestring.  This limits the patterns to a length of (maxBound ::
-- Int).
--
-- 'bmMatchLL' and 'bmMatchSL' take lazy bytestrings are targets.
-- These are written so that while they work they will not retain a
-- reference to all the earlier parts of the the lazy bytestring.
-- This means the garbage collector would be able to keep only a small
-- amount of the target string and free the rest.
--
-- If given an empty pattern the search will always return an empty
-- list.
--
-- These can all be usefully curried.  Given only a pattern the
-- curried version will compute the supporting lookup tables only
-- once, allowing for efficient re-use.  Similarly, the curried
-- 'bmMatchLL' and 'bmMatchLS' will compute the concatenated pattern
-- only once.
--
-- Overflow warning: the current code uses Int to keep track of the
-- locations in the target string.  If the length of the pattern plus
-- the length of any strict chunk of the target string is greater or
-- equal to (maxBound :: Int) then this will overflow causing an
-- error.  I try to detect this and call 'error' before a segfault
-- occurs.
--
-- Performance: Operating on a strict target string is faster than a
-- lazy target string.  It is unclear why the performance gap is as
-- large as it is (patches welcome).  To slightly ameliorate this, if
-- the lazy string is a single chunk then a copy of the strict
-- algorithm is used.
-- 
-- Complexity: Preprocessing the pattern string is O(patternLength).
-- The search performance is O(targetLength / patternLength) in the
-- best case, allowing it to go faster than a Knuth-Morris-Pratt
-- algorithm.  With a non-periodic pattern the worst case uses
-- (3*targetLength) comparisons.  The periodic pattern worst case is
-- quadratic O(targetLength*patternLength) complexity.  Improvements
-- (e.g. Turbo-Boyer-Moore) to catch and linearize worst case
-- performance slow down the loop significantly.
--
-- Descriptions of the algorithm can be found at
-- http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140
-- and
-- http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm
module BoyerMoore ( bmMatchLL
                  , bmMatchLS
                  , bmMatchSL
                  , bmMatchSS
                  ) where
 
import qualified Data.ByteString as S (ByteString,null,length,concat,unpack)
import qualified Data.ByteString.Lazy as L (ByteString,toChunks)
import qualified Data.ByteString.Base as B (unsafeIndex)
import qualified Data.ByteString.Char8 as SC (pack) -- used for testing

import Data.Array.Base (unsafeAt,unsafeRead,unsafeWrite)
import Data.Array.ST (newArray,newArray_,runSTUArray)
import Data.Array.IArray (array,accumArray,assocs)
import Data.Array.Unboxed (UArray)
import Data.Word (Word8)
import Data.Int (Int64)

{-# INLINE bmMatchLL #-}
bmMatchLL :: L.ByteString -> L.ByteString -> [Int64]
bmMatchLL pat = let search = bmMatchSSsd (S.concat (L.toChunks pat))
                in search . L.toChunks

{-# INLINE bmMatchLS #-}
bmMatchLS :: L.ByteString -> S.ByteString -> [Int]
bmMatchLS pat = bmMatchSSd (S.concat (L.toChunks pat))

{-# INLINE bmMatchSL #-}
bmMatchSL :: S.ByteString -> L.ByteString -> [Int64]
bmMatchSL pat = let search = bmMatchSSsd pat
                in search . L.toChunks

{-# INLINE bmMatchSS #-}
bmMatchSS :: S.ByteString -> S.ByteString -> [Int]
bmMatchSS pat = bmMatchSSd pat

bmMatchSSd :: S.ByteString -> S.ByteString -> [Int]
bmMatchSSd pat | S.null pat = const []
               | otherwise = 
  let !patLen = S.length pat
      !patEnd = pred patLen
      !maxStrLen = maxBound - patLen
      !occT   = occurs pat       -- used to compute bad-character shift
      !suffT  = suffShifts pat   -- used to compute good-suffix shift
      !skip   = unsafeAt suffT 0 -- used after each matching position is found
      -- 0 < skip <= patLen

      {-# INLINE patAt #-}
      patAt :: Int -> Word8
      patAt !i = B.unsafeIndex pat i

      searcher str | maxStrLen <= S.length str = error "Overflow error in BoyerMoore.bmMatchSSd"
                   | otherwise =
        let !strLen = S.length str
            !maxDiff = strLen-patLen
            {-# INLINE strAt #-}
            strAt :: Int -> Word8
            strAt !i = B.unsafeIndex str i

            findMatch !diff !patI =
              case strAt (diff+patI) of
                c | c==patAt patI -> if patI == 0
                                       then diff :
                                              let diff' = diff + skip
                                              in if maxDiff < diff'
                                                   then []
                                                   else findMatch diff' patEnd
                                       else findMatch diff (pred patI)
                  | otherwise -> let {-# INLINE badShift #-}
                                     badShift = patI - unsafeAt occT (fromIntegral c)
                                     -- (-patEnd) < badShift <= patLen
                                     {-# INLINE goodShift #-}
                                     goodShift = unsafeAt suffT patI
                                     -- 0 < goodShift <= patLen
                                     diff' = diff + max badShift goodShift
                                 in if maxDiff < diff'
                                      then []
                                      else findMatch diff' patEnd
        in if maxDiff < 0
             then []
             else findMatch 0 patEnd
  in searcher

-- release is used to keep the zipper in bmMatchSSs from remembering
-- the leading part of the searched string.  The deep parameter is the
-- number of characters that the past needs to hold.  This ensures
-- lazy streaming consumption of the searched string.
{-# INLINE release #-}
release :: Int ->  [S.ByteString] -> [S.ByteString]
release !deep _ | deep <= 0 = []
release !deep (!x:xs) = let !rest = release (deep-S.length x) xs in x : rest
release _ [] = error "BoyerMoore 'release' could not find enough past of length deep!"

bmMatchSSsd :: S.ByteString -> [S.ByteString] -> [Int64]
bmMatchSSsd pat | S.null pat = const []
               | otherwise =
  let !patLen = S.length pat
      !patEnd = pred patLen
      !longestStr = maxBound - patLen
      !occT   = occurs pat       -- used to compute bad-character shift
      !suffT  = suffShifts pat   -- used to compute good-suffix shift
      !skip   = unsafeAt suffT 0 -- used after each matching position is found
      -- 0 < skip <= patLen

      {-# INLINE patAt #-}
      patAt :: Int -> Word8
      patAt !i = B.unsafeIndex pat i

      searcher string =
        let -- seek is used to position the "zipper" of
            -- (past,str,future) to the correct S.ByteString to search
            -- with matcher.  This is done by ensuring 0 <= strPos <
            -- strLen where (strPos == diffPos+patPos). Note that
            -- future is not a strict parameter.  The character being
            -- compared will then be (strAt strPos) and (patAt
            -- patPos).  Splitting this into specialized versions
            -- seems like going too, and is only useful if pat is
            -- close to (or larger than) the chunk size.
            seek :: Int64 -> [S.ByteString] -> S.ByteString -> [S.ByteString] -> Int -> Int -> [Int64]
            seek !prior !past !str future !diffPos !patPos | (diffPos+patPos) < 0 = {-# SCC "seek/past" #-}
              case past of
                [] -> error "seek back too far!"
                (h:t) -> let hLen = S.length h
                         in seek (prior - fromIntegral hLen) t h (str:future) (diffPos + hLen) patPos
                                                           | strLen <= (diffPos+patPos) = {-# SCC "seek/future" #-}
              case future of
                [] -> []
                (h:t) -> let {-# INLINE prior' #-}
                             prior' = prior + fromIntegral strLen
                             !diffPos' = diffPos - strLen
                             {-# INLINE past' #-}
                             past' = release (-diffPos') (str:past)
                         in if maxStrLen <= S.length h
                              then error "Overflow in BoyerMoore.bmMatchSSsd"
                              else seek prior' past' h t diffPos' patPos
                                                          | otherwise = {-# SCC "seek/str" #-}
              -- matcher is the tight loop that walks backwards from the end
              -- of the pattern checking for matching characters.  The upper
              -- bound of strLen is checked only when strI is shifted
              -- upwards to strI'.  The lower bound must be checked.
              let matcher !diff !patI =
                    case strAt (diff+patI) of
                      c | c==patAt patI ->
                            if patI == 0
                              then prior + fromIntegral (diff+patI) :
                                     let !diff' = (diff+patI) + skip -- Assert : diff < diff'
                                     in if maxDiff < diff'
                                          then seek prior past str future diff' patEnd
                                          else if diff' < 0
                                                 then matcher diff' patEnd
                                                 else matcherF diff' patEnd
                              else if (diff+patI) == 0 -- diff < 0 means need to check underflow
                                     then seek prior past str future diff (pred patI) 
                                     else matcher diff (pred patI)
                        | otherwise ->
                            let {-# INLINE badShift #-}
                                badShift = patI - unsafeAt occT (fromIntegral c)
                                -- (-patEnd) < badShift <= patLen
                                {-# INLINE goodShift #-}
                                goodShift = unsafeAt suffT patI
                                -- 0 < goodShift <= patLen
                                -- Assert : diff < diff'
                                !diff' = diff + max badShift goodShift
                            in if maxDiff < diff'
                                 then seek prior past str future diff' patEnd
                                 else if diff' < 0
                                        then matcher diff' patEnd
                                        else matcherF diff' patEnd

              -- mathcherF only needs to check overflow since 0<=diff
                  matcherF !diff !patI =
                    case strAt (diff+patI) of
                      c | c==patAt patI ->
                            if patI == 0
                              then prior + fromIntegral (diff+patI) :
                                     let !diff' = (diff+patI) + skip -- Assert : diff < diff'
                                     in if maxDiff < diff'
                                          then seek prior past str future diff' patEnd
                                          else matcherF diff' patEnd
                              else matcherF diff (pred patI) -- 0 <= diff means no need to check underflow
                        | otherwise ->
                            let {-# INLINE badShift #-}
                                badShift = patI - unsafeAt occT (fromIntegral c)
                                -- (-patEnd) < badShift <= patLen
                                {-# INLINE goodShift #-}
                                goodShift = unsafeAt suffT patI
                                -- 0 < goodShift <= patLen
                                -- Assert : diff < diff'
                                !diff' = diff + max badShift goodShift
                            in if maxDiff < diff'
                                 then seek prior past str future diff' patEnd
                                 else matcherF diff' patEnd
              in if diffPos < 0
                   then matcher diffPos patPos
                   else matcherF diffPos patPos

             where !strLen = S.length str
                   !maxDiff = strLen - patLen
                   !maxStrLen = pred ((maxBound::Int) - patLen)
                   {-# INLINE strAt #-}
                   strAt :: Int -> Word8
                   strAt !i = B.unsafeIndex str i
        in case string of
             [] -> []
             [str] -> -- Steal the quick findMatch from bmMatchSSd for this case:
               let findMatch !diff !patI =
                     case strAt (diff+patI) of
                       c | c==patAt patI -> if patI == 0
                                              then fromIntegral diff :
                                                     let diff' = diff + skip
                                                     in if maxDiff < diff'
                                                          then []
                                                          else findMatch diff' patEnd
                                              else findMatch diff (pred patI)
                         | otherwise -> let {-# INLINE badShift #-}
                                            badShift = patI - unsafeAt occT (fromIntegral c)
                                            -- (-patEnd) < badShift <= patLen
                                            {-# INLINE goodShift #-}
                                            goodShift = unsafeAt suffT patI
                                            -- 0 < goodShift <= patLen
                                            diff' = diff + max badShift goodShift
                                        in if maxDiff < diff'
                                             then []
                                             else findMatch diff' patEnd
                   !strLen = S.length str
                   !maxDiff = strLen - patLen
                   !maxStrLen = ((maxBound::Int) - patLen)
                   {-# INLINE strAt #-}
                   strAt :: Int -> Word8
                   strAt !i = B.unsafeIndex str i
               in if maxStrLen <= strLen
                    then error "Overflow in BoyerMoore.bmMatchSSsd"
                    else findMatch 0 patEnd
             (str:future) -> if ((maxBound::Int) - patLen) <= S.length str
                               then error "Overflow in BoyerMoore.bmMatchSSsd"
                               else seek 0 [] str future 0 patEnd
  in searcher

{- Format of bad character table generated by occurs:

Index is good for Word8 / ASCII searching only.
The last character (at the last index) in pat is ignored.
Excluding that last element, the value is largest index of occurances of that Word8 in the pat.
The default value for Word8's not in the pattern is (-1).

Range of values: -1 <= value < length of pattern

-}
{-# INLINE occurs #-}
occurs :: S.ByteString -> UArray Word8 Int
occurs !pat | patEnd < 0 = emptyOccurs
            | otherwise  = runSTUArray
    (do ar <- newArray (minBound,maxBound) (-1)
        let loop !i | i == patEnd = return ar
                    | otherwise   = do unsafeWrite ar (fromEnum $ pat `B.unsafeIndex` i) i
                                       loop (succ i)
        loop 0)
  where
    !patEnd = pred (S.length pat)

emptyOccurs :: UArray Word8 Int
emptyOccurs = accumArray const (-1) (minBound,maxBound) []

{- Non ST variants of occurs

occurs' :: S.ByteString -> UArray Word8 Int
occurs' !pat = accumArray (flip const) (-1) (0,255)
  [ (pat `B.unsafeIndex` i, i) | i <- [0..pred (S.length pat)] ]

occurs'' :: S.ByteString -> UArray Word8 Int
occurs'' !pat = accumArray (flip const) (-1) (minBound,maxBound) $ zip (init $ S.unpack pat) [0..]
-}

{-
suffLengths uses a ST array to allow for strict querying of previously
filled in values durring the fill loops.

Format for suffLengths array:

Valid index range is the same as for the pat.

The value at index k is used when there is a mismatch at index k in
pat after checking that all indices j where j > k correctly match.

For all indices consider the prefix of pat that ends with the
character at that index.  Now the value of suffLength is the number of
character at the end of this prefix that are identical to the end of
pat.

By the above definition, the last index has the length of the pattern
as its value, since the whole pattern is compared to itself and the
overlap is always the whole pattern length.  And the maximum value at
index k is (k+1).

This value itself is a non-negative integer less than the length of
pat except for the last index, where the value is the length of pat.

For most positions the value will be 0.  Aside from the at the last
index the value can be non-zero only at indices where the last
character of the pat occurs earlier in pat.
-}
{-# INLINE suffLengths #-}
suffLengths :: S.ByteString -> UArray Int Int
suffLengths !pat | 0==patLen = array (0,-1) []
                 | otherwise = runSTUArray
    (do ar <- newArray_ (0,patEnd)
        unsafeWrite ar patEnd patLen
        let {-# INLINE matchSuffix #-}
            matchSuffix !idx !from = do
                let !d = patEnd - idx
                    helper !i | i < 0 || (pat `B.unsafeIndex` i) /= (pat `B.unsafeIndex` (i+d)) = i
                              | otherwise = helper (pred i)
                    pre' = helper from
                unsafeWrite ar idx (idx-pre')
                idxLoop (pred idx) pre' start
            idxLoop !idx !pre !end
                | idx < 0   = return ar
                | pre < idx = do matching <- unsafeRead ar end  -- try and reuse old result
                                 if pre + matching < idx        -- check if old matching length is too long for current idx
                                   then do unsafeWrite ar idx matching
                                           idxLoop (pred idx) pre (pred end)
                                   else matchSuffix idx pre
                | otherwise = matchSuffix idx idx
        idxLoop start start start) -- the third argument, the initial value of "end", is never used and does not matter.
  where
    !patLen = S.length pat
    !patEnd = pred patLen
    !start  = pred patEnd


{- Format for suffShifts:

The valid index range is the same as for pat.

The index k is used when there is a mismatch at pat index k and all
indices j where j > k have matched.

The value is the smallest number of characters one can advance the
pattern such that there the shifted pattern agrees at the already
checked positions j>k.

Thus the value range is : 0 < value <= length of pattern

-}
{-# INLINE suffShifts #-}
suffShifts :: S.ByteString -> UArray Int Int
suffShifts !pat | patLen == 0 = array (0,-1) []
                | otherwise = runSTUArray
    (do ar <- newArray (0,patEnd) patLen
        let preShift !idx !j -- idx counts down and j starts at 0 and is non-decreasing
                | idx < 0   = return ()
                | suff `unsafeAt` idx == idx+1  =
              do let !shf = patEnd - idx
                     fill_to_shf !i | i==shf = return ()
                                    | otherwise = do unsafeWrite ar i shf
                                                     fill_to_shf (succ i)
                 fill_to_shf j
                 preShift (pred idx) shf
                | otherwise = preShift (pred idx) j
            sufShift !idx
                | idx == patEnd = return ar
                | otherwise = do unsafeWrite ar (patEnd - (suff `unsafeAt` idx)) (patEnd - idx)
                                 sufShift (succ idx)
        preShift start 0
        sufShift 0)
      where
        !patLen = S.length pat
        !patEnd = pred patLen
        !start = pred patEnd
        !suff = suffLengths pat


{- TESTING SECTION for suffLengths suffShifts occurs -}

testPats =
  [ "ANPANMAN"
  , "A"
  , "AA"
  , "AAA"
  , "AAAA"
  , "AAABBB"
  , "BBBAAABBB"
  , "ABC"
  , "AB"
  , "ABCD"
  , "ABCDABCD"
  , "DCBAABCD"
  , "GCAGAGAG"
  , "AGAGAG"
  , "GAGAG"
  , "AGAG"
  , "GAG"
  , ""
  ]

rightLens =
  [ [(0,0),(1,2),(2,0),(3,0),(4,2),(5,0),(6,0),(7,8)]
  , [(0,1)]
  , [(0,1),(1,2)]
  , [(0,1),(1,2),(2,3)]
  , [(0,1),(1,2),(2,3),(3,4)]
  , [(0,0),(1,0),(2,0),(3,1),(4,2),(5,6)]
  , [(0,1),(1,2),(2,3),(3,0),(4,0),(5,0),(6,1),(7,2),(8,9)]
  , [(0,0),(1,0),(2,3)]
  , [(0,0),(1,2)]
  , [(0,0),(1,0),(2,0),(3,4)]
  , [(0,0),(1,0),(2,0),(3,4),(4,0),(5,0),(6,0),(7,8)]
  , [(0,1),(1,0),(2,0),(3,0),(4,0),(5,0),(6,0),(7,8)]
  , [(0,1),(1,0),(2,0),(3,2),(4,0),(5,4),(6,0),(7,8)]
  , [(0,0),(1,2),(2,0),(3,4),(4,0),(5,6)]
  , [(0,1),(1,0),(2,3),(3,0),(4,5)]
  , [(0,0),(1,2),(2,0),(3,4)]
  , [(0,1),(1,0),(2,3)]
  , []
  ]

rightSuffs =
 [ [(0,6),(1,6),(2,6),(3,6),(4,6),(5,3),(6,8),(7,1)]
 , [(0,1)]
 , [(0,1),(1,2)]
 , [(0,1),(1,2),(2,3)]
 , [(0,1),(1,2),(2,3),(3,4)]
 , [(0,6),(1,6),(2,6),(3,1),(4,2),(5,3)]
 , [(0,6),(1,6),(2,6),(3,6),(4,6),(5,6),(6,1),(7,2),(8,3)]
 , [(0,3),(1,3),(2,1)]
 , [(0,2),(1,1)]
 , [(0,4),(1,4),(2,4),(3,1)]
 , [(0,4),(1,4),(2,4),(3,4),(4,8),(5,8),(6,8),(7,1)]
 , [(0,7),(1,7),(2,7),(3,7),(4,7),(5,7),(6,7),(7,1)]
 , [(0,7),(1,7),(2,7),(3,2),(4,7),(5,4),(6,7),(7,1)]
 , [(0,2),(1,2),(2,4),(3,4),(4,6),(5,1)]
 , [(0,2),(1,2),(2,4),(3,4),(4,1)]
 , [(0,2),(1,2),(2,4),(3,1)]
 , [(0,2),(1,2),(2,1)]
 , []
 ]

prop_occurs :: String -> Bool
prop_occurs [] = occurs (SC.pack []) == accumArray (flip const) (-1) (minBound,maxBound) []
prop_occurs x = let s = SC.pack x
                in occurs s == occurs' s
  where occurs' :: S.ByteString -> UArray Word8 Int
        occurs' !pat = accumArray (flip const) (-1) (minBound,maxBound) $ zip (init $ S.unpack pat) [0..]

testOccurs = all prop_occurs testPats
testLens = rightLens == map (assocs . suffLengths . SC.pack) testPats
testSuffs = rightSuffs == map (assocs . suffShifts . SC.pack) testPats
-------------- next part --------------
{-# OPTIONS_GHC -fbang-patterns -O2  #-}
{-|
 Code by Justin Bailey (jgbailey at gmail.com) and
 Chris Kuklewicz (haskell at list.mightyreason.com).

 License: BSD3

 Execute 'allTests' to run quickcheck and regression tests.

 This finds non-overlapping patterns with a Knuth-Morris-Pratt
 algorithm.
-}

module KMPSeq ( kmpMatchLL
              , kmpMatchLS
              , kmpMatchSS
              , kmpMatchSL) where

import qualified Data.Array.Base as Base (unsafeAt)
import qualified Data.Array.Unboxed as Unboxed (UArray)
import qualified Data.Array.IArray as IArray (array)
import Data.List as List (map, filter, length, null, take, maximum, foldr, all, drop)
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString as S
import qualified Data.ByteString.Base as B (unsafeHead,unsafeTail,unsafeDrop,unsafeIndex)
import GHC.Int (Int64)

import Test.QuickCheck
import Debug.Trace (trace)

{-|
  Returns list of indices for a given substring in a search string, or the empty
  list if none were found.
  
  Uses the Knuth-Morris-Pratt fast string matching algorithm.
-}  
{-# INLINE kmpMatchLL #-}
kmpMatchLL :: L.ByteString -- ^ Pattern to search for.
  -> L.ByteString -- ^ String to search.
  -> [Int64] -- ^ List of indices where string was found.
kmpMatchLL pat = let search = kmpMatchSSs' (S.concat (L.toChunks pat)) in search . L.toChunks

{-|
  Returns list of indices for a given substring in a search string, or the empty
  list if none were found.
  
  Uses the Knuth-Morris-Pratt fast string matching algorithm.
-}  
{-# INLINE kmpMatchLS #-}
kmpMatchLS :: L.ByteString -- ^ Pattern to search for.
  -> S.ByteString -- ^ String to search.
  -> [Int64] -- ^ List of indices where string was found.
kmpMatchLS pat = let search = kmpMatchSSs' (S.concat (L.toChunks pat)) in search . (:[])

{-|
  Returns list of indices for a given substring in a search string, or the empty
  list if none were found.
  
  Uses the Knuth-Morris-Pratt fast string matching algorithm.
-}  
{-# INLINE kmpMatchSS #-}
kmpMatchSS :: S.ByteString -- ^ Pattern to search for.
  -> S.ByteString -- ^ String to search.
  -> [Int64] -- ^ List of indices where string was found.
kmpMatchSS pat = let search = kmpMatchSSs' pat in search . (:[])

{-|
  Returns list of indices for a given substring in a search string, or the empty
  list if none were found.
  
  Uses the Knuth-Morris-Pratt fast string matching algorithm.
-}  
{-# INLINE kmpMatchSL #-}
kmpMatchSL :: S.ByteString -- ^ Pattern to search for.
  -> L.ByteString -- ^ String to search.
  -> [Int64] -- ^ List of indices where string was found.
kmpMatchSL pat = let search = kmpMatchSSs' pat in search . L.toChunks

kmpMatchSSs' :: S.ByteString -> [S.ByteString] -> [Int64]
kmpMatchSSs' pat | S.null pat = const []
                 | otherwise =
  let !patLen = S.length pat -- Evaluate S.length once; 
      !lookupTable = computeLookup pat -- lower bound of UArray must be 0 for Base.unsafeAt, but index 0 will never be looked up
      searcher :: Int64 -> Int -> [S.ByteString] -> [Int64]
      searcher _ _ [] = []
      searcher !prior !patStart (!str:strRest) =
        let !strLen = S.length str -- Evaluate S.length once; 
            findMatch :: Int -> Int -> [Int64]
            findMatch !strIndex !patIndex | patIndex == patLen = (prior + fromIntegral strIndex - fromIntegral patLen) : findMatch strIndex 0
                                          | strIndex == strLen = searcher (prior + fromIntegral strLen) patIndex strRest
                                          | otherwise =
              if (B.unsafeIndex str strIndex) == (B.unsafeIndex pat patIndex)
                then findMatch (succ strIndex) (succ patIndex)
                else if patIndex == 0
                       then findMatch (succ strIndex) 0
                       else findMatch strIndex (Base.unsafeAt lookupTable patIndex) -- here 1 <= patIndex <= patLen-1
        in
          findMatch 0 patStart
  in searcher 0 0

{-|

 Given our pattern, get all the prefixes of the pattern. For each of those
 prefixes, find the longest prefix from the original pattern that is also a
 suffix of the prefix segment being considered, and is not equal to it. The
 argument given to overlap is the length of the prefix matched so far, and the
 length of the longest prefix, which is a suffix and is not equal to it, is the
 value overlap returns.

 If a given prefix has no possible overlap, it is mapped to -1.

-}
overlap :: S.ByteString -> [(Int, Int)]
overlap pat =
 let patternLength = S.length pat
     -- Given an index into the pattern (representing a substring), find the longest prefix of
     -- the pattern which is a suffix of the substring given, without being
     -- equal to it.
     --
     -- patIdx represents the index of the last character in the prefix, not the
     -- character after it. Therefore, compare the pattern starting at the first
     -- character of the prefix, not the zeroth.
     longestSuffix !patIdx =
      let longestSuffix' !shiftPrefix !prefixIdx 
            | shiftPrefix == patIdx = 0 -- No match
            | shiftPrefix + prefixIdx == patIdx = prefixIdx -- Suffix found.
            -- Compare pattern to itself, but shifted, here.
            | B.unsafeIndex pat (shiftPrefix + prefixIdx) == B.unsafeIndex pat prefixIdx = longestSuffix' shiftPrefix (prefixIdx + 1)
            | otherwise = longestSuffix' (shiftPrefix + 1) 0
      in
        longestSuffix' 1 0
 in
  (0, 0) : [(matchLen, longestSuffix matchLen) | matchLen <- [1 .. patternLength - 1]]
  -- List.map (\prefix -> (fromIntegral $ S.length prefix, fromIntegral $ longestPreSuffix prefix)) prefixes

 
{-|
 Given a string representing a search pattern, this function
 returns a function which represents, for each prefix of that
 pattern, the maximally long prefix of the pattern which is a suffix
 of the indicated pattern segment.

 If there is no such prefix, 0 is returned.
 -}
computeLookup :: S.ByteString -> Unboxed.UArray Int Int
computeLookup pat =
 let patLen = fromIntegral $ S.length pat
     table :: Unboxed.UArray Int Int
     table = {-# SCC "computeLookup_table" #-} IArray.array (0, patLen - 1) (overlap pat)
 in table

-- Types, instances and utility functions for testing purposes.

newtype PatternChar = PatternChar Char
 deriving Show

instance Arbitrary PatternChar where
 arbitrary = oneof (List.map (return . PatternChar) ['a', 'b', 'c', 'd'])
 coarbitrary = undefined

-- Holds the search pattern, the search string, and the expected
-- position. Used to test for strings that have failed in the past.
data Regressions = R String String Int64
  deriving Show
  
instance Arbitrary Regressions where
  arbitrary =
    oneof $ map return 
      [R "ccb" "abcdcccb" 5,
       R "bbaa" "bdcdbbbbaa" 6,
       R "cccadadc" "adaccccadadc" 4,
       R "bbdbbdaabdb" "dbbdbbbdbbdaabdb" 5,
       R "bbdbbd" "dbbdbbbdbbd" 5,
       R "ccbb" "acccbb" 2,
       R "bbbb" "dddcaaaddaaabdacbcccabbada" (-1), -- This string has caused an infinite loop.
       R "bbc" "bbbbc" 2,
       R "" "" (-1)]
  coarbitrary = undefined
    
patternsToString :: [PatternChar] -> L.ByteString
patternsToString chars = L.pack $ List.foldr (\(PatternChar char) str -> (toEnum $ fromEnum char) : str) [] chars

patternsToStrictString :: [PatternChar] -> S.ByteString
patternsToStrictString chars = S.pack $ List.foldr (\(PatternChar char) str -> (toEnum $ fromEnum char) : str) [] chars

regressionsToLazy :: Regressions -> (L.ByteString, L.ByteString, Int64)
regressionsToLazy (R pat srch idx) = (toLazyBS pat, toLazyBS srch, idx)

toLazyBS :: String -> L.ByteString
toLazyBS = L.pack . List.map (toEnum . fromEnum)

toStrictBS :: String -> S.ByteString
toStrictBS = S.pack . List.map (toEnum . fromEnum)

-- Test that 0 and 1 element always return 0, if present.
prop_testZero :: [PatternChar] -> Property
prop_testZero pat =
 let table = computeLookup (patternsToStrictString pat)
 in
   not (List.null pat) ==>
     if List.length pat > 1
       then Base.unsafeAt table 0 == 0 && Base.unsafeAt table 1 == 0
       else Base.unsafeAt table 0 == 0

-- Test that all overlaps found are actually prefixes of the
-- pattern string
prop_testSubset :: [PatternChar] -> Property
prop_testSubset pat =
 let patStr = patternsToString pat
     table = computeLookup (patternsToStrictString pat)
     prefix len = L.take (fromIntegral len)
     testPrefix len =
       if Base.unsafeAt table len == 0
         then True
         else (prefix (Base.unsafeAt table len) patStr) `L.isPrefixOf` (prefix len patStr)
 in
   not (List.null pat) ==>
     trivial (L.null patStr) $
       List.all testPrefix [0 .. List.length pat - 1]


-- Test that the prefix given is the maximal prefix. That is,
-- add one more character makes it either equal to the string
-- or not a prefix.
prop_testCorrectPrefix :: [PatternChar] -> Property
prop_testCorrectPrefix pat =
 let patStr = patternsToString pat
     table = computeLookup (patternsToStrictString pat)
     isNeverSuffix len =
       let origPrefix = prefix len patStr
           -- Drop 1 to remove empty list
           allPrefixes = List.drop 1 $ L.inits origPrefix
       in
         List.all (\p -> L.null p || p == origPrefix || not ((L.reverse p) `L.isPrefixOf` (L.reverse origPrefix))) allPrefixes
     prefix len = L.take (fromIntegral len)
     -- True if the prefix returned from table for the length given produces
     -- a string which is a suffix of the original prefix.
     isRealSuffix len = (L.reverse (prefix (Base.unsafeAt table len) patStr)) `L.isPrefixOf` (L.reverse $ prefix len patStr)
     isLongestSuffix len =
       let prefixPlus = prefix (Base.unsafeAt table len + 1) patStr
           inputPrefix =  prefix len patStr
       in
         prefixPlus == inputPrefix ||
           not ((L.reverse prefixPlus) `L.isPrefixOf` (L.reverse inputPrefix))
     testTable len =
       if Base.unsafeAt table len == 0
         then isNeverSuffix len
         else isRealSuffix len &&
               isLongestSuffix len
 in
   not (List.null pat) ==>
     List.all testTable [0 .. List.length pat - 1]

-- Verify that, if a match is found, it is where it's supposed to be in
-- the string and it can be independently found by other means.
prop_testMatch :: [PatternChar] -> [PatternChar] -> Property
prop_testMatch pat search =
 let patStr = patternsToString pat
     searchStr = patternsToString search
     strictStr = patternsToStrictString search
     strictPat = patternsToStrictString pat
     patLen = L.length patStr
     searchLen = L.length searchStr
     matches = kmpMatchLL patStr searchStr
     verify matchIdx =
       if matchIdx > -1
         then (L.take patLen $ L.drop matchIdx $ searchStr) == patStr && strictPat `S.isSubstringOf` strictStr
         else (L.null patStr && L.null searchStr) || not (strictPat `S.isSubstringOf` strictStr)
 in
   not (List.null pat) ==>
     trivial (L.null searchStr) $
       classify (patLen > searchLen) "Bigger pattern than search" $
       classify (patLen < searchLen) "Smaller pattern than search" $
       classify (patLen == searchLen) "Equal pattern and search" $
       classify (not (null matches)) "Match Exists" $
       classify (null matches) "Match Doesn't Exist" $
         all verify matches

-- Test a pattern that was known to fail.
prop_testBadPat :: [PatternChar] -> Property
prop_testBadPat search =
 let patStr = toLazyBS "bbc"
     patLen = L.length patStr
     searchStr = patternsToString search
     matches = kmpMatchLL patStr searchStr
     strictStr = patternsToStrictString search
     strictPat = toStrictBS "bbc"
     verify matchIdx =
            if matchIdx > -1
              then (L.take patLen $ L.drop matchIdx $ searchStr) == patStr && strictPat `S.isSubstringOf` strictStr
              else (L.null patStr && L.null searchStr) || not (strictPat `S.isSubstringOf` strictStr)
 in
   trivial (List.null search) $
     all verify matches

-- Test that a pattern on the end of the string is found OK.
prop_testEndPattern :: [PatternChar] -> [PatternChar] -> Property
prop_testEndPattern pat search =
 let patStr = patternsToString pat
     searchStr = patternsToString (search ++ pat)
     matches = kmpMatchLL patStr searchStr
     strictStr = patternsToStrictString (search ++ pat)
     strictPat = patternsToStrictString pat
     patLen = L.length patStr
     verify matchIdx = (L.take patLen $ L.drop matchIdx $ searchStr) == patStr && strictPat `S.isSubstringOf` strictStr
 in
   not (List.null pat) ==> all verify matches

prop_testRegressions :: Regressions -> Bool
prop_testRegressions r =
  let (pat, srch, idx) = regressionsToLazy r
      matches = kmpMatchLL pat srch 
  in
    if null matches
      then idx == -1
      else (head matches) == idx  
    
props1 = [ prop_testZero
        , prop_testSubset
        , prop_testCorrectPrefix
        , prop_testBadPat
        ]

props2 =  [ prop_testMatch
         , prop_testEndPattern
         ]

allTests = do
 mapM_ quickCheck props1
 mapM_ quickCheck props2
 quickCheck prop_testRegressions


More information about the Libraries mailing list