HaskellWiki

Haskell | Wiki community | Recent changes
Random page | Special pages

 

Not logged in
Log in | Help

Shootout/Regex DNA

< Shootout

Categories: Code

This is a Shootout Entry for http://shootout.alioth.debian.org/benchmark.php?test=regexdna&lang=all.

Todo: this entry should be replaced by the new PCRE Text.Regex.ByteString module.

Contents

1 About the regex-dna (new) benchmark

Each program should:

linefeed characters, and record the sequence length

wildcard in one position), and (one pattern at a time) count matches in the redirected file

time) match-replace the pattern in the redirect file, and record the sequence length

We use FASTA files generated by the fasta benchmark as input for this benchmark. Note: the file may include both lowercase and uppercase codes.

Correct output for this 100KB input file (generated with the fasta program N = 10000), is

   agggtaaa|tttaccct 0
   [cgt]gggtaaa|tttaccc[acg] 3
   a[act]ggtaaa|tttacc[agt]t 9
   ag[act]gtaaa|tttac[agt]ct 8
   agg[act]taaa|ttta[agt]cct 10
   aggg[acg]aaa|ttt[cgt]ccct 3
   agggt[cgt]aa|tt[acg]accct 4
   agggta[cgt]a|t[acg]taccct 3
   agggtaa[cgt]|[acg]ttaccct 5
   101745
   100000
   133640

2 Discussion

Don's Idiomatic Entry seems pretty well baked as the idiomatic GHC entry. For the high-speed GHC #2, we appear to be heading towards a new, full implementation of regular expressions in this entry. I would vote that we step back and implement enough to complete the entry because we've already bent the rules in this entry. One of the aspects that I like about my original non-Parsec entry was that it was fairly short and, IMO, fairly readable. In adding Parsec, I think that some of the comprehensibilitinessous has been lost.

I do think that we've improved the entry a lot! -- AlsonKemp

Is there any way to shorten the fastest entry? I still think that we've lost a lot of comprehensibility. -- AlsonKemp

Note that the Idiomatic Entry has already been submitted -- DonStewart

errmmm... Does the work that's done here suggest that the Text.Regex should be replaced? Not sure how to replace it or with which replacement... See also RegexSyntax. -- AlsonKemp

Well, if not replaced, we can at least provide alternatives. I propose:

 * we add a Word8 interface to the existing Text.Regex, so we can avoid marshalling from packed strings and buffers
 * we add the lazy regex combinators to the Text.ParserCombinators suite. I use them all the time, and they're more than a decade old, so they should be standardised. I'll begin by cabalising the library. -- DonStewart

3 Benchmarks

N=5,000,000, Linux/x86

Old spec

||Name || Time || ||Original || Timeout || ||Functional 1 || Timeout || ||Functional 2 || Timeout || ||Combination || Timeout || ||Packed entry || 1200s || ||Lazy lexers 1 || 33.950 || ||Alson Kemp's parsec || 30.682s || ||Lazy lexers 2 || 28.000 || ||Modified Alson || 19.950 || ||Lazy lexers 5 + regex syntax || 12.106 || ||Lazy lexers 4 || 11.192 || ||Lexer 6 || 11.000 || ||Lazy lexers 3 (illegal)|| 6.300 || |||||||| ||perl entry || 4.005s ||

New spec ||Name || Time || ||Parsec 1 || 126s || ||CTK 1 || 60s ||

N.B. the syntax highlighting breaks on many of the following programs, resulting in bogus output.

4 Proposed entry

A tuned version of Chris' Submitted

{-# OPTIONS -funbox-strict-fields -fbang-patterns #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp.
-- Updated for ByteString by Chris Kuklewicz February, 2007
--
-- Compile with: -O2 -package parsec
--
-- An idiomatic Haskell entry using lazy regex combinators, described in the paper:
--
--  Manuel  M. T. Chakravarty, Lazy Lexing is Fast.
--  In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji
--  International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.
 
import Monad
 
import Data.Array (Array, assocs, accumArray,bounds,listArray)
import Data.Array.Base (unsafeAt)
 
import Data.ByteString.Base
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Char8  as BC
 
import Word
import List hiding (delete)
import qualified Data.Map as M
import System
import qualified Text.ParserCombinators.Parsec as P
import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero)
 
------------------------------------------------------------------------
 
main = putStr . work =<< B.getContents
 
work !fileIn = unlines $ counts ++ [[],l0, l1, l2]
  where
    l0     = show $ B.length fileIn
    clean  = B.concat . delete "\n|>[^\n]+\n" $ fileIn
    l1     = show $ B.length clean
    counts = [ re++" "++show (count re clean) | re <- variants ]
    l2     = show $ L.length (iubsExpand clean)
 
iubsExpand = L.fromChunks . foldr1 (.) (map replace iubs) . return
 
variants = [    "agggtaaa|tttaccct"    ,"[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
           ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
           ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]
 
iubs = map (\(s,new) -> (regex s,BC.pack new)) $
  [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)") ,("M","(a|c)")
  ,("N","(a|c|g|t)"),("R","(a|g)") ,("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)")]
 
-- And that's it!
 
------------------------------------------------------------------------
 
-- external interface to regular expressions
 
regex = (either (error.show) accept) . (\x -> P.parse p_regex x x)
  where accept re = re (Lexer True Done) -- Close a regular expression into a Lexer.
 
(!) !a b = unsafeAt a (fromEnum $ (b - fst (bounds a)))
 
delete s = del where
  r = regex s
  del b = pieces 0 (run r b 0) where
    pieces end [] = unsafeDrop end b : []
    pieces !end ((start,stop):rest) = unsafeTake (start-end) (unsafeDrop end b) : pieces stop rest
 
count s b = runOverlappingCount r b 0
  where r = regex s
 
replace (r,new) = rep where
  rep [] = []
  rep (!b:bs) = pieces 0 (run r b 0) where
    pieces 0 []   = b : rep bs
    pieces end [] | B.length b > end = unsafeDrop end b : rep bs
                  | otherwise        = rep bs
 
    pieces end ((start,stop):rest)
        | start > end = unsafeTake (start-end) (unsafeDrop end b) : new : pieces stop rest
        | otherwise = new : pieces stop rest
 
run :: Lexer -> B.ByteString -> Int -> [(Int,Int)]
run lexerIn !b offsetIn = loop offsetIn
  where
    end = B.length b
    loop offset | offset == end = []
                | otherwise =
      let t@(start,stop) = lexOne b lexerIn offset
      in if start == -1 then  []
           else t : loop stop
 
runOverlappingCount :: Lexer -> B.ByteString -> Int -> Int
runOverlappingCount lexerIn !b offsetIn = loop offsetIn 0
  where
    end = B.length b
    loop !offset !c | offset == end = c
                    | otherwise =
      let start = fst $ lexOne b lexerIn offset
      in if start == -1 then c
           else loop (succ start) (succ c)
 
--
-- Construct a regex combinator from a string regex (use Parsec)
-- Designed to follow "man re_format" (Mac OS X 10.4.4)
--
-- The regular expressions accepted by the program include those using
-- |, empty group (), grouping with ( and ), wildcard '.', backslach
-- escaped characters "\.", greedy modifiers ? + and *, bracketed
-- alternatives including ranges such as [a-z0-9] and inverted
-- brackets such as [^]\n-].  Only 7-bit Ascii accepted.
 
-- 'p_regex' is the only "exported" function, used by 'regex' above
 
p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|'))
 
p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom))
 
p_atom = P.try (P.string "()" >> return epsilon)
     <|> P.between (P.char '(') (P.char ')') p_regex
     <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char
     <|> (pzero <?> "cannot parse regexp atom")
 
p_post_atom atom = (P.char '?' >> return (atom `quest`))
               <|> (P.char '+' >> return (atom `plus`))
               <|> (P.char '*' >> return (atom `star`))
               <|> (return (atom +>))
 
p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) )
 
p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-")))
                  middle <- P.manyTill P.anyChar (P.char ']')
                  let expand [] = []
                      expand ('-':[]) = "-"
                      expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest)
                      expand (x:xs) | x /= '-'  = x:(expand xs)
                                    | otherwise = error "A dash is in the wrong place in a p_set"
                      characters = nub ( sort (initial ++ (expand middle)) )
                  return $ if invert then alt ( ['\0'..'\127'] \\ characters )
                                     else alt characters
 
p_dot = P.char '.' >> return (alt ['\0'..'\127'])
 
p_escaped_char = P.char '\\' >> liftM char P.anyChar
 
p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\"
 
--
--  And everything else is the modified CTK library.
--
--  Compiler Toolkit: Self-optimizing lexers
--  Author : Manuel M. T. Chakravarty
--
 
-- tree structure used to represent the lexer table
data Lexer = Lexer !Bool !Cont
 
-- represent the continuation of a lexer
            -- on top of the tree, where entries are dense, we use arrays
data Cont = Dense !BoundsNum !(Array Word8 Lexer)
          -- further down, where the valid entries are sparse, we
          -- use association lists, to save memory
          | Sparse !BoundsNum !(M.Map Word8 Lexer)
          -- end of a automaton
          | Done
 
type Regexp = Lexer -> Lexer
 
infixr 4 `quest`, `star`, `plus`
infixl 3 +>
 
-- Empty lexeme (noop)
epsilon = id :: Regexp
 
-- One character regexp
char c = (\l -> Lexer False (Dense (B 1 w w) (listArray (w,w) [l])))
  where w = c2w c
 
-- accepts a non-empty set of alternative characters
-- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
alt cs  = \l -> let bnds = B (length ws) (minimum ws) (maximum ws)
                in Lexer False (aggregate bnds [(w, l) | w <- ws])
  where ws = map c2w cs
 
-- accept a character sequence
string cs = (foldr1 (+>) . map char) cs
 
-- Concatenation of regexps is just concatenation of functions
(+>)  = (.) :: Regexp -> Regexp -> Regexp
 
-- disjunctive combination of two regexps, corresponding to x|y
re1 >|< re2  = \l -> re1 l >||< re2 l
 
-- x `quest` y corresponds to the regular expression x?y
quest re1 re2  = (re1 +> re2) >|< re2
 
-- x `plus` y corresponds to the regular expression x+y
plus re1 re2  = re1 +> (re1 `star` re2)
 
-- x `star` y corresponds to the regular expression x*y
star re1 re2  = \l -> let self = re1 self >||< re2 l in self
 
-- Scan forwards searching for a match anywhere at or after
-- startOffset.  Return offsets of first matching character and after
-- last matching character or (-1,-1) for failure
lexOne !b lexerIn !startOffset =
    let stop = oneLexeme lexerIn startOffset (-1)
    in if stop == -1
             then if startOffset < end
                    then lexOne b lexerIn (succ startOffset)
                    else (-1,-1)
             else (startOffset,stop)
  where
    end = B.length b
    oneLexeme (Lexer win cont) !offset !last =
      let last' = if win then offset else last
      in if offset == end
           then last' -- at end, has to be this action
           else oneChar cont (unsafeIndex b offset) (succ offset) last'   -- keep looking
 
    oneChar tbl !c !offset' !last = case peek tbl c of
      (Lexer win Done)   -> if win then offset' else last
      l'                 -> oneLexeme l' offset' last
 
    peek (Dense bn arr)  !c | c `inBounds` bn = arr ! c
    peek (Sparse bn cls) !c | c `inBounds` bn = M.findWithDefault (Lexer False Done) c cls
    peek _ _ = (Lexer False Done)
 
-- disjunctive combination of two lexers (longest match, right biased)
(Lexer win c) >||< (Lexer win' c')  = Lexer (win || win') (joinConts c c')
 
-- represents the number of (non-error) elements and the bounds of a
-- DFA transition table
data BoundsNum = B !Int !Word8 !Word8
 
-- combine two bounds
addBoundsNum (B n lc hc) (B n' lc' hc')  = B (n + n') (min lc lc') (max hc hc')
 
-- check whether a character is in the bounds
inBounds c (B _ lc hc) = c >= lc && c <= hc
 
-- combine two disjunctive continuations
joinConts Done c'   = c'
joinConts c    Done = c
joinConts c    c'   = let (bn , cls ) = listify c
                          (bn', cls') = listify c'
                      -- note: `addsBoundsNum' can, at this point, only
                      --       approx. the number of *non-overlapping* cases;
                      --       however, the bounds are correct
                      in aggregate (addBoundsNum bn bn') (cls ++ cls')
  where listify (Dense  n arr) = (n, assocs arr)
        listify (Sparse n cls) = (n, M.toList cls)
 
-- we use the dense representation if a table has at least the given
-- number of (non-error) elements
denseMin  = 1 :: Int
 
-- Note: `n' is only an upper bound of the number of non-overlapping cases
aggregate bn@(B n lc hc) cls
  | n >= denseMin = Dense  bn (accumArray (>||<) (Lexer False Done) (lc, hc) cls)
  | otherwise     = Sparse bn (M.fromList (accum (>||<) cls))
 
-- combine the elements in the association list that have the same key
accum _ []           = []
accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces'
  where gather k e []                             = ((k, e), [])
        gather k e (ke'@(k', e'):kes) | k == k'   = gather k (f e e') kes
                                      | otherwise = let (ke'', kes') = gather k e kes
                                                    in (ke'', ke':kes')

5 Chris' Bytestring Proposal

I have modified the old entry to use Data.ByteString and modified the matching engine to produce (start,stop) offsets. This runs really fast for me. Oddly, it does not use the 'Sparse' variant of 'Cont' but erasing 'Sparse' from the code makes it perform much much worse.

I expect there is a more performance to be tweaked, but memory consumption is good since I see nearly perfect productivity on the full data set (WinXP/Core Duo 2):

agggtaaa|tttaccct 36
[cgt]gggtaaa|tttaccc[acg] 125
a[act]ggtaaa|tttacc[agt]t 426
ag[act]gtaaa|tttac[agt]ct 290
agg[act]taaa|ttta[agt]cct 536
aggg[acg]aaa|ttt[cgt]ccct 153
agggt[cgt]aa|tt[acg]accct 143
agggta[cgt]a|t[acg]taccct 160
agggtaa[cgt]|[acg]ttaccct 219

5083411
5000000
6678892
5,742,128,636 bytes allocated in the heap
 12,691,824 bytes copied during GC (scavenged)
 30,713,012 bytes copied during GC (not scavenged)
  5,248,108 bytes maximum residency (5 sample(s))

      10944 collections in generation 0 (  0.06s)
          5 collections in generation 1 (  0.02s)

         11 Mb total memory in use

  INIT  time    0.02s  (  0.00s elapsed)
  MUT   time    8.36s  (  8.58s elapsed)
  GC    time    0.08s  (  0.09s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    8.45s  (  8.67s elapsed)

  %GC time       0.9%  (1.1% elapsed)

  Alloc rate    685,627,299 bytes per MUT second

  Productivity  98.9% of total user, 96.4% of total elapsed

5.1 The code

{-# OPTIONS -funbox-strict-fields -fbang-patterns #-}
--
-- This never uses 'Sparse' but remiving it killed performance -- ghc glitch?
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- http://haskell.org/haskellwiki/Shootout/Regex_DNA
-- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp.
-- Updated for ByteString by Chris Kuklewicz February, 2007
--
-- Compile with: -O2 -package parsec
--
-- An idiomatic Haskell entry using lazy regex combinators, described in the paper:
--
--  Manuel  M. T. Chakravarty, Lazy Lexing is Fast.
--  In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji
--  International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.
 
module Main(main) where
 
import Control.Monad (liftM)
import Data.Array (Array, assocs, accumArray,bounds,listArray)
import Data.Array.Base (unsafeAt)
import qualified Data.ByteString as B (ByteString,length,take,drop,index,concat,hGetContents)
import Data.ByteString.Base (c2w)
import qualified Data.ByteString.Char8  as BC (pack)
import qualified Data.ByteString.Lazy as L (length,fromChunks)
import Data.Word (Word8)
import Data.List (sort,nub,(\\))
import qualified Data.Map as M (Map,toList,fromList,findWithDefault)
import System.IO (stdin)
import qualified Text.ParserCombinators.Parsec as P
import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero)
 
(!) a b = unsafeAt a (fromEnum $ (b - fst (bounds a)))
 
main = do
  wholeFile <- B.hGetContents stdin
  putStr (work wholeFile)
 
work fileIn =
    let l0     = show $ B.length fileIn in l0 `seq`
    let clean  = B.concat . delete "\n|>[^\n]+\n" $ fileIn
        l1     = show $ B.length clean in l1 `seq`
    let counts = [ stringRegex++" "++show (count stringRegex clean) |
                   stringRegex <- variants ]  -- Count the variants one at a time.
        l2     = show $ L.length (iubsExpand clean)
    in unlines $ counts ++ [[],l0, l1, l2]
 
iubsExpand = L.fromChunks
             . foldr1 (.) (map (\snew input -> replace snew input) iubs)
             . (:[])
 
variants = [    "agggtaaa|tttaccct"    ,"[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t"
           ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct"
           ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"]
 
iubs = map (\(s,new) -> (regex s,BC.pack new)) $
  [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)") ,("M","(a|c)")
  ,("N","(a|c|g|t)"),("R","(a|g)") ,("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)")]
 
-- EXTERNAL INTERFACE TO REGULAR EXPRESSIONS
 
regex = (either (error.show) accept) . (\x -> P.parse p_regex x x)
  where accept re = re (Lexer True Done) -- Close a regular expression into a Lexer.
 
delete s = del where
  r = regex s
  del b = pieces 0 (run r b 0) where
    pieces end [] = B.drop end b : []
    pieces end ((start,stop):rest) = B.take (start-end) (B.drop end b) : pieces stop rest
 
count s b = runOverlappingCount r b 0
  where r = regex s
 
replace (r,new) = rep where
  rep [] = []
  rep (b:bs) = pieces 0 (run r b 0) where
    pieces 0 [] = b : rep bs
    pieces end [] | B.length b > end = B.drop end b : rep bs
                  | otherwise = rep bs
    pieces end ((start,stop):rest) | start > end = B.take (start-end) (B.drop end b) : new : pieces stop rest
                                   | otherwise = new : pieces stop rest
 
run :: Lexer -> B.ByteString -> Int -> [(Int,Int)]
run lexerIn b offsetIn = loop offsetIn
  where
    end = B.length b
    loop offset | offset == end = []
                | otherwise =
      let t@(start,stop) = lexOne b lexerIn offset
      in if start == -1 then  []
           else t : loop stop
 
runOverlappingCount :: Lexer -> B.ByteString -> Int -> Int
runOverlappingCount lexerIn b offsetIn = loop offsetIn 0
  where
    end = B.length b
    loop !offset !c | offset == end = c
                    | otherwise =
      let start = fst $ lexOne b lexerIn offset
      in if start == -1 then c
           else loop (succ start) (succ c)
 
----------------------------------------------------------------
-- Construct a regex combinator from a string regex (use Parsec)
-- Designed to follow "man re_format" (Mac OS X 10.4.4)
--
-- The regular expressions accepted by the program include those using
-- |, empty group (), grouping with ( and ), wildcard '.', backslach
-- escaped characters "\.", greedy modifiers ? + and *, bracketed
-- alternatives including ranges such as [a-z0-9] and inverted
-- brackets such as [^]\n-].  Only 7-bit Ascii accepted.
 
-- 'p_regex' is the only "exported" function, used by 'regex' above
 
p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|'))
 
p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom))
 
p_atom = P.try (P.string "()" >> return epsilon)
     <|> P.between (P.char '(') (P.char ')') p_regex
     <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char
     <|> (pzero <?> "cannot parse regexp atom")
 
p_post_atom atom = (P.char '?' >> return (atom `quest`))
               <|> (P.char '+' >> return (atom `plus`))
               <|> (P.char '*' >> return (atom `star`))
               <|> (return (atom +>))
 
p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) )
 
p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-")))
                  middle <- P.manyTill P.anyChar (P.char ']')
                  let expand [] = []
                      expand ('-':[]) = "-"
                      expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest)
                      expand (x:xs) | x /= '-'  = x:(expand xs)
                                    | otherwise = error "A dash is in the wrong place in a p_set"
                      characters = nub ( sort (initial ++ (expand middle)) )
                  return $ if invert then alt ( ['\0'..'\127'] \\ characters )
                                     else alt characters
 
p_dot = P.char '.' >> return (alt ['\0'..'\127'])
 
p_escaped_char = P.char '\\' >> liftM char P.anyChar
 
p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\"
 
------------------------------------------------------------------------
--  And everything else is the midified CTK library.
--
--  Compiler Toolkit: Self-optimizing lexers
--
--  Author : Manuel M. T. Chakravarty
--  Created: 24 February 95, 2 March 99
--  Copyright (c) [1995..2000] Manuel M. T. Chakravarty
--  Copyright (c) 2004-6 Don Stewart
--
--  Self-optimizing lexer combinators.
--
--  For detailed information, see ``Lazy Lexing is Fast'', Manuel
--  M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of
--  Fourth Fuji International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.  (See my Web page for details.)
--
--             http://www.cse.unsw.edu.au/~chak/papers/Cha99.html
--
--  Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their
--  helpful suggestions that improved the design of this library.
 
-- INTERNAL IMPLEMENTATION OF REGULAR EXPRESSIONS
 
-- tree structure used to represent the lexer table
data Lexer = Lexer !Bool !Cont
 
-- represent the continuation of a lexer
            -- on top of the tree, where entries are dense, we use arrays
data Cont = Dense !BoundsNum !(Array Word8 Lexer)
          -- further down, where the valid entries are sparse, we
          -- use association lists, to save memory
          | Sparse !BoundsNum !(M.Map Word8 Lexer)
          -- end of a automaton
          | Done
 
type Regexp = Lexer -> Lexer
 
infixr 4 `quest`, `star`, `plus`
infixl 3 +>
 
-- Empty lexeme (noop)
epsilon = id :: Regexp
 
-- One character regexp
char c = (\l -> Lexer False (Dense (B 1 w w) (listArray (w,w) [l])))
  where w = c2w c
 
-- accepts a non-empty set of alternative characters
-- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
alt cs  = \l -> let bnds = B (length ws) (minimum ws) (maximum ws)
                in Lexer False (aggregate bnds [(w, l) | w <- ws])
  where ws = map c2w cs
 
-- accept a character sequence
string cs = (foldr1 (+>) . map char) cs
 
-- Concatenation of regexps is just concatenation of functions
(+>)  = (.) :: Regexp -> Regexp -> Regexp
 
-- disjunctive combination of two regexps, corresponding to x|y
re1 >|< re2  = \l -> re1 l >||< re2 l
 
-- x `quest` y corresponds to the regular expression x?y
quest re1 re2  = (re1 +> re2) >|< re2
 
-- x `plus` y corresponds to the regular expression x+y
plus re1 re2  = re1 +> (re1 `star` re2)
 
-- x `star` y corresponds to the regular expression x*y
star re1 re2  = \l -> let self = re1 self >||< re2 l in self
 
-- Scan forwards searching for a match anywhere at or after
-- startOffset.  Return offsets of first matching character and after
-- last matching character or (-1,-1) for failure
lexOne b lexerIn !startOffset = let stop = oneLexeme lexerIn startOffset (-1)
                                in if stop == -1
                                     then if startOffset < end
                                            then lexOne b lexerIn (succ startOffset)
                                            else (-1,-1)
                                     else (startOffset,stop)
  where
    end = B.length b
    oneLexeme (Lexer win cont) !offset !last =
      let last' = if win then offset else last
      in if offset == end
           then last' -- at end, has to be this action
           else oneChar cont (B.index b offset) (succ offset) last'   -- keep looking
 
    oneChar tbl !c !offset' !last = case peek tbl c of
      (Lexer win Done)   -> if win then offset' else last
      l'                 -> oneLexeme l' offset' last
 
    peek (Dense bn arr)  !c | c `inBounds` bn = arr ! c
    peek (Sparse bn cls) !c | c `inBounds` bn = M.findWithDefault (Lexer False Done) c cls
    peek _ _ = (Lexer False Done)
 
-- disjunctive combination of two lexers (longest match, right biased)
(Lexer win c) >||< (Lexer win' c')  = Lexer (win || win') (joinConts c c')
 
-- represents the number of (non-error) elements and the bounds of a
-- DFA transition table
data BoundsNum = B !Int !Word8 !Word8
 
-- combine two bounds
addBoundsNum (B n lc hc) (B n' lc' hc')  = B (n + n') (min lc lc') (max hc hc')
 
-- check whether a character is in the bounds
inBounds c (B _ lc hc) = c >= lc && c <= hc
 
-- combine two disjunctive continuations
joinConts Done c'   = c'
joinConts c    Done = c
joinConts c    c'   = let (bn , cls ) = listify c
                          (bn', cls') = listify c'
                      -- note: `addsBoundsNum' can, at this point, only
                      --       approx. the number of *non-overlapping* cases;
                      --       however, the bounds are correct
                      in aggregate (addBoundsNum bn bn') (cls ++ cls')
  where listify (Dense  n arr) = (n, assocs arr)
        listify (Sparse n cls) = (n, M.toList cls)
 
-- we use the dense representation if a table has at least the given
-- number of (non-error) elements
denseMin  = 1 :: Int
 
-- Note: `n' is only an upper bound of the number of non-overlapping cases
aggregate bn@(B n lc hc) cls
  | n >= denseMin = Dense  bn (accumArray (>||<) (Lexer False Done) (lc, hc) cls)
  | otherwise     = Sparse bn (M.fromList (accum (>||<) cls))
 
-- combine the elements in the association list that have the same key
accum _ []           = []
accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces'
  where gather k e []                             = ((k, e), [])
        gather k e (ke'@(k', e'):kes) | k == k'   = gather k (f e e') kes
                                      | otherwise = let (ke'', kes') = gather k e kes
                                                    in (ke'', ke':kes')

6 Proposed Legal entry : CTK 1

I have hacked and cleaned up the Lazy Lexer 6 to run one-at-a-time to be legal under the new rules. I will submit it after it getts benchmarked here. (Runs about 6 times slower on G4; still three times faster that the all parsec entry below) -- ChrisKuklewicz

{-# OPTIONS -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- http://haskell.org/hawiki/ShootoutEntry
-- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp.
--
-- Compile with: -O2 -package parsec
--
-- An idiomatic Haskell entry using lazy regex combinators, described in the paper:
--
--  Manuel  M. T. Chakravarty, Lazy Lexing is Fast.  
--  In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji
--  International Symposium on Functional and Logic Programming,
--  Springer-Verlag, LNCS 1722, 1999.
--
-- For more about higher-order combinator-based lexing and parsing in Haskell consult:
--
--  Graham Hutton. Higher-order functions for parsing. (1992)
--  Journal of Functional Programming 2: 232-343. 
--
--  Jeroen Fokker. Functional Parsers. (1995) 
--  Lecture Notes of the Baastad Spring school on Functional Programming.
--
--  Graham Hutton and Erik Meijer. Monadic Parser Combinators. (1996)
--  Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham.
--
--  Steve Hill. Combinators for parsing expressions. (1996)
--  Journal of Functional Programming 6(3): 445-463. 
--
--  Andrew Partridge and David Wright. 
--  Predictive parser combinators need four values to report errors. (1996)
--  Journal of Functional Programming 6(2): 355-364.
--
--  Doaitse Swierstra and Luc Duponcheel. 
--  Deterministic, Error-Correcting Combinator Parsers. (1996)
--  Advanced Functional Programming. LNCS 1129: 185-207.
--
--  Pieter Koopman and Rinus Plasmeijer. Efficient Combinator Parsers. (1999)
--  Implementation of Functional Languages. Springer Verlag, LNCS 1595: 122-138.
--
--  Doaitse Swierstra and Pablo Azero. 
--  Fast, Error Correcting Parser Combinators: A Short Tutorial. (1999)
--  SOFSEM'99 Theory and Practice of Informatics. LNCS 1725: 111-129.
--
--  Arthur Baars, Andres Loh, and Doaitse Swierstra. Parsing Permutation Phrases. (2001) 
--  Proceedings of the ACM SIGPLAN Haskell Workshop, 171?183.
--
-- And many other sources.
--
import List (sort,nub,(\\))
import Data.Array (Array, (!), assocs, accumArray)
import Control.Monad (liftM)
import qualified Data.Map as M
import qualified Text.ParserCombinators.Parsec as P
import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero)
 
main = interact $ \stdin ->
    let l0     = show $ length stdin in l0 `seq` 
    let clean  = fst $ run (reDelete "\n|>[^\n]+\n") stdin 0