Shootout/Regex DNA
< Shootout
This is a Shootout Entry for http://shootout.alioth.debian.org/benchmark.php?test=regexdna&lang=all.
Todo: this entry should be replaced by the new PCRE Text.Regex.ByteString module.
Contents |
1 About the regex-dna (new) benchmark
Each program should:
- read all of a redirected FASTA format file from stdin, and record the sequence length
- use the same simple regex pattern match-replace to remove FASTA sequence descriptions and all
linefeed characters, and record the sequence length
- use the same simple regex patterns, representing DNA 8-mers and their reverse complement (with a
wildcard in one position), and (one pattern at a time) count matches in the redirected file
- write the regex pattern and count
- use the same simple regex patterns to make IUB code alternatives explicit, and (one pattern at a
time) match-replace the pattern in the redirect file, and record the sequence length
- write the 3 recorded sequence lengths
We use FASTA files generated by the fasta benchmark as input for this benchmark. Note: the file may include both lowercase and uppercase codes.
Correct output for this 100KB input file (generated with the fasta program N = 10000), is
agggtaaa|tttaccct 0 [cgt]gggtaaa|tttaccc[acg] 3 a[act]ggtaaa|tttacc[agt]t 9 ag[act]gtaaa|tttac[agt]ct 8 agg[act]taaa|ttta[agt]cct 10 aggg[acg]aaa|ttt[cgt]ccct 3 agggt[cgt]aa|tt[acg]accct 4 agggta[cgt]a|t[acg]taccct 3 agggtaa[cgt]|[acg]ttaccct 5
101745 100000 133640
2 Discussion
Don's Idiomatic Entry seems pretty well baked as the idiomatic GHC entry. For the high-speed GHC #2, we appear to be heading towards a new, full implementation of regular expressions in this entry. I would vote that we step back and implement enough to complete the entry because we've already bent the rules in this entry. One of the aspects that I like about my original non-Parsec entry was that it was fairly short and, IMO, fairly readable. In adding Parsec, I think that some of the comprehensibilitinessous has been lost.
I do think that we've improved the entry a lot! -- AlsonKemp
Is there any way to shorten the fastest entry? I still think that we've lost a lot of comprehensibility. -- AlsonKemp
Note that the Idiomatic Entry has already been submitted -- DonStewart
errmmm... Does the work that's done here suggest that the Text.Regex should be replaced? Not sure how to replace it or with which replacement... See also RegexSyntax. -- AlsonKemp
Well, if not replaced, we can at least provide alternatives. I propose:
* we add a Word8 interface to the existing Text.Regex, so we can avoid marshalling from packed strings and buffers * we add the lazy regex combinators to the Text.ParserCombinators suite. I use them all the time, and they're more than a decade old, so they should be standardised. I'll begin by cabalising the library. -- DonStewart
3 Benchmarks
N=5,000,000, Linux/x86
Old spec
||Name || Time || ||Original || Timeout || ||Functional 1 || Timeout || ||Functional 2 || Timeout || ||Combination || Timeout || ||Packed entry || 1200s || ||Lazy lexers 1 || 33.950 || ||Alson Kemp's parsec || 30.682s || ||Lazy lexers 2 || 28.000 || ||Modified Alson || 19.950 || ||Lazy lexers 5 + regex syntax || 12.106 || ||Lazy lexers 4 || 11.192 || ||Lexer 6 || 11.000 || ||Lazy lexers 3 (illegal)|| 6.300 || |||||||| ||perl entry || 4.005s ||
New spec ||Name || Time || ||Parsec 1 || 126s || ||CTK 1 || 60s ||
N.B. the syntax highlighting breaks on many of the following programs, resulting in bogus output.
4 Proposed entry
A tuned version of Chris' Submitted
{-# OPTIONS -funbox-strict-fields -fbang-patterns #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp. -- Updated for ByteString by Chris Kuklewicz February, 2007 -- -- Compile with: -O2 -package parsec -- -- An idiomatic Haskell entry using lazy regex combinators, described in the paper: -- -- Manuel M. T. Chakravarty, Lazy Lexing is Fast. -- In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji -- International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. import Monad import Data.Array (Array, assocs, accumArray,bounds,listArray) import Data.Array.Base (unsafeAt) import Data.ByteString.Base import qualified Data.ByteString as B import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Char8 as BC import Word import List hiding (delete) import qualified Data.Map as M import System import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero) ------------------------------------------------------------------------ main = putStr . work =<< B.getContents work !fileIn = unlines $ counts ++ [[],l0, l1, l2] where l0 = show $ B.length fileIn clean = B.concat . delete "\n|>[^\n]+\n" $ fileIn l1 = show $ B.length clean counts = [ re++" "++show (count re clean) | re <- variants ] l2 = show $ L.length (iubsExpand clean) iubsExpand = L.fromChunks . foldr1 (.) (map replace iubs) . return variants = [ "agggtaaa|tttaccct" ,"[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] iubs = map (\(s,new) -> (regex s,BC.pack new)) $ [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)") ,("M","(a|c)") ,("N","(a|c|g|t)"),("R","(a|g)") ,("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)")] -- And that's it! ------------------------------------------------------------------------ -- external interface to regular expressions regex = (either (error.show) accept) . (\x -> P.parse p_regex x x) where accept re = re (Lexer True Done) -- Close a regular expression into a Lexer. (!) !a b = unsafeAt a (fromEnum $ (b - fst (bounds a))) delete s = del where r = regex s del b = pieces 0 (run r b 0) where pieces end [] = unsafeDrop end b : [] pieces !end ((start,stop):rest) = unsafeTake (start-end) (unsafeDrop end b) : pieces stop rest count s b = runOverlappingCount r b 0 where r = regex s replace (r,new) = rep where rep [] = [] rep (!b:bs) = pieces 0 (run r b 0) where pieces 0 [] = b : rep bs pieces end [] | B.length b > end = unsafeDrop end b : rep bs | otherwise = rep bs pieces end ((start,stop):rest) | start > end = unsafeTake (start-end) (unsafeDrop end b) : new : pieces stop rest | otherwise = new : pieces stop rest run :: Lexer -> B.ByteString -> Int -> [(Int,Int)] run lexerIn !b offsetIn = loop offsetIn where end = B.length b loop offset | offset == end = [] | otherwise = let t@(start,stop) = lexOne b lexerIn offset in if start == -1 then [] else t : loop stop runOverlappingCount :: Lexer -> B.ByteString -> Int -> Int runOverlappingCount lexerIn !b offsetIn = loop offsetIn 0 where end = B.length b loop !offset !c | offset == end = c | otherwise = let start = fst $ lexOne b lexerIn offset in if start == -1 then c else loop (succ start) (succ c) -- -- Construct a regex combinator from a string regex (use Parsec) -- Designed to follow "man re_format" (Mac OS X 10.4.4) -- -- The regular expressions accepted by the program include those using -- |, empty group (), grouping with ( and ), wildcard '.', backslach -- escaped characters "\.", greedy modifiers ? + and *, bracketed -- alternatives including ranges such as [a-z0-9] and inverted -- brackets such as [^]\n-]. Only 7-bit Ascii accepted. -- 'p_regex' is the only "exported" function, used by 'regex' above p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|')) p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom)) p_atom = P.try (P.string "()" >> return epsilon) <|> P.between (P.char '(') (P.char ')') p_regex <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char <|> (pzero <?> "cannot parse regexp atom") p_post_atom atom = (P.char '?' >> return (atom `quest`)) <|> (P.char '+' >> return (atom `plus`)) <|> (P.char '*' >> return (atom `star`)) <|> (return (atom +>)) p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) ) p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-"))) middle <- P.manyTill P.anyChar (P.char ']') let expand [] = [] expand ('-':[]) = "-" expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest) expand (x:xs) | x /= '-' = x:(expand xs) | otherwise = error "A dash is in the wrong place in a p_set" characters = nub ( sort (initial ++ (expand middle)) ) return $ if invert then alt ( ['\0'..'\127'] \\ characters ) else alt characters p_dot = P.char '.' >> return (alt ['\0'..'\127']) p_escaped_char = P.char '\\' >> liftM char P.anyChar p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\" -- -- And everything else is the modified CTK library. -- -- Compiler Toolkit: Self-optimizing lexers -- Author : Manuel M. T. Chakravarty -- -- tree structure used to represent the lexer table data Lexer = Lexer !Bool !Cont -- represent the continuation of a lexer -- on top of the tree, where entries are dense, we use arrays data Cont = Dense !BoundsNum !(Array Word8 Lexer) -- further down, where the valid entries are sparse, we -- use association lists, to save memory | Sparse !BoundsNum !(M.Map Word8 Lexer) -- end of a automaton | Done type Regexp = Lexer -> Lexer infixr 4 `quest`, `star`, `plus` infixl 3 +> -- Empty lexeme (noop) epsilon = id :: Regexp -- One character regexp char c = (\l -> Lexer False (Dense (B 1 w w) (listArray (w,w) [l]))) where w = c2w c -- accepts a non-empty set of alternative characters -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster alt cs = \l -> let bnds = B (length ws) (minimum ws) (maximum ws) in Lexer False (aggregate bnds [(w, l) | w <- ws]) where ws = map c2w cs -- accept a character sequence string cs = (foldr1 (+>) . map char) cs -- Concatenation of regexps is just concatenation of functions (+>) = (.) :: Regexp -> Regexp -> Regexp -- disjunctive combination of two regexps, corresponding to x|y re1 >|< re2 = \l -> re1 l >||< re2 l -- x `quest` y corresponds to the regular expression x?y quest re1 re2 = (re1 +> re2) >|< re2 -- x `plus` y corresponds to the regular expression x+y plus re1 re2 = re1 +> (re1 `star` re2) -- x `star` y corresponds to the regular expression x*y star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- Scan forwards searching for a match anywhere at or after -- startOffset. Return offsets of first matching character and after -- last matching character or (-1,-1) for failure lexOne !b lexerIn !startOffset = let stop = oneLexeme lexerIn startOffset (-1) in if stop == -1 then if startOffset < end then lexOne b lexerIn (succ startOffset) else (-1,-1) else (startOffset,stop) where end = B.length b oneLexeme (Lexer win cont) !offset !last = let last' = if win then offset else last in if offset == end then last' -- at end, has to be this action else oneChar cont (unsafeIndex b offset) (succ offset) last' -- keep looking oneChar tbl !c !offset' !last = case peek tbl c of (Lexer win Done) -> if win then offset' else last l' -> oneLexeme l' offset' last peek (Dense bn arr) !c | c `inBounds` bn = arr ! c peek (Sparse bn cls) !c | c `inBounds` bn = M.findWithDefault (Lexer False Done) c cls peek _ _ = (Lexer False Done) -- disjunctive combination of two lexers (longest match, right biased) (Lexer win c) >||< (Lexer win' c') = Lexer (win || win') (joinConts c c') -- represents the number of (non-error) elements and the bounds of a -- DFA transition table data BoundsNum = B !Int !Word8 !Word8 -- combine two bounds addBoundsNum (B n lc hc) (B n' lc' hc') = B (n + n') (min lc lc') (max hc hc') -- check whether a character is in the bounds inBounds c (B _ lc hc) = c >= lc && c <= hc -- combine two disjunctive continuations joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct in aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, M.toList cls) -- we use the dense representation if a table has at least the given -- number of (non-error) elements denseMin = 1 :: Int -- Note: `n' is only an upper bound of the number of non-overlapping cases aggregate bn@(B n lc hc) cls | n >= denseMin = Dense bn (accumArray (>||<) (Lexer False Done) (lc, hc) cls) | otherwise = Sparse bn (M.fromList (accum (>||<) cls)) -- combine the elements in the association list that have the same key accum _ [] = [] accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces' where gather k e [] = ((k, e), []) gather k e (ke'@(k', e'):kes) | k == k' = gather k (f e e') kes | otherwise = let (ke'', kes') = gather k e kes in (ke'', ke':kes')
5 Chris' Bytestring Proposal
I have modified the old entry to use Data.ByteString and modified the matching engine to produce (start,stop) offsets. This runs really fast for me. Oddly, it does not use the 'Sparse' variant of 'Cont' but erasing 'Sparse' from the code makes it perform much much worse.
I expect there is a more performance to be tweaked, but memory consumption is good since I see nearly perfect productivity on the full data set (WinXP/Core Duo 2):
agggtaaa|tttaccct 36
[cgt]gggtaaa|tttaccc[acg] 125
a[act]ggtaaa|tttacc[agt]t 426
ag[act]gtaaa|tttac[agt]ct 290
agg[act]taaa|ttta[agt]cct 536
aggg[acg]aaa|ttt[cgt]ccct 153
agggt[cgt]aa|tt[acg]accct 143
agggta[cgt]a|t[acg]taccct 160
agggtaa[cgt]|[acg]ttaccct 219
5083411
5000000
6678892
5,742,128,636 bytes allocated in the heap
12,691,824 bytes copied during GC (scavenged)
30,713,012 bytes copied during GC (not scavenged)
5,248,108 bytes maximum residency (5 sample(s))
10944 collections in generation 0 ( 0.06s)
5 collections in generation 1 ( 0.02s)
11 Mb total memory in use
INIT time 0.02s ( 0.00s elapsed)
MUT time 8.36s ( 8.58s elapsed)
GC time 0.08s ( 0.09s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.45s ( 8.67s elapsed)
%GC time 0.9% (1.1% elapsed)
Alloc rate 685,627,299 bytes per MUT second
Productivity 98.9% of total user, 96.4% of total elapsed
5.1 The code
{-# OPTIONS -funbox-strict-fields -fbang-patterns #-} -- -- This never uses 'Sparse' but remiving it killed performance -- ghc glitch? -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- http://haskell.org/haskellwiki/Shootout/Regex_DNA -- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp. -- Updated for ByteString by Chris Kuklewicz February, 2007 -- -- Compile with: -O2 -package parsec -- -- An idiomatic Haskell entry using lazy regex combinators, described in the paper: -- -- Manuel M. T. Chakravarty, Lazy Lexing is Fast. -- In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji -- International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. module Main(main) where import Control.Monad (liftM) import Data.Array (Array, assocs, accumArray,bounds,listArray) import Data.Array.Base (unsafeAt) import qualified Data.ByteString as B (ByteString,length,take,drop,index,concat,hGetContents) import Data.ByteString.Base (c2w) import qualified Data.ByteString.Char8 as BC (pack) import qualified Data.ByteString.Lazy as L (length,fromChunks) import Data.Word (Word8) import Data.List (sort,nub,(\\)) import qualified Data.Map as M (Map,toList,fromList,findWithDefault) import System.IO (stdin) import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero) (!) a b = unsafeAt a (fromEnum $ (b - fst (bounds a))) main = do wholeFile <- B.hGetContents stdin putStr (work wholeFile) work fileIn = let l0 = show $ B.length fileIn in l0 `seq` let clean = B.concat . delete "\n|>[^\n]+\n" $ fileIn l1 = show $ B.length clean in l1 `seq` let counts = [ stringRegex++" "++show (count stringRegex clean) | stringRegex <- variants ] -- Count the variants one at a time. l2 = show $ L.length (iubsExpand clean) in unlines $ counts ++ [[],l0, l1, l2] iubsExpand = L.fromChunks . foldr1 (.) (map (\snew input -> replace snew input) iubs) . (:[]) variants = [ "agggtaaa|tttaccct" ,"[cgt]gggtaaa|tttaccc[acg]","a[act]ggtaaa|tttacc[agt]t" ,"ag[act]gtaaa|tttac[agt]ct","agg[act]taaa|ttta[agt]cct","aggg[acg]aaa|ttt[cgt]ccct" ,"agggt[cgt]aa|tt[acg]accct","agggta[cgt]a|t[acg]taccct","agggtaa[cgt]|[acg]ttaccct"] iubs = map (\(s,new) -> (regex s,BC.pack new)) $ [("B","(c|g|t)"),("D","(a|g|t)"),("H","(a|c|t)"),("K","(g|t)") ,("M","(a|c)") ,("N","(a|c|g|t)"),("R","(a|g)") ,("S","(c|g)"),("V","(a|c|g)"),("W","(a|t)") ,("Y","(c|t)")] -- EXTERNAL INTERFACE TO REGULAR EXPRESSIONS regex = (either (error.show) accept) . (\x -> P.parse p_regex x x) where accept re = re (Lexer True Done) -- Close a regular expression into a Lexer. delete s = del where r = regex s del b = pieces 0 (run r b 0) where pieces end [] = B.drop end b : [] pieces end ((start,stop):rest) = B.take (start-end) (B.drop end b) : pieces stop rest count s b = runOverlappingCount r b 0 where r = regex s replace (r,new) = rep where rep [] = [] rep (b:bs) = pieces 0 (run r b 0) where pieces 0 [] = b : rep bs pieces end [] | B.length b > end = B.drop end b : rep bs | otherwise = rep bs pieces end ((start,stop):rest) | start > end = B.take (start-end) (B.drop end b) : new : pieces stop rest | otherwise = new : pieces stop rest run :: Lexer -> B.ByteString -> Int -> [(Int,Int)] run lexerIn b offsetIn = loop offsetIn where end = B.length b loop offset | offset == end = [] | otherwise = let t@(start,stop) = lexOne b lexerIn offset in if start == -1 then [] else t : loop stop runOverlappingCount :: Lexer -> B.ByteString -> Int -> Int runOverlappingCount lexerIn b offsetIn = loop offsetIn 0 where end = B.length b loop !offset !c | offset == end = c | otherwise = let start = fst $ lexOne b lexerIn offset in if start == -1 then c else loop (succ start) (succ c) ---------------------------------------------------------------- -- Construct a regex combinator from a string regex (use Parsec) -- Designed to follow "man re_format" (Mac OS X 10.4.4) -- -- The regular expressions accepted by the program include those using -- |, empty group (), grouping with ( and ), wildcard '.', backslach -- escaped characters "\.", greedy modifiers ? + and *, bracketed -- alternatives including ranges such as [a-z0-9] and inverted -- brackets such as [^]\n-]. Only 7-bit Ascii accepted. -- 'p_regex' is the only "exported" function, used by 'regex' above p_regex = liftM (foldr1 (>|<)) (P.sepBy1 p_branch (P.char '|')) p_branch = liftM (($ epsilon).(foldr (.) id)) (P.many1 (p_atom >>= p_post_atom)) p_atom = P.try (P.string "()" >> return epsilon) <|> P.between (P.char '(') (P.char ')') p_regex <|> p_bracket <|> p_dot <|> p_escaped_char <|> p_other_char <|> (pzero <?> "cannot parse regexp atom") p_post_atom atom = (P.char '?' >> return (atom `quest`)) <|> (P.char '+' >> return (atom `plus`)) <|> (P.char '*' >> return (atom `star`)) <|> (return (atom +>)) p_bracket = (P.char '[') >> ( (P.char '^' >> p_set True) <|> (p_set False) ) p_set invert = do initial <- (P.option "" ((P.char ']' >> return "]") <|> (P.char '-' >> return "-"))) middle <- P.manyTill P.anyChar (P.char ']') let expand [] = [] expand ('-':[]) = "-" expand (a:'-':b:rest) | a /= '-' = (enumFromTo a b)++(expand rest) expand (x:xs) | x /= '-' = x:(expand xs) | otherwise = error "A dash is in the wrong place in a p_set" characters = nub ( sort (initial ++ (expand middle)) ) return $ if invert then alt ( ['\0'..'\127'] \\ characters ) else alt characters p_dot = P.char '.' >> return (alt ['\0'..'\127']) p_escaped_char = P.char '\\' >> liftM char P.anyChar p_other_char = liftM char (P.noneOf specials) where specials = "^.[$()|*+?\\" ------------------------------------------------------------------------ -- And everything else is the midified CTK library. -- -- Compiler Toolkit: Self-optimizing lexers -- -- Author : Manuel M. T. Chakravarty -- Created: 24 February 95, 2 March 99 -- Copyright (c) [1995..2000] Manuel M. T. Chakravarty -- Copyright (c) 2004-6 Don Stewart -- -- Self-optimizing lexer combinators. -- -- For detailed information, see ``Lazy Lexing is Fast'', Manuel -- M. T. Chakravarty, in A. Middeldorp and T. Sato, editors, Proceedings of -- Fourth Fuji International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. (See my Web page for details.) -- -- http://www.cse.unsw.edu.au/~chak/papers/Cha99.html -- -- Thanks to Simon L. Peyton Jones and Roman Leshchinskiy for their -- helpful suggestions that improved the design of this library. -- INTERNAL IMPLEMENTATION OF REGULAR EXPRESSIONS -- tree structure used to represent the lexer table data Lexer = Lexer !Bool !Cont -- represent the continuation of a lexer -- on top of the tree, where entries are dense, we use arrays data Cont = Dense !BoundsNum !(Array Word8 Lexer) -- further down, where the valid entries are sparse, we -- use association lists, to save memory | Sparse !BoundsNum !(M.Map Word8 Lexer) -- end of a automaton | Done type Regexp = Lexer -> Lexer infixr 4 `quest`, `star`, `plus` infixl 3 +> -- Empty lexeme (noop) epsilon = id :: Regexp -- One character regexp char c = (\l -> Lexer False (Dense (B 1 w w) (listArray (w,w) [l]))) where w = c2w c -- accepts a non-empty set of alternative characters -- Equiv. to `(foldr1 (>|<) . map char) cs', but much faster alt cs = \l -> let bnds = B (length ws) (minimum ws) (maximum ws) in Lexer False (aggregate bnds [(w, l) | w <- ws]) where ws = map c2w cs -- accept a character sequence string cs = (foldr1 (+>) . map char) cs -- Concatenation of regexps is just concatenation of functions (+>) = (.) :: Regexp -> Regexp -> Regexp -- disjunctive combination of two regexps, corresponding to x|y re1 >|< re2 = \l -> re1 l >||< re2 l -- x `quest` y corresponds to the regular expression x?y quest re1 re2 = (re1 +> re2) >|< re2 -- x `plus` y corresponds to the regular expression x+y plus re1 re2 = re1 +> (re1 `star` re2) -- x `star` y corresponds to the regular expression x*y star re1 re2 = \l -> let self = re1 self >||< re2 l in self -- Scan forwards searching for a match anywhere at or after -- startOffset. Return offsets of first matching character and after -- last matching character or (-1,-1) for failure lexOne b lexerIn !startOffset = let stop = oneLexeme lexerIn startOffset (-1) in if stop == -1 then if startOffset < end then lexOne b lexerIn (succ startOffset) else (-1,-1) else (startOffset,stop) where end = B.length b oneLexeme (Lexer win cont) !offset !last = let last' = if win then offset else last in if offset == end then last' -- at end, has to be this action else oneChar cont (B.index b offset) (succ offset) last' -- keep looking oneChar tbl !c !offset' !last = case peek tbl c of (Lexer win Done) -> if win then offset' else last l' -> oneLexeme l' offset' last peek (Dense bn arr) !c | c `inBounds` bn = arr ! c peek (Sparse bn cls) !c | c `inBounds` bn = M.findWithDefault (Lexer False Done) c cls peek _ _ = (Lexer False Done) -- disjunctive combination of two lexers (longest match, right biased) (Lexer win c) >||< (Lexer win' c') = Lexer (win || win') (joinConts c c') -- represents the number of (non-error) elements and the bounds of a -- DFA transition table data BoundsNum = B !Int !Word8 !Word8 -- combine two bounds addBoundsNum (B n lc hc) (B n' lc' hc') = B (n + n') (min lc lc') (max hc hc') -- check whether a character is in the bounds inBounds c (B _ lc hc) = c >= lc && c <= hc -- combine two disjunctive continuations joinConts Done c' = c' joinConts c Done = c joinConts c c' = let (bn , cls ) = listify c (bn', cls') = listify c' -- note: `addsBoundsNum' can, at this point, only -- approx. the number of *non-overlapping* cases; -- however, the bounds are correct in aggregate (addBoundsNum bn bn') (cls ++ cls') where listify (Dense n arr) = (n, assocs arr) listify (Sparse n cls) = (n, M.toList cls) -- we use the dense representation if a table has at least the given -- number of (non-error) elements denseMin = 1 :: Int -- Note: `n' is only an upper bound of the number of non-overlapping cases aggregate bn@(B n lc hc) cls | n >= denseMin = Dense bn (accumArray (>||<) (Lexer False Done) (lc, hc) cls) | otherwise = Sparse bn (M.fromList (accum (>||<) cls)) -- combine the elements in the association list that have the same key accum _ [] = [] accum f ((c, el):ces) = let (ce, ces') = gather c el ces in ce : accum f ces' where gather k e [] = ((k, e), []) gather k e (ke'@(k', e'):kes) | k == k' = gather k (f e e') kes | otherwise = let (ke'', kes') = gather k e kes in (ke'', ke':kes')
6 Proposed Legal entry : CTK 1
I have hacked and cleaned up the Lazy Lexer 6 to run one-at-a-time to be legal under the new rules. I will submit it after it getts benchmarked here. (Runs about 6 times slower on G4; still three times faster that the all parsec entry below) -- ChrisKuklewicz
{-# OPTIONS -funbox-strict-fields #-} -- -- The Computer Language Shootout -- http://shootout.alioth.debian.org/ -- -- http://haskell.org/hawiki/ShootoutEntry -- Contributed by Don Stewart, Chris Kuklewicz and Alson Kemp. -- -- Compile with: -O2 -package parsec -- -- An idiomatic Haskell entry using lazy regex combinators, described in the paper: -- -- Manuel M. T. Chakravarty, Lazy Lexing is Fast. -- In A. Middeldorp and T. Sato, editors, Proceedings of Fourth Fuji -- International Symposium on Functional and Logic Programming, -- Springer-Verlag, LNCS 1722, 1999. -- -- For more about higher-order combinator-based lexing and parsing in Haskell consult: -- -- Graham Hutton. Higher-order functions for parsing. (1992) -- Journal of Functional Programming 2: 232-343. -- -- Jeroen Fokker. Functional Parsers. (1995) -- Lecture Notes of the Baastad Spring school on Functional Programming. -- -- Graham Hutton and Erik Meijer. Monadic Parser Combinators. (1996) -- Technical report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham. -- -- Steve Hill. Combinators for parsing expressions. (1996) -- Journal of Functional Programming 6(3): 445-463. -- -- Andrew Partridge and David Wright. -- Predictive parser combinators need four values to report errors. (1996) -- Journal of Functional Programming 6(2): 355-364. -- -- Doaitse Swierstra and Luc Duponcheel. -- Deterministic, Error-Correcting Combinator Parsers. (1996) -- Advanced Functional Programming. LNCS 1129: 185-207. -- -- Pieter Koopman and Rinus Plasmeijer. Efficient Combinator Parsers. (1999) -- Implementation of Functional Languages. Springer Verlag, LNCS 1595: 122-138. -- -- Doaitse Swierstra and Pablo Azero. -- Fast, Error Correcting Parser Combinators: A Short Tutorial. (1999) -- SOFSEM'99 Theory and Practice of Informatics. LNCS 1725: 111-129. -- -- Arthur Baars, Andres Loh, and Doaitse Swierstra. Parsing Permutation Phrases. (2001) -- Proceedings of the ACM SIGPLAN Haskell Workshop, 171?183. -- -- And many other sources. -- import List (sort,nub,(\\)) import Data.Array (Array, (!), assocs, accumArray) import Control.Monad (liftM) import qualified Data.Map as M import qualified Text.ParserCombinators.Parsec as P import Text.ParserCombinators.Parsec ((<|>),(<?>),pzero) main = interact $ \stdin -> let l0 = show $ length stdin in l0 `seq` let clean = fst $ run (reDelete "\n|>[^\n]+\n") stdin 0
