[Haskell] regular expression syntax - perl ain't got nothin on haskell

John Meacham john at repetae.net
Mon Feb 23 18:07:42 EST 2004


Inspired by an idea by Andrew Pang and an old project of mine, I decided
to fill out a reusable regular expression library which is similar to
Perl's, but much more expressive.

It provides regular and monadic versions, a very overloaded and useful
interface, as well as extensibility. although currently the only
instance  is based on Text.Regex, it generalizes to matching lists of
arbitrary type, not just strings, and also leaves the door open for
compile-time checked and optimized regular expressions via template
Haskell.

so, does this seem interesting? I am really enjoying the =~ syntax
already in projects, and the monadic version is great for concise ad-hoc
parsers. 

my next steps are to 
1. finish my template Haskell regex compiler, not only to optimize at
compile time, but allow matching of arbitrary types. not just strings.
(perhaps reusing some code from the old Regular Expressions in Haskell
project)
2. make a pcre binding. hook it in.

and possible ideas for the future are
 * remove restriction to lists, so tree-like structures can be matched.
    perhaps grabbing default value from Monoid. The only reason I don't
    have this now is I didn't want to enable overlapping-instances,
    things are already complicated enough :)
 * add further optimized versions of match*, perhaps split up class for 
    when there is no clear concept of the preceding part and postceeding
    parts around a match..
 * develop a substitution syntax
 * implement pcre in pure Haskell

although, even now, it seems quite useful.

new versions will appear at
http://repetae.net/john/computer/haskell/RegexSyntax.hs

plus, this is something to brag about to Perl people, they can only do
things differently based on scalar vs. list context. our =~ does 10
different things typesafely and probably some more I missed :)

{-# OPTIONS -fglasgow-exts  #-}

module RegexSyntax(RegexLike(..), RegexLikeImp(..), RegexContext(..), (!~)) where

import Array
import Text.Regex
import Maybe

{-

basic usage:

> string =~ "regular expression" 
returns different things depending on context

type - what it evaluates to
---------------------------
Int - number of times the regular expression matches
String -  matching portion of string
(String,String,String) - (text before match, matching text, text after match)
[Either String String] - list of matching and nonmatching strings, if concated,
  the original string results.  Left = notmatching, Right = matching.
Bool - whether the string matches
() - always returns () (useful in monad context, see below)
[String] - list of matches
Array Int String - list of substring matches for first match 
(String, Array Int String) - full matching text and substring matches
[(String, Array Int String)] - all matches, full match plus substrings
[Array Int String] - all substrings from all matches

also, there is the monadic version (=~~) which always behaves exactly the same
as (=~) except when the match fails, instead of returning a default value, the
monad fails. 

s !~ re = not (s =~ re) for convinience

regular expressions:

these may be strings, which are interpreted as regular expressions, or Regex's
from the Text.Regex module. or any other instance of the RegexLike class. 

when using strings, you may prefix the regex by "i/" for a case-insensitive
match and "s/" to treat the string as a single line. (or both as "si/")
A leading "/" is ignored, other than these cases "/" is not special.

advanced features:

not just strings can be matched, but rather lists of anything a matcher is
defined for.  RegexLikeImp data class can be used for in-place code generated
by template haskell for compile-time checked regular expresions


-}

class RegexLike r a | r -> a where
    matchOnce :: r  -> [a] -> Maybe ([a],[a],[a],Array Int [a])
    matchTest :: r  -> [a] -> Bool
    matchAll  :: r  -> [a] -> [Either [a] ([a],Array Int [a])]
    matchShow :: r -> String  -- for error messages
    matchTest r xs = isJust (matchOnce r xs)
    matchAll r xs = case matchOnce r xs of
        Nothing -> pn xs []
        Just (p,m,rest,as) -> pn p (Right (m,as):matchAll r rest)
      where pn x = if null x then id else (Left x:) 
    matchShow _ = "Unknown"


instance RegexLike Regex Char where 
    matchOnce re xs = fmap f (matchRegexAll re xs) where
        f (x,y,z,ls) = (x,y,z,listArray (1,length ls) ls)
    matchShow _ = "Regex"
    
instance RegexLike String Char where 
    matchOnce re xs = fmap f (matchRegexAll (mr re) xs) where
        f (x,y,z,ls) = (x,y,z,listArray (1,length ls) ls)
        mr ('i':'/':re) = mkRegexWithOpts re True False
        mr ('s':'/':re) = mkRegexWithOpts re False True
        mr ('i':'s':'/':re)  = mkRegexWithOpts re False False
        mr ('s':'i':'/':re)  = mkRegexWithOpts re False False
        mr ('/':re) = mkRegex re 
        mr (re) = mkRegex re 
    matchShow re = re


class RegexContext x a where
    (=~) :: RegexLike r x => [x] -> r -> a
    (=~~) :: (Monad m, RegexLike r x) => [x] -> r -> m a
    -- s =~~ re = return (s =~~ re)  not default because probably not what you want

s !~ re = not (s =~ re)
regexFailed re =  fail $ "regex failed to match: " ++ matchShow re

instance  RegexContext x Int where
    s =~ re = let xs =  matchAll  re s in length [x | Right x <- xs]
    s =~~ re = case (s =~ re) of
        0 -> regexFailed re
        xs -> return $  xs

instance RegexContext x ([x],[x],[x]) where
    s =~ re =  case matchOnce re s of 
        Nothing -> (s,[],[])
        Just (x,y,z,_) -> (x,y,z)
    s =~~ re =  case matchOnce re s of 
        Nothing -> regexFailed re 
        Just (x,y,z,_) -> return (x,y,z)

instance RegexContext x [Either [x] [x]] where
    s =~ re = map f $  matchAll re s where
        f (Left s) = Left s
        f (Right (x,_)) = Right x
    s =~~ re = case (s =~ re) of
        [Left _] -> regexFailed re
        xs -> return $  xs


instance RegexContext x [x] where
    s =~ re = case matchOnce re s of 
        Nothing -> [] `asTypeOf` s
        Just (_,s,_,_) -> s
    s =~~ re = case matchOnce re s of 
        Nothing -> regexFailed re
        Just (_,s,_,_) -> return s

-- useful in non-monad context
instance RegexContext x Bool where 
    s =~ re = matchTest re s
    s =~~ re = case s =~ re of
        False -> regexFailed re
        True -> return True
-- useful in monad context        
instance RegexContext x () where  
    s =~ re = ()
    s =~~ re = case s =~ re of
        False -> regexFailed re
        True -> return ()

instance RegexContext x [[x]] where
    s =~ re =  [x | Right (x,_) <- matchAll re s]
    s =~~ re = case (s =~ re) of
        [] -> regexFailed re
        xs -> return  xs

instance RegexContext x [([x],Array Int [x])] where 
    s =~ re =  [x | Right x <- matchAll re s]
    s =~~ re = case (s =~ re) of
        [] -> regexFailed re
        xs -> return  xs

instance RegexContext x [Array Int [x]] where 
    s =~ re =  [x | Right (_,x) <- matchAll re s]
    s =~~ re = case (s =~ re) of
        [] -> regexFailed re
        xs -> return  xs

instance RegexContext x (Array Int [x]) where 
    s =~ re = case matchOnce re s of 
        Nothing -> listArray (1,0) [] 
        Just (_,_,_,z) -> z
    s =~~ re = case matchOnce re s of 
        Nothing -> regexFailed re
        Just (_,_,_,z) -> return z


-- this is used for template haskell to generate compile-time parsed regular
-- expressions

data RegexLikeImp a = RegexLikeImp { 
    reImpMatchOnce ::  [a] -> Maybe ([a],[a],[a],Array Int [a]),
    reImpMatchTest ::  [a] -> Bool,
    reImpMatchAll  ::  [a] -> [Either [a] ([a],Array Int [a])],
    reImpMatchShow ::  String  -- for error messages
    }

instance RegexLike (RegexLikeImp a) a where
    matchOnce = reImpMatchOnce
    matchTest = reImpMatchTest
    matchAll  = reImpMatchAll
    matchShow = reImpMatchShow

-- 
---------------------------------------------------------------------------
John Meacham - California Institute of Technology, Alum. - john at foo.net
--------------------------------------------------------------------------


More information about the Haskell mailing list