Regular expressions for XML Schema
From HaskellWiki
UweSchmidt (Talk | contribs) |
UweSchmidt (Talk | contribs) m |
||
(4 intermediate revisions by one user not shown) | |||
Line 3: | Line 3: | ||
[[Category:Tools]] |
[[Category:Tools]] |
||
[[Category:Tutorials]] |
[[Category:Tutorials]] |
||
+ | [[Category:Libraries]] |
||
== Regular expressions for XML Schema == |
== Regular expressions for XML Schema == |
||
The [http://www.w3.org/TR/xmlschema11-2/#regexs W3C XML Schema] specification defines a language for regular expressions. This language is used |
The [http://www.w3.org/TR/xmlschema11-2/#regexs W3C XML Schema] specification defines a language for regular expressions. This language is used |
||
− | in the XML Schema spec when defining the data type library part. The [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-xmlschema regex-xmlschema] |
+ | in the XML Schema spec when defining the data type library part. The [http://hackage.haskell.org/package/hxt-regex-xmlschema hxt-regex-xmlschema] |
− | package contains a complete implementation of this spec. It is implemented with the technique of derivations of regular expression. Main features are |
+ | package contains a complete implementation of this spec. |
+ | (old package name:[http://hackage.haskell.org/package/regex-xmlschema regex-xmlschema]) |
||
+ | It is implemented with the technique of derivations of regular expression. Main features are |
||
full support of Unicode including all Unicode codeblocks and character properties, purely functional interface, extensions for intersection, |
full support of Unicode including all Unicode codeblocks and character properties, purely functional interface, extensions for intersection, |
||
set difference and |
set difference and |
||
− | exclusive or of regular sets (regular expressions), extensions for subexpression matches, interface for matching, stream like editing and tokenizing. |
+ | exclusive OR of regular sets (regular expressions), extensions for subexpression matches, interface for matching, stream (sed) like editing and tokenizing. |
+ | This library is part of the Haskell XML Toolbox (HXT), which is described in |
||
+ | [[HXT| HXT: A gentle introduction to the Haskell XML Toolbox]] |
||
__TOC__ |
__TOC__ |
||
Line 20: | Line 21: | ||
there was the need for a complete regular expression matcher for the W3C XML Schema regular expression syntax. |
there was the need for a complete regular expression matcher for the W3C XML Schema regular expression syntax. |
||
The string representation of basic data types in XML Schema as well as in the RelaxNG standard can be defined by regular expressions. |
The string representation of basic data types in XML Schema as well as in the RelaxNG standard can be defined by regular expressions. |
||
− | The available Haskell libraries for processing of regular expressions where not applicable for this task, e.g. they used other r.e. grammars or |
+ | The available Haskell libraries for processing regular expressions where not applicable for this task, e.g. they used other r.e. grammars or |
did not support Unicode. |
did not support Unicode. |
||
Line 32: | Line 33: | ||
that this library is generally useful, not only for XML validation. And with some rather simple extensions it became |
that this library is generally useful, not only for XML validation. And with some rather simple extensions it became |
||
possible to not only use this for matching strings, as required in HXT, but also for sed like stream editing and for |
possible to not only use this for matching strings, as required in HXT, but also for sed like stream editing and for |
||
− | easy construction of lightweight scanners and tokenizers. This was the motivation for doing a stand alone version of |
+ | easy construction of lightweight scanners and tokenizers. This was the motivation for doing a stand alone package of |
that regular expression library. |
that regular expression library. |
||
== Resources == |
== Resources == |
||
− | ; [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-xmlschema Hackage download] |
+ | ; [http://hackage.haskell.org/package/hxt-regex-xmlschema Hackage download] |
− | ; [http://git.fh-wedel.de/repos/regex.git/ git repository with head revision of regex-xmlschema] (git clone http://git.fh-wedel.de/repos/regex.git) |
+ | ; [http://github.com/UweSchmidt/hxt| HXT Repository on GitHub] |
== The idea of derivations of regular expressions == |
== The idea of derivations of regular expressions == |
||
The idea of derivations of regular expression was developed by Janusz A. Brzozowski, Princeton Univ. in 1964. |
The idea of derivations of regular expression was developed by Janusz A. Brzozowski, Princeton Univ. in 1964. |
||
− | Goal was to perform the test whether a string '''w''' is a word of a given regular set (given by a regular expression '''r''') |
+ | Goal was to perform the test whether a string ''w'' is a word of a given regular set (given by a regular expression ''r'') |
− | by manipulating the (internal tree like representation of the) regular expression '''r'''. |
+ | by manipulating the (internal tree like representation of the) regular expression ''r''. |
− | The word test is based on two functions. The first, here called '''nullable''' checks, whether the empty word ε |
+ | The word test is based on two functions. The first, here called ''nullable'' checks, whether the empty word ε |
− | is contained in the regular set associated with '''r'''. This test can easily be done and it can be done efficiently. |
+ | is contained in the regular set associated with ''r''. This test can easily be done and it can be done efficiently. |
− | Given an expression '''r''' and a single char '''x''' the second function, here called '''delta''' computes the so called |
+ | Given an expression ''r'' and a single char ''x'' the second function, here called ''delta'' computes the so called |
− | derivative of '''r''' with respect to '''x'''. The derivative '''delta r x''' is again a regular expression. |
+ | derivative of ''r'' with respect to ''x''. The derivative ''delta r x'' is again a regular expression. |
− | The following law (in Haskell like notation) must hold for '''delta r x''': |
+ | The following law (in Haskell like notation) must hold for ''delta r x'': |
<haskell> |
<haskell> |
||
Line 62: | Line 63: | ||
Let's see how regular Expression can be modelled with Haskell data types and how |
Let's see how regular Expression can be modelled with Haskell data types and how |
||
− | '''nullable''' and '''delta''' work. In the example we will work with the fixed alphabet of Haskell chars. |
+ | ''nullable'' and ''delta'' work. In the example we will work with the fixed alphabet of Haskell chars. |
<haskell> |
<haskell> |
||
Line 75: | Line 76: | ||
In this first version we use the minimal set of regular expressions: |
In this first version we use the minimal set of regular expressions: |
||
The empty set, the set containing ε and the single char sets |
The empty set, the set containing ε and the single char sets |
||
− | form the primitive sets, '''Star''' is the repetition, '''Seq''' the concatenation |
+ | form the primitive sets, ''Star'' is the repetition, ''Seq'' the concatenation |
− | and '''Alt''' the choice operator. |
+ | and ''Alt'' the choice operator. |
− | '''nullable''' is defined like this, so it's easy and efficient to compute this predicate: |
+ | ''nullable'' is defined like this, so it's easy and efficient to compute this predicate: |
<haskell> |
<haskell> |
||
Line 94: | Line 95: | ||
</haskell> |
</haskell> |
||
− | We see for the three simple cases that only '''Unit''' is nullable, |
+ | We see for the three simple cases that only ''Unit'' is nullable, |
− | '''r*''' contains per definition the empty word. A sequence contains the empty word |
+ | ''r*'' contains per definition the empty word. A sequence contains the empty word |
only in case where both parts are nullable, a union is nullable when at least one operand |
only in case where both parts are nullable, a union is nullable when at least one operand |
||
is nullable. |
is nullable. |
||
− | For '''delta''' we've again 6 cases: |
+ | For ''delta'' we've again 6 cases: |
<haskell> |
<haskell> |
||
Line 124: | Line 125: | ||
</haskell> |
</haskell> |
||
− | '''delta''' can be viewed as a parser that can accept a single character and delivers a new parser. |
+ | ''delta'' can be viewed as a parser that can accept a single character and delivers a new parser. |
− | '''delta''' fails, when the parser (the regular expression) is '''Zero''' or '''Unit'''. A '''Sym''' parser |
+ | ''delta'' fails, when the parser (the regular expression) is ''Zero'' or ''Unit''. A ''Sym'' parser |
− | checks the input against the required character and fails ('''Zero''') or results in '''Unit''' (the EOF parser). |
+ | checks the input against the required character and fails (''Zero'') or results in ''Unit'' (the EOF parser). |
− | A '''r*''' expression is expanded into a sequence '''r⋅r*''', and the input character |
+ | A ''r*'' expression is expanded into a sequence ''r⋅r*'', and the input character |
− | is parsed with the simple parser '''r'''. |
+ | is parsed with the simple parser ''r''. |
− | The most complicated rule is the rule for '''Seq r1 r2'''. There are two cases. The simple one is |
+ | The most complicated rule is the rule for ''Seq r1 r2''. There are two cases. The simple one is |
− | that '''r1''' is not '''nullable'''. In this case the input must be consumed by '''r1'''. In the second case |
+ | that ''r1'' is not ''nullable''. In this case the input must be consumed by ''r1''. In the second case |
− | ('''r1''' is '''nullable''') |
+ | (''r1'' is ''nullable'') |
− | there are to choices: The input could be consumed by '''r1''', but it also could be consumed by '''r2''' |
+ | there are to choices: The input could be consumed by ''r1'', but it also could be consumed by ''r2'' |
− | when '''r1''' only accepts the empty word. |
+ | when ''r1'' only accepts the empty word. |
− | The '''Alt r1 r2''' rule is defined such that both '''r1''' and '''r2''' are run in parallel. Here is the |
+ | The ''Alt r1 r2'' rule is defined such that both ''r1'' and ''r2'' are run in parallel. Here is the |
point where the nondeterminism is implemented. |
point where the nondeterminism is implemented. |
||
− | '''delta''' can easily be expanded on strings |
+ | ''delta'' can easily be expanded on strings |
<haskell> |
<haskell> |
||
Line 146: | Line 147: | ||
</haskell> |
</haskell> |
||
− | Combining '''deltaS''' with '''nullable''' gives us a simple matching function |
+ | Combining ''deltaS'' with ''nullable'' gives us a simple matching function |
<haskell> |
<haskell> |
||
− | match :: Regex -> String -> Bool |
+ | matchRE :: Regex -> String -> Bool |
− | match re = nullable . deltaS re |
+ | matchRE re = nullable . deltaS re |
</haskell> |
</haskell> |
||
Line 158: | Line 159: | ||
The above shown code runs, but it just runs for toy examples. When checking words |
The above shown code runs, but it just runs for toy examples. When checking words |
||
− | with this simple version of '''delta''' space and time can grow exponentially with the length of |
+ | with this simple version of ''delta'' space and time can grow exponentially with the length of |
− | the input. A simple example is the stepwise derivation of '''a*''' with '''a'''-s as input |
+ | the input. A simple example is the stepwise derivation of ''a*'' with ''a''-s as input |
<haskell> |
<haskell> |
||
Line 171: | Line 172: | ||
</haskell> |
</haskell> |
||
− | The problem here is that within '''delta''' the derivations become more complex. |
+ | The problem here is that within ''delta'' the derivations become more complex. |
− | This happens in two places: In the rule for '''Star''', where '''r*''' becomes '''r⋅r*''' |
+ | This happens in two places: In the rule for ''Star'', where ''r*'' becomes ''r⋅r*'' |
− | and in the '''Seq''' rule, when '''r1''' is nullable. In this case a choice is introduced. |
+ | and in the ''Seq'' rule, when ''r1'' is nullable. In this case a choice is introduced. |
The solution for this problem is, like in symbolic algebra systems, the introduction |
The solution for this problem is, like in symbolic algebra systems, the introduction |
||
− | of a simplification step after derivation. We easily see that '''Seq Unit r2''' is |
+ | of a simplification step after derivation. We easily see that ''Seq Unit r2'' is |
− | the same as '''r2'''. This rule is applicable in the 1. derivation. Furthermore |
+ | the same as ''r2''. This rule is applicable in the 1. derivation. Furthermore |
− | '''Seq Zero r2''' is equivalent to '''Zero''' (failure remains failure). A third effective rule |
+ | ''Seq Zero r2'' is equivalent to ''Zero'' (failure remains failure). A third effective rule |
− | is: '''Alt Zero r2''' equals '''r2'''. |
+ | is: ''Alt Zero r2'' equals ''r2''. |
These and some more simplification rules can be added by introducing smart constructors: |
These and some more simplification rules can be added by introducing smart constructors: |
||
Line 203: | Line 204: | ||
=== Implementation of other operators on regular sets and regular expressions === |
=== Implementation of other operators on regular sets and regular expressions === |
||
− | To check, whether a word '''w''' is in the union of two regular sets represented by expressions |
+ | To check, whether a word ''w'' is in the union of two regular sets represented by expressions |
− | '''r1''' and '''r2''', we apply '''delta''' to both '''r1''' and '''r2'''. The real test is then done within |
+ | ''r1'' and ''r2'', we apply ''delta'' to both ''r1'' and ''r2''. The real test is then done within |
− | '''nullable''' and there it's done by a simple logical OR operation. This observation leads to the |
+ | ''nullable'' and there it's done by a simple logical OR operation. This observation leads to the |
question, whether we could implement other binary operations on regular set. And indeed this can be done |
question, whether we could implement other binary operations on regular set. And indeed this can be done |
||
− | for all binary operations. A test whether a word '''w''' is in the intersection of two regular sets |
+ | for all binary operations. A test whether a word ''w'' is in the intersection of two regular sets |
− | can be added just by adding the operator to the '''Regex''' data type, add a rule for '''delta''' with |
+ | can be added just by adding the operator to the ''Regex'' data type, add a rule for ''delta'' with |
− | precisely the same structure as for '''Alt''' and add the predicate with the corresponding logical |
+ | precisely the same structure as for ''Alt'' and add the predicate with the corresponding logical |
− | operation to '''nullable'''. Here's the example for intersection: |
+ | operation to ''nullable''. Here's the example for intersection: |
<haskell> |
<haskell> |
||
Line 242: | Line 243: | ||
should not change. In the concrete syntax the curly braces are special symbols |
should not change. In the concrete syntax the curly braces are special symbols |
||
and they are only allowed in so called quantifiers, postfix operators that specify |
and they are only allowed in so called quantifiers, postfix operators that specify |
||
− | a kind of repetition, e..g. '''a{5,7}''' stand for 5,6 or 7 a's. |
+ | a kind of repetition, e..g. ''a{5,7}'' stand for 5,6 or 7 a's. |
− | So operators in curly braces, e.g. '''{&}''' for intersection, |
+ | So operators in curly braces, e.g. ''{&}'' for intersection, |
is an extension that does not conflict with the standard syntax. |
is an extension that does not conflict with the standard syntax. |
||
− | The following new operators are added: '''{:}''' for interleave, '''{&}''' for intersection '''{\}''' for set difference and |
+ | The following new operators are added: ''{:}'' for interleave, ''{&}'' for intersection ''{\}'' for set difference and |
− | '''{^}''' for exclusive or, here enumerated with decreasing priority. |
+ | ''{^}'' for exclusive or, here enumerated with decreasing priority. |
− | Examples: |
+ | There are two parsers, one for the standard XML Schema syntax and another one for the extended syntax |
+ | with the new operators. |
||
− | ; '''.*a.*{&}.*b.*''' all words containing at least one '''a''' and one '''b''' |
+ | Examples with the extended syntax: |
− | ; '''[a-z]+{\}bush''' all names but not '''bush''' |
+ | |
− | ; '''.*a.*{^}.*b.*''' all words containing at least one '''a''' or one '''b''' but not both an '''a''' and a '''b''' |
+ | ; ''.*a.*{&}.*b.*'' all words containing at least one ''a'' and one ''b'' |
− | ; '''aaa{:}bbb''' all 6 char long words containing 3 a's and 3 b's |
+ | ; ''[a-z]+{\}bush'' all names but not ''bush'' |
+ | ; ''.*a.*{^}.*b.*'' all words containing at least one ''a'' or one ''b'' but not both an ''a'' and a ''b'' |
||
+ | ; ''aaa{:}bbb'' all 6 char long words containing 3 a's and 3 b's |
||
A complement operator can be formulated by using the set difference operator. |
A complement operator can be formulated by using the set difference operator. |
||
− | The first attempt '''.*{\}bush''' (everything but not '''bush''') fails. The '''.''' in XML Schema syntax |
+ | The first attempt ''.*{\}bush'' (everything but not ''bush'') fails. The ''.'' in XML Schema syntax |
− | is a shortcut for every character but newline (\n) and carriage return (\r). So '''.|\n|\r''' is |
+ | is a shortcut for every character but newline (\n) and carriage return (\r). So ''.|\n|\r'' is |
− | the whole alphabet and '''(.|\n|\r)*''' all words over the alphabet. This makes the |
+ | the whole alphabet and ''(.|\n|\r)*'' all words over the alphabet. This makes the |
− | complement expression a bit clumsy: '''(.|\n|\r)*{\}bush'''. There are some character escape sequences |
+ | complement expression a bit clumsy: ''(.|\n|\r)*{\}bush''. There are some character escape sequences |
− | for character sets, e..g. '''\s''' for whitespace, '''\i''' for XML name start characters, '''\d''' for |
+ | for character sets, e..g. ''\s'' for whitespace, ''\i'' for XML name start characters, ''\d'' for |
− | digits and others. This list has been extended by '''\a''' for '''.|\n|\r''' and '''\A''' for '''\a*'''. |
+ | digits and others. This list has been extended by ''\a'' for ''.|\n|\r'' and ''\A'' for ''\a*''. |
− | With this new '''multiCharEsc''' spec the above complement expression becomes '''\A{\}bush'''. |
+ | With this new ''multiCharEsc'' spec the above complement expression becomes ''\A{\}bush''. |
=== Examples using the extended syntax === |
=== Examples using the extended syntax === |
||
==== Substitution of none greedy operators ==== |
==== Substitution of none greedy operators ==== |
||
+ | |||
+ | All following examples must be processed with ''matchExt''. The standard ''match'' does not support |
||
+ | the extensions. |
||
In Perl and other libraries there are so called none greedy repetition operators. |
In Perl and other libraries there are so called none greedy repetition operators. |
||
Line 272: | Line 276: | ||
none greedy expressions can be reformulated with the use of set difference. |
none greedy expressions can be reformulated with the use of set difference. |
||
A classical example is a regular expression for comments, which are delimited by character |
A classical example is a regular expression for comments, which are delimited by character |
||
− | sequences, like in C with '''/*''' and '''*/'''. The naive approach '''/[*].*[*]/''' does |
+ | sequences, like in C with ''/*'' and ''*/''. The naive approach ''/[*].*[*]/'' does |
− | not work. A word like '''/*abc*/123*/''' is not a C comment, but it matches the above given expression. |
+ | not work. A word like ''/*abc*/123*/'' is not a C comment, but it matches the above given expression. |
The solution with this library is an expression like the following: |
The solution with this library is an expression like the following: |
||
− | '''/[*](\A{\}(\A[*]/\A))[*]/''' |
+ | ''/[*](\A{\}(\A[*]/\A))[*]/'' |
in words: The contents of a C (multi line) comment is every word, that does not contain a subsequence |
in words: The contents of a C (multi line) comment is every word, that does not contain a subsequence |
||
− | of '''*/''' |
+ | of ''*/'' |
==== Identifiers except keywords ==== |
==== Identifiers except keywords ==== |
||
Line 290: | Line 294: | ||
A simple example: |
A simple example: |
||
− | '''[a-z][a-z0-9]*{\}(if|then|else|while|do)''' |
+ | ''[a-z][a-z0-9]*{\}(if|then|else|while|do)'' |
excludes the 5 keywords from the set of identifiers. |
excludes the 5 keywords from the set of identifiers. |
||
Line 296: | Line 300: | ||
==== Permutations ==== |
==== Permutations ==== |
||
− | With the use of the intersection operator '''{&}''' it is rather easy to formulate |
+ | With the use of the intersection operator ''{&}'' it is rather easy to formulate |
a regular expression for the permutations of a character set. |
a regular expression for the permutations of a character set. |
||
− | '''.*a.*{&}.*b.*{&}.*c.*{&}.{3}''' |
+ | ''.*a.*{&}.*b.*{&}.*c.*{&}.{3}'' |
− | is an expressions for all permutations of '''a''', '''b''' and '''c'''. |
+ | is an expressions for all permutations of ''a'', ''b'' and ''c''. |
The above expression for permutations can be formulated even simpler |
The above expression for permutations can be formulated even simpler |
||
by using the interleave operator. |
by using the interleave operator. |
||
− | '''a{:}b{:}c''' |
+ | ''a{:}b{:}c'' |
− | Given 2 words '''w1''' and '''w2''' matching the regular expressions |
+ | Given 2 words ''w1'' and ''w2'' matching the regular expressions |
− | '''r1''' respectively '''r2''', all words '''w''' constructed by |
+ | ''r1'' respectively ''r2'', all words ''w'' constructed by |
− | merging '''w1''' and '''w2''' match the regular expression '''r1{:}r2''' |
+ | merging ''w1'' and ''w2'' match the regular expression ''r1{:}r2'' |
The interleave operator is used when validating XML with RelaxNG Schema. |
The interleave operator is used when validating XML with RelaxNG Schema. |
||
Line 327: | Line 331: | ||
in case of success giving back the strings for the year, month and day: |
in case of success giving back the strings for the year, month and day: |
||
− | '''.*({y}[0-9]{4})-({m}[0-9]{2})-({d}[0-9]{2}).*''' |
+ | ''.*({y}[0-9]{4})-({m}[0-9]{2})-({d}[0-9]{2}).*'' |
− | The three pairs of parentheses are labeled '''y''', '''m''' and '''d'''. Let's see |
+ | The three pairs of parentheses are labeled ''y'', ''m'' and ''d''. Let's see |
what function we call with this pattern and how the result looks like. |
what function we call with this pattern and how the result looks like. |
||
− | We've already seen the '''match''' function giving a back a Boolean for the match result. |
+ | We've already seen the ''match'' function giving a back a Boolean for the match result. |
− | This is too less information in this case. We need an extended form of match called '''matchSubex''' |
+ | This is too less information in this case. We need an extended form of match called ''matchSubex'' |
with the following signature: |
with the following signature: |
||
Line 340: | Line 344: | ||
</haskell> |
</haskell> |
||
− | === The '''matchSubex''' function === |
+ | === The ''matchSubex'' function === |
− | Extracting the date from a single line of output from a Unix '''ls -l''' command can be done with the following |
+ | Extracting the date from a single line of output from a Unix ''ls -l'' command can be done with the following |
code: |
code: |
||
Line 355: | Line 359: | ||
</haskell> |
</haskell> |
||
− | The '''matchSubex''' function returns a list of label-value pairs for the subexpression matches. |
+ | The ''matchSubex'' function returns a list of label-value pairs for the subexpression matches. |
The label-value pairs occur in the same sequence as in the regular expression. If there is no match |
The label-value pairs occur in the same sequence as in the regular expression. If there is no match |
||
or there is no labeled subexpression the result is the empty list. |
or there is no labeled subexpression the result is the empty list. |
||
Line 383: | Line 387: | ||
</haskell> |
</haskell> |
||
− | The expression is '''(x+)*''' and |
+ | The expression is ''(x+)*'' and |
− | the '''x+''' is labeled with '''l'''. The match can be done in two ways. |
+ | the ''x+'' is labeled with ''l''. The match can be done in two ways. |
− | The first one is: apply the *-expression once and take '''xx''' as the matching subexpression, |
+ | The first one is: apply the *-expression once and take ''xx'' as the matching subexpression, |
− | the second is apply the *-expression two times and take tow times '''x''' as matching subexpressions. |
+ | the second is apply the *-expression two times and take tow times ''x'' as matching subexpressions. |
− | This gives 3 matches for label '''l'''. The situation becomes much more complicated for longer input. |
+ | This gives 3 matches for label ''l''. The situation becomes much more complicated for longer input. |
There is one extension for resolving nondeterministic results with subexpressions. |
There is one extension for resolving nondeterministic results with subexpressions. |
||
Line 402: | Line 406: | ||
</haskell> |
</haskell> |
||
− | In the 2. test we would like to detect the '''else''' as a keyword. This could be done |
+ | In the 2. test we would like to detect the ''else'' as a keyword. This could be done |
by subtracting all keywords from the subexpressions for name. but this make the expression |
by subtracting all keywords from the subexpressions for name. but this make the expression |
||
− | unnecessarily complicated. There is an operator '''{|}''' that works like set union, but for |
+ | unnecessarily complicated. There is an operator ''{|}'' that works like set union, but for |
subexpressions it's not symmetric. If the left hand side matches, only these results are taken |
subexpressions it's not symmetric. If the left hand side matches, only these results are taken |
||
the results from the right hand are ignored. So the above example can be rewritten to prioritize |
the results from the right hand are ignored. So the above example can be rewritten to prioritize |
||
Line 421: | Line 425: | ||
== Examples for editing == |
== Examples for editing == |
||
− | '''sed''' is a function for stream editing of substrings matching a regular expressions |
+ | ''sed'' is a function for stream editing of substrings matching a regular expressions |
<haskell> |
<haskell> |
||
− | sed :: (String -> String) -> String -> String -> String |
+ | sed :: (String -> String) -> String -> String -> String |
− | sed edit regex input = ... |
+ | sed edit regex input = ... |
+ | |||
+ | -- sed for extended R.E.s |
||
+ | sedExt :: (String -> String) -> String -> String -> String |
||
+ | sedExt edit regex input = ... |
||
-- examples |
-- examples |
||
Line 442: | Line 446: | ||
== Examples for tokenizing == |
== Examples for tokenizing == |
||
− | The '''tokenize''' function can be used for constructing simple tokenizers. |
+ | The ''tokenize'' function can be used for constructing simple tokenizers. |
It is recommended to use regular expressions where the empty word does not match. |
It is recommended to use regular expressions where the empty word does not match. |
||
Else there will appear a lot of probably useless empty tokens in the output. |
Else there will appear a lot of probably useless empty tokens in the output. |
||
Line 450: | Line 454: | ||
<haskell> |
<haskell> |
||
− | tokenize :: String -> String -> [String] |
+ | tokenize :: String -> String -> [String] |
− | tokenize regex string = ... |
+ | tokenize regex string = ... |
+ | |||
+ | -- extended form |
||
+ | tokenizeExt :: String -> String -> [String] |
||
+ | tokenizeExt regex string = ... |
||
-- example runs |
-- example runs |
||
Line 477: | Line 481: | ||
</haskell> |
</haskell> |
||
− | There are two more tokenizer functions, '''tokenize' ''' does not throw |
+ | There are two more tokenizer functions, ''tokenize' '' does not throw |
− | away the none matching characters, and '''tokenizeSubex''' works with labeled |
+ | away the none matching characters, and ''tokenizeSubex'' works with labeled |
subexpressions, so the resulting words can be labeled and can further be processed |
subexpressions, so the resulting words can be labeled and can further be processed |
||
with respect to that label. |
with respect to that label. |
||
Line 510: | Line 514: | ||
in the example run with 2^25 characters (about 33Mb). This file is copied by reading and writing the complete file |
in the example run with 2^25 characters (about 33Mb). This file is copied by reading and writing the complete file |
||
to get a figure about the time spend in IO. |
to get a figure about the time spend in IO. |
||
− | The second test reads in the file splits it up into lines with the predefined '''line''' function |
+ | The second test reads in the file splits it up into lines with the predefined ''line'' function |
and writes the lines out again. This is compared with a line function defined as |
and writes the lines out again. This is compared with a line function defined as |
||
− | '''tokenize ".*"'''. The same comparison is done with the built in '''words''' and |
+ | ''tokenize ".*"''. The same comparison is done with the built in ''words'' and |
− | '''tokenize "\\S+"'''. |
+ | ''tokenize "\\S+"''. |
The following runtimes in second where measured: |
The following runtimes in second where measured: |
Latest revision as of 16:46, 1 October 2010
[edit] 1 Regular expressions for XML Schema
The W3C XML Schema specification defines a language for regular expressions. This language is used in the XML Schema spec when defining the data type library part. The hxt-regex-xmlschema package contains a complete implementation of this spec. (old package name:regex-xmlschema) It is implemented with the technique of derivations of regular expression. Main features are full support of Unicode including all Unicode codeblocks and character properties, purely functional interface, extensions for intersection, set difference and exclusive OR of regular sets (regular expressions), extensions for subexpression matches, interface for matching, stream (sed) like editing and tokenizing. This library is part of the Haskell XML Toolbox (HXT), which is described in HXT: A gentle introduction to the Haskell XML Toolbox
[edit] 2 Motivation
When developing the RelaxNG schema validator in the Haskell XML Toolbox (HXT) there was the need for a complete regular expression matcher for the W3C XML Schema regular expression syntax. The string representation of basic data types in XML Schema as well as in the RelaxNG standard can be defined by regular expressions. The available Haskell libraries for processing regular expressions where not applicable for this task, e.g. they used other r.e. grammars or did not support Unicode.
When implementing the DTD validator and the RelaxNG validator in HXT, we used the rather old idea of derivations of regular expression. This is a technique to make the regular expressions operational in a direct way without the clumsy construction of a finite state machine. This worked fine for validating the content model of XML elements and the implementations were very compact and sufficiently efficient. So we could expect it to also work fine for Unicode.
The XML Schema grammar is well designed, so the transformation of this grammar in a parsec parser was straight forward, the interesting part was the internal data structure and it's processing. When completing the work for HXT it showed up, that this library is generally useful, not only for XML validation. And with some rather simple extensions it became possible to not only use this for matching strings, as required in HXT, but also for sed like stream editing and for easy construction of lightweight scanners and tokenizers. This was the motivation for doing a stand alone package of that regular expression library.
[edit] 3 Resources
[edit] 4 The idea of derivations of regular expressions
The idea of derivations of regular expression was developed by Janusz A. Brzozowski, Princeton Univ. in 1964. Goal was to perform the test whether a string w is a word of a given regular set (given by a regular expression r) by manipulating the (internal tree like representation of the) regular expression r.
The word test is based on two functions. The first, here called nullable checks, whether the empty word ε is contained in the regular set associated with r. This test can easily be done and it can be done efficiently.
Given an expression r and a single char x the second function, here called delta computes the so called derivative of r with respect to x. The derivative delta r x is again a regular expression. The following law (in Haskell like notation) must hold for delta r x:
match r (x:xs) <=> match (delta r x) xs
Brzozowski has shown that this derivative exists, and that it is simple to construct it.
Let's see how regular Expression can be modelled with Haskell data types and how nullable and delta work. In the example we will work with the fixed alphabet of Haskell chars.
data Regex = Zero -- {} | Unit -- {ε} | Sym Char -- {a} | Star Regex -- r* | Seq Regex Regex -- r1 . r2 | Alt Regex Regex -- r1 | r2
In this first version we use the minimal set of regular expressions: The empty set, the set containing ε and the single char sets form the primitive sets, Star is the repetition, Seq the concatenation and Alt the choice operator.
nullable is defined like this, so it's easy and efficient to compute this predicate:
nullable :: Regex -> Bool nullable Zero = False nullable Unit = True nullable (Sym a) = False nullable (Star r) = True nullable (Seq r1 r2) = nullable r1 && nullable r2 nullable (Alt r1 r2) = nullable r1 || nullable r2
We see for the three simple cases that only Unit is nullable, r* contains per definition the empty word. A sequence contains the empty word only in case where both parts are nullable, a union is nullable when at least one operand is nullable.
For delta we've again 6 cases:
delta :: Regex -> Char -> Regex delta Zero x = Zero delta Unit x = Zero delta (Sym y) x | x == y = Unit | otherwise = Zero delta (Star r) x = Seq (delta r x) (Star r) delta (Seq r1 r2) x | nullable r1 = Alt dr1 dr2 | otherwise = dr1 where dr1 = Seq (delta r1 x) r2 dr2 = delta r2 x delta (Alt r1 r2) x = Alt (delta r1 x) (delta r2 x)
delta can be viewed as a parser that can accept a single character and delivers a new parser. delta fails, when the parser (the regular expression) is Zero or Unit. A Sym parser checks the input against the required character and fails (Zero) or results in Unit (the EOF parser). A r* expression is expanded into a sequence r⋅r*, and the input character is parsed with the simple parser r.
The most complicated rule is the rule for Seq r1 r2. There are two cases. The simple one is that r1 is not nullable. In this case the input must be consumed by r1. In the second case (r1 is nullable) there are to choices: The input could be consumed by r1, but it also could be consumed by r2 when r1 only accepts the empty word. The Alt r1 r2 rule is defined such that both r1 and r2 are run in parallel. Here is the point where the nondeterminism is implemented.
delta can easily be expanded on strings
deltaS :: Regex -> String -> Regex deltaS = foldl delta
Combining deltaS with nullable gives us a simple matching function
matchRE :: Regex -> String -> Bool matchRE re = nullable . deltaS re
These are the essential facts from theory, but is this approach practically applicable?
[edit] 5 Making the theory practically applicable
The above shown code runs, but it just runs for toy examples. When checking words with this simple version of delta space and time can grow exponentially with the length of the input. A simple example is the stepwise derivation of a* with a-s as input
deltaS (Star (Sym 'a')) "a" = Seq Unit (Star (Sym 'a')) deltaS (Star (Sym 'a')) "aa" = Alt (Seq Zero (Star (Sym 'a'))) (Seq Unit (Star (Sym 'a'))) deltaS (Star (Sym 'a')) "aaa" = Alt (Seq Zero (Star (Sym 'a'))) (Alt (Seq Zero (Star (Sym 'a'))) (Seq Unit (Star (Sym 'a')))) ...
The problem here is that within delta the derivations become more complex. This happens in two places: In the rule for Star, where r* becomes r⋅r* and in the Seq rule, when r1 is nullable. In this case a choice is introduced.
The solution for this problem is, like in symbolic algebra systems, the introduction of a simplification step after derivation. We easily see that Seq Unit r2 is the same as r2. This rule is applicable in the 1. derivation. Furthermore Seq Zero r2 is equivalent to Zero (failure remains failure). A third effective rule is: Alt Zero r2 equals r2.
These and some more simplification rules can be added by introducing smart constructors:
mkSeq :: Regex -> Regex -> Regex mkSeq Zero r2 = Zero mkSeq r1 Zero = Zero mkSeq Unit r2 = r2 mkSeq r1 r2 = Seq r1 r2 mkAlt Zero r2 = r2 mkAlt r1 Zero = r1 mkAlt r1 r2 = Alt r1 r2
With these simplification rules the resulting regular expressions remain constant in size when successively derive an expression. Furthermore the simplification rules run in constant time. So space as well as runtime remains proportional to the length of the input.
[edit] 6 Extension: Operators for intersection, complement, set difference, exclusive or and interleave
[edit] 6.1 Implementation of other operators on regular sets and regular expressions
To check, whether a word w is in the union of two regular sets represented by expressions r1 and r2, we apply delta to both r1 and r2. The real test is then done within nullable and there it's done by a simple logical OR operation. This observation leads to the question, whether we could implement other binary operations on regular set. And indeed this can be done for all binary operations. A test whether a word w is in the intersection of two regular sets can be added just by adding the operator to the Regex data type, add a rule for delta with precisely the same structure as for Alt and add the predicate with the corresponding logical operation to nullable. Here's the example for intersection:
data Regex = ... | ... | Isect Regex Regex nullable (Isect r1 r2) = nullable r1 && nullable r2 delta (Isect r1 r2) x = mkIsect (delta r1 x) (delta r2 x) -- smart constructor with some simplification rules mkIsect Zero r2 = Zero mkIsect r1 Zero = Zero mkIsect Unit r2 = Unit mkIsect r1 Unit = Unit ... mkIsect r1 r2 = Isect r1 r2
So adding intersection, difference or other set operations are done by adding about 10 lines of code.
[edit] 6.2 Syntax extensions for new operators
When extending the XML Schema regular expression syntax, there was one leading principle: All legal regular expressions should remain correct and their semantics should not change. In the concrete syntax the curly braces are special symbols and they are only allowed in so called quantifiers, postfix operators that specify a kind of repetition, e..g. a{5,7} stand for 5,6 or 7 a's. So operators in curly braces, e.g. {&} for intersection, is an extension that does not conflict with the standard syntax. The following new operators are added: {:} for interleave, {&} for intersection {\} for set difference and {^} for exclusive or, here enumerated with decreasing priority.
There are two parsers, one for the standard XML Schema syntax and another one for the extended syntax with the new operators.
Examples with the extended syntax:
- .*a.*{&}.*b.* all words containing at least one a and one b
- [a-z]+{\}bush all names but not bush
- .*a.*{^}.*b.* all words containing at least one a or one b but not both an a and a b
- aaa{:}bbb all 6 char long words containing 3 a's and 3 b's
A complement operator can be formulated by using the set difference operator. The first attempt .*{\}bush (everything but not bush) fails. The . in XML Schema syntax is a shortcut for every character but newline (\n) and carriage return (\r). So .|\n|\r is the whole alphabet and (.|\n|\r)* all words over the alphabet. This makes the complement expression a bit clumsy: (.|\n|\r)*{\}bush. There are some character escape sequences for character sets, e..g. \s for whitespace, \i for XML name start characters, \d for digits and others. This list has been extended by \a for .|\n|\r and \A for \a*. With this new multiCharEsc spec the above complement expression becomes \A{\}bush.
[edit] 6.3 Examples using the extended syntax
[edit] 6.3.1 Substitution of none greedy operators
All following examples must be processed with matchExt. The standard match does not support the extensions.
In Perl and other libraries there are so called none greedy repetition operators. These are not present in the W3C XML Schema syntax. But for many real world examples these none greedy expressions can be reformulated with the use of set difference. A classical example is a regular expression for comments, which are delimited by character sequences, like in C with /* and */. The naive approach /[*].*[*]/ does not work. A word like /*abc*/123*/ is not a C comment, but it matches the above given expression.
The solution with this library is an expression like the following:
/[*](\A{\}(\A[*]/\A))[*]/
in words: The contents of a C (multi line) comment is every word, that does not contain a subsequence of */
[edit] 6.3.2 Identifiers except keywords
In most scanner specs, the regular expressions for names and keywords overlap and the sequence of the rules in the scanner spec becomes important for solving this ambiguity. With the set difference it becomes simple to exclude keywords from the regular expression for identifier.
A simple example:
[a-z][a-z0-9]*{\}(if|then|else|while|do)
excludes the 5 keywords from the set of identifiers.
[edit] 6.3.3 Permutations
With the use of the intersection operator {&} it is rather easy to formulate a regular expression for the permutations of a character set.
.*a.*{&}.*b.*{&}.*c.*{&}.{3}
is an expressions for all permutations of a, b and c.
The above expression for permutations can be formulated even simpler by using the interleave operator.
a{:}b{:}c
Given 2 words w1 and w2 matching the regular expressions r1 respectively r2, all words w constructed by merging w1 and w2 match the regular expression r1{:}r2
The interleave operator is used when validating XML with RelaxNG Schema. In RelaxNG the content model is described by a regular expression. But there it is allowed to specify two different content models for a single element and then mix theses content models together.
[edit] 7 Extension: Matching of subexpressions
This library supports matching of subexpressions like in Perl, but the syntax for subexpressions
is different. Labeling subexpressions is not done implicitly by counting and numbering the pairs of parentheses, but
the parentheses of interest can be labeled with a name. This is more flexible and less error prone when extending the regular expressions.
Here's an example for searching a date pattern in YYYY-MM-DD format in a line of text and in case of success giving back the strings for the year, month and day:
.*({y}[0-9]{4})-({m}[0-9]{2})-({d}[0-9]{2}).*
The three pairs of parentheses are labeled y, m and d. Let's see what function we call with this pattern and how the result looks like. We've already seen the match function giving a back a Boolean for the match result. This is too less information in this case. We need an extended form of match called matchSubex with the following signature:
matchSubex :: String -> String -> [(String, String)] matchSubex re input = ...
[edit] 7.1 The matchSubex function
Extracting the date from a single line of output from a Unix ls -l command can be done with the following code:
getDate [(String,String)] -> Maybe (Int, Int, Int) getDate [("y",y),("m",m),("d",d)] = Just (read y, read m, read d) getDate _ = Nothing getDate . matchSubex ".*({y}[0-9]{4})-({m}[0-9]{2})-({d}[0-9]{2}).*" $ "-rw-r--r-- 1 uwe users 2264 2008-11-19 15:36 Main.hs" => Just (2008,11,19)
The matchSubex function returns a list of label-value pairs for the subexpression matches. The label-value pairs occur in the same sequence as in the regular expression. If there is no match or there is no labeled subexpression the result is the empty list. Nesting of labeled subexpressions is possible. Examples:
matchSubex ".*({date}({y}[0-9]{4})-({m}[0-9]{2})-({d}[0-9]{2})).*" $ "-rw-r--r-- 1 uwe users 2264 2008-11-19 15:36 Main.hs" => [("date","2008-11-19"),("y","2008"),("m","11"),("d",19")]
[edit] 7.2 Matching with subexpressions and nondeterminism
When writing regular expressions with labeled subexpressions there are some traps, if this is not done carefully. When an expression is nondeterministic and contains labeled subexpressions, all matches for these subexpressions are computed. The number of matches can grow exponentially, so the runtime also can grow exponentially.
Here is a simple example demonstrating this situation.
matchSubex "(({l}x+))*"$ "xx" => [("l","xx"),("l","x"),("l","x")]
The expression is (x+)* and the x+ is labeled with l. The match can be done in two ways. The first one is: apply the *-expression once and take xx as the matching subexpression, the second is apply the *-expression two times and take tow times x as matching subexpressions. This gives 3 matches for label l. The situation becomes much more complicated for longer input.
There is one extension for resolving nondeterministic results with subexpressions. As an example, let's match a text with identifiers or keywords. This could be done as follows
matchSubex "({name}[a-z][a-z0-9]*)|({keyword}if|then|else|while|do)" "abc" => [("name","abc")] matchSubex "({name}[a-z][a-z0-9]*)|({keyword}if|then|else|while|do)" "else" => [("name","else"),("keyword", "else")]
In the 2. test we would like to detect the else as a keyword. This could be done by subtracting all keywords from the subexpressions for name. but this make the expression unnecessarily complicated. There is an operator {|} that works like set union, but for subexpressions it's not symmetric. If the left hand side matches, only these results are taken the results from the right hand are ignored. So the above example can be rewritten to prioritize the keywords as follows:
matchSubex "({keyword}if|then|else|while|do){|}({name}[a-z][a-z0-9]*)" "abc" => [("name","abc")] matchSubex "({keyword}if|then|else|while|do){|}({name}[a-z][a-z0-9]*)" "else" => [("keyword", "else")]
[edit] 8 Examples for editing
sed is a function for stream editing of substrings matching a regular expressions
sed :: (String -> String) -> String -> String -> String sed edit regex input = ... -- sed for extended R.E.s sedExt :: (String -> String) -> String -> String -> String sedExt edit regex input = ... -- examples sed (const "b") "a" "xaxax" => "xbxbx" sed (\ x -> x ++ x) "a" "xax" => "xaax" sed jandl "l|r" "left or right" => "reft ol light" -- with jandl "l" = "r" jandl "r" = "l"
[edit] 9 Examples for tokenizing
The tokenize function can be used for constructing simple tokenizers. It is recommended to use regular expressions where the empty word does not match. Else there will appear a lot of probably useless empty tokens in the output. All none matching chars are discarded.
Here are some test cases:
tokenize :: String -> String -> [String] tokenize regex string = ... -- extended form tokenizeExt :: String -> String -> [String] tokenizeExt regex string = ... -- example runs tokenize "a" "aabba" => ["a","a","a"] tokenize "a*" "aaaba" => ["aaa","a"] tokenize "a*" "bbb" => ["","",""] tokenize "a+" "bbb" => [] tokenize "[a-z]{2,}|[0-9]{2,}|[0-9]+[.][0-9]+" "ab123 456.7abc" => ["ab","123","456.7","abc"] tokenize "[^ \t\n\r]*" "abc def\t\n\rxyz" => ["abc","def","xyz"] tokenize ".*" "\nabc\n123\n\nxyz\n" = ["","abc","123","","xyz"] tokenize ".*" = lines tokenize "[^ \t\n\r]*" = words
There are two more tokenizer functions, tokenize' does not throw away the none matching characters, and tokenizeSubex works with labeled subexpressions, so the resulting words can be labeled and can further be processed with respect to that label.
tokenizeSubex ( "({keyword}if|then|else|while|do)" ++ "{|}" ++ "({name}[a-z][a-z0-9]*)" ++ "|" ++ "({num}[0-9]+)" ++ "|" ++ "({op}==|/=|:=|[+])" ) "if abc /= 42 then abc := 42" => [ ("keyword", "if" ) , ("name", "abc" ) , ("op", "/=" ) , ("num", "42" ) , ("keyword", "then" ) , ("name", "abc" ) , ("op", ":=" ) , ("num", "42" ) ]
[edit] 10 Performance
A simple performance test shows that it's possible to process even large data rather efficiently. The simple test does the following: A large text file is generated, in the example run with 2^25 characters (about 33Mb). This file is copied by reading and writing the complete file to get a figure about the time spend in IO. The second test reads in the file splits it up into lines with the predefined line function and writes the lines out again. This is compared with a line function defined as tokenize ".*". The same comparison is done with the built in words and tokenize "\\S+".
The following runtimes in second where measured:
- 1.33 copy file
- 8.48 split into lines with lines from prelude
- 10.93 split into lines with regex tokenize
- 20.44 split into words with words from prelude
- 39.75 split into words with regex tokenize
So the overhead introduced by repeated computation of derivation is not too bad. Of course this throughput can not be expected by more complex regular expressions.