[Haskell-cafe] Regular Expressions without GADTs

Conor McBride ctm at cs.nott.ac.uk
Wed Oct 19 10:17:41 EDT 2005


oleg at pobox.com wrote:

>>I suspect that once you start producing values with the
>>relevant properties (as I do) rather than just consuming them (as Oleg
>>and Bruno do), the GADT method might work out a little neater.
>>    
>>
>
>Actually, the code is pretty much your original code, with downcased
>identifiers. It faithfully implements that parser division
>approach. Still, there are no existentials.
>

That's because this isn't quite my program either, although I'll grant 
you it's a lot closer. Give me back the uppercase identifiers and you've 
pretty much done it. That's to say, I want a datatype of regular 
expressions which happens (at least) to support parsing, rather than a 
parser library which supports (at least) regular languages.

We can leave it to others to decide which is neater, but I'd remark that...

> I wouldn't say that GADT
>code is so much different. Perhaps the code below is a bit neater due
>to the absence of existentials, `case' statements, and local type
>annotations.
>  
>

...I'm also irritated by the case statements, the local type 
annotations, and so on. I was rather surprised to find that I needed 
them. I bet it's not an essential or permanent problem, and it's 
certainly a problem we have planned not to have in Epigram.

The existential also bothers me, but for different reasons. In the 
dependently typed version, RegExp isn't a GADT: it's just an ordinary 
datatype, and there's a perfectly ordinary function, Sem :: RegExp -> *, 
which computes the type corresponding to a regular expression. The 
syntactic part of division is thus completely separated from the 
semantic part (glueing the head back on). The GADT version here replaces 
Sem by an index in the datatype, so sometimes I'm forced to write an 
existential in order to collect what would have been the result of 
applying Sem. You've avoided the existential by throwing away the syntax 
and keeping only the specific empty-and-divide semantics required for 
this particular example. In effect

 * I define regular expressions and show they possess at least the 
empty-and-divide semantics;
 * You define the empty-and-divide semantics and show that it's closed 
under at least the regular expression formers (nicely expressed via 
Bruno's class)

Let's see...

Here's the class of regular expression algebras:

>-- Bruno.Oliveira's type class
>class RegExp g where
>    zero   :: g tok Zero
>    one    :: g tok ()
>    check  :: (tok -> Bool) -> g tok tok
>    plus   :: g tok a -> g tok b -> g tok (Either a b)
>    mult   :: g tok a -> g tok b -> g tok (a,b)
>    star   :: g tok a -> g tok [a]
>  
>

Here's the specification of the intended semantics:

>data Parse tok t = Parse { empty :: Maybe t
>			 , divide :: tok -> Parse tok t}
>  
>

Here's where you explain how stuff other than regular expressions also 
has the semantics. This is how the existential is made to disappear: you 
use liftP to fuse the head-glueing function into the parsers themselves. 
It's something like adding

  Map :: (s -> t) -> RegExp tok s -> RegExp tok t

to the syntax of regular expressions, isn't it?

>liftP f p = Parse{empty  = liftM f (empty p),
>		  divide = \tok -> liftP f (divide p tok)}
>  
>

And then you show that indeed the desired semantics is closed under 
RegExp formation.

>instance RegExp Parse  where
>  
>

It's a classic instance of the Expression Problem. If you fix the 
elimination behaviour, it's easy to add new introduction behaviour; if 
you fix the introduction behaviour, it's easy to add new elimination 
behaviour. I think it's important to support both modes of working: type 
classes are good at the former, datatypes the latter, so let's be open 
to both. There's always a danger with small examples that there's not so 
much to choose between the two approaches---except that they scale up in 
completely different ways. It's important not to get into a sterile 
argument 'Oh, but I can extend the grammar', 'Oh, but I can extend the 
functionality', as if only one of these is ever important. I will 
happily cheer anyone who can help us to be lighter on our feet in 
jumping between the two.

I'm not saying there's no value to 'poor man's' alternatives to GADTs. I 
certainly agree with Bruno
  (a) that it's important to know how to work within the standard; my 
experiences implementing the first version of Epigram have convinced me 
to be much more cautious about which Haskell extensions I'm willing to 
use in the second
  (b) that the pattern 'show type constructor f has such-an-such an 
algebra' is useful and worth abstracting.

I was just about to send, when up popped Ross's program, which does give 
me a datatype, including just enough extra stuff to support functoriality.

>> data RegExp tok a
>>	= Zero
>>	| One a
>>	| Check (tok -> a) (tok -> Bool)
>>	| Plus (RegExp tok a) (RegExp tok a)
>>	| forall b. Mult (RegExp tok (b -> a)) (RegExp tok b)
>>	| forall e. Star ([e] -> a) (RegExp tok e)
>  
>

This combines several dodges rather neatly. You can see the traces of 
the translation from GADT to nested-datatype-with-equations, but instead 
of saying forall e. Star (Eq [e] a) (RegExp tok e), Ross has kept just 
the half of the iso he needs in order to extract the parsed value. 
Throwing away the other half more-or-less allows him to hide the 
head-glueing functions inside the grammar of the regular expressions. In 
effect, Map has been distributed through the syntax. It's nice that the 
syntax of regular expressions survives, but it has been somewhat spun in 
favour of the functionality required in the example. Of the fakes so 
far, I like this the best, because it is *data*.

I would say that the availability of 'this workaround for this example, 
that workaround for that example', is evidence in *favour* of adopting 
the general tools which are designed to support many examples. If a 
job's worth doing, it's worth doing well. Is there a *uniform* 
workaround which delivers all the GADT functionality cheaply and 
cleanly? Kind of subjective, I suppose, but I haven't seen anything 
cheap enough for me.

All the best

Conor


More information about the Haskell-Cafe mailing list