2.5. Monadic Parsers

Happy has support for threading a monad through the generated parser. This might be useful for several reasons:

Adding monadic support to your parser couldn't be simpler. Just add the following directive to the declaration section of the grammar file:

%monad { <type> } [ { <then> } { <return> } ]

where <type> is the type constructor for the monad, <then> is the bind operation of the monad, and <return> is the return operation. If you leave out the names for the bind and return operations, Happy assumes that <type> is an instance of the standard Haskell type class Monad and uses the overloaded names for the bind and return operations.

When this declaration is included in the grammar, Happy makes a couple of changes to the generated parser: the types of the main parser function and parseError (the function named in %error) become [Token] -> P a where P is the monad type constructor, and the function must be polymorphic in a. In other words, Happy adds an application of the <return> operation defined in the declaration above, around the result of the parser (parseError is affected because it must have the same return type as the parser). And that's all it does.

This still isn't very useful: all you can do is return something of monadic type from parseError. How do you specify that the productions can also have type P a? Most of the time, you don't want a production to have this type: you'd have to write explicit returnPs everywhere. However, there may be a few rules in a grammar that need to get at the monad, so Happy has a special syntax for monadic actions:

n  :  t_1 ... t_n          {% <expr> }

The % in the action indicates that this is a monadic action, with type P a, where a is the real return type of the production. When Happy reduces one of these rules, it evaluates the expression

<expr> `then` \result -> <continue parsing>

Happy uses result as the real semantic value of the production. During parsing, several monadic actions might be reduced, resulting in a sequence like

<expr1> `then` \r1 ->
<expr2> `then` \r2 ->
...
return <expr3>

The monadic actions are performed in the order that they are reduced. If we consider the parse as a tree, then reductions happen in a depth-first left-to-right manner. The great thing about adding a monad to your parser is that it doesn't impose any performance overhead for normal reductions - only the monadic ones are translated like this.

Take a look at the Haskell parser for a good illustration of how to use a monad in your parser: it contains examples of all the principles discussed in this section, namely parse errors, a threaded lexer, line/column numbers, and state communication between the parser and lexer.

The following sections consider a couple of uses for monadic parsers, and describe how to also thread the monad through the lexical analyser.

2.5.1. Handling Parse Errors

It's not very convenient to just call error when a parse error is detected: in a robust setting, you'd like the program to recover gracefully and report a useful error message to the user. Exceptions (of which errors are a special case) are normally implemented in Haskell by using an exception monad, something like:

data E a = Ok a | Failed String

thenE :: E a -> (a -> E b) -> E b
m `thenE` k = 
   case m of 
       Ok a -> k a
	 Failed e -> Failed e

returnE :: a -> E a
returnE a = Ok a

failE :: String -> E a
failE err = Failed err

catchE :: E a -> (String -> E a) -> E a
catchE m k = 
   case m of
      Ok a -> OK a
	Failed e -> k e

This monad just uses a string as the error type. The functions thenE and returnE are the usual bind and return operations of the monad, failE raises an error, and catchE is a combinator for handling exceptions.

We can add this monad to the parser with the declaration

%monad { E } { thenE } { returnE }

Now, without changing the grammar, we can change the definition of parseError and have something sensible happen for a parse error:

parseError tokens = failE "Parse error"

The parser now raises an exception in the monad instead of bombing out on a parse error.

We can also generate errors during parsing. There are times when it is more convenient to parse a more general language than that which is actually intended, and check it later. An example comes from Haskell, where the precedence values in infix declarations must be between 0 and 9:

prec :: { Int }
      : int    {% if $1 < 0 || $1 > 9 
	                then failE "Precedence out of range"
		        else returnE $1
		}

The monadic action allows the check to be placed in the parser itself, where it belongs.

2.5.2. Threaded Lexers

Happy allows the monad concept to be extended to the lexical analyser, too. This has several useful consequences:

  • Lexical errors can be treated in the same way as parse errors, using an exception monad.

  • Information such as the current file and line number can be communicated between the lexer and parser.

  • General state communication between the parser and lexer - for example, implementation of the Haskell layout rule requires this kind of interaction.

  • IO operations can be performed in the lexer - this could be useful for following import/include declarations for instance.

A monadic lexer is requested by adding the following declaration to the grammar file:

%lexer { <lexer> } { <eof> }

where <lexer> is the name of the lexical analyser function, and <eof> is a token that is to be treated as the end of file.

When using a monadic lexer, the parser no longer reads a list of tokens. Instead, it calls the lexical analysis function for each new token to be read. This has the side effect of eliminating the intermediate list of tokens, which is a slight performance win.

The type of the main parser function is now just P a - the input is being handled completely within the monad.

The type of parseError becomes Token -> P a; that is it takes Happy's current lookahead token as input. This can be useful, because the error function probably wants to report the token at which the parse error occurred, and otherwise the lexer would have to store this token in the monad.

The lexical analysis function must have the following type:

lexer :: (Token -> P a) -> P a

where P is the monad type constructor declared with %monad, and a can be replaced by the parser return type if desired.

You can see from this type that the lexer takes a continuation as an argument. The lexer is to find the next token, and pass it to this continuation to carry on with the parse. Obviously, we need to keep track of the input in the monad somehow, so that the lexer can do something different each time it's called!

Let's take the exception monad above, and extend it to add the input string so that we can use it with a threaded lexer.

data ParseResult a = Ok a | Failed String
type P a = String -> ParseResult a

thenP :: P a -> (a -> P b) -> P b
m `thenP` k = \s ->
   case m s of 
       Ok a -> k a s
	 Failed e -> Failed e

returnP :: a -> P a
returnP a = \s -> Ok a

failP :: String -> P a
failP err = \s -> Failed err

catchP :: P a -> (String -> P a) -> P a
catchP m k = \s ->
   case m s of
      Ok a -> OK a
	Failed e -> k e s

Notice that this isn't a real state monad - the input string just gets passed around, not returned. Our lexer will now look something like this:

lexer :: (Token -> P a) -> P a
lexer cont s = 
    ... lexical analysis code ...
    cont token s'

the lexer grabs the continuation and the input string, finds the next token token, and passes it together with the remaining input string s' to the continuation.

We can now indicate lexical errors by ignoring the continuation and calling failP "error message" s within the lexer (don't forget to pass the input string to make the types work out).

This may all seem a bit weird. Why, you ask, doesn't the lexer just have type P Token? It was done this way for performance reasons - this formulation sometimes means that you can use a reader monad instead of a state monad for P, and the reader monad might be faster. It's not at all clear that this reasoning still holds (or indeed ever held), and it's entirely possible that the use of a continuation here is just a misfeature.

If you want a lexer of type P Token, then just define a wrapper to deal with the continuation:

lexwrap :: (Token -> P a) -> P a
lexwrap cont = real_lexer `thenP` \token -> cont token

2.5.2.1. Monadic productions with %lexer

The {% ... } actions work fine with %lexer, but additionally there are two more forms which are useful in certain cases. Firstly:

n  :  t_1 ... t_n          {%^ <expr> }

In this case, <expr> has type Token -> P a. That is, Happy passes the current lookahead token to the monadic action <expr>. This is a useful way to get hold of Happy's current lookahead token without having to store it in the monad.

n  :  t_1 ... t_n          {%% <expr> }

This is a slight variant on the previous form. The type of <expr> is the same, but in this case the lookahead token is actually discarded and a new token is read from the input. This can be useful when you want to change the next token and continue parsing.

2.5.3. Line Numbers

Previous versions of Happy had a %newline directive that enabled simple line numbers to be counted by the parser and referenced in the actions. We warned you that this facility may go away and be replaced by something more general, well guess what? :-)

Line numbers can now be dealt with quite straightforwardly using a monadic parser/lexer combination. Ok, we have to extend the monad a bit more:

type LineNumber = Int
type P a = String -> LineNumber -> ParseResult a

getLineNo :: P LineNumber
getLineNo = \s l -> Ok l

(the rest of the functions in the monad follow by just adding the extra line number argument in the same way as the input string). Again, the line number is just passed down, not returned: this is OK because of the continuation-based lexer that can change the line number and pass the new one to the continuation.

The lexer can now update the line number as follows:

lexer cont s =
  case s of
     '\n':s  ->  \line -> lexer cont s (line + 1)
     ... rest of lexical analysis ...

It's as simple as that. Take a look at Happy's own parser if you have the sources lying around, it uses a monad just like the one above.

Reporting the line number of a parse error is achieved by changing parseError to look something like this:

parseError :: Token -> P a
parseError = getLineNo `thenP` \line -> 
             failP (show line ++ ": parse error")

We can also get hold of the line number during parsing, to put it in the parsed data structure for future reference. A good way to do this is to have a production in the grammar that returns the current line number:

lineno :: { LineNumber }
        : {- empty -}      {% getLineNo }

The semantic value of lineno is the line number of the last token read - this will always be the token directly following the lineno symbol in the grammar, since Happy always keeps one lookahead token in reserve.

2.5.4. Summary

The types of various functions related to the parser are dependent on what combination of %monad and %lexer directives are present in the grammar. For reference, we list those types here. In the following types, t is the return type of the parser. A type containing a type variable indicates that the specified function must be polymorphic.

  • No %monad or %lexer

    parse      :: [Token] -> t
    parseError :: [Token] -> a
    

  • with %monad

    parse      :: [Token] -> P t
    parseError :: [Token] -> P a
    

  • with %lexer

    parse      :: T t
    parseError :: Token -> T a
    lexer      :: (Token -> T a) -> T a
    

    where the type constructor T is whatever you want (usually T a = String -> a. I'm not sure if this is useful, or even if it works properly.

  • with %monad and %lexer

    parse      :: P t
    parseError :: Token -> P a
    lexer      :: (Token -> P a) -> P a