3.3. Including semantic results

This section discusses the options for including semantic information in grammars.

3.3.1. Forms of semantics

Semantic information may be attached to productions in the conventional way, but when more than one analysis is possible, the use of the semantic information must change. Two schemes have been implemented, which we call tree decoding and label decoding. The former is for simple applications, where there is not much ambiguity and hence where the effective unpacking of the parse forest isn't a factor. This mode is quite similar to the standard mode in Happy. The latter is for serious applications, where sharing is important and where processing of the forest (eg filtering) is needed. Here, the emphasis is about providing rich labels in nodes of the the parse forest, to support such processing.

The default mode is labelling. If you want the tree decode mode, use the --decode flag.

3.3.2. Tree decoding

Tree decoding corresponds to unpacking the parse forest to individual trees and collecting the list of semantic results computed from each of these. It is a mode intended for simple applications, where there is limited ambiguity. You may access semantic results from components of a reduction using the dollar variables. As a working example, the following is taken from the expr-tree grammar in the examples. Note that the type signature is required, else the types in use can't be determined by the parser generator.

	E :: {Int}		-- type signature needed
	  : E '+' E  { $1 + $3 }
	  | E '*' E  { $1 * $3 }
	  | i        { $1 }

This mode works by converting each of the semantic rules into functions (abstracted over the dollar variables mentioned), and labelling each Branch created from a reduction of that rule with the function value. This amounts to delaying the action of the rule, since we must wait until we know the results of all of the sub-analyses before computing any of the results. (Certain cases of packing can add new analyses at a later stage.)

At the end of parsing, the functions are applied across relevant sub-analyses via a recursive descent. The main interface to this is via the class and entry function below. Typically, decode should be called on the root of the forest, also supplying a function which maps node names to their list of analyses (typically a partial application of lookup in the forest value). The result is a list of semantic values. Note that the context of the call to decode should (eventually) supply a concrete type to allow selection of appropriate instance. Ie, you have to indicate in some way what type the semantic result should have. Decode_Result a is a synonym generated by Happy: for non-monadic semantics, it is equivalent to a; when monads are in use, it becomes the declared monad type. See the full expr-eval example for more information.

	class TreeDecode a where
	       decode_b :: (ForestId -> [Branch]) -> Branch -> [Decode_Result a]
	decode :: TreeDecode a => (ForestId -> [Branch]) -> ForestId -> [Decode_Result a]

The GLR parser generator identifies the types involved in each semantic rule, hence the types of the functions, then creates a union containing distinct types. Values of this union are stored in the branches. (The union is actually a bit more complex: it must also distinguish patterns of dollar-variable usage, eg a function \x y -> x + y could be applied to the first and second constituents, or to the first and third.) The parser generator also creates instances of the TreeDecode class, which unpacks the semantic function and applies it across the decodings of the possible combinations of children. Effectively, it does a cartesian product operation across the lists of semantic results from each of the children. Eg [1,2] "+" [3,4] produces [4,5,5,6]. Information is extracted from token values using the patterns supplied by the user when declaring tokens and their Haskell representation, so the dollar-dollar convention works also.

The decoding process could be made more efficient by using memoisation techniques, but this hasn't been implemented since we believe the other (label) decoding mode is more useful. (If someone sends in a patch, we may include it in a future release -- but this might be tricky, eg require higher-order polymorphism? Plus, are there other ways of using this form of semantic function?)

3.3.3. Label decoding

The labelling mode aims to label branches in the forest with information that supports subsequent processing, for example the filtering and prioritisation of analyses prior to extraction of favoured solutions. As above, code fragments are given in braces and can contain dollar-variables. But these variables are expanded to node names in the graph, with the intention of easing navigation. The following grammar is from the expr-tree example.

	E :: {Tree ForestId Int}
	  : E '+' E      { Plus  $1 $3 }
	  | E '*' E      { Times $1 $3 }
	  | i            { Const $1 }

Here, the semantic values provide more meaningful labels than the plain structural information. In particular, only the interesting parts of the branch are represented, and the programmer can clearly select or label the useful constituents if required. There is no need to remember that it is the first and third child in the branch which we need to extract, because the label only contains those values (the `noise' has been dropped). Consider also the difference between concrete and abstract syntax. The labels are oriented towards abstract syntax. Tokens are handled slightly differently here: when they appear as children in a reduction, their informational content can be extracted directly, hence the Const value above will be built with the Int value from the token, not some ForestId.

Note the useful technique of making the label types polymorphic in the position used for forest indices. This allows replacement at a later stage with more appropriate values, eg. inserting lists of actual subtrees from the final decoding.

Use of these labels is supported by a type class LabelDecode, which unpacks values of the automatically-generated union type GSem to the original type(s). The parser generator will create appropriate instances of this class, based on the type information in the grammar file. (Note that omitting type information leads to a default of ().) Observe that use of the labels is often like traversing an abstract syntax, and the structure of the abstract syntax type usually constrains the types of constituents; so once the overall type is fixed (eg. with a type cast or signature) then there are no problems with resolution of class instances.

	class LabelDecode a where
	       unpack :: GSem -> a

Internally, the semantic values are packed in a union type as before, but there is no direct abstraction step. Instead, the ForestId values (from the dollar-variables) are bound when the corresponding branch is created from the list of constituent nodes. At this stage, token information is also extracted, using the patterns supplied by the user when declaring the tokens.

3.3.4. Monadic tree decoding

You can use the %monad directive in the tree-decode mode. Essentially, the decoding process now creates a list of monadic values, using the monad type declared in the directive. The default handling of the semantic functions is to apply the relevant return function to the value being returned. You can over-ride this using the {% ... } convention. The declared (>>=) function is used to assemble the computations.

Note that no attempt is made to share the results of monadic computations from sub-trees. (You could possibly do this by supplying a memoising lookup function for the decoding process.) Hence, the usual behaviour is that decoding produces whole monadic computations, each part of which is computed afresh (in depth-first order) when the whole is computed. Hence you should take care to initialise any relevant state before computing the results from multiple solutions.

This facility is experimental, and we welcome comments or observations on the approach taken! An example is provided (examples/glr/expr-monad). It is the standard example of arithmetic expressions, except that the IO monad is used, and a user exception is thrown when the second argument to addition is an odd number. Running this example will show a zero (from the exception handler) instead of the expected number amongst the results from the other parses.