These notational conventions are used for presenting syntax:
[pattern] | optional |
{pattern} | zero or more repetitions |
(pattern) | grouping |
pat1 | pat2 | choice |
pat⟨pat′⟩ | difference—elements generated by pat |
except those generated by pat′ |
|
fibonacci | terminal syntax in typewriter font |
BNF-like syntax is used throughout, with productions having the form:
nonterm | → | alt1 | alt2 | … | altn |
In both the lexical and the context-free syntax, there are some ambiguities that are to be resolved by making grammatical phrases as long as possible, proceeding from left to right (in shift-reduce parsing, resolving shift/reduce conflicts by shifting). In the lexical syntax, this is the “maximal munch” rule. In the context-free syntax, this means that conditionals, let-expressions, and lambda abstractions extend to the right as far as possible.
program | → | { lexeme | whitespace } |
lexeme | → | qvarid | qconid | qvarsym | qconsym |
| | literal | special | reservedop | reservedid | |
literal | → | integer | float | char | string |
special | → | ( | ) | , | ; | [ | ] | ` | { | } |
whitespace | → | whitestuff {whitestuff} |
whitestuff | → | whitechar | comment | ncomment |
whitechar | → | newline | vertab | space | tab | uniWhite |
newline | → | return linefeed | return | linefeed | formfeed |
return | → | a carriage return |
linefeed | → | a line feed |
vertab | → | a vertical tab |
formfeed | → | a form feed |
space | → | a space |
tab | → | a horizontal tab |
uniWhite | → | any Unicode character defined as whitespace |
comment | → | dashes [ any⟨symbol⟩ {any} ] newline |
dashes | → | -- {-} |
opencom | → | {- |
closecom | → | -} |
ncomment | → | opencom ANY seq {ncomment ANY seq} closecom |
ANY seq | → | {ANY }⟨{ANY } ( opencom | closecom ) {ANY }⟩ |
ANY | → | graphic | whitechar |
any | → | graphic | space | tab |
graphic | → | small | large | symbol | digit | special | " | ' |
small | → | ascSmall | uniSmall | _ |
ascSmall | → | a | b | … | z |
uniSmall | → | any Unicode lowercase letter |
large | → | ascLarge | uniLarge |
ascLarge | → | A | B | … | Z |
uniLarge | → | any uppercase or titlecase Unicode letter |
symbol | → | ascSymbol | uniSymbol⟨special | _ | " | '⟩ |
ascSymbol | → | ! | # | $ | % | & | ⋆ | + | . | / | < | = | > | ? | @ |
| | \ | ^ | | | - | ~ | : | |
uniSymbol | → | any Unicode symbol or punctuation |
digit | → | ascDigit | uniDigit |
ascDigit | → | 0 | 1 | … | 9 |
uniDigit | → | any Unicode decimal digit |
octit | → | 0 | 1 | … | 7 |
hexit | → | digit | A | … | F | a | … | f |
varid | → | (small {small | large | digit | ' })⟨reservedid⟩ | |
conid | → | large {small | large | digit | ' } | |
reservedid | → | case | class | data | default | deriving | do | else | |
| | foreign | if | import | in | infix | infixl | ||
| | infixr | instance | let | module | newtype | of | ||
| | then | type | where | _ | ||
varsym | → | ( symbol⟨:⟩ {symbol} )⟨reservedop | dashes⟩ | |
consym | → | ( : {symbol})⟨reservedop⟩ | |
reservedop | → | .. | : | :: | = | \ | | | <- | -> | @ | ~ | => | |
varid | (variables) | ||
conid | (constructors) | ||
tyvar | → | varid | (type variables) |
tycon | → | conid | (type constructors) |
tycls | → | conid | (type classes) |
modid | → | {conid .} conid | (modules) |
qvarid | → | [ modid . ] varid | |
qconid | → | [ modid . ] conid | |
qtycon | → | [ modid . ] tycon | |
qtycls | → | [ modid . ] tycls | |
qvarsym | → | [ modid . ] varsym | |
qconsym | → | [ modid . ] consym | |
decimal | → | digit{digit} | |
octal | → | octit{octit} | |
hexadecimal | → | hexit{hexit} | |
integer | → | decimal | |
| | 0o octal | 0O octal | ||
| | 0x hexadecimal | 0X hexadecimal | ||
float | → | decimal . decimal [exponent] | |
| | decimal exponent | ||
exponent | → | (e | E) [+ | -] decimal | |
char | → | ' (graphic⟨' | \⟩ | space | escape⟨\&⟩) ' | |
string | → | " {graphic⟨" | \⟩ | space | escape | gap} " | |
escape | → | \ ( charesc | ascii | decimal | o octal | x hexadecimal ) | |
charesc | → | a | b | f | n | r | t | v | \ | " | ' | & | |
ascii | → | ^cntrl | NUL | SOH | STX | ETX | EOT | ENQ | ACK | |
| | BEL | BS | HT | LF | VT | FF | CR | SO | SI | DLE | ||
| | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | ||
| | EM | SUB | ESC | FS | GS | RS | US | SP | DEL | ||
cntrl | → | ascLarge | @ | [ | \ | ] | ^ | _ | |
gap | → | \ whitechar {whitechar} \ |
Section 2.7 gives an informal discussion of the layout rule. This section defines it more precisely.
The meaning of a Haskell program may depend on its layout. The effect of layout on its meaning can be completely described by adding braces and semicolons in places determined by the layout. The meaning of this augmented program is now layout insensitive.
The effect of layout is specified in this section by describing how to add braces and semicolons to a laid-out program. The specification takes the form of a function L that performs the translation. The input to L is:
There is no < n > inserted before the \Bill, because it is not the beginning of a complete lexeme; nor before the ,, because it is not preceded only by white space.)
The “indentation” of a lexeme is the column number of the first character of that lexeme; the indentation of a line is the indentation of its leftmost lexeme. To determine the column number, assume a fixed-width font with the following conventions:
For the purposes of the layout rule, Unicode characters in a source program are considered to be of the same, fixed, width as an ASCII character. However, to avoid visual confusion, programmers should avoid writing programs in which the meaning of implicit layout depends on the width of non-space characters.
The application L tokens [] delivers a layout-insensitive translation of tokens, where tokens is the result of lexically analysing a module and adding column-number indicators to it as described above. The definition of L is as follows, where we use “:” as a stream construction operator, and “[]” for the empty stream.
L (< n >: ts) (m : ms) | = | ; : (L ts (m : ms)) | if m = n |
= | } : (L (< n >: ts) ms) | if n < m |
|
L (< n >: ts) ms | = | L ts ms |
|
L ({n} : ts) (m : ms) | = | { : (L ts (n : m : ms)) | if n > m (Note 1) |
L ({n} : ts) [] | = | { : (L ts [n]) | if n > 0 (Note 1) |
L ({n} : ts) ms | = | { : } : (L (< n >: ts) ms) | (Note 2) |
L (} : ts) (0 : ms) | = | } : (L ts ms) | (Note 3) |
L (} : ts) ms | = | parse-error | (Note 3) |
L ({ : ts) ms | = | { : (L ts (0 : ms)) | (Note 4) |
L (t : ts) (m : ms) | = | } : (L (t : ts) ms) | if m∕ = 0 and parse-error(t) |
(Note 5) |
|||
L (t : ts) ms | = | t : (L ts ms) |
|
L [] [] | = | [] |
|
L [] (m : ms) | = | } : L [] ms | if m≠0 (Note 6) |
Here, the definition of p is indented less than the indentation of the enclosing context, which is set in this case by the definition of h.
The test m∕ = 0 checks that an implicitly-added closing brace would match an implicit open brace.
If none of the rules given above matches, then the algorithm fails. It can fail for instance when the end of the input is reached, and a non-layout context is active, since the close brace is missing. Some error conditions are not detected by the algorithm, although they could be: for example let }.
Note 1 implements the feature that layout processing can be stopped prematurely by a parse error. For example
is valid, because it translates to
The close brace is inserted due to the parse error rule above.
The “literate comment” convention, first developed by Richard Bird and Philip Wadler for Orwell, and inspired in turn by Donald Knuth’s “literate programming”, is an alternative style for encoding Haskell source code. The literate style encourages comments by making them the default. A line in which “>” is the first character is treated as part of the program; all other lines are comments.
The program text is recovered by taking only those lines beginning with “>”, and replacing the leading “>” with a space. Layout and comments apply exactly as described in Chapter 10 in the resulting text.
To capture some cases where one omits an “>” by mistake, it is an error for a program line to appear adjacent to a non-blank comment line, where a line is taken as blank if it consists only of whitespace.
By convention, the style of comment is indicated by the file extension, with “.hs” indicating a usual Haskell file and “.lhs” indicating a literate Haskell file. Using this style, a simple factorial program would be:
An alternative style of literate programming is particularly suitable for use with the LaTeX text processing system. In this convention, only those parts of the literate program that are entirely enclosed between \begin{code}…\end{code} delimiters are treated as program text; all other lines are comments. More precisely:
It is not necessary to insert additional blank lines before or after these delimiters, though it may be stylistically desirable. For example,
This style uses the same file extension. It is not advisable to mix these two styles in the same file.
module | → | module modid [exports] where body | |
| | body | ||
body | → | { impdecls ; topdecls } | |
| | { impdecls } | ||
| | { topdecls } | ||
impdecls | → | impdecl1 ; … ; impdecln | (n ≥ 1) |
exports | → | ( export1 , … , exportn [ , ] ) | (n ≥ 0) |
export | → | qvar | |
| | qtycon [(..) | ( cname1 , … , cnamen )] | (n ≥ 0) | |
| | qtycls [(..) | ( qvar1 , … , qvarn )] | (n ≥ 0) | |
| | module modid | ||
impdecl | → | import [qualified] modid [as modid] [impspec] | |
| | (empty declaration) | ||
impspec | → | ( import1 , … , importn [ , ] ) | (n ≥ 0) |
| | hiding ( import1 , … , importn [ , ] ) | (n ≥ 0) | |
import | → | var | |
| | tycon [ (..) | ( cname1 , … , cnamen )] | (n ≥ 0) | |
| | tycls [(..) | ( var1 , … , varn )] | (n ≥ 0) | |
cname | → | var | con | |
topdecls | → | topdecl1 ; … ; topdecln | (n ≥ 0) |
topdecl | → | type simpletype = type | |
| | data [context =>] simpletype [= constrs] [deriving] | ||
| | newtype [context =>] simpletype = newconstr [deriving] | ||
| | class [scontext =>] tycls tyvar [where cdecls] | ||
| | instance [scontext =>] qtycls inst [where idecls] | ||
| | default (type1 , … , typen) | (n ≥ 0) | |
| | foreign fdecl | ||
| | decl | ||
decls | → | { decl1 ; … ; decln } | (n ≥ 0) |
decl | → | gendecl | |
| | (funlhs | pat) rhs | ||
cdecls | → | { cdecl1 ; … ; cdecln } | (n ≥ 0) |
cdecl | → | gendecl | |
| | (funlhs | var) rhs | ||
idecls | → | { idecl1 ; … ; idecln } | (n ≥ 0) |
idecl | → | (funlhs | var) rhs | |
| | (empty) | ||
gendecl | → | vars :: [context =>] type | (type signature) |
| | fixity [integer] ops | (fixity declaration) | |
| | (empty declaration) | ||
ops | → | op1 , … , opn | (n ≥ 1) |
vars | → | var1 , …, varn | (n ≥ 1) |
fixity | → | infixl | infixr | infix | |
type | → | btype [-> type] | (function type) |
btype | → | [btype] atype | (type application) |
atype | → | gtycon | |
| | tyvar | ||
| | ( type1 , … , typek ) | (tuple type, k ≥ 2) | |
| | [ type ] | (list type) | |
| | ( type ) | (parenthesized constructor) | |
gtycon | → | qtycon | |
| | () | (unit type) | |
| | [] | (list constructor) | |
| | (->) | (function constructor) | |
| | (,{,}) | (tupling constructors) | |
context | → | class | |
| | ( class1 , … , classn ) | (n ≥ 0) | |
class | → | qtycls tyvar | |
| | qtycls ( tyvar atype1 … atypen ) | (n ≥ 1) | |
scontext | → | simpleclass | |
| | ( simpleclass1 , … , simpleclassn ) | (n ≥ 0) | |
simpleclass | → | qtycls tyvar | |
simpletype | → | tycon tyvar1 … tyvark | (k ≥ 0) |
constrs | → | constr1 | … | constrn | (n ≥ 1) |
constr | → | con [!] atype1 … [!] atypek | (arity con = k, k ≥ 0) |
| | (btype | ! atype) conop (btype | ! atype) | (infix conop) | |
| | con { fielddecl1 , … , fielddecln } | (n ≥ 0) | |
newconstr | → | con atype | |
| | con { var :: type } | ||
fielddecl | → | vars :: (type | ! atype) | |
deriving | → | deriving (dclass | (dclass1, … , dclassn)) | (n ≥ 0) |
dclass | → | qtycls | |
inst | → | gtycon | |
| | ( gtycon tyvar1 … tyvark ) | (k ≥ 0, tyvars distinct) | |
| | ( tyvar1 , … , tyvark ) | (k ≥ 2, tyvars distinct) | |
| | [ tyvar ] | ||
| | ( tyvar1 -> tyvar2 ) | tyvar1 and tyvar2 distinct | |
fdecl | → | import callconv [safety] impent var :: ftype | (define variable) |
| | export callconv expent var :: ftype | (expose variable) | |
callconv | → | ccall | stdcall | cplusplus | (calling convention) |
| | jvm | dotnet | ||
| | system-specific calling conventions | ||
impent | → | [string] | (see Section 8.5.1) |
expent | → | [string] | (see Section 8.5.1) |
safety | → | unsafe | safe | |
ftype | → | frtype | |
| | fatype → ftype | ||
frtype | → | fatype | |
| | () | ||
fatype | → | qtycon atype1 … atypek | (k ≥ 0) |
funlhs | → | var apat { apat } | |
| | pat varop pat | ||
| | ( funlhs ) apat { apat } | ||
rhs | → | = exp [where decls] | |
| | gdrhs [where decls] | ||
gdrhs | → | guards = exp [gdrhs] | |
guards | → | | guard1, …, guardn | (n ≥ 1) |
guard | → | pat <- infixexp | (pattern guard) |
| | let decls | (local declaration) | |
| | infixexp | (boolean guard) | |
exp | → | infixexp :: [context =>] type | (expression type signature) |
| | infixexp | ||
infixexp | → | lexp qop infixexp | (infix operator application) |
| | - infixexp | (prefix negation) | |
| | lexp | ||
lexp | → | \ apat1 … apatn -> exp | (lambda abstraction, n ≥ 1) |
| | let decls in exp | (let expression) | |
| | if exp [;] then exp [;] else exp | (conditional) | |
| | case exp of { alts } | (case expression) | |
| | do { stmts } | (do expression) | |
| | fexp | ||
fexp | → | [fexp] aexp | (function application) |
aexp | → | qvar | (variable) |
| | gcon | (general constructor) | |
| | literal | ||
| | ( exp ) | (parenthesized expression) | |
| | ( exp1 , … , expk ) | (tuple, k ≥ 2) | |
| | [ exp1 , … , expk ] | (list, k ≥ 1) | |
| | [ exp1 [, exp2] .. [exp3] ] | (arithmetic sequence) | |
| | [ exp | qual1 , … , qualn ] | (list comprehension, n ≥ 1) | |
| | ( infixexp qop ) | (left section) | |
| | ( qop⟨-⟩ infixexp ) | (right section) | |
| | qcon { fbind1 , … , fbindn } | (labeled construction, n ≥ 0) | |
| | aexp⟨qcon⟩ { fbind1 , … , fbindn } | (labeled update, n ≥ 1) | |
qual | → | pat <- exp | (generator) |
| | let decls | (local declaration) | |
| | exp | (guard) | |
alts | → | alt1 ; … ; altn | (n ≥ 1) |
alt | → | pat -> exp [where decls] | |
| | pat gdpat [where decls] | ||
| | (empty alternative) | ||
gdpat | → | guards -> exp [ gdpat ] | |
stmts | → | stmt1 … stmtn exp [;] | (n ≥ 0) |
stmt | → | exp ; | |
| | pat <- exp ; | ||
| | let decls ; | ||
| | ; | (empty statement) | |
fbind | → | qvar = exp | |
pat | → | lpat qconop pat | (infix constructor) |
| | lpat | ||
lpat | → | apat | |
| | - (integer | float) | (negative literal) | |
| | gcon apat1 … apatk | (arity gcon = k, k ≥ 1) | |
apat | → | var [ @ apat] | (as pattern) |
| | gcon | (arity gcon = 0) | |
| | qcon { fpat1 , … , fpatk } | (labeled pattern, k ≥ 0) | |
| | literal | ||
| | _ | (wildcard) | |
| | ( pat ) | (parenthesized pattern) | |
| | ( pat1 , … , patk ) | (tuple pattern, k ≥ 2) | |
| | [ pat1 , … , patk ] | (list pattern, k ≥ 1) | |
| | ~ apat | (irrefutable pattern) | |
fpat | → | qvar = pat | |
gcon | → | () | |
| | [] | ||
| | (,{,}) | ||
| | qcon | ||
var | → | varid | ( varsym ) | (variable) |
qvar | → | qvarid | ( qvarsym ) | (qualified variable) |
con | → | conid | ( consym ) | (constructor) |
qcon | → | qconid | ( gconsym ) | (qualified constructor) |
varop | → | varsym | ` varid ` | (variable operator) |
qvarop | → | qvarsym | ` qvarid ` | (qualified variable operator) |
conop | → | consym | ` conid ` | (constructor operator) |
qconop | → | gconsym | ` qconid ` | (qualified constructor operator) |
op | → | varop | conop | (operator) |
qop | → | qvarop | qconop | (qualified operator) |
gconsym | → | : | qconsym |
The following is an example implementation of fixity resolution for Haskell expressions. Fixity resolution also applies to Haskell patterns, but patterns are a subset of expressions so in what follows we consider only expressions for simplicity.
The function resolve takes a list in which the elements are expressions or operators, i.e. an instance of the infixexp non-terminal in the context-free grammar. It returns either Just e where e is the resolved expression, or Nothing if the input does not represent a valid expression. In a compiler, of course, it would be better to return more information about the operators involved for the purposes of producing a useful error message, but the Maybe type will suffice to illustrate the algorithm here.
The algorithm works as follows. At each stage we have a call
which means that we are looking at an expression like
(the caller holds E0). The job of parse is to build the expression to the right of op1, returning the expression and any remaining input.
There are three cases to consider:
To initialise the algorithm, we set op1 to be an imaginary operator with precedence lower than anything else. Hence parse will consume the whole input, and return the resulting expression.
The handling of the prefix negation operator, -, complicates matters only slightly. Recall that prefix negation has the same fixity as infix negation: left-associative with precedence 6. The operator to the left of -, if there is one, must have precedence lower than 6 for the expression to be legal. The negation operator itself may left-associate with operators of the same fixity (e.g. +). So for example -a + b is legal and resolves as (-a) + b, but a + -b is illegal.
The function parseNeg handles prefix negation. If we encounter a negation operator, and it is legal in this position (the operator to the left has precedence lower than 6), then we proceed in a similar way to case (3) above: compute the argument to - by recursively calling parseNeg, and then continue by calling parse.
Note that this algorithm is insensitive to the range and resolution of precedences. There is no reason in principle that Haskell should be limited to integral precedences in the range 1 to 10; a larger range, or fractional values, would present no additional difficulties.