Application letters at the Haskell workshop: suggestion

Marcin 'Qrczak' Kowalczyk qrczak@knm.org.pl
16 Sep 2001 09:30:12 GMT


Sat, 15 Sep 2001 15:44:52 -0500, Duncan Coutts <dcoutts@cray.com> pisze:

> I've been using a few variants: single error, multiple error and multiple
> error/warning types. I'm also particularly pleased with one that has an
> extra combinator which allows seperate 'branches' of an expression to
> combine all the errors (if there are any). That allows multiple
> *independant* errors to be returned to the user rather than just the first one.

Parsec does this and similar things. It tries to generate reasonable
messages of the form "expecting foo, found bar" or "unexpected bar"
annotated with source position, making use of labels of higher level
syntactic constructs inserted in the grammar as well as individual
characters matched.

I was also experimenting with this. Recently I had no time to work
on this, but I will finish it sometime. It's surprisingly hard to
get right in all cases, and there are non-obvious decisions to take.

For example consider the grammar 'A | B C' and input 'BZ'. We should
report that Z was found where C was expected, but do we also report
that possibly B was found where A was expected? Depending on the
situation both choices can make sense.

It depends whether the fact that input began with 'B' really means
that the user wanted to write 'B C' (e.g. if in reality 'B' = 'class';
one rarely writes 'class' when he wants to define something else)
or it's probably an artifact of the grammar and the error is at the
beginning (e.g. if 'B' is an identifier and the rest doesn't look
like any of things which could begin with an identifier; the user
could mean anything, even misspelled a keyword).

Consider the grammar 'A B C | A B D' and input 'ABZ'. Should we
merge errors to the form "expecting C or D, found Z", even if both
occurrences of 'A B' are independent and really produced by more
complex subgrammars? What if lengths of matching prefixes differ?
What if the grammar creator gave meaningful names to 'A B C' and
'A B D' as a whole, so we can't report the error where the mismatch
occurred?

Another complication: we definitely want to perform cuts after enough
has been successfully parsed, so we don't keep it in memory and
won't consider backtracking that far. I managed to do this, slightly
differently than Parsec, but the grammar creator must be aware of this
and insert annotations improving backtracking sometimes. This impacts
error messages a bit.

Getting right descriptions of what was expected or unexpected is not
trivial. For example when there is no separate lexer, we rarely have
anything besides raw characters as "unexpected". We have something
more descriptive only if the grammar explicitly calls 'unexpected'
after successfully parsing something. We really don't want to give
a message like "expecting 'e', found 'i'" when the real cause is
"expecting 'then', found 'thickness'".

Sometimes the grammar doesn't give a name and uses a character
predicate, so we don't have any description to put into error message.
Sometimes the grammar should explicitly mark some parts to not report
them as expected; I was getting silly errors about expected whitespace
or begin of comment, even though I surely know that whitespace is
possible almost everywhere and error messages should concentrate on
syntactic errors instead.

There are many corner cases giving surprising results. For the grammar
'A (comma A)* right_bracket' and input 'A,A,A,A,A,A,A,]' we really
don't want to say for each of the commas that it could be right
bracket instead, even though it would indeed allow to successfully
parse this part, because the real error is probably only at the end.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK