antlr/doc/left-recursion.md

51 lines
2.5 KiB
Markdown
Raw Normal View History

2015-11-01 08:31:25 +08:00
# Left-recursive rules
The most natural expression of a some common language constructs is left recursive. For example C declarators and arithmetic expressions. Unfortunately, left recursive specifications of arithmetic expressions are typically ambiguous but much easier to write out than the multiple levels required in a typical top-down grammar. Here is a sample ANTLR 4 grammar with a left recursive expression rule:
```
stat: expr '=' expr ';' // e.g., x=y; or x=f(x);
| expr ';' // e.g., f(x); or f(g(x));
;
expr: expr '*' expr
| expr '+' expr
| expr '(' expr ')' // f(x)
| id
;
```
In straight context free grammars, such a rule is ambiguous because `1+2*3` it can interpret either operator as occurring first, but ANTLR rewrites that to be non-left recursive and unambiguous using semantic predicates:
```
expr[int pr] : id
( {4 >= $pr}? '*' expr[5]
| {3 >= $pr}? '+' expr[4]
| {2 >= $pr}? '(' expr[0] ')'
)*
;
```
The predicates resolve ambiguities by comparing the precedence of the current operator against the precedence of the previous operator. An expansion of expr[pr] can match only those subexpressions whose precedence meets or exceeds pr.
## Formal rules
The formal 4.0, 4.1 ANTLR left-recursion elimination rules were changed (simplified) for 4.2 and are laid out in the [ALL(*) tech report](http://www.antlr.org/papers/allstar-techreport.pdf):
* Binary expressions are expressions which contain a recursive invocation of the rule as the first and last element of the alternative.
* Suffix expressions contain a recursive invocation of the rule as the first element of the alternative, but not as the last element.
* Prefix expressions contain a recursive invocation of the rule as the last element of the alternative, but not as the first element.
There is no such thing as a "ternary" expression--they are just binary expressions in disguise.
The right associativity specifiers used to be on the individual tokens but it's done on alternative basis anyway so the option is now on the individual alternative; e.g.,
```
e : e '*' e
| e '+' e
|<assoc=right> e '?' e ':' e
|<assoc=right> e '=' e
| INT
;
```
If your 4.0 or 4.1 grammar uses a right-associative ternary operator, you will need to update your grammar to include `<assoc=right>` on the alternative operator. To smooth the transition, `<assoc=right>` is still allowed on token references but it is ignored.