Merge pull request #1030 from parrt/move-faq-to-repo

Move faq to repo
This commit is contained in:
Terence Parr 2015-10-27 10:48:20 -07:00
commit e92a0e9ed4
10 changed files with 382 additions and 1 deletions

1
.gitignore vendored
View File

@ -45,7 +45,6 @@ nbactions*.xml
# Generated files
/out/
/doc/
/gen/
/gen3/
/gen4/

11
doc/faq/actions-preds.md Normal file
View File

@ -0,0 +1,11 @@
# Actions and semantic predicates
## How do I test if an optional rule was matched?
For optional rule references such as the initialization clause in the following
```
decl : 'var' ID (EQUALS expr)? ;
```
testing to see if that clause was matched can be done using `$EQUALS!=null` or `$expr.ctx!=null` where `$expr.ctx` points to the context or parse tree created for that reference to rule expr.

View File

@ -0,0 +1,5 @@
# Error handling
## How do I perform semantic checking with ANTLR?
See [How to implement error handling in ANTLR4](http://stackoverflow.com/questions/21613421/how-to-implement-error-handling-in-antlr4/21615751#21615751).

100
doc/faq/general.md Normal file
View File

@ -0,0 +1,100 @@
# General
## Why do we need ANTLR v4?
*Oliver Zeigermann asked me some questions about v4. Here is our conversation.*
*See the [preface from the book](http://media.pragprog.com/titles/tpantlr2/preface.pdf)*
**Q: Why is the new version of ANTLR also called “honey badger”?**
ANTLR v4 is called the honey badger release after the fearless hero of the YouTube sensation, The Crazy Nastyass Honey Badger.
**Q: Why did you create a new version of ANTLR?**
Well, I start creating a new version because v3 had gotten very messy on the inside and also relied on grammars written in ANTLR v2. Unfortunately, v2's open-source license was unclear and so projects such as Eclipse could not include v3 because of its dependency on v2. In the end, Sam Harwell converted all of the v2 grammars into v3 so that v3 was written in itself. Because v3 has a very clean BSD license, the Eclipse project okayed for inclusion in that project in the summer of 2011.
As I was rewriting ANTLR, I wanted to experiment with a new variation of the LL(\*) parsing algorithm. As luck would have it, I came up with a cool new version called adaptive LL(\*) that pushes all of the grammar analysis effort to runtime. The parser warms up like Java does with its JIT on-the-fly compiler; the code gets faster and faster the longer it runs. The benefit is that the adaptive algorithm is much stronger than the static LL(\*) grammar analysis algorithm in v3. Honey Badger takes any grammar that you give it; it just doesn't give a damn. (v4 accepts even left recursive grammars, except for indirectly left recursive grammars where x calls y which calls x).
v4 is the culmination of 25 years of research into parsers and parser generators. I think I finally know what I want to build. :)
**Q: What makes you excited about ANTLR4?**
The biggest thing is the new adaptive parsing strategy, which lets us accept any grammar we care to write. That gives us a huge productivity boost because we can now write much more natural expression rules (which occur in almost every grammar). For example, bottom-up parser generators such as yacc let you write very natural grammars like this:
```
e : e '*' e
| e '+' e
| INT
;
```
ANTLR v4 will also take that grammar now, translating it secretly to a non-left recursive version.
Another big thing with v4 is that my goal has shifted from performance to ease-of-use. For example, ANTLR automatically can build parse trees for you and generate listeners and visitors. This is not only a huge productivity win, but also an important step forward in building grammars that don't depend on embedded actions. Those embedded actions (raw Java code or whatever) locked the grammar into use with only one language. If we keep all of the actions out of the grammar and put them into external visitors, we can reuse the same grammar to generate code in any language for which we have an ANTLR target.
**Q: What do you think are the things people had problems with in ANTLR3?**
The biggest problem was figuring out why ANTLR did not like their grammar. The static analysis often could not figure out how to generate a parser for the grammar. This problem totally goes away with the honey badger because it will take just about anything you give it without a whimper.
**Q: And what with other compiler generator tools?**
The biggest problem for the average practitioner is that most parser generators do not produce code you can load into a debugger and step through. This immediately removes bottom-up parser generators and the really powerful GLR parser generators from consideration by the average programmer. There are a few other tools that generate source code like ANTLR does, but they don't have v4's adaptive LL(\*) parsers. You will be stuck with contorting your grammar to fit the needs of the tool's weaker, say, LL(k) parsing strategy. PEG-based tools have a number of weaknesses, but to mention one, they have essentially no error recovery because they cannot report an error and until they have parsed the entire input.
**Q: What are the main design decisions in ANTLR4?**
Ease-of-use over performance. I will worry about performance later. Simplicity over complexity. For example, I have taken out explicit/manual AST construction facilities and the tree grammar facilities. For 20 years I've been trying to get people to go that direction, but I've since decided that it was a mistake. It's much better to give people a parser generator that can automatically build trees and then let them use pure code to do whatever tree walking they want. People are extremely familiar and comfortable with visitors, for example.
**Q: What do you think people will like most on ANTLR4?**
The lack of errors when you run your grammar through ANTLR. The automatic tree construction and listener/visitor generation.
**What do you think are the problems people will try to solve with ANTLR4?**
In my experience, almost no one uses parser generators to build commercial compilers. So, people are using ANTLR for their everyday work, building everything from configuration files to little scripting languages.
In response to a question about this entry from stackoverflow.com: I believe that compiler developers are very concerned with parsing speed, error reporting, and error recovery. For that, they want absolute control over their parser. Also, some languages are so complicated, such as C++, that parser generators might build parsers slower than compiler developers want. The compiler developers also like the control of a recursive-descent parser for predicating the parse to handle context-sensitive constructs such as `T(i)` in C++.
There is also likely a sense that parsing is the easy part of building a compiler so they don't immediately jump automatically to parser generators. I think this is also a function of previous generation parser generators. McPeak's Elkhound GLR-based parser generator is powerful enough and fast enough, in the hands of someone that knows what they're doing, to be suitable for compilers. I can also attest to the fact that ANTLR v4 is now powerful enough and fast enough to compete well with handbuilt parsers. E.g., after warm-up, it's now taking just 1s to parse the entire JDK java/\* library.
## What is the difference between ANTLR 3 and 4?
The biggest difference between ANTLR 3 and 4 is that ANTLR 4 takes any grammar you give it unless the grammar had indirect left recursion. That means we don't need syntactic predicates or backtracking so ANTLR 4 does not support that syntax; you will get a warning for using it. ANTLR 4 allows direct left recursion so that expressing things like arithmetic expression syntax is very easy and natural:
```
expr : expr '*' expr
| expr '+' expr
| INT
;
```
ANTLR 4 automatically constructs parse trees for you and abstract syntax tree (AST) construction is no longer an option. See also What if I need ASTs not parse trees for a compiler, for example?
Another big difference is that we discourage the use of actions directly within the grammar because ANTLR 4 automatically generates [listeners and visitors](https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Parse+Tree+Listeners) for you to use that trigger method calls when some phrases of interest are recognized during a tree walk after parsing. See also [Parse Tree Matching and XPath](https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Parse+Tree+Matching+and+XPath).
Semantic predicates are still allowed in both the parser and lexer rules as our actions. For efficiency sake keep semantic predicates to the right edge of lexical rules.
There are no tree grammars because we use listeners and visitors instead.
## Why is my expression parser slow?
Make sure to use two-stage parsing. See example in [bug report](https://github.com/antlr/antlr4/issues/374).
```Java
CharStream input = new ANTLRFileStream(args[0]);
ExprLexer lexer = new ExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ExprParser parser = new ExprParser(tokens);
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
try {
parser.stat(); // STAGE 1
}
catch (Exception ex) {
tokens.reset(); // rewind input stream
parser.reset();
parser.getInterpreter().setPredictionMode(PredictionMode.LL);
parser.stat(); // STAGE 2
// if we parse ok, it's LL not SLL
}
```

View File

@ -0,0 +1,11 @@
# Getting started
## How to I install and run a simple grammar?
See [Getting Started with ANTLR v4](https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Getting+Started+with+ANTLR+v4).
## Why does my parser test program hang?
Your test program is likely not hanging but simply waiting for you to type some input for standard input. Don't forget that you need to type the end of file character, generally on a line by itself, at the end of the input. On a Mac or Linux machine it is ctrl-D, as gawd intended, or ctrl-Z on a Windows machine.
See [Getting Started with ANTLR v4](https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Getting+Started+with+ANTLR+v4).

50
doc/faq/index.md Normal file
View File

@ -0,0 +1,50 @@
# Frequently-Asked Questions (FAQ)
This is the main landing page for the ANTLR 4 FAQ. The links below will take you to the appropriate file containing all answers for that subcategory.
*To add to or improve this FAQ, [fork](https://help.github.com/articles/fork-a-repo/) the [antlr/antlr4 repo](https://github.com/antlr/antlr4) then update this `doc/faq/index.md` or file(s) in that directory. Submit a [pull request](https://help.github.com/articles/creating-a-pull-request/) to get your changes incorporated into the main repository. Do not mix code and FAQ updates in the sample pull request since code updates require signing the contributors.txt certificate of origin.*
## Getting Started
* [How to I install and run a simple grammar?](getting-started.md)
* [Why does my parser test program hang?](getting-started.md)
## Installation
* [Why can't ANTLR (grun) find my lexer or parser?](installation.md)
* [Why can't I run the ANTLR tool?](installation.md)
* [Why doesn't my parser compile?](installation.md)
## General
* [Why do we need ANTLR v4?](general.md)
* [What is the difference between ANTLR 3 and 4?](general.md)
* [Why is my expression parser slow?](general.md)
## Grammar syntax
## Lexical analysis
* [How can I parse non-ASCII text and use characters in token rules?](lexical.md)
* [How do I replace escape characters in string tokens?](lexical.md)
* [Why are my keywords treated as identifiers?](lexical.md)
* [Why are there no whitespace tokens in the token stream?](lexical.md)
## Parse Trees
* [How do I get the input text for a parse-tree subtree?](parse-trees.md)
* [What if I need ASTs not parse trees for a compiler, for example?](parse-trees.md)
* [When do I use listener/visitor vs XPath vs Tree pattern matching?](parse-trees.md)
## Translation
* [ASTs vs parse trees](parse-trees.md)
* [Decoupling input walking from output generation](parse-trees.md)
## Actions and semantic predicates
* [How do I test if an optional rule was matched?](actions-preds.md)
## Error handling
* [How do I perform semantic checking with ANTLR?](error-handling.md)

60
doc/faq/installation.md Normal file
View File

@ -0,0 +1,60 @@
# Installation
Please read carefully: [Getting Started with ANTLR v4](https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Getting+Started+with+ANTLR+v4).
## Why can't ANTLR (grun) find my lexer or parser?
If you see "Can't load Hello as lexer or parser", it's because you don't have '.' (current directory) in your CLASSPATH.
```bash
$ alias antlr4='java -jar /usr/local/lib/antlr-4.2.2-complete.jar'
$ alias grun='java org.antlr.v4.runtime.misc.TestRig'
$ export CLASSPATH="/usr/local/lib/antlr-4.2.2-complete.jar"
$ antlr4 Hello.g4
$ javac Hello*.java
$ grun Hello r -tree
Can't load Hello as lexer or parser
$
```
For mac/linux, use:
```bash
export CLASSPATH=".:/usr/local/lib/antlr-4.2.2-complete.jar:$CLASSPATH"
```
or for Windows:
```
SET CLASSPATH=.;C:\Javalib\antlr4-complete.jar;%CLASSPATH%
```
**See the dot at the beginning?** It's critical.
## Why can't I run the ANTLR tool?
If you get a no class definition found error, you are missing the ANTLR jar in your `CLASSPATH` (or you might only have the runtime jar):
```bash
/tmp $ java org.antlr.v4.Tool Hello.g4
Exception in thread "main" java.lang.NoClassDefFoundError: org/antlr/v4/Tool
Caused by: java.lang.ClassNotFoundException: org.antlr.v4.Tool
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
```
## Why doesn't my parser compile?
If you see these kinds of errors, it's because you don't have the runtime or complete ANTLR library in your CLASSPATH.
```bash
/tmp $ javac Hello*.java
HelloBaseListener.java:3: package org.antlr.v4.runtime does not exist
import org.antlr.v4.runtime.ParserRuleContext;
^
...
```

63
doc/faq/lexical.md Normal file
View File

@ -0,0 +1,63 @@
# Lexical analysis
## How can I parse non-ASCII text and use characters in token rules?
See [Using non-ASCII characters in token rules](http://stackoverflow.com/questions/28126507/antlr4-using-non-ascii-characters-in-token-rules/28129510#28129510).
## How do I replace escape characters in string tokens?
Unfortunately, manipulating the text of the token matched by a lexical rule is cumbersome (as of 4.2). You have to build up a buffer and then set the text at the end. Actions in the lexer execute at the associated position in the input just like they do in the parser. Here's an example that does escape character replacement in strings. It's not pretty but it works.
```
grammar Foo;
@members {
StringBuilder buf = new StringBuilder(); // can't make locals in lexer rules
}
STR : '"'
( '\\'
( 'r' {buf.append('\r');}
| 'n' {buf.append('\n');}
| 't' {buf.append('\t');}
| '\\' {buf.append('\\');}
| '\"' {buf.append('"');}
)
| ~('\\'|'"') {buf.append((char)_input.LA(-1));}
)*
'"'
{setText(buf.toString()); buf.setLength(0); System.out.println(getText());}
;
```
It's easier and more efficient to return original input string and then use a small function to rewrite the string later during a parse tree walk or whatever. But, here's how to do it from within the lexer.
Lexer actions don't work in the interpreter, which includes xpath and tree patterns.
For more on the argument against doing complicated things in the lexer, see the [related lexer-action issue at github](https://github.com/antlr/antlr4/issues/483#issuecomment-37326067).
## Why are my keywords treated as identifiers?
Keywords such as `begin` are also valid identifiers lexically and so that input is ambiguous. To resolve ambiguities, ANTLR gives precedence to the lexical rules specified first. That implies that you must put the identifier rule after all of your keywords:
```
grammar T;
decl : DEF 'int' ID ';'
DEF : 'def' ; // ambiguous with ID as is 'int'
ID : [a-z]+ ;
```
Notice that literal `'int'` is also physically before the ID rule and will also get precedence.
## Why are there no whitespace tokens in the token stream?
The lexer is not sending white space to the parser, which means that the rewrite stream doesn't have access to the tokens either. It is because of the skip lexer command:
```
WS : [ \t\r\n\u000C]+ -> skip
;
```
You have to change all those to `-> channel(HIDDEN)` which will send them to the parser on a different channel, making them available in the token stream, but invisible to the parser.

73
doc/faq/parse-trees.md Normal file
View File

@ -0,0 +1,73 @@
# Parse Trees
## How do I get the input text for a parse-tree subtree?
In ParseTree, you have this method:
```java
/** Return the combined text of all leaf nodes. Does not get any
* off-channel tokens (if any) so won't return whitespace and
* comments if they are sent to parser on hidden channel.
*/
String getText();
```
But, you probably want this method from TokenStream:
```java
/**
* Return the text of all tokens in the source interval of the specified
* context. This method behaves like the following code, including potential
* exceptions from the call to {@link #getText(Interval)}, but may be
* optimized by the specific implementation.
*
* <p>If {@code ctx.getSourceInterval()} does not return a valid interval of
* tokens provided by this stream, the behavior is unspecified.</p>
*
* <pre>
* TokenStream stream = ...;
* String text = stream.getText(ctx.getSourceInterval());
* </pre>
*
* @param ctx The context providing the source interval of tokens to get
* text for.
* @return The text of all tokens within the source interval of {@code ctx}.
*/
public String getText(RuleContext ctx);
```
That is, do this:
```
mytokens.getText(mySubTree);
```
## What if I need ASTs not parse trees for a compiler, for example?
For writing a compiler, either generate [LLVM-type static-single-assignment](http://llvm.org/docs/LangRef.html) form or construct an AST from the parse tree using a listener or visitor. Or, use actions in grammar, turning off auto-parse-tree construction.
## When do I use listener/visitor vs XPath vs Tree pattern matching?
### XPath
XPath works great when you need to find specific nodes, possibly in certain contexts. The context is limited to the parents on the way to the root of the tree. For example, if you want to find all ID nodes, use path `//ID`. If you want all variable declarations, you might use path `//vardecl`. If you only want fields declarations, then you can use some context information via path `/classdef/vardecl`, which would only find vardecls that our children of class definitions. You can merge the results of multiple XPath `findAll()`s simulating a set union for XPath. The only caveat is that the order from the original tree is not preserved when you union multiple `findAll()` sets.
### Tree pattern matching
Use tree pattern matching when you want to find specific subtree structures such as all assignments to 0 using pattern `x = 0;`. (Recall that these are very convenient because you specify the tree structure in the concrete syntax of the language described by the grammar.) If you want to find all assignments of any kind, you can use pattern `x = <expr>;` where `<expr>` will find any expression. This works great for matching particular substructures and therefore gives you a bit more ability to specify context. I.e., instead of just finding all identifiers, you can find all identifiers on the left hand side of an expression.
### Listeners/Visitors
Using the listener or visitor interfaces give you the most power but require implementing more methods. It might be more challenging to discover the emergent behavior of the listener than a simple tree pattern matcher that says *go find me X under node Y*.
Listeners are great when you want to visit many nodes in a tree.
Listeners allow you to compute and save context information necessary for processing at various nodes. For example, when building a symbol table manager for a compiler or translator, you need to compute symbol scopes such as globals, class, function, and code block. When you enter a class or function, you push a new scope and then pop it when you exit that class or function. When you see a symbol, you need to define it or look it up in the proper scope. By having enter/exit listener functions push and pop scopes, listener functions for defining variables simply say something like:
```java
scopeStack.peek().define(new VariableSymbol("foo"))
```
That way each listener function does not have to compute its appropriate scope.
Examples: [DefScopesAndSymbols.java](https://github.com/mantra/compiler/blob/master/src/java/mantra/semantics/DefScopesAndSymbols.java) and [SetScopeListener.java](https://github.com/mantra/compiler/blob/master/src/java/mantra/semantics/SetScopeListener.java) and [VerifyListener.java](https://github.com/mantra/compiler/blob/master/src/java/mantra/semantics/VerifyListener.java)

9
doc/faq/translation.md Normal file
View File

@ -0,0 +1,9 @@
# Translation
## ASTs vs parse trees
I used to do specialized AST (**abstract** syntax tree) nodes rather than (concrete) parse trees because I used to think more about compilation and generating bytecode/assembly code. When I started thinking more about translation, I started using parse trees. For v4, I realized that I did mostly translation. I guess what I'm saying is that maybe parse trees are not as good as ASTs for generating bytecodes. Personally, I would rather see `(+ 3 4)` rather than `(expr 3 + 4)` for generating byte codes, but it's not the end of the world. (*Can someone fill this in?*)
## Decoupling input walking from output generation
I suggest creating an intermediate model that represents your output. You walk the parse tree to collect information and create your model. Then, you could almost certainly automatically walk this internal model to generate output based upon stringtemplates that match the class names of the internal model. In other words, define a special `IFStatement` object that has all of the fields you want and then create them as you walk the parse tree. This decoupling of the input from the output is very powerful. Just because we have a parse tree listener doesn't mean that the parse tree itself is necessarily the best data structure to hold all information necessary to generate code. Imagine a situation where the output is the exact reverse of the input. In that case, you really want to walk the input just to collect data. Generating output should be driven by the internal model not the way it was represented in the input.