antlr/doc/grammars.md

7.1 KiB
Raw Blame History

Grammar Structure

A grammar is essentially a grammar declaration followed by a list of rules, but has the general form:

/** Optional javadoc style comment */
grammar Name; ①
options {...}
import ... ;
 	
tokens {...}
channels {...} // lexer only
@actionName {...}
 	 
rule1 // parser and lexer rules, possibly intermingled
...
ruleN

The file name containing grammar X must be called X.g4. You can specify options, imports, token specifications, and actions in any order. There can be at most one each of options, imports, and token specifications. All of those elements are optional except for the header ① and at least one rule. Rules take the basic form:

ruleName : alternative1 | ... | alternativeN ;

Parser rule names must start with a lowercase letter and lexer rules must start with a capital letter.

Grammars defined without a prefix on the grammar header are combined grammars that can contain both lexical and parser rules. To make a parser grammar that only allows parser rules, use the following header.

parser grammar Name;
...

And, naturally, a pure lexer grammar looks like this:

lexer grammar Name;
...

Only lexer grammars can contain mode specifications.

Only lexer grammars can contain custom channels specifications

channels {
  WHITESPACE_CHANNEL,
  COMMENTS_CHANNEL
}

Those channels can then be used like enums within lexer rules:

WS : [ \r\t\n]+ -> channel(WHITESPACE_CHANNEL) ;

Sections 15.5, Lexer Rules and Section 15.3, Parser Rules contain details on rule syntax. Section 15.8, Options describes grammar options and Section 15.4, Actions and Attributes has information on grammar-level actions.

Grammar Imports

Grammar imports let you break up a grammar into logical and reusable chunks, as we saw in Importing Grammars. ANTLR treats imported grammars very much like object-oriented programming languages treat superclasses. A grammar inherits all of the rules, tokens specifications, and named actions from the imported grammar. Rules in the “main grammar” override rules from imported grammars to implement inheritance.

Think of import as more like a smart include statement (which does not include rules that are already defined). The result of all imports is a single combined grammar; the ANTLR code generator sees a complete grammar and has no idea there were imported grammars.

To process a main grammar, the ANTLR tool loads all of the imported grammars into subordinate grammar objects. It then merges the rules, token types, and named actions from the imported grammars into the main grammar. In the diagram below, the grammar on the right illustrates the effect of grammar MyELang importing grammar ELang.

MyELang inherits rules stat, WS, and ID, but overrides rule expr and adds INT. Heres a sample build and test run that shows MyELang can recognize integer expressions whereas the original ELang cant. The third, erroneous input statement triggers an error message that also demonstrates the parser was looking for MyELangs expr not ELangs.

$ antlr4 MyELang.g4
$ javac MyELang*.java
$ grun MyELang stat
=> 	34;
=> 	a;
=> 	;
=> 	EOF
<= 	line 3:0 extraneous input ';' expecting {INT, ID}

If there were any tokens specifications, the main grammar would merge the token sets. Any named actions such as @members would be merged. In general, you should avoid named actions and actions within rules in imported grammars since that limits their reuse. ANTLR also ignores any options in imported grammars.

Imported grammars can also import other grammars. ANTLR pursues all imported grammars in a depth-first fashion. If two or more imported grammars define ruler, ANTLR chooses the first version of r it finds. In the following diagram, ANTLR examines grammars in the following order Nested, G1, G3, G2.

Nested includes the r rule from G3 because it sees that version before the r in G2.

Not every kind of grammar can import every other kind of grammar:

  • Lexer grammars can import lexers.
  • Parsers can import parsers.
  • Combined grammars can import lexers or parsers.

ANTLR adds imported rules to the end of the rule list in a main lexer grammar. That means lexer rules in the main grammar get precedence over imported rules. For example, if a main grammar defines rule IF : if ; and an imported grammar defines rule ID : [a-z]+ ; (which also recognizes if), the imported ID wont hide the main grammars IF token definition.

Tokens Section

The purpose of the tokens section is to define token types needed by a grammar for which there is no associated lexical rule. The basic syntax is:

tokens { Token1, ..., TokenN }

Most of the time, the tokens section is used to define token types needed by actions in the grammar as shown in Section 10.3, Recognizing Languages whose Keywords Arent Fixed:

// explicitly define keyword token types to avoid implicit definition warnings
tokens { BEGIN, END, IF, THEN, WHILE }
 
@lexer::members { // keywords map used in lexer to assign token types
Map<String,Integer> keywords = new HashMap<String,Integer>() {{
	put("begin", KeywordsParser.BEGIN);
	put("end", KeywordsParser.END);
	...
}};
}

The tokens section really just defines a set of tokens to add to the overall set.

$ cat Tok.g4
grammar Tok;
tokens { A, B, C }
a : X ;
$ antlr4 Tok.g4
warning(125): Tok.g4:3:4: implicit definition of token X in parser
$ cat Tok.tokens
A=1
B=2
C=3
X=4

Actions at the Grammar Level

Currently there are only two defined named actions (for the Java target) used outside of grammar rules: header and members. The former injects code into the generated recognizer class file, before the recognizer class definition, and the latter injects code into the recognizer class definition, as fields and methods.

For combined grammars, ANTLR injects the actions into both the parser and the lexer. To restrict an action to the generated parser or lexer, use @parser::name or @lexer::name.

Heres an example where the grammar specifies a package for the generated code:

grammar Count;
 
@header {
package foo;
}
 
@members {
int count = 0;
}
 
list
@after {System.out.println(count+" ints");}
: INT {count++;} (',' INT {count++;} )*
;
 
INT : [0-9]+ ;
WS : [ \r\t\n]+ -> skip ;

The grammar itself then should be in directory foo so that ANTLR generates code in that same foo directory (at least when not using the -o ANTLR tool option):

$ cd foo
$ antlr4 Count.g4 # generates code in the current directory (foo)
$ ls
Count.g4		CountLexer.java	CountParser.java
Count.tokens	CountLexer.tokens
CountBaseListener.java CountListener.java
$ javac *.java
$ cd ..
$ grun foo.Count list
=> 	9, 10, 11
=> 	EOF
<= 	3 ints

The Java compiler expects classes in package foo to be in directory foo.