# Lexer Rules A lexer grammar is composed of lexer rules, optionally broken into multiple modes. Lexical modes allow us to split a single lexer grammar into multiple sublexers. The lexer can only return tokens matched by rules from the current mode. Lexer rules specify token definitions and more or less follow the syntax of parser rules except that lexer rules cannot have arguments, return values, or local variables. Lexer rule names must begin with an uppercase letter, which distinguishes them from parser rule names: ``` /** Optional document comment */ TokenName : alternative1 | ... | alternativeN ; ``` You can also define rules that are not tokens but rather aid in the recognition of tokens. These fragment rules do not result in tokens visible to the parser: ``` fragment HelperTokenRule : alternative1 | ... | alternativeN ; ``` For example, `DIGIT` is a pretty common fragment rule: ``` INT : DIGIT+ ; // references the DIGIT helper rule fragment DIGIT : [0-9] ; // not a token by itself ``` ## Lexical Modes Modes allow you to group lexical rules by context, such as inside and outside of XML tags. It’s like having multiple sublexers, one for context. The lexer can only return tokens matched by entering a rule in the current mode. Lexers start out in the so-called default mode. All rules are considered to be within the default mode unless you specify a mode command. Modes are not allowed within combined grammars, just lexer grammars. (See grammar `XMLLexer` from [Tokenizing XML](http://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference).) ``` rules in default mode ... mode MODE1; rules in MODE1 ... mode MODEN; rules in MODEN ... ``` ## Lexer Rule Elements Lexer rules allow two constructs that are unavailable to parser rules: the .. range operator and the character set notation enclosed in square brackets, [characters]. Don’t confuse character sets with arguments to parser rules. [characters] only means character set in a lexer. Here’s a summary of all lexer rule elements:
Syntax | Description |
---|---|
T | Match token T at the current input position. Tokens always begin with a capital letter. |
’literal’ | Match that character or sequence of characters. E.g., ’while’ or ’=’. |
[char set] |
Match one of the characters specified in the character set. Interpret x-y as the set of characters between range x and y, inclusively. The following escaped characters are interpreted as single special characters: \n, \r, \b, \t, \f, \uXXXX, and \u{XXXXXX}. To get ], \, or - you must escape them with \. You can also include all characters matching Unicode properties (general category, boolean, or enumerated including scripts and blocks) with \p{PropertyName} or \p{EnumProperty=Value}. (You can invert the test with \P{PropertyName} or \P{EnumProperty=Value}). For a list of valid Unicode property names, see Unicode Standard Annex #44. (ANTLR also supports short and long Unicode general category names and values like \p{Lu}, \p{Z}, \p{Symbol}, \p{Blk=Latin_1_Sup}, and \p{Block=Latin_1_Supplement}.) As a shortcut for \p{Block=Latin_1_Supplement}, you can refer to blocks using Unicode block names prefixed with In and with spaces changed to _. For example: \p{InLatin_1_Supplement}, \p{InYijing_Hexagram_Symbols}, and \p{InAncient_Greek_Numbers}. A few extra properties are supported:
Property names are case-insensitive, and _ and - are treated identically Here are a few examples: WS : [ \n\u000D] -> skip ; // same as [ \n\r] UNICODE_WS : [\p{White_Space}] -> skip; // match all Unicode whitespace ID : [a-zA-Z] [a-zA-Z0-9]* ; // match usual identifier spec UNICODE_ID : [\p{Alpha}\p{General_Category=Other_Letter}] [\p{Alnum}\p{General_Category=Other_Letter}]* ; // match full Unicode alphabetic ids EMOJI : [\u{1F4A9}\u{1F926}] ; // note Unicode code points > U+FFFF DASHBRACK : [\-\]]+ ; // match - or ] one or more times |
’x’..’y’ | Match any single character between range x and y, inclusively. E.g., ’a’..’z’. ’a’..’z’ is identical to [a-z]. |
T |
Invoke lexer rule T; recursion is allowed in general, but not left recursion. T can be a regular token or fragment rule.
ID : LETTER (LETTER|'0'..'9')* ; fragment LETTER : [a-zA-Z\u0080-\u00FF_] ; |
. |
The dot is a single-character wildcard that matches any single character. Example:
ESC : '\\' . ; // match any escaped \x character |
{«action»} |
Lexer actions can appear anywhere as of 4.2, not just at the end of the outermost alternative. The lexer executes the actions at the appropriate input position, according to the placement of the action within the rule. To execute a single action for a role that has multiple alternatives, you can enclose the alts in parentheses and put the action afterwards:
END : ('endif'|'end') {System.out.println("found an end");} ; The action conforms to the syntax of the target language. ANTLR copies the action’s contents into the generated code verbatim; there is no translation of expressions like $x.y as there is in parser actions. Only actions within the outermost token rule are executed. In other words, if STRING calls ESC_CHAR and ESC_CHAR has an action, that action is not executed when the lexer starts matching in STRING. |
{«p»}? | Evaluate semantic predicate «p». If «p» evaluates to false at runtime, the surrounding rule becomes “invisible” (nonviable). Expression «p» conforms to the target language syntax. While semantic predicates can appear anywhere within a lexer rule, it is most efficient to have them at the end of the rule. The one caveat is that semantic predicates must precede lexer actions. See Predicates in Lexer Rules. |
~x |
Match any single character not in the set described by x. Set x can be a single character literal, a range, or a subrule set like ~(’x’|’y’|’z’) or ~[xyz]. Here is a rule that uses ~ to match any character other than characters using ~[\r\n]*:
COMMENT : '#' ~[\r\n]* '\r'? '\n' -> skip ; |