start on doc for Case insensitive lexing

This commit is contained in:
parrt 2017-12-05 15:15:46 -08:00
parent a5f47c8c55
commit 529903a7f6
2 changed files with 68 additions and 8 deletions

View File

@ -2,12 +2,70 @@
In some languages, keywords are case insensitive meaning that `BeGiN` means the same thing as `begin` or `BEGIN`. ANTLR has two mechanisms to support building grammars for such languages:
1. build lexical rules that match either upper or lower case.<br>
**Advantage**: no changes required to ANTLR. <br>
**Disadvantage**: might have a small efficiency cost and grammar is a more verbose and more of a hassle to write.
1. Build lexical rules that match either upper or lower case.
* **Advantage**: no changes required to ANTLR, makes it clear in the grammar that the language in this case insensitive.
* **Disadvantage**: might have a small efficiency cost and grammar is a more verbose and more of a hassle to write.
2. build lexical rules that match keywords in all uppercase and then parse with a custom [character stream](https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/CharStream.java) that converts all characters to uppercase before sending them to the lexer (via the `LA()` method). Care must be taken not to convert all characters in the stream to uppercase because characters within strings and comments should be unaffected. All we really want is to trick the lexer into thinking the input is all uppercase.<br>
**Advantage**: Could have a speed advantage depending on implementation, no change required to the grammar.<br>
**Disadvantage**: Requires that the case-insensitive stream and grammar are used in correctly in conjunction with each other, makes all characters appear as uppercase/lowercase to the lexer but some grammars are case sensitive outside of keywords.
2. Build lexical rules that match keywords in all uppercase and then parse with a custom [character stream](https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/CharStream.java) that converts all characters to uppercase before sending them to the lexer (via the `LA()` method). Care must be taken not to convert all characters in the stream to uppercase because characters within strings and comments should be unaffected. All we really want is to trick the lexer into thinking the input is all uppercase.
* **Advantage**: Could have a speed advantage depending on implementation, no change required to the grammar.
* **Disadvantage**: Requires that the case-insensitive stream and grammar are used in correctly in conjunction with each other, makes all characters appear as uppercase/lowercase to the lexer but some grammars are case sensitive outside of keywords, errors new case insensitive streams and language output targets (java, C#, C++, ...).
For the 4.7.1 release, we discussed both approaches in [detail](https://github.com/antlr/antlr4/pull/2046) and even possibly altering the ANTLR metalanguage to directly support case-insensitive lexing. We discussed including
For the 4.7.1 release, we discussed both approaches in [detail](https://github.com/antlr/antlr4/pull/2046) and even possibly altering the ANTLR metalanguage to directly support case-insensitive lexing. We discussed including the case insensitive streams into the runtime but not all would be immediately supported. I decided to simply make documentation that clearly states how to handle this and include the appropriate snippets that people can cut-and-paste into their grammars.
## Case-insensitive grammars
As a prime example of a grammar that specifically describes case insensitive keywords, see the
[SQLite grammar](https://github.com/antlr/grammars-v4/blob/master/sqlite/SQLite.g4). To match a case insensitive keyword, there are rules such as
```
K_UPDATE : U P D A T E;
```
that will match `UpdaTE` and `upDATE` etc... as the `update` keyword. This rule makes use of some generically useful fragment rules that you can cut-and-paste into your grammars:
```
fragment A : [aA]; // match either an 'a' or 'A'
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
```
No special streams are required to use this mechanism for case insensitivity.
## Custom character streams approach
The other approach is to use lexical rules that match either all uppercase or all lowercase, such as:
```
K_UPDATE : 'UPDATE';
```
Then, when creating the character stream to parse from, we need a custom class that overrides methods used by the lexer. Below you will find custom character streams for a number of the targets that you can copy into your projects, but here is how to use the streams in Java as an example:
```java
CharStream s = CharStreams.fromPath(Paths.get('test.sql'));
CaseChangingCharStream upper = new CaseChangingCharStream(s, true);
Lexer lexer = new SomeSQLLexer(upper);
```

View File

@ -8,7 +8,7 @@ Notes:
<li>Copyright © 2012, The Pragmatic Bookshelf. Pragmatic Bookshelf grants a nonexclusive, irrevocable, royalty-free, worldwide license to reproduce, distribute, prepare derivative works, and otherwise use this contribution as part of the ANTLR project and associated documentation.</li>
<li>This text was copied with permission from the <a href=http://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference>The Definitive ANTLR 4 Reference</a>, though it is being morphed over time as the tool changes.</li>
<li>Much of this text was copied with permission from the <a href=http://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference>The Definitive ANTLR 4 Reference</a>, though it is being morphed over time as the tool changes.</li>
</ul>
Links in the documentation refer to various sections of the book but have been redirected to the general book page on the publisher's site. There are two excerpts on the publisher's website that might be useful to you without having to purchase the book: [Let's get Meta](http://media.pragprog.com/titles/tpantlr2/picture.pdf) and [Building a Translator with a Listener](http://media.pragprog.com/titles/tpantlr2/listener.pdf). You should also consider reading the following books (the vid describes the reference book):
@ -55,6 +55,8 @@ This documentation is a reference and summarizes grammar syntax and the key sema
* [Parsing binary streams](parsing-binary-files.md)
* [Case-Insensitive Lexing](case-insensitive-lexing.md)
* [Parser and lexer interpreters](interpreters.md)
* [Resources](resources.md)