Merge pull request #2146 from parrt/case-insensitivity-doc
Case insensitivity doc
This commit is contained in:
commit
a4a14213f9
|
@ -175,4 +175,6 @@ YYYY/MM/DD, github id, Full name, email
|
|||
2017/11/05, ajaypanyala, Ajay Panyala, ajay.panyala@gmail.com
|
||||
2017/11/24, zqlu.cn, Zhiqiang Lu, zqlu.cn@gmail.com
|
||||
2017/11/28, niccroad, Nicolas Croad, nic.croad@gmail.com
|
||||
2017/12/01, DavidMoraisFerreira, David Morais Ferreira, david.moraisferreira@gmail.com
|
||||
2017/12/01, SebastianLng, Sebastian Lang, sebastian.lang@outlook.com
|
||||
2017/12/03, oranoran, Oran Epelbaum, oran / epelbaum me
|
||||
|
|
|
@ -0,0 +1,78 @@
|
|||
# Case-Insensitive Lexing
|
||||
|
||||
In some languages, keywords are case insensitive meaning that `BeGiN` means the same thing as `begin` or `BEGIN`. ANTLR has two mechanisms to support building grammars for such languages:
|
||||
|
||||
1. Build lexical rules that match either upper or lower case.
|
||||
* **Advantage**: no changes required to ANTLR, makes it clear in the grammar that the language in this case insensitive.
|
||||
* **Disadvantage**: might have a small efficiency cost and grammar is a more verbose and more of a hassle to write.
|
||||
|
||||
2. Build lexical rules that match keywords in all uppercase and then parse with a custom [character stream](https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/CharStream.java) that converts all characters to uppercase before sending them to the lexer (via the `LA()` method). Care must be taken not to convert all characters in the stream to uppercase because characters within strings and comments should be unaffected. All we really want is to trick the lexer into thinking the input is all uppercase.
|
||||
* **Advantage**: Could have a speed advantage depending on implementation, no change required to the grammar.
|
||||
* **Disadvantage**: Requires that the case-insensitive stream and grammar are used in correctly in conjunction with each other, makes all characters appear as uppercase/lowercase to the lexer but some grammars are case sensitive outside of keywords, errors new case insensitive streams and language output targets (java, C#, C++, ...).
|
||||
|
||||
For the 4.7.1 release, we discussed both approaches in [detail](https://github.com/antlr/antlr4/pull/2046) and even possibly altering the ANTLR metalanguage to directly support case-insensitive lexing. We discussed including the case insensitive streams into the runtime but not all would be immediately supported. I decided to simply make documentation that clearly states how to handle this and include the appropriate snippets that people can cut-and-paste into their grammars.
|
||||
|
||||
## Case-insensitive grammars
|
||||
|
||||
As a prime example of a grammar that specifically describes case insensitive keywords, see the
|
||||
[SQLite grammar](https://github.com/antlr/grammars-v4/blob/master/sqlite/SQLite.g4). To match a case insensitive keyword, there are rules such as
|
||||
|
||||
```
|
||||
K_UPDATE : U P D A T E;
|
||||
```
|
||||
|
||||
that will match `UpdaTE` and `upDATE` etc... as the `update` keyword. This rule makes use of some generically useful fragment rules that you can cut-and-paste into your grammars:
|
||||
|
||||
```
|
||||
fragment A : [aA]; // match either an 'a' or 'A'
|
||||
fragment B : [bB];
|
||||
fragment C : [cC];
|
||||
fragment D : [dD];
|
||||
fragment E : [eE];
|
||||
fragment F : [fF];
|
||||
fragment G : [gG];
|
||||
fragment H : [hH];
|
||||
fragment I : [iI];
|
||||
fragment J : [jJ];
|
||||
fragment K : [kK];
|
||||
fragment L : [lL];
|
||||
fragment M : [mM];
|
||||
fragment N : [nN];
|
||||
fragment O : [oO];
|
||||
fragment P : [pP];
|
||||
fragment Q : [qQ];
|
||||
fragment R : [rR];
|
||||
fragment S : [sS];
|
||||
fragment T : [tT];
|
||||
fragment U : [uU];
|
||||
fragment V : [vV];
|
||||
fragment W : [wW];
|
||||
fragment X : [xX];
|
||||
fragment Y : [yY];
|
||||
fragment Z : [zZ];
|
||||
```
|
||||
|
||||
No special streams are required to use this mechanism for case insensitivity.
|
||||
|
||||
## Custom character streams approach
|
||||
|
||||
The other approach is to use lexical rules that match either all uppercase or all lowercase, such as:
|
||||
|
||||
```
|
||||
K_UPDATE : 'UPDATE';
|
||||
```
|
||||
|
||||
Then, when creating the character stream to parse from, we need a custom class that overrides methods used by the lexer. Below you will find custom character streams for a number of the targets that you can copy into your projects, but here is how to use the streams in Java as an example:
|
||||
|
||||
```java
|
||||
CharStream s = CharStreams.fromPath(Paths.get('test.sql'));
|
||||
CaseChangingCharStream upper = new CaseChangingCharStream(s, true);
|
||||
Lexer lexer = new SomeSQLLexer(upper);
|
||||
```
|
||||
|
||||
Here are implementations of `CaseChangingCharStream` in various target languages:
|
||||
|
||||
* [Java](https://github.com/parrt/antlr4/blob/case-insensitivity-doc/doc/resources/CaseChangingCharStream.java)
|
||||
* [JavaScript](https://github.com/parrt/antlr4/blob/case-insensitivity-doc/doc/resources/CaseInsensitiveInputStream.js)
|
||||
* [Go](https://github.com/parrt/antlr4/blob/case-insensitivity-doc/doc/resources/case_changing_stream.go)
|
||||
* [C#](https://github.com/parrt/antlr4/blob/case-insensitivity-doc/doc/resources/CaseChangingCharStream.cs)
|
|
@ -8,7 +8,7 @@ Notes:
|
|||
|
||||
<li>Copyright © 2012, The Pragmatic Bookshelf. Pragmatic Bookshelf grants a nonexclusive, irrevocable, royalty-free, worldwide license to reproduce, distribute, prepare derivative works, and otherwise use this contribution as part of the ANTLR project and associated documentation.</li>
|
||||
|
||||
<li>This text was copied with permission from the <a href=http://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference>The Definitive ANTLR 4 Reference</a>, though it is being morphed over time as the tool changes.</li>
|
||||
<li>Much of this text was copied with permission from the <a href=http://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference>The Definitive ANTLR 4 Reference</a>, though it is being morphed over time as the tool changes.</li>
|
||||
</ul>
|
||||
|
||||
Links in the documentation refer to various sections of the book but have been redirected to the general book page on the publisher's site. There are two excerpts on the publisher's website that might be useful to you without having to purchase the book: [Let's get Meta](http://media.pragprog.com/titles/tpantlr2/picture.pdf) and [Building a Translator with a Listener](http://media.pragprog.com/titles/tpantlr2/listener.pdf). You should also consider reading the following books (the vid describes the reference book):
|
||||
|
@ -55,6 +55,8 @@ This documentation is a reference and summarizes grammar syntax and the key sema
|
|||
|
||||
* [Parsing binary streams](parsing-binary-files.md)
|
||||
|
||||
* [Case-Insensitive Lexing](case-insensitive-lexing.md)
|
||||
|
||||
* [Parser and lexer interpreters](interpreters.md)
|
||||
|
||||
* [Resources](resources.md)
|
||||
|
|
|
@ -0,0 +1,105 @@
|
|||
/* Copyright (c) 2012-2017 The ANTLR Project. All rights reserved.
|
||||
* Use of this file is governed by the BSD 3-clause license that
|
||||
* can be found in the LICENSE.txt file in the project root.
|
||||
*/
|
||||
using System;
|
||||
using Antlr4.Runtime.Misc;
|
||||
|
||||
namespace Antlr4.Runtime
|
||||
{
|
||||
/// <summary>
|
||||
/// This class supports case-insensitive lexing by wrapping an existing
|
||||
/// <see cref="ICharStream"/> and forcing the lexer to see either upper or
|
||||
/// lowercase characters. Grammar literals should then be either upper or
|
||||
/// lower case such as 'BEGIN' or 'begin'. The text of the character
|
||||
/// stream is unaffected. Example: input 'BeGiN' would match lexer rule
|
||||
/// 'BEGIN' if constructor parameter upper=true but getText() would return
|
||||
/// 'BeGiN'.
|
||||
/// </summary>
|
||||
public class CaseChangingCharStream : ICharStream
|
||||
{
|
||||
private ICharStream stream;
|
||||
private bool upper;
|
||||
|
||||
/// <summary>
|
||||
/// Constructs a new CaseChangingCharStream wrapping the given <paramref name="stream"/> forcing
|
||||
/// all characters to upper case or lower case.
|
||||
/// </summary>
|
||||
/// <param name="stream">The stream to wrap.</param>
|
||||
/// <param name="upper">If true force each symbol to upper case, otherwise force to lower.</param>
|
||||
public CaseChangingCharStream(ICharStream stream, bool upper)
|
||||
{
|
||||
this.stream = stream;
|
||||
this.upper = upper;
|
||||
}
|
||||
|
||||
public int Index
|
||||
{
|
||||
get
|
||||
{
|
||||
return stream.Index;
|
||||
}
|
||||
}
|
||||
|
||||
public int Size
|
||||
{
|
||||
get
|
||||
{
|
||||
return stream.Size;
|
||||
}
|
||||
}
|
||||
|
||||
public string SourceName
|
||||
{
|
||||
get
|
||||
{
|
||||
return stream.SourceName;
|
||||
}
|
||||
}
|
||||
|
||||
public void Consume()
|
||||
{
|
||||
stream.Consume();
|
||||
}
|
||||
|
||||
[return: NotNull]
|
||||
public string GetText(Interval interval)
|
||||
{
|
||||
return stream.GetText(interval);
|
||||
}
|
||||
|
||||
public int LA(int i)
|
||||
{
|
||||
int c = stream.LA(i);
|
||||
|
||||
if (c <= 0)
|
||||
{
|
||||
return c;
|
||||
}
|
||||
|
||||
char o = (char)c;
|
||||
|
||||
if (upper)
|
||||
{
|
||||
return (int)char.ToUpperInvariant(o);
|
||||
}
|
||||
|
||||
return (int)char.ToLowerInvariant(o);
|
||||
}
|
||||
|
||||
public int Mark()
|
||||
{
|
||||
return stream.Mark();
|
||||
}
|
||||
|
||||
public void Release(int marker)
|
||||
{
|
||||
stream.Release(marker);
|
||||
}
|
||||
|
||||
public void Seek(int index)
|
||||
{
|
||||
stream.Seek(index);
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,81 @@
|
|||
package org.antlr.v4.runtime;
|
||||
|
||||
import org.antlr.v4.runtime.misc.Interval;
|
||||
|
||||
/**
|
||||
* This class supports case-insensitive lexing by wrapping an existing
|
||||
* {@link CharStream} and forcing the lexer to see either upper or
|
||||
* lowercase characters. Grammar literals should then be either upper or
|
||||
* lower case such as 'BEGIN' or 'begin'. The text of the character
|
||||
* stream is unaffected. Example: input 'BeGiN' would match lexer rule
|
||||
* 'BEGIN' if constructor parameter upper=true but getText() would return
|
||||
* 'BeGiN'.
|
||||
*/
|
||||
public class CaseChangingCharStream implements CharStream {
|
||||
|
||||
final CharStream stream;
|
||||
final boolean upper;
|
||||
|
||||
/**
|
||||
* Constructs a new CaseChangingCharStream wrapping the given {@link CharStream} forcing
|
||||
* all characters to upper case or lower case.
|
||||
* @param stream The stream to wrap.
|
||||
* @param upper If true force each symbol to upper case, otherwise force to lower.
|
||||
*/
|
||||
public CaseChangingCharStream(CharStream stream, boolean upper) {
|
||||
this.stream = stream;
|
||||
this.upper = upper;
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getText(Interval interval) {
|
||||
return stream.getText(interval);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void consume() {
|
||||
stream.consume();
|
||||
}
|
||||
|
||||
@Override
|
||||
public int LA(int i) {
|
||||
int c = stream.LA(i);
|
||||
if (c <= 0) {
|
||||
return c;
|
||||
}
|
||||
if (upper) {
|
||||
return Character.toUpperCase(c);
|
||||
}
|
||||
return Character.toLowerCase(c);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int mark() {
|
||||
return stream.mark();
|
||||
}
|
||||
|
||||
@Override
|
||||
public void release(int marker) {
|
||||
stream.release(marker);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int index() {
|
||||
return stream.index();
|
||||
}
|
||||
|
||||
@Override
|
||||
public void seek(int index) {
|
||||
stream.seek(index);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int size() {
|
||||
return stream.size();
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getSourceName() {
|
||||
return stream.getSourceName();
|
||||
}
|
||||
}
|
|
@ -0,0 +1,54 @@
|
|||
//
|
||||
/* Copyright (c) 2012-2017 The ANTLR Project. All rights reserved.
|
||||
* Use of this file is governed by the BSD 3-clause license that
|
||||
* can be found in the LICENSE.txt file in the project root.
|
||||
*/
|
||||
//
|
||||
|
||||
function CaseInsensitiveInputStream(stream, upper) {
|
||||
this._stream = stream;
|
||||
this._case = upper ? String.toUpperCase : String.toLowerCase;
|
||||
return this;
|
||||
}
|
||||
|
||||
CaseInsensitiveInputStream.prototype.LA = function (offset) {
|
||||
c = this._stream.LA(i);
|
||||
if (c <= 0) {
|
||||
return c;
|
||||
}
|
||||
return this._case.call(String.fromCodePoint(c))
|
||||
};
|
||||
|
||||
CaseInsensitiveInputStream.prototype.reset = function() {
|
||||
return this._stream.reset();
|
||||
};
|
||||
|
||||
CaseInsensitiveInputStream.prototype.consume = function() {
|
||||
return this._stream.consume();
|
||||
};
|
||||
|
||||
CaseInsensitiveInputStream.prototype.LT = function(offset) {
|
||||
return this._stream.LT(offset);
|
||||
};
|
||||
|
||||
CaseInsensitiveInputStream.prototype.mark = function() {
|
||||
return this._stream.mark();
|
||||
};
|
||||
|
||||
CaseInsensitiveInputStream.prototype.release = function(marker) {
|
||||
return this._stream.release(marker);
|
||||
};
|
||||
|
||||
CaseInsensitiveInputStream.prototype.seek = function(_index) {
|
||||
return this._stream.getText(start, stop);
|
||||
};
|
||||
|
||||
CaseInsensitiveInputStream.prototype.getText = function(start, stop) {
|
||||
return this._stream.getText(start, stop);
|
||||
};
|
||||
|
||||
CaseInsensitiveInputStream.prototype.toString = function() {
|
||||
return this._stream.toString();
|
||||
};
|
||||
|
||||
exports.CaseInsensitiveInputStream = CaseInsensitiveInputStream;
|
|
@ -0,0 +1,37 @@
|
|||
package antlr
|
||||
|
||||
import (
|
||||
"unicode"
|
||||
)
|
||||
|
||||
// CaseChangingStream wraps an existing CharStream, but upper cases, or
|
||||
// lower cases the input before it is tokenized.
|
||||
type CaseChangingStream struct {
|
||||
CharStream
|
||||
|
||||
upper bool
|
||||
}
|
||||
|
||||
// NewCaseChangingStream returns a new CaseChangingStream that forces
|
||||
// all tokens read from the underlying stream to be either upper case
|
||||
// or lower case based on the upper argument.
|
||||
func NewCaseChangingStream(in CharStream, upper bool) *CaseChangingStream {
|
||||
return &CaseChangingStream{
|
||||
in, upper,
|
||||
}
|
||||
}
|
||||
|
||||
// LA gets the value of the symbol at offset from the current position
|
||||
// from the underlying CharStream and converts it to either upper case
|
||||
// or lower case.
|
||||
func (is *CaseChangingStream) LA(offset int) int {
|
||||
in := is.CharStream.LA(offset)
|
||||
if in < 0 {
|
||||
// Such as antlr.TokenEOF which is -1
|
||||
return in
|
||||
}
|
||||
if is.upper {
|
||||
return int(unicode.ToUpper(rune(in)))
|
||||
}
|
||||
return int(unicode.ToLower(rune(in)))
|
||||
}
|
Loading…
Reference in New Issue