fix antlr notation

This commit is contained in:
parrt 2017-03-20 14:48:00 -07:00
parent 6de2f3f335
commit e0ea43ccb8
1 changed files with 6 additions and 7 deletions

View File

@ -16,23 +16,22 @@ a lexer rule for a single Cyrillic character by creating a range from
`U+0400` to `U+04FF`:
```ANTLR
CYRILLIC = ('\u0400'..'\u04FF'); // or [\u0400-\u04FF] without quotes
CYRILLIC : '\u0400'..'\u04FF' ; // or [\u0400-\u04FF] without quotes
```
Unicode literals larger than U+FFFF must use the extended `\u{12345}` syntax.
For example, to create a lexer rule for a selection of smiley faces
Unicode literals larger than U+FFFF must use the extended `\u{12345}` syntax. For example, to create a lexer rule for a selection of smiley faces
from the [Emoticons Unicode block](http://www.unicode.org/charts/PDF/U1F600.pdf):
```ANTLR
EMOTICONS = ('\u{1F600}' | '\u{1F602}' | '\u{1F615}'); // or [\u{1F600}\u{1F602}\u{1F615}]
EMOTICONS : ('\u{1F600}' | '\u{1F602}' | '\u{1F615}') ; // or [\u{1F600}\u{1F602}\u{1F615}]
```
Finally, lexer char sets can include Unicode properties. Each Unicode code point has at least one property that describes the type group to which it belongs (e.g. alpha, number, punctuation). Other properties can be the language script or special binary properties and Unicode code blocks. That means however, that a property specifies a group of code points, hence they are only allowed in lexer char sets.
```ANTLR
EMOJI = [\p{Emoji}];
JAPANESE = [\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Han}];
NOT_CYRILLIC = [\P{Script=Cyrillic}];
EMOJI : [\p{Emoji}] ;
JAPANESE : [\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Han}] ;
NOT_CYRILLIC : [\P{Script=Cyrillic}] ;
```
See [lexer-rules.md](lexer-rules.md#lexer-rule-elements) for more detail on Unicode