Update docs for new Unicode literal and property escapes

2017-03-01 15:04:30 -08:00 · 2017-03-01 15:04:30 -08:00 · c209f2a51e
parent d9ae13fc1a
commit c209f2a51e
3 changed files with 27 additions and 11 deletions
--- a/doc/cpp-target.md
+++ b/doc/cpp-target.md
@ -106,16 +106,12 @@ For gcc and clang it is possible to use the `-fvisibility=hidden` setting to hid
 Since C++ has no built-in memory management we need to take extra care. For that we rely mostly on smart pointers, which however might cause time penalties or memory side effects (like cyclic references) if not used with care. Currently however the memory household looks very stable. Generally, when you see a raw pointer in code consider this as being managed elsewehere. You should never try to manage such a pointer (delete, assign to smart pointer etc.).

 ### Unicode Support
-Encoding is mostly an input issue, i.e. when the lexer converts text input into lexer tokens. The parser is completely encoding unaware. However, lexer input in the grammar is defined by character ranges with either a single member (e.g. 'a' or [a] or [abc]), an explicit range (e.g. 'a'..'z' or [a-z]), the full Unicode range (for a wildcard) and the full Unicode range minus a sub range (for negated ranges, e.g. ~[a]). The explicit ranges (including single member ranges) are encoded in the serialized ATN by 16bit numbers, hence cannot reach beyond 0xFFFF (the Unicode BMP), while the implicit ranges can include any value (and hence support the full Unicode set, up to 0x10FFFF).
+Encoding is mostly an input issue, i.e. when the lexer converts text input into lexer tokens. The parser is completely encoding unaware.

-> An interesting side note here is that the Java target fully supports Unicode as well, despite the inherent limitations from the serialized ATN. That's possible because the Java String class represents characters beyond the BMP as surrogate pairs (two 16bit values) and even reads them as 2 separate input characters. To make this work a character range for an identifier in a grammar must include the surrogate pairs area (for a Java parser).
-
-The C++ target however always expects UTF-8 input (either in a string or via a wide stream) which is then converted to UTF-32 (a char32_t array) and fed to the lexer. ANTLR, when parsing your grammar, limits character ranges explicitly to the BMP currently. So, in order to allow specifying the full Unicode set the C++ target uses a little trick: whenever an explicit character range includes the (unused) codepoint 0xFFFF in a grammar it is silently extended to the full Unicode range. It's clear that this is an all-or-nothing solution. You cannot define a subset of Unicode codepoints > 0xFFFF that way. This can only be solved if ANTLR supports larger character intervals.
-
-The differences in handling characters beyond the BMP leads to a difference between Java and C++ lexers: the character offsets may not concur. This is because Java reads two 16bit values per Unicode char (if that falls into the surrogate area) while a C++ parser only reads one 32bit value. That usually doesn't have practical consequences, but might confuse people when comparing token positions.
+The C++ target always expects UTF-8 input (either in a string or stream) which is then converted to UTF-32 (a char32_t array) and fed to the lexer.

 ### Named Actions
-In order to help customizing the generated files there are a number of additional socalled **named actions**. These actions are tight to specific areas in the generated code and allow to add custom (target specific) code. All targets support these actions 
+In order to help customizing the generated files there are a number of additional socalled **named actions**. These actions are tight to specific areas in the generated code and allow to add custom (target specific) code. All targets support these actions

 * @parser::header
 * @parser::members
--- a/doc/lexer-rules.md
+++ b/doc/lexer-rules.md
@ -58,13 +58,29 @@ Match that character or sequence of characters. E.g., ’while’ or ’=’.</t

 <tr>
 <td>[char set]</td><td>
-Match one of the characters specified in the character set. Interpret x-y as set of characters between range x and y, inclusively. The following escaped characters are interpreted as single special characters: \n, \r, \b, \t, and \f. To get ], \, or - you must escape them with \. You can also use Unicode character specifications: \uXXXX. Here are a few examples:
+<p>Match one of the characters specified in the character set. Interpret <tt>x-y</tt> as the set of characters between range <tt>x</tt> and <tt>y</tt>, inclusively. The following escaped characters are interpreted as single special characters: <tt>\n</tt>, <tt>\r</tt>, <tt>\b</tt>, <tt>\t</tt>, <tt>\f</tt>, <tt>\uXXXX</tt>, and <tt>\u{XXXXXX}</tt>. To get <tt>]</tt>, <tt>\</tt>, or <tt>-</tt> you must escape them with <tt>\</tt>.</p>
+
+<p>You can also include all characters matching Unicode properties (general category, boolean, script, or block) with <tt>\p{PropertyName}</tt>. (You can invert the test with <tt>\P{PropertyName}</tt>).</p>
+
+<p>For a list of valid Unicode property names, see <a href="http://unicode.org/reports/tr44/#Properties">Unicode Standard Annex #44</a>. (ANTLR also supports <a href="http://unicode.org/reports/tr44/#General_Category_Values">short and long Unicode general category names</a> like <tt>\p{Lu}</tt>, <tt>\p{Z}</tt>, and <tt>\p{Symbol}</tt>.)</p>
+
+<p>Property names include <a href="http://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt">Unicode block names</a> prefixed with <tt>In</tt> (they overlap with script names) and with spaces changed to <tt>_</tt>. For example: <tt>\p{InLatin_1_Supplement}</tt>, <tt>\p{InYijing_Hexagram_Symbols}</tt>, and <tt>\p{InAncient_Greek_Numbers}</tt>.</p>
+
+<p>Property names are <b>case-insensitive</b>, and <tt>_</tt> and <tt>-</tt> are treated identically</p>
+
+<p>Here are a few examples:</p>

 <pre>
 WS : [ \n\u000D] -> skip ; // same as [ \n\r]
- 	
+
+UNICODE_WS : [\p{White_Space}] -> skip; // match all Unicode whitespace
+
 ID : [a-zA-Z] [a-zA-Z0-9]* ; // match usual identifier spec
- 	
+
+UNICODE_ID : [\p{Alpha}] [\p{Alnum}]* ; // match full Unicode alphabetic ids
+
+EMOJI : [\u{1F4A9}\u{1F926}] ; // note Unicode code points > U+FFFF
+
 DASHBRACK : [\-\]]+ ; // match - or ] one or more times
 </pre>
 </td>
--- a/doc/lexicon.md
+++ b/doc/lexicon.md
@ -81,7 +81,11 @@ These more or less correspond to `isJavaIdentifierPart` and `isJavaIdentifierSta

 ANTLR does not distinguish between character and string literals as most languages do. All literal strings one or more characters in length are enclosed in single quotes such as `’;’`, `’if’`, `’>=’`, and `’\’'` (refers to the one-character string containing the single quote character). Literals never contain regular expressions.

-Literals can contain Unicode escape sequences of the form `\uXXXX`, where XXXX is the hexadecimal Unicode character value. For example, `’\u00E8’` is the French letter with a grave accent: `’è’`. ANTLR also understands the usual special escape sequences: `’\n’` (newline), `’\r’` (carriage return), `’\t’` (tab), `’\b’` (backspace), and `’\f’` (form feed). You can use Unicode characters directly within literals or use the Unicode escape sequences:
+Literals can contain Unicode escape sequences of the form `’\uXXXX’` (for Unicode code points up to `’U+FFFF’`) or `’\u{XXXXXX}’` (for all Unicode code points), where `’XXXX’` is the hexadecimal Unicode code point value.
+
+For example, `’\u00E8’` is the French letter with a grave accent: `’è’`, and `’\u{1F4A9}’` is the famous emoji: `’💩’`.
+
+ANTLR also understands the usual special escape sequences: `’\n’` (newline), `’\r’` (carriage return), `’\t’` (tab), `’\b’` (backspace), and `’\f’` (form feed). You can use Unicode code points directly within literals or use the Unicode escape sequences:

 ```
 grammar Foreign;