Added some notes about Unicode to the C++ target doc.

2016-06-04 16:46:09 +02:00 · 2016-06-04 16:46:09 +02:00 · d6339f5cf4
parent 80ced92d55
commit d6339f5cf4
1 changed files with 9 additions and 0 deletions
--- a/doc/cpp-target.md
+++ b/doc/cpp-target.md
@ -77,6 +77,15 @@ This example assumes your grammar contains a parser rule named `key` for which t
 ### Memory Management
 Caused by the nature of C++ there are a couple of things that only the C++ ANTLR target has. Since C++ has no built-in memory management we need to take extra care for that. For that we rely mostly on smart pointers, which however might cause time penalties or memory side effects (like cyclic references) if not used with care. Currently however the memory household looks very stable.

+### Unicode Support
+Encoding is mostly an input issue, i.e. when the lexer converts text input into lexer tokens. The parser is completely encoding unaware. However, lexer input in in the grammar is defined by character ranges with either a single member (e.g. 'a' or [a]), an explicit range (e.g. 'a'..'z' or [a-z]), the full Unicode range (for a wildcard) and the full Unicode range minus a sub range (for negated ranges, e.g. ~[a]). The explicit ranges are encoded in the serialized ATN by 16bit numbers, hence cannot reach beyond 0xFFFF (the Unicode BMP), while the implicit ranges can include any value (and hence support the full Unicode set, up to 0x10FFFF).
+
+> An interesting side note here is that the Java target fully supports Unicode as well, despite the inherent limitations from the serialized ATN. That's possible because the Java String class represents characters beyond the BMP as surrogate pairs (two 16bit values) and even reads them as 2 characters. To make this work a character range for an identifier in a grammar must include the surrogate pairs area (for a Java parser).
+
+The C++ target however always expects UTF-8 input (either in a string or via a wide stream) which is then converted to a char32_t array and fed to the lexer. ANTLR, when parsing your grammar, limits character ranges explicitly to the BMP currently. So, in order to allow specifying the full Unicode set the C++ target uses a little trick: whenever an explicit character range includes the (unused) codepoint 0xFFFF in a grammar it is silently extended to the full Unicode range. It's clear that this is an all-or-nothing solution. You cannot define a subset of Unicode codepoints > 0xFFFF that way. This can only be solved if ANTLR supports larger character intervals.
+
+The differences in handling characters beyond the BMP leads to a difference between Java and C++ parsers: the character offsets may not concur. This is because Java reads two 16bit values per Unicode char (if that falls into the surrogate area) while a C++ parser only reads one 32bit value. That usually doesn't have practical consequences, but might confuse people when comparing token positions.
+
 ### Named Actions
 In order to help customizing the generated files there are a number of additional socalled **named actions**. These actions are tight to specific areas in the generated code and allow to add custom (target specific) code. All targets support these actions