The C++ target supports all platforms that can either run MS Visual Studio 2013 (or newer), XCode 7 (or newer) or CMake (C++11 required). All build tools can either create static or dynamic libraries, both as 64bit or 32bit arch. Additionally, XCode can create an iOS library.
## How to create a C++ lexer or parser?
This is pretty much the same as creating a Java lexer or parser, except you need to specify the language target, for example:
```
$ antlr4 -Dlanguage=Cpp MyGrammar.g4
```
You will see that there are a whole bunch of files generated by this call. If visitor or listener are not suppressed (which is the default) you'll get:
Once you've generated the lexer and/or parser code, you need to download or build the runtime. Prebuilt C++ runtime binaries for Windows (Visual Studio 2013/2015), OSX/macOS and iOS are available on the ANTLR web site:
Use CMake to build a Linux library (works also on OSX, however not for the iOS library).
Instead of downloading a prebuilt binary you can also easily build your own library on OSX or Windows. Just use the provided projects for XCode or Visual Studio and build it. Should work out of the box without any additional dependency.
## How do I run the generated lexer and/or parser?
Putting it all together to get a working parser is really easy. Look in the [runtime/Cpp/demo](../runtime/Cpp/demo) folder for a simple example. The [README](../runtime/Cpp/demo/README.md) there describes shortly how to build and run the demo on OSX, Windows or Linux.
The generation step above created a listener and base listener class for you. The listener class is an abstract interface, which declares enter and exit methods for each of your parser rules. The base listener implements all those abstract methods with an empty body, so you don't have to do it yourself if you just want to implement a single function. Hence use this base listener as the base class for your custom listener:
This example assumes your grammar contains a parser rule named `key` for which the enterKey function was generated. The `Ref<>` template is an alias for `std::shared_ptr<>` to simplify the runtime source code which often makes use of smart pointers.
The code generation (by running the ANTLR4 jar) allows to specify 2 values you might find useful for better integration of the generated files into your application (both are optional):
* A **namespace**: use the **`-package`** parameter to specify the namespace you want.
* An **export macro**: especially in VC++ extra work is required to export your classes from a DLL. This is usually accomplished by a macro that has different values depending on whether you are creating the DLL or import it. The ANTLR4 runtime itself also uses one for its classes:
```c++
#ifdef ANTLR4CPP_EXPORTS
#define ANTLR4CPP_PUBLIC __declspec(dllexport)
#else
#ifdef ANTLR4CPP_STATIC
#define ANTLR4CPP_PUBLIC
#else
#define ANTLR4CPP_PUBLIC __declspec(dllimport)
#endif
#endif
```
Just like the `ANTLR4CPP_PUBLIC` macro here you can specify your own one for the generated classes using the **`-export-macro`** parameter.
In order to create a static lib in Visual Studio define the `ANTLR4CPP_STATIC` macro in addition to the project settings that must be set for a static library (if you compile the runtime yourself).
For gcc and clang it is possible to use the `-fvisibility=hidden` setting to hide all symbols except those that are made default-visible (which has been defined for all public classes in the runtime).
Since C++ has no built-in memory management we need to take extra care. For that we rely mostly on smart pointers, which however might cause time penalties or memory side effects (like cyclic references) if not used with care. Currently however the memory household looks very stable. Generally, when you see a raw pointer in code consider this as being managed elsewehere. You should never try to manage such a pointer (delete, assign to smart pointer etc.).
Encoding is mostly an input issue, i.e. when the lexer converts text input into lexer tokens. The parser is completely encoding unaware. However, lexer input in the grammar is defined by character ranges with either a single member (e.g. 'a' or [a] or [abc]), an explicit range (e.g. 'a'..'z' or [a-z]), the full Unicode range (for a wildcard) and the full Unicode range minus a sub range (for negated ranges, e.g. ~[a]). The explicit ranges (including single member ranges) are encoded in the serialized ATN by 16bit numbers, hence cannot reach beyond 0xFFFF (the Unicode BMP), while the implicit ranges can include any value (and hence support the full Unicode set, up to 0x10FFFF).
> An interesting side note here is that the Java target fully supports Unicode as well, despite the inherent limitations from the serialized ATN. That's possible because the Java String class represents characters beyond the BMP as surrogate pairs (two 16bit values) and even reads them as 2 separate input characters. To make this work a character range for an identifier in a grammar must include the surrogate pairs area (for a Java parser).
The C++ target however always expects UTF-8 input (either in a string or via a wide stream) which is then converted to UTF-32 (a char32_t array) and fed to the lexer. ANTLR, when parsing your grammar, limits character ranges explicitly to the BMP currently. So, in order to allow specifying the full Unicode set the C++ target uses a little trick: whenever an explicit character range includes the (unused) codepoint 0xFFFF in a grammar it is silently extended to the full Unicode range. It's clear that this is an all-or-nothing solution. You cannot define a subset of Unicode codepoints > 0xFFFF that way. This can only be solved if ANTLR supports larger character intervals.
The differences in handling characters beyond the BMP leads to a difference between Java and C++ lexers: the character offsets may not concur. This is because Java reads two 16bit values per Unicode char (if that falls into the surrogate area) while a C++ parser only reads one 32bit value. That usually doesn't have practical consequences, but might confuse people when comparing token positions.
In order to help customizing the generated files there are a number of additional socalled **named actions**. These actions are tight to specific areas in the generated code and allow to add custom (target specific) code. All targets support these actions
*@parser::header
*@parser::members
*@lexer::header
*@lexer::members
(and their scopeless alternatives `@header` and `@members`) where header doesn't mean a C/C++ header file, but the top of a code file. The content of the header action appears in all generated files at the first line. So it's good for things like license/copyright information.
The content of a *members* action is placed in the public section of lexer or parser class declarations. Hence it can be used for public variables or predicate functions used in a grammar predicate. Since all targets support *header* + *members* they are the best place for stuff that should be available also in generated files for other languages.
In addition to that the C++ target supports many more such named actions. Unfortunately, it's not possible to define new scopes (e.g. *listener* in addition to *parser*) so they had to be defined as part of the existing scopes (*lexer* or *parser*). The grammar in the demo application contains all of the named actions as well for reference. Here's the list:
* **@lexer::preinclude** - Placed right before the first #include (e.g. good for headers that must appear first, for system headers etc.). Appears in both lexer h and cpp file.
* **@lexer::postinclude** - Placed right after the last #include, but before any class code (e.g. for additional namespaces). Appears in both lexer h and cpp file.
* **@lexer::context** - Placed right before the lexer class declaration. Use for e.g. additional types, aliases, forward declarations and the like. Appears in the lexer h file.
* **@lexer::declarations** - Placed in the private section of the lexer declaration (generated sections in all classes strictly follow the pattern: public, protected, privat, from top to bottom). Use this for private vars etc.
* **@lexer::definitions** - Placed before other implementations in the cpp file (but after *@postinclude*). Use this to implement e.g. private types.
For the parser there are the same actions as shown above for the lexer. In addition to that there are even more actions for visitor and listener classes:
and should be self explanatory now. Note: there is no *context* action for listeners or visitors, simply because they would be even less used than the other actions and there are so many already.