antlr/doc/cpp-target.md

9.0 KiB

C++

The C++ target supports all platforms that can either run MS Visual Studio 2013 (or newer), XCode 7 (or newer) or CMake (C++11 required). All build tools can either create static or dynamic libraries, both as 64bit or 32bit arch. Additionally, XCode can create an iOS library.

How to create a C++ lexer or parser?

This is pretty much the same as creating a Java lexer or parser, except you need to specify the language target, for example:

$ antlr4 -Dlanguage=Cpp MyGrammar.g4

You will see that there are a whole bunch of files generated by this call. If visitor or listener are not suppressed (which is the default) you'll get:

  • MyGrammarLexer.h + MyGrammarLexer.cpp
  • MyGrammarParser.h + MyGrammarParser.cpp
  • MyGrammarVisitor.h + MyGrammarVisitor.cpp
  • MyGrammarBaseVisitor.h + MyGrammarBaseVisitor.cpp
  • MyGrammarListener.h + MyGrammarListener.cpp
  • MyGrammarBaseListener.h + MyGrammarBaseListener.cpp

Where can I get the runtime?

Once you've generated the lexer and/or parser code, you need to download or build the runtime. Prebuilt C++ runtime binaries for Windows (VS 2013 runtime), OSX and iOS are available on the ANTLR web site:

Use CMake to build a Linux library (works also on OSX, if you don't have XCode, as we use pure C++ code). Building your own library on OSX or Windows is trivial, however. Just open the VS or XCode project, select target + arch and build it. Should work out of the box without any additional dependency.

How do I run the generated lexer and/or parser?

Putting it all together to get a working parser is really easy. Look in the runtime/Cpp/demo folder for a simple example. The README there describes shortly how to build and run the demo on OSX, Windows or Linux.

How do I create and run a custom listener?

The generation step above created a listener and base listener class for you. The listener class is an abstract interface, which declares enter and exit methods for each of your parser rules. The base listener is implements all those abstract methods with an empty body, so you don't have to do it yourself if you just want to implement a single function. Hence use this base listener as the base class for your custom listener:

#include <iostream>

#include "antlr4-runtime.h"
#include "MyGrammarLexer.h"
#include "MyGrammarParser.h"
#include "MyGrammarBaseListener.h"

using namespace org::antlr::v4::runtime;

class TreeShapeListener : public MyGrammarBaseListener {
public:
  void enterKey(Ref<ParserRuleContext> ctx) {
	// Do something when entering the key rule.
  }
};


int main(int argc, const char* argv[]) {
  std::wifstream stream;
  stream.open(argv[1]);
  ANTLRInputStream input("ae");
  MyGrammarLexer lexer(&input);
  CommonTokenStream tokens(&lexer);
  MyGrammarParser parser(&tokens);

  Ref<tree::ParseTree> tree = parser.key();
  Ref<TreeShapeListener> listener(new TreeShapeListener());
  tree::ParseTreeWalker::DEFAULT->walk(listener, tree);

  return 0;
}

This example assumes your grammar contains a parser rule named key for which the enterKey function was generated. The Ref<> template is an alias for std::shared_ptr<> to simplify the runtime source code which often makes use of smart pointers.

Specialities of this ANTLR target

Memory Management

Caused by the nature of C++ there are a couple of things that only the C++ ANTLR target has. Since C++ has no built-in memory management we need to take extra care for that. For that we rely mostly on smart pointers, which however might cause time penalties or memory side effects (like cyclic references) if not used with care. Currently however the memory household looks very stable.

Unicode Support

Encoding is mostly an input issue, i.e. when the lexer converts text input into lexer tokens. The parser is completely encoding unaware. However, lexer input in in the grammar is defined by character ranges with either a single member (e.g. 'a' or [a]), an explicit range (e.g. 'a'..'z' or [a-z]), the full Unicode range (for a wildcard) and the full Unicode range minus a sub range (for negated ranges, e.g. ~[a]). The explicit ranges are encoded in the serialized ATN by 16bit numbers, hence cannot reach beyond 0xFFFF (the Unicode BMP), while the implicit ranges can include any value (and hence support the full Unicode set, up to 0x10FFFF).

An interesting side note here is that the Java target fully supports Unicode as well, despite the inherent limitations from the serialized ATN. That's possible because the Java String class represents characters beyond the BMP as surrogate pairs (two 16bit values) and even reads them as 2 characters. To make this work a character range for an identifier in a grammar must include the surrogate pairs area (for a Java parser).

The C++ target however always expects UTF-8 input (either in a string or via a wide stream) which is then converted to a char32_t array and fed to the lexer. ANTLR, when parsing your grammar, limits character ranges explicitly to the BMP currently. So, in order to allow specifying the full Unicode set the C++ target uses a little trick: whenever an explicit character range includes the (unused) codepoint 0xFFFF in a grammar it is silently extended to the full Unicode range. It's clear that this is an all-or-nothing solution. You cannot define a subset of Unicode codepoints > 0xFFFF that way. This can only be solved if ANTLR supports larger character intervals.

The differences in handling characters beyond the BMP leads to a difference between Java and C++ parsers: the character offsets may not concur. This is because Java reads two 16bit values per Unicode char (if that falls into the surrogate area) while a C++ parser only reads one 32bit value. That usually doesn't have practical consequences, but might confuse people when comparing token positions.

Named Actions

In order to help customizing the generated files there are a number of additional socalled named actions. These actions are tight to specific areas in the generated code and allow to add custom (target specific) code. All targets support these actions

  • @parser::header
  • @parser::members
  • @lexer::header
  • @lexer::members

(and their scopeless alternatives @header and @members) where header doesn't mean a C/C++ header file, but the top of a code file. The content of the header action appears in all generated files at the first line. So it's good for things like license/copyright information.

The content of a members action is placed in the public section of lexer or parser class declarations. Hence it can be used for public variables or predicate functions used in a grammar predicate. Since all targets support header + members they are the best place for stuff that should be available also in generated files for other languages.

In addition to that the C++ target supports many more such named actions. Unfortunately, it's not possible to define new scopes (e.g. listener in addition to parser) so they had to be defined as part of the existing scopes (lexer or parser). The grammar in the demo application contains all of the named actions as well for reference. Here's the list:

  • @lexer::preinclude - Placed right before the first #include (e.g. good for headers that must appear first, for system headers etc.). Appears in both lexer h and cpp file.
  • @lexer::postinclude - Placed right after the last #include, but before any class code (e.g. for additional namespaces). Appears in both lexer h and cpp file.
  • @lexer::context - Placed right before the lexer class declaration. Use for e.g. additional types, aliases, forward declarations and the like. Appears in the lexer h file.
  • @lexer::declarations - Placed in the private section of the lexer declaration (generated sections in all classes strictly follow the pattern: public, protected, privat, from top to bottom). Use this for private vars etc.
  • @lexer::definitions - Placed before other implementations in the cpp file (but after @postinclude). Use this to implement e.g. private types.

For the parser there are the same for actions as shown above for the lexer. In addition to that there are even more actions for visitor and listener classes, which are:

  • @parser::listenerpreinclude
  • @parser::listenerpostinclude
  • @parser::listenerdeclarations
  • @parser::listenermembers
  • @parser::listenerdefinitions
  • @parser::baselistenerpreinclude
  • @parser::baselistenerpostinclude
  • @parser::baselistenerdeclarations
  • @parser::baselistenermembers
  • @parser::baselistenerdefinitions
  • @parser::visitorpreinclude
  • @parser::visitorpostinclude
  • @parser::visitordeclarations
  • @parser::visitormembers
  • @parser::visitordefinitions
  • @parser::basevisitorpreinclude
  • @parser::basevisitorpostinclude
  • @parser::basevisitordeclarations
  • @parser::basevisitormembers
  • @parser::basevisitordefinitions

and should be self explanatory now. Note: there is no context action for listeners or visitors, simply because they would be even less used than the other actions and there are so many already.