Formatted documentation

This commit is contained in:
Sam Harwell 2014-01-22 21:30:47 -06:00
parent d5b269b6b6
commit aba1178c49
2 changed files with 289 additions and 246 deletions

View File

@ -37,66 +37,79 @@ import java.util.HashMap;
import java.util.List;
import java.util.Map;
/** Useful for rewriting out a buffered input token stream after doing some
* augmentation or other manipulations on it.
/**
* Useful for rewriting out a buffered input token stream after doing some
* augmentation or other manipulations on it.
*
* You can insert stuff, replace, and delete chunks. Note that the
* operations are done lazily--only if you convert the buffer to a
* String with getText(). This is very efficient because you are not moving
* data around all the time. As the buffer of tokens is converted to strings,
* the getText() method(s) scan the input token stream and check
* to see if there is an operation at the current index.
* If so, the operation is done and then normal String
* rendering continues on the buffer. This is like having multiple Turing
* machine instruction streams (programs) operating on a single input tape. :)
* <p>
* You can insert stuff, replace, and delete chunks. Note that the operations
* are done lazily--only if you convert the buffer to a String with getText().
* This is very efficient because you are not moving data around all the time.
* As the buffer of tokens is converted to strings, the getText() method(s) scan
* the input token stream and check to see if there is an operation at the
* current index. If so, the operation is done and then normal String rendering
* continues on the buffer. This is like having multiple Turing machine
* instruction streams (programs) operating on a single input tape. :)</p>
*
* This rewriter makes no modifications to the token stream. It does not
* ask the stream to fill itself up nor does it advance the input cursor.
* The token stream index() will return the same value before and after
* any getText() call.
* <p>
* This rewriter makes no modifications to the token stream. It does not ask the
* stream to fill itself up nor does it advance the input cursor. The token
* stream index() will return the same value before and after any getText()
* call.</p>
*
* The rewriter only works on tokens that you have in the buffer and
* ignores the current input cursor. If you are buffering tokens on-demand,
* calling getText() halfway through the input will only do rewrites
* for those tokens in the first half of the file.
* <p>
* The rewriter only works on tokens that you have in the buffer and ignores the
* current input cursor. If you are buffering tokens on-demand, calling
* getText() halfway through the input will only do rewrites for those tokens in
* the first half of the file.</p>
*
* Since the operations are done lazily at getText-time, operations do not
* screw up the token index values. That is, an insert operation at token
* index i does not change the index values for tokens i+1..n-1.
* <p>
* Since the operations are done lazily at getText-time, operations do not screw
* up the token index values. That is, an insert operation at token index i does
* not change the index values for tokens i+1..n-1.</p>
*
* Because operations never actually alter the buffer, you may always get
* the original token stream back without undoing anything. Since
* the instructions are queued up, you can easily simulate transactions and
* roll back any changes if there is an error just by removing instructions.
* For example,
* <p>
* Because operations never actually alter the buffer, you may always get the
* original token stream back without undoing anything. Since the instructions
* are queued up, you can easily simulate transactions and roll back any changes
* if there is an error just by removing instructions. For example,</p>
*
* CharStream input = new ANTLRFileStream("input");
* TLexer lex = new TLexer(input);
* CommonTokenStream tokens = new CommonTokenStream(lex);
* T parser = new T(tokens);
* TokenStreamRewriter rewriter = new TokenStreamRewriter(tokens);
* parser.startRule();
* <pre>
* CharStream input = new ANTLRFileStream("input");
* TLexer lex = new TLexer(input);
* CommonTokenStream tokens = new CommonTokenStream(lex);
* T parser = new T(tokens);
* TokenStreamRewriter rewriter = new TokenStreamRewriter(tokens);
* parser.startRule();
* </pre>
*
* Then in the rules, you can execute (assuming rewriter is visible):
* Token t,u;
* ...
* rewriter.insertAfter(t, "text to put after t");}
* rewriter.insertAfter(u, "text after u");}
* System.out.println(tokens.toString());
* <p>
* Then in the rules, you can execute (assuming rewriter is visible):</p>
*
* You can also have multiple "instruction streams" and get multiple
* rewrites from a single pass over the input. Just name the instruction
* streams and use that name again when printing the buffer. This could be
* useful for generating a C file and also its header file--all from the
* same buffer:
* <pre>
* Token t,u;
* ...
* rewriter.insertAfter(t, "text to put after t");}
* rewriter.insertAfter(u, "text after u");}
* System.out.println(tokens.toString());
* </pre>
*
* tokens.insertAfter("pass1", t, "text to put after t");}
* tokens.insertAfter("pass2", u, "text after u");}
* System.out.println(tokens.toString("pass1"));
* System.out.println(tokens.toString("pass2"));
* <p>
* You can also have multiple "instruction streams" and get multiple rewrites
* from a single pass over the input. Just name the instruction streams and use
* that name again when printing the buffer. This could be useful for generating
* a C file and also its header file--all from the same buffer:</p>
*
* If you don't use named rewrite streams, a "default" stream is used as
* the first example shows.
* <pre>
* tokens.insertAfter("pass1", t, "text to put after t");}
* tokens.insertAfter("pass2", u, "text after u");}
* System.out.println(tokens.toString("pass1"));
* System.out.println(tokens.toString("pass2"));
* </pre>
*
* <p>
* If you don't use named rewrite streams, a "default" stream is used as the
* first example shows.</p>
*/
public class TokenStreamRewriter {
public static final String DEFAULT_PROGRAM_NAME = "default";

View File

@ -55,202 +55,232 @@ import java.util.List;
import java.util.Set;
/**
The embodiment of the adaptive LL(*), ALL(*), parsing strategy.
The basic complexity of the adaptive strategy makes it harder to
understand. We begin with ATN simulation to build paths in a
DFA. Subsequent prediction requests go through the DFA first. If
they reach a state without an edge for the current symbol, the
algorithm fails over to the ATN simulation to complete the DFA
path for the current input (until it finds a conflict state or
uniquely predicting state).
All of that is done without using the outer context because we
want to create a DFA that is not dependent upon the rule
invocation stack when we do a prediction. One DFA works in all
contexts. We avoid using context not necessarily because it's
slower, although it can be, but because of the DFA caching
problem. The closure routine only considers the rule invocation
stack created during prediction beginning in the decision rule. For
example, if prediction occurs without invoking another rule's
ATN, there are no context stacks in the configurations.
When lack of context leads to a conflict, we don't know if it's
an ambiguity or a weakness in the strong LL(*) parsing strategy
(versus full LL(*)).
When SLL yields a configuration set with conflict, we rewind the
input and retry the ATN simulation, this time using
full outer context without adding to the DFA. Configuration context
stacks will be the full invocation stacks from the start rule. If
we get a conflict using full context, then we can definitively
say we have a true ambiguity for that input sequence. If we don't
get a conflict, it implies that the decision is sensitive to the
outer context. (It is not context-sensitive in the sense of
context-sensitive grammars.)
The next time we reach this DFA state with an SLL conflict, through
DFA simulation, we will again retry the ATN simulation using full
context mode. This is slow because we can't save the results and have
to "interpret" the ATN each time we get that input.
CACHING FULL CONTEXT PREDICTIONS
We could cache results from full context to predicted
alternative easily and that saves a lot of time but doesn't work
in presence of predicates. The set of visible predicates from
the ATN start state changes depending on the context, because
closure can fall off the end of a rule. I tried to cache
tuples (stack context, semantic context, predicted alt) but it
was slower than interpreting and much more complicated. Also
required a huge amount of memory. The goal is not to create the
world's fastest parser anyway. I'd like to keep this algorithm
simple. By launching multiple threads, we can improve the speed
of parsing across a large number of files.
There is no strict ordering between the amount of input used by
SLL vs LL, which makes it really hard to build a cache for full
context. Let's say that we have input A B C that leads to an SLL
conflict with full context X. That implies that using X we
might only use A B but we could also use A B C D to resolve
conflict. Input A B C D could predict alternative 1 in one
position in the input and A B C E could predict alternative 2 in
another position in input. The conflicting SLL configurations
could still be non-unique in the full context prediction, which
would lead us to requiring more input than the original A B C. To
make a prediction cache work, we have to track the exact input used
during the previous prediction. That amounts to a cache that maps X
to a specific DFA for that context.
Something should be done for left-recursive expression predictions.
They are likely LL(1) + pred eval. Easier to do the whole SLL unless
error and retry with full LL thing Sam does.
AVOIDING FULL CONTEXT PREDICTION
We avoid doing full context retry when the outer context is empty,
we did not dip into the outer context by falling off the end of the
decision state rule, or when we force SLL mode.
As an example of the not dip into outer context case, consider
as super constructor calls versus function calls. One grammar
might look like this:
ctorBody : '{' superCall? stat* '}' ;
Or, you might see something like
stat : superCall ';' | expression ';' | ... ;
In both cases I believe that no closure operations will dip into the
outer context. In the first case ctorBody in the worst case will stop
at the '}'. In the 2nd case it should stop at the ';'. Both cases
should stay within the entry rule and not dip into the outer context.
PREDICATES
Predicates are always evaluated if present in either SLL or LL both.
SLL and LL simulation deals with predicates differently. SLL collects
predicates as it performs closure operations like ANTLR v3 did. It
delays predicate evaluation until it reaches and accept state. This
allows us to cache the SLL ATN simulation whereas, if we had evaluated
predicates on-the-fly during closure, the DFA state configuration sets
would be different and we couldn't build up a suitable DFA.
When building a DFA accept state during ATN simulation, we evaluate
any predicates and return the sole semantically valid alternative. If
there is more than 1 alternative, we report an ambiguity. If there are
0 alternatives, we throw an exception. Alternatives without predicates
act like they have true predicates. The simple way to think about it
is to strip away all alternatives with false predicates and choose the
minimum alternative that remains.
When we start in the DFA and reach an accept state that's predicated,
we test those and return the minimum semantically viable
alternative. If no alternatives are viable, we throw an exception.
During full LL ATN simulation, closure always evaluates predicates and
on-the-fly. This is crucial to reducing the configuration set size
during closure. It hits a landmine when parsing with the Java grammar,
for example, without this on-the-fly evaluation.
SHARING DFA
All instances of the same parser share the same decision DFAs through
a static field. Each instance gets its own ATN simulator but they
share the same decisionToDFA field. They also share a
PredictionContextCache object that makes sure that all
PredictionContext objects are shared among the DFA states. This makes
a big size difference.
THREAD SAFETY
The parser ATN simulator locks on the decisionDFA field when it adds a
new DFA object to that array. addDFAEdge locks on the DFA for the
current decision when setting the edges[] field. addDFAState locks on
the DFA for the current decision when looking up a DFA state to see if
it already exists. We must make sure that all requests to add DFA
states that are equivalent result in the same shared DFA object. This
is because lots of threads will be trying to update the DFA at
once. The addDFAState method also locks inside the DFA lock but this
time on the shared context cache when it rebuilds the configurations'
PredictionContext objects using cached subgraphs/nodes. No other
locking occurs, even during DFA simulation. This is safe as long as we
can guarantee that all threads referencing s.edge[t] get the same
physical target DFA state, or none. Once into the DFA, the DFA
simulation does not reference the dfa.state map. It follows the
edges[] field to new targets. The DFA simulator will either find
dfa.edges to be null, to be non-null and dfa.edges[t] null, or
dfa.edges[t] to be non-null. The addDFAEdge method could be racing to
set the field but in either case the DFA simulator works; if null, and
requests ATN simulation. It could also race trying to get
dfa.edges[t], but either way it will work because it's not doing a
test and set operation.
Starting with SLL then failing to combined SLL/LL
Sam pointed out that if SLL does not give a syntax error, then there
is no point in doing full LL, which is slower. We only have to try LL
if we get a syntax error. For maximum speed, Sam starts the parser
with pure SLL mode:
parser.getInterpreter().setSLL(true);
and with the bail error strategy:
parser.setErrorHandler(new BailErrorStrategy());
If it does not get a syntax error, then we're done. If it does get a
syntax error, we need to retry with the combined SLL/LL strategy.
The reason this works is as follows. If there are no SLL
conflicts then the grammar is SLL for sure, at least for that
input set. If there is an SLL conflict, the full LL analysis
must yield a set of ambiguous alternatives that is no larger
than the SLL set. If the LL set is a singleton, then the grammar
is LL but not SLL. If the LL set is the same size as the SLL
set, the decision is SLL. If the LL set has size &gt; 1, then that
decision is truly ambiguous on the current input. If the LL set
is smaller, then the SLL conflict resolution might choose an
alternative that the full LL would rule out as a possibility
based upon better context information. If that's the case, then
the SLL parse will definitely get an error because the full LL
analysis says it's not viable. If SLL conflict resolution
chooses an alternative within the LL set, them both SLL and LL
would choose the same alternative because they both choose the
minimum of multiple conflicting alternatives.
Let's say we have a set of SLL conflicting alternatives {1, 2, 3} and
a smaller LL set called s. If s is {2, 3}, then SLL parsing will get
an error because SLL will pursue alternative 1. If s is {1, 2} or {1,
3} then both SLL and LL will choose the same alternative because
alternative one is the minimum of either set. If s is {2} or {3} then
SLL will get a syntax error. If s is {1} then SLL will succeed.
Of course, if the input is invalid, then we will get an error for sure
in both SLL and LL parsing. Erroneous input will therefore require 2
passes over the input.
*/
* The embodiment of the adaptive LL(*), ALL(*), parsing strategy.
*
* <p>
* The basic complexity of the adaptive strategy makes it harder to understand.
* We begin with ATN simulation to build paths in a DFA. Subsequent prediction
* requests go through the DFA first. If they reach a state without an edge for
* the current symbol, the algorithm fails over to the ATN simulation to
* complete the DFA path for the current input (until it finds a conflict state
* or uniquely predicting state).</p>
*
* <p>
* All of that is done without using the outer context because we want to create
* a DFA that is not dependent upon the rule invocation stack when we do a
* prediction. One DFA works in all contexts. We avoid using context not
* necessarily because it's slower, although it can be, but because of the DFA
* caching problem. The closure routine only considers the rule invocation stack
* created during prediction beginning in the decision rule. For example, if
* prediction occurs without invoking another rule's ATN, there are no context
* stacks in the configurations. When lack of context leads to a conflict, we
* don't know if it's an ambiguity or a weakness in the strong LL(*) parsing
* strategy (versus full LL(*)).</p>
*
* <p>
* When SLL yields a configuration set with conflict, we rewind the input and
* retry the ATN simulation, this time using full outer context without adding
* to the DFA. Configuration context stacks will be the full invocation stacks
* from the start rule. If we get a conflict using full context, then we can
* definitively say we have a true ambiguity for that input sequence. If we
* don't get a conflict, it implies that the decision is sensitive to the outer
* context. (It is not context-sensitive in the sense of context-sensitive
* grammars.)</p>
*
* <p>
* The next time we reach this DFA state with an SLL conflict, through DFA
* simulation, we will again retry the ATN simulation using full context mode.
* This is slow because we can't save the results and have to "interpret" the
* ATN each time we get that input.</p>
*
* <p>
* <strong>CACHING FULL CONTEXT PREDICTIONS</strong></p>
*
* <p>
* We could cache results from full context to predicted alternative easily and
* that saves a lot of time but doesn't work in presence of predicates. The set
* of visible predicates from the ATN start state changes depending on the
* context, because closure can fall off the end of a rule. I tried to cache
* tuples (stack context, semantic context, predicted alt) but it was slower
* than interpreting and much more complicated. Also required a huge amount of
* memory. The goal is not to create the world's fastest parser anyway. I'd like
* to keep this algorithm simple. By launching multiple threads, we can improve
* the speed of parsing across a large number of files.</p>
*
* <p>
* There is no strict ordering between the amount of input used by SLL vs LL,
* which makes it really hard to build a cache for full context. Let's say that
* we have input A B C that leads to an SLL conflict with full context X. That
* implies that using X we might only use A B but we could also use A B C D to
* resolve conflict. Input A B C D could predict alternative 1 in one position
* in the input and A B C E could predict alternative 2 in another position in
* input. The conflicting SLL configurations could still be non-unique in the
* full context prediction, which would lead us to requiring more input than the
* original A B C. To make a prediction cache work, we have to track the exact
* input used during the previous prediction. That amounts to a cache that maps
* X to a specific DFA for that context.</p>
*
* <p>
* Something should be done for left-recursive expression predictions. They are
* likely LL(1) + pred eval. Easier to do the whole SLL unless error and retry
* with full LL thing Sam does.</p>
*
* <p>
* <strong>AVOIDING FULL CONTEXT PREDICTION</strong></p>
*
* <p>
* We avoid doing full context retry when the outer context is empty, we did not
* dip into the outer context by falling off the end of the decision state rule,
* or when we force SLL mode.</p>
*
* <p>
* As an example of the not dip into outer context case, consider as super
* constructor calls versus function calls. One grammar might look like
* this:</p>
*
* <pre>
* ctorBody
* : '{' superCall? stat* '}'
* ;
* </pre>
*
* <p>
* Or, you might see something like</p>
*
* <pre>
* stat
* : superCall ';'
* | expression ';'
* | ...
* ;
* </pre>
*
* <p>
* In both cases I believe that no closure operations will dip into the outer
* context. In the first case ctorBody in the worst case will stop at the '}'.
* In the 2nd case it should stop at the ';'. Both cases should stay within the
* entry rule and not dip into the outer context.</p>
*
* <p>
* <strong>PREDICATES</strong></p>
*
* <p>
* Predicates are always evaluated if present in either SLL or LL both. SLL and
* LL simulation deals with predicates differently. SLL collects predicates as
* it performs closure operations like ANTLR v3 did. It delays predicate
* evaluation until it reaches and accept state. This allows us to cache the SLL
* ATN simulation whereas, if we had evaluated predicates on-the-fly during
* closure, the DFA state configuration sets would be different and we couldn't
* build up a suitable DFA.</p>
*
* <p>
* When building a DFA accept state during ATN simulation, we evaluate any
* predicates and return the sole semantically valid alternative. If there is
* more than 1 alternative, we report an ambiguity. If there are 0 alternatives,
* we throw an exception. Alternatives without predicates act like they have
* true predicates. The simple way to think about it is to strip away all
* alternatives with false predicates and choose the minimum alternative that
* remains.</p>
*
* <p>
* When we start in the DFA and reach an accept state that's predicated, we test
* those and return the minimum semantically viable alternative. If no
* alternatives are viable, we throw an exception.</p>
*
* <p>
* During full LL ATN simulation, closure always evaluates predicates and
* on-the-fly. This is crucial to reducing the configuration set size during
* closure. It hits a landmine when parsing with the Java grammar, for example,
* without this on-the-fly evaluation.</p>
*
* <p>
* <strong>SHARING DFA</strong></p>
*
* <p>
* All instances of the same parser share the same decision DFAs through a
* static field. Each instance gets its own ATN simulator but they share the
* same decisionToDFA field. They also share a PredictionContextCache object
* that makes sure that all PredictionContext objects are shared among the DFA
* states. This makes a big size difference.</p>
*
* <p>
* <strong>THREAD SAFETY</strong></p>
*
* <p>
* The parser ATN simulator locks on the decisionDFA field when it adds a new
* DFA object to that array. addDFAEdge locks on the DFA for the current
* decision when setting the edges[] field. addDFAState locks on the DFA for the
* current decision when looking up a DFA state to see if it already exists. We
* must make sure that all requests to add DFA states that are equivalent result
* in the same shared DFA object. This is because lots of threads will be trying
* to update the DFA at once. The addDFAState method also locks inside the DFA
* lock but this time on the shared context cache when it rebuilds the
* configurations' PredictionContext objects using cached subgraphs/nodes. No
* other locking occurs, even during DFA simulation. This is safe as long as we
* can guarantee that all threads referencing s.edge[t] get the same physical
* target DFA state, or none. Once into the DFA, the DFA simulation does not
* reference the dfa.state map. It follows the edges[] field to new targets. The
* DFA simulator will either find dfa.edges to be null, to be non-null and
* dfa.edges[t] null, or dfa.edges[t] to be non-null. The addDFAEdge method
* could be racing to set the field but in either case the DFA simulator works;
* if null, and requests ATN simulation. It could also race trying to get
* dfa.edges[t], but either way it will work because it's not doing a test and
* set operation.</p>
*
* <p>
* <strong>Starting with SLL then failing to combined SLL/LL</strong></p>
*
* <p>
* Sam pointed out that if SLL does not give a syntax error, then there is no
* point in doing full LL, which is slower. We only have to try LL if we get a
* syntax error. For maximum speed, Sam starts the parser with pure SLL
* mode:</p>
*
* <pre>
* parser.getInterpreter().setSLL(true);
* </pre>
*
* <p>
* and with the bail error strategy:</p>
*
* <pre>
* parser.setErrorHandler(new BailErrorStrategy());
* </pre>
*
* <p>
* If it does not get a syntax error, then we're done. If it does get a syntax
* error, we need to retry with the combined SLL/LL strategy.</p>
*
* <p>
* The reason this works is as follows. If there are no SLL conflicts then the
* grammar is SLL for sure, at least for that input set. If there is an SLL
* conflict, the full LL analysis must yield a set of ambiguous alternatives
* that is no larger than the SLL set. If the LL set is a singleton, then the
* grammar is LL but not SLL. If the LL set is the same size as the SLL set, the
* decision is SLL. If the LL set has size &gt; 1, then that decision is truly
* ambiguous on the current input. If the LL set is smaller, then the SLL
* conflict resolution might choose an alternative that the full LL would rule
* out as a possibility based upon better context information. If that's the
* case, then the SLL parse will definitely get an error because the full LL
* analysis says it's not viable. If SLL conflict resolution chooses an
* alternative within the LL set, them both SLL and LL would choose the same
* alternative because they both choose the minimum of multiple conflicting
* alternatives.</p>
*
* <p>
* Let's say we have a set of SLL conflicting alternatives {@code {1, 2, 3}} and
* a smaller LL set called <em>s</em>. If <em>s</em> is {@code {2, 3}}, then SLL
* parsing will get an error because SLL will pursue alternative 1. If
* <em>s</em> is {@code {1, 2}} or {@code {1, 3}} then both SLL and LL will
* choose the same alternative because alternative one is the minimum of either
* set. If <em>s</em> is {@code {2}} or {@code {3}} then SLL will get a syntax
* error. If <em>s</em> is {@code {1}} then SLL will succeed.</p>
*
* <p>
* Of course, if the input is invalid, then we will get an error for sure in
* both SLL and LL parsing. Erroneous input will therefore require 2 passes over
* the input.</p>
*/
public class ParserATNSimulator extends ATNSimulator {
public static final boolean debug = false;
public static final boolean debug_list_atn_decisions = false;