Formatted documentation
This commit is contained in:
parent
d5b269b6b6
commit
aba1178c49
|
@ -37,66 +37,79 @@ import java.util.HashMap;
|
|||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
/** Useful for rewriting out a buffered input token stream after doing some
|
||||
* augmentation or other manipulations on it.
|
||||
/**
|
||||
* Useful for rewriting out a buffered input token stream after doing some
|
||||
* augmentation or other manipulations on it.
|
||||
*
|
||||
* You can insert stuff, replace, and delete chunks. Note that the
|
||||
* operations are done lazily--only if you convert the buffer to a
|
||||
* String with getText(). This is very efficient because you are not moving
|
||||
* data around all the time. As the buffer of tokens is converted to strings,
|
||||
* the getText() method(s) scan the input token stream and check
|
||||
* to see if there is an operation at the current index.
|
||||
* If so, the operation is done and then normal String
|
||||
* rendering continues on the buffer. This is like having multiple Turing
|
||||
* machine instruction streams (programs) operating on a single input tape. :)
|
||||
* <p>
|
||||
* You can insert stuff, replace, and delete chunks. Note that the operations
|
||||
* are done lazily--only if you convert the buffer to a String with getText().
|
||||
* This is very efficient because you are not moving data around all the time.
|
||||
* As the buffer of tokens is converted to strings, the getText() method(s) scan
|
||||
* the input token stream and check to see if there is an operation at the
|
||||
* current index. If so, the operation is done and then normal String rendering
|
||||
* continues on the buffer. This is like having multiple Turing machine
|
||||
* instruction streams (programs) operating on a single input tape. :)</p>
|
||||
*
|
||||
* This rewriter makes no modifications to the token stream. It does not
|
||||
* ask the stream to fill itself up nor does it advance the input cursor.
|
||||
* The token stream index() will return the same value before and after
|
||||
* any getText() call.
|
||||
* <p>
|
||||
* This rewriter makes no modifications to the token stream. It does not ask the
|
||||
* stream to fill itself up nor does it advance the input cursor. The token
|
||||
* stream index() will return the same value before and after any getText()
|
||||
* call.</p>
|
||||
*
|
||||
* The rewriter only works on tokens that you have in the buffer and
|
||||
* ignores the current input cursor. If you are buffering tokens on-demand,
|
||||
* calling getText() halfway through the input will only do rewrites
|
||||
* for those tokens in the first half of the file.
|
||||
* <p>
|
||||
* The rewriter only works on tokens that you have in the buffer and ignores the
|
||||
* current input cursor. If you are buffering tokens on-demand, calling
|
||||
* getText() halfway through the input will only do rewrites for those tokens in
|
||||
* the first half of the file.</p>
|
||||
*
|
||||
* Since the operations are done lazily at getText-time, operations do not
|
||||
* screw up the token index values. That is, an insert operation at token
|
||||
* index i does not change the index values for tokens i+1..n-1.
|
||||
* <p>
|
||||
* Since the operations are done lazily at getText-time, operations do not screw
|
||||
* up the token index values. That is, an insert operation at token index i does
|
||||
* not change the index values for tokens i+1..n-1.</p>
|
||||
*
|
||||
* Because operations never actually alter the buffer, you may always get
|
||||
* the original token stream back without undoing anything. Since
|
||||
* the instructions are queued up, you can easily simulate transactions and
|
||||
* roll back any changes if there is an error just by removing instructions.
|
||||
* For example,
|
||||
* <p>
|
||||
* Because operations never actually alter the buffer, you may always get the
|
||||
* original token stream back without undoing anything. Since the instructions
|
||||
* are queued up, you can easily simulate transactions and roll back any changes
|
||||
* if there is an error just by removing instructions. For example,</p>
|
||||
*
|
||||
* CharStream input = new ANTLRFileStream("input");
|
||||
* TLexer lex = new TLexer(input);
|
||||
* CommonTokenStream tokens = new CommonTokenStream(lex);
|
||||
* T parser = new T(tokens);
|
||||
* TokenStreamRewriter rewriter = new TokenStreamRewriter(tokens);
|
||||
* parser.startRule();
|
||||
* <pre>
|
||||
* CharStream input = new ANTLRFileStream("input");
|
||||
* TLexer lex = new TLexer(input);
|
||||
* CommonTokenStream tokens = new CommonTokenStream(lex);
|
||||
* T parser = new T(tokens);
|
||||
* TokenStreamRewriter rewriter = new TokenStreamRewriter(tokens);
|
||||
* parser.startRule();
|
||||
* </pre>
|
||||
*
|
||||
* Then in the rules, you can execute (assuming rewriter is visible):
|
||||
* Token t,u;
|
||||
* ...
|
||||
* rewriter.insertAfter(t, "text to put after t");}
|
||||
* rewriter.insertAfter(u, "text after u");}
|
||||
* System.out.println(tokens.toString());
|
||||
* <p>
|
||||
* Then in the rules, you can execute (assuming rewriter is visible):</p>
|
||||
*
|
||||
* You can also have multiple "instruction streams" and get multiple
|
||||
* rewrites from a single pass over the input. Just name the instruction
|
||||
* streams and use that name again when printing the buffer. This could be
|
||||
* useful for generating a C file and also its header file--all from the
|
||||
* same buffer:
|
||||
* <pre>
|
||||
* Token t,u;
|
||||
* ...
|
||||
* rewriter.insertAfter(t, "text to put after t");}
|
||||
* rewriter.insertAfter(u, "text after u");}
|
||||
* System.out.println(tokens.toString());
|
||||
* </pre>
|
||||
*
|
||||
* tokens.insertAfter("pass1", t, "text to put after t");}
|
||||
* tokens.insertAfter("pass2", u, "text after u");}
|
||||
* System.out.println(tokens.toString("pass1"));
|
||||
* System.out.println(tokens.toString("pass2"));
|
||||
* <p>
|
||||
* You can also have multiple "instruction streams" and get multiple rewrites
|
||||
* from a single pass over the input. Just name the instruction streams and use
|
||||
* that name again when printing the buffer. This could be useful for generating
|
||||
* a C file and also its header file--all from the same buffer:</p>
|
||||
*
|
||||
* If you don't use named rewrite streams, a "default" stream is used as
|
||||
* the first example shows.
|
||||
* <pre>
|
||||
* tokens.insertAfter("pass1", t, "text to put after t");}
|
||||
* tokens.insertAfter("pass2", u, "text after u");}
|
||||
* System.out.println(tokens.toString("pass1"));
|
||||
* System.out.println(tokens.toString("pass2"));
|
||||
* </pre>
|
||||
*
|
||||
* <p>
|
||||
* If you don't use named rewrite streams, a "default" stream is used as the
|
||||
* first example shows.</p>
|
||||
*/
|
||||
public class TokenStreamRewriter {
|
||||
public static final String DEFAULT_PROGRAM_NAME = "default";
|
||||
|
|
|
@ -55,202 +55,232 @@ import java.util.List;
|
|||
import java.util.Set;
|
||||
|
||||
/**
|
||||
The embodiment of the adaptive LL(*), ALL(*), parsing strategy.
|
||||
|
||||
The basic complexity of the adaptive strategy makes it harder to
|
||||
understand. We begin with ATN simulation to build paths in a
|
||||
DFA. Subsequent prediction requests go through the DFA first. If
|
||||
they reach a state without an edge for the current symbol, the
|
||||
algorithm fails over to the ATN simulation to complete the DFA
|
||||
path for the current input (until it finds a conflict state or
|
||||
uniquely predicting state).
|
||||
|
||||
All of that is done without using the outer context because we
|
||||
want to create a DFA that is not dependent upon the rule
|
||||
invocation stack when we do a prediction. One DFA works in all
|
||||
contexts. We avoid using context not necessarily because it's
|
||||
slower, although it can be, but because of the DFA caching
|
||||
problem. The closure routine only considers the rule invocation
|
||||
stack created during prediction beginning in the decision rule. For
|
||||
example, if prediction occurs without invoking another rule's
|
||||
ATN, there are no context stacks in the configurations.
|
||||
When lack of context leads to a conflict, we don't know if it's
|
||||
an ambiguity or a weakness in the strong LL(*) parsing strategy
|
||||
(versus full LL(*)).
|
||||
|
||||
When SLL yields a configuration set with conflict, we rewind the
|
||||
input and retry the ATN simulation, this time using
|
||||
full outer context without adding to the DFA. Configuration context
|
||||
stacks will be the full invocation stacks from the start rule. If
|
||||
we get a conflict using full context, then we can definitively
|
||||
say we have a true ambiguity for that input sequence. If we don't
|
||||
get a conflict, it implies that the decision is sensitive to the
|
||||
outer context. (It is not context-sensitive in the sense of
|
||||
context-sensitive grammars.)
|
||||
|
||||
The next time we reach this DFA state with an SLL conflict, through
|
||||
DFA simulation, we will again retry the ATN simulation using full
|
||||
context mode. This is slow because we can't save the results and have
|
||||
to "interpret" the ATN each time we get that input.
|
||||
|
||||
CACHING FULL CONTEXT PREDICTIONS
|
||||
|
||||
We could cache results from full context to predicted
|
||||
alternative easily and that saves a lot of time but doesn't work
|
||||
in presence of predicates. The set of visible predicates from
|
||||
the ATN start state changes depending on the context, because
|
||||
closure can fall off the end of a rule. I tried to cache
|
||||
tuples (stack context, semantic context, predicted alt) but it
|
||||
was slower than interpreting and much more complicated. Also
|
||||
required a huge amount of memory. The goal is not to create the
|
||||
world's fastest parser anyway. I'd like to keep this algorithm
|
||||
simple. By launching multiple threads, we can improve the speed
|
||||
of parsing across a large number of files.
|
||||
|
||||
There is no strict ordering between the amount of input used by
|
||||
SLL vs LL, which makes it really hard to build a cache for full
|
||||
context. Let's say that we have input A B C that leads to an SLL
|
||||
conflict with full context X. That implies that using X we
|
||||
might only use A B but we could also use A B C D to resolve
|
||||
conflict. Input A B C D could predict alternative 1 in one
|
||||
position in the input and A B C E could predict alternative 2 in
|
||||
another position in input. The conflicting SLL configurations
|
||||
could still be non-unique in the full context prediction, which
|
||||
would lead us to requiring more input than the original A B C. To
|
||||
make a prediction cache work, we have to track the exact input used
|
||||
during the previous prediction. That amounts to a cache that maps X
|
||||
to a specific DFA for that context.
|
||||
|
||||
Something should be done for left-recursive expression predictions.
|
||||
They are likely LL(1) + pred eval. Easier to do the whole SLL unless
|
||||
error and retry with full LL thing Sam does.
|
||||
|
||||
AVOIDING FULL CONTEXT PREDICTION
|
||||
|
||||
We avoid doing full context retry when the outer context is empty,
|
||||
we did not dip into the outer context by falling off the end of the
|
||||
decision state rule, or when we force SLL mode.
|
||||
|
||||
As an example of the not dip into outer context case, consider
|
||||
as super constructor calls versus function calls. One grammar
|
||||
might look like this:
|
||||
|
||||
ctorBody : '{' superCall? stat* '}' ;
|
||||
|
||||
Or, you might see something like
|
||||
|
||||
stat : superCall ';' | expression ';' | ... ;
|
||||
|
||||
In both cases I believe that no closure operations will dip into the
|
||||
outer context. In the first case ctorBody in the worst case will stop
|
||||
at the '}'. In the 2nd case it should stop at the ';'. Both cases
|
||||
should stay within the entry rule and not dip into the outer context.
|
||||
|
||||
PREDICATES
|
||||
|
||||
Predicates are always evaluated if present in either SLL or LL both.
|
||||
SLL and LL simulation deals with predicates differently. SLL collects
|
||||
predicates as it performs closure operations like ANTLR v3 did. It
|
||||
delays predicate evaluation until it reaches and accept state. This
|
||||
allows us to cache the SLL ATN simulation whereas, if we had evaluated
|
||||
predicates on-the-fly during closure, the DFA state configuration sets
|
||||
would be different and we couldn't build up a suitable DFA.
|
||||
|
||||
When building a DFA accept state during ATN simulation, we evaluate
|
||||
any predicates and return the sole semantically valid alternative. If
|
||||
there is more than 1 alternative, we report an ambiguity. If there are
|
||||
0 alternatives, we throw an exception. Alternatives without predicates
|
||||
act like they have true predicates. The simple way to think about it
|
||||
is to strip away all alternatives with false predicates and choose the
|
||||
minimum alternative that remains.
|
||||
|
||||
When we start in the DFA and reach an accept state that's predicated,
|
||||
we test those and return the minimum semantically viable
|
||||
alternative. If no alternatives are viable, we throw an exception.
|
||||
|
||||
During full LL ATN simulation, closure always evaluates predicates and
|
||||
on-the-fly. This is crucial to reducing the configuration set size
|
||||
during closure. It hits a landmine when parsing with the Java grammar,
|
||||
for example, without this on-the-fly evaluation.
|
||||
|
||||
SHARING DFA
|
||||
|
||||
All instances of the same parser share the same decision DFAs through
|
||||
a static field. Each instance gets its own ATN simulator but they
|
||||
share the same decisionToDFA field. They also share a
|
||||
PredictionContextCache object that makes sure that all
|
||||
PredictionContext objects are shared among the DFA states. This makes
|
||||
a big size difference.
|
||||
|
||||
THREAD SAFETY
|
||||
|
||||
The parser ATN simulator locks on the decisionDFA field when it adds a
|
||||
new DFA object to that array. addDFAEdge locks on the DFA for the
|
||||
current decision when setting the edges[] field. addDFAState locks on
|
||||
the DFA for the current decision when looking up a DFA state to see if
|
||||
it already exists. We must make sure that all requests to add DFA
|
||||
states that are equivalent result in the same shared DFA object. This
|
||||
is because lots of threads will be trying to update the DFA at
|
||||
once. The addDFAState method also locks inside the DFA lock but this
|
||||
time on the shared context cache when it rebuilds the configurations'
|
||||
PredictionContext objects using cached subgraphs/nodes. No other
|
||||
locking occurs, even during DFA simulation. This is safe as long as we
|
||||
can guarantee that all threads referencing s.edge[t] get the same
|
||||
physical target DFA state, or none. Once into the DFA, the DFA
|
||||
simulation does not reference the dfa.state map. It follows the
|
||||
edges[] field to new targets. The DFA simulator will either find
|
||||
dfa.edges to be null, to be non-null and dfa.edges[t] null, or
|
||||
dfa.edges[t] to be non-null. The addDFAEdge method could be racing to
|
||||
set the field but in either case the DFA simulator works; if null, and
|
||||
requests ATN simulation. It could also race trying to get
|
||||
dfa.edges[t], but either way it will work because it's not doing a
|
||||
test and set operation.
|
||||
|
||||
Starting with SLL then failing to combined SLL/LL
|
||||
|
||||
Sam pointed out that if SLL does not give a syntax error, then there
|
||||
is no point in doing full LL, which is slower. We only have to try LL
|
||||
if we get a syntax error. For maximum speed, Sam starts the parser
|
||||
with pure SLL mode:
|
||||
|
||||
parser.getInterpreter().setSLL(true);
|
||||
|
||||
and with the bail error strategy:
|
||||
|
||||
parser.setErrorHandler(new BailErrorStrategy());
|
||||
|
||||
If it does not get a syntax error, then we're done. If it does get a
|
||||
syntax error, we need to retry with the combined SLL/LL strategy.
|
||||
|
||||
The reason this works is as follows. If there are no SLL
|
||||
conflicts then the grammar is SLL for sure, at least for that
|
||||
input set. If there is an SLL conflict, the full LL analysis
|
||||
must yield a set of ambiguous alternatives that is no larger
|
||||
than the SLL set. If the LL set is a singleton, then the grammar
|
||||
is LL but not SLL. If the LL set is the same size as the SLL
|
||||
set, the decision is SLL. If the LL set has size > 1, then that
|
||||
decision is truly ambiguous on the current input. If the LL set
|
||||
is smaller, then the SLL conflict resolution might choose an
|
||||
alternative that the full LL would rule out as a possibility
|
||||
based upon better context information. If that's the case, then
|
||||
the SLL parse will definitely get an error because the full LL
|
||||
analysis says it's not viable. If SLL conflict resolution
|
||||
chooses an alternative within the LL set, them both SLL and LL
|
||||
would choose the same alternative because they both choose the
|
||||
minimum of multiple conflicting alternatives.
|
||||
|
||||
Let's say we have a set of SLL conflicting alternatives {1, 2, 3} and
|
||||
a smaller LL set called s. If s is {2, 3}, then SLL parsing will get
|
||||
an error because SLL will pursue alternative 1. If s is {1, 2} or {1,
|
||||
3} then both SLL and LL will choose the same alternative because
|
||||
alternative one is the minimum of either set. If s is {2} or {3} then
|
||||
SLL will get a syntax error. If s is {1} then SLL will succeed.
|
||||
|
||||
Of course, if the input is invalid, then we will get an error for sure
|
||||
in both SLL and LL parsing. Erroneous input will therefore require 2
|
||||
passes over the input.
|
||||
|
||||
*/
|
||||
* The embodiment of the adaptive LL(*), ALL(*), parsing strategy.
|
||||
*
|
||||
* <p>
|
||||
* The basic complexity of the adaptive strategy makes it harder to understand.
|
||||
* We begin with ATN simulation to build paths in a DFA. Subsequent prediction
|
||||
* requests go through the DFA first. If they reach a state without an edge for
|
||||
* the current symbol, the algorithm fails over to the ATN simulation to
|
||||
* complete the DFA path for the current input (until it finds a conflict state
|
||||
* or uniquely predicting state).</p>
|
||||
*
|
||||
* <p>
|
||||
* All of that is done without using the outer context because we want to create
|
||||
* a DFA that is not dependent upon the rule invocation stack when we do a
|
||||
* prediction. One DFA works in all contexts. We avoid using context not
|
||||
* necessarily because it's slower, although it can be, but because of the DFA
|
||||
* caching problem. The closure routine only considers the rule invocation stack
|
||||
* created during prediction beginning in the decision rule. For example, if
|
||||
* prediction occurs without invoking another rule's ATN, there are no context
|
||||
* stacks in the configurations. When lack of context leads to a conflict, we
|
||||
* don't know if it's an ambiguity or a weakness in the strong LL(*) parsing
|
||||
* strategy (versus full LL(*)).</p>
|
||||
*
|
||||
* <p>
|
||||
* When SLL yields a configuration set with conflict, we rewind the input and
|
||||
* retry the ATN simulation, this time using full outer context without adding
|
||||
* to the DFA. Configuration context stacks will be the full invocation stacks
|
||||
* from the start rule. If we get a conflict using full context, then we can
|
||||
* definitively say we have a true ambiguity for that input sequence. If we
|
||||
* don't get a conflict, it implies that the decision is sensitive to the outer
|
||||
* context. (It is not context-sensitive in the sense of context-sensitive
|
||||
* grammars.)</p>
|
||||
*
|
||||
* <p>
|
||||
* The next time we reach this DFA state with an SLL conflict, through DFA
|
||||
* simulation, we will again retry the ATN simulation using full context mode.
|
||||
* This is slow because we can't save the results and have to "interpret" the
|
||||
* ATN each time we get that input.</p>
|
||||
*
|
||||
* <p>
|
||||
* <strong>CACHING FULL CONTEXT PREDICTIONS</strong></p>
|
||||
*
|
||||
* <p>
|
||||
* We could cache results from full context to predicted alternative easily and
|
||||
* that saves a lot of time but doesn't work in presence of predicates. The set
|
||||
* of visible predicates from the ATN start state changes depending on the
|
||||
* context, because closure can fall off the end of a rule. I tried to cache
|
||||
* tuples (stack context, semantic context, predicted alt) but it was slower
|
||||
* than interpreting and much more complicated. Also required a huge amount of
|
||||
* memory. The goal is not to create the world's fastest parser anyway. I'd like
|
||||
* to keep this algorithm simple. By launching multiple threads, we can improve
|
||||
* the speed of parsing across a large number of files.</p>
|
||||
*
|
||||
* <p>
|
||||
* There is no strict ordering between the amount of input used by SLL vs LL,
|
||||
* which makes it really hard to build a cache for full context. Let's say that
|
||||
* we have input A B C that leads to an SLL conflict with full context X. That
|
||||
* implies that using X we might only use A B but we could also use A B C D to
|
||||
* resolve conflict. Input A B C D could predict alternative 1 in one position
|
||||
* in the input and A B C E could predict alternative 2 in another position in
|
||||
* input. The conflicting SLL configurations could still be non-unique in the
|
||||
* full context prediction, which would lead us to requiring more input than the
|
||||
* original A B C. To make a prediction cache work, we have to track the exact
|
||||
* input used during the previous prediction. That amounts to a cache that maps
|
||||
* X to a specific DFA for that context.</p>
|
||||
*
|
||||
* <p>
|
||||
* Something should be done for left-recursive expression predictions. They are
|
||||
* likely LL(1) + pred eval. Easier to do the whole SLL unless error and retry
|
||||
* with full LL thing Sam does.</p>
|
||||
*
|
||||
* <p>
|
||||
* <strong>AVOIDING FULL CONTEXT PREDICTION</strong></p>
|
||||
*
|
||||
* <p>
|
||||
* We avoid doing full context retry when the outer context is empty, we did not
|
||||
* dip into the outer context by falling off the end of the decision state rule,
|
||||
* or when we force SLL mode.</p>
|
||||
*
|
||||
* <p>
|
||||
* As an example of the not dip into outer context case, consider as super
|
||||
* constructor calls versus function calls. One grammar might look like
|
||||
* this:</p>
|
||||
*
|
||||
* <pre>
|
||||
* ctorBody
|
||||
* : '{' superCall? stat* '}'
|
||||
* ;
|
||||
* </pre>
|
||||
*
|
||||
* <p>
|
||||
* Or, you might see something like</p>
|
||||
*
|
||||
* <pre>
|
||||
* stat
|
||||
* : superCall ';'
|
||||
* | expression ';'
|
||||
* | ...
|
||||
* ;
|
||||
* </pre>
|
||||
*
|
||||
* <p>
|
||||
* In both cases I believe that no closure operations will dip into the outer
|
||||
* context. In the first case ctorBody in the worst case will stop at the '}'.
|
||||
* In the 2nd case it should stop at the ';'. Both cases should stay within the
|
||||
* entry rule and not dip into the outer context.</p>
|
||||
*
|
||||
* <p>
|
||||
* <strong>PREDICATES</strong></p>
|
||||
*
|
||||
* <p>
|
||||
* Predicates are always evaluated if present in either SLL or LL both. SLL and
|
||||
* LL simulation deals with predicates differently. SLL collects predicates as
|
||||
* it performs closure operations like ANTLR v3 did. It delays predicate
|
||||
* evaluation until it reaches and accept state. This allows us to cache the SLL
|
||||
* ATN simulation whereas, if we had evaluated predicates on-the-fly during
|
||||
* closure, the DFA state configuration sets would be different and we couldn't
|
||||
* build up a suitable DFA.</p>
|
||||
*
|
||||
* <p>
|
||||
* When building a DFA accept state during ATN simulation, we evaluate any
|
||||
* predicates and return the sole semantically valid alternative. If there is
|
||||
* more than 1 alternative, we report an ambiguity. If there are 0 alternatives,
|
||||
* we throw an exception. Alternatives without predicates act like they have
|
||||
* true predicates. The simple way to think about it is to strip away all
|
||||
* alternatives with false predicates and choose the minimum alternative that
|
||||
* remains.</p>
|
||||
*
|
||||
* <p>
|
||||
* When we start in the DFA and reach an accept state that's predicated, we test
|
||||
* those and return the minimum semantically viable alternative. If no
|
||||
* alternatives are viable, we throw an exception.</p>
|
||||
*
|
||||
* <p>
|
||||
* During full LL ATN simulation, closure always evaluates predicates and
|
||||
* on-the-fly. This is crucial to reducing the configuration set size during
|
||||
* closure. It hits a landmine when parsing with the Java grammar, for example,
|
||||
* without this on-the-fly evaluation.</p>
|
||||
*
|
||||
* <p>
|
||||
* <strong>SHARING DFA</strong></p>
|
||||
*
|
||||
* <p>
|
||||
* All instances of the same parser share the same decision DFAs through a
|
||||
* static field. Each instance gets its own ATN simulator but they share the
|
||||
* same decisionToDFA field. They also share a PredictionContextCache object
|
||||
* that makes sure that all PredictionContext objects are shared among the DFA
|
||||
* states. This makes a big size difference.</p>
|
||||
*
|
||||
* <p>
|
||||
* <strong>THREAD SAFETY</strong></p>
|
||||
*
|
||||
* <p>
|
||||
* The parser ATN simulator locks on the decisionDFA field when it adds a new
|
||||
* DFA object to that array. addDFAEdge locks on the DFA for the current
|
||||
* decision when setting the edges[] field. addDFAState locks on the DFA for the
|
||||
* current decision when looking up a DFA state to see if it already exists. We
|
||||
* must make sure that all requests to add DFA states that are equivalent result
|
||||
* in the same shared DFA object. This is because lots of threads will be trying
|
||||
* to update the DFA at once. The addDFAState method also locks inside the DFA
|
||||
* lock but this time on the shared context cache when it rebuilds the
|
||||
* configurations' PredictionContext objects using cached subgraphs/nodes. No
|
||||
* other locking occurs, even during DFA simulation. This is safe as long as we
|
||||
* can guarantee that all threads referencing s.edge[t] get the same physical
|
||||
* target DFA state, or none. Once into the DFA, the DFA simulation does not
|
||||
* reference the dfa.state map. It follows the edges[] field to new targets. The
|
||||
* DFA simulator will either find dfa.edges to be null, to be non-null and
|
||||
* dfa.edges[t] null, or dfa.edges[t] to be non-null. The addDFAEdge method
|
||||
* could be racing to set the field but in either case the DFA simulator works;
|
||||
* if null, and requests ATN simulation. It could also race trying to get
|
||||
* dfa.edges[t], but either way it will work because it's not doing a test and
|
||||
* set operation.</p>
|
||||
*
|
||||
* <p>
|
||||
* <strong>Starting with SLL then failing to combined SLL/LL</strong></p>
|
||||
*
|
||||
* <p>
|
||||
* Sam pointed out that if SLL does not give a syntax error, then there is no
|
||||
* point in doing full LL, which is slower. We only have to try LL if we get a
|
||||
* syntax error. For maximum speed, Sam starts the parser with pure SLL
|
||||
* mode:</p>
|
||||
*
|
||||
* <pre>
|
||||
* parser.getInterpreter().setSLL(true);
|
||||
* </pre>
|
||||
*
|
||||
* <p>
|
||||
* and with the bail error strategy:</p>
|
||||
*
|
||||
* <pre>
|
||||
* parser.setErrorHandler(new BailErrorStrategy());
|
||||
* </pre>
|
||||
*
|
||||
* <p>
|
||||
* If it does not get a syntax error, then we're done. If it does get a syntax
|
||||
* error, we need to retry with the combined SLL/LL strategy.</p>
|
||||
*
|
||||
* <p>
|
||||
* The reason this works is as follows. If there are no SLL conflicts then the
|
||||
* grammar is SLL for sure, at least for that input set. If there is an SLL
|
||||
* conflict, the full LL analysis must yield a set of ambiguous alternatives
|
||||
* that is no larger than the SLL set. If the LL set is a singleton, then the
|
||||
* grammar is LL but not SLL. If the LL set is the same size as the SLL set, the
|
||||
* decision is SLL. If the LL set has size > 1, then that decision is truly
|
||||
* ambiguous on the current input. If the LL set is smaller, then the SLL
|
||||
* conflict resolution might choose an alternative that the full LL would rule
|
||||
* out as a possibility based upon better context information. If that's the
|
||||
* case, then the SLL parse will definitely get an error because the full LL
|
||||
* analysis says it's not viable. If SLL conflict resolution chooses an
|
||||
* alternative within the LL set, them both SLL and LL would choose the same
|
||||
* alternative because they both choose the minimum of multiple conflicting
|
||||
* alternatives.</p>
|
||||
*
|
||||
* <p>
|
||||
* Let's say we have a set of SLL conflicting alternatives {@code {1, 2, 3}} and
|
||||
* a smaller LL set called <em>s</em>. If <em>s</em> is {@code {2, 3}}, then SLL
|
||||
* parsing will get an error because SLL will pursue alternative 1. If
|
||||
* <em>s</em> is {@code {1, 2}} or {@code {1, 3}} then both SLL and LL will
|
||||
* choose the same alternative because alternative one is the minimum of either
|
||||
* set. If <em>s</em> is {@code {2}} or {@code {3}} then SLL will get a syntax
|
||||
* error. If <em>s</em> is {@code {1}} then SLL will succeed.</p>
|
||||
*
|
||||
* <p>
|
||||
* Of course, if the input is invalid, then we will get an error for sure in
|
||||
* both SLL and LL parsing. Erroneous input will therefore require 2 passes over
|
||||
* the input.</p>
|
||||
*/
|
||||
public class ParserATNSimulator extends ATNSimulator {
|
||||
public static final boolean debug = false;
|
||||
public static final boolean debug_list_atn_decisions = false;
|
||||
|
|
Loading…
Reference in New Issue