From 86e47b9c02c734ad1644cbd7c446cdd4c9dadfa7 Mon Sep 17 00:00:00 2001
From: Terence Parr <parrt@antlr.org>
Date: Thu, 2 Aug 2012 18:23:11 -0700
Subject: [PATCH] update comments to explain SLL vs LL, predicate strategy,
 etc...  (and did a small tweak in code)

---
 .../v4/runtime/atn/ParserATNSimulator.java    | 151 +++++++++++++++---
 .../runtime/atn/PredictionContextCache.java   |   7 +-
 2 files changed, 132 insertions(+), 26 deletions(-)

diff --git a/runtime/Java/src/org/antlr/v4/runtime/atn/ParserATNSimulator.java b/runtime/Java/src/org/antlr/v4/runtime/atn/ParserATNSimulator.java
index a6a9533e0..6d2420535 100755
--- a/runtime/Java/src/org/antlr/v4/runtime/atn/ParserATNSimulator.java
+++ b/runtime/Java/src/org/antlr/v4/runtime/atn/ParserATNSimulator.java
@@ -72,23 +72,37 @@ import java.util.Set;
  problem.  The closure routine only considers the rule invocation
  stack created during prediction beginning in the decision rule.  For
  example, if prediction occurs without invoking another rule's
- ATN, there are no context stacks in the configurations. When this
- leads to a conflict, we don't know if it's an ambiguity or a
- weakness in the strong LL(*) parsing strategy (versus full
- LL(*)).
+ ATN, there are no context stacks in the configurations.
+ When lack of context leads to a conflict, we don't know if it's
+ an ambiguity or a weakness in the strong LL(*) parsing strategy
+ (versus full LL(*)).
 
- So, we simply rewind and retry the ATN simulation again, this time
- using full outer context without adding to the DFA. Configuration context
- stacks will be the full invocation stack from the start rule. If
+ When SLL yields a configuration set with conflict, we rewind the
+ input and retry the ATN simulation, this time using
+ full outer context without adding to the DFA. Configuration context
+ stacks will be the full invocation stacks from the start rule. If
  we get a conflict using full context, then we can definitively
  say we have a true ambiguity for that input sequence. If we don't
  get a conflict, it implies that the decision is sensitive to the
  outer context. (It is not context-sensitive in the sense of
- context sensitive grammars.) We create a special DFA accept state
- that maps rule context to a predicted alternative. That is the
- only modification needed to handle full LL(*) vs SLL(*) prediction.
+ context-sensitive grammars.)
 
- So, the strategy is complex because we bounce back and forth from
+ The next time we reach this DFA state with an SLL conflict, through
+ DFA simulation, we will again retry the ATN simulation using full
+ context mode. This is slow because we can't save the results and have
+ to "interpret" the ATN each time we get that input. We could cache
+ results from full context to predicted alternative easily and that
+ saves a lot of time but doesn't work in presence of predicates. The set
+ of visible predicates from the ATN start state changes depending on
+ the context, because closure can fall off the end of a rule. I tried
+ to cache tuples (stack context, semantic context, predicted alt) but
+ it was slower than interpreting and much more complicated. Also
+ required a huge amount of memory. The goal is not to create the
+ world's fastest parser anyway. I'd like to keep this algorithm
+ simple. By launching multiple threads, we can improve the speed of
+ parsing across a large number of files.
+
+ This strategy is complex because we bounce back and forth from
  the ATN to the DFA, simultaneously performing predictions and
  extending the DFA according to previously unseen input
  sequences.
@@ -112,15 +126,109 @@ import java.util.Set;
  at the '}'. In the 2nd case it should stop at the ';'. Both cases
  should stay within the entry rule and not dip into the outer context.
 
- When we are forced to do full context parsing, I mark the DFA state
- with isCtxSensitive=true when we reach conflict in SLL prediction.
- Any further DFA simulation that reaches that state will
- launch an ATN simulation to get the prediction, without updating the
- DFA or storing any context information.
+ PREDICATES
+
+ Predicates are always evaluated if present in either SLL or LL both.
+ SLL and LL simulation deals with predicates differently. SLL collects
+ predicates as it performs closure operations like ANTLR v3 did. It
+ delays predicate evaluation until it reaches and accept state. This
+ allows us to cache the SLL ATN simulation whereas, if we had evaluated
+ predicates on-the-fly during closure, the DFA state configuration sets
+ would be different and we couldn't build up a suitable DFA.
+
+ When building a DFA accept state during ATN simulation, we evaluate
+ any predicates and return the sole semantically valid alternative. If
+ there is more than 1 alternative, we report an ambiguity. If there are
+ 0 alternatives, we throw an exception. Alternatives without predicates
+ act like they have true predicates. The simple way to think about it
+ is to strip away all alternatives with false predicates and choose the
+ minimum alternative that remains.
+
+ When we start in the DFA and reach an accept state that's predicated,
+ we test those and return the minimum semantically viable
+ alternative. If no alternatives are viable, we throw an exception.  We
+ don't report ambiguities in the DFA, but I'm not sure why anymore.
+
+ During full LL ATN simulation, closure always evaluates predicates and
+ on-the-fly. This is crucial to reducing the configuration set size
+ during closure. It hits a landmine when parsing with the Java grammar,
+ for example, without this on-the-fly evaluation.
+
+ SHARING DFA
+
+ All instances of the same parser share the same decision DFAs through
+ a static field. Each instance gets its own ATN simulator but they
+ share the same decisionToDFA field. They also share a
+ PredictionContextCache object that makes sure that all
+ PredictionContext objects are shared among the DFA states. This makes
+ a big size difference.
+
+ THREAD SAFETY
+
+ The parser ATN simulator locks on the decisionDFA field when it adds a
+ new DFA object to that array. addDFAEdge locks on the DFA for the
+ current decision when setting the edges[] field.  addDFAState locks on
+ the DFA for the current decision when looking up a DFA state to see if
+ it already exists.  We must make sure that all requests to add DFA
+ states that are equivalent result in the same shared DFA object. This
+ is because lots of threads will be trying to update the DFA at
+ once. The addDFAState method also locks inside the DFA lock but this
+ time on the shared context cache when it rebuilds the configurations'
+ PredictionContext objects using cached subgraphs/nodes. No other
+ locking occurs, even during DFA simulation. This is safe as long as we
+ can guarantee that all threads referencing s.edge[t] get the same
+ physical target DFA state, or none.  Once into the DFA, the DFA
+ simulation does not reference the dfa.state map. It follows the
+ edges[] field to new targets.  The DFA simulator will either find
+ dfa.edges to be null, to be non-null and dfa.edges[t] null, or
+ dfa.edges[t] to be non-null. The addDFAEdge method could be racing to
+ set the field but in either case the DFA simulator works; if null, and
+ requests ATN simulation.  It could also race trying to get
+ dfa.edges[t], but either way it will work because it's not doing a
+ test and set operation.
+
+ Starting with SLL then failing to combined SLL/LL
+
+ Sam pointed out that if SLL does not give a syntax error, then there
+ is no point in doing full LL, which is slower. We only have to try LL
+ if we get a syntax error.  For maximum speed, Sam starts the parser
+ with pure SLL mode:
+
+     parser.getInterpreter().setSLL(true);
+
+ and with the bail error strategy:
+
+     parser.setErrorHandler(new BailErrorStrategy());
+
+ If it does not get a syntax error, then we're done. If it does get a
+ syntax error, we need to retry with the combined SLL/LL strategy.
+
+ The reason this works is as follows.  If there are no SLL conflicts
+ then the grammar is SLL for sure. If there is an SLL conflict, the
+ full LL analysis must yield a set of ambiguous alternatives that is no
+ larger than the SLL set. If the LL set is a singleton, then the
+ grammar is LL but not SLL. If the LL set is the same size as the SLL
+ set, the decision is SLL. If the LL set has size > 1, then that
+ decision is truly ambiguous on the current input. If the LL set is
+ smaller, then the SLL conflict resolution might choose an alternative
+ that the full LL would rule out as a possibility based upon better
+ context information. If that's the case, then the SLL parse will
+ definitely get an error because the full LL analysis says it's not
+ viable. If SLL conflict resolution chooses an alternative within the
+ LL set, them both SLL and LL would choose the same alternative because
+ they both choose the minimum of multiple conflicting alternatives.
+
+ Let's say we have a set of SLL conflicting alternatives {1, 2, 3} and
+ a smaller LL set called s. If s is {2, 3}, then SLL parsing will get
+ an error because SLL will pursue alternative 1. If s is {1, 2} or {1,
+ 3} then both SLL and LL will choose the same alternative because
+ alternative one is the minimum of either set. If s is {2} or {3} then
+ SLL will get a syntax error. If s is {1} then SLL will succeed.
+
+ Of course, if the input is invalid, then we will get an error for sure
+ in both SLL and LL parsing. Erroneous input will therefore require 2
+ passes over the input.
 
- Predicates can be tested during SLL mode when we are sure that
- the conflicted state is a true ambiguity not an unknown conflict.
- This only happens with the special context circumstances mentioned above.
 */
 public class ParserATNSimulator<Symbol extends Token> extends ATNSimulator {
 	public static boolean debug = false;
@@ -845,7 +953,7 @@ public class ParserATNSimulator<Symbol extends Token> extends ATNSimulator {
 	}
 
 	/** Look through a list of predicate/alt pairs, returning alts for the
-	 *  pairs that win. A {@code null} predicate indicates an alt containing an
+	 *  pairs that win. A {@code NONE} predicate indicates an alt containing an
 	 *  unpredicated config which behaves as "always true." If !complete
 	 *  then we stop at the first predicate that evaluates to true. This
 	 *  includes pairs with null predicates.
@@ -856,12 +964,11 @@ public class ParserATNSimulator<Symbol extends Token> extends ATNSimulator {
 	{
 		IntervalSet predictions = new IntervalSet();
 		for (DFAState.PredPrediction pair : predPredictions) {
-			if ( pair.pred==null ) { // TODO: can't be null, can it?
+			if ( pair.pred==SemanticContext.NONE ) {
 				predictions.add(pair.alt);
 				if (!complete) {
 					break;
 				}
-
 				continue;
 			}
 
diff --git a/runtime/Java/src/org/antlr/v4/runtime/atn/PredictionContextCache.java b/runtime/Java/src/org/antlr/v4/runtime/atn/PredictionContextCache.java
index 768034ce9..102758873 100644
--- a/runtime/Java/src/org/antlr/v4/runtime/atn/PredictionContextCache.java
+++ b/runtime/Java/src/org/antlr/v4/runtime/atn/PredictionContextCache.java
@@ -3,10 +3,9 @@ package org.antlr.v4.runtime.atn;
 import java.util.HashMap;
 import java.util.Map;
 
-/** Used to cache PredictionContext objects. Its use for both the shared
- *  context cash associated with contacts in DFA states as well as the
- *  transient cash used for adaptivePredict().  This cache can be used for
- *  both lexers and parsers.
+/** Used to cache PredictionContext objects. Its used for the shared
+ *  context cash associated with contexts in DFA states. This cache
+ *  can be used for both lexers and parsers.
  */
 public class PredictionContextCache {
 	protected Map<PredictionContext, PredictionContext> cache =