[WIP] Nascent design doc for on demand

2020-08-25 00:06:24 -07:00 · 2020-08-25 00:06:24 -07:00 · 4e3b4809ea
parent a90b8fb449
commit 4e3b4809ea
1 changed files with 315 additions and 0 deletions
--- a/doc/ondemand.md
+++ b/doc/ondemand.md
@ -0,0 +1,315 @@
+I'll write a few of the design features / innovations here:
+
+## Classes
+
+In general, simdjson's parser classes are divided into two categories:
+
+* Persistent State:
+  - Generally persisted through multiple algorithm runs.
+  - Intended to be stored in memory.
+  - We try not to access members of these structs repeatedly if we can avoid it.
+  - We limit mallocs to these classes so they can be reused.
+  - Examples: dom::parser, ondemand::parser, dom_parser_implementation
+* Inline Algorithm Context:
+  - Context and counters for a single hot-loop algorithm (stage 1, stage 2, etc.).
+  - Member variables intended to be broken up and stored in registers where possible.
+  - We try to store as few things as we can manage to avoid register pressure.
+  - We never copy these (except perhaps during construction).
+  - We *only* pass references / pointers to really_inline functions, expecting the
+    compiler to eliminate the indirection.
+  - Examples: stage2::structural_parser, stage1::json_structural_indexer
+
+In ondemand, `ondemand::parser` and `dom_parser_implementation` are used to store persistent state,
+and all other classes are inline algorithm context.
+
+### ondemand::parser / dom_parser_implementation
+
+On-demand parsing has several primary classes that keep the main state:
+
+* `ondemand::parser`: Holds resources for parsing.
+  - Persistent State.
+  - `parser.parse()` calls stage 1 and then returns `ondemand::document` for you to iterate.
+  - `structural_indexes`: Buffer for structural indexes for stage 1.
+  - `string_buf`: Allocates string buffer for fast string parsing.
+  - `buf`, `len`: cached pointer to JSON buffer (and length) being parsed.
+  - `current_string_buf_loc`: The current append position in string_buf (see string parsing, later).
+    Like buf and len, this is reset on every parse.
+* `json_iterator`: Low-level iteration over JSON structurals from stage 1.
+  - Not user-facing (though you can call doc.iterate() to get it right now, I don't think we want to
+    expose that API unless we absolutely must).
+  - Inline Algorithm Context. Stored internally in `document`.
+  - `buf`: Pointer to JSON buffer, intended to be stored in a register since it's used so much.
+    (This is also in dom_parser_implementation, but that's stored in memory and we're avoiding
+    indirection here.)
+  - `index`: Pointer to next structural index.
+  - NOTE: probably *should* have a cached (or primary) copy of current_string_buf_loc for registerness.
+
+### Value
+
+value represents an value whose type is not yet known. The On Demand philosophy is to let the *user*
+tell us what type the value is by asking us to convert it, and *then* check the type. A value can
+be converted to an array or object, parsed to string, double, uint64_t, int64_t, or bool, or checked
+for is_null().
+
+* **Arrays** can be used with get_array(). More on that in the array section.
+* **Objects** can be used with get_object(). More on that in the object section.
+* **Strings** are parsed with get_string(), which parses the string, appending the unescaped value
+  into a buffer allocated by the parser, and returning a `std::string_view` pointing at the buffer
+  and telling you the location. We append a terminating `\0` in case you end up using these values
+  with routines designed for C strings (the string_view's length of course does not include that
+  `\0').
+  
+  These string_views *will be invalidated on the next parse,* so if you want to persist a copy,
+  you'll need to allocate your own actual strings. This case is designed to be fast in the
+  server-like scenario, where you parse a document, use the values, and only then move on to parse
+  another one.
+
+  Optionally, get_raw_json_string() gives you a pointer to the raw, unescaped JSON, allowing you to
+  work with strings efficiently in cases where your format disallows escaping--for example,
+  enumeration values and GUIDs.
+* **Numbers** can be parsed with get_double(), get_uint64() and get_int64(), each of which is a separate,
+battle-tested parser specifically targeted to the type. The algorithms give exact answers, even for
+floating-point numbers, and check for overflow or invalid characters.
+* **true, false and null** are parsed through get_bool() and is_null(). simdjson checks these quickly by
+read the next 4 characters as a 32-bit integer and comparing that to true, fals, or null. Then it
+checks the extra "e" in false, and ensures that the next character is either whitespace, or a JSON
+operator (, : ] or }).
+
+### Document
+
+document represents the top level value in the document, behaving much like a value but also does
+double duty, storing iteration state. It can be converted to an array or object, parsed to
+string, double, uint64_t, int64_t, bool, or checked for is_null().
+
+The document *must* be kept around during iteration, as all other iterator classes rely on it to
+provide iterator state. 
+
+If the document itself is a single number, boolean, or null, the parsing algorithm is very slightly
+different from the value parsers, which rely on there being more JSON after the value. To
+accomodate, simdjson copies the number or atom into a small buffer, places a space after it, and
+then runs the normal algorithm. Strings, arrays and objects at the root all use exactly the same
+stuff. (NOTE: I haven't implemented this difference yet.)
+
+### Array
+
+The `ondemand::array` object lets you iterate the values of an array in a forward-only manner:
+
+```c++
+for (object tweet : doc["tweets"].get_array()) {
+  ...
+}
+```
+
+This is equivalent to:
+
+```c++
+array tweets = doc["tweets"].get_array();
+array::iterator iter = tweets.begin();
+array::iterator end = tweets.end();
+while (iter != end) {
+  object tweet = *iter;
+  ...
+  iter++;
+}
+```
+
+The way you *parse* an array is somewhat split into pieces by the iterator. Here is how we section
+the work to parse and validate the array:
+
+1. `get_array()`:
+   - Validates that this is an array (starts with `[`). Returns INCORRECT_TYPE if not.
+   - If the array is empty (followed by `]`), advances the iterator past the `]` and returns an
+     array with finished == true.
+   - If the array is not empty, returns an array with finished == false. Iterator remains pointed
+     at the first value (the token just after `[`).
+2. `tweets.begin()`, `tweets.end()`: Returns an `array::iterator`, which just points back at the
+   array object.
+3. `while (iter != end)`: Stops if finished == true.
+4. `*iter`: Yields the value (or error).
+   - If there is an error, returns it and sets finished == true.
+   - Returns a value object, advancing the iterator just past the value token (if it is `[`, `{`,
+     etc. then that will be handled when the value is converted to an array/object).
+5. `iter++`: Expects the iterator to have finished with the previous array value, and looks for `,` or `]`.
+   - Advances the iterator and gets the JSON `]` or `,`.
+   - If the array just ended (there is a `]`), sets finished == true.
+   - If the array continues (`,`), does nothing.
+   - If anything else is there (`true`, `:`, `}`, etc.), sets error = TAPE_ERROR.
+   - #3 gets run next.
+
+### Array
+
+The `ondemand::array` object lets you iterate the values of an array in a forward-only manner:
+
+```c++
+for (object tweet : doc["tweets"].get_array()) {
+  ...
+}
+```
+
+This is equivalent to:
+
+```c++
+array tweets = doc["tweets"].get_array();
+array::iterator iter = tweets.begin();
+array::iterator end = tweets.end();
+while (iter != end) {
+  object tweet = *iter;
+  ...
+  iter++;
+}
+```
+
+The way you *parse* an array is somewhat split into pieces by the iterator. Here is how we section
+the work to parse and validate the array:
+
+1. `get_array()`:
+   - Validates that this is an array (starts with `[`). Returns INCORRECT_TYPE if not.
+   - If the array is empty (followed by `]`), advances the iterator past the `]` and returns an
+     array with finished == true.
+   - If the array is not empty, returns an array with finished == false. Iterator remains pointed
+     at the first value (the token just after `[`).
+2. `tweets.begin()`, `tweets.end()`: Returns an `array::iterator`, which just points back at the
+   array object.
+3. `while (iter != end)`: Stops if finished == true.
+4. `*iter`: Yields the value (or error).
+   - If there is an error, returns it and sets finished == true.
+   - Returns a value object, advancing the iterator just past the value token (if it is `[`, `{`,
+     etc. then that will be handled when the value is converted to an array/object).
+5. `iter++`: Expects the iterator to have finished with the previous array value, and looks for `,` or `]`.
+   - Advances the iterator and gets the JSON `]` or `,`.
+   - If the array just ended (there is a `]`), sets finished == true.
+   - If the array continues (`,`), does nothing.
+   - If anything else is there (`true`, `:`, `}`, etc.), sets error = TAPE_ERROR.
+   - #3 gets run next.
+
+#### Error Chaining
+
+When you iterate over an array or object with an error, the error is yielded in the loop
+
+#### Error Chaining
+
+
+* `document`: Represents the root value of the document, as well as iteration state.
+  - Inline Algorithm Context. MUST be kept around for the duration of the algorithm.
+  - `iter`: The `json_iterator`, which handles low-level iteration.
+  - `parser`: A pointer to the parser to get at persistent state.
+  - Can be converted to `value` (and by proxy, to array/object/string/number/bool/null).
+  - Can be iterated (like `value`, assumes it is an array).
+  - Safety: can only be converted to `value` once. This prevents double iteration of arrays/objects
+    (which will cause inconsistent iterator state) or double-unescaping strings (which could cause
+    memory overruns if done too much). NOTE: this is actually not ideal; it means you can't do
+    `if (document.parse_double().error() == INCORRECT_TYPE) { document.parse_string(); }`
+
+The `value` class is expected to be temporary (NOT kept around), used to check the type of a value
+and parse it to another type. It can convert to array or object, and has helpers to parse string,
+double, int64, uint64, boolean and is_null().
+
+`value` does not check the type of the value ahead of time. Instead, when you ask for a value of a
+given type, it tries to parse as that type, treating a parser error as INCORRECT_TYPE. For example,
+if you call get_int64() on the JSON string `"123"`, it will fail because it is looking for either a
+`-` or digit at the front. The philosophy here is "proceed AS IF the JSON has the desired type, and
+have an unlikely error branch if not." Since most files have well-known types, this avoids
+unnecessary branchiness and extra checks for the value type.
+
+It does preemptively advance the iterator and store the pointer to the JSON, allowing you to attempt
+to parse more than one type. This is how you do nullable ints, for example: check is_null() and
+then get_int64(). If we didn't store the JSON, and instead did `iter.advance()` during is_null(),
+you would get garbage if you then did get_int64().
+
+## 
+
+until it's asked to convert, in which case it proceeds
+    *expecting* the value is of the given type, treating an error in parsing as "it must be some
+    other type." This saves the extra explicit type check in the common case where you already know
+    saving the if/switch statement
+    and . We find out it's not a double
+    when the double parser says "I couldn't find any digits and I don't understand what this `t`
+    character is for."
+  - The iterator has already been advanced past the value token. If it's { or [, the iterator is just
+    after the open brace.
+  - Can be parsed as a string, double, int, unsigned, boolean, or `is_null()`, in which case a parser is run
+    and the value is returned.
+  - Can be converted to array or object iterator.
+* `object`: Represents an object iteration.
+  - `doc`: A pointer to the document (and the iterator).
+  - `at_start`: boolean saying whether we're at the start of the object, used for `[]`.
+  - Can do order-sensitive field lookups.
+  - Can iterate over key/value pairs.
+* `array`: Represents an array iteration.
+  - `doc`: A pointer to the document (and the iterator).
+  
+
+    
+  - Can be returned as a `raw_json_string`, so that people who want to check JSON strings without
+    unescaping can do so.
+  - Can be converted to array or object
+     and keep member variables in the same registers.
+    In fact, , and . Usually based on whether they are persistent--i.e. we intend them
+to be stored in memory--or algorithmic--
+(which generally means we intend to persist them), or algorithm-lifetime, possibly
+
+- persistent classes
+- non-persistent 
+`ondemand::parser` is the equivalent of `dom::parser`, and . 
+`json_iterator` handles all iteration state.
+
+### json_iterator / document
+
+`document` owns the json_iterator and the . I'm considering moving this into the json_iterator so there's one
+less class that requires persistent ownership, however.
+
+### String unescaping
+
+When the user requests strings, we unescape them to a single string_buf (much like the DOM parser)
+so that users enjoy the same string performance as the core simdjson. The current_string_buf_loc is
+presently stored in the 
+
+We do not write the length to the string buffer, however; that is stored in the string_view we
+return to the user, and immediately forgotten.
+
+Presently, we use the `char string_buf[]` to write
+
+ The top-level document stores a json_iterator inside, which 
+
+It is illegal to move the document after it has been iterated (otherwise, pointers would be
+invalidated). Note: I need to check that the current code actually checks this.
+
+If it's stored in a `simdjson_result<document>` (as it would be in `auto doc = parser.parse(json)`),
+the document pointed to by children is the one inside the simdjson_result, and the simdjson_result,
+therefore, can't be moved either.
+
+### Object Iteration
+
+Because the C++ iterator contract requires iterators to be const-assignable and const-constructable,
+object and array iterators are separate classes from the object/array itself, and have an interior
+mutable reference to it.
+
+### Array Iteration
+
+### Iteration Safety
+
+  - If the value fails to be parsed as one type, you can try to parse it as something else until you
+    succeed.
+  - Safety: If the value succeeds in being parsed or converted to a type, you cannot try again. (It
+    sets `json` to null, so you will segfault.) This prevents double iteration of an array (which
+    will cause inconsistent iterator state) or double-unescaping a string (which could cause memory
+    overruns if done too much).
+  - Guaranteed Iteration: If you discard a value without using it--perhaps you just wanted to know
+    if it was null but didn't care what the actual value was--it will iterate. See 
+
+This is an area I'm chomping at the bit for Rust, actually ... a whole lot of this would be safer if
+we had compiler-enforced borrowing.
+
+### Raw Key Lookup
+
+### Skip Algorithm
+
+### Root Scalar Parsing Without Malloc
+
+The malloc when we parse the number / atoms at the root of the document has always bothered me a little, so I wrote alternate routines that use a stack-based buffer instead, based on the type in question. Atoms require no more than 6 bytes; integers no more than 21; and floats ... well, I [wanted your opinion on that, actually.](https://github.com/simdjson/simdjson/pull/947/files#diff-979f6706620f56f5d6a45ca3bf511669R166). I wanted to set a limit on the biggest possible float, and came up with:
+
+> Per https://www.exploringbinary.com/maximum-number-of-decimal-digits-in-binary-floating-point-numbers/, 1074 is the maximum number of fractional digits needed to distinguish any two doubles (including zeroes and significant digits). Add 8 more digits for the other stuff (`-0.<fraction>e-308`) -- and you get 1082.
+
+Technically, people could add an arbitrary number of digits after that ... but we could actually scan for that and ignore, if we wanted. I know it's a lot of convolutions to avoid malloc / free, but I think there are really good reasons to have a 100% malloc-free library (well, we have integration points that malloc, but they are predictable and could easily be swapped out.
+
+I considered just using separate algorithms, and I think for numbers in particular there is probably a way to do that without having two separate functions, but I haven't figured out the *clean* way yet.