[WIP] Nascent design doc for on demand

This commit is contained in:
John Keiser 2020-08-25 00:06:24 -07:00
parent a90b8fb449
commit 4e3b4809ea
1 changed files with 315 additions and 0 deletions

315
doc/ondemand.md Normal file
View File

@ -0,0 +1,315 @@
I'll write a few of the design features / innovations here:
## Classes
In general, simdjson's parser classes are divided into two categories:
* Persistent State:
- Generally persisted through multiple algorithm runs.
- Intended to be stored in memory.
- We try not to access members of these structs repeatedly if we can avoid it.
- We limit mallocs to these classes so they can be reused.
- Examples: dom::parser, ondemand::parser, dom_parser_implementation
* Inline Algorithm Context:
- Context and counters for a single hot-loop algorithm (stage 1, stage 2, etc.).
- Member variables intended to be broken up and stored in registers where possible.
- We try to store as few things as we can manage to avoid register pressure.
- We never copy these (except perhaps during construction).
- We *only* pass references / pointers to really_inline functions, expecting the
compiler to eliminate the indirection.
- Examples: stage2::structural_parser, stage1::json_structural_indexer
In ondemand, `ondemand::parser` and `dom_parser_implementation` are used to store persistent state,
and all other classes are inline algorithm context.
### ondemand::parser / dom_parser_implementation
On-demand parsing has several primary classes that keep the main state:
* `ondemand::parser`: Holds resources for parsing.
- Persistent State.
- `parser.parse()` calls stage 1 and then returns `ondemand::document` for you to iterate.
- `structural_indexes`: Buffer for structural indexes for stage 1.
- `string_buf`: Allocates string buffer for fast string parsing.
- `buf`, `len`: cached pointer to JSON buffer (and length) being parsed.
- `current_string_buf_loc`: The current append position in string_buf (see string parsing, later).
Like buf and len, this is reset on every parse.
* `json_iterator`: Low-level iteration over JSON structurals from stage 1.
- Not user-facing (though you can call doc.iterate() to get it right now, I don't think we want to
expose that API unless we absolutely must).
- Inline Algorithm Context. Stored internally in `document`.
- `buf`: Pointer to JSON buffer, intended to be stored in a register since it's used so much.
(This is also in dom_parser_implementation, but that's stored in memory and we're avoiding
indirection here.)
- `index`: Pointer to next structural index.
- NOTE: probably *should* have a cached (or primary) copy of current_string_buf_loc for registerness.
### Value
value represents an value whose type is not yet known. The On Demand philosophy is to let the *user*
tell us what type the value is by asking us to convert it, and *then* check the type. A value can
be converted to an array or object, parsed to string, double, uint64_t, int64_t, or bool, or checked
for is_null().
* **Arrays** can be used with get_array(). More on that in the array section.
* **Objects** can be used with get_object(). More on that in the object section.
* **Strings** are parsed with get_string(), which parses the string, appending the unescaped value
into a buffer allocated by the parser, and returning a `std::string_view` pointing at the buffer
and telling you the location. We append a terminating `\0` in case you end up using these values
with routines designed for C strings (the string_view's length of course does not include that
`\0').
These string_views *will be invalidated on the next parse,* so if you want to persist a copy,
you'll need to allocate your own actual strings. This case is designed to be fast in the
server-like scenario, where you parse a document, use the values, and only then move on to parse
another one.
Optionally, get_raw_json_string() gives you a pointer to the raw, unescaped JSON, allowing you to
work with strings efficiently in cases where your format disallows escaping--for example,
enumeration values and GUIDs.
* **Numbers** can be parsed with get_double(), get_uint64() and get_int64(), each of which is a separate,
battle-tested parser specifically targeted to the type. The algorithms give exact answers, even for
floating-point numbers, and check for overflow or invalid characters.
* **true, false and null** are parsed through get_bool() and is_null(). simdjson checks these quickly by
read the next 4 characters as a 32-bit integer and comparing that to true, fals, or null. Then it
checks the extra "e" in false, and ensures that the next character is either whitespace, or a JSON
operator (, : ] or }).
### Document
document represents the top level value in the document, behaving much like a value but also does
double duty, storing iteration state. It can be converted to an array or object, parsed to
string, double, uint64_t, int64_t, bool, or checked for is_null().
The document *must* be kept around during iteration, as all other iterator classes rely on it to
provide iterator state.
If the document itself is a single number, boolean, or null, the parsing algorithm is very slightly
different from the value parsers, which rely on there being more JSON after the value. To
accomodate, simdjson copies the number or atom into a small buffer, places a space after it, and
then runs the normal algorithm. Strings, arrays and objects at the root all use exactly the same
stuff. (NOTE: I haven't implemented this difference yet.)
### Array
The `ondemand::array` object lets you iterate the values of an array in a forward-only manner:
```c++
for (object tweet : doc["tweets"].get_array()) {
...
}
```
This is equivalent to:
```c++
array tweets = doc["tweets"].get_array();
array::iterator iter = tweets.begin();
array::iterator end = tweets.end();
while (iter != end) {
object tweet = *iter;
...
iter++;
}
```
The way you *parse* an array is somewhat split into pieces by the iterator. Here is how we section
the work to parse and validate the array:
1. `get_array()`:
- Validates that this is an array (starts with `[`). Returns INCORRECT_TYPE if not.
- If the array is empty (followed by `]`), advances the iterator past the `]` and returns an
array with finished == true.
- If the array is not empty, returns an array with finished == false. Iterator remains pointed
at the first value (the token just after `[`).
2. `tweets.begin()`, `tweets.end()`: Returns an `array::iterator`, which just points back at the
array object.
3. `while (iter != end)`: Stops if finished == true.
4. `*iter`: Yields the value (or error).
- If there is an error, returns it and sets finished == true.
- Returns a value object, advancing the iterator just past the value token (if it is `[`, `{`,
etc. then that will be handled when the value is converted to an array/object).
5. `iter++`: Expects the iterator to have finished with the previous array value, and looks for `,` or `]`.
- Advances the iterator and gets the JSON `]` or `,`.
- If the array just ended (there is a `]`), sets finished == true.
- If the array continues (`,`), does nothing.
- If anything else is there (`true`, `:`, `}`, etc.), sets error = TAPE_ERROR.
- #3 gets run next.
### Array
The `ondemand::array` object lets you iterate the values of an array in a forward-only manner:
```c++
for (object tweet : doc["tweets"].get_array()) {
...
}
```
This is equivalent to:
```c++
array tweets = doc["tweets"].get_array();
array::iterator iter = tweets.begin();
array::iterator end = tweets.end();
while (iter != end) {
object tweet = *iter;
...
iter++;
}
```
The way you *parse* an array is somewhat split into pieces by the iterator. Here is how we section
the work to parse and validate the array:
1. `get_array()`:
- Validates that this is an array (starts with `[`). Returns INCORRECT_TYPE if not.
- If the array is empty (followed by `]`), advances the iterator past the `]` and returns an
array with finished == true.
- If the array is not empty, returns an array with finished == false. Iterator remains pointed
at the first value (the token just after `[`).
2. `tweets.begin()`, `tweets.end()`: Returns an `array::iterator`, which just points back at the
array object.
3. `while (iter != end)`: Stops if finished == true.
4. `*iter`: Yields the value (or error).
- If there is an error, returns it and sets finished == true.
- Returns a value object, advancing the iterator just past the value token (if it is `[`, `{`,
etc. then that will be handled when the value is converted to an array/object).
5. `iter++`: Expects the iterator to have finished with the previous array value, and looks for `,` or `]`.
- Advances the iterator and gets the JSON `]` or `,`.
- If the array just ended (there is a `]`), sets finished == true.
- If the array continues (`,`), does nothing.
- If anything else is there (`true`, `:`, `}`, etc.), sets error = TAPE_ERROR.
- #3 gets run next.
#### Error Chaining
When you iterate over an array or object with an error, the error is yielded in the loop
#### Error Chaining
* `document`: Represents the root value of the document, as well as iteration state.
- Inline Algorithm Context. MUST be kept around for the duration of the algorithm.
- `iter`: The `json_iterator`, which handles low-level iteration.
- `parser`: A pointer to the parser to get at persistent state.
- Can be converted to `value` (and by proxy, to array/object/string/number/bool/null).
- Can be iterated (like `value`, assumes it is an array).
- Safety: can only be converted to `value` once. This prevents double iteration of arrays/objects
(which will cause inconsistent iterator state) or double-unescaping strings (which could cause
memory overruns if done too much). NOTE: this is actually not ideal; it means you can't do
`if (document.parse_double().error() == INCORRECT_TYPE) { document.parse_string(); }`
The `value` class is expected to be temporary (NOT kept around), used to check the type of a value
and parse it to another type. It can convert to array or object, and has helpers to parse string,
double, int64, uint64, boolean and is_null().
`value` does not check the type of the value ahead of time. Instead, when you ask for a value of a
given type, it tries to parse as that type, treating a parser error as INCORRECT_TYPE. For example,
if you call get_int64() on the JSON string `"123"`, it will fail because it is looking for either a
`-` or digit at the front. The philosophy here is "proceed AS IF the JSON has the desired type, and
have an unlikely error branch if not." Since most files have well-known types, this avoids
unnecessary branchiness and extra checks for the value type.
It does preemptively advance the iterator and store the pointer to the JSON, allowing you to attempt
to parse more than one type. This is how you do nullable ints, for example: check is_null() and
then get_int64(). If we didn't store the JSON, and instead did `iter.advance()` during is_null(),
you would get garbage if you then did get_int64().
##
until it's asked to convert, in which case it proceeds
*expecting* the value is of the given type, treating an error in parsing as "it must be some
other type." This saves the extra explicit type check in the common case where you already know
saving the if/switch statement
and . We find out it's not a double
when the double parser says "I couldn't find any digits and I don't understand what this `t`
character is for."
- The iterator has already been advanced past the value token. If it's { or [, the iterator is just
after the open brace.
- Can be parsed as a string, double, int, unsigned, boolean, or `is_null()`, in which case a parser is run
and the value is returned.
- Can be converted to array or object iterator.
* `object`: Represents an object iteration.
- `doc`: A pointer to the document (and the iterator).
- `at_start`: boolean saying whether we're at the start of the object, used for `[]`.
- Can do order-sensitive field lookups.
- Can iterate over key/value pairs.
* `array`: Represents an array iteration.
- `doc`: A pointer to the document (and the iterator).
- Can be returned as a `raw_json_string`, so that people who want to check JSON strings without
unescaping can do so.
- Can be converted to array or object
and keep member variables in the same registers.
In fact, , and . Usually based on whether they are persistent--i.e. we intend them
to be stored in memory--or algorithmic--
(which generally means we intend to persist them), or algorithm-lifetime, possibly
- persistent classes
- non-persistent
`ondemand::parser` is the equivalent of `dom::parser`, and .
`json_iterator` handles all iteration state.
### json_iterator / document
`document` owns the json_iterator and the . I'm considering moving this into the json_iterator so there's one
less class that requires persistent ownership, however.
### String unescaping
When the user requests strings, we unescape them to a single string_buf (much like the DOM parser)
so that users enjoy the same string performance as the core simdjson. The current_string_buf_loc is
presently stored in the
We do not write the length to the string buffer, however; that is stored in the string_view we
return to the user, and immediately forgotten.
Presently, we use the `char string_buf[]` to write
The top-level document stores a json_iterator inside, which
It is illegal to move the document after it has been iterated (otherwise, pointers would be
invalidated). Note: I need to check that the current code actually checks this.
If it's stored in a `simdjson_result<document>` (as it would be in `auto doc = parser.parse(json)`),
the document pointed to by children is the one inside the simdjson_result, and the simdjson_result,
therefore, can't be moved either.
### Object Iteration
Because the C++ iterator contract requires iterators to be const-assignable and const-constructable,
object and array iterators are separate classes from the object/array itself, and have an interior
mutable reference to it.
### Array Iteration
### Iteration Safety
- If the value fails to be parsed as one type, you can try to parse it as something else until you
succeed.
- Safety: If the value succeeds in being parsed or converted to a type, you cannot try again. (It
sets `json` to null, so you will segfault.) This prevents double iteration of an array (which
will cause inconsistent iterator state) or double-unescaping a string (which could cause memory
overruns if done too much).
- Guaranteed Iteration: If you discard a value without using it--perhaps you just wanted to know
if it was null but didn't care what the actual value was--it will iterate. See
This is an area I'm chomping at the bit for Rust, actually ... a whole lot of this would be safer if
we had compiler-enforced borrowing.
### Raw Key Lookup
### Skip Algorithm
### Root Scalar Parsing Without Malloc
The malloc when we parse the number / atoms at the root of the document has always bothered me a little, so I wrote alternate routines that use a stack-based buffer instead, based on the type in question. Atoms require no more than 6 bytes; integers no more than 21; and floats ... well, I [wanted your opinion on that, actually.](https://github.com/simdjson/simdjson/pull/947/files#diff-979f6706620f56f5d6a45ca3bf511669R166). I wanted to set a limit on the biggest possible float, and came up with:
> Per https://www.exploringbinary.com/maximum-number-of-decimal-digits-in-binary-floating-point-numbers/, 1074 is the maximum number of fractional digits needed to distinguish any two doubles (including zeroes and significant digits). Add 8 more digits for the other stuff (`-0.<fraction>e-308`) -- and you get 1082.
Technically, people could add an arbitrary number of digits after that ... but we could actually scan for that and ignore, if we wanted. I know it's a lot of convolutions to avoid malloc / free, but I think there are really good reasons to have a 100% malloc-free library (well, we have integration points that malloc, but they are predictable and could easily be swapped out.
I considered just using separate algorithms, and I think for numbers in particular there is probably a way to do that without having two separate functions, but I haven't figured out the *clean* way yet.