Add ondemand rationale to beginning of document

This commit is contained in:
John Keiser 2020-09-12 11:59:39 -07:00
parent 88f0dc4726
commit 9dcf5fca5b
1 changed files with 217 additions and 86 deletions

View File

@ -1,6 +1,208 @@
I'll write a few of the design features / innovations here:
How simdjson's On Demand Parsing works
======================================
## Classes
Current JSON parsers generally have either ease of use or performance. Very few have both at
once. simdjson's On Demand API bridges that gap with a familiar, friendly DOM API and the
performance of just-in-time parsing on top of the simdjson core's legendary performance.
To achieve ease of use, we mimicked the *form* of a traditional DOM API: you can iterate over
arrays, look up fields in objects, and extract native values like double, uint64_t, string and bool.
To achieve performance, we introduced some key limitations that make the DOM API *streaming*:
array/object iteration cannot be restarted, and fields must be looked up in order, and string/number
values can only be parsed once.
```c++
ondemand::parser parser;
auto doc = parser.iterate(json);
for (auto tweet : doc["statuses"]) {
std::string_view text = tweet["text"];
std::string_view screen_name = tweet["user"]["screen_name"];
uint64_t retweets = tweet["retweet_count"];
uint64_t favorites = tweet["favorite_count"];
cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
}
```
This streaming approach means that fields or values you don't use don't have to get parsed or
converted, saving space and time.
Further, the On Demand API doesn't parse a value *at all* until you try to convert it to double,
int, string, or bool. Because you have told it the type at that point, it can avoid the the key
"what type is this" branch present in almost all other parsers, avoiding branch misprediction that
cause massive (sometimes 2-4x) slowdowns.
Parsers Today
-------------
To understand exactly what's happening here and why it's different, it's helpful to review the major
approaches to parsing and parser APIs in use today.
### Generic DOM
Many of the most usable, popular JSON APIs deserialize into a **DOM**: an intermediate tree of
objects, arrays and values. The resulting API lets you refer to each array or object separately,
using familiar techniques like iteration (`for (auto value : array)`) or indexing (`object["key"]`).
In some cases the values are even deserialized straight into familiar C++ constructs like vector and
map.
This model is dead simple to use, since it talks in terms of *data types* instead of JSON. It's
often easy enough that many users use the deserialized JSON as-is instead of deserializing into
their own custom structs, saving a ton of development work.
simdjson's DOM parser is one such example. It looks very similar to the ondemand example, except
it calls `parse` instead of `iterate`:
```c++
dom::parser parser;
auto doc = parser.parse(json);
for (auto tweet : doc["statuses"]) {
std::string_view text = tweet["text"];
std::string_view screen_name = tweet["user"]["screen_name"];
uint64_t retweets = tweet["retweet_count"];
uint64_t favorites = tweet["favorite_count"];
cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
}
```
Pros and cons of generic DOM:
* Straightforward, user-friendly interface (arrays and objects)
* No lifetime concerns (arrays and objects are often independent of JSON text and parser internal state)
* Parses and stores everything, using memory/CPU even on values you don't end up using (cost can be brought down some with lazy numbers/strings and top-level iterators)
* Values stay in memory even if you only use them once
* Heavy performance drain from [type blindness](#type-blindness).
### SAX (SAJ?)
The SAX model ("Streaming API for XML") uses streaming to eliminate the high cost of
parsing and storing the entire JSON. In the SAX model, a core JSON parser parses the JSON document
piece by piece, but instead of stuffing values in a DOM tree, it passes each value to a callback,
letting the user use the value and decide for themselves whether to discard it and where to store
it. or discard it.
This allows users to work with much larger files without running out of memory. Some SAX APIs even
allow the user to skip values entirely, lightening the parsing burden as well.
The big drawback is complexity: SAX APIs generally have you define a single callback for each type
(e.g. `string_field(std::string_view key, std::string_view value)`). Because of this, you suffer
from context blindness: when you find a string you have to check where it is before you know what to
do with it. Is this string the text of the tweet, the screen name, or something else I don't even
care about? Are we even in a tweet right now, or is this from some other place in the document
entirely?
The following is SAX example of the Twitter problem we've seen in the Generic DOM and On Demand
examples. To make it short enough to use as an example at all, it's heavily redacted: it only solves
a part of the problem (doesn't get user.screen_name), it has bugs (it doesn't handle sub-objects
in a tweet at all), and it uses a theoretical, simple SAX API that minimizes ceremony and skips over
the issue of lazy parsing and number types entirely.
```c++
struct twitter_callbacks {
bool in_statuses;
bool in_tweet;
std::string_view text;
uint64_t retweets;
uint64_t favorites;
void start_object_field(std::string_view key) {
if (key == "statuses") { in_statuses = true; }
}
void start_object() {
if (in_statuses) { in_tweet = true; }
}
void string_field(std::string_view key, std::string_view value) {
if (in_tweet && key == "text") { text = value; }
}
void number_field(std::string_view key, uint64_t value) {
if (in_tweet) {
if (key == "retweet_count") { retweets = value; }
if (key == "favorite_count") { favorites = value; }
}
}
void end_object() {
if (in_tweet) {
cout << "[redacted] (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
in_tweet = false;
} else if (in_statuses) {
in_statuses = false;
}
}
};
sax::parser parser;
parser.parse(twitter_callbacks());
```
This is a startling amount of code, requiring mental gymnastics even to read, and in order to get it
this small and illustrate basic usage, *it has bugs* and skips over parsing user.screen_name
entirely. The real implementation is much, much harder to write (and to read).
Pros and cons of SAX (SAJ):
* Speed and space benefits from low, predictable memory usage
* Some SAX APIs can lazily parse numbers/strings, another performance win (pay for what you use)
* Performance drain from context blindness (switch statement for "where am I in the document")
* Startlingly difficult to use
### Schema-Based Parser Generators
There is another breed of parser, commonly used to generate REST API clients, which is in principle
capable of fixing most of the issues with DOM and SAX. These parsers take a schema--a description of
your JSON, with field names, types, everything--and generate classes/structs in your language of
choice, as well as a parser to deserialize the JSON into those structs. (In another variant, you
define your own struct and a preprocessor inspects it and generates a JSON parser for it.)
Not all of these schema-based parser generators actually generate a parser or even optimize for
streaming, but they are *able* to.
Some of the features help entirely eliminate the DOM and SAX issues:
Pros and cons:
* Ease of Use is on par with DOM
* Parsers that generate iterators and lazy values in structs can keep memory pressure down to SAX levels.
* Type Blindness can be entirely solved with specific parsers for each type, saving many branches.
* Context Blindness can be solved, especially if object fields are required and in order, saving
even more branches.
* Scenarios are limited by declarative language (often limited to deserialization-to-objects)
Rust's serde does a lot of the necessary legwork here, for example. (Editor's Note: I don't know
*how much* it does, but I know it does a decent amount, and is very fast.)
Type Blindness and Branch Misprediction
---------------------------------------
The DOM parsing model, and even the SAX model to a great extent, suffers from **type
blindness:** you, the user, almost always know exactly what fields and what types are in your JSON,
but the parser doesn't. When you say `json_parser.parse(json)`, the parser doesn't get told *any*
of this. It has no way to know. This means it has to look at each value blind with a big "switch"
statement, asking "is this a number? A string? A boolean? An array? An object?"
In modern processors, this kind of switch statement can make your program run *3-4 times slower*
than it needs to. This is because of the high cost of branch misprediction.
Modern processors have more than one core, even on a single thread. To go fast, each of these cores
"reads ahead" in your program, each picking different instructions to run (as soon as data is
available). If all the cores are working almost all the time, your single-threaded program will run
around 4 instructions per cycle--*4 times faster* than it theoretically could.
Most modern programs don't manage to get much past 1 instruction per cycle, however. This is
because of branch misprediction. Programs have a lot of if statements, so to read ahead, processors
guess which branch will be taken and read ahead from that branch. If it guesses wrong, all that
wonderful work it did is discarded, and it starts over from the if statement. It *was* running at
4x speed, but it was all wasted work!
And this brings us back to that switch statement. Type blindness means the processor essentially has
to guess, for every JSON value, whether it will be an array, an object, number, string or boolean.
Unless your file is something ridiculously predictable, like a giant array of numbers, it's going to
trip up a lot. (Processors get better about this all the time, but for something complex like this
there is only so much it can do in the tiny amount of time it has to guess.)
On Demand parsing is tailor-made to solve this problem at the source, parsing values only after the
user declares their type by asking for a double, an int, a string, etc.
NOTE: EVERYTHING BELOW THIS NEEDS REWRITING AND MAY NOT BE ACCURATE AT PRESENT
==============================================================================
Classes
-------
In general, simdjson's parser classes are divided into two categories:
@ -141,7 +343,7 @@ the work to parse and validate the array:
The `ondemand::array` object lets you iterate the values of an array in a forward-only manner:
```c++
for (object tweet : doc["tweets"].get_array()) {
for (object tweet : doc["statuses"].get_array()) {
...
}
```
@ -149,7 +351,7 @@ for (object tweet : doc["tweets"].get_array()) {
This is equivalent to:
```c++
array tweets = doc["tweets"].get_array();
array tweets = doc["statuses"].get_array();
array::iterator iter = tweets.begin();
array::iterator end = tweets.end();
while (iter != end) {
@ -164,13 +366,13 @@ the work to parse and validate the array:
1. `get_array()`:
- Validates that this is an array (starts with `[`). Returns INCORRECT_TYPE if not.
- If the array is empty (followed by `]`), advances the iterator past the `]` and returns an
array with finished == true.
- If the array is not empty, returns an array with finished == false. Iterator remains pointed
at the first value (the token just after `[`).
2. `tweets.begin()`, `tweets.end()`: Returns an `array::iterator`, which just points back at the
- If the array is empty (followed by `]`), advances the iterator past the `]` and then releases the
iterator (indicating the array is finished iterating).
- If the array is not empty, the iterator remains pointed at the first value (the token just
after `[`).
2. `tweets.begin()`, `tweets.end()`: Returns an `array_iterator`, which just points back at the
array object.
3. `while (iter != end)`: Stops if finished == true.
3. `while (iter != end)`: Stops if the iterator has been released.
4. `*iter`: Yields the value (or error).
- If there is an error, returns it and sets finished == true.
- Returns a value object, advancing the iterator just past the value token (if it is `[`, `{`,
@ -182,83 +384,14 @@ the work to parse and validate the array:
- If anything else is there (`true`, `:`, `}`, etc.), sets error = TAPE_ERROR.
- #3 gets run next.
Design Features
---------------
#### Error Chaining
When you iterate over an array or object with an error, the error is yielded in the loop
#### Error Chaining
* `document`: Represents the root value of the document, as well as iteration state.
- Inline Algorithm Context. MUST be kept around for the duration of the algorithm.
- `iter`: The `json_iterator`, which handles low-level iteration.
- `parser`: A pointer to the parser to get at persistent state.
- Can be converted to `value` (and by proxy, to array/object/string/number/bool/null).
- Can be iterated (like `value`, assumes it is an array).
- Safety: can only be converted to `value` once. This prevents double iteration of arrays/objects
(which will cause inconsistent iterator state) or double-unescaping strings (which could cause
memory overruns if done too much). NOTE: this is actually not ideal; it means you can't do
`if (document.parse_double().error() == INCORRECT_TYPE) { document.parse_string(); }`
The `value` class is expected to be temporary (NOT kept around), used to check the type of a value
and parse it to another type. It can convert to array or object, and has helpers to parse string,
double, int64, uint64, boolean and is_null().
`value` does not check the type of the value ahead of time. Instead, when you ask for a value of a
given type, it tries to parse as that type, treating a parser error as INCORRECT_TYPE. For example,
if you call get_int64() on the JSON string `"123"`, it will fail because it is looking for either a
`-` or digit at the front. The philosophy here is "proceed AS IF the JSON has the desired type, and
have an unlikely error branch if not." Since most files have well-known types, this avoids
unnecessary branchiness and extra checks for the value type.
It does preemptively advance the iterator and store the pointer to the JSON, allowing you to attempt
to parse more than one type. This is how you do nullable ints, for example: check is_null() and
then get_int64(). If we didn't store the JSON, and instead did `iter.advance()` during is_null(),
you would get garbage if you then did get_int64().
##
until it's asked to convert, in which case it proceeds
*expecting* the value is of the given type, treating an error in parsing as "it must be some
other type." This saves the extra explicit type check in the common case where you already know
saving the if/switch statement
and . We find out it's not a double
when the double parser says "I couldn't find any digits and I don't understand what this `t`
character is for."
- The iterator has already been advanced past the value token. If it's { or [, the iterator is just
after the open brace.
- Can be parsed as a string, double, int, unsigned, boolean, or `is_null()`, in which case a parser is run
and the value is returned.
- Can be converted to array or object iterator.
* `object`: Represents an object iteration.
- `doc`: A pointer to the document (and the iterator).
- `at_start`: boolean saying whether we're at the start of the object, used for `[]`.
- Can do order-sensitive field lookups.
- Can iterate over key/value pairs.
* `array`: Represents an array iteration.
- `doc`: A pointer to the document (and the iterator).
- Can be returned as a `raw_json_string`, so that people who want to check JSON strings without
unescaping can do so.
- Can be converted to array or object
and keep member variables in the same registers.
In fact, , and . Usually based on whether they are persistent--i.e. we intend them
to be stored in memory--or algorithmic--
(which generally means we intend to persist them), or algorithm-lifetime, possibly
- persistent classes
- non-persistent
`ondemand::parser` is the equivalent of `dom::parser`, and .
`json_iterator` handles all iteration state.
### json_iterator / document
`document` owns the json_iterator and the . I'm considering moving this into the json_iterator so there's one
less class that requires persistent ownership, however.
### String unescaping
### String Parsing
When the user requests strings, we unescape them to a single string_buf (much like the DOM parser)
so that users enjoy the same string performance as the core simdjson. The current_string_buf_loc is
@ -278,14 +411,12 @@ If it's stored in a `simdjson_result<document>` (as it would be in `auto doc = p
the document pointed to by children is the one inside the simdjson_result, and the simdjson_result,
therefore, can't be moved either.
### Object Iteration
### Object/Array Iteration
Because the C++ iterator contract requires iterators to be const-assignable and const-constructable,
object and array iterators are separate classes from the object/array itself, and have an interior
mutable reference to it.
### Array Iteration
### Iteration Safety
- If the value fails to be parsed as one type, you can try to parse it as something else until you