Revamp design documentation to match new design

This commit is contained in:
John Keiser 2020-12-07 12:47:32 -08:00
parent 3baba73cf5
commit 2eaeac53e4
1 changed files with 217 additions and 77 deletions

View File

@ -29,11 +29,6 @@ auto doc = parser.iterate(json);
for (auto tweet : doc["statuses"]) {
std::string_view text = tweet["text"];
std::string_view screen_name = tweet["user"]["screen_name"];
std::string_view screen_name;
{
ondemand::object user = tweet["user"];
screen_name = user["screen_name"];
}
uint64_t retweets = tweet["retweet_count"];
uint64_t favorites = tweet["favorite_count"];
cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
@ -66,7 +61,10 @@ Such code would be apply to a JSON document such as the following JSON mimicking
"retweet_count": 82,
"favorite_count": 42
}
]
],
"search_metadata": {
"count": 100,
}
}
```
@ -91,7 +89,6 @@ The On Demand approach is designed around several principles:
* **Validate What You Use:** On Demand deliberately validates the values you use and the structure leading to it, but nothing else. The goal is a guarantee that the value you asked for is the correct one and is not malformed: there must be no confusion over whether you got the right value.
To understand why On Demand is different, it is helpful to review the major
approaches to parsing and parser APIs in use today.
@ -119,8 +116,7 @@ for (auto tweet : doc["statuses"]) {
std::string_view text = tweet["text"];
std::string_view screen_name = tweet["user"]["screen_name"];
uint64_t retweets = tweet["retweet_count"];
uint64_t favorites = tweet["favorite_count"];
cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
cout << screen_name << " (" << retweets << " retweets): " << text << endl;
}
```
@ -273,9 +269,10 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
```json
{
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "favorite_count": 100, "retweet_count": 40 },
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "favorite_count": 2, "retweet_count": 3 }
]
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
],
"search_metadata": { "count": 2 }
}
```
@ -318,11 +315,24 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
rely on error chaining, so it is possible to delay error checks: we shall shortly explain error
chaining more fully.
NOTE: You should always have such a `document` instance (here `doc`) and it should remain in scope for the duration
of your parsing function. E.g., you should not use the returned document as a temporary (e.g., `auto x = parser.iterate(json).get_object();`)
followed by other operations as the destruction of the `document` instance makes all of the derived instances
ill-defined.
> NOTE: You should always have such a `document` instance (here `doc`) and it should remain in scope for the duration
> of your parsing function. E.g., you should not use the returned document as a temporary (e.g., `auto x = parser.iterate(json).get_object();`)
> followed by other operations as the destruction of the `document` instance makes all of the derived instances
> ill-defined.
At this point, the iterator is at the start of the JSON:
```json
{
^ (depth 1)
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
],
"search_metadata": { "count": 2 }
}
```
3. We iterate over the "statuses" field using a typical C++ iterator, reading past the initial
`{ "statuses": [ {`.
@ -330,45 +340,59 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
```c++
for (ondemand::object tweet : doc["statuses"]) {
```
This shorthand does much, and it is helpful to see what it expands to.
Comments in front of each one explain what's going on:
```c++
// Validate that the top-level value is an object: check for {
ondemand::object top = doc.get_object();
// Find the field statuses by:
// 1. Check whether the object is empty (check for }). (We do not really need to do this unless the key lookup fails!)
// 2. Check if we're at the field by looking for the string "statuses" using byte-by-byte comparison.
// 3. Validate that there is a `:` after it.
auto tweets_field = top["statuses"];
This shorthand does a lot, and it is helpful to see what it expands to.
Comments in front of each one explain what's going on:
// Validate that the field value is an array: check for [
// Also mark the array as finished if there is a ] next, which would cause the while () statement to exit immediately.
ondemand::array tweets = tweets_field.get_array();
// These three method calls do nothing substantial (the real checking happens in get_array() and ++)
// != checks whether the array is marked as finished (if we have found a ]).
ondemand::array_iterator tweets_iter = tweets.begin();
while (tweets_iter != tweets.end()) {
auto tweet_value = *tweets_iter;
```c++
// Validate that the top-level value is an object: check for {. Increase depth to 2 (root > field).
ondemand::object top = doc.get_object();
// Validate that the array element is an object: check for {
ondemand::object tweet = tweet_value.get_object();
...
}
```
What is not explained in this code expansion is *error chaining*.
Generally, you can use `document` methods on a `simdjson_result<...>` value; any errors will
just be passed down the chain. Many method calls
can be chained in this manner. So `for (object tweet : doc["statuses"])`, which is the equivalent of
`object tweet = *(doc.get_object()["statuses"].get_array().begin()).get_object()`, could fail in any of
6 method calls, and the error will only be checked at the end,
when you attempt to cast the final `simdjson_result<object>` to object. Upon casting, an exception is
thrown if there was an error.
// Find the field statuses by:
// 1. Check whether the object is empty (check for }). (We do not really need to do this unless
// the key lookup fails!)
// 2. Check if we're at the field by looking for the string "statuses" using byte-by-byte comparison.
// 3. Validate that there is a `:` after it.
auto tweets_field = top["statuses"];
NOTE: while the document can be queried once for a key as if it were an object, it is not an actual object
instance. If you need to treat it as an object (e.g., to query more than one keys), you can cast it as
such `ondemand::object root_object = doc.get_object();`.
// - Validate that the field value is an array: check for [
// - If the array is empty (if there is a ] next), decrease depth back to 0.
// - If not, increase depth to 3 (root > statuses > tweet).
ondemand::array tweets = tweets_field.get_array();
// These three method calls do nothing substantial (the real checking happens in get_array() and ++)
// != checks whether the array is finished (if we found a ] and decreased depth back to 0).
ondemand::array_iterator tweets_iter = tweets.begin();
while (tweets_iter != tweets.end()) {
auto tweet_value = *tweets_iter;
// - Validate that the array element is an object: check for {
// - If the object is empty (if there is a } next), decrease depth back to 1.
// - If not, increase depth to 4 (root > statuses > tweet > field).
ondemand::object tweet = tweet_value.get_object();
...
}
```
> NOTE: What is not explained in this code expansion is *error chaining*.
> Generally, you can use `document` methods on a `simdjson_result<...>` value; any errors will
> just be passed down the chain. Many method calls
> can be chained in this manner. So `for (object tweet : doc["statuses"])`, which is the equivalent of
> `object tweet = *(doc.get_object()["statuses"].get_array().begin()).get_object()`, could fail in any of
> 6 method calls, and the error will only be checked at the end,
> when you attempt to cast the final `simdjson_result<object>` to object. Upon casting, an exception is
> thrown if there was an error.
```json
{
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
^ (depth 4 - root > statuses > tweet > field)
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
],
"search_metadata": { "count": 2 }
}
```
4. We get the `"text"` field as a string.
@ -382,8 +406,10 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
The second field is matched (`"text"`), so we validate the `:` and move to the actual value.
NOTE: `["text"]` does a *raw match*, comparing the key directly against the raw JSON. This means
that keys with escapes in them may not be matched and the letter case must match exactly.
> NOTE: `["text"]` does a *raw match*, comparing the key directly against the raw JSON. This
> allows simdjson to do field lookup very, very quickly when the keys you want to match have
> letters, numbers and punctuation. However, this means that fields with escapes in them will not
> be matched.
To convert to a string, we check for `"` and use simdjson's fast unescaping algorithm to copy
`first!` (plus a terminating `\0`) into a buffer managed by the `document`. This buffer stores
@ -391,6 +417,22 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
A `string_view` is returned which points to that buffer, and contains the length.
We advance to the comma, and decrease depth to 3 (root > statuses > tweet).
At this point, we are here in the JSON:
```json
{
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
^ (depth 2 - root > statuses > tweet)
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
],
"search_metadata": { "count": 2 }
}
```
4. We get the `"screen_name"` from the `"user"` object.
```c++
@ -398,29 +440,75 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
screen_name = user["screen_name"];
```
First, `["user"]` checks whether there are any more object fields by looking for either `,` or
`}`. Then it matches `"user"` and validates the `:`.
First, `["user"]` finds the `,`, discovers the next key is `"user"`, validates that the `:`
is there, and increases depth to 4 (root > statuses > tweet > field).
`["screen_name"]` then converts to object, checking for `{`, and finds `"screen_name"`.
Next, the cast to ondemand::object checks for `{` and increases depth to 5 (root > statuses >
tweet > user > field).
`["screen_name"]` finds the first field `"screen_name"` and validates the `:`.
To convert the result to usable string (i.e., the screen name `lemire`), the characters are written to the document's
string buffer (after possibly escaping them), which now has *two* string_views pointing into it, and looks like `first!\0lemire\0`.
Finally, the temporary user object is destroyed, causing it to skip the remainder of the object
(`}`).
The iterator advances to the comma and decreases depth back to 4 (root > statuses > tweet > user).
NOTE: You may only have one active array or object active at any given time. An array or an object becomes
active when the `ondemand::object` or `ondemand::array` is created, and it releases its 'focus' when
its destructor is called. If you create an array or an object located inside a parent object or array,
the child array or object becomes active while the parent becomes temporarily inactive. If you access
several sibling objects or arrays, you must ensure that the destructor is called by scoping each access
(see Iteration Safety section below for further details).
At this point, the iterator is here in the JSON:
5. We get `"retweet_count"` and `"favorite_count"` as unsigned integers.
```json
{
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
^ (depth 4 - root > statuses > tweet > user)
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
],
"search_metadata": { "count": 2 }
}
```
5. We get `"retweet_count"` as an unsigned integer.
```c++
uint64_t retweets = tweet["retweet_count"];
uint64_t favorites = tweet["favorite_count"];
```
First, `["retweet_count"]` checks whether the previous field value is finished (if it was, depth
would be 3 (root > statuses > tweet). Since it's not, we skip JSON until depth is 3. This brings
the iterator to the `,` after the user object:
```json
{
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
^ (depth 4 - root > statuses > tweet > user)
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
],
"search_metadata": { "count": 2 }
}
```
Because of the cast to uint64_t, simdjson knows it's parsing an unsigned integer. This lets
us use a fast parser which *only* knows how to parse digits. It validates that it is an integer
by rejecting negative numbers, strings, and other values based on the fact that they are not the
digits 0-9. This type specificity is part of why parsing with on demand is so fast: you lose all
the code that has to understand those other types.
The iterator is advanced to the `}`, and depth decreased back to 3 (root > statuses > tweet).
At this point, we are here in the JSON:
```json
{
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
^ (depth 3 - root > statuses > tweet)
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
],
"search_metadata": { "count": 2 }
}
```
6. We loop to the next tweet.
@ -441,25 +529,77 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
}
```
First, the `tweet` destructor runs, skipping the remainder of the object which in this case is
just `}`.
First, `iter++` (remember, this is the array of tweets) checks whether the previous object was
fully iterated. It was not--depth is 3 (root > statuses > tweet), so we skip until it's 2--which
in this case just means consuming the `}`, leaving the iterator at the next comma. Depth is now 2
(root > statuses).
Next, `iter++` checks whether there are more values and finds `,`. The loop continues.
Next, `iter++` finds the `,` and advances past it to the `{`, increasing depth to 3 (root >
statuses > tweet).
Finally, `ondemand::object tweet = *iter` checks for `{` and returns the object.
Finally, `ondemand::object tweet = *iter` validates the `{` and increases depth to 4 (root >
statuses > tweet > field). This leaves the iterator here:
This tweet is processed just like the previous one.
```json
{
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
^ (depth 3 - root > statuses > tweet)
],
"search_metadata": { "count": 2 }
}
```
7. We finish the last tweet.
7. This tweet is processed just like the previous one, leaving the iterator here:
At the end of the loop, the `tweet` is first destroyed, skipping the remainder of the tweet
object (`}`).
```json
{
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
^ (depth 3 - root > statuses > tweet)
],
"search_metadata": { "count": 2 }
}
```
The `iter++` instruction from `for (ondemand::object tweet : doc["statuses"])` then checks whether there are
more values and finds that there are none (`]`). It marks the array iteration as finished and the for
loop terminates.
8. The loop ends. Recall the relevant parts of the statuses loop:
Then the outer object is destroyed, skipping everything up to the `}`.
```c++
while (iter != statuses.end()) {
ondemand::object tweet = *iter;
...
iter++;
}
```
First, `iter++` finishes up any children, consuming the `}` and leaving depth at 2 (root > statuses).
Next, `iter++` notices the `]` and ends the array by decreasing depth to 1. This leaves the iterator
here in the JSON:
```json
{
"statuses": [
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
],
^ (depth 1 - root)
"search_metadata": { "count": 2 }
}
```
9. The remainder of the file is skipped.
Because no more action is taken, JSON processing stops: processing only occurs when you ask for
values.
This means you can very efficiently do things like read a single value from a JSON file, or take
the top N, for example. It also means the things you don't use won't be fully validated. This is
a general principle of On Demand: don't validate what you don't use. We still fully validate
values you do use, however, as well as the objects and arrays that lead to them, so that you can
be sure you get the information you need.
Design Features
---------------