Tweaking further the documentation. (#1237)

* Tweaking further the documentation.

* More details.

* Another sentence.

* Saving.

* Tweaking more
This commit is contained in:
Daniel Lemire 2020-10-19 16:51:04 -04:00 committed by GitHub
parent f1b4a54991
commit 0a907ec694
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 31 additions and 29 deletions

View File

@ -12,6 +12,7 @@ Before submitting an issue, please ensure that you have read the documentation:
* Basics is an overview of how to use simdjson and its APIs: https://github.com/simdjson/simdjson/blob/master/doc/basics.md
* Performance shows some more advanced scenarios and how to tune for them: https://github.com/simdjson/simdjson/blob/master/doc/performance.md
* Contributing: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md
* We follow the [JSON specification as described by RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.txt) (T. Bray, 2017).
**Describe the bug**

View File

@ -12,6 +12,7 @@ Before submitting an issue, please ensure that you have read the documentation:
* Basics is an overview of how to use simdjson and its APIs: https://github.com/simdjson/simdjson/blob/master/doc/basics.md
* Performance shows some more advanced scenarios and how to tune for them: https://github.com/simdjson/simdjson/blob/master/doc/performance.md
* Contributing: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md
* We follow the [JSON specification as described by RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.txt) (T. Bray, 2017).
We do not make changes to simdjson without clearly identifiable benefits, which typically means either performance improvements, bug fixes or new features. Avoid bike-shedding: we all have opinions about how to write code, but we want to focus on what makes simdjson objectively better.

View File

@ -12,6 +12,7 @@ Before submitting an issue, please ensure that you have read the documentation:
* Basics is an overview of how to use simdjson and its APIs: https://github.com/simdjson/simdjson/blob/master/doc/basics.md
* Performance shows some more advanced scenarios and how to tune for them: https://github.com/simdjson/simdjson/blob/master/doc/performance.md
* Contributing: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md
* We follow the [JSON specification as described by RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.txt) (T. Bray, 2017).
We do not make changes to simdjson without clearly identifiable benefits, which typically means either performance improvements, bug fixes or new features. Avoid bike-shedding: we all have opinions about how to write code, but we want to focus on what makes simdjson objectively better.

View File

@ -8,8 +8,8 @@ Whether we parse JSON or XML, or any other serialized format, there are relative
- Another established approach is a event-based approach (like SAX, SAJ).
- Another popular approach is the schema-based deserialization model.
We propose an approach that is as easy to use and often as flexible as the DOM approach, yet as fast and
efficient as the schema-based or event-based approaches. We call this new approach "On Demand". The
We propose an approach that is as easy to use and often as flexible as the DOM approach, yet as fast and
efficient as the schema-based or event-based approaches. We call this new approach "On Demand". The
simdjson On Demand API offers a familiar, friendly DOM API and
provides the performance of just-in-time parsing on top of the simdjson superior performance.
@ -71,11 +71,12 @@ and `"friends_count"` keys and matching values are skipped.
Further, the On Demand API does not parse a value *at all* until you try to convert it (e.g., to `double`,
`int`, `string`, or `bool`). In our example, when accessing the key-value pair `"retweet_count": 82`, the parser
may not convert the pair of characters `82` to the binary integer 82. Because the programmer specifies the data type, we avoid branch
mispredictions related to data type determination and improve the performance.
may not convert the pair of characters `82` to the binary integer 82. Because the programmer specifies the data
type, we avoid branch mispredictions related to data type determination and improve the performance.
We expect users of an On Demand API to work in terms of a JSON dialect, which is a set of expectations and
specifications that come in addition to the [JSON specification](https://www.rfc-editor.org/rfc/rfc8259.txt).
The On Demand approach is designed around several principles:
* **Streaming (\*):** It avoids preparsing values, keeping the memory usage and the latency down.
@ -85,6 +86,7 @@ The On Demand approach is designed around several principles:
* **Validate What You Use:** On Demand deliberately validates the values you use and the structure leading to it, but nothing else. The goal is a guarantee that the value you asked for is the correct one and is not malformed: there must be no confusion over whether you got the right value.
To understand why On Demand is different, it is helpful to review the major
approaches to parsing and parser APIs in use today.
@ -275,7 +277,7 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
### Starting the iteration
1. First, we declare a parser object that keeps internal buffers necessary for parsing. This can be
reused to parse multiple JSON files, so you don't pay the high cost of allocating memory every
reused to parse multiple JSON files, so you do not pay the high cost of allocating memory every
time (and so it can stay in cache!).
This declaration does not allocate any memory; that will happen in the next step.
@ -325,8 +327,8 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
ondemand::object top = doc.get_object();
// Find the field statuses by:
// 1. Check whether the object is empty (check for }). (TODO we don't really need to do this unless the key lookup fails!)
// 2. Check if we're at the field by looking for the string "statuses".
// 1. Check whether the object is empty (check for }). (We do not really need to do this unless the key lookup fails!)
// 2. Check if we're at the field by looking for the string "statuses" using byte-by-byte comparison.
// 3. Validate that there is a `:` after it.
auto tweets_field = top["statuses"];
@ -359,14 +361,14 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
std::string_view text = tweet["text"];
```
First, `["text"]` skips the `"id"` field because it doesn't match: skips the key, `:` and
First, `["text"]` skips the `"id"` field because it does not match: skips the key, `:` and
value (`1`). We then check whether there are more fields by looking for either `,`
or `}`.
The second field is matched (`"text"`), so we validate the `:` and move to the actual value.
NOTE: `["text"]` does a *raw match*, comparing the key directly against the raw JSON. This means
that keys with escapes in them may not be matched.
that keys with escapes in them may not be matched and the letter case must match exactly.
To convert to a string, we check for `"` and use simdjson's fast unescaping algorithm to copy
`first!` (plus a terminating `\0`) into a buffer managed by the `document`. This buffer stores
@ -445,11 +447,6 @@ When the user requests strings, we unescape them to a single string buffer much
so that users enjoy the same string performance as the core simdjson. We do not write the length to the
string buffer, however; that is stored in the `string_view` instance we return to the user.
### Object/Array Iteration
Because the C++ iterator contract requires iterators to be const-assignable and const-constructable,
object and array iterators are separate classes from the object/array itself, and have an interior
mutable reference to it.
### Iteration Safety
@ -466,24 +463,27 @@ in production systems:
if it was `nullptr` but did not care what the actual value was--it will iterate. The destructor automates
the iteration.
### Limitations of the On Demand Approach
### Benefits of the On Demand Approach
We expect that the On Demand approach has many of the performance benefits of the schema-based approach, while providing a flexibility that is similar to that of the DOM-based approach. However, there are some limitations.
We expect that the On Demand approach has many of the performance benefits of the schema-based approach, while providing a flexibility that is similar to that of the DOM-based approach.
Pros of the On Demand approach:
* Faster than DOM in some cases. Reduced memory usage.
* Straightforward, programmer-friendly interface (arrays and objects).
* Highly expressive, beyond deserialization and pointer queries: many tasks can be accomplished with little code.
Cons of the On Demand approach:
* Because it operates in streaming mode, you only have access to the current element in the JSON document. Furthermore, the document is traversed in order so the code is sensitive to the order of the JSON nodes in the same manner as an event-based approach (e.g., SAX). It is possible for the programmer to handle out-of-order keys when the JSON dialect is underspecified, but it requires additional care. You should be mindful that the though your software might write the keys in a consistent manner, the JSON specification does not prescribe that the order be significant and thus, a JSON producer could change the order of the keys within an object. The On Demand API will still help the programmer by throwing an exception when the unexpected occurs, but the programmer is responsible for handling such cases (e.g., by rejecting the JSON input that does not follow the expected JSON dialect).
* Less safe than DOM: we only validate the components of the JSON document that are used and it is possible to begin ingesting an invalid document only to find out later that the document is invalid. Are you fine ingesting a large JSON document that starts with well formed JSON but ends with invalid JSON content?
### Limitations of the On Demand Approach
The On Demand approach has some limitations:
* Because it operates in streaming mode, you only have access to the current element in the JSON document. Furthermore, the document is traversed in order so the code is sensitive to the order of the JSON nodes in the same manner as an event-based approach (e.g., SAX).
* The On Demand approach is less safe than DOM: we only validate the components of the JSON document that are used and it is possible to begin ingesting an invalid document only to find out later that the document is invalid. Are you fine ingesting a large JSON document that starts with well formed JSON but ends with invalid JSON content?
There are currently additional technical limitations which we expect to resolve in future releases of the simdjson library:
* The simdjson library offers runtime dispatching which allows you to compile one binary and have it run at full speed on different processors, taking advantage of the specific features of the processor. The On Demand API does not have runtime dispatch support at this time. To benefit from the On Demand API, you must compile your code for a specific processor. E.g., if your processor supports AVX2 instructions, you should compile your binary executable with AVX2 instruction support (by using your compiler's commands). If you are sufficiently technically proficient, you can implement runtime dispatching within your application, by compiling your On Demand code for different processors.
* There is an initial phase which scans the entire document quickly, irrespective of the size of the document. We plan to break this phase into distinct steps for large files in a future release as we have done with other components of our API (e.g., `parse_many`).
* The On Demand API does not support JSON Pointer. This capability is currently limited to our core API.
* We intend to help users who wish to use the On Demand API but require support for order-insensitive semantics, but in our current implementation support for out-of-order keys (if needed) must be provided by the programmer. Currently, one might proceed in the following manner as a fallback measure if keys can appear in any order:
* You should be mindful that the though your software might write the keys in a consistent manner, the [JSON specification](https://www.rfc-editor.org/rfc/rfc8259.txt) states that "JSON parsing libraries have been observed to differ as to whether or not they make the ordering of object members visible". The On Demand API will help the programmer handle unexpected JSON dialects by throwing an exception when the unexpected occurs, but the programmer is responsible for handling such cases: e.g., by rejecting the JSON input that does not follow the expected JSON dialect. We intend to help users who wish to use the On Demand API but require support for order-insensitive semantics, but in our current implementation support for out-of-order keys (if needed) must be provided by the programmer. Currently, one might proceed in the following manner as a fallback measure if keys can appear in any order:
```C++
for (ondemand::object my_object : doc["mykey"]) {
for (auto field : my_object) {
@ -519,11 +519,10 @@ most programmers will want to target `arm64`. The `fallback` is probably only go
std::cout << simdjson::builtin_implementation()->name() << std::endl;
```
If you are using CMake for your C++ project, then you can pass compilation flags to your compiler during the first configuration
by using the `CXXFLAGS` configuration variable:
```
CXXFLAGS=-march=haswell cmake -B build_haswell
cmake --build build_haswell
```
If you are using CMake for your C++ project, then you can pass compilation flags to your compiler by using
the `CMAKE_CXX_FLAGS` variable:
You may also use the `CMAKE_CXX_FLAGS` variable.
```
cmake -DCMAKE_CXX_FLAGS="-march=haswell" -B build_haswell
cmake --build build_haswell
```