Tweaking further the documentation. (#1237)
* Tweaking further the documentation. * More details. * Another sentence. * Saving. * Tweaking more
This commit is contained in:
parent
f1b4a54991
commit
0a907ec694
|
@ -12,6 +12,7 @@ Before submitting an issue, please ensure that you have read the documentation:
|
|||
* Basics is an overview of how to use simdjson and its APIs: https://github.com/simdjson/simdjson/blob/master/doc/basics.md
|
||||
* Performance shows some more advanced scenarios and how to tune for them: https://github.com/simdjson/simdjson/blob/master/doc/performance.md
|
||||
* Contributing: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md
|
||||
* We follow the [JSON specification as described by RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.txt) (T. Bray, 2017).
|
||||
|
||||
|
||||
**Describe the bug**
|
||||
|
|
|
@ -12,6 +12,7 @@ Before submitting an issue, please ensure that you have read the documentation:
|
|||
* Basics is an overview of how to use simdjson and its APIs: https://github.com/simdjson/simdjson/blob/master/doc/basics.md
|
||||
* Performance shows some more advanced scenarios and how to tune for them: https://github.com/simdjson/simdjson/blob/master/doc/performance.md
|
||||
* Contributing: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md
|
||||
* We follow the [JSON specification as described by RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.txt) (T. Bray, 2017).
|
||||
|
||||
We do not make changes to simdjson without clearly identifiable benefits, which typically means either performance improvements, bug fixes or new features. Avoid bike-shedding: we all have opinions about how to write code, but we want to focus on what makes simdjson objectively better.
|
||||
|
||||
|
|
|
@ -12,6 +12,7 @@ Before submitting an issue, please ensure that you have read the documentation:
|
|||
* Basics is an overview of how to use simdjson and its APIs: https://github.com/simdjson/simdjson/blob/master/doc/basics.md
|
||||
* Performance shows some more advanced scenarios and how to tune for them: https://github.com/simdjson/simdjson/blob/master/doc/performance.md
|
||||
* Contributing: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md
|
||||
* We follow the [JSON specification as described by RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.txt) (T. Bray, 2017).
|
||||
|
||||
We do not make changes to simdjson without clearly identifiable benefits, which typically means either performance improvements, bug fixes or new features. Avoid bike-shedding: we all have opinions about how to write code, but we want to focus on what makes simdjson objectively better.
|
||||
|
||||
|
|
|
@ -8,8 +8,8 @@ Whether we parse JSON or XML, or any other serialized format, there are relative
|
|||
- Another established approach is a event-based approach (like SAX, SAJ).
|
||||
- Another popular approach is the schema-based deserialization model.
|
||||
|
||||
We propose an approach that is as easy to use and often as flexible as the DOM approach, yet as fast and
|
||||
efficient as the schema-based or event-based approaches. We call this new approach "On Demand". The
|
||||
We propose an approach that is as easy to use and often as flexible as the DOM approach, yet as fast and
|
||||
efficient as the schema-based or event-based approaches. We call this new approach "On Demand". The
|
||||
simdjson On Demand API offers a familiar, friendly DOM API and
|
||||
provides the performance of just-in-time parsing on top of the simdjson superior performance.
|
||||
|
||||
|
@ -71,11 +71,12 @@ and `"friends_count"` keys and matching values are skipped.
|
|||
|
||||
Further, the On Demand API does not parse a value *at all* until you try to convert it (e.g., to `double`,
|
||||
`int`, `string`, or `bool`). In our example, when accessing the key-value pair `"retweet_count": 82`, the parser
|
||||
may not convert the pair of characters `82` to the binary integer 82. Because the programmer specifies the data type, we avoid branch
|
||||
mispredictions related to data type determination and improve the performance.
|
||||
|
||||
may not convert the pair of characters `82` to the binary integer 82. Because the programmer specifies the data
|
||||
type, we avoid branch mispredictions related to data type determination and improve the performance.
|
||||
|
||||
|
||||
We expect users of an On Demand API to work in terms of a JSON dialect, which is a set of expectations and
|
||||
specifications that come in addition to the [JSON specification](https://www.rfc-editor.org/rfc/rfc8259.txt).
|
||||
The On Demand approach is designed around several principles:
|
||||
|
||||
* **Streaming (\*):** It avoids preparsing values, keeping the memory usage and the latency down.
|
||||
|
@ -85,6 +86,7 @@ The On Demand approach is designed around several principles:
|
|||
* **Validate What You Use:** On Demand deliberately validates the values you use and the structure leading to it, but nothing else. The goal is a guarantee that the value you asked for is the correct one and is not malformed: there must be no confusion over whether you got the right value.
|
||||
|
||||
|
||||
|
||||
To understand why On Demand is different, it is helpful to review the major
|
||||
approaches to parsing and parser APIs in use today.
|
||||
|
||||
|
@ -275,7 +277,7 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
|
|||
### Starting the iteration
|
||||
|
||||
1. First, we declare a parser object that keeps internal buffers necessary for parsing. This can be
|
||||
reused to parse multiple JSON files, so you don't pay the high cost of allocating memory every
|
||||
reused to parse multiple JSON files, so you do not pay the high cost of allocating memory every
|
||||
time (and so it can stay in cache!).
|
||||
|
||||
This declaration does not allocate any memory; that will happen in the next step.
|
||||
|
@ -325,8 +327,8 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
|
|||
ondemand::object top = doc.get_object();
|
||||
|
||||
// Find the field statuses by:
|
||||
// 1. Check whether the object is empty (check for }). (TODO we don't really need to do this unless the key lookup fails!)
|
||||
// 2. Check if we're at the field by looking for the string "statuses".
|
||||
// 1. Check whether the object is empty (check for }). (We do not really need to do this unless the key lookup fails!)
|
||||
// 2. Check if we're at the field by looking for the string "statuses" using byte-by-byte comparison.
|
||||
// 3. Validate that there is a `:` after it.
|
||||
auto tweets_field = top["statuses"];
|
||||
|
||||
|
@ -359,14 +361,14 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
|
|||
std::string_view text = tweet["text"];
|
||||
```
|
||||
|
||||
First, `["text"]` skips the `"id"` field because it doesn't match: skips the key, `:` and
|
||||
First, `["text"]` skips the `"id"` field because it does not match: skips the key, `:` and
|
||||
value (`1`). We then check whether there are more fields by looking for either `,`
|
||||
or `}`.
|
||||
|
||||
The second field is matched (`"text"`), so we validate the `:` and move to the actual value.
|
||||
|
||||
NOTE: `["text"]` does a *raw match*, comparing the key directly against the raw JSON. This means
|
||||
that keys with escapes in them may not be matched.
|
||||
that keys with escapes in them may not be matched and the letter case must match exactly.
|
||||
|
||||
To convert to a string, we check for `"` and use simdjson's fast unescaping algorithm to copy
|
||||
`first!` (plus a terminating `\0`) into a buffer managed by the `document`. This buffer stores
|
||||
|
@ -445,11 +447,6 @@ When the user requests strings, we unescape them to a single string buffer much
|
|||
so that users enjoy the same string performance as the core simdjson. We do not write the length to the
|
||||
string buffer, however; that is stored in the `string_view` instance we return to the user.
|
||||
|
||||
### Object/Array Iteration
|
||||
|
||||
Because the C++ iterator contract requires iterators to be const-assignable and const-constructable,
|
||||
object and array iterators are separate classes from the object/array itself, and have an interior
|
||||
mutable reference to it.
|
||||
|
||||
### Iteration Safety
|
||||
|
||||
|
@ -466,24 +463,27 @@ in production systems:
|
|||
if it was `nullptr` but did not care what the actual value was--it will iterate. The destructor automates
|
||||
the iteration.
|
||||
|
||||
### Limitations of the On Demand Approach
|
||||
### Benefits of the On Demand Approach
|
||||
|
||||
We expect that the On Demand approach has many of the performance benefits of the schema-based approach, while providing a flexibility that is similar to that of the DOM-based approach. However, there are some limitations.
|
||||
We expect that the On Demand approach has many of the performance benefits of the schema-based approach, while providing a flexibility that is similar to that of the DOM-based approach.
|
||||
|
||||
Pros of the On Demand approach:
|
||||
* Faster than DOM in some cases. Reduced memory usage.
|
||||
* Straightforward, programmer-friendly interface (arrays and objects).
|
||||
* Highly expressive, beyond deserialization and pointer queries: many tasks can be accomplished with little code.
|
||||
|
||||
Cons of the On Demand approach:
|
||||
* Because it operates in streaming mode, you only have access to the current element in the JSON document. Furthermore, the document is traversed in order so the code is sensitive to the order of the JSON nodes in the same manner as an event-based approach (e.g., SAX). It is possible for the programmer to handle out-of-order keys when the JSON dialect is underspecified, but it requires additional care. You should be mindful that the though your software might write the keys in a consistent manner, the JSON specification does not prescribe that the order be significant and thus, a JSON producer could change the order of the keys within an object. The On Demand API will still help the programmer by throwing an exception when the unexpected occurs, but the programmer is responsible for handling such cases (e.g., by rejecting the JSON input that does not follow the expected JSON dialect).
|
||||
* Less safe than DOM: we only validate the components of the JSON document that are used and it is possible to begin ingesting an invalid document only to find out later that the document is invalid. Are you fine ingesting a large JSON document that starts with well formed JSON but ends with invalid JSON content?
|
||||
### Limitations of the On Demand Approach
|
||||
|
||||
The On Demand approach has some limitations:
|
||||
|
||||
* Because it operates in streaming mode, you only have access to the current element in the JSON document. Furthermore, the document is traversed in order so the code is sensitive to the order of the JSON nodes in the same manner as an event-based approach (e.g., SAX).
|
||||
* The On Demand approach is less safe than DOM: we only validate the components of the JSON document that are used and it is possible to begin ingesting an invalid document only to find out later that the document is invalid. Are you fine ingesting a large JSON document that starts with well formed JSON but ends with invalid JSON content?
|
||||
|
||||
There are currently additional technical limitations which we expect to resolve in future releases of the simdjson library:
|
||||
|
||||
* The simdjson library offers runtime dispatching which allows you to compile one binary and have it run at full speed on different processors, taking advantage of the specific features of the processor. The On Demand API does not have runtime dispatch support at this time. To benefit from the On Demand API, you must compile your code for a specific processor. E.g., if your processor supports AVX2 instructions, you should compile your binary executable with AVX2 instruction support (by using your compiler's commands). If you are sufficiently technically proficient, you can implement runtime dispatching within your application, by compiling your On Demand code for different processors.
|
||||
* There is an initial phase which scans the entire document quickly, irrespective of the size of the document. We plan to break this phase into distinct steps for large files in a future release as we have done with other components of our API (e.g., `parse_many`).
|
||||
* The On Demand API does not support JSON Pointer. This capability is currently limited to our core API.
|
||||
* We intend to help users who wish to use the On Demand API but require support for order-insensitive semantics, but in our current implementation support for out-of-order keys (if needed) must be provided by the programmer. Currently, one might proceed in the following manner as a fallback measure if keys can appear in any order:
|
||||
* You should be mindful that the though your software might write the keys in a consistent manner, the [JSON specification](https://www.rfc-editor.org/rfc/rfc8259.txt) states that "JSON parsing libraries have been observed to differ as to whether or not they make the ordering of object members visible". The On Demand API will help the programmer handle unexpected JSON dialects by throwing an exception when the unexpected occurs, but the programmer is responsible for handling such cases: e.g., by rejecting the JSON input that does not follow the expected JSON dialect. We intend to help users who wish to use the On Demand API but require support for order-insensitive semantics, but in our current implementation support for out-of-order keys (if needed) must be provided by the programmer. Currently, one might proceed in the following manner as a fallback measure if keys can appear in any order:
|
||||
```C++
|
||||
for (ondemand::object my_object : doc["mykey"]) {
|
||||
for (auto field : my_object) {
|
||||
|
@ -519,11 +519,10 @@ most programmers will want to target `arm64`. The `fallback` is probably only go
|
|||
std::cout << simdjson::builtin_implementation()->name() << std::endl;
|
||||
```
|
||||
|
||||
If you are using CMake for your C++ project, then you can pass compilation flags to your compiler during the first configuration
|
||||
by using the `CXXFLAGS` configuration variable:
|
||||
```
|
||||
CXXFLAGS=-march=haswell cmake -B build_haswell
|
||||
cmake --build build_haswell
|
||||
```
|
||||
If you are using CMake for your C++ project, then you can pass compilation flags to your compiler by using
|
||||
the `CMAKE_CXX_FLAGS` variable:
|
||||
|
||||
You may also use the `CMAKE_CXX_FLAGS` variable.
|
||||
```
|
||||
cmake -DCMAKE_CXX_FLAGS="-march=haswell" -B build_haswell
|
||||
cmake --build build_haswell
|
||||
```
|
Loading…
Reference in New Issue