Let us try to improve the documentation so that it is clearer. (#1733)

* Let us try to improve the documentation so that it is clearer.

* Minor tweaks.
This commit is contained in:
Daniel Lemire 2021-10-19 13:09:41 -04:00 committed by GitHub
parent b7c4d1eeef
commit 9e477ddb00
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 42 additions and 18 deletions

View File

@ -127,19 +127,23 @@ The Basics: Loading and Parsing JSON Documents
----------------------------------------------
The simdjson library allows you to navigate and validate JSON documents ([RFC 8259](https://www.tbray.org/ongoing/When/201x/2017/12/14/rfc8259.html)).
As required by the standard, your JSON document should be Unicode (UTF-8) strings.
As required by the standard, your JSON document should be in a Unicode (UTF-8) string. The whole
string, from the beginning to the end, needs to be valid: we do not attempt to tolerate bad
inputs before or after a document.
The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can access by creating a `ondemand::parser` and calling the `iterate()` method:
The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can
access by creating a `ondemand::parser` and calling the `iterate()` method. The iterate method
quickly indexes the input string and may detect some errors. The following example illustrates
how to get started with an input JSON file (`"twitter.json"`):
```c++
ondemand::parser parser;
auto json = padded_string::load("twitter.json");
auto json = padded_string::load("twitter.json"); // load JSON file 'twitter.json'.
ondemand::document doc = parser.iterate(json); // position a pointer at the beginning of the JSON data
```
Or by creating a padded string---for efficiency reasons, simdjson requires a string with a few
bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:
You can also create a padded string---for efficiency reasons, simdjson requires a string
with a few bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:
```c++
ondemand::parser parser;
@ -242,11 +246,11 @@ to enable these additional checks: just make sure you remove the definition once
code has been tested.
Once you have a document (`simdjson::ondemand::document`), you can navigate it with
idiomatic C++ iterators, operators and casts. Besides the documents instances and
idiomatic C++ iterators, operators and casts. Besides the document instances and
native types (`double`, `uint64_t`, `int64_t`, `bool`), we also access
Unicode (UTF-8) strings (`std::string_view`), objects (`simdjson::ondemand::object`)
and arrays (`simdjson::ondemand::array`).
We also have a generic type (`simdjson::ondemand::value`) which represent a potential
We also have a generic type (`simdjson::ondemand::value`) which represents a potential
array or object, or scalar type (`double`, `uint64_t`, `int64_t`, `bool`, `null`, string) inside an array or an object. Both generic types (`simdjson::ondemand::document` and `simdjson::ondemand::value`) have a `type()` method returning
a `json_type` value describing the value (`json_type::array`, `json_type::object`, `json_type::number`, `json_type::string`, `json_type::boolean`, `json_type::null`).
@ -255,6 +259,7 @@ should review our section [dynamic number types](#dynamic-number-types). Indeed,
we have an additional `ondemand::number` type which may represent either integers
or floating-point values, depending on how the numbers are formatted.
floating-point values followed by an integer.
While you are accessing the document, the `document` instance should remain in scope:
it is your "iterator" which keeps track of where you are in the JSON document.
By design, there is one and only one `document` instance per JSON document.
@ -1157,11 +1162,19 @@ The `raw_json_token()` should be fast and free of allocation.
Newline-Delimited JSON (ndjson) and JSON lines
----------------------------------------------
The simdjson library also supports multithreaded JSON streaming through a large file containing many
smaller JSON documents in either [ndjson](http://ndjson.org) or [JSON lines](http://jsonlines.org)
format. If your JSON documents all contain arrays or objects, we even support direct file
concatenation without whitespace. The concatenated file has no size restrictions (including larger
than 4GB), though each individual document must be no larger than 4 GB.
When processing large inputs (e.g., in the context of data engineering), engineers commonly
serialize data into streams of multiple JSON documents. That is, instead of one large
(e.g., 2 GB) JSON document containing multiple records, it is often preferable to
write out multiple records as independent JSON documents, to be read one-by-one.
The simdjson library also supports multithreaded JSON streaming through a large file
containing many smaller JSON documents in either [ndjson](http://ndjson.org)
or [JSON lines](http://jsonlines.org) format. If your JSON documents all contain arrays
or objects, we even support direct file concatenation without whitespace. However, if there
is content between your JSON documents, it should be exclusively ASCII white-space characters.
The concatenated file has no size restrictions (including larger than 4GB), though each
individual document must be no larger than 4 GB.
Here is an example:

View File

@ -1,7 +1,9 @@
iterate_many
==========
An interface providing features to work with files or streams containing multiple small JSON documents. Given an input such as
When serializing large databases, it is often better to write out many independent JSON
documents, instead of one large monolithic document containing many records. The simdjson
library provides high-speed access to files or streams containing multiple small JSON documents separated by ASCII white-space characters. Given an input such as
```JSON
{"text":"a"}
{"text":"b"}
@ -114,7 +116,9 @@ Whitespace Characters:
- **Linefeed**
- **Carriage return**
- **Horizontal tab**
- **Nothing**
If your documents are all objects or arrays, then you may even have nothing between them.
E.g., `[1,2]{"32":1}` is recognized as two documents.
Some official formats **(non-exhaustive list)**:
- [Newline-Delimited JSON (NDJSON)](http://ndjson.org/)

View File

@ -54,6 +54,11 @@ public:
* ondemand::parser parser;
* document doc = parser.iterate(json);
*
* It is expected that the content is a valid UTF-8 file, containing a valid JSON document.
* Otherwise the iterate method may return an error. In particular, the whole input should be
* valid: we do not attempt to tolerate incorrect content either before or after a JSON
* document.
*
* ### IMPORTANT: Validate what you use
*
* Calling iterate on an invalid JSON document may not immediately trigger an error. The call to
@ -166,13 +171,15 @@ public:
* ### Format
*
* The buffer must contain a series of one or more JSON documents, concatenated into a single
* buffer, separated by whitespace. It effectively parses until it has a fully valid document,
* buffer, separated by ASCII whitespace. It effectively parses until it has a fully valid document,
* then starts parsing the next document at that point. (It does this with more parallelism and
* lookahead than you might think, though.)
*
* documents that consist of an object or array may omit the whitespace between them, concatenating
* with no separator. documents that consist of a single primitive (i.e. documents that are not
* arrays or objects) MUST be separated with whitespace.
* with no separator. Documents that consist of a single primitive (i.e. documents that are not
* arrays or objects) MUST be separated with ASCII whitespace.
*
* The characters inside a JSON document, and between JSON documents, must be valid Unicode (UTF-8).
*
* The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse.
* Setting batch_size to excessively large or excesively small values may impact negatively the