Let us try to improve the documentation so that it is clearer. (#1733)
* Let us try to improve the documentation so that it is clearer. * Minor tweaks.
This commit is contained in:
parent
b7c4d1eeef
commit
9e477ddb00
|
@ -127,19 +127,23 @@ The Basics: Loading and Parsing JSON Documents
|
||||||
----------------------------------------------
|
----------------------------------------------
|
||||||
|
|
||||||
The simdjson library allows you to navigate and validate JSON documents ([RFC 8259](https://www.tbray.org/ongoing/When/201x/2017/12/14/rfc8259.html)).
|
The simdjson library allows you to navigate and validate JSON documents ([RFC 8259](https://www.tbray.org/ongoing/When/201x/2017/12/14/rfc8259.html)).
|
||||||
As required by the standard, your JSON document should be Unicode (UTF-8) strings.
|
As required by the standard, your JSON document should be in a Unicode (UTF-8) string. The whole
|
||||||
|
string, from the beginning to the end, needs to be valid: we do not attempt to tolerate bad
|
||||||
|
inputs before or after a document.
|
||||||
|
|
||||||
|
The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can
|
||||||
The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can access by creating a `ondemand::parser` and calling the `iterate()` method:
|
access by creating a `ondemand::parser` and calling the `iterate()` method. The iterate method
|
||||||
|
quickly indexes the input string and may detect some errors. The following example illustrates
|
||||||
|
how to get started with an input JSON file (`"twitter.json"`):
|
||||||
|
|
||||||
```c++
|
```c++
|
||||||
ondemand::parser parser;
|
ondemand::parser parser;
|
||||||
auto json = padded_string::load("twitter.json");
|
auto json = padded_string::load("twitter.json"); // load JSON file 'twitter.json'.
|
||||||
ondemand::document doc = parser.iterate(json); // position a pointer at the beginning of the JSON data
|
ondemand::document doc = parser.iterate(json); // position a pointer at the beginning of the JSON data
|
||||||
```
|
```
|
||||||
|
|
||||||
Or by creating a padded string---for efficiency reasons, simdjson requires a string with a few
|
You can also create a padded string---for efficiency reasons, simdjson requires a string
|
||||||
bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:
|
with a few bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:
|
||||||
|
|
||||||
```c++
|
```c++
|
||||||
ondemand::parser parser;
|
ondemand::parser parser;
|
||||||
|
@ -242,11 +246,11 @@ to enable these additional checks: just make sure you remove the definition once
|
||||||
code has been tested.
|
code has been tested.
|
||||||
|
|
||||||
Once you have a document (`simdjson::ondemand::document`), you can navigate it with
|
Once you have a document (`simdjson::ondemand::document`), you can navigate it with
|
||||||
idiomatic C++ iterators, operators and casts. Besides the documents instances and
|
idiomatic C++ iterators, operators and casts. Besides the document instances and
|
||||||
native types (`double`, `uint64_t`, `int64_t`, `bool`), we also access
|
native types (`double`, `uint64_t`, `int64_t`, `bool`), we also access
|
||||||
Unicode (UTF-8) strings (`std::string_view`), objects (`simdjson::ondemand::object`)
|
Unicode (UTF-8) strings (`std::string_view`), objects (`simdjson::ondemand::object`)
|
||||||
and arrays (`simdjson::ondemand::array`).
|
and arrays (`simdjson::ondemand::array`).
|
||||||
We also have a generic type (`simdjson::ondemand::value`) which represent a potential
|
We also have a generic type (`simdjson::ondemand::value`) which represents a potential
|
||||||
array or object, or scalar type (`double`, `uint64_t`, `int64_t`, `bool`, `null`, string) inside an array or an object. Both generic types (`simdjson::ondemand::document` and `simdjson::ondemand::value`) have a `type()` method returning
|
array or object, or scalar type (`double`, `uint64_t`, `int64_t`, `bool`, `null`, string) inside an array or an object. Both generic types (`simdjson::ondemand::document` and `simdjson::ondemand::value`) have a `type()` method returning
|
||||||
a `json_type` value describing the value (`json_type::array`, `json_type::object`, `json_type::number`, `json_type::string`, `json_type::boolean`, `json_type::null`).
|
a `json_type` value describing the value (`json_type::array`, `json_type::object`, `json_type::number`, `json_type::string`, `json_type::boolean`, `json_type::null`).
|
||||||
|
|
||||||
|
@ -255,6 +259,7 @@ should review our section [dynamic number types](#dynamic-number-types). Indeed,
|
||||||
we have an additional `ondemand::number` type which may represent either integers
|
we have an additional `ondemand::number` type which may represent either integers
|
||||||
or floating-point values, depending on how the numbers are formatted.
|
or floating-point values, depending on how the numbers are formatted.
|
||||||
floating-point values followed by an integer.
|
floating-point values followed by an integer.
|
||||||
|
|
||||||
While you are accessing the document, the `document` instance should remain in scope:
|
While you are accessing the document, the `document` instance should remain in scope:
|
||||||
it is your "iterator" which keeps track of where you are in the JSON document.
|
it is your "iterator" which keeps track of where you are in the JSON document.
|
||||||
By design, there is one and only one `document` instance per JSON document.
|
By design, there is one and only one `document` instance per JSON document.
|
||||||
|
@ -1157,11 +1162,19 @@ The `raw_json_token()` should be fast and free of allocation.
|
||||||
Newline-Delimited JSON (ndjson) and JSON lines
|
Newline-Delimited JSON (ndjson) and JSON lines
|
||||||
----------------------------------------------
|
----------------------------------------------
|
||||||
|
|
||||||
The simdjson library also supports multithreaded JSON streaming through a large file containing many
|
When processing large inputs (e.g., in the context of data engineering), engineers commonly
|
||||||
smaller JSON documents in either [ndjson](http://ndjson.org) or [JSON lines](http://jsonlines.org)
|
serialize data into streams of multiple JSON documents. That is, instead of one large
|
||||||
format. If your JSON documents all contain arrays or objects, we even support direct file
|
(e.g., 2 GB) JSON document containing multiple records, it is often preferable to
|
||||||
concatenation without whitespace. The concatenated file has no size restrictions (including larger
|
write out multiple records as independent JSON documents, to be read one-by-one.
|
||||||
than 4GB), though each individual document must be no larger than 4 GB.
|
|
||||||
|
The simdjson library also supports multithreaded JSON streaming through a large file
|
||||||
|
containing many smaller JSON documents in either [ndjson](http://ndjson.org)
|
||||||
|
or [JSON lines](http://jsonlines.org) format. If your JSON documents all contain arrays
|
||||||
|
or objects, we even support direct file concatenation without whitespace. However, if there
|
||||||
|
is content between your JSON documents, it should be exclusively ASCII white-space characters.
|
||||||
|
|
||||||
|
The concatenated file has no size restrictions (including larger than 4GB), though each
|
||||||
|
individual document must be no larger than 4 GB.
|
||||||
|
|
||||||
Here is an example:
|
Here is an example:
|
||||||
|
|
||||||
|
|
|
@ -1,7 +1,9 @@
|
||||||
iterate_many
|
iterate_many
|
||||||
==========
|
==========
|
||||||
|
|
||||||
An interface providing features to work with files or streams containing multiple small JSON documents. Given an input such as
|
When serializing large databases, it is often better to write out many independent JSON
|
||||||
|
documents, instead of one large monolithic document containing many records. The simdjson
|
||||||
|
library provides high-speed access to files or streams containing multiple small JSON documents separated by ASCII white-space characters. Given an input such as
|
||||||
```JSON
|
```JSON
|
||||||
{"text":"a"}
|
{"text":"a"}
|
||||||
{"text":"b"}
|
{"text":"b"}
|
||||||
|
@ -114,7 +116,9 @@ Whitespace Characters:
|
||||||
- **Linefeed**
|
- **Linefeed**
|
||||||
- **Carriage return**
|
- **Carriage return**
|
||||||
- **Horizontal tab**
|
- **Horizontal tab**
|
||||||
- **Nothing**
|
|
||||||
|
If your documents are all objects or arrays, then you may even have nothing between them.
|
||||||
|
E.g., `[1,2]{"32":1}` is recognized as two documents.
|
||||||
|
|
||||||
Some official formats **(non-exhaustive list)**:
|
Some official formats **(non-exhaustive list)**:
|
||||||
- [Newline-Delimited JSON (NDJSON)](http://ndjson.org/)
|
- [Newline-Delimited JSON (NDJSON)](http://ndjson.org/)
|
||||||
|
|
|
@ -54,6 +54,11 @@ public:
|
||||||
* ondemand::parser parser;
|
* ondemand::parser parser;
|
||||||
* document doc = parser.iterate(json);
|
* document doc = parser.iterate(json);
|
||||||
*
|
*
|
||||||
|
* It is expected that the content is a valid UTF-8 file, containing a valid JSON document.
|
||||||
|
* Otherwise the iterate method may return an error. In particular, the whole input should be
|
||||||
|
* valid: we do not attempt to tolerate incorrect content either before or after a JSON
|
||||||
|
* document.
|
||||||
|
*
|
||||||
* ### IMPORTANT: Validate what you use
|
* ### IMPORTANT: Validate what you use
|
||||||
*
|
*
|
||||||
* Calling iterate on an invalid JSON document may not immediately trigger an error. The call to
|
* Calling iterate on an invalid JSON document may not immediately trigger an error. The call to
|
||||||
|
@ -166,13 +171,15 @@ public:
|
||||||
* ### Format
|
* ### Format
|
||||||
*
|
*
|
||||||
* The buffer must contain a series of one or more JSON documents, concatenated into a single
|
* The buffer must contain a series of one or more JSON documents, concatenated into a single
|
||||||
* buffer, separated by whitespace. It effectively parses until it has a fully valid document,
|
* buffer, separated by ASCII whitespace. It effectively parses until it has a fully valid document,
|
||||||
* then starts parsing the next document at that point. (It does this with more parallelism and
|
* then starts parsing the next document at that point. (It does this with more parallelism and
|
||||||
* lookahead than you might think, though.)
|
* lookahead than you might think, though.)
|
||||||
*
|
*
|
||||||
* documents that consist of an object or array may omit the whitespace between them, concatenating
|
* documents that consist of an object or array may omit the whitespace between them, concatenating
|
||||||
* with no separator. documents that consist of a single primitive (i.e. documents that are not
|
* with no separator. Documents that consist of a single primitive (i.e. documents that are not
|
||||||
* arrays or objects) MUST be separated with whitespace.
|
* arrays or objects) MUST be separated with ASCII whitespace.
|
||||||
|
*
|
||||||
|
* The characters inside a JSON document, and between JSON documents, must be valid Unicode (UTF-8).
|
||||||
*
|
*
|
||||||
* The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse.
|
* The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse.
|
||||||
* Setting batch_size to excessively large or excesively small values may impact negatively the
|
* Setting batch_size to excessively large or excesively small values may impact negatively the
|
||||||
|
|
Loading…
Reference in New Issue