Let us try to improve the documentation so that it is clearer. (#1733)
* Let us try to improve the documentation so that it is clearer. * Minor tweaks.
This commit is contained in:
parent
b7c4d1eeef
commit
9e477ddb00
|
@ -127,19 +127,23 @@ The Basics: Loading and Parsing JSON Documents
|
|||
----------------------------------------------
|
||||
|
||||
The simdjson library allows you to navigate and validate JSON documents ([RFC 8259](https://www.tbray.org/ongoing/When/201x/2017/12/14/rfc8259.html)).
|
||||
As required by the standard, your JSON document should be Unicode (UTF-8) strings.
|
||||
As required by the standard, your JSON document should be in a Unicode (UTF-8) string. The whole
|
||||
string, from the beginning to the end, needs to be valid: we do not attempt to tolerate bad
|
||||
inputs before or after a document.
|
||||
|
||||
|
||||
The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can access by creating a `ondemand::parser` and calling the `iterate()` method:
|
||||
The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can
|
||||
access by creating a `ondemand::parser` and calling the `iterate()` method. The iterate method
|
||||
quickly indexes the input string and may detect some errors. The following example illustrates
|
||||
how to get started with an input JSON file (`"twitter.json"`):
|
||||
|
||||
```c++
|
||||
ondemand::parser parser;
|
||||
auto json = padded_string::load("twitter.json");
|
||||
auto json = padded_string::load("twitter.json"); // load JSON file 'twitter.json'.
|
||||
ondemand::document doc = parser.iterate(json); // position a pointer at the beginning of the JSON data
|
||||
```
|
||||
|
||||
Or by creating a padded string---for efficiency reasons, simdjson requires a string with a few
|
||||
bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:
|
||||
You can also create a padded string---for efficiency reasons, simdjson requires a string
|
||||
with a few bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:
|
||||
|
||||
```c++
|
||||
ondemand::parser parser;
|
||||
|
@ -242,11 +246,11 @@ to enable these additional checks: just make sure you remove the definition once
|
|||
code has been tested.
|
||||
|
||||
Once you have a document (`simdjson::ondemand::document`), you can navigate it with
|
||||
idiomatic C++ iterators, operators and casts. Besides the documents instances and
|
||||
idiomatic C++ iterators, operators and casts. Besides the document instances and
|
||||
native types (`double`, `uint64_t`, `int64_t`, `bool`), we also access
|
||||
Unicode (UTF-8) strings (`std::string_view`), objects (`simdjson::ondemand::object`)
|
||||
and arrays (`simdjson::ondemand::array`).
|
||||
We also have a generic type (`simdjson::ondemand::value`) which represent a potential
|
||||
We also have a generic type (`simdjson::ondemand::value`) which represents a potential
|
||||
array or object, or scalar type (`double`, `uint64_t`, `int64_t`, `bool`, `null`, string) inside an array or an object. Both generic types (`simdjson::ondemand::document` and `simdjson::ondemand::value`) have a `type()` method returning
|
||||
a `json_type` value describing the value (`json_type::array`, `json_type::object`, `json_type::number`, `json_type::string`, `json_type::boolean`, `json_type::null`).
|
||||
|
||||
|
@ -255,6 +259,7 @@ should review our section [dynamic number types](#dynamic-number-types). Indeed,
|
|||
we have an additional `ondemand::number` type which may represent either integers
|
||||
or floating-point values, depending on how the numbers are formatted.
|
||||
floating-point values followed by an integer.
|
||||
|
||||
While you are accessing the document, the `document` instance should remain in scope:
|
||||
it is your "iterator" which keeps track of where you are in the JSON document.
|
||||
By design, there is one and only one `document` instance per JSON document.
|
||||
|
@ -1157,11 +1162,19 @@ The `raw_json_token()` should be fast and free of allocation.
|
|||
Newline-Delimited JSON (ndjson) and JSON lines
|
||||
----------------------------------------------
|
||||
|
||||
The simdjson library also supports multithreaded JSON streaming through a large file containing many
|
||||
smaller JSON documents in either [ndjson](http://ndjson.org) or [JSON lines](http://jsonlines.org)
|
||||
format. If your JSON documents all contain arrays or objects, we even support direct file
|
||||
concatenation without whitespace. The concatenated file has no size restrictions (including larger
|
||||
than 4GB), though each individual document must be no larger than 4 GB.
|
||||
When processing large inputs (e.g., in the context of data engineering), engineers commonly
|
||||
serialize data into streams of multiple JSON documents. That is, instead of one large
|
||||
(e.g., 2 GB) JSON document containing multiple records, it is often preferable to
|
||||
write out multiple records as independent JSON documents, to be read one-by-one.
|
||||
|
||||
The simdjson library also supports multithreaded JSON streaming through a large file
|
||||
containing many smaller JSON documents in either [ndjson](http://ndjson.org)
|
||||
or [JSON lines](http://jsonlines.org) format. If your JSON documents all contain arrays
|
||||
or objects, we even support direct file concatenation without whitespace. However, if there
|
||||
is content between your JSON documents, it should be exclusively ASCII white-space characters.
|
||||
|
||||
The concatenated file has no size restrictions (including larger than 4GB), though each
|
||||
individual document must be no larger than 4 GB.
|
||||
|
||||
Here is an example:
|
||||
|
||||
|
|
|
@ -1,7 +1,9 @@
|
|||
iterate_many
|
||||
==========
|
||||
|
||||
An interface providing features to work with files or streams containing multiple small JSON documents. Given an input such as
|
||||
When serializing large databases, it is often better to write out many independent JSON
|
||||
documents, instead of one large monolithic document containing many records. The simdjson
|
||||
library provides high-speed access to files or streams containing multiple small JSON documents separated by ASCII white-space characters. Given an input such as
|
||||
```JSON
|
||||
{"text":"a"}
|
||||
{"text":"b"}
|
||||
|
@ -114,7 +116,9 @@ Whitespace Characters:
|
|||
- **Linefeed**
|
||||
- **Carriage return**
|
||||
- **Horizontal tab**
|
||||
- **Nothing**
|
||||
|
||||
If your documents are all objects or arrays, then you may even have nothing between them.
|
||||
E.g., `[1,2]{"32":1}` is recognized as two documents.
|
||||
|
||||
Some official formats **(non-exhaustive list)**:
|
||||
- [Newline-Delimited JSON (NDJSON)](http://ndjson.org/)
|
||||
|
|
|
@ -54,6 +54,11 @@ public:
|
|||
* ondemand::parser parser;
|
||||
* document doc = parser.iterate(json);
|
||||
*
|
||||
* It is expected that the content is a valid UTF-8 file, containing a valid JSON document.
|
||||
* Otherwise the iterate method may return an error. In particular, the whole input should be
|
||||
* valid: we do not attempt to tolerate incorrect content either before or after a JSON
|
||||
* document.
|
||||
*
|
||||
* ### IMPORTANT: Validate what you use
|
||||
*
|
||||
* Calling iterate on an invalid JSON document may not immediately trigger an error. The call to
|
||||
|
@ -166,13 +171,15 @@ public:
|
|||
* ### Format
|
||||
*
|
||||
* The buffer must contain a series of one or more JSON documents, concatenated into a single
|
||||
* buffer, separated by whitespace. It effectively parses until it has a fully valid document,
|
||||
* buffer, separated by ASCII whitespace. It effectively parses until it has a fully valid document,
|
||||
* then starts parsing the next document at that point. (It does this with more parallelism and
|
||||
* lookahead than you might think, though.)
|
||||
*
|
||||
* documents that consist of an object or array may omit the whitespace between them, concatenating
|
||||
* with no separator. documents that consist of a single primitive (i.e. documents that are not
|
||||
* arrays or objects) MUST be separated with whitespace.
|
||||
* with no separator. Documents that consist of a single primitive (i.e. documents that are not
|
||||
* arrays or objects) MUST be separated with ASCII whitespace.
|
||||
*
|
||||
* The characters inside a JSON document, and between JSON documents, must be valid Unicode (UTF-8).
|
||||
*
|
||||
* The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse.
|
||||
* Setting batch_size to excessively large or excesively small values may impact negatively the
|
||||
|
|
Loading…
Reference in New Issue