Let us try to improve the documentation so that it is clearer. (#1733)

* Let us try to improve the documentation so that it is clearer. * Minor tweaks.
2021-10-19 13:09:41 -04:00 · 2021-10-19 13:09:41 -04:00 · 9e477ddb00
parent b7c4d1eeef
commit 9e477ddb00
3 changed files with 42 additions and 18 deletions
--- a/doc/basics.md
+++ b/doc/basics.md
@ -127,19 +127,23 @@ The Basics: Loading and Parsing JSON Documents
 ----------------------------------------------

 The simdjson library allows you to navigate and validate JSON documents ([RFC 8259](https://www.tbray.org/ongoing/When/201x/2017/12/14/rfc8259.html)).
-As required by the standard, your JSON document should be Unicode (UTF-8) strings.
+As required by the standard, your JSON document should be in a Unicode (UTF-8) string. The whole
+string, from the beginning to the end, needs to be valid: we do not attempt to tolerate bad
+inputs before or after a document.

-
-The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can access by creating a `ondemand::parser` and calling the `iterate()` method:
+The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can
+access by creating a `ondemand::parser` and calling the `iterate()` method. The iterate method
+quickly indexes the input string and may detect some errors. The following example illustrates
+how to get started with an input JSON file (`"twitter.json"`):

 ```c++
 ondemand::parser parser;
-auto json = padded_string::load("twitter.json");
+auto json = padded_string::load("twitter.json"); // load JSON file 'twitter.json'.
 ondemand::document doc = parser.iterate(json); // position a pointer at the beginning of the JSON data
 ```

-Or by creating a padded string---for efficiency reasons, simdjson requires a string with a few
- bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:
+You can also create a padded string---for efficiency reasons, simdjson requires a string
+with a few bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:

 ```c++
 ondemand::parser parser;
@ -242,11 +246,11 @@ to enable these additional checks: just make sure you remove the definition once
 code has been tested.

 Once you have a document (`simdjson::ondemand::document`), you can navigate it with
-idiomatic C++ iterators, operators and casts. Besides the documents instances and
+idiomatic C++ iterators, operators and casts. Besides the document instances and
 native types (`double`, `uint64_t`, `int64_t`, `bool`), we also access
 Unicode (UTF-8) strings (`std::string_view`), objects (`simdjson::ondemand::object`)
 and arrays (`simdjson::ondemand::array`).
-We also have a generic type (`simdjson::ondemand::value`) which represent a potential
+We also have a generic type (`simdjson::ondemand::value`) which represents a potential
 array or object, or scalar type (`double`, `uint64_t`, `int64_t`, `bool`, `null`, string) inside an array or an object. Both generic types (`simdjson::ondemand::document` and `simdjson::ondemand::value`) have a `type()` method returning
 a `json_type` value describing the value (`json_type::array`, `json_type::object`, `json_type::number`, `json_type::string`, `json_type::boolean`, `json_type::null`).

@ -255,6 +259,7 @@ should review our section [dynamic number types](#dynamic-number-types). Indeed,
 we have an additional `ondemand::number` type which may represent either integers
 or floating-point values, depending on how the numbers are formatted.
 floating-point values followed by an integer.
+
 While you are accessing the document, the `document` instance should remain in scope:
 it is your "iterator" which keeps track of where you are in the JSON document.
 By design, there is one and only one `document` instance per JSON document.
@ -1157,11 +1162,19 @@ The `raw_json_token()` should be fast and free of allocation.
 Newline-Delimited JSON (ndjson) and JSON lines
 ----------------------------------------------

-The simdjson library also supports multithreaded JSON streaming through a large file containing many
-smaller JSON documents in either [ndjson](http://ndjson.org) or [JSON lines](http://jsonlines.org)
-format. If your JSON documents all contain arrays or objects, we even support direct file
-concatenation without whitespace. The concatenated file has no size restrictions (including larger
-than 4GB), though each individual document must be no larger than 4 GB.
+When processing large inputs (e.g., in the context of data engineering), engineers commonly
+serialize data into streams of multiple JSON documents. That is, instead of one large
+(e.g., 2 GB) JSON document containing multiple records, it is often preferable to
+write out multiple records as independent JSON documents, to be read one-by-one.
+
+The simdjson library also supports multithreaded JSON streaming through a large file
+containing many smaller JSON documents in either [ndjson](http://ndjson.org)
+or [JSON lines](http://jsonlines.org) format. If your JSON documents all contain arrays
+or objects, we even support direct file concatenation without whitespace. However, if there
+is content between your JSON documents, it should be exclusively ASCII white-space characters.
+
+The concatenated file has no size restrictions (including larger than 4GB), though each
+individual document must be no larger than 4 GB.

 Here is an example:

--- a/doc/iterate_many.md
+++ b/doc/iterate_many.md
@ -1,7 +1,9 @@
 iterate_many
 ==========

-An interface providing features to work with files or streams containing multiple small JSON documents. Given an input such as
+When serializing large databases, it is often better to write out many independent JSON
+documents, instead of one large monolithic document containing many records. The simdjson
+library provides high-speed access to files or streams containing multiple small JSON documents separated by ASCII white-space characters. Given an input such as
 ```JSON
 {"text":"a"}
 {"text":"b"}
@ -114,7 +116,9 @@ Whitespace Characters:
 - **Linefeed**
 - **Carriage return**
 - **Horizontal tab**
- **Nothing**
+
+If your documents are all objects or arrays, then you may even have nothing between them.
+E.g., `[1,2]{"32":1}` is recognized as two documents.

 Some official formats **(non-exhaustive list)**:
 - [Newline-Delimited JSON (NDJSON)](http://ndjson.org/)
--- a/include/simdjson/generic/ondemand/parser.h
+++ b/include/simdjson/generic/ondemand/parser.h
@ -54,6 +54,11 @@ public:
   *   ondemand::parser parser;
   *   document doc = parser.iterate(json);
   *
+   * It is expected that the content is a valid UTF-8 file, containing a valid JSON document.
+   * Otherwise the iterate method may return an error. In particular, the whole input should be
+   * valid: we do not attempt to tolerate incorrect content either before or after a JSON
+   * document.
+   *
   * ### IMPORTANT: Validate what you use
   *
   * Calling iterate on an invalid JSON document may not immediately trigger an error. The call to
@ -166,13 +171,15 @@ public:
   * ### Format
   *
   * The buffer must contain a series of one or more JSON documents, concatenated into a single
-   * buffer, separated by whitespace. It effectively parses until it has a fully valid document,
+   * buffer, separated by ASCII whitespace. It effectively parses until it has a fully valid document,
   * then starts parsing the next document at that point. (It does this with more parallelism and
   * lookahead than you might think, though.)
   *
   * documents that consist of an object or array may omit the whitespace between them, concatenating
-   * with no separator. documents that consist of a single primitive (i.e. documents that are not
-   * arrays or objects) MUST be separated with whitespace.
+   * with no separator. Documents that consist of a single primitive (i.e. documents that are not
+   * arrays or objects) MUST be separated with ASCII whitespace.
+   *
+   * The characters inside a JSON document, and between JSON documents, must be valid Unicode (UTF-8).
   *
   * The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse.
   * Setting batch_size to excessively large or excesively small values may impact negatively the