Let us try to improve the documentation so that it is clearer. (#1733)

* Let us try to improve the documentation so that it is clearer. * Minor tweaks.
2021-10-19 13:09:41 -04:00 · 2021-10-19 13:09:41 -04:00 · 9e477ddb00
parent b7c4d1eeef
commit 9e477ddb00
3 changed files with 42 additions and 18 deletions
--- a/doc/basics.md
+++ b/doc/basics.md
@ -127,19 +127,23 @@ The Basics: Loading and Parsing JSON Documents
 ----------------------------------------------
 The simdjson library allows you to navigate and validate JSON documents ([RFC 8259](https://www.tbray.org/ongoing/When/201x/2017/12/14/rfc8259.html)).
-As required by the standard, your JSON document should be Unicode (UTF-8) strings.
+As required by the standard, your JSON document should be in a Unicode (UTF-8) string. The whole
 string, from the beginning to the end, needs to be valid: we do not attempt to tolerate bad
 inputs before or after a document.
-
+The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can
-The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can access by creating a `ondemand::parser` and calling the `iterate()` method:
+access by creating a `ondemand::parser` and calling the `iterate()` method. The iterate method
 quickly indexes the input string and may detect some errors. The following example illustrates
 how to get started with an input JSON file (`"twitter.json"`):
 ```c++
 ondemand::parser parser;
-auto json = padded_string::load("twitter.json");
+auto json = padded_string::load("twitter.json"); // load JSON file 'twitter.json'.
 ondemand::document doc = parser.iterate(json); // position a pointer at the beginning of the JSON data
 ```
-Or by creating a padded string---for efficiency reasons, simdjson requires a string with a few
+You can also create a padded string---for efficiency reasons, simdjson requires a string
- bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:
+with a few bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`:
 ```c++
 ondemand::parser parser;
@ -242,11 +246,11 @@ to enable these additional checks: just make sure you remove the definition once
 code has been tested.
 Once you have a document (`simdjson::ondemand::document`), you can navigate it with
-idiomatic C++ iterators, operators and casts. Besides the documents instances and
+idiomatic C++ iterators, operators and casts. Besides the document instances and
 native types (`double`, `uint64_t`, `int64_t`, `bool`), we also access
 Unicode (UTF-8) strings (`std::string_view`), objects (`simdjson::ondemand::object`)
 and arrays (`simdjson::ondemand::array`).
-We also have a generic type (`simdjson::ondemand::value`) which represent a potential
+We also have a generic type (`simdjson::ondemand::value`) which represents a potential
 array or object, or scalar type (`double`, `uint64_t`, `int64_t`, `bool`, `null`, string) inside an array or an object. Both generic types (`simdjson::ondemand::document` and `simdjson::ondemand::value`) have a `type()` method returning
 a `json_type` value describing the value (`json_type::array`, `json_type::object`, `json_type::number`, `json_type::string`, `json_type::boolean`, `json_type::null`).
@ -255,6 +259,7 @@ should review our section [dynamic number types](#dynamic-number-types). Indeed,
 we have an additional `ondemand::number` type which may represent either integers
 or floating-point values, depending on how the numbers are formatted.
 floating-point values followed by an integer.
 While you are accessing the document, the `document` instance should remain in scope:
 it is your "iterator" which keeps track of where you are in the JSON document.
 By design, there is one and only one `document` instance per JSON document.
@ -1157,11 +1162,19 @@ The `raw_json_token()` should be fast and free of allocation.
 Newline-Delimited JSON (ndjson) and JSON lines
 ----------------------------------------------
-The simdjson library also supports multithreaded JSON streaming through a large file containing many
+When processing large inputs (e.g., in the context of data engineering), engineers commonly
-smaller JSON documents in either [ndjson](http://ndjson.org) or [JSON lines](http://jsonlines.org)
+serialize data into streams of multiple JSON documents. That is, instead of one large
-format. If your JSON documents all contain arrays or objects, we even support direct file
+(e.g., 2 GB) JSON document containing multiple records, it is often preferable to
-concatenation without whitespace. The concatenated file has no size restrictions (including larger
+write out multiple records as independent JSON documents, to be read one-by-one.
-than 4GB), though each individual document must be no larger than 4 GB.
+
 The simdjson library also supports multithreaded JSON streaming through a large file
 containing many smaller JSON documents in either [ndjson](http://ndjson.org)
 or [JSON lines](http://jsonlines.org) format. If your JSON documents all contain arrays
 or objects, we even support direct file concatenation without whitespace. However, if there
 is content between your JSON documents, it should be exclusively ASCII white-space characters.
 The concatenated file has no size restrictions (including larger than 4GB), though each
 individual document must be no larger than 4 GB.
 Here is an example:
--- a/doc/iterate_many.md
+++ b/doc/iterate_many.md
@ -1,7 +1,9 @@
 iterate_many
 ==========
-An interface providing features to work with files or streams containing multiple small JSON documents. Given an input such as
+When serializing large databases, it is often better to write out many independent JSON
 documents, instead of one large monolithic document containing many records. The simdjson
 library provides high-speed access to files or streams containing multiple small JSON documents separated by ASCII white-space characters. Given an input such as
 ```JSON
 {"text":"a"}
 {"text":"b"}
@ -114,7 +116,9 @@ Whitespace Characters:
 - **Linefeed**
 - **Carriage return**
 - **Horizontal tab**
- **Nothing**
+
 If your documents are all objects or arrays, then you may even have nothing between them.
 E.g., `[1,2]{"32":1}` is recognized as two documents.
 Some official formats **(non-exhaustive list)**:
 - [Newline-Delimited JSON (NDJSON)](http://ndjson.org/)
--- a/include/simdjson/generic/ondemand/parser.h
+++ b/include/simdjson/generic/ondemand/parser.h
@ -54,6 +54,11 @@ public:
   *   ondemand::parser parser;
   *   document doc = parser.iterate(json);
   *
   * It is expected that the content is a valid UTF-8 file, containing a valid JSON document.
   * Otherwise the iterate method may return an error. In particular, the whole input should be
   * valid: we do not attempt to tolerate incorrect content either before or after a JSON
   * document.
   *
   * ### IMPORTANT: Validate what you use
   *
   * Calling iterate on an invalid JSON document may not immediately trigger an error. The call to
@ -166,13 +171,15 @@ public:
   * ### Format
   *
   * The buffer must contain a series of one or more JSON documents, concatenated into a single
-   * buffer, separated by whitespace. It effectively parses until it has a fully valid document,
+   * buffer, separated by ASCII whitespace. It effectively parses until it has a fully valid document,
   * then starts parsing the next document at that point. (It does this with more parallelism and
   * lookahead than you might think, though.)
   *
   * documents that consist of an object or array may omit the whitespace between them, concatenating
-   * with no separator. documents that consist of a single primitive (i.e. documents that are not
+   * with no separator. Documents that consist of a single primitive (i.e. documents that are not
-   * arrays or objects) MUST be separated with whitespace.
+   * arrays or objects) MUST be separated with ASCII whitespace.
   *
   * The characters inside a JSON document, and between JSON documents, must be valid Unicode (UTF-8).
   *
   * The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse.
   * Setting batch_size to excessively large or excesively small values may impact negatively the