diff --git a/doc/basics.md b/doc/basics.md index 776245be..1f57b954 100644 --- a/doc/basics.md +++ b/doc/basics.md @@ -127,19 +127,23 @@ The Basics: Loading and Parsing JSON Documents ---------------------------------------------- The simdjson library allows you to navigate and validate JSON documents ([RFC 8259](https://www.tbray.org/ongoing/When/201x/2017/12/14/rfc8259.html)). -As required by the standard, your JSON document should be Unicode (UTF-8) strings. +As required by the standard, your JSON document should be in a Unicode (UTF-8) string. The whole +string, from the beginning to the end, needs to be valid: we do not attempt to tolerate bad +inputs before or after a document. - -The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can access by creating a `ondemand::parser` and calling the `iterate()` method: +The simdjson library offers a tree-like [API](https://en.wikipedia.org/wiki/API), which you can +access by creating a `ondemand::parser` and calling the `iterate()` method. The iterate method +quickly indexes the input string and may detect some errors. The following example illustrates +how to get started with an input JSON file (`"twitter.json"`): ```c++ ondemand::parser parser; -auto json = padded_string::load("twitter.json"); +auto json = padded_string::load("twitter.json"); // load JSON file 'twitter.json'. ondemand::document doc = parser.iterate(json); // position a pointer at the beginning of the JSON data ``` -Or by creating a padded string---for efficiency reasons, simdjson requires a string with a few - bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`: +You can also create a padded string---for efficiency reasons, simdjson requires a string +with a few bytes (`simdjson::SIMDJSON_PADDING`) at the end---and calling `iterate()`: ```c++ ondemand::parser parser; @@ -242,11 +246,11 @@ to enable these additional checks: just make sure you remove the definition once code has been tested. Once you have a document (`simdjson::ondemand::document`), you can navigate it with -idiomatic C++ iterators, operators and casts. Besides the documents instances and +idiomatic C++ iterators, operators and casts. Besides the document instances and native types (`double`, `uint64_t`, `int64_t`, `bool`), we also access Unicode (UTF-8) strings (`std::string_view`), objects (`simdjson::ondemand::object`) and arrays (`simdjson::ondemand::array`). -We also have a generic type (`simdjson::ondemand::value`) which represent a potential +We also have a generic type (`simdjson::ondemand::value`) which represents a potential array or object, or scalar type (`double`, `uint64_t`, `int64_t`, `bool`, `null`, string) inside an array or an object. Both generic types (`simdjson::ondemand::document` and `simdjson::ondemand::value`) have a `type()` method returning a `json_type` value describing the value (`json_type::array`, `json_type::object`, `json_type::number`, `json_type::string`, `json_type::boolean`, `json_type::null`). @@ -255,6 +259,7 @@ should review our section [dynamic number types](#dynamic-number-types). Indeed, we have an additional `ondemand::number` type which may represent either integers or floating-point values, depending on how the numbers are formatted. floating-point values followed by an integer. + While you are accessing the document, the `document` instance should remain in scope: it is your "iterator" which keeps track of where you are in the JSON document. By design, there is one and only one `document` instance per JSON document. @@ -1157,11 +1162,19 @@ The `raw_json_token()` should be fast and free of allocation. Newline-Delimited JSON (ndjson) and JSON lines ---------------------------------------------- -The simdjson library also supports multithreaded JSON streaming through a large file containing many -smaller JSON documents in either [ndjson](http://ndjson.org) or [JSON lines](http://jsonlines.org) -format. If your JSON documents all contain arrays or objects, we even support direct file -concatenation without whitespace. The concatenated file has no size restrictions (including larger -than 4GB), though each individual document must be no larger than 4 GB. +When processing large inputs (e.g., in the context of data engineering), engineers commonly +serialize data into streams of multiple JSON documents. That is, instead of one large +(e.g., 2 GB) JSON document containing multiple records, it is often preferable to +write out multiple records as independent JSON documents, to be read one-by-one. + +The simdjson library also supports multithreaded JSON streaming through a large file +containing many smaller JSON documents in either [ndjson](http://ndjson.org) +or [JSON lines](http://jsonlines.org) format. If your JSON documents all contain arrays +or objects, we even support direct file concatenation without whitespace. However, if there +is content between your JSON documents, it should be exclusively ASCII white-space characters. + +The concatenated file has no size restrictions (including larger than 4GB), though each +individual document must be no larger than 4 GB. Here is an example: diff --git a/doc/iterate_many.md b/doc/iterate_many.md index 01f44856..60fb6e40 100644 --- a/doc/iterate_many.md +++ b/doc/iterate_many.md @@ -1,7 +1,9 @@ iterate_many ========== -An interface providing features to work with files or streams containing multiple small JSON documents. Given an input such as +When serializing large databases, it is often better to write out many independent JSON +documents, instead of one large monolithic document containing many records. The simdjson +library provides high-speed access to files or streams containing multiple small JSON documents separated by ASCII white-space characters. Given an input such as ```JSON {"text":"a"} {"text":"b"} @@ -114,7 +116,9 @@ Whitespace Characters: - **Linefeed** - **Carriage return** - **Horizontal tab** -- **Nothing** + +If your documents are all objects or arrays, then you may even have nothing between them. +E.g., `[1,2]{"32":1}` is recognized as two documents. Some official formats **(non-exhaustive list)**: - [Newline-Delimited JSON (NDJSON)](http://ndjson.org/) diff --git a/include/simdjson/generic/ondemand/parser.h b/include/simdjson/generic/ondemand/parser.h index 681b10dd..39870e1e 100644 --- a/include/simdjson/generic/ondemand/parser.h +++ b/include/simdjson/generic/ondemand/parser.h @@ -54,6 +54,11 @@ public: * ondemand::parser parser; * document doc = parser.iterate(json); * + * It is expected that the content is a valid UTF-8 file, containing a valid JSON document. + * Otherwise the iterate method may return an error. In particular, the whole input should be + * valid: we do not attempt to tolerate incorrect content either before or after a JSON + * document. + * * ### IMPORTANT: Validate what you use * * Calling iterate on an invalid JSON document may not immediately trigger an error. The call to @@ -166,13 +171,15 @@ public: * ### Format * * The buffer must contain a series of one or more JSON documents, concatenated into a single - * buffer, separated by whitespace. It effectively parses until it has a fully valid document, + * buffer, separated by ASCII whitespace. It effectively parses until it has a fully valid document, * then starts parsing the next document at that point. (It does this with more parallelism and * lookahead than you might think, though.) * * documents that consist of an object or array may omit the whitespace between them, concatenating - * with no separator. documents that consist of a single primitive (i.e. documents that are not - * arrays or objects) MUST be separated with whitespace. + * with no separator. Documents that consist of a single primitive (i.e. documents that are not + * arrays or objects) MUST be separated with ASCII whitespace. + * + * The characters inside a JSON document, and between JSON documents, must be valid Unicode (UTF-8). * * The documents must not exceed batch_size bytes (by default 1MB) or they will fail to parse. * Setting batch_size to excessively large or excesively small values may impact negatively the