diff --git a/doc/basics.md b/doc/basics.md index fd896f04..bcc6a6e0 100644 --- a/doc/basics.md +++ b/doc/basics.md @@ -582,7 +582,7 @@ format. If your JSON documents all contain arrays or objects, we even support di concatenation without whitespace. The concatenated file has no size restrictions (including larger than 4GB), though each individual document must be no larger than 4 GB. -Here is a simple example, given "x.json" with this content: +Here is a simple example, given `x.json` with this content: ```json { "foo": 1 } @@ -626,6 +626,8 @@ document at a time. Both `load_many` and `parse_many` take an optional parameter `size_t batch_size` which defines the window processing size. It is set by default to a large value (`1000000` corresponding to 1 MB). None of your JSON documents should exceed this window size, or else you will get the error `simdjson::CAPACITY`. You cannot set this window size larger than 4 GB: you will get the error `simdjson::CAPACITY`. The smaller the window size is, the less memory the function will use. Setting the window size too small (e.g., less than 100 kB) may also impact performance negatively. Leaving it to 1 MB is expected to be a good choice, unless you have some larger documents. +If your documents are large (e.g., larger than a megabyte), then the `load_many` and `parse_many` functions are maybe ill-suited. They are really meant to support reading efficiently streams of relatively small documents (e.g., a few kilobytes each). If you have larger documents, you should use other functions like `parse`. + See [parse_many.md](parse_many.md) for detailed information and design. Thread Safety diff --git a/doc/parse_many.md b/doc/parse_many.md index 49e93d37..64fe928e 100644 --- a/doc/parse_many.md +++ b/doc/parse_many.md @@ -1,7 +1,7 @@ parse_many ========== -An interface providing features to work with files or streams containing multiple JSON documents. +An interface providing features to work with files or streams containing multiple small JSON documents. As fast and convenient as possible. Contents @@ -14,16 +14,16 @@ Contents - [API](#api) - [Use cases](#use-cases) -Motivations +Motivation ----------- The main motivation for this piece of software is to achieve maximum speed and offer a -better quality of life in parsing files containing multiple JSON documents. +better quality of life in parsing files containing multiple small JSON documents. -The JavaScript Object Notation (JSON) [RFC7159](https://tools.ietf.org/html/rfc7159) is a very handy +The JavaScript Object Notation (JSON) [RFC7159](https://tools.ietf.org/html/rfc7159) is a handy serialization format. However, when serializing a large sequence of values as an array, or a possibly indeterminate-length or never- -ending sequence of values, JSON becomes difficult to work with. +ending sequence of values, JSON may be inconvenient. Consider a sequence of one million values, each possibly one kilobyte when encoded -- roughly one gigabyte. It is often desirable to process such a dataset incrementally @@ -32,9 +32,9 @@ without having to first read all of it before beginning to produce results. Performance ----------- -Here is a chart comparing the speed of the different alternatives to parse a multiline JSON. -The simdjson library provides a threaded and non-threaded parse_many() implementation. As the -figure below shows, if you can, use threads, but if you can't, it's still pretty fast! +The following is a chart comparing the speed of the different alternatives to parse a multiline JSON. +The simdjson library provides a threaded and non-threaded `parse_many()` implementation. As the +figure below shows, if you can, use threads, but if you cannot, the unthreaded mode is still fast! [![Chart.png](/doc/Multiline_JSON_Parse_Competition.png)](/doc/Multiline_JSON_Parse_Competition.png) How it works @@ -71,7 +71,7 @@ allocate enough memory so that all the documents can fit. This value is what we As of right now, we need to manually specify a value for this batch size, it has to be at least as big as the biggest document in your file, but not too big so that it submerges the cached memory. The bigger the batch size, the fewer we need to make allocations. We found that 1MB is somewhat a -sweet spot for now. +sweet spot. 1. When the user calls `parse_many`, we return a `document_stream` which the user can iterate over to receive parsed documents.