Making it clearer that parse_many is meant for *small* documents. (#1205)

* Making it clearer that parse_many is meant for *small* documents.

* Update parse_many.md
This commit is contained in:
Daniel Lemire 2020-10-06 17:19:34 -04:00 committed by GitHub
parent 04267e0f6b
commit 1f41cc2030
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 12 additions and 10 deletions

View File

@ -582,7 +582,7 @@ format. If your JSON documents all contain arrays or objects, we even support di
concatenation without whitespace. The concatenated file has no size restrictions (including larger concatenation without whitespace. The concatenated file has no size restrictions (including larger
than 4GB), though each individual document must be no larger than 4 GB. than 4GB), though each individual document must be no larger than 4 GB.
Here is a simple example, given "x.json" with this content: Here is a simple example, given `x.json` with this content:
```json ```json
{ "foo": 1 } { "foo": 1 }
@ -626,6 +626,8 @@ document at a time.
Both `load_many` and `parse_many` take an optional parameter `size_t batch_size` which defines the window processing size. It is set by default to a large value (`1000000` corresponding to 1 MB). None of your JSON documents should exceed this window size, or else you will get the error `simdjson::CAPACITY`. You cannot set this window size larger than 4 GB: you will get the error `simdjson::CAPACITY`. The smaller the window size is, the less memory the function will use. Setting the window size too small (e.g., less than 100 kB) may also impact performance negatively. Leaving it to 1 MB is expected to be a good choice, unless you have some larger documents. Both `load_many` and `parse_many` take an optional parameter `size_t batch_size` which defines the window processing size. It is set by default to a large value (`1000000` corresponding to 1 MB). None of your JSON documents should exceed this window size, or else you will get the error `simdjson::CAPACITY`. You cannot set this window size larger than 4 GB: you will get the error `simdjson::CAPACITY`. The smaller the window size is, the less memory the function will use. Setting the window size too small (e.g., less than 100 kB) may also impact performance negatively. Leaving it to 1 MB is expected to be a good choice, unless you have some larger documents.
If your documents are large (e.g., larger than a megabyte), then the `load_many` and `parse_many` functions are maybe ill-suited. They are really meant to support reading efficiently streams of relatively small documents (e.g., a few kilobytes each). If you have larger documents, you should use other functions like `parse`.
See [parse_many.md](parse_many.md) for detailed information and design. See [parse_many.md](parse_many.md) for detailed information and design.
Thread Safety Thread Safety

View File

@ -1,7 +1,7 @@
parse_many parse_many
========== ==========
An interface providing features to work with files or streams containing multiple JSON documents. An interface providing features to work with files or streams containing multiple small JSON documents.
As fast and convenient as possible. As fast and convenient as possible.
Contents Contents
@ -14,16 +14,16 @@ Contents
- [API](#api) - [API](#api)
- [Use cases](#use-cases) - [Use cases](#use-cases)
Motivations Motivation
----------- -----------
The main motivation for this piece of software is to achieve maximum speed and offer a The main motivation for this piece of software is to achieve maximum speed and offer a
better quality of life in parsing files containing multiple JSON documents. better quality of life in parsing files containing multiple small JSON documents.
The JavaScript Object Notation (JSON) [RFC7159](https://tools.ietf.org/html/rfc7159) is a very handy The JavaScript Object Notation (JSON) [RFC7159](https://tools.ietf.org/html/rfc7159) is a handy
serialization format. However, when serializing a large sequence of serialization format. However, when serializing a large sequence of
values as an array, or a possibly indeterminate-length or never- values as an array, or a possibly indeterminate-length or never-
ending sequence of values, JSON becomes difficult to work with. ending sequence of values, JSON may be inconvenient.
Consider a sequence of one million values, each possibly one kilobyte Consider a sequence of one million values, each possibly one kilobyte
when encoded -- roughly one gigabyte. It is often desirable to process such a dataset incrementally when encoded -- roughly one gigabyte. It is often desirable to process such a dataset incrementally
@ -32,9 +32,9 @@ without having to first read all of it before beginning to produce results.
Performance Performance
----------- -----------
Here is a chart comparing the speed of the different alternatives to parse a multiline JSON. The following is a chart comparing the speed of the different alternatives to parse a multiline JSON.
The simdjson library provides a threaded and non-threaded parse_many() implementation. As the The simdjson library provides a threaded and non-threaded `parse_many()` implementation. As the
figure below shows, if you can, use threads, but if you can't, it's still pretty fast! figure below shows, if you can, use threads, but if you cannot, the unthreaded mode is still fast!
[![Chart.png](/doc/Multiline_JSON_Parse_Competition.png)](/doc/Multiline_JSON_Parse_Competition.png) [![Chart.png](/doc/Multiline_JSON_Parse_Competition.png)](/doc/Multiline_JSON_Parse_Competition.png)
How it works How it works
@ -71,7 +71,7 @@ allocate enough memory so that all the documents can fit. This value is what we
As of right now, we need to manually specify a value for this batch size, it has to be at least as As of right now, we need to manually specify a value for this batch size, it has to be at least as
big as the biggest document in your file, but not too big so that it submerges the cached memory. big as the biggest document in your file, but not too big so that it submerges the cached memory.
The bigger the batch size, the fewer we need to make allocations. We found that 1MB is somewhat a The bigger the batch size, the fewer we need to make allocations. We found that 1MB is somewhat a
sweet spot for now. sweet spot.
1. When the user calls `parse_many`, we return a `document_stream` which the user can iterate over 1. When the user calls `parse_many`, we return a `document_stream` which the user can iterate over
to receive parsed documents. to receive parsed documents.