218 lines
9.6 KiB
Markdown
218 lines
9.6 KiB
Markdown
parse_many
|
|
==========
|
|
|
|
An interface providing features to work with files or streams containing multiple small JSON documents.
|
|
As fast and convenient as possible.
|
|
|
|
Contents
|
|
--------
|
|
|
|
- [Motivations](#motivations)
|
|
- [Performance](#performance)
|
|
- [How it works](#how-it-works)
|
|
- [Support](#support)
|
|
- [API](#api)
|
|
- [Use cases](#use-cases)
|
|
- [Tracking your position](#tracking-your-position)
|
|
- [Incomplete streams](#incomplete-streams)
|
|
|
|
Motivation
|
|
-----------
|
|
|
|
The main motivation for this piece of software is to achieve maximum speed and offer a
|
|
better quality of life in parsing files containing multiple small JSON documents.
|
|
|
|
The JavaScript Object Notation (JSON) [RFC7159](https://tools.ietf.org/html/rfc7159) is a handy
|
|
serialization format. However, when serializing a large sequence of
|
|
values as an array, or a possibly indeterminate-length or never-
|
|
ending sequence of values, JSON may be inconvenient.
|
|
|
|
Consider a sequence of one million values, each possibly one kilobyte
|
|
when encoded -- roughly one gigabyte. It is often desirable to process such a dataset incrementally
|
|
without having to first read all of it before beginning to produce results.
|
|
|
|
Performance
|
|
-----------
|
|
|
|
The following is a chart comparing the speed of the different alternatives to parse a multiline JSON.
|
|
The simdjson library provides a threaded and non-threaded `parse_many()` implementation. As the
|
|
figure below shows, if you can, use threads, but if you cannot, the unthreaded mode is still fast!
|
|
[![Chart.png](/doc/Multiline_JSON_Parse_Competition.png)](/doc/Multiline_JSON_Parse_Competition.png)
|
|
|
|
How it works
|
|
------------
|
|
|
|
### Context
|
|
|
|
The parsing in simdjson is divided into 2 stages. First, in stage 1, we parse the document and find
|
|
all the structural indexes (`{`, `}`, `]`, `[`, `,`, `"`, ...) and validate UTF8. Then, in stage 2,
|
|
we go through the document again and build the tape using structural indexes found during stage 1.
|
|
Although stage 1 finds the structural indexes, it has no knowledge of the structure of the document
|
|
nor does it know whether it parsed a valid document, multiple documents, or even if the document is
|
|
complete.
|
|
|
|
Prior to parse_many, most people who had to parse a multiline JSON file would proceed by reading the
|
|
file line by line, using a utility function like `std::getline` or equivalent, and would then use
|
|
the `parse` on each of those lines. From a performance point of view, this process is highly
|
|
inefficient, in that it requires a lot of unnecessary memory allocation and makes use of the
|
|
`getline` function, which is fundamentally slow, slower than the act of parsing with simdjson
|
|
[(more on this here)](https://lemire.me/blog/2019/06/18/how-fast-is-getline-in-c/).
|
|
|
|
Unlike the popular parser RapidJson, our DOM does not require the buffer once the parsing job is
|
|
completed, the DOM and the buffer are completely independent. The drawback of this architecture is
|
|
that we need to allocate some additional memory to store our ParsedJson data, for every document
|
|
inside a given file. Memory allocation can be slow and become a bottleneck, therefore, we want to
|
|
minimize it as much as possible.
|
|
|
|
### Design
|
|
|
|
To achieve a minimum amount of allocations, we opted for a design where we create only one
|
|
parser object and therefore allocate its memory once, and then recycle it for every document in a
|
|
given file. But, knowing that they often have largely varying size, we need to make sure that we
|
|
allocate enough memory so that all the documents can fit. This value is what we call the batch size.
|
|
As of right now, we need to manually specify a value for this batch size, it has to be at least as
|
|
big as the biggest document in your file, but not too big so that it submerges the cached memory.
|
|
The bigger the batch size, the fewer we need to make allocations. We found that 1MB is somewhat a
|
|
sweet spot.
|
|
|
|
1. When the user calls `parse_many`, we return a `document_stream` which the user can iterate over
|
|
to receive parsed documents.
|
|
2. We call stage 1 on the first batch_size bytes of JSON in the buffer, detecting structural
|
|
indexes for all documents in that batch.
|
|
3. We call stage 2 on the indexes, reading tokens until we reach the end of a valid document (i.e.
|
|
a single array, object, string, boolean, number or null).
|
|
4. Each time the user calls `++` to read the next document, we call stage 2 to parse the next
|
|
document where we left off.
|
|
5. When we reach the end of the batch, we call stage 1 on the next batch, starting from the end of
|
|
the last document, and go to step 3.
|
|
|
|
### Threads
|
|
|
|
But how can we make use of threads if they are available? We found a pretty cool algorithm that allows us to quickly
|
|
identify the position of the last JSON document in a given batch. Knowing exactly where the end of
|
|
the batch is, we no longer need for stage 2 to finish in order to load a new batch. We already know
|
|
where to start the next batch. Therefore, we can run stage 1 on the next batch concurrently while
|
|
the main thread is going through stage 2. Running stage 1 in a different thread can, in best
|
|
cases, remove almost entirely its cost and replaces it by the overhead of a thread, which is orders
|
|
of magnitude cheaper. Ain't that awesome!
|
|
|
|
Thread support is only active if thread supported is detected in which case the macro
|
|
SIMDJSON_THREADS_ENABLED is set. Otherwise the library runs in single-thread mode.
|
|
|
|
A `document_stream` instance uses at most two threads: there is a main thread and a worker thread.
|
|
You should expect the main thread to be fully occupied while the worker thread is partially busy
|
|
(e.g., 80% of the time).
|
|
|
|
Support
|
|
-------
|
|
|
|
Since we want to offer flexibility and not restrict ourselves to a specific file
|
|
format, we support any file that contains any amount of valid JSON document, **separated by one
|
|
or more character that is considered whitespace** by the JSON spec. Anything that is
|
|
not whitespace will be parsed as a JSON document and could lead to failure.
|
|
|
|
Whitespace Characters:
|
|
- **Space**
|
|
- **Linefeed**
|
|
- **Carriage return**
|
|
- **Horizontal tab**
|
|
- **Nothing**
|
|
|
|
Some official formats **(non-exhaustive list)**:
|
|
- [Newline-Delimited JSON (NDJSON)](http://ndjson.org/)
|
|
- [JSON lines (JSONL)](http://jsonlines.org/)
|
|
- [Record separator-delimited JSON (RFC 7464)](https://tools.ietf.org/html/rfc7464) <- Not supported by JsonStream!
|
|
- [More on Wikipedia...](https://en.wikipedia.org/wiki/JSON_streaming)
|
|
|
|
API
|
|
---
|
|
|
|
See [basics.md](basics.md#newline-delimited-json-ndjson-and-json-lines) for an overview of the API.
|
|
|
|
## Use cases
|
|
|
|
From [jsonlines.org](http://jsonlines.org/examples/):
|
|
|
|
- **Better than CSV**
|
|
```json
|
|
["Name", "Session", "Score", "Completed"]
|
|
["Gilbert", "2013", 24, true]
|
|
["Alexa", "2013", 29, true]
|
|
["May", "2012B", 14, false]
|
|
["Deloise", "2012A", 19, true]
|
|
```
|
|
CSV seems so easy that many programmers have written code to generate it themselves, and almost every implementation is
|
|
different. Handling broken CSV files is a common and frustrating task. CSV has no standard encoding, no standard column
|
|
separator and multiple character escaping standards. String is the only type supported for cell values, so some programs
|
|
attempt to guess the correct types.
|
|
|
|
JSON Lines handles tabular data cleanly and without ambiguity. Cells may use the standard JSON types.
|
|
|
|
The biggest missing piece is an import/export filter for popular spreadsheet programs so that non-programmers can use
|
|
this format.
|
|
|
|
- **Easy Nested Data**
|
|
```json
|
|
{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
|
|
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
|
|
{"name": "May", "wins": []}
|
|
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}
|
|
```
|
|
JSON Lines' biggest strength is in handling lots of similar nested data structures. One .jsonl file is easier to
|
|
work with than a directory full of XML files.
|
|
|
|
|
|
Tracking your position
|
|
-----------
|
|
|
|
Some users would like to know where the document they parsed is in the input array of bytes.
|
|
It is possible to do so by accessing directly the iterator and calling its `current_index()`
|
|
method which reports the location (in bytes) of the current document in the input stream.
|
|
|
|
Let us illustrate the idea with code:
|
|
|
|
|
|
```C++
|
|
auto json = R"([1,2,3] {"1":1,"2":3,"4":4} [1,2,3] )"_padded;
|
|
simdjson::dom::parser parser;
|
|
simdjson::dom::document_stream stream;
|
|
ASSERT_SUCCESS( parser.parse_many(json).get(stream) );
|
|
auto i = stream.begin();
|
|
for(; i != stream.end(); ++i) {
|
|
auto doc = *i;
|
|
if(!doc.error()) {
|
|
std::cout << "got full document at " << i.current_index() << std::endl;
|
|
}
|
|
}
|
|
size_t index = i.current_index();
|
|
if(index != 38) {
|
|
std::cerr << "Expected to stop after the three full documents " << std::endl;
|
|
std::cerr << "index = " << index << std::endl;
|
|
return false;
|
|
}
|
|
```
|
|
|
|
This code will print:
|
|
```
|
|
got full document at 0
|
|
got full document at 9
|
|
got full document at 29
|
|
```
|
|
|
|
The last call to `i.current_index()` return the byte index 38, which is just beyond
|
|
the last document.
|
|
|
|
Incomplete streams
|
|
-----------
|
|
|
|
Some users may need to work with truncated streams while tracking their location in the stream.
|
|
The same code, with the `current_index()` will work. However, the last block (by default 1MB)
|
|
terminates with an unclosed string, then no JSON document within this last block will validate.
|
|
In particular, it means that if your input string is `[1,2,3] {"1":1,"2":3,"4":4} [1,2` then
|
|
no JSON document will be successfully parsed. The error `simdjson::UNCLOSED_STRING` will be
|
|
given (even with the first JSON document). It is then your responsability to terminate the input
|
|
maybe by appending the missing data at the end of the truncated string, or by copying the truncated
|
|
data before the continuing input.
|
|
|
|
|