JsonStream documentation (#381)
* adding Multiline JSON competition chart to doc * Completing the comments for JsonStream * Adding a page for JsonStream's documentation.
This commit is contained in:
parent
9b6377fd80
commit
f163155929
23
README.md
23
README.md
|
@ -154,6 +154,29 @@ if( ! pj.is_valid() ) {
|
|||
|
||||
As needed, the `json_parse` and `build_parsed_json` functions copy the input data to a temporary buffer readable up to SIMDJSON_PADDING bytes beyond the end of the data.
|
||||
|
||||
## JSON streaming
|
||||
|
||||
**API and detailed documentation found [here](doc/JsonStream.md).**
|
||||
|
||||
Here is a simple exemple, using single header simdjson:
|
||||
```cpp
|
||||
#include "simdjson.h"
|
||||
#include "simdjson.cpp"
|
||||
|
||||
int parse_file(const char *filename) {
|
||||
simdjson::padded_string p = simdjson::get_corpus(filename);
|
||||
simdjson::ParsedJson pj;
|
||||
simdjson::JsonStream js{p};
|
||||
int parse_res = simdjson::SUCCESS_AND_HAS_MORE;
|
||||
|
||||
while (parse_res == simdjson::SUCCESS_AND_HAS_MORE) {
|
||||
parse_res = js.json_parse(pj);
|
||||
|
||||
//Do something with pj...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Usage: easy single-header version
|
||||
|
||||
See the "singleheader" repository for a single header version. See the included
|
||||
|
|
|
@ -0,0 +1,220 @@
|
|||
# JsonStream
|
||||
An interface providing features to work with files or streams containing multiple JSON documents.
|
||||
As fast and convenient as possible.
|
||||
## Contents
|
||||
- [Motivations](#Motivations)
|
||||
- [Performance](#Performance)
|
||||
- [How it works](#how-it-works)
|
||||
- [Support](#Support)
|
||||
- [API](#API)
|
||||
- [Concurrency mode](#concurrency-mode)
|
||||
- [Example](#Example)
|
||||
- [Use cases](#use-cases)
|
||||
|
||||
|
||||
## Motivations
|
||||
The main motivation for this piece of software is to achieve maximum speed and offer a
|
||||
better quality of life in parsing files containing multiple JSON documents.
|
||||
|
||||
The JavaScript Object Notation (JSON) [RFC7159](https://tools.ietf.org/html/rfc7159) is a very handy
|
||||
serialization format. However, when serializing a large sequence of
|
||||
values as an array, or a possibly indeterminate-length or never-
|
||||
ending sequence of values, JSON becomes difficult to work with.
|
||||
|
||||
Consider a sequence of one million values, each possibly one kilobyte
|
||||
when encoded -- roughly one gigabyte. It is often desirable to process such a dataset incrementally
|
||||
without having to first read all of it before beginning to produce results.
|
||||
## Performance
|
||||
Here is a chart comparing the speed of the different alternatives to parse a multiline JSON.
|
||||
JsonStream provides a threaded and non-threaded implementation. As the figure below shows,
|
||||
if you can, use threads, else it's still pretty fast!
|
||||
[](/doc/Multiline_JSON_Parse_Competition.png)
|
||||
## How it works
|
||||
- **Context:**
|
||||
The parsing in simdjson is divided into 2 stages. First, in stage 1, we parse the document and find all
|
||||
the structural indexes (`{`, `}`, `]`, `[`, `,`, `"`, ...) and validate UTF8. Then, in stage 2, we go through the
|
||||
document again and build the tape using structural indexes found during stage 1. Although stage 1 finds the structural
|
||||
indexes, it has no knowledge of the structure of the document nor does it know whether it parsed a valid document,
|
||||
multiple documents, or even if the document is complete.
|
||||
|
||||
Prior to JsonStream, most people who had to parse a multiline JSON file would proceed by reading the file
|
||||
line by line, using a utility function like `std::getline` or equivalent, and would then use the function
|
||||
`simdjson::json_parse` on each of those lines. From a performance point of view, this process is highly inefficient,
|
||||
in that it requires a lot of unnecessary memory allocation and makes use of the `getline` function,
|
||||
which is fundamentally slow, slower than the act of parsing with simdjson [(more on this here)](https://lemire.me/blog/2019/06/18/how-fast-is-getline-in-c/).
|
||||
|
||||
Unlike the popular parser RapidJson, our DOM does not require the buffer once the parsing job is completed,
|
||||
the DOM and the buffer are completely independent. The drawback of this architecture is that we need to allocate
|
||||
some additional memory to store our ParsedJson data, for every document inside a given file. Memory allocation can be
|
||||
slow and become a bottleneck, therefore, we want to minimize it as much as possible.
|
||||
- **Design:**
|
||||
To achieve a minimum amount of allocations, we opted for a design where we create only one ParsedJson object and
|
||||
therefore allocate its memory once, and then recycle it for every document in a given file. But, knowing
|
||||
that they often have largely varying size, we need to make sure that we allocate enough memory so that all the
|
||||
documents can fit. This value is what we call the batch size. As of right now, we need to manually specify a value
|
||||
for this batch size, it has to be at least as big as the biggest document in your file, but not too big so that it
|
||||
submerges the cached memory. The bigger the batch size, the fewer we need to make allocations. We found that 1MB is
|
||||
somewhat a sweet spot for now.
|
||||
|
||||
We then proceed by running the stage 1 on all the data that we can fit into this batch size. Finally, every time the
|
||||
user calls `JsonStream::json_parse`, we run the stage 2 on this data, until we detect that we parsed a valid
|
||||
document, keep a reference to this point and start at this point the next time `JsonStream::json_parse` is called.
|
||||
We do so until we consumed all the documents in the current batch, then we load another batch, and repeat the process
|
||||
until we reach the end of the file.
|
||||
|
||||
But how can we make use of threads? We found a pretty cool algorithm that allows us to quickly identify the
|
||||
position of the last JSON document in a given batch. Knowing exactly where the end of the batch is, we no longer
|
||||
need for stage 2 to finish in order to load a new batch. We already know where to start the next batch. Therefore,
|
||||
we can run stage 1 on the next batch concurrently while the main thread is going through stage 2. Now, running
|
||||
stage 1 in a different thread can, in best cases, remove almost entirely it's cost and replaces it by the overhead
|
||||
of a thread, which is orders of magnitude cheaper. Ain't that awesome!
|
||||
|
||||
## Support
|
||||
Since we want to offer flexibility and not restrict ourselves to a specific file
|
||||
format, we support any file that contains any amount of valid JSON document, **separated by one
|
||||
or more character that is considered whitespace** by the JSON spec. Anything that is
|
||||
not whitespace will be parsed as a JSON document and could lead to failure.
|
||||
|
||||
Whitespace Characters:
|
||||
- **Space**
|
||||
- **Linefeed**
|
||||
- **Carriage return**
|
||||
- **Horizontal tab**
|
||||
- **Nothing**
|
||||
|
||||
|
||||
|
||||
|
||||
Some official formats **(non-exhaustive list)**:
|
||||
- [Newline-Delimited JSON (NDJSON)](http://ndjson.org/)
|
||||
- [JSON lines (JSONL)](http://jsonlines.org/)
|
||||
- [RFC 7464](https://tools.ietf.org/html/rfc7464)
|
||||
- [More on Wikipedia...](https://en.wikipedia.org/wiki/JSON_streaming)
|
||||
|
||||
## API
|
||||
**Constructor**
|
||||
```cpp
|
||||
// Create a JsonStream object that can be used to parse sequentially the valid JSON
|
||||
// documents found in the buffer "buf".
|
||||
//
|
||||
// The batch_size must be at least as large as the biggest document in the file, but not
|
||||
// too large to submerge the cached memory. We found that 1MB is somewhat a sweet spot for
|
||||
// now.
|
||||
//
|
||||
// The user is expected to call the following json_parse method to parse the next valid
|
||||
// JSON document found in the buffer. This method can and is expected to be called in a
|
||||
// loop.
|
||||
//
|
||||
// Various methods are offered to keep track of the status, like get_current_buffer_loc,
|
||||
// get_n_parsed_docs, // get_n_bytes_parsed, etc.
|
||||
JsonStream(const char *buf, size_t len, size_t batch_size = 1000000);
|
||||
JsonStream(const std::string &s, size_t batch_size = 1000000);
|
||||
```
|
||||
**Methods**
|
||||
|
||||
- **Parsing**
|
||||
```cpp
|
||||
// Parse the next document found in the buffer previously given to JsonStream.
|
||||
//
|
||||
// You do NOT need to pre-allocate ParsedJson. This function takes care of pre
|
||||
// allocating a capacity defined by the // batch_size defined when creating the
|
||||
// JsonStream object.
|
||||
//
|
||||
// The function returns simdjson::SUCCESS_AND_HAS_MORE (an integer = 1) in case of
|
||||
// success and indicates that the buffer // still contains more data to be parsed,
|
||||
// meaning this function can be called again to return the next JSON document
|
||||
// after this one.
|
||||
//
|
||||
// The function returns simdjson::SUCCESS (as integer = 0) in case of success and
|
||||
// indicates that the buffer has // successfully been parsed to the end. Every
|
||||
// document it contained has been parsed without error.
|
||||
//
|
||||
// The function returns an error code from simdjson/simdjson.h in case of failure
|
||||
// such as simdjson::CAPACITY, // simdjson::MEMALLOC, simdjson::DEPTH_ERROR and so
|
||||
// forth; the simdjson::error_message function converts these error // codes into a
|
||||
// string).
|
||||
//
|
||||
// You can also check validity by calling pj.is_valid(). The same ParsedJson can
|
||||
// and should be reused for the other // documents in the buffer.
|
||||
int json_parse(ParsedJson &pj)
|
||||
```
|
||||
- **Buffer**
|
||||
```cpp
|
||||
// Sets a new buffer for this JsonStream. Will also reinitialize all the
|
||||
// variables, which acts as a reset. A new JsonStream without initializing
|
||||
// again.
|
||||
void set_new_buffer(const char *buf, size_t len);
|
||||
void set_new_buffer(const std::string &s)
|
||||
```
|
||||
|
||||
- **Utility**
|
||||
```cpp
|
||||
// Returns the location (index) of where the next document should be in the
|
||||
// buffer. Can be used for debugging, it tells the user the position of the end of
|
||||
// the last valid JSON document parsed
|
||||
size_t get_current_buffer_loc() const;
|
||||
|
||||
// Returns the total amount of complete documents parsed by the JsonStream,
|
||||
// in the current buffer, at the given time.
|
||||
size_t get_n_parsed_docs() const;
|
||||
|
||||
// Returns the total amount of data (in bytes) parsed by the JsonStream,
|
||||
// in the current buffer, at the given time.
|
||||
size_t get_n_bytes_parsed() const;
|
||||
```
|
||||
## Concurrency mode
|
||||
To use concurrency mode, tell the compiler to enable threading mode:
|
||||
- **GCC and Clang** : Compile with `-pthread` flag.
|
||||
- **Visual C++ (MSVC)**: Usually enabled by default. If not, add the flag `/MT` or `/MD` to the compiler.
|
||||
|
||||
**Note:** The JsonStream API remains the same whether you are using the threaded version or not.
|
||||
## Example
|
||||
|
||||
Here is a simple exemple, using single header simdjson:
|
||||
```cpp
|
||||
#include "simdjson.h"
|
||||
#include "simdjson.cpp"
|
||||
|
||||
int parse_file(const char *filename) {
|
||||
simdjson::padded_string p = simdjson::get_corpus(filename);
|
||||
simdjson::ParsedJson pj;
|
||||
simdjson::JsonStream js{p};
|
||||
int parse_res = simdjson::SUCCESS_AND_HAS_MORE;
|
||||
|
||||
while (parse_res == simdjson::SUCCESS_AND_HAS_MORE) {
|
||||
parse_res = js.json_parse(pj);
|
||||
|
||||
//Do something with pj...
|
||||
}
|
||||
}
|
||||
```
|
||||
## Use cases
|
||||
From [jsonlines.org](http://jsonlines.org/examples/):
|
||||
|
||||
- **Better than CSV**
|
||||
```json
|
||||
["Name", "Session", "Score", "Completed"]
|
||||
["Gilbert", "2013", 24, true]
|
||||
["Alexa", "2013", 29, true]
|
||||
["May", "2012B", 14, false]
|
||||
["Deloise", "2012A", 19, true]
|
||||
```
|
||||
CSV seems so easy that many programmers have written code to generate it themselves, and almost every implementation is
|
||||
different. Handling broken CSV files is a common and frustrating task. CSV has no standard encoding, no standard column
|
||||
separator and multiple character escaping standards. String is the only type supported for cell values, so some programs
|
||||
attempt to guess the correct types.
|
||||
|
||||
JSON Lines handles tabular data cleanly and without ambiguity. Cells may use the standard JSON types.
|
||||
|
||||
The biggest missing piece is an import/export filter for popular spreadsheet programs so that non-programmers can use
|
||||
this format.
|
||||
|
||||
- **Easy Nested Data**
|
||||
```json
|
||||
{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
|
||||
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
|
||||
{"name": "May", "wins": []}
|
||||
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}
|
||||
```
|
||||
JSON Lines' biggest strength is in handling lots of similar nested data structures. One .jsonl file is easier to
|
||||
work with than a directory full of XML files.
|
Binary file not shown.
After Width: | Height: | Size: 34 KiB |
|
@ -9,23 +9,111 @@
|
|||
#include "simdjson/simdjson.h"
|
||||
|
||||
namespace simdjson {
|
||||
|
||||
/*************************************************************************************
|
||||
* The main motivation for this piece of software is to achieve maximum speed and offer
|
||||
* good quality of life while parsing files containing multiple JSON documents.
|
||||
*
|
||||
* Since we want to offer flexibility and not restrict ourselves to a specific file
|
||||
* format, we support any file that contains any valid JSON documents separated by one
|
||||
* or more character that is considered a whitespace by the JSON spec.
|
||||
* Namely: space, nothing, linefeed, carriage return, horizontal tab.
|
||||
* Anything that is not whitespace will be parsed as a JSON document and could lead
|
||||
* to failure.
|
||||
*
|
||||
* To offer maximum parsing speed, our implementation processes the data inside the
|
||||
* buffer by batches and their size is defined by the parameter "batch_size".
|
||||
* By loading data in batches, we can optimize the time spent allocating data in the
|
||||
* ParsedJson and can also open the possibility of multi-threading.
|
||||
* The batch_size must be at least as large as the biggest document in the file, but
|
||||
* not too large in order to submerge the chached memory. We found that 1MB is
|
||||
* somewhat a sweet spot for now. Eventually, this batch_size could be fully
|
||||
* automated and be optimal at all times.
|
||||
************************************************************************************/
|
||||
class JsonStream {
|
||||
public:
|
||||
/* Create a JsonStream object that can be used to parse sequentially the valid
|
||||
* JSON documents found in the buffer "buf".
|
||||
*
|
||||
* The batch_size must be at least as large as the biggest document in the file, but
|
||||
* not too large to submerge the cached memory. We found that 1MB is
|
||||
* somewhat a sweet spot for now.
|
||||
*
|
||||
* The user is expected to call the following json_parse method to parse the next
|
||||
* valid JSON document found in the buffer. This method can and is expected to be
|
||||
* called in a loop.
|
||||
*
|
||||
* Various methods are offered to keep track of the status, like get_current_buffer_loc,
|
||||
* get_n_parsed_docs, get_n_bytes_parsed, etc.
|
||||
*
|
||||
* */
|
||||
JsonStream(const char *buf, size_t len, size_t batch_size = 1000000);
|
||||
|
||||
/* Create a JsonStream object that can be used to parse sequentially the valid
|
||||
* JSON documents found in the buffer "buf".
|
||||
*
|
||||
* The batch_size must be at least as large as the biggest document in the file, but
|
||||
* not too large to submerge the cached memory. We found that 1MB is
|
||||
* somewhat a sweet spot for now.
|
||||
*
|
||||
* The user is expected to call the following json_parse method to parse the next
|
||||
* valid JSON document found in the buffer. This method can and is expected to be
|
||||
* called in a loop.
|
||||
*
|
||||
* Various methods are offered to keep track of the status, like get_current_buffer_loc,
|
||||
* get_n_parsed_docs, get_n_bytes_parsed, etc.
|
||||
*
|
||||
* */
|
||||
JsonStream(const std::string &s, size_t batch_size = 1000000) : JsonStream(s.data(), s.size(), batch_size) {};
|
||||
|
||||
/* Parse the next document found in the buffer previously given to JsonStream.
|
||||
|
||||
* The content should be a valid JSON document encoded as UTF-8. If there is a
|
||||
* UTF-8 BOM, the caller is responsible for omitting it, UTF-8 BOM are
|
||||
* discouraged.
|
||||
*
|
||||
* You do NOT need to pre-allocate ParsedJson. This function takes care of
|
||||
* pre-allocating a capacity defined by the batch_size defined when creating the
|
||||
* JsonStream object.
|
||||
*
|
||||
* The function returns simdjson::SUCCESS_AND_HAS_MORE (an integer = 1) in case
|
||||
* of success and indicates that the buffer still contains more data to be parsed,
|
||||
* meaning this function can be called again to return the next JSON document
|
||||
* after this one.
|
||||
*
|
||||
* The function returns simdjson::SUCCESS (as integer = 0) in case of success
|
||||
* and indicates that the buffer has successfully been parsed to the end.
|
||||
* Every document it contained has been parsed without error.
|
||||
*
|
||||
* The function returns an error code from simdjson/simdjson.h in case of failure
|
||||
* such as simdjson::CAPACITY, simdjson::MEMALLOC, simdjson::DEPTH_ERROR and so forth;
|
||||
* the simdjson::error_message function converts these error codes into a
|
||||
* string).
|
||||
*
|
||||
* You can also check validity by calling pj.is_valid(). The same ParsedJson can
|
||||
* and should be reused for the other documents in the buffer. */
|
||||
int json_parse(ParsedJson &pj);
|
||||
|
||||
/* Sets a new buffer for this JsonStream. Will also reinitialize all the variables,
|
||||
* which acts as a reset. A new JsonStream without initializing again.
|
||||
* */
|
||||
void set_new_buffer(const char *buf, size_t len);
|
||||
|
||||
/* Sets a new buffer for this JsonStream. Will also reinitialize all the variables,
|
||||
* which is basically a reset. A new JsonStream without initializing again.
|
||||
* */
|
||||
void set_new_buffer(const std::string &s) { set_new_buffer(s.data(), s.size()); }
|
||||
|
||||
/* Returns the location (index) of where the next document should be in the buffer.
|
||||
* Can be used for debugging, it tells the user the position of the end of the last
|
||||
* valid JSON document parsed*/
|
||||
size_t get_current_buffer_loc() const;
|
||||
|
||||
/* Returns the total amount of complete documents parsed by the JsonStream,
|
||||
* in the current buffer, at the given time.*/
|
||||
size_t get_n_parsed_docs() const;
|
||||
|
||||
/* Returns the total amount of data (in bytes) parsed by the JsonStream,
|
||||
* in the current buffer, at the given time.*/
|
||||
size_t get_n_bytes_parsed() const;
|
||||
|
||||
private:
|
||||
|
@ -37,7 +125,6 @@ namespace simdjson {
|
|||
bool load_next_batch{true};
|
||||
size_t current_buffer_loc{0};
|
||||
size_t last_json_buffer_loc{0};
|
||||
size_t thread_current_buffer_loc{0};
|
||||
size_t n_parsed_docs{0};
|
||||
size_t n_bytes_parsed{0};
|
||||
|
||||
|
@ -45,7 +132,29 @@ namespace simdjson {
|
|||
simdjson::ParsedJson pj_thread;
|
||||
|
||||
#ifdef SIMDJSON_THREADS_ENABLED
|
||||
size_t find_last_json(const ParsedJson &pj);
|
||||
/* This algorithm is used to quickly identify the buffer position of
|
||||
* the last JSON document inside the current batch.
|
||||
*
|
||||
* It does it's work by finding the last pair of structural characters
|
||||
* that represent the end followed by the start of a document.
|
||||
*
|
||||
* Simply put, we iterate over the structural characters, starting from
|
||||
* the end. We consider that we found the end of a JSON document when the
|
||||
* first element of the pair is NOT one of these characters: '{' '[' ';' ','
|
||||
* and when the second element is NOT one of these characters: '}' '}' ';' ','.
|
||||
*
|
||||
* This simple comparison works most of the time, but it does not cover cases
|
||||
* where the batch's structural indexes contain a perfect amount of documents.
|
||||
* In such a case, we do not have access to the structural index which follows
|
||||
* the last document, therefore, we do not have access to the second element in
|
||||
* the pair, and means that we cannot identify the last document. To fix this
|
||||
* issue, we keep a count of the open and closed curly/square braces we found
|
||||
* while searching for the pair. When we find a pair AND the count of open and
|
||||
* closed curly/square braces is the same, we know that we just passed a complete
|
||||
* document, therefore the last json buffer location is the end of the batch
|
||||
* */
|
||||
size_t find_last_json_buf_loc(const ParsedJson &pj);
|
||||
|
||||
#endif
|
||||
};
|
||||
|
||||
|
|
|
@ -75,7 +75,7 @@ int JsonStream::json_parse(ParsedJson &pj) {
|
|||
}
|
||||
|
||||
if(_len-_batch_size > 0) {
|
||||
last_json_buffer_loc = find_last_json(pj);
|
||||
last_json_buffer_loc = find_last_json_buf_loc(pj);
|
||||
_batch_size = std::min(_batch_size, _len - last_json_buffer_loc);
|
||||
if(_batch_size>0)
|
||||
stage_1_thread = std::thread(
|
||||
|
@ -136,7 +136,7 @@ int JsonStream::json_parse(ParsedJson &pj) {
|
|||
}
|
||||
|
||||
#ifdef SIMDJSON_THREADS_ENABLED
|
||||
size_t JsonStream::find_last_json(const ParsedJson &pj) {
|
||||
size_t JsonStream::find_last_json_buf_loc(const ParsedJson &pj) {
|
||||
auto last_i = pj.n_structural_indexes - 1;
|
||||
if (pj.structural_indexes[last_i] == _batch_size)
|
||||
last_i = pj.n_structural_indexes - 2;
|
||||
|
|
Loading…
Reference in New Issue