diff --git a/README.md b/README.md index 83f3eb5f..2a8a4c59 100644 --- a/README.md +++ b/README.md @@ -154,6 +154,29 @@ if( ! pj.is_valid() ) { As needed, the `json_parse` and `build_parsed_json` functions copy the input data to a temporary buffer readable up to SIMDJSON_PADDING bytes beyond the end of the data. +## JSON streaming + +**API and detailed documentation found [here](doc/JsonStream.md).** + +Here is a simple exemple, using single header simdjson: +```cpp +#include "simdjson.h" +#include "simdjson.cpp" + +int parse_file(const char *filename) { + simdjson::padded_string p = simdjson::get_corpus(filename); + simdjson::ParsedJson pj; + simdjson::JsonStream js{p}; + int parse_res = simdjson::SUCCESS_AND_HAS_MORE; + + while (parse_res == simdjson::SUCCESS_AND_HAS_MORE) { + parse_res = js.json_parse(pj); + + //Do something with pj... + } +} +``` + ## Usage: easy single-header version See the "singleheader" repository for a single header version. See the included diff --git a/doc/JsonStream.md b/doc/JsonStream.md new file mode 100644 index 00000000..81a31405 --- /dev/null +++ b/doc/JsonStream.md @@ -0,0 +1,220 @@ +# JsonStream +An interface providing features to work with files or streams containing multiple JSON documents. +As fast and convenient as possible. +## Contents +- [Motivations](#Motivations) +- [Performance](#Performance) +- [How it works](#how-it-works) +- [Support](#Support) +- [API](#API) +- [Concurrency mode](#concurrency-mode) +- [Example](#Example) +- [Use cases](#use-cases) + + +## Motivations +The main motivation for this piece of software is to achieve maximum speed and offer a +better quality of life in parsing files containing multiple JSON documents. + +The JavaScript Object Notation (JSON) [RFC7159](https://tools.ietf.org/html/rfc7159) is a very handy +serialization format. However, when serializing a large sequence of +values as an array, or a possibly indeterminate-length or never- +ending sequence of values, JSON becomes difficult to work with. + +Consider a sequence of one million values, each possibly one kilobyte +when encoded -- roughly one gigabyte. It is often desirable to process such a dataset incrementally +without having to first read all of it before beginning to produce results. +## Performance +Here is a chart comparing the speed of the different alternatives to parse a multiline JSON. +JsonStream provides a threaded and non-threaded implementation. As the figure below shows, +if you can, use threads, else it's still pretty fast! +[![Chart.png](/doc/Multiline_JSON_Parse_Competition.png)](/doc/Multiline_JSON_Parse_Competition.png) +## How it works +- **Context:** + The parsing in simdjson is divided into 2 stages. First, in stage 1, we parse the document and find all + the structural indexes (`{`, `}`, `]`, `[`, `,`, `"`, ...) and validate UTF8. Then, in stage 2, we go through the + document again and build the tape using structural indexes found during stage 1. Although stage 1 finds the structural + indexes, it has no knowledge of the structure of the document nor does it know whether it parsed a valid document, + multiple documents, or even if the document is complete. + + Prior to JsonStream, most people who had to parse a multiline JSON file would proceed by reading the file + line by line, using a utility function like `std::getline` or equivalent, and would then use the function + `simdjson::json_parse` on each of those lines. From a performance point of view, this process is highly inefficient, + in that it requires a lot of unnecessary memory allocation and makes use of the `getline` function, + which is fundamentally slow, slower than the act of parsing with simdjson [(more on this here)](https://lemire.me/blog/2019/06/18/how-fast-is-getline-in-c/). + + Unlike the popular parser RapidJson, our DOM does not require the buffer once the parsing job is completed, + the DOM and the buffer are completely independent. The drawback of this architecture is that we need to allocate + some additional memory to store our ParsedJson data, for every document inside a given file. Memory allocation can be + slow and become a bottleneck, therefore, we want to minimize it as much as possible. +- **Design:** + To achieve a minimum amount of allocations, we opted for a design where we create only one ParsedJson object and + therefore allocate its memory once, and then recycle it for every document in a given file. But, knowing + that they often have largely varying size, we need to make sure that we allocate enough memory so that all the + documents can fit. This value is what we call the batch size. As of right now, we need to manually specify a value + for this batch size, it has to be at least as big as the biggest document in your file, but not too big so that it + submerges the cached memory. The bigger the batch size, the fewer we need to make allocations. We found that 1MB is + somewhat a sweet spot for now. + + We then proceed by running the stage 1 on all the data that we can fit into this batch size. Finally, every time the + user calls `JsonStream::json_parse`, we run the stage 2 on this data, until we detect that we parsed a valid + document, keep a reference to this point and start at this point the next time `JsonStream::json_parse` is called. + We do so until we consumed all the documents in the current batch, then we load another batch, and repeat the process + until we reach the end of the file. + + But how can we make use of threads? We found a pretty cool algorithm that allows us to quickly identify the + position of the last JSON document in a given batch. Knowing exactly where the end of the batch is, we no longer + need for stage 2 to finish in order to load a new batch. We already know where to start the next batch. Therefore, + we can run stage 1 on the next batch concurrently while the main thread is going through stage 2. Now, running + stage 1 in a different thread can, in best cases, remove almost entirely it's cost and replaces it by the overhead + of a thread, which is orders of magnitude cheaper. Ain't that awesome! + +## Support +Since we want to offer flexibility and not restrict ourselves to a specific file +format, we support any file that contains any amount of valid JSON document, **separated by one +or more character that is considered whitespace** by the JSON spec. Anything that is + not whitespace will be parsed as a JSON document and could lead to failure. + +Whitespace Characters: +- **Space** +- **Linefeed** +- **Carriage return** +- **Horizontal tab** +- **Nothing** + + + + +Some official formats **(non-exhaustive list)**: +- [Newline-Delimited JSON (NDJSON)](http://ndjson.org/) +- [JSON lines (JSONL)](http://jsonlines.org/) +- [RFC 7464](https://tools.ietf.org/html/rfc7464) +- [More on Wikipedia...](https://en.wikipedia.org/wiki/JSON_streaming) + +## API +**Constructor** +```cpp +// Create a JsonStream object that can be used to parse sequentially the valid JSON +// documents found in the buffer "buf". +// +// The batch_size must be at least as large as the biggest document in the file, but not +// too large to submerge the cached memory. We found that 1MB is somewhat a sweet spot for +// now. +// +// The user is expected to call the following json_parse method to parse the next valid +// JSON document found in the buffer. This method can and is expected to be called in a +// loop. +// +// Various methods are offered to keep track of the status, like get_current_buffer_loc, +// get_n_parsed_docs, // get_n_bytes_parsed, etc. + JsonStream(const char *buf, size_t len, size_t batch_size = 1000000); + JsonStream(const std::string &s, size_t batch_size = 1000000); + ``` + **Methods** + + - **Parsing** + ```cpp + // Parse the next document found in the buffer previously given to JsonStream. + // + // You do NOT need to pre-allocate ParsedJson. This function takes care of pre + // allocating a capacity defined by the // batch_size defined when creating the + // JsonStream object. + // + // The function returns simdjson::SUCCESS_AND_HAS_MORE (an integer = 1) in case of + // success and indicates that the buffer // still contains more data to be parsed, + // meaning this function can be called again to return the next JSON document + // after this one. + // + // The function returns simdjson::SUCCESS (as integer = 0) in case of success and + // indicates that the buffer has // successfully been parsed to the end. Every + // document it contained has been parsed without error. + // + // The function returns an error code from simdjson/simdjson.h in case of failure + // such as simdjson::CAPACITY, // simdjson::MEMALLOC, simdjson::DEPTH_ERROR and so + // forth; the simdjson::error_message function converts these error // codes into a + // string). + // + // You can also check validity by calling pj.is_valid(). The same ParsedJson can + // and should be reused for the other // documents in the buffer. + int json_parse(ParsedJson &pj) + ``` + - **Buffer** + ```cpp + // Sets a new buffer for this JsonStream. Will also reinitialize all the + // variables, which acts as a reset. A new JsonStream without initializing + // again. + void set_new_buffer(const char *buf, size_t len); + void set_new_buffer(const std::string &s) + ``` + + - **Utility** + ```cpp + // Returns the location (index) of where the next document should be in the + // buffer. Can be used for debugging, it tells the user the position of the end of + // the last valid JSON document parsed + size_t get_current_buffer_loc() const; + + // Returns the total amount of complete documents parsed by the JsonStream, + // in the current buffer, at the given time. + size_t get_n_parsed_docs() const; + + // Returns the total amount of data (in bytes) parsed by the JsonStream, + // in the current buffer, at the given time. + size_t get_n_bytes_parsed() const; + ``` +## Concurrency mode +To use concurrency mode, tell the compiler to enable threading mode: +- **GCC and Clang** : Compile with `-pthread` flag. +- **Visual C++ (MSVC)**: Usually enabled by default. If not, add the flag `/MT` or `/MD` to the compiler. + +**Note:** The JsonStream API remains the same whether you are using the threaded version or not. +## Example + +Here is a simple exemple, using single header simdjson: +```cpp +#include "simdjson.h" +#include "simdjson.cpp" + +int parse_file(const char *filename) { + simdjson::padded_string p = simdjson::get_corpus(filename); + simdjson::ParsedJson pj; + simdjson::JsonStream js{p}; + int parse_res = simdjson::SUCCESS_AND_HAS_MORE; + + while (parse_res == simdjson::SUCCESS_AND_HAS_MORE) { + parse_res = js.json_parse(pj); + + //Do something with pj... + } +} +``` +## Use cases +From [jsonlines.org](http://jsonlines.org/examples/): + +- **Better than CSV** + ```json + ["Name", "Session", "Score", "Completed"] + ["Gilbert", "2013", 24, true] + ["Alexa", "2013", 29, true] + ["May", "2012B", 14, false] + ["Deloise", "2012A", 19, true] + ``` + CSV seems so easy that many programmers have written code to generate it themselves, and almost every implementation is + different. Handling broken CSV files is a common and frustrating task. CSV has no standard encoding, no standard column + separator and multiple character escaping standards. String is the only type supported for cell values, so some programs + attempt to guess the correct types. + + JSON Lines handles tabular data cleanly and without ambiguity. Cells may use the standard JSON types. + + The biggest missing piece is an import/export filter for popular spreadsheet programs so that non-programmers can use + this format. + +- **Easy Nested Data** + ```json + {"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]} + {"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]} + {"name": "May", "wins": []} + {"name": "Deloise", "wins": [["three of a kind", "5♣"]]} + ``` + JSON Lines' biggest strength is in handling lots of similar nested data structures. One .jsonl file is easier to + work with than a directory full of XML files. diff --git a/doc/Multiline_JSON_Parse_Competition.png b/doc/Multiline_JSON_Parse_Competition.png new file mode 100644 index 00000000..66308804 Binary files /dev/null and b/doc/Multiline_JSON_Parse_Competition.png differ diff --git a/include/simdjson/jsonstream.h b/include/simdjson/jsonstream.h index 1eb9907f..7dd62f4f 100644 --- a/include/simdjson/jsonstream.h +++ b/include/simdjson/jsonstream.h @@ -9,23 +9,111 @@ #include "simdjson/simdjson.h" namespace simdjson { - + /************************************************************************************* + * The main motivation for this piece of software is to achieve maximum speed and offer + * good quality of life while parsing files containing multiple JSON documents. + * + * Since we want to offer flexibility and not restrict ourselves to a specific file + * format, we support any file that contains any valid JSON documents separated by one + * or more character that is considered a whitespace by the JSON spec. + * Namely: space, nothing, linefeed, carriage return, horizontal tab. + * Anything that is not whitespace will be parsed as a JSON document and could lead + * to failure. + * + * To offer maximum parsing speed, our implementation processes the data inside the + * buffer by batches and their size is defined by the parameter "batch_size". + * By loading data in batches, we can optimize the time spent allocating data in the + * ParsedJson and can also open the possibility of multi-threading. + * The batch_size must be at least as large as the biggest document in the file, but + * not too large in order to submerge the chached memory. We found that 1MB is + * somewhat a sweet spot for now. Eventually, this batch_size could be fully + * automated and be optimal at all times. + ************************************************************************************/ class JsonStream { public: + /* Create a JsonStream object that can be used to parse sequentially the valid + * JSON documents found in the buffer "buf". + * + * The batch_size must be at least as large as the biggest document in the file, but + * not too large to submerge the cached memory. We found that 1MB is + * somewhat a sweet spot for now. + * + * The user is expected to call the following json_parse method to parse the next + * valid JSON document found in the buffer. This method can and is expected to be + * called in a loop. + * + * Various methods are offered to keep track of the status, like get_current_buffer_loc, + * get_n_parsed_docs, get_n_bytes_parsed, etc. + * + * */ JsonStream(const char *buf, size_t len, size_t batch_size = 1000000); + /* Create a JsonStream object that can be used to parse sequentially the valid + * JSON documents found in the buffer "buf". + * + * The batch_size must be at least as large as the biggest document in the file, but + * not too large to submerge the cached memory. We found that 1MB is + * somewhat a sweet spot for now. + * + * The user is expected to call the following json_parse method to parse the next + * valid JSON document found in the buffer. This method can and is expected to be + * called in a loop. + * + * Various methods are offered to keep track of the status, like get_current_buffer_loc, + * get_n_parsed_docs, get_n_bytes_parsed, etc. + * + * */ JsonStream(const std::string &s, size_t batch_size = 1000000) : JsonStream(s.data(), s.size(), batch_size) {}; + /* Parse the next document found in the buffer previously given to JsonStream. + + * The content should be a valid JSON document encoded as UTF-8. If there is a + * UTF-8 BOM, the caller is responsible for omitting it, UTF-8 BOM are + * discouraged. + * + * You do NOT need to pre-allocate ParsedJson. This function takes care of + * pre-allocating a capacity defined by the batch_size defined when creating the + * JsonStream object. + * + * The function returns simdjson::SUCCESS_AND_HAS_MORE (an integer = 1) in case + * of success and indicates that the buffer still contains more data to be parsed, + * meaning this function can be called again to return the next JSON document + * after this one. + * + * The function returns simdjson::SUCCESS (as integer = 0) in case of success + * and indicates that the buffer has successfully been parsed to the end. + * Every document it contained has been parsed without error. + * + * The function returns an error code from simdjson/simdjson.h in case of failure + * such as simdjson::CAPACITY, simdjson::MEMALLOC, simdjson::DEPTH_ERROR and so forth; + * the simdjson::error_message function converts these error codes into a + * string). + * + * You can also check validity by calling pj.is_valid(). The same ParsedJson can + * and should be reused for the other documents in the buffer. */ int json_parse(ParsedJson &pj); + /* Sets a new buffer for this JsonStream. Will also reinitialize all the variables, + * which acts as a reset. A new JsonStream without initializing again. + * */ void set_new_buffer(const char *buf, size_t len); + /* Sets a new buffer for this JsonStream. Will also reinitialize all the variables, + * which is basically a reset. A new JsonStream without initializing again. + * */ void set_new_buffer(const std::string &s) { set_new_buffer(s.data(), s.size()); } + /* Returns the location (index) of where the next document should be in the buffer. + * Can be used for debugging, it tells the user the position of the end of the last + * valid JSON document parsed*/ size_t get_current_buffer_loc() const; + /* Returns the total amount of complete documents parsed by the JsonStream, + * in the current buffer, at the given time.*/ size_t get_n_parsed_docs() const; + /* Returns the total amount of data (in bytes) parsed by the JsonStream, + * in the current buffer, at the given time.*/ size_t get_n_bytes_parsed() const; private: @@ -37,7 +125,6 @@ namespace simdjson { bool load_next_batch{true}; size_t current_buffer_loc{0}; size_t last_json_buffer_loc{0}; - size_t thread_current_buffer_loc{0}; size_t n_parsed_docs{0}; size_t n_bytes_parsed{0}; @@ -45,7 +132,29 @@ namespace simdjson { simdjson::ParsedJson pj_thread; #ifdef SIMDJSON_THREADS_ENABLED - size_t find_last_json(const ParsedJson &pj); + /* This algorithm is used to quickly identify the buffer position of + * the last JSON document inside the current batch. + * + * It does it's work by finding the last pair of structural characters + * that represent the end followed by the start of a document. + * + * Simply put, we iterate over the structural characters, starting from + * the end. We consider that we found the end of a JSON document when the + * first element of the pair is NOT one of these characters: '{' '[' ';' ',' + * and when the second element is NOT one of these characters: '}' '}' ';' ','. + * + * This simple comparison works most of the time, but it does not cover cases + * where the batch's structural indexes contain a perfect amount of documents. + * In such a case, we do not have access to the structural index which follows + * the last document, therefore, we do not have access to the second element in + * the pair, and means that we cannot identify the last document. To fix this + * issue, we keep a count of the open and closed curly/square braces we found + * while searching for the pair. When we find a pair AND the count of open and + * closed curly/square braces is the same, we know that we just passed a complete + * document, therefore the last json buffer location is the end of the batch + * */ + size_t find_last_json_buf_loc(const ParsedJson &pj); + #endif }; diff --git a/src/jsonstream.cpp b/src/jsonstream.cpp index d6c5dee0..8fe73129 100755 --- a/src/jsonstream.cpp +++ b/src/jsonstream.cpp @@ -75,7 +75,7 @@ int JsonStream::json_parse(ParsedJson &pj) { } if(_len-_batch_size > 0) { - last_json_buffer_loc = find_last_json(pj); + last_json_buffer_loc = find_last_json_buf_loc(pj); _batch_size = std::min(_batch_size, _len - last_json_buffer_loc); if(_batch_size>0) stage_1_thread = std::thread( @@ -136,7 +136,7 @@ int JsonStream::json_parse(ParsedJson &pj) { } #ifdef SIMDJSON_THREADS_ENABLED -size_t JsonStream::find_last_json(const ParsedJson &pj) { +size_t JsonStream::find_last_json_buf_loc(const ParsedJson &pj) { auto last_i = pj.n_structural_indexes - 1; if (pj.structural_indexes[last_i] == _batch_size) last_i = pj.n_structural_indexes - 2;