Go to file
Daniel Lemire c8706c66ec Solving some build issues 2018-12-05 21:33:32 -05:00
benchmark Solving some build issues 2018-12-05 21:33:32 -05:00
dependencies Added dependencies. 2018-11-20 16:43:22 -05:00
include/simdjson Solving some build issues 2018-12-05 21:33:32 -05:00
jsonchecker Added more thorough testing. 2018-10-23 20:19:33 -04:00
jsonexamples Added dependencies. 2018-11-20 16:43:22 -05:00
scripts Various cleaning steps. 2018-11-09 21:31:14 -05:00
src Cleaning. 2018-11-30 22:02:32 -05:00
tests Solving some build issues 2018-12-05 21:33:32 -05:00
tools More cleaning. 2018-11-30 21:31:05 -05:00
.gitmodules Added dependencies. 2018-11-20 16:43:22 -05:00
Makefile Solving some build issues 2018-12-05 21:33:32 -05:00
Notes.md Update Notes.md 2018-04-14 22:22:08 +10:00
README.md More cleaning. 2018-11-30 21:31:05 -05:00

README.md

simdjson : Parsing gigabytes of JSON per second

A C++ library to see how fast we can parse JSON with complete validation.

Goal: Speed up the parsing of JSON per se.

Code example

#include "simdjson/jsonparser.h"

/...

const char * filename = ... //
std::string_view p = get_corpus(filename);
ParsedJson pj;
size_t maxdepth = 1024; // support documents have nesting "depth" up to 1024
pj.allocateCapacity(p.size(), maxdepth); // allocate memory for parsing up to p.size() bytes
bool is_ok = json_parse(p, pj); // do the parsing, return false on error
// parsing is done!
// js can be reused with other json_parse calls.

Usage

Requirements: clang or gcc and make. A system like Linux or macOS is expected.

To test:

make
make test

To run benchmarks:

make parse
./parse jsonexamples/twitter.json

To run comparative benchmarks (with other parsers):

make parse
./parsingcompetition jsonexamples/twitter.json

Limitations

To simplify the engineering, we make some assumptions.

  • We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16).
  • We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel. No support for non-x86 processors is included.
  • We only support GNU GCC and LLVM Clang at this time. There is no support for Microsoft Visual Studio, though it should not be difficult.
  • We expect the input memory pointer to be padded (e.g., with spaces) so that it can be read entirely in blocks of 512 bits (a cache line). In practice, this means that users should allocate the memory where the JSON bytes are located using the allocate_aligned_buffer function or the equivalent. Of course, the data you may want to processed could be on a buffer that does have this padding. However, copying the data is relatively cheap (much cheaper than parsing JSON), and we can eventually remove this constraint.
  • The input string should be NULL terminated.

Features

  • We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers.
  • We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.)
  • We fully validate the numbers. (Parsers like gason and ultranjson will accept [0e+] as valid JSON.)
  • We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tags in strings.)

Architecture

The parser works in three stages:

  • Stage 1. Identifies quickly structure elements, strings, and so forth. We validate UTF-8 encoding at that stage.
  • Stage 2. Involves the "flattening" of the data from stage 1, that is, convert bitsets into arrays of indexes.
  • Stage 3. (Structure building) Involves constructing a "tree" of sort to navigate through the data. Strings and numbers are parsed at this stage.

Various References

Inspiring links:

Validating UTF-8 takes no more than 0.7 cycles per byte:

Remarks on JSON parsing

  • The JSON spec defines what a JSON parser is:

A JSON parser transforms a JSON text into another representation. A JSON parser MUST accept all texts that conform to the JSON grammar. A JSON parser MAY accept non-JSON forms or extensions. An implementation may set limits on the size of texts that it accepts. An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers. An implementation may set limits on the length and character contents of strings."

  • JSON is not JavaScript:

All JSON is Javascript but NOT all Javascript is JSON. So {property:1} is invalid because property does not have double quotes around it. {'property':1} is also invalid, because it's single quoted while the only thing that can placate the JSON specification is double quoting. JSON is even fussy enough that {"property":.1} is invalid too, because you should have of course written {"property":0.1}. Also, don't even think about having comments or semicolons, you guessed it: they're invalid. (credit:https://github.com/elzr/vim-json)

  • The structural characters are:

    begin-array     =  [ left square bracket
    begin-object    =  { left curly bracket
    end-array       =  ] right square bracket
    end-object      =  } right curly bracket
    name-separator  = : colon
    value-separator = , comma
    

Pseudo-structural elements

A character is pseudo-structural if and only if:

  1. Not enclosed in quotes, AND
  2. Is a non-whitespace character, AND
  3. It's preceding chararacter is either: (a) a structural character, OR (b) whitespace.

This helps as we redefine some new characters as pseudo-structural such as the characters 1, 1, G, n in the following:

{ "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }

Academic References

  • T.Mühlbauer, W.Rödiger, R.Seilbeck, A.Reiser, A.Kemper, and T.Neumann. Instant loading for main memory databases. PVLDB, 6(14):17021713, 2013. (SIMD-based CSV parsing)
  • Mytkowicz, Todd, Madanlal Musuvathi, and Wolfram Schulte. "Data-parallel finite-state machines." ACM SIGARCH Computer Architecture News. Vol. 42. No. 1. ACM, 2014.
  • Lu, Yifan, et al. "Tree structured data processing on GPUs." Cloud Computing, Data Science & Engineering-Confluence, 2017 7th International Conference on. IEEE, 2017.
  • Sidhu, Reetinder. "High throughput, tree automata based XML processing using FPGAs." Field-Programmable Technology (FPT), 2013 International Conference on. IEEE, 2013.
  • Dai, Zefu, Nick Ni, and Jianwen Zhu. "A 1 cycle-per-byte XML parsing accelerator." Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays. ACM, 2010.
  • Lin, Dan, et al. "Parabix: Boosting the efficiency of text processing on commodity processors." High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012. http://parabix.costar.sfu.ca/export/1783/docs/HPCA2012/final_ieee/final.pdf
  • Deshmukh, V. M., and G. R. Bamnote. "An empirical evaluation of optimization parameters in XML parsing for performance enhancement." Computer, Communication and Control (IC4), 2015 International Conference on. IEEE, 2015.
  • Moussalli, Roger, et al. "Efficient XML Path Filtering Using GPUs." ADMS@ VLDB. 2011.
  • Jianliang, Ma, et al. "Parallel speculative dom-based XML parser." High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on. IEEE, 2012.
  • Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J. and Kossmann, D., 2017. Mison: a fast JSON parser for data analytics. Proceedings of the VLDB Endowment, 10(10), pp.1118-1129. http://www.vldb.org/pvldb/vol10/p1118-li.pdf
  • Cameron, Robert D., et al. "Parallel scanning with bitstream addition: An xml case study." European Conference on Parallel Processing. Springer, Berlin, Heidelberg, 2011.
  • Cameron, Robert D., Kenneth S. Herdy, and Dan Lin. "High performance XML parsing using parallel bit stream technology." Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds. ACM, 2008.
  • Shah, Bhavik, et al. "A data parallel algorithm for XML DOM parsing." International XML Database Symposium. Springer, Berlin, Heidelberg, 2009.
  • Cameron, Robert D., and Dan Lin. "Architectural support for SWAR text processing with parallel bit streams: the inductive doubling principle." ACM Sigplan Notices. Vol. 44. No. 3. ACM, 2009.
  • Amagasa, Toshiyuki, Mana Seino, and Hiroyuki Kitagawa. "Energy-Efficient XML Stream Processing through Element-Skipping Parsing." Database and Expert Systems Applications (DEXA), 2013 24th International Workshop on. IEEE, 2013.
  • Medforth, Nigel Woodland. "icXML: Accelerating Xerces-C 3.1. 1 using the Parabix Framework." (2013).
  • Zhang, Qiang Scott. Embedding Parallel Bit Stream Technology Into Expat. Diss. Simon Fraser University, 2010.
  • Cameron, Robert D., et al. "Fast Regular Expression Matching with Bit-parallel Data Streams."
  • Lin, Dan. Bits filter: a high-performance multiple string pattern matching algorithm for malware detection. Diss. School of Computing Science-Simon Fraser University, 2010.
  • Yang, Shiyang. Validation of XML Document Based on Parallel Bit Stream Technology. Diss. Applied Sciences: School of Computing Science, 2013.
  • N. Nakasato, "Implementation of a parallel tree method on a GPU", Journal of Computational Science, vol. 3, no. 3, pp. 132-141, 2012.