diff --git a/README.md b/README.md index d71b0bfa..91f9a766 100644 --- a/README.md +++ b/README.md @@ -78,7 +78,7 @@ make benchmark ## Tools - `json2json mydoc.json` parses the document, constructs a model and then dumps back the result to standard output. -- `json2json -d mydoc.json` parses the document, constructs a model and then dumps model (as a tape) to standard output. +- `json2json -d mydoc.json` parses the document, constructs a model and then dumps model (as a tape) to standard output. The tape format is described in the accompanying file tape.md. - `minify mydoc.json` minifies the JSON document, outputting the result to standard output. Minifying means to remove the unneeded white space charaters. ## Scope @@ -100,7 +100,7 @@ To simplify the engineering, we make some assumptions. ## Features - The input string is unmodified. (Parsers like sajson and RapidJSON use the input string as a buffer.) -- We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long`. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document. +- We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long`. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number.) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document. - We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.) - We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.) - We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tags in strings.) diff --git a/include/simdjson/parsedjson.h b/include/simdjson/parsedjson.h index 173e74ea..582ee2cd 100644 --- a/include/simdjson/parsedjson.h +++ b/include/simdjson/parsedjson.h @@ -18,6 +18,11 @@ #define DEFAULTMAXDEPTH 1024// a JSON document with a depth exceeding 1024 is probably de facto invalid + +/************ + * The JSON is parsed to a tape, see the accompanying tape.md file + * for documentation. + ***********/ struct ParsedJson { public: diff --git a/src/stage34_unified.cpp b/src/stage34_unified.cpp index f8abda6e..25786437 100644 --- a/src/stage34_unified.cpp +++ b/src/stage34_unified.cpp @@ -68,6 +68,11 @@ really_inline bool is_valid_null_atom(const u8 *loc) { return error == 0; } + +/************ + * The JSON is parsed to a tape, see the accompanying tape.md file + * for documentation. + ***********/ // Implemented using Labels as Values which works in GCC and CLANG (and maybe // also in Intel's compiler), but won't work in MSVC. This would need to be // reimplemented differently if one wants to be standard compliant. diff --git a/tape.md b/tape.md new file mode 100644 index 00000000..3bc2f65d --- /dev/null +++ b/tape.md @@ -0,0 +1,129 @@ + +# Tape structure in simdjson + +We parse a JSON document to a tape. A tape is an array of 64-bit values. Each node encountered in the JSON document is written to the tape using one or more 64-bit tape elements; the layout of the tape is in "document order". Throughout, little endian encoding is assumed. The tape is indexed starting at 0 (the first element is at index 0). + +## Example + +It is sometimes useful to start with an example. Consider the following JSON document: + +``` +{ + "Image": { + "Width": 800, + "Height": 600, + "Title": "View from 15th Floor", + "Thumbnail": { + "Url": "http://www.example.com/image/481989943", + "Height": 125, + "Width": 100 + }, + "Animated": false, + "IDs": [116, 943, 234, 38793] + } +} +``` + +The following is a dump of the content of the tape, with the first number of each line representing the index of a tape element. + +``` +$ ./json2json -d jsonexamples/small/demo.json +0 : r // pointing to 38 (right after last node) +1 : { // pointing to next tape location 38 (first node after the scope) +2 : string "Image" +3 : { // pointing to next tape location 37 (first node after the scope) +4 : string "Width" +5 : integer 800 +7 : string "Height" +8 : integer 600 +10 : string "Title" +11 : string "View from 15th Floor" +12 : string "Thumbnail" +13 : { // pointing to next tape location 23 (first node after the scope) +14 : string "Url" +15 : string "http://www.example.com/image/481989943" +16 : string "Height" +17 : integer 125 +19 : string "Width" +20 : integer 100 +22 : } // pointing to previous tape location 13 (start of the scope) +23 : string "Animated" +24 : false +25 : string "IDs" +26 : [ // pointing to next tape location 36 (first node after the scope) +27 : integer 116 +29 : integer 943 +31 : integer 234 +33 : integer 38793 +35 : ] // pointing to previous tape location 26 (start of the scope) +36 : } // pointing to previous tape location 3 (start of the scope) +37 : } // pointing to previous tape location 1 (start of the scope) +38 : r // pointing to 0 (start root) + +``` + +## General formal of the tape elements + +Most tape elements are written as ('c' << 56) + x where 'c' is some ASCII character determining the type of the element and where x is a 56-bit value called the payload. + + +## Simple JSON values + +Simple JSON nodes are represented with one tape element: + +- null is represented as the 64-bit value ('n' << 56) where 'n' is the 8-bit code point values (in ASCII) corresponding to the letter 'n'. +- true is represented as the 64-bit value ('t' << 56). +- false is represented as the 64-bit value ('f' << 56). + +Performance consideration: It is somewhat wasteful to use 64-bit tape elements to store values that would require far less storage. However, we believe that this has no significant performance impact in most practical applications. + +## Integer and Double values + +Integer values are represented as two 64-bit tape elements: +- The 64-bit value ('l' << 56) followed by the 64-bit integer value litterally. Integer values are assumed to be signed 64-bit values, using two's complement notation. + +Float values are represented as two 64-bit tape elements: +- The 64-bit value ('d' << 56) followed by the 64-bit double value litterally in standard IEEE 754 notation. + +Performance consideration: We store numbers of the main tape because we believe that locality of reference is helpful for performance. The format is somewhat storage wasteful as 56 bits are ignored. + +## Root node + +Each JSON document will have two special 64-bit tape element representing a root node, one at the beginning and one at the end. + +- The first 64-bit tape element contains the value ('r'<<56) + x where x is the location on the tape of the last root element. +- The last 64-bit tape element contains the value ('r'<< 56). + +All of the parsed document is located between these two 64-bit tape elements. + +Hint: we can read the first tape element to determine the length of the tape. + + +## Strings + +We store string values using UTF-8 encoding with null termination on a separate tape. A string value is represented on the main tape as the 64-bit tape element ('"'<< 56) + x where x is the location on the string tape of the null-terminated string. + +## Arrays + +JSON arrays are represented using two 64-bit tape elements. + +- The first 64-bit tape element contains the value ('[' << 56) + x where the payload x is 1 + the index of the second 64-bit tape element on the tape. +- The second 64-bit tape element contains the value (']' << 56) + x where the payload x contains the index of the first 64-bit tape element on the tape. + +All the content of the array is located between these two tape elements,including arrays and objects. + +Performance consideration: We can skip the content of an array entirely by accessing the first 64-bit tape element, reading the payload and moving to the corresponding index on the tape. + +## Objects + +JSON objects are represented using two 64-bit tape elements. + +- The first 64-bit tape element contains the value ('{' << 56) + x where the payload x is 1 + the index of the second 64-bit tape element on the tape. +- The second 64-bit tape element contains the value ('{' << 56) + x where the payload x contains the index of the first 64-bit tape element on the tape. + +In-between these two tape elements, we alternate between key (which must strings) and values. A value could be an object or an array. + +All the content of the array is located between these two tape elements, including arrays and objects. + +Performance consideration: We can skip the content of an object entirely by accessing the first 64-bit tape element, reading the payload and moving to the corresponding index on the tape. +