Added documentation of the tape format.
This commit is contained in:
parent
779ce184fb
commit
0a109508de
|
@ -78,7 +78,7 @@ make benchmark
|
|||
## Tools
|
||||
|
||||
- `json2json mydoc.json` parses the document, constructs a model and then dumps back the result to standard output.
|
||||
- `json2json -d mydoc.json` parses the document, constructs a model and then dumps model (as a tape) to standard output.
|
||||
- `json2json -d mydoc.json` parses the document, constructs a model and then dumps model (as a tape) to standard output. The tape format is described in the accompanying file tape.md.
|
||||
- `minify mydoc.json` minifies the JSON document, outputting the result to standard output. Minifying means to remove the unneeded white space charaters.
|
||||
|
||||
## Scope
|
||||
|
@ -100,7 +100,7 @@ To simplify the engineering, we make some assumptions.
|
|||
## Features
|
||||
|
||||
- The input string is unmodified. (Parsers like sajson and RapidJSON use the input string as a buffer.)
|
||||
- We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long`. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document.
|
||||
- We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long`. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number.) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document.
|
||||
- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.)
|
||||
- We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.)
|
||||
- We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tags in strings.)
|
||||
|
|
|
@ -18,6 +18,11 @@
|
|||
|
||||
#define DEFAULTMAXDEPTH 1024// a JSON document with a depth exceeding 1024 is probably de facto invalid
|
||||
|
||||
|
||||
/************
|
||||
* The JSON is parsed to a tape, see the accompanying tape.md file
|
||||
* for documentation.
|
||||
***********/
|
||||
struct ParsedJson {
|
||||
public:
|
||||
|
||||
|
|
|
@ -68,6 +68,11 @@ really_inline bool is_valid_null_atom(const u8 *loc) {
|
|||
return error == 0;
|
||||
}
|
||||
|
||||
|
||||
/************
|
||||
* The JSON is parsed to a tape, see the accompanying tape.md file
|
||||
* for documentation.
|
||||
***********/
|
||||
// Implemented using Labels as Values which works in GCC and CLANG (and maybe
|
||||
// also in Intel's compiler), but won't work in MSVC. This would need to be
|
||||
// reimplemented differently if one wants to be standard compliant.
|
||||
|
|
|
@ -0,0 +1,129 @@
|
|||
|
||||
# Tape structure in simdjson
|
||||
|
||||
We parse a JSON document to a tape. A tape is an array of 64-bit values. Each node encountered in the JSON document is written to the tape using one or more 64-bit tape elements; the layout of the tape is in "document order". Throughout, little endian encoding is assumed. The tape is indexed starting at 0 (the first element is at index 0).
|
||||
|
||||
## Example
|
||||
|
||||
It is sometimes useful to start with an example. Consider the following JSON document:
|
||||
|
||||
```
|
||||
{
|
||||
"Image": {
|
||||
"Width": 800,
|
||||
"Height": 600,
|
||||
"Title": "View from 15th Floor",
|
||||
"Thumbnail": {
|
||||
"Url": "http://www.example.com/image/481989943",
|
||||
"Height": 125,
|
||||
"Width": 100
|
||||
},
|
||||
"Animated": false,
|
||||
"IDs": [116, 943, 234, 38793]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The following is a dump of the content of the tape, with the first number of each line representing the index of a tape element.
|
||||
|
||||
```
|
||||
$ ./json2json -d jsonexamples/small/demo.json
|
||||
0 : r // pointing to 38 (right after last node)
|
||||
1 : { // pointing to next tape location 38 (first node after the scope)
|
||||
2 : string "Image"
|
||||
3 : { // pointing to next tape location 37 (first node after the scope)
|
||||
4 : string "Width"
|
||||
5 : integer 800
|
||||
7 : string "Height"
|
||||
8 : integer 600
|
||||
10 : string "Title"
|
||||
11 : string "View from 15th Floor"
|
||||
12 : string "Thumbnail"
|
||||
13 : { // pointing to next tape location 23 (first node after the scope)
|
||||
14 : string "Url"
|
||||
15 : string "http://www.example.com/image/481989943"
|
||||
16 : string "Height"
|
||||
17 : integer 125
|
||||
19 : string "Width"
|
||||
20 : integer 100
|
||||
22 : } // pointing to previous tape location 13 (start of the scope)
|
||||
23 : string "Animated"
|
||||
24 : false
|
||||
25 : string "IDs"
|
||||
26 : [ // pointing to next tape location 36 (first node after the scope)
|
||||
27 : integer 116
|
||||
29 : integer 943
|
||||
31 : integer 234
|
||||
33 : integer 38793
|
||||
35 : ] // pointing to previous tape location 26 (start of the scope)
|
||||
36 : } // pointing to previous tape location 3 (start of the scope)
|
||||
37 : } // pointing to previous tape location 1 (start of the scope)
|
||||
38 : r // pointing to 0 (start root)
|
||||
|
||||
```
|
||||
|
||||
## General formal of the tape elements
|
||||
|
||||
Most tape elements are written as ('c' << 56) + x where 'c' is some ASCII character determining the type of the element and where x is a 56-bit value called the payload.
|
||||
|
||||
|
||||
## Simple JSON values
|
||||
|
||||
Simple JSON nodes are represented with one tape element:
|
||||
|
||||
- null is represented as the 64-bit value ('n' << 56) where 'n' is the 8-bit code point values (in ASCII) corresponding to the letter 'n'.
|
||||
- true is represented as the 64-bit value ('t' << 56).
|
||||
- false is represented as the 64-bit value ('f' << 56).
|
||||
|
||||
Performance consideration: It is somewhat wasteful to use 64-bit tape elements to store values that would require far less storage. However, we believe that this has no significant performance impact in most practical applications.
|
||||
|
||||
## Integer and Double values
|
||||
|
||||
Integer values are represented as two 64-bit tape elements:
|
||||
- The 64-bit value ('l' << 56) followed by the 64-bit integer value litterally. Integer values are assumed to be signed 64-bit values, using two's complement notation.
|
||||
|
||||
Float values are represented as two 64-bit tape elements:
|
||||
- The 64-bit value ('d' << 56) followed by the 64-bit double value litterally in standard IEEE 754 notation.
|
||||
|
||||
Performance consideration: We store numbers of the main tape because we believe that locality of reference is helpful for performance. The format is somewhat storage wasteful as 56 bits are ignored.
|
||||
|
||||
## Root node
|
||||
|
||||
Each JSON document will have two special 64-bit tape element representing a root node, one at the beginning and one at the end.
|
||||
|
||||
- The first 64-bit tape element contains the value ('r'<<56) + x where x is the location on the tape of the last root element.
|
||||
- The last 64-bit tape element contains the value ('r'<< 56).
|
||||
|
||||
All of the parsed document is located between these two 64-bit tape elements.
|
||||
|
||||
Hint: we can read the first tape element to determine the length of the tape.
|
||||
|
||||
|
||||
## Strings
|
||||
|
||||
We store string values using UTF-8 encoding with null termination on a separate tape. A string value is represented on the main tape as the 64-bit tape element ('"'<< 56) + x where x is the location on the string tape of the null-terminated string.
|
||||
|
||||
## Arrays
|
||||
|
||||
JSON arrays are represented using two 64-bit tape elements.
|
||||
|
||||
- The first 64-bit tape element contains the value ('[' << 56) + x where the payload x is 1 + the index of the second 64-bit tape element on the tape.
|
||||
- The second 64-bit tape element contains the value (']' << 56) + x where the payload x contains the index of the first 64-bit tape element on the tape.
|
||||
|
||||
All the content of the array is located between these two tape elements,including arrays and objects.
|
||||
|
||||
Performance consideration: We can skip the content of an array entirely by accessing the first 64-bit tape element, reading the payload and moving to the corresponding index on the tape.
|
||||
|
||||
## Objects
|
||||
|
||||
JSON objects are represented using two 64-bit tape elements.
|
||||
|
||||
- The first 64-bit tape element contains the value ('{' << 56) + x where the payload x is 1 + the index of the second 64-bit tape element on the tape.
|
||||
- The second 64-bit tape element contains the value ('{' << 56) + x where the payload x contains the index of the first 64-bit tape element on the tape.
|
||||
|
||||
In-between these two tape elements, we alternate between key (which must strings) and values. A value could be an object or an array.
|
||||
|
||||
All the content of the array is located between these two tape elements, including arrays and objects.
|
||||
|
||||
Performance consideration: We can skip the content of an object entirely by accessing the first 64-bit tape element, reading the payload and moving to the corresponding index on the tape.
|
||||
|
Loading…
Reference in New Issue