138 lines
7.5 KiB
Markdown
138 lines
7.5 KiB
Markdown
|
|
# Tape structure in simdjson
|
|
|
|
We parse a JSON document to a tape. A tape is an array of 64-bit values. Each node encountered in the JSON document is written to the tape using one or more 64-bit tape elements; the layout of the tape is in "document order": elements are stored as they are encountered in the JSON document.
|
|
|
|
Throughout, little endian encoding is assumed. The tape is indexed starting at 0 (the first element is at index 0).
|
|
|
|
## Example
|
|
|
|
It is sometimes useful to start with an example. Consider the following JSON document:
|
|
|
|
```json
|
|
{
|
|
"Image": {
|
|
"Width": 800,
|
|
"Height": 600,
|
|
"Title": "View from 15th Floor",
|
|
"Thumbnail": {
|
|
"Url": "http://www.example.com/image/481989943",
|
|
"Height": 125,
|
|
"Width": 100
|
|
},
|
|
"Animated": false,
|
|
"IDs": [116, 943, 234, 38793]
|
|
}
|
|
}
|
|
```
|
|
|
|
The following is a dump of the content of the tape, with the first number of each line representing the index of a tape element.
|
|
|
|
### The Tape
|
|
| index | element (64 bit word) |
|
|
| ----- | ------------------------------------------------------------------- |
|
|
| 0 | r // pointing to 38 (right after last node) |
|
|
| 1 | { // pointing to next tape location 38 (first node after the scope) |
|
|
| 2 | string "Image" |
|
|
| 3 | { // pointing to next tape location 37 (first node after the scope) |
|
|
| 4 | string "Width" |
|
|
| 5 | integer 800 |
|
|
| 7 | string "Height" |
|
|
| 8 | integer 600 |
|
|
| 10 | string "Title" |
|
|
| 11 | string "View from 15th Floor" |
|
|
| 12 | string "Thumbnail" |
|
|
| 13 | { // pointing to next tape location 23 (first node after the scope) |
|
|
| 14 | string "Url" |
|
|
| 15 | string "http://www.example.com/image/481989943" |
|
|
| 16 | string "Height" |
|
|
| 17 | integer 125 |
|
|
| 19 | string "Width" |
|
|
| 20 | integer 100 |
|
|
| 22 | } // pointing to previous tape location 13 (start of the scope) |
|
|
| 23 | string "Animated" |
|
|
| 24 | false |
|
|
| 25 | string "IDs" |
|
|
| 26 | [ // pointing to next tape location 36 (first node after the scope) |
|
|
| 27 | integer 116 |
|
|
| 29 | integer 943 |
|
|
| 31 | integer 234 |
|
|
| 33 | integer 38793 |
|
|
| 35 | ] // pointing to previous tape location 26 (start of the scope) |
|
|
| 36 | } // pointing to previous tape location 3 (start of the scope) |
|
|
| 37 | } // pointing to previous tape location 1 (start of the scope) |
|
|
| 38 | r // pointing to 0 (start root) |
|
|
|
|
|
|
|
|
## General formal of the tape elements
|
|
|
|
Most tape elements are written as `('c' << 56) + x` where `'c'` is some ASCII character determining the type of the element (out of 't', 'f', 'n', 'l', 'u', 'd', '"', '{', '}', '[', ']' ,'r') and where `x` is a 56-bit value called the payload. The payload is normally interpreted as an unsigned 56-bit integer. Note that 56-bit integers can be quite large.
|
|
|
|
|
|
Performance consideration: We believe that accessing the tape in regular units of 64 bits is more important for performance than saving memory.
|
|
|
|
## Simple JSON values
|
|
|
|
Simple JSON nodes are represented with one tape element:
|
|
|
|
- null is represented as the 64-bit value `('n' << 56)` where `'n'` is the 8-bit code point values (in ASCII) corresponding to the letter `'n'`.
|
|
- true is represented as the 64-bit value `('t' << 56)`.
|
|
- false is represented as the 64-bit value `('f' << 56)`.
|
|
|
|
|
|
## Integer and Double values
|
|
|
|
Integer values are represented as two 64-bit tape elements:
|
|
- The 64-bit value `('l' << 56)` followed by the 64-bit integer value litterally. Integer values are assumed to be signed 64-bit values, using two's complement notation.
|
|
- The 64-bit value `('u' << 56)` followed by the 64-bit integer value litterally. Integer values are assumed to be unsigned 64-bit values.
|
|
|
|
|
|
Float values are represented as two 64-bit tape elements:
|
|
- The 64-bit value `('d' << 56)` followed by the 64-bit double value litterally in standard IEEE 754 notation.
|
|
|
|
Performance consideration: We store numbers of the main tape because we believe that locality of reference is helpful for performance.
|
|
|
|
## Root node
|
|
|
|
Each JSON document will have two special 64-bit tape elements representing a root node, one at the beginning and one at the end.
|
|
|
|
- The first 64-bit tape element contains the value `('r' << 56) + x` where `x` is the location on the tape of the last root element.
|
|
- The last 64-bit tape element contains the value `('r' << 56)`.
|
|
|
|
All of the parsed document is located between these two 64-bit tape elements.
|
|
|
|
Hint: We can read the first tape element to determine the length of the tape.
|
|
|
|
|
|
## Strings
|
|
|
|
We prefix the string data itself by a 32-bit header to be interpreted as a 32-bit integer. It indicates the length of the string. The actual string data starts at an offset of 4 bytes.
|
|
|
|
We store string values using UTF-8 encoding with null termination on a separate tape. A string value is represented on the main tape as the 64-bit tape element `('"' << 56) + x` where the payload `x` is the location on the string tape of the null-terminated string.
|
|
|
|
## Arrays
|
|
|
|
JSON arrays are represented using two 64-bit tape elements.
|
|
|
|
- The first 64-bit tape element contains the value `('[' << 56) + x` where the payload `x` is 1 + the index of the second 64-bit tape element on the tape.
|
|
- The second 64-bit tape element contains the value `(']' << 56) + x` where the payload `x` contains the index of the first 64-bit tape element on the tape.
|
|
|
|
All the content of the array is located between these two tape elements, including arrays and objects.
|
|
|
|
Performance consideration: We can skip the content of an array entirely by accessing the first 64-bit tape element, reading the payload and moving to the corresponding index on the tape.
|
|
|
|
## Objects
|
|
|
|
JSON objects are represented using two 64-bit tape elements.
|
|
|
|
- The first 64-bit tape element contains the value `('{' << 56) + x` where the payload `x` is 1 + the index of the second 64-bit tape element on the tape.
|
|
- The second 64-bit tape element contains the value `('}' << 56) + x` where the payload `x` contains the index of the first 64-bit tape element on the tape.
|
|
|
|
In-between these two tape elements, we alternate between key (which must be strings) and values. A value could be an object or an array.
|
|
|
|
All the content of the object is located between these two tape elements, including arrays and objects.
|
|
|
|
Performance consideration: We can skip the content of an object entirely by accessing the first 64-bit tape element, reading the payload and moving to the corresponding index on the tape.
|
|
|