Create Notes.md

A notes file describing what's actually in the implementation. Will try to minimize the amount of speculative or TODO material here.
This commit is contained in:
geofflangdale 2018-04-04 15:26:41 +10:00 committed by GitHub
parent e6478e33b3
commit 92110efa67
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 93 additions and 0 deletions

93
Notes.md Normal file
View File

@ -0,0 +1,93 @@
# Notes on simdson
## Rationale:
The simdjson project serves two purposes:
1. It creates a useful library for parsing JSON data quickly.
2. It is a demonstration of the use of SIMD and pipelined programming techniques to perform a complex and irregular task.
These techniques include the use of large registers and SIMD instructions to process large amounts of input data at once,
to hold larger entities than can typically be held in a single General Purpose Register (GPR), and to perform operations
that are not cheap to perform without use of a SIMD unit (for example table lookup using permute instructions).
The other key technique is that the system is designed to minimize the number of unpredictable branches that must be taken
to perform the task. Modern architectures are both wide and deep (4-wide pipelines with ~14 stages are commonplace). A
recent Intel Architecture processor, for example, can perform 3 256-bit SIMD operations or 2 512-bit SIMD operations per
cycle as well as other operations on general purpose registers or with the load/store unit. An incorrectly predicted branch
will clear this pipeline. While it is rare that a programmer can achieve the maximum throughput on a machine, a developer
may be missing the opportunity to carry out 56 operations for each branch miss.
Many code-bases make use of SIMD and deeply pipelined, "non-branchy", processing for regular tasks. Numerical problems
(e.g. "matrix multiply") or simple 'bulk search' tasks (e.g. "count all the occurrences of a given character in a text",
"find the first occurrence of the string 'foo' in a text") frequently use this class of techniques. We are demonstrating
that these techniques can be applied to much more complex and less regular task.
## Design:
### Stage 1: SIMD over bytes; bit vector processing over bytes.
The first stage of our processing must identify key points in our input: the 'structural characters' of JSON (curly and
square braces, colon, and comma), the start and end of strings as delineated by double quote characters, other JSON 'atoms'
that are not distinguishable by simple characters (constructs such as "true", "false", "null" and numbers), as well as
discovering these characters and atoms in the presence of both quoting conventions and backslash escaping conventions.
As such we follow the broad outline of the construction of a structural index as set forth in the Mison paper [XXX]; first,
the discovery of odd-length sequences of backslash characters (which will cause quote characters immediately following to
be escaped and not serve their quoting role but instead be literal charaters), second, the discovery of quote pairs (which
cause structural characters within the quote pairs to also be merely literal characters and have no function as structural
characters), then finally the discovery of structural characters not contained without the quote pairs.
We depart from the Mison paper in terms of method and overall design. In terms of method, the Mison paper uses iteration
over bit vectors to discover backslash sequences and quote pairs; we introduce branch-free techniques to discover both of
these properties.
We also make use of our ability to quickly detect whitespace in this early stage. We can use another bit-vector based
transformation to discover locations in our data that follow a structural character or quote followed by zero or more
characters of whitespace; excluding locations within strings, and the structural characters we have already discovered,
these locations are the only place that we can expect to see the starts of the JSON 'atoms'. These locations are thus
treated as 'structural' ('pseudo-structural characters').
This stage involves either SIMD processing over out bytes or the manipulation of bit arrays that have 1 bit corresponding
to 1 byte of input. As such, it can be quite inefficient for some inputs - it is possible to observe dozens of operations
taking place to discover that there are in fact no odd-numbered sequences of backslashes or quotes in a given block of
input. However, this inefficiency on such inputs is balanced by the fact that it costs no more to run this code over
complex structured input, and the alternatives would generally involve running a number of unpredictable branches (for
example, the loop branches in Mison that iterate over bit vectors).
### Stage 2: The transition from "SIMD over bytes" to "indices"
Our structural, pseudo-structural and other 'interesting' characters are relatively rare (TODO: quantify in detail -
it's typically about 1 in 10). As such, continuing to process them as bit vectors will involve manipulating data structures
that are relatively large as well as being fairly unpredictably spaced. We must transform these bitvectors of "interesting"
locations into offsets.
Note that we can examine the character at the offset to discover what the original function of the item in the bitvector
was. While the JSON structural characters and quotes are relatively self-explanatory (although working only with one offset
at a time, we have lost the distinction between opening quotes and closing quotes, something that was available in Stage 1),
it is a quirk of JSON that the legal atoms can all be distinguished from each other by their first character - 't' for
'true', 'f' for 'false', 'n' for 'null' and the character class [0-9-] for numerical values.
Thus, the offset suffices, as long as we retain our original input.
Our current implementation involves a straightforward transformation of bitmaps to indices by use of the 'count trailing
zeros' operation and the well-known operation to clear the lowest set bit. Note that this implementation introduces an
unpredictable branch; unless there is a regular pattern in our bitmaps, we would expect to have at least one branch miss
for each bitmap.
### Stage 3: Operation over indices
The indices form a relatively concise map of structurally important parts of our JSON input. However, since JSON is
recursively defined, we may nest structures (JSON "objects" and "arrays") inside other JSON structures. It is important
to be able to quickly traverse portions of our JSON structure at any given level - it is trivial for us to move around
in a way that follows the input text, but skipping to the next item at a given level may involve searching hundreds of
bytes of text).
We can construct a simple data structure that allows us to thread together such structures relatively simply; at this
stage this code is not branch-free. We use an implicit 'stack' structure by virtue of threading together 'up-level
pointers' within the structure as we build it (these are pointers that, for each item in the structure we have seen
already, tell us which item in the structure that contains this one); to pop up a level, we simply follow one layer
of 'up-level pointers'.
An equivalent operation requiring an external data structure would be to maintain a stack that essentially describes
all current levels of our structure as we traverse it; this may have performance advantages.