flushing out
This commit is contained in:
parent
bc1331283a
commit
e6e3f42491
38
README.md
38
README.md
|
@ -1,27 +1,41 @@
|
|||
# simdjson
|
||||
# simdjson : Parsing gigabytes of JSON per second
|
||||
|
||||
A *research* library. The purpose of this repository is to support an eventual paper.
|
||||
A *research* library. The purpose of this repository is to support an eventual paper. The immediate goal is not to produce a library that would be used in production systems. Of course, the end game is, indeed, to have an impact on production system.
|
||||
|
||||
Goal: Speed up the parsing of JSON per se. No materialization.
|
||||
|
||||
Parsing gigabytes of JSON per second
|
||||
## Architecture
|
||||
|
||||
The parser works in three stages:
|
||||
|
||||
## Todo
|
||||
- Stage 1. Identifies quickly structure elements, strings, and so forth. Currently, there is no validation (JSON is assumed to be correct).
|
||||
- Stage 2. Involves the "flattening" of the data from stage 1, that is, convert bitsets into arrays of indexes.
|
||||
- Stage 3. (Structure building) Involves constructing a "tree" of sort to navigate through the data.
|
||||
- Stage 4. (Currently unimplemented) Iterate throw the structure without "stalling" (fighting back against latency)
|
||||
|
||||
## Todo (Priority)
|
||||
|
||||
Geoff is unhappy with stage 3. He writes:
|
||||
|
||||
- Write unit tests
|
||||
- Write bona fide, accurate benchmarks (with fair comparisons using good alternatives). Geoff wrote:
|
||||
> 44% of the time is in the tree construction. Of the remaining 56%, pretty much half is in the code to discover structural characters and half is in the naive code to flatten that out into a vector of u32 offsets (the 'iterate over set bits' problem).
|
||||
- Document better the code, make the code easier to use
|
||||
- Add some measure of error handling (maybe optional)
|
||||
- Geoff wrote:
|
||||
> A future goal for Stage 3 will also be to thread together vectors or lists of similar structural elements (e.g. strings, objects, numbers, etc). A putative 'stage 4' will then be able to iterate in parallel fashion over these vectors (branchlessly, or at least without a pipeline-killing "Giant What I Am Doing Anyway" switch at the top) and transform them into more usable values. Some of this code is also inherently interesting (de-escaping strings, high speed atoi - an old favorite).
|
||||
- Geoff wrote:
|
||||
|
||||
> I'm focusing on the tree construction at the moment. I think we can abstract the structural characters to 3 operations during that stage (UP, DOWN, SIDEWAYS), batch them, and build out tree structure in bulk with data-driven SIMD operations rather than messing around with branches. It's probably OK to have a table with 3^^5 or even 3^^6 entries, and it's still probably OK to have some hard cases eliminated and handled on a slow path (e.g. someone hits you with 6 scope closes in a row, forcing you to pop 24 bytes into your tree construction stack). (...) I spend some time with SIMD and found myself going in circles. There's an utterly fantastic solution in there somewhere involving turning the tree building code into a transformation over multiple abstracted symbols (e.g. UP/UP/SIDEWAYS/DOWN). It looked great until I smacked up against the fact that the oh-so-elegant solution I had sketched out had, on its critical path, an unaligned store followed by load partially overlapping with that store, to maintain the stack of 'up' pointers. Ugh. (...) So I worked through about 6 alternate solutions of various levels of pretentiousness none of which made me particularly happy.
|
||||
|
||||
> The structure building is too slow for my taste. I'm not sure I want too much richer functionality in it. Many of the transformations which I would like to do seem better done on the tree (i.e. pruning out every second " character). But this code takes 44% of our time; it's outrageous.
|
||||
|
||||
|
||||
Of course, stage 4 is totally unimplemented so it might be a priority as well:
|
||||
|
||||
> A future goal for Stage 3 will also be to thread together vectors or lists of similar structural elements (e.g. strings, objects, numbers, etc). A putative 'stage 4' will then be able to iterate in parallel fashion over these vectors (branchlessly, or at least without a pipeline-killing "Giant What I Am Doing Anyway" switch at the top) and transform them into more usable values. Some of this code is also inherently interesting (de-escaping strings, high speed atoi - an old favorite).
|
||||
|
||||
|
||||
## Todo (Secondary)
|
||||
|
||||
- Build up a paper (use overleaf.com)
|
||||
- Write unit tests
|
||||
- Write bona fide, accurate benchmarks (with fair comparisons using good alternatives).
|
||||
- Document better the code, make the code easier to use
|
||||
- Add some measure of error handling (maybe optional)
|
||||
|
||||
## References
|
||||
|
||||
|
@ -83,4 +97,4 @@ containing structural element ("up").
|
|||
- It seems that the use of ``bzero`` is discouraged.
|
||||
- Per input byte, multiple bytes are allocated which could potentially be a problem when processing a very large document, hence one might want to be more incremental in practice maybe to minimize memory usage. For really large documents, there might be caching issues as well.
|
||||
- The ``clmul`` thing is tricky but nice. (Geoff's remark: find the spaces between quotes, is actually a ponderous way of doing parallel prefix over XOR, which a mathematically adept person would have realized could be done with clmul by -1. Not me, I had to look it up: http://bitmath.blogspot.com.au/2016/11/parallel-prefixsuffix-operations.html.)
|
||||
- It is possible, though unlikely, that parallelizing the bitset decoding could be useful (https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/), and there is VCOMPRESSB (AVX-512)
|
||||
- It is possible, though maybe unlikely, that parallelizing the bitset decoding could be useful (https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/), and there is VCOMPRESSB (AVX-512)
|
||||
|
|
Loading…
Reference in New Issue