flushing out

This commit is contained in:
Daniel Lemire 2018-03-23 10:00:21 -04:00 committed by GitHub
parent bc1331283a
commit e6e3f42491
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 26 additions and 12 deletions

View File

@ -1,27 +1,41 @@
# simdjson
# simdjson : Parsing gigabytes of JSON per second
A *research* library. The purpose of this repository is to support an eventual paper.
A *research* library. The purpose of this repository is to support an eventual paper. The immediate goal is not to produce a library that would be used in production systems. Of course, the end game is, indeed, to have an impact on production system.
Goal: Speed up the parsing of JSON per se. No materialization.
Parsing gigabytes of JSON per second
## Architecture
The parser works in three stages:
## Todo
- Stage 1. Identifies quickly structure elements, strings, and so forth. Currently, there is no validation (JSON is assumed to be correct).
- Stage 2. Involves the "flattening" of the data from stage 1, that is, convert bitsets into arrays of indexes.
- Stage 3. (Structure building) Involves constructing a "tree" of sort to navigate through the data.
- Stage 4. (Currently unimplemented) Iterate throw the structure without "stalling" (fighting back against latency)
## Todo (Priority)
Geoff is unhappy with stage 3. He writes:
- Write unit tests
- Write bona fide, accurate benchmarks (with fair comparisons using good alternatives). Geoff wrote:
> 44% of the time is in the tree construction. Of the remaining 56%, pretty much half is in the code to discover structural characters and half is in the naive code to flatten that out into a vector of u32 offsets (the 'iterate over set bits' problem).
- Document better the code, make the code easier to use
- Add some measure of error handling (maybe optional)
- Geoff wrote:
> A future goal for Stage 3 will also be to thread together vectors or lists of similar structural elements (e.g. strings, objects, numbers, etc). A putative 'stage 4' will then be able to iterate in parallel fashion over these vectors (branchlessly, or at least without a pipeline-killing "Giant What I Am Doing Anyway" switch at the top) and transform them into more usable values. Some of this code is also inherently interesting (de-escaping strings, high speed atoi - an old favorite).
- Geoff wrote:
> I'm focusing on the tree construction at the moment. I think we can abstract the structural characters to 3 operations during that stage (UP, DOWN, SIDEWAYS), batch them, and build out tree structure in bulk with data-driven SIMD operations rather than messing around with branches. It's probably OK to have a table with 3^^5 or even 3^^6 entries, and it's still probably OK to have some hard cases eliminated and handled on a slow path (e.g. someone hits you with 6 scope closes in a row, forcing you to pop 24 bytes into your tree construction stack). (...) I spend some time with SIMD and found myself going in circles. There's an utterly fantastic solution in there somewhere involving turning the tree building code into a transformation over multiple abstracted symbols (e.g. UP/UP/SIDEWAYS/DOWN). It looked great until I smacked up against the fact that the oh-so-elegant solution I had sketched out had, on its critical path, an unaligned store followed by load partially overlapping with that store, to maintain the stack of 'up' pointers. Ugh. (...) So I worked through about 6 alternate solutions of various levels of pretentiousness none of which made me particularly happy.
> The structure building is too slow for my taste. I'm not sure I want too much richer functionality in it. Many of the transformations which I would like to do seem better done on the tree (i.e. pruning out every second " character). But this code takes 44% of our time; it's outrageous.
Of course, stage 4 is totally unimplemented so it might be a priority as well:
> A future goal for Stage 3 will also be to thread together vectors or lists of similar structural elements (e.g. strings, objects, numbers, etc). A putative 'stage 4' will then be able to iterate in parallel fashion over these vectors (branchlessly, or at least without a pipeline-killing "Giant What I Am Doing Anyway" switch at the top) and transform them into more usable values. Some of this code is also inherently interesting (de-escaping strings, high speed atoi - an old favorite).
## Todo (Secondary)
- Build up a paper (use overleaf.com)
- Write unit tests
- Write bona fide, accurate benchmarks (with fair comparisons using good alternatives).
- Document better the code, make the code easier to use
- Add some measure of error handling (maybe optional)
## References
@ -83,4 +97,4 @@ containing structural element ("up").
- It seems that the use of ``bzero`` is discouraged.
- Per input byte, multiple bytes are allocated which could potentially be a problem when processing a very large document, hence one might want to be more incremental in practice maybe to minimize memory usage. For really large documents, there might be caching issues as well.
- The ``clmul`` thing is tricky but nice. (Geoff's remark: find the spaces between quotes, is actually a ponderous way of doing parallel prefix over XOR, which a mathematically adept person would have realized could be done with clmul by -1. Not me, I had to look it up: http://bitmath.blogspot.com.au/2016/11/parallel-prefixsuffix-operations.html.)
- It is possible, though unlikely, that parallelizing the bitset decoding could be useful (https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/), and there is VCOMPRESSB (AVX-512)
- It is possible, though maybe unlikely, that parallelizing the bitset decoding could be useful (https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/), and there is VCOMPRESSB (AVX-512)