Private research repo.

2018-03-23 00:05:32 -04:00 · 2018-03-23 00:05:32 -04:00 · bc1331283a
parent 800811d04f
commit bc1331283a
8 changed files with 66487 additions and 1 deletions
--- a/25
+++ b/25
@ -0,0 +1,25 @@
+
+.SUFFIXES:
+#
+.SUFFIXES: .cpp .o .c .h
+
+
+.PHONY: clean cleandist
+
+CXXFLAGS =  -std=c++11 -O2 -march=native -Wall -Wextra -Wshadow -Wno-implicit-function-declaration
+
+EXECUTABLES=parse
+
+
+
+all: $(EXECUTABLES)
+
+parse: main.cpp common_defs.h
+	$(CXX) $(CXXFLAGS) -o parse main.cpp
+
+
+clean:
+	rm -f $(EXECUTABLES)
+
+cleandist:
+	rm -f $(EXECUTABLES)
--- a/README.md
+++ b/README.md
@ -1 +1,86 @@
-# simdjson
+# simdjson
+
+A *research* library. The purpose of this repository is to support an eventual paper.
+
+Goal: Speed up the parsing of JSON per se. No materialization.
+
+Parsing gigabytes of JSON per second
+
+
+## Todo
+
+- Write unit tests
+- Write bona fide, accurate benchmarks (with fair comparisons using good alternatives). Geoff wrote:
+> 44% of the time is in the tree construction. Of the remaining 56%, pretty much half is in the code to discover structural characters and half is in the naive code to flatten that out into a vector of u32 offsets (the 'iterate over set bits' problem).
+- Document better the code, make the code easier to use
+- Add some measure of error handling (maybe optional)
+- Geoff wrote:
+> A future goal for Stage 3 will also be to thread together vectors or lists of similar structural elements (e.g. strings, objects, numbers, etc). A putative 'stage 4' will then be able to iterate in parallel fashion over these vectors (branchlessly, or at least without a pipeline-killing "Giant What I Am Doing Anyway" switch at the top) and transform them into more usable values. Some of this code is also inherently interesting (de-escaping strings, high speed atoi - an old favorite).
+- Geoff wrote:
+> I'm focusing on the tree construction at the moment. I think we can abstract the structural characters to 3 operations during that stage (UP, DOWN, SIDEWAYS), batch them, and build out tree structure in bulk with data-driven SIMD operations rather than messing around with branches. It's probably OK to have a table with 3^^5 or even 3^^6 entries, and it's still probably OK to have some hard cases eliminated and handled on a slow path (e.g. someone hits you with 6 scope closes in a row, forcing you to pop 24 bytes into your tree construction stack). (...) I spend some time with SIMD and found myself going in circles. There's an utterly fantastic solution in there somewhere involving turning the tree building code into a transformation over multiple abstracted symbols (e.g. UP/UP/SIDEWAYS/DOWN). It looked great until I smacked up against the fact that the oh-so-elegant solution I had sketched out had, on its critical path, an unaligned store followed by load partially overlapping with that store, to maintain the stack of 'up' pointers. Ugh. (...) So I worked through about 6 alternate solutions of various levels of pretentiousness none of which made me particularly happy.
+
+
+
+
+
+## References
+
+- https://tools.ietf.org/html/rfc7159
+- Mison: A Fast JSON Parser for Data Analytics http://www.vldb.org/pvldb/vol10/p1118-li.pdf They build the index and parse at the same time. The index allows them to parse faster... so they don't materialize the index.  The speculative part is hastily described but they do hint that this is where much of the benefit comes from...
+- The only public Mison implementation (in rust)  https://github.com/pikkr/pikkr
+- http://rapidjson.org/md_doc_sax.html
+
+
+Inspiring links:
+- https://auth0.com/blog/beating-json-performance-with-protobuf/
+- https://gist.github.com/shijuvar/25ad7de9505232c87034b8359543404a
+- https://github.com/frankmcsherry/blog/blob/master/posts/2018-02-11.md
+
+## Remarks on JSON parsing
+
+- The JSON spec defines what a JSON parser is:
+>  A JSON parser transforms a JSON text into another representation.  A JSON parser MUST accept all texts that conform to the JSON grammar.  A JSON parser MAY accept non-JSON forms or extensions. An implementation may set limits on the size of texts that it accepts.  An implementation may set limits on the maximum depth of nesting.  An implementation may set limits on the range and precision of numbers.  An implementation may set limits on the length and character contents of strings."
+
+- The  structural characters are:
+
+
+      begin-array     =  [ left square bracket
+      begin-object    =  { left curly bracket
+      end-array       =  ] right square bracket
+      end-object      =  } right curly bracket
+      name-separator  = : colon
+      value-separator = , comma
+
+- It might be useful/important to prune "insignificant" white space characters. Maybe.
+
+- Parsing and validating numbers fast could be potentially interesting,  but the spec allows many things. (Geoff wrote: "Parsing numbers (especially floating point ones!) and other atoms is fiddly but doable.")
+
+- Error handling is a nice problem. Of course, we have to define what we mean by an error and which error types the parser should be responsible for.  Here are a few things that could/should probably trigger an error:
+
+   - An array structure is represented as square brackets surrounding zero  or more values (or elements).  Elements are separated by commas.
+
+   - An object structure is represented as a pair of curly brackets surrounding zero or more name/value pairs (or members).  A name is a string.  A single colon comes after each name, separating the name from the value.  A single comma separates a value from a following name.
+
+   - Values must be one of false / null / true / object / array / number / string
+
+   - A string begins and ends with  quotation marks.  All Unicode characters may be placed within the   quotation marks, except for the characters that must be escaped:   quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
+
+   - Regarding strings, Geoff wrote:
+   > For example, in Stage 2 ("string detection") we could validate that the only place we saw backslashes was in places we consider "inside strings".
+
+
+- Geoff's remark regarding the structure:
+> next structural element
+prev structural element
+next structural element at the same level (i.e. skip over complex structures)
+prev structural element at the same level
+containing structural element ("up").
+
+
+
+## Remarks on the code
+
+- It seems that the use of  ``bzero`` is discouraged.
+- Per input byte,  multiple bytes are allocated which could potentially be a problem when processing a very large document, hence one might want to be more incremental in practice maybe to minimize memory usage. For really large documents, there might be caching issues as well.
+- The ``clmul`` thing is tricky but nice. (Geoff's remark:  find the spaces between quotes, is actually a ponderous way of doing parallel prefix over XOR, which a mathematically adept person would have realized could be done with clmul by -1. Not me, I had to look it up: http://bitmath.blogspot.com.au/2016/11/parallel-prefixsuffix-operations.html.)
+- It is possible, though unlikely, that parallelizing the bitset decoding could be useful (https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/), and there is VCOMPRESSB (AVX-512)
--- a/common_defs.h
+++ b/common_defs.h
@ -0,0 +1,53 @@
+#pragma once
+
+typedef unsigned char u8;
+typedef unsigned short u16;
+typedef unsigned int u32;
+typedef unsigned long long u64;
+typedef signed char s8;
+typedef signed short s16;
+typedef signed int s32;
+typedef signed long long s64;
+
+#include <x86intrin.h>
+#include <immintrin.h>
+
+typedef __m128i m128;
+typedef __m256i m256;
+
+// Snippets from Hyperscan
+
+// Align to N-byte boundary
+#define ROUNDUP_N(a, n) (((a) + ((n)-1)) & ~((n)-1))
+#define ROUNDDOWN_N(a, n) ((a) & ~((n)-1))
+
+#define ISALIGNED_N(ptr, n) (((uintptr_t)(ptr) & ((n) - 1)) == 0)
+
+#define really_inline inline __attribute__ ((always_inline, unused))
+#define never_inline inline __attribute__ ((noinline, unused))
+
+#ifndef likely
+  #define likely(x)     __builtin_expect(!!(x), 1)
+#endif
+#ifndef unlikely
+  #define unlikely(x)   __builtin_expect(!!(x), 0)
+#endif
+
+static inline
+u32 ctz64(u64 x) {
+	assert(x); // behaviour not defined for x == 0
+#if defined(_WIN64)
+	unsigned long r;
+	_BitScanForward64(&r, x);
+	return r;
+#elif defined(_WIN32)
+	unsigned long r;
+	if (_BitScanForward(&r, (u32)x)) {
+		return (u32)r;
+	}
+	_BitScanForward(&r, x >> 32);
+	return (u32)(r + 32);
+#else
+	return (u32)__builtin_ctzll(x);
+#endif
+}
--- a/jsonexamples/canada.json
+++ b/jsonexamples/canada.json
--- a/jsonexamples/citm_catalog.json
+++ b/jsonexamples/citm_catalog.json
--- a/jsonexamples/demo.json
+++ b/jsonexamples/demo.json
@ -0,0 +1,15 @@
+{
+        "Image": {
+            "Width":  800,
+            "Height": 600,
+            "Title":  "View from 15th Floor",
+            "Thumbnail": {
+                "Url":    "http://www.example.com/image/481989943",
+                "Height": 125,
+                "Width":  100
+            },
+            "Animated" : false,
+            "IDs": [116, 943, 234, 38793]
+          }
+      }
+
--- a/jsonexamples/twitter.json
+++ b/jsonexamples/twitter.json
--- a/main.cpp
+++ b/main.cpp
@ -0,0 +1,348 @@
+#include <iostream>
+#include <iomanip>
+#include <chrono>
+#include <fstream>
+#include <sstream>
+#include <string>
+#include <cstring>
+#include <vector>
+#include <set>
+#include <map>
+#include <algorithm>
+#include <x86intrin.h>
+#include <assert.h>
+#include "common_defs.h"
+
+using namespace std;
+
+#define DEBUG
+
+#ifdef DEBUG
+inline void dump256(m256 d, string msg) {
+	for (u32 i = 0; i < 32; i++) {
+		cout << setw(3) << (int)*(((u8 *)(&d)) + i);
+        if (!((i+1)%8))
+            cout << "|";
+        else if (!((i+1)%4))
+            cout << ":";
+        else
+            cout << " ";
+	}
+    cout << " " << msg << "\n";
+}
+
+// dump bits low to high
+void dumpbits(u64 v, string msg) {
+	for (u32 i = 0; i < 64; i++) {
+        std::cout << (((v>>(u64)i) & 0x1ULL) ? "1" : "_");
+    }
+    cout << " " << msg << "\n";
+}
+#else
+#define dump256(a,b) ;
+#define dumpbits(a,b) ;
+#endif
+
+// get a corpus; pad out to cache line so we can always use SIMD
+pair<u8 *, size_t> get_corpus(string filename) {
+ifstream is(filename, ios::binary);
+    if (is) {
+        stringstream buffer;
+        buffer << is.rdbuf();
+        size_t length = buffer.str().size();
+        char * aligned_buffer;
+        if (posix_memalign( (void **)&aligned_buffer, 64, ROUNDUP_N(length, 64))) {
+            throw "Allocation failed";
+        };
+        bzero(aligned_buffer, ROUNDUP_N(length, 64));
+        memcpy(aligned_buffer, buffer.str().c_str(), length);
+        is.close();
+        return make_pair((u8 *)aligned_buffer, length);
+    }
+    throw "No corpus";
+    return make_pair((u8 *)0, (size_t)0);
+}
+
+struct JsonNode {
+    u32 up;
+    u32 next;
+    u32 prev;
+};
+
+struct ParsedJson {
+    u8 * structurals;
+    u32 n_structural_indexes;
+    u32 * structural_indexes;
+    JsonNode * nodes;
+};
+
+// a straightforward comparison of a mask against input. 5 uops; would be cheaper in AVX512.
+really_inline u64 cmp_mask_against_input(m256 input_lo, m256 input_hi, m256 mask) {
+    m256 cmp_res_0 = _mm256_cmpeq_epi8(input_lo, mask);
+    u64 res_0 = (u32)_mm256_movemask_epi8(cmp_res_0);
+    m256 cmp_res_1 = _mm256_cmpeq_epi8(input_hi, mask);
+    u64 res_1 = _mm256_movemask_epi8(cmp_res_1);
+    return res_0 | (res_1 << 32);
+}
+
+// note this one is limited to masks that are aiming to detect things that don't have
+// a high bit set. The 0x80 bit is *not* masked off when we PSHUFB against shufti_low_nibble,
+// and will force us to 0 - which is fine and we can save the operations
+really_inline u64 shufti_against_input(m256 input_lo, m256 input_hi, m256 shufti_low_nibble, m256 shufti_high_nibble) {
+        m256 v_lo = _mm256_and_si256(
+                        _mm256_shuffle_epi8(shufti_low_nibble, input_lo),
+                        _mm256_shuffle_epi8(shufti_high_nibble,
+                                           _mm256_and_si256(_mm256_srli_epi32(input_lo, 4), _mm256_set1_epi8(0x7f))));
+
+        m256 v_hi = _mm256_and_si256(
+                        _mm256_shuffle_epi8(shufti_low_nibble, input_hi),
+                        _mm256_shuffle_epi8(shufti_high_nibble,
+                                           _mm256_and_si256(_mm256_srli_epi32(input_hi, 4), _mm256_set1_epi8(0x7f))));
+        v_lo = _mm256_cmpeq_epi8(v_lo, _mm256_set1_epi8(0));
+        v_hi = _mm256_cmpeq_epi8(v_hi, _mm256_set1_epi8(0));
+        u64 res_0 = (u32)_mm256_movemask_epi8(v_lo);
+        u64 res_1 = _mm256_movemask_epi8(v_hi);
+        return ~(res_0 | (res_1 << 32));
+}
+
+
+never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson & pj) {
+    // Useful constant masks
+    const u64 even_bits = 0x5555555555555555ULL;
+    const u64 odd_bits = ~even_bits;
+
+    // for now, just work in 64-byte chunks
+    // we have padded the input out to 64 byte multiple with the remainder being zeros
+
+    // persistent state across loop
+    u64 prev_iter_ends_odd_backslash = 0ULL; // either 0 or 1, but a 64-bit value
+    u64 prev_iter_inside_quote = 0ULL; // either all zeros or all ones
+    //u64 prev_iter_inside_quote2 = 0ULL; // either all zeros or all ones
+    //m256 prev_iter_prefix_sum = _mm256_setzero_si256();
+
+    for (size_t idx = 0; idx < len; idx+=64) {
+#ifdef DEBUG
+        cout << "Idx is " << idx << "\n";
+        for (u32 j = 0; j < 64; j++) {
+            char c = *(buf+idx+j);
+            if (isprint(c)) {
+                cout << c;
+            } else {
+                cout << '_';
+            }
+        }
+        cout << "|  ... input\n";
+#endif
+        m256 input_lo = _mm256_load_si256((const m256 *)(buf + idx + 0));
+        m256 input_hi = _mm256_load_si256((const m256 *)(buf + idx + 32));
+
+        ////////////////////////////////////////////////////////////////////////////////////////////
+        //     Step 1: detect odd sequences of backslashes
+        ////////////////////////////////////////////////////////////////////////////////////////////
+
+        u64 bs_bits = cmp_mask_against_input(input_lo, input_hi, _mm256_set1_epi8('\\'));
+        dumpbits(bs_bits, "backslash bits");
+        u64 start_edges = bs_bits & ~(bs_bits << 1);
+        dumpbits(start_edges, "start_edges");
+
+        // flip lowest if we have an odd-length run at the end of the prior iteration
+        u64 even_start_mask = even_bits ^ prev_iter_ends_odd_backslash;
+        u64 even_starts = start_edges & even_start_mask;
+        u64 odd_starts = start_edges & ~even_start_mask;
+
+        dumpbits(even_starts, "even_starts");
+        dumpbits(odd_starts, "odd_starts");
+
+        u64 even_carries = bs_bits + even_starts;
+
+        u64 odd_carries;
+        // must record the carry-out of our odd-carries out of bit 63; this indicates whether the
+        // sense of any edge going to the next iteration should be flipped
+        bool iter_ends_odd_backslash = __builtin_uaddll_overflow(bs_bits, odd_starts, &odd_carries);
+
+        odd_carries |= prev_iter_ends_odd_backslash; // push in bit zero as a potential end
+                                                     // if we had an odd-numbered run at the end of
+                                                     // the previous iteration
+        prev_iter_ends_odd_backslash = iter_ends_odd_backslash ? 0x1ULL : 0x0ULL;
+
+        dumpbits(even_carries, "even_carries");
+        dumpbits(odd_carries, "odd_carries");
+
+        u64 even_carry_ends = even_carries & ~bs_bits;
+        u64 odd_carry_ends = odd_carries & ~bs_bits;
+        dumpbits(even_carry_ends, "even_carry_ends");
+        dumpbits(odd_carry_ends, "odd_carry_ends");
+
+        u64 even_start_odd_end = even_carry_ends & odd_bits;
+        u64 odd_start_even_end = odd_carry_ends & even_bits;
+        dumpbits(even_start_odd_end, "esoe");
+        dumpbits(odd_start_even_end, "osee");
+
+        u64 odd_ends = even_start_odd_end | odd_start_even_end;
+        dumpbits(odd_ends, "odd_ends");
+
+        ////////////////////////////////////////////////////////////////////////////////////////////
+        //     Step 2: detect insides of quote pairs
+        ////////////////////////////////////////////////////////////////////////////////////////////
+
+        u64 quote_bits = cmp_mask_against_input(input_lo, input_hi, _mm256_set1_epi8('"'));
+        quote_bits = quote_bits & ~odd_ends;
+        dumpbits(quote_bits, "quote_bits");
+        u64 quote_mask = _mm_cvtsi128_si64(_mm_clmulepi64_si128(_mm_set_epi64x(0ULL, quote_bits),
+                                                                _mm_set1_epi8(0xFF), 0));
+        quote_mask ^= prev_iter_inside_quote;
+        prev_iter_inside_quote = (u64)((s64)quote_mask>>63);
+        dumpbits(quote_mask, "quote_mask");
+
+        // How do we build up a user traversable data structure
+        // first, do a 'shufti' to detect structural JSON characters
+        // they are { 0x7b } 0x7d : 0x3a [ 0x5b ] 0x5d , 0x2c
+        const m256 low_nibble_mask = _mm256_setr_epi8(
+        //                                a  b  c  d
+            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 1, 2, 1, 0, 0,
+            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 1, 2, 1, 0, 0
+        );
+        const m256 high_nibble_mask = _mm256_setr_epi8(
+        //        2  3     5     7
+            0, 0, 2, 4, 0, 1, 0, 1, 0, 0, 0, 3, 2, 1, 0, 0,
+            0, 0, 2, 4, 0, 1, 0, 1, 0, 0, 0, 3, 2, 1, 0, 0
+        );
+
+        u64 structurals = shufti_against_input(input_lo, input_hi, low_nibble_mask, high_nibble_mask);
+        dumpbits(structurals, "structurals");
+
+        // mask off anything inside quotes
+        structurals &= ~quote_mask;
+
+        // add the real quote bits back into our bitmask as well, so we can
+        // quickly traverse the strings we've spent all this trouble gathering
+        structurals |= quote_bits;
+        dumpbits(structurals, "final structurals");
+        *(u64 *)(pj.structurals + idx/8) = structurals;
+
+        // fifth, we will then call a function that takes nothing more than the array of integers
+        // and our input and the parsed_json structure. Alternately, *this* function becomes the
+        // thing that generates that array of input.
+
+        // TODO: think about error handling
+        // TODO: think about 'streaming' - how to process as we go?
+    }
+    return true;
+}
+
+const u32 NUM_RESERVED_NODES = 2;
+const u32 DUMMY_NODE = 0;
+const u32 ROOT_NODE = 0;
+
+// just transform the bitmask to a big list of 32-bit integers for now
+// that's all; the type of character under the gun (now this is : {};[],") - will
+// tell us exactly what we need to know. Naive but straightforward implementation
+never_inline bool flatten_indexes(size_t len, ParsedJson & pj) {
+    u32 base = NUM_RESERVED_NODES;
+    u32 * base_ptr = pj.structural_indexes;
+    base_ptr[DUMMY_NODE] = base_ptr[ROOT_NODE] = 0; // really shouldn't matter
+    for (size_t idx = 0; idx < len; idx+=64) {
+        u64 s = *(u64 *)(pj.structurals + idx/8);
+        while (s) {
+            u32 si = (u32)idx + __builtin_ctzll(s);
+#ifdef DEBUG
+            cout << "Putting structural index " << si << " at array location " << base << "\n";
+#endif
+            base_ptr[base++] = si;
+            s &= s - 1ULL;
+        }
+    }
+    pj.n_structural_indexes = base;
+    return true;
+}
+
+// Parse our json given a big array of 32-bit integers telling us where
+// the interesting stuff is
+
+never_inline bool json_parse(const u8 * buf, size_t len, ParsedJson & pj) {
+    u32 last; // index of previous structure at this level or 0 if none
+    u32 up; // index of structure that contains this one
+
+    JsonNode * nodes = pj.nodes;
+
+    JsonNode & dummy = nodes[DUMMY_NODE];
+    JsonNode & root = nodes[ROOT_NODE];
+    dummy.prev = dummy.up = DUMMY_NODE;
+    root.prev = DUMMY_NODE;
+    root.up = ROOT_NODE;
+    last = up = ROOT_NODE;
+
+    for (u32 i = NUM_RESERVED_NODES; i < pj.n_structural_indexes; i++) {
+        u32 idx = pj.structural_indexes[i];
+        JsonNode & n = nodes[i];
+        u8 c = buf[idx];
+        if (unlikely((c & 0xdf) == 0x5b)) { // meaning 7b or 5b, { or [
+            // open a scope
+            n.prev = last;
+            n.up = up;
+            up = i;
+            last = 0;
+        } else if (unlikely((c & 0xdf) == 0x5d)) { // meaning 7d or 5d, } or ]
+            // close a scope
+            n.prev = up;
+            n.up = pj.nodes[up].up;
+            up = pj.nodes[up].up;
+            last = i;
+        } else {
+            n.prev = last;
+            n.up = up;
+            last = i;
+        }
+        n.next = 0;
+        nodes[n.prev].next = i;
+    }
+    dummy.next = DUMMY_NODE; // dummy.next is a sump for meaningless 'nexts', clear it
+#ifdef DEBUG
+    for (u32 i = 0; i < pj.n_structural_indexes; i++) {
+        u32 idx = pj.structural_indexes[i];
+        JsonNode & n = nodes[i];
+        cout << "i: " << i;
+        cout << " n.up: " << n.up;
+        cout << " n.next: " << n.next;
+        cout << " n.prev: " << n.prev;
+        //cout << " idx: " << idx << " buf[idx] " << buf[idx] << "\n"; // this line causes problems (segfault)
+    }
+#endif
+    return true;
+}
+
+int main(int argc, char * argv[]) {
+    pair<u8 *, size_t> p = get_corpus(argv[1]);
+    ParsedJson pj;
+
+    if (posix_memalign( (void **)&pj.structurals, 8, ROUNDUP_N(p.second, 64)/8)) {
+        throw "Allocation failed";
+    };
+
+    pj.n_structural_indexes = 0;
+    // we have potentially 1 structure per byte of input
+    // as well as a dummy structure and a root structure
+    u32 max_structures = ROUNDUP_N(p.second, 64) + 2;
+    pj.structural_indexes = new u32[max_structures];
+    pj.nodes = new JsonNode[max_structures];
+
+#ifdef DEBUG
+    const u32 iterations = 1;
+#else
+    const u32 iterations = 1000;
+#endif
+    vector<double> res;
+    res.resize(iterations);
+    for (u32 i = 0; i < iterations; i++) {
+        auto start = std::chrono::steady_clock::now();
+        find_structural_bits(p.first, p.second, pj);
+        flatten_indexes(p.second, pj);
+        json_parse(p.first, p.second, pj);
+        auto end = std::chrono::steady_clock::now();
+        std::chrono::duration<double> secs = end - start;
+        res[i] = secs.count();
+    }
+	double min_result = *min_element(res.begin(), res.end());
+	cout << "Min:  " << min_result << " bytes read: " << p.second  << " Gigabytes/second: " << (p.second) / (min_result * 1000000000.0) << "\n";
+    return 0;
+}