Merge branch 'master' of https://github.com/lemire/simdjson

2018-04-14 21:47:41 +10:00 · 2018-04-14 21:47:41 +10:00 · 694942e3cd
parent 03799855df 46d55fa6ce
commit 694942e3cd
16 changed files with 561 additions and 100 deletions
--- a/README.md
+++ b/README.md
@ -57,7 +57,7 @@ Of course, stage 4 is totally unimplemented so it might be a priority as well:
 > Using this parallel bit stream approach, the vast majority of conditional branches used to identify key positions and/or syntax errors at each parsing position are mostly eliminated, which, as Section 6.2 shows, minimizes branch misprediction penalties. Accurate parsing and parallel lexical analysis is done through processor-friendly equations that require neither speculation nor multithreading.

 - Deshmukh, V. M., and G. R. Bamnote. "An empirical evaluation of optimization parameters in XML parsing for performance enhancement." Computer, Communication and Control (IC4), 2015 International Conference on. IEEE, 2015.
-APA	
+APA


 - Moussalli, Roger, et al. "Efficient XML Path Filtering Using GPUs." ADMS@ VLDB. 2011.
@ -97,6 +97,10 @@ APA
 - http://rapidjson.org/md_doc_sax.html
 - https://github.com/Geal/parser_benchmarks/tree/master/json
 - Gron: A command line tool that makes JSON greppable https://news.ycombinator.com/item?id=16727665
+- GoogleGson https://github.com/google/gson
+- Jackson https://github.com/FasterXML/jackson
+- https://www.yelp.com/dataset_challenge
+- RapidJSON. http://rapidjson.org/

 Inspiring links:
 - https://auth0.com/blog/beating-json-performance-with-protobuf/
@ -108,6 +112,11 @@ Inspiring links:
 - The JSON spec defines what a JSON parser is:
 >  A JSON parser transforms a JSON text into another representation.  A JSON parser MUST accept all texts that conform to the JSON grammar.  A JSON parser MAY accept non-JSON forms or extensions. An implementation may set limits on the size of texts that it accepts.  An implementation may set limits on the maximum depth of nesting.  An implementation may set limits on the range and precision of numbers.  An implementation may set limits on the length and character contents of strings."

+
+- JSON is not JavaScript:
+
+> All JSON is Javascript but NOT all Javascript is JSON. So {property:1} is invalid because property does not have double quotes around it. {'property':1} is also invalid, because it's single quoted while the only thing that can placate the JSON specification is double quoting. JSON is even fussy enough that {"property":.1} is invalid too, because you should have of course written {"property":0.1}. Also, don't even think about having comments or semicolons, you guessed it: they're invalid. (credit:https://github.com/elzr/vim-json)
+
 - The  structural characters are:


@ -130,7 +139,7 @@ Inspiring links:

   - Values must be one of false / null / true / object / array / number / string

-   - A string begins and ends with  quotation marks.  All Unicode characters may be placed within the   quotation marks, except for the characters that must be escaped:   quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). [Decoding UTF-8 is fun](https://github.com/skeeto/branchless-utf8/blob/master/utf8.h).
+   - A string begins and ends with  quotation marks.  All Unicode characters may be placed within the   quotation marks, except for the characters that must be escaped:   quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). We can probably safely assume that strings are in UTF-8. [Decoding UTF-8 is fun](https://github.com/skeeto/branchless-utf8/blob/master/utf8.h). However, any character can be escaped in JSON string and escaping them might be required? Well, maybe you can quickly check whether a string needs escaping.

   - Regarding strings, Geoff wrote:
   > For example, in Stage 2 ("string detection") we could validate that the only place we saw backslashes was in places we consider "inside strings".
@ -144,6 +153,19 @@ prev structural element at the same level
 containing structural element ("up").


+### Pseudo-structural elements
+
+A character is pseudo-structural if and only if:
+
+1. Not enclosed in quotes, AND
+2. Is a non-whitespace character, AND
+3. It's preceding chararacter is either:
+(a) a structural character, OR
+(b) whitespace.
+
+This helps as we redefine some new characters as pseudo-structural such as the characters 1, 1, G, n in the following:
+
+> { "foo" : 1.5, "bar" : 1.5   GEOFF_IS_A_DUMMY bla bla , "baz", null } 

 ## Remarks on the code

@ -152,6 +174,6 @@ containing structural element ("up").
 - The ``clmul`` thing is tricky but nice. (Geoff's remark:  find the spaces between quotes, is actually a ponderous way of doing parallel prefix over XOR, which a mathematically adept person would have realized could be done with clmul by -1. Not me, I had to look it up: http://bitmath.blogspot.com.au/2016/11/parallel-prefixsuffix-operations.html.)
 - It is possible, though maybe unlikely, that parallelizing the bitset decoding could be useful (https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/), and there is VCOMPRESSB (AVX-512)

-## Future work 
+## Future work

 Long term we should keep in mind the idea that what would be cool is a method to extract something like this code from an abstract description of something closer to a grammar.
--- a/jsonexamples/adversarial.json
+++ b/jsonexamples/adversarial.json
@ -1,9 +1,9 @@
 {
  "\"Name rue": [
-    116,
+    [ 116,
    "\"",
    234,
    "true",
-    false
+    false ]
  ]
 }
--- a/main.cpp
+++ b/main.cpp
@ -12,7 +12,7 @@
 #include <x86intrin.h>
 #include <assert.h>
 #include "common_defs.h"
- 
+
 using namespace std;

 //#define DEBUG
@ -31,7 +31,7 @@ inline void dump256(m256 d, string msg) {
    cout << " " << msg << "\n";
 }

-// dump bits low to high 
+// dump bits low to high
 void dumpbits(u64 v, string msg) {
 	for (u32 i = 0; i < 64; i++) {
        std::cout << (((v>>(u64)i) & 0x1ULL) ? "1" : "_");
@ -55,7 +55,7 @@ ifstream is(filename, ios::binary);
            throw "Allocation failed";
        };
        memset(aligned_buffer, 0x20, ROUNDUP_N(length, 64));
-        memcpy(aligned_buffer, buffer.str().c_str(), length); 
+        memcpy(aligned_buffer, buffer.str().c_str(), length);
        is.close();
        return make_pair((u8 *)aligned_buffer, length);
    }
@ -88,14 +88,14 @@ really_inline u64 cmp_mask_against_input(m256 input_lo, m256 input_hi, m256 mask
 never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson & pj) {
    // Useful constant masks
    const u64 even_bits = 0x5555555555555555ULL;
-    const u64 odd_bits = ~even_bits; 
+    const u64 odd_bits = ~even_bits;

    // for now, just work in 64-byte chunks
    // we have padded the input out to 64 byte multiple with the remainder being zeros

    // persistent state across loop
    u64 prev_iter_ends_odd_backslash = 0ULL; // either 0 or 1, but a 64-bit value
-    u64 prev_iter_inside_quote = 0ULL; // either all zeros or all ones 
+    u64 prev_iter_inside_quote = 0ULL; // either all zeros or all ones
    u64 prev_iter_pseudo_structural_carry = 0ULL;

    for (size_t idx = 0; idx < len; idx+=64) {
@ -108,7 +108,7 @@ never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson &
            } else {
                cout << '_';
            }
-        }   
+        }
        cout << "|  ... input\n";
 #endif
        m256 input_lo = _mm256_load_si256((const m256 *)(buf + idx + 0));
@ -126,11 +126,11 @@ never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson &
        // flip lowest if we have an odd-length run at the end of the prior iteration
        u64 even_start_mask = even_bits ^ prev_iter_ends_odd_backslash;
        u64 even_starts = start_edges & even_start_mask;
-        u64 odd_starts = start_edges & ~even_start_mask; 
+        u64 odd_starts = start_edges & ~even_start_mask;

        dumpbits(even_starts, "even_starts");
        dumpbits(odd_starts, "odd_starts");
-        
+
        u64 even_carries = bs_bits + even_starts;

        u64 odd_carries;
@ -158,9 +158,9 @@ never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson &

        u64 odd_ends = even_start_odd_end | odd_start_even_end;
        dumpbits(odd_ends, "odd_ends");
-    
+
        ////////////////////////////////////////////////////////////////////////////////////////////
-        //     Step 2: detect insides of quote pairs 
+        //     Step 2: detect insides of quote pairs
        ////////////////////////////////////////////////////////////////////////////////////////////

        u64 quote_bits = cmp_mask_against_input(input_lo, input_hi, _mm256_set1_epi8('"'));
@ -227,7 +227,7 @@ never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson &

        // mask off anything inside quotes
        structurals &= ~quote_mask;
-        
+
        // whitespace inside our quotes also doesn't count; otherwise "    foo" would generate a spurious
        // pseudo-structural-character at 'foo'
        whitespace &= ~quote_mask;
@ -245,9 +245,9 @@ never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson &

        // Slightly more painful than it would seem. It's possible that either structurals or whitespace are
        // all 1s (e.g. {{{{{{{....{{{{x64, or a really long whitespace). As such there is no safe place
-        // to add a '1' from the previous iteration without *that* triggering the carry we are looking 
+        // to add a '1' from the previous iteration without *that* triggering the carry we are looking
        // out for, so we must check both carries for overflow
-        
+
        u64 tmp = structurals | whitespace;
        u64 tmp2;
        bool ps_carry = __builtin_uaddll_overflow(tmp, structurals, &tmp2);
@ -260,7 +260,7 @@ never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson &
        tmp3 &= ~whitespace;
        dumpbits(tmp3, "pseudo_structural add calculation without quotes and whitespace");
        dumpbits(structurals, "final structurals without quotes");
-        structurals |= tmp3;       
+        structurals |= tmp3;
        dumpbits(structurals, "final structurals and pseudo structurals");

        *(u64 *)(pj.structurals + idx/8) = structurals;
@ -528,6 +528,55 @@ never_inline bool ape_machine(const u8 * buf, UNUSED size_t len, ParsedJson & pj
    return true;
 }

+// https://stackoverflow.com/questions/2616906/how-do-i-output-coloured-text-to-a-linux-terminal
+namespace Color {
+    enum Code {
+        FG_DEFAULT = 39, FG_BLACK = 30, FG_RED = 31, FG_GREEN = 32,
+        FG_YELLOW = 33, FG_BLUE = 34, FG_MAGENTA = 35, FG_CYAN = 36,
+        FG_LIGHT_GRAY = 37, FG_DARK_GRAY = 90, FG_LIGHT_RED = 91,
+        FG_LIGHT_GREEN = 92, FG_LIGHT_YELLOW = 93, FG_LIGHT_BLUE = 94,
+        FG_LIGHT_MAGENTA = 95, FG_LIGHT_CYAN = 96, FG_WHITE = 97,
+        BG_RED = 41, BG_GREEN = 42, BG_BLUE = 44, BG_DEFAULT = 49
+    };
+    class Modifier {
+        Code code;
+    public:
+        Modifier(Code pCode) : code(pCode) {}
+        friend std::ostream&
+        operator<<(std::ostream& os, const Modifier& mod) {
+            return os << "\033[" << mod.code << "m";
+        }
+    };
+}
+
+void colorfuldisplay(ParsedJson & pj, const u8 * buf) {
+    Color::Modifier greenfg(Color::FG_GREEN);
+    Color::Modifier yellowfg(Color::FG_YELLOW);
+    Color::Modifier deffg(Color::FG_DEFAULT);
+    size_t i = 0;
+    // skip initial fluff
+    while((i+1< pj.n_structural_indexes) && (pj.structural_indexes[i]==pj.structural_indexes[i+1])){
+      i++;
+    }
+    for (; i < pj.n_structural_indexes; i++) {
+        u32 idx = pj.structural_indexes[i];
+        u8 c = buf[idx];
+        if (((c & 0xdf) == 0x5b)) { // meaning 7b or 5b, { or [
+            std::cout << greenfg <<  buf[idx] << deffg;
+        } else if (((c & 0xdf) == 0x5d)) { // meaning 7d or 5d, } or ]
+            std::cout << greenfg <<  buf[idx] << deffg;
+        } else {
+            std::cout << yellowfg <<  buf[idx] << deffg;
+        }
+        if(i + 1 < pj.n_structural_indexes) {
+          u32 nextidx = pj.structural_indexes[i + 1];
+          for(u32 pos = idx + 1 ; pos < nextidx; pos++) {
+            std::cout << buf[pos];
+          }
+        }
+    }
+    std::cout << std::endl;
+}
 int main(int argc, char * argv[]) {
    if (argc != 2) {
        cerr << "Usage: " << argv[0] << " <jsonfile>\n";
@ -546,7 +595,7 @@ int main(int argc, char * argv[]) {
    // we have potentially 1 structure per byte of input
    // as well as a dummy structure and a root structure
    // we also potentially write up to 7 iterations beyond
-    // in our 'cheesy flatten', so make some worst-case 
+    // in our 'cheesy flatten', so make some worst-case
    // sapce for that too
    u32 max_structures = ROUNDUP_N(p.second, 64) + 2 + 7;
    pj.structural_indexes = new u32[max_structures];
@ -568,6 +617,7 @@ int main(int argc, char * argv[]) {
        std::chrono::duration<double> secs = end - start;
        res[i] = secs.count();
    }
+    colorfuldisplay(pj, p.first);
 	double min_result = *min_element(res.begin(), res.end());
 	cout << "Min:  " << min_result << " bytes read: " << p.second  << " Gigabytes/second: " << (p.second) / (min_result * 1000000000.0) << "\n";
    return 0;
--- a/scalarvssimd/Makefile
+++ b/scalarvssimd/Makefile
@ -0,0 +1,6 @@
+HEADERS:=include/avxprocessing.h    include/benchmark.h        include/common_defs.h      include/jsonstruct.h       include/scalarprocessing.h include/util.h
+bench: benchmarks/bench.cpp $(HEADERS)
+	$(CXX) -std=c++11 -O3 -o $@ benchmarks/bench.cpp -Iinclude  -march=native -lm -Wall -Wextra
+
+clean:
+	rm -f bench
--- a/scalarvssimd/README.md
+++ b/scalarvssimd/README.md
@ -0,0 +1,3 @@
+```
+./run.sh
+```
--- a/scalarvssimd/benchmarks/bench.cpp
+++ b/scalarvssimd/benchmarks/bench.cpp
@ -0,0 +1,61 @@
+#include "jsonstruct.h"
+
+#include "scalarprocessing.h"
+#include "avxprocessing.h"
+#include "benchmark.h"
+#include "util.h"
+
+//colorfuldisplay(ParsedJson & pj, const u8 * buf)
+//BEST_TIME_NOCHECK(dividearray32(array, N), , repeat,  N, timings,true);
+
+
+int main(int argc, char * argv[]) {
+    if (argc < 2) {
+        cerr << "Usage: " << argv[0] << " <jsonfile>\n";
+        cerr << "Or " << argv[0] << " -v <jsonfile>\n";
+        exit(1);
+    }
+    bool verbose = false;
+    if (argc > 2) {
+      if(strcmp(argv[1],"-v")) verbose = true;
+    }
+    pair<u8 *, size_t> p = get_corpus(argv[argc - 1]);
+    ParsedJson pj;
+    std::cout << "Input has ";
+    if(p.second > 1024 * 1024)
+      std::cout << p.second / (1024*1024) << " MB ";
+    else if (p.second > 1024)
+      std::cout << p.second / 1024 << " KB ";
+    else 
+      std::cout << p.second << " B ";
+    std::cout << std::endl;
+
+    if (posix_memalign( (void **)&pj.structurals, 8, ROUNDUP_N(p.second, 64)/8)) {
+        throw "Allocation failed";
+    };
+
+    pj.n_structural_indexes = 0;
+    // we have potentially 1 structure per byte of input
+    // as well as a dummy structure and a root structure
+    // we also potentially write up to 7 iterations beyond
+    // in our 'cheesy flatten', so make some worst-case
+    // sapce for that too
+    u32 max_structures = ROUNDUP_N(p.second, 64) + 2 + 7;
+    pj.structural_indexes = new u32[max_structures];
+    pj.nodes = new JsonNode[max_structures];
+    if(verbose) {
+      std::cout << "Parsing SIMD (once) " << std::endl;
+      avx_json_parse(p.first,  p.second, pj);
+      colorfuldisplay(pj, p.first);
+      debugdisplay(pj,p.first);
+      std::cout << "Parsing scalar (once) " << std::endl;
+      scalar_json_parse(p.first,  p.second, pj);
+      colorfuldisplay(pj, p.first);
+      debugdisplay(pj,p.first);
+    }
+    int repeat = 10;
+    int volume = p.second;
+    BEST_TIME_NOCHECK(avx_json_parse(p.first,  p.second, pj), , repeat, volume, true);
+    BEST_TIME_NOCHECK(scalar_json_parse(p.first,  p.second, pj), , repeat, volume, true);
+
+}
--- a/scalarvssimd/demo.cpp
+++ b/scalarvssimd/demo.cpp
--- a/scalarvssimd/include/avxprocessing.h
+++ b/scalarvssimd/include/avxprocessing.h
@ -18,7 +18,7 @@ using namespace std;


 // a straightforward comparison of a mask against input. 5 uops; would be cheaper in AVX512.
-really_inline u64 cmp_mask_against_input(m256 input_lo, m256 input_hi, m256 mask) {
+static inline u64 cmp_mask_against_input(m256 input_lo, m256 input_hi, m256 mask) {
    m256 cmp_res_0 = _mm256_cmpeq_epi8(input_lo, mask);
    u64 res_0 = (u32)_mm256_movemask_epi8(cmp_res_0);
    m256 cmp_res_1 = _mm256_cmpeq_epi8(input_hi, mask);
@ -26,7 +26,7 @@ really_inline u64 cmp_mask_against_input(m256 input_lo, m256 input_hi, m256 mask
    return res_0 | (res_1 << 32);
 }

-never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson & pj) {
+static bool find_structural_bits(const u8 * buf, size_t len, ParsedJson & pj) {
    // Useful constant masks
    const u64 even_bits = 0x5555555555555555ULL;
    const u64 odd_bits = ~even_bits;
@ -81,12 +81,10 @@ never_inline bool find_structural_bits(const u8 * buf, size_t len, ParsedJson &

        u64 quote_bits = cmp_mask_against_input(input_lo, input_hi, _mm256_set1_epi8('"'));
        quote_bits = quote_bits & ~odd_ends;
-        dumpbits(quote_bits, "quote_bits");
        u64 quote_mask = _mm_cvtsi128_si64(_mm_clmulepi64_si128(_mm_set_epi64x(0ULL, quote_bits),
                                                                _mm_set1_epi8(0xFF), 0));
        quote_mask ^= prev_iter_inside_quote;
        prev_iter_inside_quote = (u64)((s64)quote_mask>>63);
-        dumpbits(quote_mask, "quote_mask");

        // How do we build up a user traversable data structure
        // first, do a 'shufti' to detect structural JSON characters
@ -184,17 +182,31 @@ const u32 ROOT_NODE = 1;
 // just transform the bitmask to a big list of 32-bit integers for now
 // that's all; the type of character the offset points to will
 // tell us exactly what we need to know. Naive but straightforward implementation
-never_inline bool flatten_indexes(size_t len, ParsedJson & pj) {
+static bool flatten_indexes(size_t len, ParsedJson & pj) {
    u32 base = NUM_RESERVED_NODES;
    u32 * base_ptr = pj.structural_indexes;
    base_ptr[DUMMY_NODE] = base_ptr[ROOT_NODE] = 0; // really shouldn't matter
    for (size_t idx = 0; idx < len; idx+=64) {
        u64 s = *(u64 *)(pj.structurals + idx/8);
+        u32 cnt = __builtin_popcountll(s);
+        u32 next_base = base + cnt;
        while (s) {
-            u32 si = (u32)idx + __builtin_ctzll(s);
-            base_ptr[base++] = si;
-            s &= s - 1ULL;
+            // spoil the suspense
+            u64 s3 = _pdep_u64(~0x7ULL, s); // s3 will have bottom 3 1-bits unset
+            u64 s5 = _pdep_u64(~0x1fULL, s); // s5 will have bottom 5 1-bits unset
+
+            base_ptr[base+0] = (u32)idx + __builtin_ctzll(s);  u64 s1 = s  & (s  - 1ULL);
+            base_ptr[base+1] = (u32)idx + __builtin_ctzll(s1); u64 s2 = s1 & (s1 - 1ULL);
+            base_ptr[base+2] = (u32)idx + __builtin_ctzll(s2); //u64 s3 = s2 & (s2 - 1ULL);
+            base_ptr[base+3] = (u32)idx + __builtin_ctzll(s3); u64 s4 = s3 & (s3 - 1ULL);
+
+            base_ptr[base+4] = (u32)idx + __builtin_ctzll(s4); //u64 s5 = s4 & (s4 - 1ULL);
+            base_ptr[base+5] = (u32)idx + __builtin_ctzll(s5); u64 s6 = s5 & (s5 - 1ULL);
+            base_ptr[base+6] = (u32)idx + __builtin_ctzll(s6); u64 s7 = s6 & (s6 - 1ULL);
+            s = s7;
+            base += 7;
        }
+        base = next_base;
    }
    pj.n_structural_indexes = base;
    return true;
@ -202,7 +214,7 @@ never_inline bool flatten_indexes(size_t len, ParsedJson & pj) {

 // Parse our json given a big array of 32-bit integers telling us where
 // the interesting stuff is
-bool avx_json_parse(const u8 * buf, UNUSED size_t len, ParsedJson & pj) {
+static bool json_parse(const u8 * buf, UNUSED size_t len, ParsedJson & pj) {
    u32 last; // index of previous structure at this level or 0 if none
    u32 up; // index of structure that contains this one

@ -240,16 +252,13 @@ bool avx_json_parse(const u8 * buf, UNUSED size_t len, ParsedJson & pj) {
        nodes[n.prev].next = i;
    }
    dummy.next = DUMMY_NODE; // dummy.next is a sump for meaningless 'nexts', clear it
-#ifdef DEBUG
-    for (u32 i = 0; i < pj.n_structural_indexes; i++) {
-        u32 idx = pj.structural_indexes[i];
-        JsonNode & n = nodes[i];
-        cout << "i: " << i;
-        cout << " n.up: " << n.up;
-        cout << " n.next: " << n.next;
-        cout << " n.prev: " << n.prev;
-        cout << " idx: " << idx << " buf[idx] " << buf[idx] << "\n";
-    }
-#endif
    return true;
 }
+
+
+static bool avx_json_parse(const u8 * buf,  size_t len, ParsedJson & pj) {
+          find_structural_bits(buf, len, pj);
+          flatten_indexes(len, pj);
+          json_parse(buf, len, pj);
+          return true;
+}
--- a/scalarvssimd/include/benchmark.h
+++ b/scalarvssimd/include/benchmark.h
@ -0,0 +1,196 @@
+#ifndef _BENCHMARK_H_
+#define _BENCHMARK_H_
+#include <stdint.h>
+#include <time.h>
+#ifdef __x86_64__
+
+const char *unitname = "cycles";
+
+#define RDTSC_START(cycles)                                                    \
+  do {                                                                         \
+    uint32_t cyc_high, cyc_low;                                                \
+    __asm volatile("cpuid\n"                                                   \
+                   "rdtsc\n"                                                   \
+                   "mov %%edx, %0\n"                                           \
+                   "mov %%eax, %1"                                             \
+                   : "=r"(cyc_high), "=r"(cyc_low)                             \
+                   :                                                           \
+                   :                              /* no read only */           \
+                   "%rax", "%rbx", "%rcx", "%rdx" /* clobbers */               \
+                   );                                                          \
+    (cycles) = ((uint64_t)cyc_high << 32) | cyc_low;                           \
+  } while (0)
+
+#define RDTSC_STOP(cycles)                                                     \
+  do {                                                                         \
+    uint32_t cyc_high, cyc_low;                                                \
+    __asm volatile("rdtscp\n"                                                  \
+                   "mov %%edx, %0\n"                                           \
+                   "mov %%eax, %1\n"                                           \
+                   "cpuid"                                                     \
+                   : "=r"(cyc_high), "=r"(cyc_low)                             \
+                   : /* no read only registers */                              \
+                   : "%rax", "%rbx", "%rcx", "%rdx" /* clobbers */             \
+                   );                                                          \
+    (cycles) = ((uint64_t)cyc_high << 32) | cyc_low;                           \
+  } while (0)
+
+#else
+const char *unitname = " (clock units) ";
+
+#define RDTSC_START(cycles)                                                    \
+  do {                                                                         \
+    cycles = clock();                                                          \
+  } while (0)
+
+#define RDTSC_STOP(cycles)                                                     \
+  do {                                                                         \
+    cycles = clock();                                                          \
+  } while (0)
+#endif
+
+static __attribute__((noinline)) uint64_t rdtsc_overhead_func(uint64_t dummy) {
+  return dummy;
+}
+
+uint64_t global_rdtsc_overhead = (uint64_t)UINT64_MAX;
+
+#define RDTSC_SET_OVERHEAD(test, repeat)                                       \
+  do {                                                                         \
+    uint64_t cycles_start, cycles_final, cycles_diff;                          \
+    uint64_t min_diff = UINT64_MAX;                                            \
+    for (int i = 0; i < repeat; i++) {                                         \
+      __asm volatile("" ::: /* pretend to clobber */ "memory");                \
+      RDTSC_START(cycles_start);                                               \
+      test;                                                                    \
+      RDTSC_STOP(cycles_final);                                                \
+      cycles_diff = (cycles_final - cycles_start);                             \
+      if (cycles_diff < min_diff)                                              \
+        min_diff = cycles_diff;                                                \
+    }                                                                          \
+    global_rdtsc_overhead = min_diff;                                          \
+  } while (0)
+
+/*
+ * Prints the best number of operations per cycle where
+ * test is the function call, answer is the expected answer generated by
+ * test, repeat is the number of times we should repeat and size is the
+ * number of operations represented by test.
+ */
+#define BEST_TIME(test, expected, pre, repeat, size, verbose)                  \
+  do {                                                                         \
+    if (global_rdtsc_overhead == UINT64_MAX) {                                 \
+      RDTSC_SET_OVERHEAD(rdtsc_overhead_func(1), repeat);                      \
+    }                                                                          \
+    if (verbose)                                                               \
+      printf("%-60s\t: ", #test);                                              \
+    fflush(NULL);                                                              \
+    uint64_t cycles_start, cycles_final, cycles_diff;                          \
+    uint64_t min_diff = (uint64_t)-1;                                          \
+    uint64_t sum_diff = 0;                                                     \
+    for (int i = 0; i < repeat; i++) {                                         \
+      pre;                                                                     \
+      __asm volatile("" ::: /* pretend to clobber */ "memory");                \
+      RDTSC_START(cycles_start);                                               \
+      if (test != expected) {                                                  \
+        printf("not expected (%d , %d )", (int)test, (int)expected);           \
+        break;                                                                 \
+      }                                                                        \
+      RDTSC_STOP(cycles_final);                                                \
+      cycles_diff = (cycles_final - cycles_start - global_rdtsc_overhead);     \
+      if (cycles_diff < min_diff)                                              \
+        min_diff = cycles_diff;                                                \
+      sum_diff += cycles_diff;                                                 \
+    }                                                                          \
+    uint64_t S = size;                                                         \
+    float cycle_per_op = (min_diff) / (double)S;                               \
+    if (verbose)                                                               \
+      printf(" %.3f %s per operation (best) ", cycle_per_op, unitname);        \
+    if (verbose)                                                               \
+      printf("\t%.3f %s per operation (avg) ", avg_cycle_per_op, unitname);    \
+    if (verbose)                                                               \
+      printf("\n");                                                            \
+    if (!verbose)                                                              \
+      printf(" %.3f ", cycle_per_op);                                          \
+    fflush(NULL);                                                              \
+  } while (0)
+
+// like BEST_TIME, but no check
+#define BEST_TIME_NOCHECK(test, pre, repeat, size,  verbose)           \
+  do {                                                                         \
+    if (global_rdtsc_overhead == UINT64_MAX) {                                 \
+      RDTSC_SET_OVERHEAD(rdtsc_overhead_func(1), repeat);                      \
+    }                                                                          \
+    if (verbose)                                                               \
+      printf("%-40s\t: ", #test);                                              \
+    fflush(NULL);                                                              \
+    uint64_t cycles_start, cycles_final, cycles_diff;                          \
+    uint64_t min_diff = (uint64_t)-1;                                          \
+    uint64_t sum_diff = 0;                                                     \
+    for (int i = 0; i < repeat; i++) {                                         \
+      pre;                                                                     \
+      __asm volatile("" ::: /* pretend to clobber */ "memory");                \
+      RDTSC_START(cycles_start);                                               \
+      test;                                                                    \
+      RDTSC_STOP(cycles_final);                                                \
+      cycles_diff = (cycles_final - cycles_start - global_rdtsc_overhead);     \
+      if (cycles_diff < min_diff)                                              \
+        min_diff = cycles_diff;                                                \
+      sum_diff += cycles_diff;                                                 \
+    }                                                                          \
+    uint64_t S = size;                                                         \
+    float cycle_per_op = (min_diff) / (double)S;                               \
+    float avg_cycle_per_op = (sum_diff) / ((double)S * repeat);                \
+    if (verbose)                                                               \
+      printf(" %.3f %s per input byte (best) ", cycle_per_op, unitname);        \
+    if (verbose)                                                               \
+      printf(" %.3f %s per input byte (avg) ", avg_cycle_per_op, unitname);        \
+     if (verbose)                                                               \
+      printf("\n");                                                            \
+    if (!verbose)                                                              \
+      printf(" %.3f ", cycle_per_op);                                          \
+    fflush(NULL);                                                              \
+  } while (0)
+
+// like BEST_TIME except that we run a function to check the result
+#define BEST_TIME_CHECK(test, check, pre, repeat, size, verbose)               \
+  do {                                                                         \
+    if (global_rdtsc_overhead == UINT64_MAX) {                                 \
+      RDTSC_SET_OVERHEAD(rdtsc_overhead_func(1), repeat);                      \
+    }                                                                          \
+    if (verbose)                                                               \
+      printf("%-60s\t: ", #test);                                              \
+    fflush(NULL);                                                              \
+    uint64_t cycles_start, cycles_final, cycles_diff;                          \
+    uint64_t min_diff = (uint64_t)-1;                                          \
+    uint64_t sum_diff = 0;                                                     \
+    for (int i = 0; i < repeat; i++) {                                         \
+      pre;                                                                     \
+      __asm volatile("" ::: /* pretend to clobber */ "memory");                \
+      RDTSC_START(cycles_start);                                               \
+      test;                                                                    \
+      RDTSC_STOP(cycles_final);                                                \
+      if (!check) {                                                            \
+        printf("error");                                                       \
+        break;                                                                 \
+      }                                                                        \
+      cycles_diff = (cycles_final - cycles_start - global_rdtsc_overhead);     \
+      if (cycles_diff < min_diff)                                              \
+        min_diff = cycles_diff;                                                \
+      sum_diff += cycles_diff;                                                 \
+    }                                                                          \
+    uint64_t S = size;                                                         \
+    float cycle_per_op = (min_diff) / (double)S;                               \
+    float avg_cycle_per_op = (sum_diff) / ((double)S * repeat);                \
+    if (verbose)                                                               \
+      printf(" %.3f cycles per operation (best) ", cycle_per_op);              \
+    if (verbose)                                                               \
+      printf("\t%.3f cycles per operation (avg) ", avg_cycle_per_op);          \
+    if (verbose)                                                               \
+      printf("\n");                                                            \
+    if (!verbose)                                                              \
+      printf(" %.3f ", cycle_per_op);                                          \
+    fflush(NULL);                                                              \
+  } while (0)
+
+#endif
--- a/scalarvssimd/include/common_defs.h
+++ b/scalarvssimd/include/common_defs.h
@ -1,5 +1,5 @@
 #pragma once
-
+#include <cassert>
 typedef unsigned char u8;
 typedef unsigned short u16;
 typedef unsigned int u32;
--- a/scalarvssimd/include/jsonstruct.h
+++ b/scalarvssimd/include/jsonstruct.h
@ -0,0 +1,82 @@
+#pragma once
+
+#include "common_defs.h"
+
+struct JsonNode {
+    u32 up;
+    u32 next;
+    u32 prev;
+};
+
+struct ParsedJson {
+    u8 * structurals;
+    u32 n_structural_indexes;
+    u32 * structural_indexes;
+    JsonNode * nodes;
+};
+
+#include <algorithm>
+#include <iostream>
+#include <iterator>
+
+// https://stackoverflow.com/questions/2616906/how-do-i-output-coloured-text-to-a-linux-terminal
+namespace Color {
+    enum Code {
+        FG_DEFAULT = 39, FG_BLACK = 30, FG_RED = 31, FG_GREEN = 32,
+        FG_YELLOW = 33, FG_BLUE = 34, FG_MAGENTA = 35, FG_CYAN = 36,
+        FG_LIGHT_GRAY = 37, FG_DARK_GRAY = 90, FG_LIGHT_RED = 91,
+        FG_LIGHT_GREEN = 92, FG_LIGHT_YELLOW = 93, FG_LIGHT_BLUE = 94,
+        FG_LIGHT_MAGENTA = 95, FG_LIGHT_CYAN = 96, FG_WHITE = 97,
+        BG_RED = 41, BG_GREEN = 42, BG_BLUE = 44, BG_DEFAULT = 49
+    };
+    class Modifier {
+        Code code;
+    public:
+        Modifier(Code pCode) : code(pCode) {}
+        friend std::ostream&
+        operator<<(std::ostream& os, const Modifier& mod) {
+            return os << "\033[" << mod.code << "m";
+        }
+    };
+}
+
+void colorfuldisplay(ParsedJson & pj, const u8 * buf) {
+    Color::Modifier greenfg(Color::FG_GREEN);
+    Color::Modifier yellowfg(Color::FG_YELLOW);
+    Color::Modifier deffg(Color::FG_DEFAULT);
+    size_t i = 0;
+    // skip initial fluff
+    while((i+1< pj.n_structural_indexes) && (pj.structural_indexes[i]==pj.structural_indexes[i+1])){
+      i++;
+    }
+    for (; i < pj.n_structural_indexes; i++) {
+        u32 idx = pj.structural_indexes[i];
+        u8 c = buf[idx];
+        if (((c & 0xdf) == 0x5b)) { // meaning 7b or 5b, { or [
+            std::cout << greenfg <<  buf[idx] << deffg;
+        } else if (((c & 0xdf) == 0x5d)) { // meaning 7d or 5d, } or ]
+            std::cout << greenfg <<  buf[idx] << deffg;
+        } else {
+            std::cout << yellowfg <<  buf[idx] << deffg;
+        }
+        if(i + 1 < pj.n_structural_indexes) {
+          u32 nextidx = pj.structural_indexes[i + 1];
+          for(u32 pos = idx + 1 ; pos < nextidx; pos++) {
+            std::cout << buf[pos];
+          }
+        }
+    }
+    std::cout << std::endl;
+}
+
+void debugdisplay(ParsedJson & pj, const u8 * buf) {
+    for (u32 i = 0; i < pj.n_structural_indexes; i++) {
+        u32 idx = pj.structural_indexes[i];
+        JsonNode & n = pj.nodes[i];
+        std::cout << "i: " << i;
+        std::cout << " n.up: " << n.up;
+        std::cout << " n.next: " << n.next;
+        std::cout << " n.prev: " << n.prev;
+        std::cout << " idx: " << idx << " buf[idx] " << buf[idx] << std::endl;
+    }
+}
--- a/scalarvssimd/include/scalarprocessing.h
+++ b/scalarvssimd/include/scalarprocessing.h
@ -0,0 +1,79 @@
+#include "common_defs.h"
+#include "jsonstruct.h"
+
+bool is_valid_escape(char c) {
+  return (c == '"') ||  (c == '\\') ||  (c == '/') ||  (c == 'b') ||  (c == 'f') ||  (c == 'n') ||  (c == 'r') ||  (c == 't') ||  (c == 'u');
+}
+
+bool scalar_json_parse(const u8 * buf,  size_t len, ParsedJson & pj) {
+  // this is a naive attempt at this point
+  // it will probably be subject to failures given adversarial inputs
+  size_t pos = 0;
+  size_t last = 0;
+  size_t up = 0;
+
+  const u32 DUMMY_NODE = 0;
+  const u32 ROOT_NODE = 1;
+  pj.structural_indexes[DUMMY_NODE] = 0;
+  pj.structural_indexes[ROOT_NODE] = 0;
+  JsonNode & dummy = pj.nodes[DUMMY_NODE];
+  JsonNode & root = pj.nodes[ROOT_NODE];
+  dummy.prev = dummy.up = DUMMY_NODE;
+  dummy.next = 0;
+  root.prev = DUMMY_NODE;
+  root.up = ROOT_NODE;
+  root.next = 0;
+
+  last = up = ROOT_NODE;
+
+  pos = 2;
+  for(size_t i = 0; i < len; i++) {
+    JsonNode & n = pj.nodes[pos];
+    switch (buf[i]) {
+      case '[':
+      case '{':
+            pj.structural_indexes[pos] = i;
+            n.prev = last;
+            pj.nodes[last].next = pos;// two-way linked list
+            n.up = up;
+            up = pos;// new possible up
+            last = 0;
+            pos += 1;
+
+            break;
+      case ']':
+      case '}':
+            pj.structural_indexes[pos] = i;
+            n.prev = up;
+            n.next = 0;// necessary?
+            pj.nodes[up].next = pos;// two-way linked list
+            n.up = pj.nodes[up].up;
+            up = pj.nodes[up].up;
+            last = pos;// potential previous
+            pos += 1;
+            break;
+      case '"':
+      case ':':
+      case ',':
+          pj.structural_indexes[pos] = i;
+          n.prev = last;
+          n.next = 0;// necessary
+          pj.nodes[last].next = pos;// two-way linked list
+          n.up = up;
+          last = pos;// potential previous
+          pos += 1;
+          break;
+      case '\\':
+          if(i == len - 1) return false;
+          if(!is_valid_escape(buf[i+1])) return false;
+          i = i + 1; // skip valid escape
+      default:
+          // nothing
+          break;
+    }
+
+  }
+  pj.n_structural_indexes = pos;
+  dummy.next = DUMMY_NODE; // dummy.next is a sump for meaningless 'nexts', clear it
+  return true;
+}
--- a/scalarvssimd/include/util.h
+++ b/scalarvssimd/include/util.h
--- a/scalarvssimd/jsonstruct.h
+++ b/scalarvssimd/jsonstruct.h
@ -1,15 +0,0 @@
-#pragma once
-
-
-struct JsonNode {
-    u32 up;
-    u32 next;
-    u32 prev;
-};
-
-struct ParsedJson {
-    u8 * structurals;
-    u32 n_structural_indexes;
-    u32 * structural_indexes;
-    JsonNode * nodes;
-};
--- a/scalarvssimd/run.sh
+++ b/scalarvssimd/run.sh
@ -0,0 +1,12 @@
+#!/bin/bash
+echo "Note: the SIMD parser does a bit more work."
+SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
+cd $SCRIPTPATH
+make bench
+echo 
+for i in $SCRIPTPATH/../jsonexamples/*.json; do
+    [ -f "$i" ] || break
+    echo $i
+    $SCRIPTPATH/bench $i
+    echo
+done
--- a/scalarvssimd/scalarprocessing.h
+++ b/scalarvssimd/scalarprocessing.h
@ -1,44 +0,0 @@
-#include "common_defs.h"
-#include "jsonstruct.h"
-
-bool scalar_json_parse(const u8 * buf,  size_t len, ParsedJson & pj) {
-  // this is a naive attempt at this point
-  // it will probably be subject to failures given adversarial inputs
-  size_t pos = 0;
-  size_t last = 0;
-  size_t up = 0;
-  for(size_t i = 0; i < len; i++) {
-    JsonNode & n = pj.nodes[pos];
-    switch buf[i] {
-      case '[':
-      case '{':
-            n.prev = last;
-            n.up = up;
-            up = pos;
-            last = 0;
-            pos += 1;
-            break;
-      case ']':
-      case '}':
-            n.prev = up;
-            n.up = pj.nodes[up].up;
-            up = pj.nodes[up].up;
-            last = pos;
-            pos += 1;
-            break;
-
-      case ':':
-      case ',':
-          n.prev = last;
-          n.up = up;
-          last = pos;
-          pos += 1;
-          break;
-      default:
-          // nothing
-    }
-    n.next = 0;
-    nodes[n.prev].next = pos;
-  }
-  pj.n_structural_indexes = pos;
-}