Various cleaning steps.

2018-11-09 21:31:14 -05:00 · 2018-11-09 21:31:14 -05:00 · 76074a821f
parent 0e5b939568
commit 76074a821f
13 changed files with 181 additions and 159 deletions
--- a/7
+++ b/7
@ -6,7 +6,12 @@

 .PHONY: clean cleandist

-CXXFLAGS =  -std=c++11 -g2 -O3 -march=native -Wall -Wextra -Wshadow -Iinclude  -Ibenchmark/linux  -Idependencies/rapidjson/include -Idependencies/sajson/include -fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined
+CXXFLAGS =  -std=c++11 -g2 -O3 -march=native -Wall -Wextra -Wshadow -Iinclude  -Ibenchmark/linux  -Idependencies/rapidjson/include -Idependencies/sajson/include 
+
+ifeq ($(SANITIZE),1)
+	CXXFLAGS += -g2 -fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined
+endif
+
 EXECUTABLES=parse jsoncheck numberparsingcheck stringparsingcheck minifiercompetition parsingcompetition minify allparserscheckfile

 HEADERS= include/jsonparser/simdutf8check.h include/jsonparser/stringparsing.h include/jsonparser/numberparsing.h include/jsonparser/jsonparser.h include/jsonparser/common_defs.h include/jsonparser/jsonioutil.h benchmark/benchmark.h benchmark/linux/linux-perf-events.h include/jsonparser/simdjson_internal.h include/jsonparser/stage1_find_marks.h include/jsonparser/stage2_flatten.h include/jsonparser/stage34_unified.h include/jsonparser/jsoncharutils.h
--- a/README.md
+++ b/README.md
@ -1,93 +1,76 @@
 # simdjson : Parsing gigabytes of JSON per second

-A *research* library. The purpose of this repository is to support an eventual paper. The immediate goal is not to produce a library that would be used in production systems. Of course, the end game is, indeed, to have an impact on production system.
+A C++  library to see how fast we can parse JSON with complete validation.

-Goal: Speed up the parsing of JSON per se. No materialization.
+Goal: Speed up the parsing of JSON per se. 
+
+## Code example
+
+```C
+#include "jsonparser/jsonparser.h"
+
+/...
+
+const char * filename = ... //
+pair<u8 *, size_t> p = get_corpus(filename);
+ParsedJson *pj_ptr = allocate_ParsedJson(p.second); // allocate memory for parsing up to p.second bytes
+bool is_ok = json_parse(p.first, p.second, *pj_ptr); // do the parsing, return false on error
+// parsing is done!
+
+free(p.first); // free JSON bytes, can be done right after parsing
+
+
+deallocate_ParsedJson(pj_ptr); // once you are done with pj_ptr, free JSON document; hint: you can reuse pj_ptr
+```
+
+
+
+## Usage
+
+Requirements:  clang or gcc and make. A system like Linux or macOS is expected.
+
+To test:
+
+```
+make
+make test
+```
+
+
+To run benchmarks:
+
+```
+make parse
+./parse myjsonfile
+```
+
+
+
+## Limitations
+
+To simplify the engineering, we make some assumptions.
+
+- We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16).
+- We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel. No support for non-x86 processors is included.
+- We only support GNU GCC and LLVM Clang at this time. There is no support for Microsoft Visual Studio, though it should not be difficult.
+- We expect the input memory pointer to 256-bit aligned and to be padded (e.g., with spaces) so that it can be read entirely in blocks of 256 bits. In practice, this means that users should allocate the memory where the JSON bytes are located using the `allocate_aligned_buffer` function or the equivalent.

 ## Architecture

 The parser works in three stages:

- Stage 1. Identifies quickly structure elements, strings, and so forth. Currently, there is no validation (JSON is assumed to be correct).
+- Stage 1. Identifies quickly structure elements, strings, and so forth. We validate UTF-8 encoding at that stage.
 - Stage 2. Involves the "flattening" of the data from stage 1, that is, convert bitsets into arrays of indexes.
- Stage 3. (Structure building) Involves constructing a "tree" of sort to navigate through the data.
- Stage 4. (Currently unimplemented) Iterate throw the structure without "stalling" (fighting back against latency)
-
-## Todo (Priority)
-
- The top level FSM is wrong and yields an error. Need to hand-hack the treatment of the first level.
- The tape machine is not robust against invalid input. A large string of closing brace characters, for example, will crash the tape machine and we will not consult the state machine in time to know that this wasn't valid input. Furthermore there are valid inputs in terms of the state machine that will create undefined behavior if the JSON nesting depth exceeds what we have provisioned for. This is a straightforward fix.
- The atoms need to be verified for validity. It would be nice to have a branch-free version for this. This stage can operate over our trees.
- The tape machine's output is nearly unreadable both for debugging purposes and as a structure in itself. We don't know where each nested item starts, only where it ends - a linear traversal at depth N could track this information for entries to depth N+1, so I am leery of putting extra information on the tapes until we have a more full idea of what the final stages look like. 
- The tape machine's output is also cryptic in that it's not clear whether records point into the input (offsets) or point to other items on our tapes. The simplest trick to fix this would be to use negative offsets for the tapes (growing them upward) so negative values point to our tapes and positive values point to our input. Obviously the current situation is not sustainable as we could actually wind up not knowing what our tape values really mean!
-
-## Todo (Secondary)
-
- Build up a paper (use overleaf.com)
- Write unit tests
- Write bona fide, accurate benchmarks (with fair comparisons using good alternatives). See https://github.com/Geal/parser_benchmarks
- Document better the code, make the code easier to use
- Add some measure of error handling. I no longer think error handling is optional; I think this could be a key differentiating feature over what I suspect Mison actually does. If Mison happily accepts broken JSON because of its much vaunted ability to skip input, is this a feature or a bug?
+- Stage 3. (Structure building) Involves constructing a "tree" of sort to navigate through the data. Strings and numbers are parsed at this stage.


+## Various References

-
-
-## Academic References
-
- T.Mu ̈hlbauer,W.Ro ̈diger,R.Seilbeck,A.Reiser,A.Kemper,andT.Neumann. Instant loading for main memory databases. PVLDB, 6(14):1702–1713, 2013. (SIMD-based CSV parsing)
-
- Mytkowicz, Todd, Madanlal Musuvathi, and Wolfram Schulte. "Data-parallel finite-state machines." ACM SIGARCH Computer Architecture News. Vol. 42. No. 1. ACM, 2014.
-
- Lu, Yifan, et al. "Tree structured data processing on GPUs." Cloud Computing, Data Science & Engineering-Confluence, 2017 7th International Conference on. IEEE, 2017.
-
- Sidhu, Reetinder. "High throughput, tree automata based XML processing using FPGAs." Field-Programmable Technology (FPT), 2013 International Conference on. IEEE, 2013.
-
- Dai, Zefu, Nick Ni, and Jianwen Zhu. "A 1 cycle-per-byte XML parsing accelerator." Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays. ACM, 2010.
-
- Lin, Dan, et al. "Parabix: Boosting the efficiency of text processing on commodity processors." High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012. http://parabix.costar.sfu.ca/export/1783/docs/HPCA2012/final_ieee/final.pdf
-> Using this parallel bit stream approach, the vast majority of conditional branches used to identify key positions and/or syntax errors at each parsing position are mostly eliminated, which, as Section 6.2 shows, minimizes branch misprediction penalties. Accurate parsing and parallel lexical analysis is done through processor-friendly equations that require neither speculation nor multithreading.
-
- Deshmukh, V. M., and G. R. Bamnote. "An empirical evaluation of optimization parameters in XML parsing for performance enhancement." Computer, Communication and Control (IC4), 2015 International Conference on. IEEE, 2015.
-APA
-
-
- Moussalli, Roger, et al. "Efficient XML Path Filtering Using GPUs." ADMS@ VLDB. 2011.
-
- Jianliang, Ma, et al. "Parallel speculative dom-based XML parser." High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on. IEEE, 2012.
-
- Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J. and Kossmann, D., 2017. Mison: a fast JSON parser for data analytics. Proceedings of the VLDB Endowment, 10(10), pp.1118-1129. http://www.vldb.org/pvldb/vol10/p1118-li.pdf
-> They build the index and parse at the same time. The index allows them to parse faster... so they don't materialize the index.  The speculative part is hastily described but they do hint that this is where much of the benefit comes from...
-
-
- Cameron, Robert D., et al. "Parallel scanning with bitstream addition: An xml case study." European Conference on Parallel Processing. Springer, Berlin, Heidelberg, 2011.
-
- Cameron, Robert D., Kenneth S. Herdy, and Dan Lin. "High performance XML parsing using parallel bit stream technology." Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds. ACM, 2008.
-
- Shah, Bhavik, et al. "A data parallel algorithm for XML DOM parsing." International XML Database Symposium. Springer, Berlin, Heidelberg, 2009.
-
- Cameron, Robert D., and Dan Lin. "Architectural support for SWAR text processing with parallel bit streams: the inductive doubling principle." ACM Sigplan Notices. Vol. 44. No. 3. ACM, 2009.
-
- Amagasa, Toshiyuki, Mana Seino, and Hiroyuki Kitagawa. "Energy-Efficient XML Stream Processing through Element-Skipping Parsing." Database and Expert Systems Applications (DEXA), 2013 24th International Workshop on. IEEE, 2013.
-
- Medforth, Nigel Woodland. "icXML: Accelerating Xerces-C 3.1. 1 using the Parabix Framework." (2013).
-
- Zhang, Qiang Scott. Embedding Parallel Bit Stream Technology Into Expat. Diss. Simon Fraser University, 2010.
-
- Cameron, Robert D., et al. "Fast Regular Expression Matching with Bit-parallel Data Streams."
-
- Lin, Dan. Bits filter: a high-performance multiple string pattern matching algorithm for malware detection. Diss. School of Computing Science-Simon Fraser University, 2010.
-
- Yang, Shiyang. Validation of XML Document Based on Parallel Bit Stream Technology. Diss. Applied Sciences: School of Computing Science, 2013.
-
-  N. Nakasato, "Implementation of a parallel tree method on a GPU", Journal of Computational Science, vol. 3, no. 3, pp. 132-141, 2012.
-
-## References
 - [Google double-conv](https://github.com/google/double-conversion/)
 - [How to implement atoi using SIMD?](https://stackoverflow.com/questions/35127060/how-to-implement-atoi-using-simd)
 - [Parsing JSON is a Minefield 💣](http://seriot.ch/parsing_json.php)
 - https://tools.ietf.org/html/rfc7159
- The only public Mison implementation (in rust)  https://github.com/pikkr/pikkr
+- The Mison implementation in rust  https://github.com/pikkr/pikkr
 - http://rapidjson.org/md_doc_sax.html
 - https://github.com/Geal/parser_benchmarks/tree/master/json
 - Gron: A command line tool that makes JSON greppable https://news.ycombinator.com/item?id=16727665
@ -113,7 +96,6 @@ Validating UTF-8 takes no more than 0.7 cycles per byte:


 - JSON is not JavaScript:
-
 > All JSON is Javascript but NOT all Javascript is JSON. So {property:1} is invalid because property does not have double quotes around it. {'property':1} is also invalid, because it's single quoted while the only thing that can placate the JSON specification is double quoting. JSON is even fussy enough that {"property":.1} is invalid too, because you should have of course written {"property":0.1}. Also, don't even think about having comments or semicolons, you guessed it: they're invalid. (credit:https://github.com/elzr/vim-json)

 - The  structural characters are:
@ -126,24 +108,6 @@ Validating UTF-8 takes no more than 0.7 cycles per byte:
      name-separator  = : colon
      value-separator = , comma

- It might be useful/important to prune "insignificant" white space characters. Maybe.
-
- Parsing and validating numbers fast could be potentially interesting,  but the spec allows many things. (Geoff wrote: "Parsing numbers (especially floating point ones!) and other atoms is fiddly but doable.")
-
- Error handling is a nice problem. Of course, we have to define what we mean by an error and which error types the parser should be responsible for.  Here are a few things that could/should probably trigger an error:
-
-   - An array structure is represented as square brackets surrounding zero  or more values (or elements).  Elements are separated by commas.
-
-   - An object structure is represented as a pair of curly brackets surrounding zero or more name/value pairs (or members).  A name is a string.  A single colon comes after each name, separating the name from the value.  A single comma separates a value from a following name.
-
-   - Values must be one of false / null / true / object / array / number / string
-
-   - A string begins and ends with  quotation marks.  All Unicode characters may be placed within the   quotation marks, except for the characters that must be escaped:   quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). We can probably safely assume that strings are in UTF-8. [Decoding UTF-8 is fun](https://github.com/skeeto/branchless-utf8/blob/master/utf8.h). However, any character can be escaped in JSON string and escaping them might be required? Well, maybe you can quickly check whether a string needs escaping.
-
-   - Regarding strings, Geoff wrote:
-   > For example, in Stage 2 ("string detection") we could validate that the only place we saw backslashes was in places we consider "inside strings".
-
-

 ### Pseudo-structural elements

@ -159,13 +123,27 @@ This helps as we redefine some new characters as pseudo-structural such as the c

 > { "foo" : 1.5, "bar" : 1.5   GEOFF_IS_A_DUMMY bla bla , "baz", null } 

-## Remarks on the code

- It seems that the use of  ``bzero`` is discouraged.
- Per input byte,  multiple bytes are allocated which could potentially be a problem when processing a very large document, hence one might want to be more incremental in practice maybe to minimize memory usage. For really large documents, there might be caching issues as well.
- The ``clmul`` thing is tricky but nice. (Geoff's remark:  find the spaces between quotes, is actually a ponderous way of doing parallel prefix over XOR, which a mathematically adept person would have realized could be done with clmul by -1. Not me, I had to look it up: http://bitmath.blogspot.com.au/2016/11/parallel-prefixsuffix-operations.html.)
- It is possible, though maybe unlikely, that parallelizing the bitset decoding could be useful (https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/), and there is VCOMPRESSB (AVX-512)
+## Academic References

-## Future work
-
- Long term we should keep in mind the idea that what would be cool is a method to extract something like this code from an abstract description of something closer to a grammar.
+- T.Mühlbauer, W.Rödiger, R.Seilbeck, A.Reiser, A.Kemper, and T.Neumann. Instant loading for main memory databases. PVLDB, 6(14):1702–1713, 2013. (SIMD-based CSV parsing)
+- Mytkowicz, Todd, Madanlal Musuvathi, and Wolfram Schulte. "Data-parallel finite-state machines." ACM SIGARCH Computer Architecture News. Vol. 42. No. 1. ACM, 2014.
+- Lu, Yifan, et al. "Tree structured data processing on GPUs." Cloud Computing, Data Science & Engineering-Confluence, 2017 7th International Conference on. IEEE, 2017.
+- Sidhu, Reetinder. "High throughput, tree automata based XML processing using FPGAs." Field-Programmable Technology (FPT), 2013 International Conference on. IEEE, 2013.
+- Dai, Zefu, Nick Ni, and Jianwen Zhu. "A 1 cycle-per-byte XML parsing accelerator." Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays. ACM, 2010.
+- Lin, Dan, et al. "Parabix: Boosting the efficiency of text processing on commodity processors." High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012. http://parabix.costar.sfu.ca/export/1783/docs/HPCA2012/final_ieee/final.pdf
+- Deshmukh, V. M., and G. R. Bamnote. "An empirical evaluation of optimization parameters in XML parsing for performance enhancement." Computer, Communication and Control (IC4), 2015 International Conference on. IEEE, 2015.
+- Moussalli, Roger, et al. "Efficient XML Path Filtering Using GPUs." ADMS@ VLDB. 2011.
+- Jianliang, Ma, et al. "Parallel speculative dom-based XML parser." High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on. IEEE, 2012.
+- Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J. and Kossmann, D., 2017. Mison: a fast JSON parser for data analytics. Proceedings of the VLDB Endowment, 10(10), pp.1118-1129. http://www.vldb.org/pvldb/vol10/p1118-li.pdf
+- Cameron, Robert D., et al. "Parallel scanning with bitstream addition: An xml case study." European Conference on Parallel Processing. Springer, Berlin, Heidelberg, 2011.
+- Cameron, Robert D., Kenneth S. Herdy, and Dan Lin. "High performance XML parsing using parallel bit stream technology." Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds. ACM, 2008.
+- Shah, Bhavik, et al. "A data parallel algorithm for XML DOM parsing." International XML Database Symposium. Springer, Berlin, Heidelberg, 2009.
+- Cameron, Robert D., and Dan Lin. "Architectural support for SWAR text processing with parallel bit streams: the inductive doubling principle." ACM Sigplan Notices. Vol. 44. No. 3. ACM, 2009.
+- Amagasa, Toshiyuki, Mana Seino, and Hiroyuki Kitagawa. "Energy-Efficient XML Stream Processing through Element-Skipping Parsing." Database and Expert Systems Applications (DEXA), 2013 24th International Workshop on. IEEE, 2013.
+- Medforth, Nigel Woodland. "icXML: Accelerating Xerces-C 3.1. 1 using the Parabix Framework." (2013).
+- Zhang, Qiang Scott. Embedding Parallel Bit Stream Technology Into Expat. Diss. Simon Fraser University, 2010.
+- Cameron, Robert D., et al. "Fast Regular Expression Matching with Bit-parallel Data Streams."
+- Lin, Dan. Bits filter: a high-performance multiple string pattern matching algorithm for malware detection. Diss. School of Computing Science-Simon Fraser University, 2010.
+- Yang, Shiyang. Validation of XML Document Based on Parallel Bit Stream Technology. Diss. Applied Sciences: School of Computing Science, 2013.
+-  N. Nakasato, "Implementation of a parallel tree method on a GPU", Journal of Computational Science, vol. 3, no. 3, pp. 132-141, 2012.
--- a/benchmark/parse.cpp
+++ b/benchmark/parse.cpp
@ -20,10 +20,10 @@
 #include <unistd.h>
 #include <vector>
 #include <x86intrin.h>
-
+#include <ctype.h>

 //#define DEBUG
-
+#include "jsonparser/jsonparser.h"
 #include "jsonparser/jsonioutil.h"
 #include "jsonparser/simdjson_internal.h"
 #include "jsonparser/stage1_find_marks.h"
@ -98,38 +98,37 @@ void colorfuldisplay(ParsedJson &pj, const u8 *buf) {
 }

 int main(int argc, char *argv[]) {
-  if (argc != 2) {
+  bool verbose = false;
+  int c;
+
+  while ((c = getopt (argc, argv, "v")) != -1)
+    switch (c)
+      {
+      case 'v':
+        verbose = true;
+        break;
+      default:
+        abort ();
+      }
+  if (optind >= argc) {
    cerr << "Usage: " << argv[0] << " <jsonfile>" << endl;
    exit(1);
  }
-  pair<u8 *, size_t> p = get_corpus(argv[1]);
-  ParsedJson *pj_ptr = new ParsedJson;
-  ParsedJson &pj(*pj_ptr);
-
-  if (posix_memalign((void **)&pj.structurals, 8,
-                     ROUNDUP_N(p.second, 64) / 8)) {
-    cerr << "Could not allocate memory" << endl;
-    exit(1);
-  };
-
-  if (p.second > 0xffffff) {
-    cerr << "Currently only support JSON files < 16MB\n";
-    exit(1);
+  const char * filename = argv[optind];
+  if(optind + 1 < argc) {
+    cerr << "warning: ignoring everything after " << argv[optind  + 1] << endl;
  }
-
-  pj.n_structural_indexes = 0;
-  // we have potentially 1 structure per byte of input
-  // as well as a dummy structure and a root structure
-  // we also potentially write up to 7 iterations beyond
-  // in our 'cheesy flatten', so make some worst-case
-  // space for that too
-  u32 max_structures = ROUNDUP_N(p.second, 64) + 2 + 7;
-  pj.structural_indexes = new u32[max_structures];
+  if(verbose) cout << "[verbose] loading " << filename << endl;
+  pair<u8 *, size_t> p = get_corpus(filename);
+  if(verbose) cout << "[verbose] loaded " << filename << " ("<< p.second << " bytes)" << endl;
+  ParsedJson *pj_ptr = allocate_ParsedJson(p.second);
+  ParsedJson &pj(*pj_ptr);
+  if(verbose) cout << "[verbose] allocated memory for parsed JSON " << endl;

 #if defined(DEBUG)
  const u32 iterations = 1;
 #else
-  const u32 iterations = 1000;
+  const u32 iterations = p.second < 1 * 1000 * 1000? 1000 : 10;
 #endif
  vector<double> res;
  res.resize(iterations);
@ -151,7 +150,9 @@ int main(int argc, char *argv[]) {
  unsigned long mis1 = 0, mis2 = 0, mis3 = 0;
 #endif
  bool isok = true;
+
  for (u32 i = 0; i < iterations; i++) {
+    if(verbose) cout << "[verbose] iteration # " << i << endl;
    auto start = std::chrono::steady_clock::now();
 #ifndef SQUASH_COUNTERS
    unified.start();
@ -235,10 +236,8 @@ int main(int argc, char *argv[]) {
       << " Gigabytes/second: " << (p.second) / (min_result * 1000000000.0)
       << "\n";

-  free(pj.structurals);
  free(p.first);
-  delete[] pj.structural_indexes;
-  delete pj_ptr;
+  deallocate_ParsedJson(pj_ptr);
  if (!isok) {
    printf(" Parsing failed. \n ");
    return EXIT_FAILURE;
--- a/include/jsonparser/numberparsing.h
+++ b/include/jsonparser/numberparsing.h
@ -407,8 +407,7 @@ static really_inline bool parse_number(const u8 *const buf, UNUSED size_t len,
    while (is_integer(*p)) {
      unsigned char digit = *p - '0';
      ++p;
-      i = i * 10 + digit;
-      // exponent --;
+      i = i * 10 + digit; // in rare cases, this will overflow, but that's ok because we have parse_highprecision_float later.
    }
    exponent = firstafterperiod - p;
  }
--- a/include/jsonparser/simdjson_internal.h
+++ b/include/jsonparser/simdjson_internal.h
@ -32,9 +32,9 @@ public:
  // grossly overprovisioned
  u64 tape[MAX_TAPE];
  u32 tape_locs[MAX_DEPTH];
-  u8 * string_buf;//[MAX_JSON_BYTES];
+  u8 * string_buf;// should be at least bytecapacity
  u8 *current_string_buf_loc;
-  u8 * number_buf;//[MAX_JSON_BYTES * 4]; // holds either doubles or longs, really
+  u8 * number_buf;// holds either doubles or longs, really // should be at least 4 * bytecapacity
  u8 *current_number_buf_loc;
    
    void init() {
--- a/include/jsonparser/stringparsing.h
+++ b/include/jsonparser/stringparsing.h
@ -67,7 +67,7 @@ really_inline bool handle_unicode_codepoint(const u8 **src_ptr, u8 **dst_ptr) {
  return true;
 }

-really_inline bool parse_string(const u8 *buf, UNUSED size_t len,
+really_inline  bool parse_string(const u8 *buf, UNUSED size_t len,
                                ParsedJson &pj, u32 depth, u32 offset) {
  using namespace std;
  const u8 *src = &buf[offset + 1]; // we know that buf at offset is a "
--- a/scripts/javascript/README.md
+++ b/scripts/javascript/README.md
@ -0,0 +1,3 @@
+- npm install
+- nodejs generatelargejson.js (or node generatelargejson.js)
+
--- a/scripts/javascript/generatelargejson.js
+++ b/scripts/javascript/generatelargejson.js
@ -0,0 +1,19 @@
+
+var fs = require('fs');
+
+var faker = require('faker');
+
+
+// generate bigDataSet as example
+var bigSet = [];
+var mmax = 500000
+console.log("this may take some time...")
+for(var i = 10; i < mmax; i++){
+  if(i % 1024 == 0) process.stdout.write("\r"+i+" entries ("+Math.round(i * 100.0 /mmax)+" percent)");
+  bigSet.push(faker.helpers.userCard());
+};
+console.log()
+
+fs.writeFile(__dirname + '/large.json',  JSON.stringify(bigSet), function() {
+  console.log("large.json generated successfully!");
+})
--- a/scripts/javascript/package-lock.json
+++ b/scripts/javascript/package-lock.json
@ -0,0 +1,18 @@
+{
+  "name": "generatelargejson",
+  "version": "1.0.0",
+  "lockfileVersion": 1,
+  "requires": true,
+  "dependencies": {
+    "faker": {
+      "version": "4.1.0",
+      "resolved": "https://registry.npmjs.org/faker/-/faker-4.1.0.tgz",
+      "integrity": "sha1-HkW7vsxndLPBlfrSg1EJxtdIzD8="
+    },
+    "fs": {
+      "version": "0.0.1-security",
+      "resolved": "https://registry.npmjs.org/fs/-/fs-0.0.1-security.tgz",
+      "integrity": "sha1-invTcYa23d84E/I4WLV+yq9eQdQ="
+    }
+  }
+}
--- a/scripts/javascript/package.json
+++ b/scripts/javascript/package.json
@ -0,0 +1,16 @@
+{
+  "name": "generatelargejson",
+  "version": "1.0.0",
+  "description": "",
+  "main": "generatelargejson.js",
+  "dependencies": {
+    "faker": "^4.1.0",
+    "fs": "0.0.1-security"
+  },
+  "devDependencies": {},
+  "scripts": {
+    "test": "echo \"Error: no test specified\" && exit 1"
+  },
+  "author": "",
+  "license": "ISC"
+}
--- a/src/jsonioutil.cpp
+++ b/src/jsonioutil.cpp
@ -1,22 +1,15 @@
 #include "jsonparser/jsonioutil.h"
 #include <cstring>

-#define AVXOVERALLOCATE

 char * allocate_aligned_buffer(size_t length) {
    char *aligned_buffer;
    size_t paddedlength = ROUNDUP_N(length, 64);
-#ifdef AVXOVERALLOCATE
    // allocate an extra sizeof(__m256i) just so we can always use AVX safely
-    if (posix_memalign((void **)&aligned_buffer, 64, paddedlength + 1 + 2 * sizeof(__m256i))) {
+    if (posix_memalign((void **)&aligned_buffer, 64, paddedlength + 1 + sizeof(__m256i))) {
      throw std::runtime_error("Could not allocate sufficient memory");
    };
-    for(size_t i = 0; i < 2 * sizeof(__m256i); i++) aligned_buffer[paddedlength + i] = '\0';
-#else
-    if (posix_memalign((void **)&aligned_buffer, 64, paddedlength + 1)) {
-      throw std::runtime_error("Could not allocate sufficient memory");
-    };
-#endif
+    for(size_t i = 0; i < sizeof(__m256i); i++) aligned_buffer[paddedlength + i] = '\0';
    aligned_buffer[paddedlength] = '\0';
    memset(aligned_buffer + length, 0x20, paddedlength - length);
    return aligned_buffer;
--- a/src/jsonparser.cpp
+++ b/src/jsonparser.cpp
@ -6,18 +6,13 @@
 // This structure is meant to be reused from document to document, as needed.
 // you can use deallocate_ParsedJson to deallocate the memory.
 ParsedJson *allocate_ParsedJson(size_t len) {
-  /*if (len > MAX_JSON_BYTES) {
-    std::cerr << "Currently only support JSON files having up to "<<MAX_JSON_BYTES<<" bytes, requested length: "
-              << len << std::endl;
-    return NULL;
-  }*/
  ParsedJson *pj_ptr = new ParsedJson;
  if (pj_ptr == NULL) {
    std::cerr << "Could not allocate memory for core struct." << std::endl;
    return NULL;
  }
  ParsedJson &pj(*pj_ptr);
-  pj.bytecapacity = len;
+  pj.bytecapacity = 0; // will only set it to len after allocations are a success
  if (posix_memalign((void **)&pj.structurals, 8, ROUNDUP_N(len, 64) / 8)) {
    std::cerr << "Could not allocate memory for structurals" << std::endl;
    delete pj_ptr;
@ -34,7 +29,7 @@ ParsedJson *allocate_ParsedJson(size_t len) {
    delete pj_ptr;
    return NULL;
  }
-  pj.string_buf = new u8[len];
+  pj.string_buf = new u8[ROUNDUP_N(len, 64)];
  if (pj.string_buf == NULL) {
    std::cerr << "Could not allocate memory for string_buf"
              << std::endl;
@ -43,7 +38,7 @@ ParsedJson *allocate_ParsedJson(size_t len) {
    delete pj_ptr;
    return NULL;
  }
-  pj.number_buf = new u8[4 * len];
+  pj.number_buf = new u8[4 * ROUNDUP_N(len, 64)];
  if (pj.string_buf == NULL) {
    std::cerr << "Could not allocate memory for number_buf"
              << std::endl;
@ -53,7 +48,7 @@ ParsedJson *allocate_ParsedJson(size_t len) {
    delete pj_ptr;
    return NULL;
  }
-
+  pj.bytecapacity = len;
  return pj_ptr;
 }

--- a/src/stage1_find_marks.cpp
+++ b/src/stage1_find_marks.cpp
@ -35,8 +35,8 @@ really_inline u64 cmp_mask_against_input(m256 input_lo, m256 input_hi,
 WARN_UNUSED
 /*never_inline*/ bool find_structural_bits(const u8 *buf, size_t len,
                                           ParsedJson &pj) {
-  if (len > 0xffffff) {
-    cerr << "Currently only support JSON files < 16MB\n";
+  if (len > pj.bytecapacity) {
+    cerr << "Your ParsedJson object only supports documents up to "<< pj.bytecapacity << " bytes but you are trying to process " <<  len  << " bytes\n";
    return false;
  }
 #ifdef UTF8VALIDATE
@ -77,9 +77,6 @@ WARN_UNUSED
    }
    cout << "|  ... input\n";
 #endif
-if(idx+64 > len) {
-    printf("we are going to read %zu extra bytes \n", 64 + idx - len);
-} 
    m256 input_lo = _mm256_load_si256((const m256 *)(buf + idx + 0));
    m256 input_hi = _mm256_load_si256((const m256 *)(buf + idx + 32));
 #ifdef UTF8VALIDATE