Commit Graph

1567 Commits

Author SHA1 Message Date
John Keiser fd44c2a2ff
Merge pull request #927 from simdjson/dlemire/exposingthestringminifier
Exposing the string minifier.
2020-06-13 07:47:20 -07:00
John Keiser a86a82b39c Rename minify class to minifier so the minify() method is cleared up 2020-06-12 17:05:25 -07:00
Daniel Lemire 89b059b1ea
Testing with GCC 10 and clang 10 (#926)
* Testing with GCC 10 and clang 10

* Fixing spurious space

* gcc10 does not need the cmake installation.

* We don't want to run the perf test on ARM. I ignore them systematically. ARM performance
should be assessed manually.

* Switching to GCC 10 and Clang 10

* Disabling some tests under sanitizers when they involve rapidjson or other parsers.

Co-authored-by: Daniel Lemire <lemire@gmai.com>
2020-06-12 17:58:53 -04:00
Daniel Lemire bd2d0f769f
One unlikely too many (#930) 2020-06-12 17:58:10 -04:00
Daniel Lemire d830422489
Put back the amalgamation files and add tests (#929)
Co-authored-by: John Keiser <john@johnkeiser.com>
Co-authored-by: Daniel Lemire <lemire@gmai.com>
2020-06-12 17:57:45 -04:00
Daniel Lemire d1a54249e7 New API traversal tests. 2020-06-12 17:42:57 -04:00
Daniel Lemire 4dfbf98e4e
Using a worker instead of a thread per batch (#920)
In the parse_many function, we have one thread doing the stage 1, while the main thread does stage 2. So if stage 1 and stage 2 take half the time, the parse_many could run at twice the speed. It is unlikely to do so. Still, we see benefits of about 40% due to threading.

To achieve this interleaving, we load the data in batches (blocks) of some size. In the current code (master), we create a new thread for each batch. Thread creation is expensive so our approach only works over sizeable batches. This PR improves things and makes parse_many faster when using small batches.

  This fixes our parse_stream benchmark which is just busted.
  This replaces the one-thread per batch routine by a worker object that reuses the same thread. In benchmarks, this allows us to get the same maximal speed, but with smaller processing blocks. It does not help much with larger blocks because the cost of the thread create gets amortized efficiently.
This PR makes parse_many beneficial over small datasets. It also makes us less dependent on the thread creation time.

Unfortunately, it is going to be difficult to say anything definitive in general. The cost of creating a thread varies widely depending on the OS. On some systems, it might be cheap, in others very expensive. It should be expected that the new code will depend less drastically on the performances of the underlying system, since we create juste one thread.

Co-authored-by: John Keiser <john@johnkeiser.com>
Co-authored-by: Daniel Lemire <lemire@gmai.com>
2020-06-12 16:51:18 -04:00
Daniel Lemire 1b6258ec8c Added std::minify 2020-06-12 16:37:41 -04:00
Daniel Lemire be707dbb6f Added a remark 2020-06-12 16:07:34 -04:00
John Keiser 664b03bb13 Short circuit find escapes if there is a backslash 2020-06-12 10:10:35 -07:00
John Keiser 7c6723d912 Print progress bar even if there is only one file 2020-06-12 10:01:19 -07:00
John Keiser 1febf2ec83
Merge pull request #919 from simdjson/jkeiser/move-current-loc
[4/4] Stop persisting current_loc (+2% parse throughput)
2020-06-12 09:55:04 -07:00
John Keiser fe69928764
Merge pull request #918 from simdjson/jkeiser/remove-iterator-variables
[3/4] Remove unneeded structural_iterator variables
2020-06-12 09:52:35 -07:00
John Keiser bbd61eb13f Let tape writing be put in a register 2020-06-12 09:18:20 -07:00
John Keiser e15e1e253d peek_char -> peek_next_char 2020-06-12 09:10:16 -07:00
Daniel Lemire 45e2178ada Duh. 2020-06-11 17:20:28 +00:00
Daniel Lemire a6e4933d93 Exposing the string minifier. 2020-06-11 13:07:18 -04:00
Daniel Lemire 98599e0972
Remove the circleci badge since it may appear to fail due to perfdiff 2020-06-11 11:37:53 -04:00
John Keiser b4837f2e2f
Merge pull request #915 from simdjson/jkeiser/stage2-common
[2/4] Use same state machine for stage 2 streaming and non-streaming
2020-06-10 08:37:08 -07:00
John Keiser ea08e7d192 Remove unused extra copy of find_next_document_index 2020-06-09 17:52:13 -07:00
John Keiser d178e089a6 Stop caching current structural, keep current index around instead of
next
2020-06-08 15:21:54 -07:00
John Keiser 5f00b37e21 Stop caching the buffer index 2020-06-08 15:21:54 -07:00
John Keiser 8a8792d47f Remove most uses of current_char() 2020-06-08 15:21:54 -07:00
John Keiser 59d9bc9e48 Store the pointer to the next structural instead of base
structural_indexes and an index
2020-06-08 15:21:54 -07:00
John Keiser 8793dd3ceb Don't store len locally 2020-06-08 15:21:54 -07:00
John Keiser 48062380fa Move parser to structural_iterator 2020-06-08 15:21:54 -07:00
John Keiser 3636aa5522 Extend structural_parser from structural_iterator 2020-06-08 15:21:54 -07:00
John Keiser a1aea4588f Move document stream state to implementation 2020-06-08 15:21:54 -07:00
John Keiser 1d4fffb799 Fix fallback implementation 2020-06-08 15:21:52 -07:00
John Keiser 6f90f5dc5f Remove templating from finish() method 2020-06-08 15:20:56 -07:00
John Keiser 9dd6972d26 Remove impossible checks, add EMPTY check to normal parser 2020-06-08 15:20:56 -07:00
John Keiser d731a7d52c Privatize structural_parser 2020-06-08 15:20:56 -07:00
John Keiser 059468b74e Eliminate streaming_structural_parser subclass with templates 2020-06-08 15:20:56 -07:00
John Keiser 5e69fb782a Call a function to parse structurals 2020-06-08 15:20:56 -07:00
John Keiser a5beffda78 Remove streaming_structural_parser.h 2020-06-08 15:20:56 -07:00
John Keiser 7de7ce5fdc Move document stream state to implementation 2020-06-08 15:20:56 -07:00
John Keiser 383e8c7f68
Merge pull request #913 from simdjson/jkeiser/internal-streaming
[1/4] Simplify parse_many() and fix bugs
2020-06-08 15:19:27 -07:00
John Keiser 0dbda65e44 Fix fallback implementation 2020-06-08 14:52:23 -07:00
John Keiser fe01da077e Make threaded version work again 2020-06-07 16:21:00 -07:00
John Keiser d43a4e9df9 Remove SUCCESS_AND_HAS_MORE (internal only value) 2020-06-07 16:20:55 -07:00
John Keiser 3e226795f0 Run all passing json against parse_many. Empty documents pass, too. 2020-06-07 16:20:51 -07:00
John Keiser c4a0fe1606 Add tests for parse_many() errors 2020-06-07 16:20:46 -07:00
John Keiser ef63a84a3e Move document stream state to implementation 2020-06-07 16:20:44 -07:00
John Keiser 8c16ba372e Acknowledge that we always have a remainder 2020-06-06 16:46:38 -07:00
John Keiser 9be4a17687 Separate definition from declaration, arrange top down 2020-06-06 16:46:38 -07:00
Furkan 89332e1696
Temporary fix to #914 (#917) 2020-06-05 21:01:41 -04:00
John Keiser 8a56129def
Merge pull request #916 from simdjson/jkeiser/issue906stage2
Move unclosed array check to stage 2
2020-06-05 14:23:44 -07:00
John Keiser ed0c815735 Move unclosed array check to stage 2 2020-06-05 12:39:13 -07:00
Daniel Lemire 7a69da16e4
Fixing issue 906 (#912)
* Fixing issue 906

* Safe patching.

* Now with explanations.

* Bumping up memory allocation.

* Putting the patch back.

* fallback fixes.

Co-authored-by: Daniel Lemire <lemire@gmai.com>
2020-06-05 15:37:09 -04:00
Daniel Lemire 351717414d
updating docker with instructions... (#901)
* Better dockerfile with instructions.

* Typo.
2020-06-04 20:06:29 -04:00