Go to file
friendlyanon 5ec85197f8
CMake refactor stage1 (#1512)
* Remove CMP0025 policy

This policy is already set to NEW by the minimum required version.

* Use HOMEPAGE_URL in the project call

* Use VERSION in the project call

* Detect if this is the top project

* Port simdjson-user-cmakecache to a CMake script

* Create a developer mode

The SIMDJSON_DEVELOPER_MODE option set to ON will enable targets that
are only useful for developers of simdjson.

* Consolidate root CML commands into logical sections

* Warn about intended use of developer mode

* Prettify the just_ascii test

* Remove redundant CMake variables

* Inline CML contents from include and src

* Raise minimum CMake requirement to 3.14

* Define proper install rules

* Restore thread support variable

* Add BUILD_SHARED_LIBS as a top level only option

* Force developer mode to be on in CI

* Include flags earlier in developer mode

* Set CMAKE_BUILD_TYPE conditionally

CMAKE_BUILD_TYPE is used only by single configuration generators and is
otherwise completely ignored.

* Remove useless static/shared options

simdjson now uses the CMake builtin BUILD_SHARED_LIBS to switch the
built artifact's type.

* Remove unused CMAKE_MODULE_PATH variable

* Refactor implementation switching into a module

* Factor exception option out into a module

* Reformat simdjson-flags.cmake

* Rename simdjson-flags to developer-options

* Accumulate properties into an include module

This is done this way to avoid using utility targets that must be
exported and installed, which could potentially be misused by users of
the library.

* Port impl definitions to props

* Port exception options to props

* Lift normal options to the top

* Port developer options to props

* Remove simdjson-flags from benchmark

* Document the developer mode in HACKING

* Fix include path in installed config file

* Fix formatting of prop commands

* Fix tests that include .cpp files

* Change GCC AVX fixes back to compile options

* Deprecate SIMDJSON_BUILD_STATIC

* Always link fuzz targets to simdjson

* Install CMake from simdjson's debian repo

* Add gnupg for apt-key

* Make sure ASan link flags come first

* Pass CI env variable to cmake invocation

* Install package for apt-add-repository

* Remove return() from flush macro

* Use directory level commands instead of props

* Restore the github repository variable

* Set developer mode unconditionally for checkperf

The CI env variable is only set in the CI and this target is always run
in developer mode.

* Attempt to fix ODR violation in parsing checks

These tests were compiling the simdjson.cpp file again and linking to
the simdjson library target causes ODR violations.

Instead of linking to the target, just inherit its props.

* Move variables before the source dir

* Mark props to be flushed after adding more

* Use props for every command for the library

* Use keyword form for linking libs

* Handle deprecation of SIMDJSON_JUST_LIBRARY

* Handle deprecations in a separate module

Co-authored-by: friendlyanon <friendlyanon@users.noreply.github.com>
2021-04-23 09:24:56 -04:00
.circleci Don't keep a separate at_start boolean in object 2020-12-15 11:29:31 -08:00
.github CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
benchmark CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
cmake CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
dependencies Add PartialTweets<Yyjson> benchmark 2021-01-01 22:03:38 -08:00
doc Document stream: truncate final unfinished document and give access to the number of truncated bytes. (#1534) 2021-04-23 09:24:00 -04:00
examples Recommend simdjson::ondemand over simdjson::builtin::ondemand (#1380) 2021-01-14 17:33:49 -05:00
extra Removing all stdout, stderr from main library. (#455) 2020-01-20 16:03:15 -05:00
fuzz CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
images Improving the doxygen. (#687) 2020-04-08 17:53:04 -04:00
include CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
jsonchecker Basics. (#1116) 2020-08-14 17:28:09 -04:00
jsonexamples Add sajson benchmarks 2021-01-11 15:12:12 -08:00
scripts remove trailing whitespace (#1284) 2020-11-03 21:48:09 +01:00
singleheader CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
src CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
style Hiding the pointer away... (#252) 2019-08-04 15:41:00 -04:00
tests CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
tools CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
windows remove trailing whitespace (#1284) 2020-11-03 21:48:09 +01:00
.appveyor.yml Create acceptance_tests, all_tests, etc. make targets 2020-12-23 09:14:45 -08:00
.cirrus.yml Add test for out-of-order parse asserts 2020-12-13 13:39:47 -08:00
.clang-format We are adopting clang-format. 2019-08-01 15:40:07 -04:00
.dockerignore move amalgamate from bash to python (#1278) 2020-11-03 07:35:16 +01:00
.drone.yml Add test for out-of-order parse asserts 2020-12-13 13:39:47 -08:00
.gitattributes Use Unix line endings for c/c++ code (#1069) 2020-07-25 13:53:31 -04:00
.gitignore fix unintended early return in ondemand unit test loops (#1263) 2020-11-01 19:08:40 +01:00
.travis.yml This will disable the sanitizer runs on travis. (#1523) 2021-03-26 13:51:39 -04:00
AUTHORS Update AUTHORS 2020-04-30 19:49:20 -04:00
CMakeLists.txt CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
CONTRIBUTING.md Removing trailing white space. 2021-01-12 18:04:51 -05:00
CONTRIBUTORS Update CONTRIBUTORS 2020-10-27 18:44:28 -04:00
Dockerfile Add test for out-of-order parse asserts 2020-12-13 13:39:47 -08:00
Doxyfile Version 0.9.1 2021-03-18 11:31:38 -04:00
HACKING.md CMake refactor stage1 (#1512) 2021-04-23 09:24:56 -04:00
LICENSE Updating again. 2019-02-08 10:05:50 -05:00
README.md Adding another license note. 2021-04-13 10:25:02 -04:00
RELEASES.md release candidate (#1132) 2020-08-19 18:12:23 -04:00

README.md

Fuzzing Status Ubuntu 18.04 CI Ubuntu 20.04 CI VS16-CI MinGW64-CI Doxygen Documentation

simdjson : Parsing gigabytes of JSON per second

JSON is everywhere on the Internet. Servers spend a *lot* of time parsing it. We need a fresh approach. The simdjson library uses commonly available SIMD instructions and microparallel algorithms to parse JSON 4x faster than RapidJSON and 25x faster than JSON for Modern C++.
  • Fast: Over 4x faster than commonly used production-grade JSON parsers.
  • Record Breaking Features: Minify JSON at 6 GB/s, validate UTF-8 at 13 GB/s, NDJSON at 3.5 GB/s.
  • Easy: First-class, easy to use and carefully documented APIs.
  • Strict: Full JSON and UTF-8 validation, lossless parsing. Performance with no compromises.
  • Automatic: Selects a CPU-tailored parser at runtime. No configuration needed.
  • Reliable: From memory allocation to error handling, simdjson's design avoids surprises.
  • Peer Reviewed: Our research appears in venues like VLDB Journal, Software: Practice and Experience.

This library is part of the Awesome Modern C++ list.

Table of Contents

Quick Start

The simdjson library is easily consumable with a single .h and .cpp file.

  1. Prerequisites: g++ (version 7 or better) or clang++ (version 6 or better), and a 64-bit system with a command-line shell (e.g., Linux, macOS, freeBSD). We also support programming environments like Visual Studio and Xcode, but different steps are needed.

  2. Pull simdjson.h and simdjson.cpp into a directory, along with the sample file twitter.json.

    wget https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.h https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.cpp https://raw.githubusercontent.com/simdjson/simdjson/master/jsonexamples/twitter.json
    
  3. Create quickstart.cpp:

#include "simdjson.h"
using namespace simdjson;
int main(void) {
    ondemand::parser parser;
    padded_string json = padded_string::load("twitter.json");
    ondemand::document tweets = parser.iterate(json);
    std::cout << uint64_t(tweets["search_metadata"]["count"]) << " results." << std::endl;
}

  1. c++ -o quickstart quickstart.cpp simdjson.cpp
  2. ./quickstart
    100 results.
    

Documentation

Usage documentation is available:

  • Basics is an overview of how to use simdjson and its APIs.
  • Performance shows some more advanced scenarios and how to tune for them.
  • Implementation Selection describes runtime CPU detection and how you can work with it.
  • API contains the automatically generated API documentation.

Performance results

The simdjson library uses three-quarters less instructions than state-of-the-art parser RapidJSON. To our knowledge, simdjson is the first fully-validating JSON parser to run at gigabytes per second (GB/s) on commodity processors. It can parse millions of JSON documents per second on a single core.

The following figure represents parsing speed in GB/s for parsing various files on an Intel Skylake processor (3.4 GHz) using the GNU GCC 10 compiler (with the -O3 flag). We compare against the best and fastest C++ libraries on benchmarks that load and process the data. The simdjson library offers full unicode (UTF-8) validation and exact number parsing.

The simdjson library offers high speed whether it processes tiny files (e.g., 300 bytes) or larger files (e.g., 3MB). The following plot presents parsing speed for synthetic files over various sizes generated with a script on a 3.4 GHz Skylake processor (GNU GCC 9, -O3).

All our experiments are reproducible.

For NDJSON files, we can exceed 3 GB/s with our multithreaded parsing functions.

Real-world usage

If you are planning to use simdjson in a product, please work from one of our releases.

Bindings and Ports of simdjson

We distinguish between "bindings" (which just wrap the C++ code) and a port to another programming language (which reimplements everything).

About simdjson

The simdjson library takes advantage of modern microarchitectures, parallelizing with SIMD vector instructions, reducing branch misprediction, and reducing data dependency to take advantage of each CPU's multiple execution cores.

Some people enjoy reading our paper: A description of the design and implementation of simdjson is in our research article:

We have an in-depth paper focused on the UTF-8 validation:

We also have an informal blog post providing some background and context.

For the video inclined,
simdjson at QCon San Francisco 2019
(it was the best voted talk, we're kinda proud of it).

Funding

The work is supported by the Natural Sciences and Engineering Research Council of Canada under grant number RGPIN-2017-03910.

Contributing to simdjson

Head over to CONTRIBUTING.md for information on contributing to simdjson, and HACKING.md for information on source, building, and architecture/design.

License

This code is made available under the Apache License 2.0.

Under Windows, we build some tools using the windows/dirent_portable.h file (which is outside our library code): it under the liberal (business-friendly) MIT license.

For compilers that do not support C++17, we bundle the string-view library which is published under the Boost license (http://www.boost.org/LICENSE_1_0.txt). Like the Apache license, the Boost license is a permissive license allowing commercial redistribution.

For efficient number serialization, we bundle Florian Loitsch's implementation of the Grisu2 algorithm for binary to decimal floating-point numbers. The implementation was slightly modified by JSON for Modern C++ library. Both Florian Loitsch's implementation and JSON for Modern C++ are provided under the MIT license.

For runtime dispatching, we use some code from the PyTorch project licensed under 3-clause BSD.