simdjson/doc/ondemand.md

11 KiB

On Demand Basics

On Demand is a new, faster simdjson API with all the ease-of-use you're used to. While it provides a familiar DOM interface, under the hood it is anything but: it is parsing values as you use them. This means you don't waste time parsing JSON you don't use, and you don't pay the cost of generating an intermediate DOM tree.

An overview of what you need to know to use simdjson, with examples.

ondemand supports the same JSON standards and C++ compilers as simdjson's older DOM API. Refer to the DOM docs for more information:

For deeper information about the design and implementation of simdjson's ondemand API, refer to the design document.

Including ondemand

To include simdjson, copy simdjson.h and simdjson.cpp into your project. Then include it in your project with:

#include "simdjson.h"
using namespace simdjson; // optional
using namespace simdjson::builtin; // optional, for ondemand

You can compile with:

c++ -march=native myproject.cpp simdjson.cpp

Note:

  • Users on macOS and other platforms where compilers do not provide C++11 compliant by default should request it with the appropriate flag (e.g., c++ -march=native -std=c++17 myproject.cpp simdjson.cpp).

The native architecture flag

Passing -march=native to the compiler makes On Demand much faster by allowing it to use optimizations specific to your machine. You cannot do this, however, if you are compiling code that might be run on less advanced machines.

On Demand uses advanced architecture-specific code for many common processors to make JSON preprocessing and string parsing faster. By default, however, most c++ compilers will compile to the least common denominator (since the program could theoretically be run anywhere). Since On Demand is inlined into your own code, it cannot use these advanced versions unless the compiler is told to target them. -march=native says "target the current computer," which is a reasonable default for many applications which both compile and run on the same processor.

The Basics: Loading and Parsing JSON Documents

The simdjson library offers a simple DOM tree API, which you can access by creating a ondemand::parser and calling the iterate() method:

ondemand::parser parser;
auto json = padded_string::load("twitter.json");
ondemand::document doc = parser.iterate(json); // load and parse a file

Or by creating a padded string (for efficiency reasons, simdjson requires a string with SIMDJSON_PADDING bytes at the end) and calling iterate():

ondemand::parser parser;
auto json = "[1,2,3]"_padded; // The _padded suffix creates a simdjson::padded_string instance
ondemand::document doc = parser.iterate(json); // parse a string

Documents Are Iterators

A document is not a fully-parsed JSON value; rather, it is an iterator over the JSON text. This means that while you iterate an array, or search for a field in an object, it is actually walking through the original JSON text, merrily reading commas and colons and brackets to make sure you get where you're going. This is the key to On Demand's performance: since it's just an iterator, it lets you parse values as you use them. And particularly, it lets you skip values you don't want to use.

Parser, Document and JSON Scope

Because a document is an iterator over the JSON text, both the JSON text and the parser must remain alive (in scope) while you are using it. Further, a parser may have at most one document open at a time, since it holds allocated memory used for the parsing.

During the iterate call, the original JSON text is never modified--only read. After you're done with the document, the source (whether file or string) can be safely discarded.

For best performance, a parser instance should be reused over several files: otherwise you will needlessly reallocate memory, an expensive process. It is also possible to avoid entirely memory allocations during parsing when using simdjson. See our performance notes for details.

Using the Parsed JSON

Once you have a document, you can navigate it with idiomatic C++ iterators, operators and casts. The following show how to use the JSON when exceptions are enabled, but simdjson has full, idiomatic support for users who avoid exceptions. See the simdjson DOM API's error handling documentation for more.

  • Extracting Values: You can cast a JSON element to a native type: double(element) or double x = json_element. This works for double, uint64_t, int64_t, bool, ondemand::object and ondemand::array. At this point, the number, string or boolean will be parsed, or the initial [ or { will be verified. An exception is thrown if the cast is not possible.

    IMPORTANT NOTE: values can only be parsed once. Since documents are iterators, once you have parsed a value (such as by casting to double), you can't get at it again.

  • Field Access: To get the value of the "foo" field in an object, use object["foo"]. This will scan through the object looking for the field with the matching string.

    NOTE: simdjson does not unescape keys when matching. This is not generally a problem for applications with well-defined key names (which generally do not use escapes). If you do need this support, it's best to iterate through the object fields to find the field you are looking for.

    By default, field lookup is order-insensitive, so you can look up values in any order. However, we still encourage you to look up fields in the order you expect them in the JSON, as it is still much faster.

    If you want to enforce finding fields in order, you can use object.find_field("foo") instead. This will only look forward, and will fail to find fields in the wrong order: for example, this will fail:

    ondemand::parser parser;
    auto json = R"(  { "x": 1, "y": 2 }  )"_padded;
    auto doc = parser.iterate(json);
    double y = doc.find_field("y"); // The cursor is now after the 2 (at })
    double x = doc.find_field("x"); // This fails, because there are no more fields after "y"
    

    By contrast, using the default (order-insensitive) lookup succeeds:

    ondemand::parser parser;
    auto json = R"(  { "x": 1, "y": 2 }  )"_padded;
    auto doc = parser.iterate(json);
    double y = doc["y"]; // The cursor is now after the 2 (at })
    double x = doc["x"]; // Success: [] loops back around to find "x"
    
  • Array Iteration: To iterate through an array, use for (auto value : array) { ... }. This will step through each value in the JSON array.

    If you know the type of the value, you can cast it right there, too! for (double value : array) { ... }.

  • Object Iteration: You can iterate through an object's fields, as well: for (auto field : object) { ... }

    • field.unescaped_key() will get you the key string.
    • field.value() will get you the value, which you can then use all these other methods on.
  • Array Index: Because it is forward-only, you cannot look up an array element by index. Instead, you will need to iterate through the array and keep an index yourself.

Examples

The following code illustrates many of the above concepts:

ondemand::parser parser;
auto cars_json = R"( [
  { "make": "Toyota", "model": "Camry",  "year": 2018, "tire_pressure": [ 40.1, 39.9, 37.7, 40.4 ] },
  { "make": "Kia",    "model": "Soul",   "year": 2012, "tire_pressure": [ 30.1, 31.0, 28.6, 28.7 ] },
  { "make": "Toyota", "model": "Tercel", "year": 1999, "tire_pressure": [ 29.8, 30.0, 30.2, 30.5 ] }
] )"_padded;

// Iterating through an array of objects
for (ondemand::object car : parser.iterate(cars_json)) {
  // Accessing a field by name
  cout << "Make/Model: " << std::string_view(car["make"]) << "/" << std::string_view(car["model"]) << endl;

  // Casting a JSON element to an integer
  uint64_t year = car["year"];
  cout << "- This car is " << 2020 - year << "years old." << endl;

  // Iterating through an array of floats
  double total_tire_pressure = 0;
  for (double tire_pressure : car["tire_pressure"]) {
    total_tire_pressure += tire_pressure;
  }
  cout << "- Average tire pressure: " << (total_tire_pressure / 4) << endl;
}

Here is a different example illustrating the same ideas:

ondemand::parser parser;
auto points_json = R"( [
    {  "12345" : {"x":12.34, "y":56.78, "z": 9998877}   },
    {  "12545" : {"x":11.44, "y":12.78, "z": 11111111}  }
  ] )"_padded;

// Parse and iterate through an array of objects
for (ondemand::object points : parser.iterate(points_json)) {
  for (auto point : points) {
    cout << "id: " << std::string_view(point.unescaped_key()) << ": (";
    cout << point.value()["x"].get_double() << ", ";
    cout << point.value()["y"].get_double() << ", ";
    cout << point.value()["z"].get_int64() << endl;
  }
}

And another one:

auto abstract_json = R"(
  { "str" : { "123" : {"abc" : 3.14 } } }
)"_padded;
ondemand::parser parser;
auto doc = parser.iterate(abstract_json);
cout << doc["str"]["123"]["abc"].get_double() << endl; // Prints 3.14
  • Extracting Values (without exceptions): You can use a variant usage of get() with error codes to avoid exceptions. You first declare the variable of the appropriate type (double, uint64_t, int64_t, bool, ondemand::object and ondemand::array) and pass it by reference to get() which gives you back an error code: e.g.,

    auto abstract_json = R"(
      { "str" : { "123" : {"abc" : 3.14 } } }
    )"_padded;
    ondemand::parser parser;
    
    double value;
    auto doc = parser.iterate(abstract_json);
    auto error = doc["str"]["123"]["abc"].get(value);
    if (error) { std::cerr << error << std::endl; return EXIT_FAILURE; }
    cout << value << endl; // Prints 3.14