Improving the documentation.

This commit is contained in:
Daniel Lemire 2019-07-29 14:10:49 -04:00
parent 771e9cd68a
commit 3c0f5a3fe4
2 changed files with 31 additions and 4 deletions

View File

@ -325,9 +325,8 @@ The parser builds a useful immutable (read-only) DOM (document-object model) whi
To simplify the engineering, we make some assumptions. To simplify the engineering, we make some assumptions.
- We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe this is a genuine limitation, because we do not think there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding. - We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe this is a genuine limitation, because we do not think there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding. If the UTF-8 contains a leading BOM, it should be omitted: the user is responsible for detecting and skipping the BOM; UTF-8 BOMs are discouraged.
- All strings in the JSON document may have up to 4294967295 bytes in UTF-8 (4GB). To enforce this constraint, we refuse to parse a document that contains more than 4294967295 bytes (4GB). This should accommodate most JSON documents. - All strings in the JSON document may have up to 4294967295 bytes in UTF-8 (4GB). To enforce this constraint, we refuse to parse a document that contains more than 4294967295 bytes (4GB). This should accommodate most JSON documents.
- In cases of failure, we report a failure without any indication to the nature of the problem. (This can be easily improved without affecting performance.)
- As allowed by the specification, we allow repeated keys within an object (other parsers like sajson do the same). - As allowed by the specification, we allow repeated keys within an object (other parsers like sajson do the same).
- Performance is optimized for JSON documents spanning at least a tens kilobytes up to many megabytes: the performance issues with having to parse many tiny JSON documents or one truly enormous JSON document are different. - Performance is optimized for JSON documents spanning at least a tens kilobytes up to many megabytes: the performance issues with having to parse many tiny JSON documents or one truly enormous JSON document are different.
@ -339,9 +338,11 @@ _We do not aim to provide a general-purpose JSON library._ A library like RapidJ
- We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long`. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number.) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document. - We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long`. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number.) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document.
- We support the full range of 64-bit floating-point numbers (binary64). The values range from ` std::numeric_limits<double>::lowest()` to `std::numeric_limits<double>::max()`, so from -1.7976e308 all the way to 1.7975e308. Extreme values (less or equal to -1e308, greater or equal to 1e308) are rejected: we refuse to parse the input document. - We support the full range of 64-bit floating-point numbers (binary64). The values range from ` std::numeric_limits<double>::lowest()` to `std::numeric_limits<double>::max()`, so from -1.7976e308 all the way to 1.7975e308. Extreme values (less or equal to -1e308, greater or equal to 1e308) are rejected: we refuse to parse the input document.
- We aim for accurate float parsing with a bound on the [unit of least precision (ULP)](https://en.wikipedia.org/wiki/Unit_in_the_last_place) of one. - We aim for accurate float parsing with a bound on the [unit of least precision (ULP)](https://en.wikipedia.org/wiki/Unit_in_the_last_place) of one.
- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.) - We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation. The sajson parser does incomplete UTF-8 validation, accepting code point
sequences like 0xb1 0x87.)
- We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.) - We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.)
- We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tabs in strings.) - We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tabs in strings.)
- We fully validate the white-space characters outside of the strings. Parsers like RapidJSON will accept JSON documents with null characters outside of strings.
## Architecture ## Architecture

View File

@ -74,6 +74,10 @@ int json_parse_implementation(const uint8_t *buf, size_t len, ParsedJson &pj, bo
} }
// Parse a document found in buf. // Parse a document found in buf.
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)). // You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
// //
// The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from // The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from
@ -94,6 +98,10 @@ inline int json_parse(const uint8_t *buf, size_t len, ParsedJson &pj, bool reall
} }
// Parse a document found in buf. // Parse a document found in buf.
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)). // You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
// //
// The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from // The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from
@ -131,6 +139,10 @@ inline int json_parse(const std::string &s, ParsedJson &pj) {
} }
// Parse a document found in in string s. // Parse a document found in in string s.
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)). // You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
// //
// The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from // The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from
@ -150,9 +162,12 @@ inline int json_parse(const padded_string &s, ParsedJson &pj) {
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing // If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made). // (a copy of the input string is made).
// //
// the input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false, // The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage). // all bytes at and after buf + len are ignored (can be garbage).
// //
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// This is a convenience function which calls json_parse. // This is a convenience function which calls json_parse.
WARN_UNUSED WARN_UNUSED
ParsedJson build_parsed_json(const uint8_t *buf, size_t len, bool reallocifneeded = true); ParsedJson build_parsed_json(const uint8_t *buf, size_t len, bool reallocifneeded = true);
@ -162,9 +177,14 @@ WARN_UNUSED
// by calling pj.isValid(). This does the memory allocation needed for ParsedJson. // by calling pj.isValid(). This does the memory allocation needed for ParsedJson.
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing // If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made). // (a copy of the input string is made).
//
// The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false, // The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage). // all bytes at and after buf + len are ignored (can be garbage).
// //
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// This is a convenience function which calls json_parse. // This is a convenience function which calls json_parse.
inline ParsedJson build_parsed_json(const char * buf, size_t len, bool reallocifneeded = true) { inline ParsedJson build_parsed_json(const char * buf, size_t len, bool reallocifneeded = true) {
return build_parsed_json(reinterpret_cast<const uint8_t *>(buf), len, reallocifneeded); return build_parsed_json(reinterpret_cast<const uint8_t *>(buf), len, reallocifneeded);
@ -183,6 +203,9 @@ ParsedJson build_parsed_json(const char *buf) = delete;
// A temporary buffer is created when needed during processing // A temporary buffer is created when needed during processing
// (a copy of the input string is made). // (a copy of the input string is made).
// //
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// This is a convenience function which calls json_parse. // This is a convenience function which calls json_parse.
WARN_UNUSED WARN_UNUSED
inline ParsedJson build_parsed_json(const std::string &s) { inline ParsedJson build_parsed_json(const std::string &s) {
@ -195,6 +218,9 @@ inline ParsedJson build_parsed_json(const std::string &s) {
// Return SUCCESS (an integer = 0) in case of a success. You can also check validity // Return SUCCESS (an integer = 0) in case of a success. You can also check validity
// by calling pj.isValid(). The same ParsedJson can be reused for other documents. // by calling pj.isValid(). The same ParsedJson can be reused for other documents.
// //
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// This is a convenience function which calls json_parse. // This is a convenience function which calls json_parse.
WARN_UNUSED WARN_UNUSED
inline ParsedJson build_parsed_json(const padded_string &s) { inline ParsedJson build_parsed_json(const padded_string &s) {