Improving the documentation.

This commit is contained in:
Daniel Lemire 2019-07-29 14:10:49 -04:00
parent 771e9cd68a
commit 3c0f5a3fe4
2 changed files with 31 additions and 4 deletions

View File

@ -325,9 +325,8 @@ The parser builds a useful immutable (read-only) DOM (document-object model) whi
To simplify the engineering, we make some assumptions.
- We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe this is a genuine limitation, because we do not think there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding.
- We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe this is a genuine limitation, because we do not think there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding. If the UTF-8 contains a leading BOM, it should be omitted: the user is responsible for detecting and skipping the BOM; UTF-8 BOMs are discouraged.
- All strings in the JSON document may have up to 4294967295 bytes in UTF-8 (4GB). To enforce this constraint, we refuse to parse a document that contains more than 4294967295 bytes (4GB). This should accommodate most JSON documents.
- In cases of failure, we report a failure without any indication to the nature of the problem. (This can be easily improved without affecting performance.)
- As allowed by the specification, we allow repeated keys within an object (other parsers like sajson do the same).
- Performance is optimized for JSON documents spanning at least a tens kilobytes up to many megabytes: the performance issues with having to parse many tiny JSON documents or one truly enormous JSON document are different.
@ -339,9 +338,11 @@ _We do not aim to provide a general-purpose JSON library._ A library like RapidJ
- We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long`. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number.) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document.
- We support the full range of 64-bit floating-point numbers (binary64). The values range from ` std::numeric_limits<double>::lowest()` to `std::numeric_limits<double>::max()`, so from -1.7976e308 all the way to 1.7975e308. Extreme values (less or equal to -1e308, greater or equal to 1e308) are rejected: we refuse to parse the input document.
- We aim for accurate float parsing with a bound on the [unit of least precision (ULP)](https://en.wikipedia.org/wiki/Unit_in_the_last_place) of one.
- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.)
- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation. The sajson parser does incomplete UTF-8 validation, accepting code point
sequences like 0xb1 0x87.)
- We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.)
- We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tabs in strings.)
- We fully validate the white-space characters outside of the strings. Parsers like RapidJSON will accept JSON documents with null characters outside of strings.
## Architecture

View File

@ -74,6 +74,10 @@ int json_parse_implementation(const uint8_t *buf, size_t len, ParsedJson &pj, bo
}
// Parse a document found in buf.
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
//
// The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from
@ -94,6 +98,10 @@ inline int json_parse(const uint8_t *buf, size_t len, ParsedJson &pj, bool reall
}
// Parse a document found in buf.
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
//
// The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from
@ -131,6 +139,10 @@ inline int json_parse(const std::string &s, ParsedJson &pj) {
}
// Parse a document found in in string s.
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
//
// The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from
@ -150,8 +162,11 @@ inline int json_parse(const padded_string &s, ParsedJson &pj) {
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
//
// the input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// This is a convenience function which calls json_parse.
WARN_UNUSED
@ -162,9 +177,14 @@ WARN_UNUSED
// by calling pj.isValid(). This does the memory allocation needed for ParsedJson.
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
//
// The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
//
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// This is a convenience function which calls json_parse.
inline ParsedJson build_parsed_json(const char * buf, size_t len, bool reallocifneeded = true) {
return build_parsed_json(reinterpret_cast<const uint8_t *>(buf), len, reallocifneeded);
@ -182,6 +202,9 @@ ParsedJson build_parsed_json(const char *buf) = delete;
//
// A temporary buffer is created when needed during processing
// (a copy of the input string is made).
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// This is a convenience function which calls json_parse.
WARN_UNUSED
@ -194,6 +217,9 @@ inline ParsedJson build_parsed_json(const std::string &s) {
// You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
// Return SUCCESS (an integer = 0) in case of a success. You can also check validity
// by calling pj.isValid(). The same ParsedJson can be reused for other documents.
//
// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
// is responsible for omitting it, UTF-8 BOM are discouraged.
//
// This is a convenience function which calls json_parse.
WARN_UNUSED