Improving the documentation.

2019-07-29 14:10:49 -04:00 · 2019-07-29 14:10:49 -04:00 · 3c0f5a3fe4
parent 771e9cd68a
commit 3c0f5a3fe4
2 changed files with 31 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -325,9 +325,8 @@ The parser builds a useful immutable (read-only) DOM (document-object model) whi

 To simplify the engineering, we make some assumptions.

- We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe this is a genuine limitation, because we do not think there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding.
+- We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe this is a genuine limitation, because we do not think there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding. If the UTF-8 contains a leading BOM, it should be omitted: the user is responsible for detecting and skipping the BOM; UTF-8 BOMs are discouraged.
 - All strings in the JSON document may have up to 4294967295 bytes in UTF-8 (4GB). To enforce this constraint, we refuse to parse a document that contains more than 4294967295 bytes (4GB). This should accommodate most JSON documents.
- In cases of failure, we report a failure without any indication to the nature of the problem. (This can be easily improved without affecting performance.)
 - As allowed by the specification, we allow repeated keys within an object (other parsers like sajson do the same).
 - Performance is optimized for JSON documents spanning at least a tens kilobytes up to many megabytes: the performance issues with having to parse many tiny JSON documents or one truly enormous JSON document are different.

@ -339,9 +338,11 @@ _We do not aim to provide a general-purpose JSON library._ A library like RapidJ
 - We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers in [-9223372036854775808,9223372036854775808), like a Java `long` or a C/C++ `long long`. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number.) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document.
 - We support the full range of 64-bit floating-point numbers (binary64). The values range from ` std::numeric_limits<double>::lowest()`  to `std::numeric_limits<double>::max()`, so from -1.7976e308 all the way to 1.7975e308. Extreme values (less or equal to -1e308, greater or equal to 1e308) are rejected: we refuse to parse the input document.
 - We aim for accurate float parsing with a bound on the [unit of least precision (ULP)](https://en.wikipedia.org/wiki/Unit_in_the_last_place) of one.
- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.)
+- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation. The sajson parser does incomplete UTF-8 validation, accepting code point
+sequences like 0xb1 0x87.)
 - We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.)
 - We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tabs in strings.)
+- We fully validate the white-space characters outside of the strings. Parsers like RapidJSON will accept JSON documents with null characters outside of strings.

 ## Architecture

--- a/include/simdjson/jsonparser.h
+++ b/include/simdjson/jsonparser.h
@ -74,6 +74,10 @@ int json_parse_implementation(const uint8_t *buf, size_t len, ParsedJson &pj, bo
 }

 // Parse a document found in buf. 
+// 
+// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
+// is responsible for omitting it, UTF-8 BOM are discouraged.
+//
 // You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
 //
 // The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from 
@ -94,6 +98,10 @@ inline int json_parse(const uint8_t *buf, size_t len, ParsedJson &pj, bool reall
 }

 // Parse a document found in buf.
+// 
+// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
+// is responsible for omitting it, UTF-8 BOM are discouraged.
+//
 // You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
 //
 // The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from 
@ -131,6 +139,10 @@ inline int json_parse(const std::string &s, ParsedJson &pj) {
 }

 // Parse a document found in in string s.
+// 
+// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
+// is responsible for omitting it, UTF-8 BOM are discouraged.
+//
 // You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
 //
 // The function returns simdjson::SUCCESS (an integer = 0) in case of a success or an error code from 
@ -150,8 +162,11 @@ inline int json_parse(const padded_string &s, ParsedJson &pj) {
 // If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
 // (a copy of the input string is made).
 //
-// the input buf should be readable up to buf + len + SIMDJSON_PADDING  if reallocifneeded is false,
+// The input buf should be readable up to buf + len + SIMDJSON_PADDING  if reallocifneeded is false,
 // all bytes at and after buf + len  are ignored (can be garbage).
+// 
+// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
+// is responsible for omitting it, UTF-8 BOM are discouraged.
 //
 // This is a convenience function which calls json_parse.
 WARN_UNUSED
@ -162,9 +177,14 @@ WARN_UNUSED
 // by calling pj.isValid(). This does the memory allocation needed for ParsedJson.
 // If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
 // (a copy of the input string is made).
+//
 // The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
 // all bytes at and after buf + len  are ignored (can be garbage).
 //
+// 
+// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
+// is responsible for omitting it, UTF-8 BOM are discouraged.
+//
 // This is a convenience function which calls json_parse.
 inline ParsedJson build_parsed_json(const char * buf, size_t len, bool reallocifneeded = true) {
  return build_parsed_json(reinterpret_cast<const uint8_t *>(buf), len, reallocifneeded);
@ -182,6 +202,9 @@ ParsedJson build_parsed_json(const char *buf) = delete;
 //
 // A temporary buffer is created when needed during processing
 // (a copy of the input string is made).
+// 
+// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
+// is responsible for omitting it, UTF-8 BOM are discouraged.
 //
 // This is a convenience function which calls json_parse.
 WARN_UNUSED
@ -194,6 +217,9 @@ inline ParsedJson build_parsed_json(const std::string &s) {
 // You need to preallocate ParsedJson with a capacity of len (e.g., pj.allocateCapacity(len)).
 // Return SUCCESS (an integer = 0) in case of a success. You can also check validity
 // by calling pj.isValid(). The same ParsedJson can be reused for other documents.
+// 
+// The content should be a valid JSON document encoded as UTF-8. If there is a UTF-8 BOM, the caller
+// is responsible for omitting it, UTF-8 BOM are discouraged.
 //
 // This is a convenience function which calls json_parse.
 WARN_UNUSED