Better doc.

This commit is contained in:
Daniel Lemire 2018-12-10 22:00:16 -05:00
parent d61aa21d50
commit 058eb917d1
3 changed files with 77 additions and 39 deletions

View File

@ -68,18 +68,17 @@ make parsingcompetition
To simplify the engineering, we make some assumptions. To simplify the engineering, we make some assumptions.
- We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe that this is a genuine limitation in the sense that we do not think that there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding. - We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe that this is a genuine limitation in the sense that we do not think that there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding.
- We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel. No support for non-x86 processors is included though it can be done. - We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel. No support for non-x86 processors is included though it can be done. We plan to support ARM processors (help is invited).
- We only support GNU GCC and LLVM Clang at this time. There is no support for Microsoft Visual Studio, though it should not be difficult. - We only support GNU GCC and LLVM Clang at this time. There is no support for Microsoft Visual Studio, though it should not be difficult (help is invited).
- We expect the input memory to be readable up to 32 bytes beyond the end of the JSON document (to support fast vector loads). All bytes beyond the end of the JSON document are ignored (can be garbage) and the JSON document does not need to be NULL terminated. You can allocate a properly overallocated memory region with the provided `allocate_padded_buffer` function or simply by allocating your memory with extra capacity (`malloc(length + SIMDJSON_PADDING)`). (We expect that this limitation can be lifted without performance penalty, but at the cost of a bit of some code complexity.)
- In cases of failure, we just report a failure without any indication as to the nature of the problem. (This can be easily improved without affecting performance.) - In cases of failure, we just report a failure without any indication as to the nature of the problem. (This can be easily improved without affecting performance.)
## Features ## Features
- The input string is unmodified. (Parsers like sajson and RapidJSON use the input string as a buffer.)
- We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers. - We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers.
- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.) - We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.)
- We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.) - We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.)
- We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tags in strings.) - We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tags in strings.)
- The input string is unmodified. (Parsers like sajson and RapidJSON use the input string as a buffer.)
## Architecture ## Architecture

View File

@ -12,46 +12,71 @@
// Parse a document found in buf, need to preallocate ParsedJson. // Parse a document found in buf, need to preallocate ParsedJson.
// Return false in case of a failure. You can also check validity // Return false in case of a failure. You can also check validity
// by calling pj.isValid(). The same ParsedJson can be reused. // by calling pj.isValid(). The same ParsedJson can be reused for other documents.
// the input buf should be readable up to buf + len + SIMDJSON_PADDING //
// all bytes at and after buf + len are ignored (can be garbage) // If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
// The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
WARN_UNUSED WARN_UNUSED
bool json_parse(const u8 *buf, size_t len, ParsedJson &pj); bool json_parse(const u8 *buf, size_t len, ParsedJson &pj, bool reallocifneeded = true);
// the input buf should be readable up to buf + len + SIMDJSON_PADDING // Parse a document found in buf, need to preallocate ParsedJson.
// all bytes at and after buf + len are ignored (can be garbage) // Return false in case of a failure. You can also check validity
// by calling pj.isValid(). The same ParsedJson can be reused for other documents.
//
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
// The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
WARN_UNUSED WARN_UNUSED
static inline bool json_parse(const char * buf, size_t len, ParsedJson &pj) { static inline bool json_parse(const char * buf, size_t len, ParsedJson &pj, bool reallocifneeded = true) {
return json_parse((const u8 *) buf, len, pj); return json_parse((const u8 *) buf, len, pj, reallocifneeded);
} }
// convenience function // Parse a document found in buf, need to preallocate ParsedJson.
// the input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING // Return false in case of a failure. You can also check validity
// all bytes at and after s.data()+s.size() are ignored (can be garbage) // by calling pj.isValid(). The same ParsedJson can be reused for other documents.
//
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
// the input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after s.data()+s.size() are ignored (can be garbage).
WARN_UNUSED WARN_UNUSED
static inline bool json_parse(const std::string_view &s, ParsedJson &pj) { static inline bool json_parse(const std::string_view &s, ParsedJson &pj, bool reallocifneeded = true) {
return json_parse(s.data(), s.size(), pj); return json_parse(s.data(), s.size(), pj, reallocifneeded);
} }
// Build a ParsedJson object. You can check validity // Build a ParsedJson object. You can check validity
// by calling pj.isValid(). This does memory allocation. // by calling pj.isValid(). This does the memory allocation needed for ParsedJson.
// the input buf should be readable up to buf + len + SIMDJSON_PADDING // If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// all bytes at and after buf + len are ignored (can be garbage) // (a copy of the input string is made).
//
// the input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
WARN_UNUSED WARN_UNUSED
ParsedJson build_parsed_json(const u8 *buf, size_t len); ParsedJson build_parsed_json(const u8 *buf, size_t len, bool reallocifneeded = true);
WARN_UNUSED WARN_UNUSED
// the input buf should be readable up to buf + len + SIMDJSON_PADDING // Build a ParsedJson object. You can check validity
// all bytes at and after buf + len are ignored (can be garbage) // by calling pj.isValid(). This does the memory allocation needed for ParsedJson.
static inline ParsedJson build_parsed_json(const char * buf, size_t len) { // If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
return build_parsed_json((const u8 *) buf, len); // (a copy of the input string is made).
// The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
static inline ParsedJson build_parsed_json(const char * buf, size_t len, bool reallocifneeded = true) {
return build_parsed_json((const u8 *) buf, len, reallocifneeded);
} }
// convenience function // convenience function
WARN_UNUSED WARN_UNUSED
// the input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING // Build a ParsedJson object. You can check validity
// all bytes at and after s.data()+s.size() are ignored (can be garbage) // by calling pj.isValid(). This does the memory allocation needed for ParsedJson.
static inline ParsedJson build_parsed_json(const std::string_view &s) { // If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
return build_parsed_json(s.data(), s.size()); // (a copy of the input string is made).
// The input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after s.data()+s.size() are ignored (can be garbage).
static inline ParsedJson build_parsed_json(const std::string_view &s, bool reallocifneeded = true) {
return build_parsed_json(s.data(), s.size(), reallocifneeded);
} }

View File

@ -1,35 +1,49 @@
#include "simdjson/jsonparser.h" #include "simdjson/jsonparser.h"
#include <unistd.h>
// parse a document found in buf, need to preallocate ParsedJson. // parse a document found in buf, need to preallocate ParsedJson.
WARN_UNUSED WARN_UNUSED
bool json_parse(const u8 *buf, size_t len, ParsedJson &pj) { bool json_parse(const u8 *buf, size_t len, ParsedJson &pj, bool reallocifneeded) {
if (pj.bytecapacity < len) { if (pj.bytecapacity < len) {
std::cerr << "Your ParsedJson cannot support documents that big: " << len std::cerr << "Your ParsedJson cannot support documents that big: " << len
<< std::endl; << std::endl;
return false; return false;
} }
bool reallocated = false;
if(reallocifneeded) {
// realloc is needed if the end of the memory crosses a page
long pagesize = sysconf (_SC_PAGESIZE); // on windows this should be SYSTEM_INFO sysInfo; GetSystemInfo(&sysInfo); sysInfo.dwPageSize
if ( (reinterpret_cast<uintptr_t>(buf + len - 1) % pagesize ) < SIMDJSON_PADDING ) {
const u8 *tmpbuf = buf;
buf = (u8 *) allocate_padded_buffer(len);
if(buf == NULL) return false;
memcpy((void*)buf,tmpbuf,len);
reallocated = true;
}
}
bool isok = find_structural_bits(buf, len, pj); bool isok = find_structural_bits(buf, len, pj);
if (isok) { if (isok) {
isok = flatten_indexes(len, pj); isok = flatten_indexes(len, pj);
} else { } else {
return false; if(reallocated) free((void*)buf);
}//printf("ok\n");
if (isok) {
isok = unified_machine(buf, len, pj);
//printf("ok %d \n",isok);
} else {
return false; return false;
} }
if (isok) {
isok = unified_machine(buf, len, pj);
} else {
if(reallocated) free((void*)buf);
return false;
}
if(reallocated) free((void*)buf);
return isok; return isok;
} }
WARN_UNUSED WARN_UNUSED
ParsedJson build_parsed_json(const u8 *buf, size_t len) { ParsedJson build_parsed_json(const u8 *buf, size_t len, bool reallocifneeded) {
ParsedJson pj; ParsedJson pj;
bool ok = pj.allocateCapacity(len); bool ok = pj.allocateCapacity(len);
if(ok) { if(ok) {
ok = json_parse(buf, len, pj); ok = json_parse(buf, len, pj, reallocifneeded);
assert(ok == pj.isValid()); assert(ok == pj.isValid());
} else { } else {
std::cerr << "failure during memory allocation " << std::endl; std::cerr << "failure during memory allocation " << std::endl;