Better doc.

This commit is contained in:
Daniel Lemire 2018-12-10 22:00:16 -05:00
parent d61aa21d50
commit 058eb917d1
3 changed files with 77 additions and 39 deletions

View File

@ -68,18 +68,17 @@ make parsingcompetition
To simplify the engineering, we make some assumptions.
- We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16). We do not believe that this is a genuine limitation in the sense that we do not think that there is any serious application that needs to process JSON data without an ASCII or UTF-8 encoding.
- We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel. No support for non-x86 processors is included though it can be done.
- We only support GNU GCC and LLVM Clang at this time. There is no support for Microsoft Visual Studio, though it should not be difficult.
- We expect the input memory to be readable up to 32 bytes beyond the end of the JSON document (to support fast vector loads). All bytes beyond the end of the JSON document are ignored (can be garbage) and the JSON document does not need to be NULL terminated. You can allocate a properly overallocated memory region with the provided `allocate_padded_buffer` function or simply by allocating your memory with extra capacity (`malloc(length + SIMDJSON_PADDING)`). (We expect that this limitation can be lifted without performance penalty, but at the cost of a bit of some code complexity.)
- We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel. No support for non-x86 processors is included though it can be done. We plan to support ARM processors (help is invited).
- We only support GNU GCC and LLVM Clang at this time. There is no support for Microsoft Visual Studio, though it should not be difficult (help is invited).
- In cases of failure, we just report a failure without any indication as to the nature of the problem. (This can be easily improved without affecting performance.)
## Features
- The input string is unmodified. (Parsers like sajson and RapidJSON use the input string as a buffer.)
- We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers.
- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.)
- We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.)
- We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tags in strings.)
- The input string is unmodified. (Parsers like sajson and RapidJSON use the input string as a buffer.)
## Architecture

View File

@ -12,46 +12,71 @@
// Parse a document found in buf, need to preallocate ParsedJson.
// Return false in case of a failure. You can also check validity
// by calling pj.isValid(). The same ParsedJson can be reused.
// the input buf should be readable up to buf + len + SIMDJSON_PADDING
// all bytes at and after buf + len are ignored (can be garbage)
// by calling pj.isValid(). The same ParsedJson can be reused for other documents.
//
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
// The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
WARN_UNUSED
bool json_parse(const u8 *buf, size_t len, ParsedJson &pj);
bool json_parse(const u8 *buf, size_t len, ParsedJson &pj, bool reallocifneeded = true);
// the input buf should be readable up to buf + len + SIMDJSON_PADDING
// all bytes at and after buf + len are ignored (can be garbage)
// Parse a document found in buf, need to preallocate ParsedJson.
// Return false in case of a failure. You can also check validity
// by calling pj.isValid(). The same ParsedJson can be reused for other documents.
//
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
// The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
WARN_UNUSED
static inline bool json_parse(const char * buf, size_t len, ParsedJson &pj) {
return json_parse((const u8 *) buf, len, pj);
static inline bool json_parse(const char * buf, size_t len, ParsedJson &pj, bool reallocifneeded = true) {
return json_parse((const u8 *) buf, len, pj, reallocifneeded);
}
// convenience function
// the input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING
// all bytes at and after s.data()+s.size() are ignored (can be garbage)
// Parse a document found in buf, need to preallocate ParsedJson.
// Return false in case of a failure. You can also check validity
// by calling pj.isValid(). The same ParsedJson can be reused for other documents.
//
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
// the input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after s.data()+s.size() are ignored (can be garbage).
WARN_UNUSED
static inline bool json_parse(const std::string_view &s, ParsedJson &pj) {
return json_parse(s.data(), s.size(), pj);
static inline bool json_parse(const std::string_view &s, ParsedJson &pj, bool reallocifneeded = true) {
return json_parse(s.data(), s.size(), pj, reallocifneeded);
}
// Build a ParsedJson object. You can check validity
// by calling pj.isValid(). This does memory allocation.
// the input buf should be readable up to buf + len + SIMDJSON_PADDING
// all bytes at and after buf + len are ignored (can be garbage)
// by calling pj.isValid(). This does the memory allocation needed for ParsedJson.
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
//
// the input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
WARN_UNUSED
ParsedJson build_parsed_json(const u8 *buf, size_t len);
ParsedJson build_parsed_json(const u8 *buf, size_t len, bool reallocifneeded = true);
WARN_UNUSED
// the input buf should be readable up to buf + len + SIMDJSON_PADDING
// all bytes at and after buf + len are ignored (can be garbage)
static inline ParsedJson build_parsed_json(const char * buf, size_t len) {
return build_parsed_json((const u8 *) buf, len);
// Build a ParsedJson object. You can check validity
// by calling pj.isValid(). This does the memory allocation needed for ParsedJson.
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
// The input buf should be readable up to buf + len + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after buf + len are ignored (can be garbage).
static inline ParsedJson build_parsed_json(const char * buf, size_t len, bool reallocifneeded = true) {
return build_parsed_json((const u8 *) buf, len, reallocifneeded);
}
// convenience function
WARN_UNUSED
// the input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING
// all bytes at and after s.data()+s.size() are ignored (can be garbage)
static inline ParsedJson build_parsed_json(const std::string_view &s) {
return build_parsed_json(s.data(), s.size());
// Build a ParsedJson object. You can check validity
// by calling pj.isValid(). This does the memory allocation needed for ParsedJson.
// If reallocifneeded is true (default) then a temporary buffer is created when needed during processing
// (a copy of the input string is made).
// The input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING if reallocifneeded is false,
// all bytes at and after s.data()+s.size() are ignored (can be garbage).
static inline ParsedJson build_parsed_json(const std::string_view &s, bool reallocifneeded = true) {
return build_parsed_json(s.data(), s.size(), reallocifneeded);
}

View File

@ -1,35 +1,49 @@
#include "simdjson/jsonparser.h"
#include <unistd.h>
// parse a document found in buf, need to preallocate ParsedJson.
WARN_UNUSED
bool json_parse(const u8 *buf, size_t len, ParsedJson &pj) {
bool json_parse(const u8 *buf, size_t len, ParsedJson &pj, bool reallocifneeded) {
if (pj.bytecapacity < len) {
std::cerr << "Your ParsedJson cannot support documents that big: " << len
<< std::endl;
return false;
}
bool reallocated = false;
if(reallocifneeded) {
// realloc is needed if the end of the memory crosses a page
long pagesize = sysconf (_SC_PAGESIZE); // on windows this should be SYSTEM_INFO sysInfo; GetSystemInfo(&sysInfo); sysInfo.dwPageSize
if ( (reinterpret_cast<uintptr_t>(buf + len - 1) % pagesize ) < SIMDJSON_PADDING ) {
const u8 *tmpbuf = buf;
buf = (u8 *) allocate_padded_buffer(len);
if(buf == NULL) return false;
memcpy((void*)buf,tmpbuf,len);
reallocated = true;
}
}
bool isok = find_structural_bits(buf, len, pj);
if (isok) {
isok = flatten_indexes(len, pj);
} else {
return false;
}//printf("ok\n");
if (isok) {
isok = unified_machine(buf, len, pj);
//printf("ok %d \n",isok);
} else {
if(reallocated) free((void*)buf);
return false;
}
if (isok) {
isok = unified_machine(buf, len, pj);
} else {
if(reallocated) free((void*)buf);
return false;
}
if(reallocated) free((void*)buf);
return isok;
}
WARN_UNUSED
ParsedJson build_parsed_json(const u8 *buf, size_t len) {
ParsedJson build_parsed_json(const u8 *buf, size_t len, bool reallocifneeded) {
ParsedJson pj;
bool ok = pj.allocateCapacity(len);
if(ok) {
ok = json_parse(buf, len, pj);
ok = json_parse(buf, len, pj, reallocifneeded);
assert(ok == pj.isValid());
} else {
std::cerr << "failure during memory allocation " << std::endl;