Using just ASCII. (#899)

* Using just ASCII.

* Let us prune checkperf.

* Moving the description of lookup2 to the HACKING.md file.
This commit is contained in:
Daniel Lemire 2020-05-21 21:59:06 -04:00 committed by GitHub
parent 219b02c1e5
commit 12150baa5e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 259 additions and 242 deletions

View File

@ -38,13 +38,27 @@ add_subdirectory(singleheader)
#
# Compile tools / tests / benchmarks
#
add_subdirectory(dependencies)
add_subdirectory(tests)
add_subdirectory(examples)
add_subdirectory(benchmark)
add_subdirectory(fuzz)
#
# Source files should be just ASCII
#
find_program(FIND find)
find_program(FILE file)
find_program(GREP grep)
if((FIND) AND (FILE) AND (GREP))
add_test(
NAME "just_ascii"
COMMAND sh -c "${FIND} include src windows tools singleheader tests examples benchmark -path benchmark/checkperf-reference -prune -name '*.h' -o -name '*.cpp' -type f -exec ${FILE} '{}' \; |${GREP} -v ASCII || exit 0 && exit 1"
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
)
endif()
#
# CPack
#

View File

@ -40,6 +40,7 @@ We have few hard rules, but we have some:
- Printing to standard output or standard error (`stderr`, `stdout`, `std::cerr`, `std::cout`) in the core library is forbidden. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Compiled code should not write to stdout or stderr".
- Calls to `abort()` are forbidden in the core library. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Under no circumstances should your compiled code ever call abort or exit".
- All source code files (.h, .cpp) must be ASCII.
Tools, tests and benchmarks are not held to these same strict rules.

View File

@ -369,6 +369,213 @@ This helps as we redefine some new characters as pseudo-structural such as the c
> { "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }
### UTF-8 validation (lookup2)
The simdjson library relies on the lookup2 algorithm for UTF-8 validation on x64 platforms.
This algorithm validate the length of multibyte characters (that each multibyte character has the right number of continuation characters, and that all continuation characters are part of a multibyte character).
#### Algorithm
This algorithm compares *expected* continuation characters with *actual* continuation bytes, and emits an error anytime there is a mismatch.
For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
characters, the file will look like this:
| Character | 𝄞 | | | | ₿ | | | ֏ | | a | b |
|-----------------------|----|----|----|----|----|----|----|----|----|----|----|
| Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
| Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
| is_second_byte | | X | | | | X | | | X | | |
| is_third_byte | | | X | | | | X | | | | |
| is_fourth_byte | | | | X | | | | | | | |
| expected_continuation | | X | X | X | | X | X | | X | | |
| is_continuation | | X | X | X | | X | X | | X | | |
The errors here are basically (Second Byte OR Third Byte OR Fourth Byte == Continuation):
- **Extra Continuations:** Any continuation that is not a second, third or fourth byte is not
part of a valid 2-, 3- or 4-byte character and is thus an error. It could be that it's just
floating around extra outside of any character, or that there is an illegal 5-byte character,
or maybe it's at the beginning of the file before any characters have started; but it's an
error in all these cases.
- **Missing Continuations:** Any second, third or fourth byte that *isn't* a continuation is an error, because that means
we started a new character before we were finished with the current one.
#### Getting the Previous Bytes
Because we want to know if a byte is the *second* (or third, or fourth) byte of a multibyte
character, we need to "shift the bytes" to find that out. This is what they mean:
- `is_continuation`: if the current byte is a continuation.
- `is_second_byte`: if 1 byte back is the start of a 2-, 3- or 4-byte character.
- `is_third_byte`: if 2 bytes back is the start of a 3- or 4-byte character.
- `is_fourth_byte`: if 3 bytes back is the start of a 4-byte character.
We use shuffles to go n bytes back, selecting part of the current `input` and part of the
`prev_input` (search for `.prev<1>`, `.prev<2>`, etc.). These are passed in by the caller
function, because the 1-byte-back data is used by other checks as well.
#### Getting the Continuation Mask
Once we have the right bytes, we have to get the masks. To do this, we treat UTF-8 bytes as
numbers, using signed `<` and `>` operations to check if they are continuations or leads.
In fact, we treat the numbers as *signed*, partly because it helps us, and partly because
Intel's SIMD presently only offers signed `<` and `>` operations (not unsigned ones).
In UTF-8, bytes that start with the bits 110, 1110 and 11110 are 2-, 3- and 4-byte "leads,"
respectively, meaning they expect to have 1, 2 and 3 "continuation bytes" after them.
Continuation bytes start with 10, and ASCII (1-byte characters) starts with 0.
When treated as signed numbers, they look like this:
| Type | High Bits | Binary Range | Signed |
|--------------|------------|--------------|--------|
| ASCII | `0` | `01111111` | 127 |
| | | `00000000` | 0 |
| 4+-Byte Lead | `1111` | `11111111` | -1 |
| | | `11110000 | -16 |
| 3-Byte Lead | `1110` | `11101111` | -17 |
| | | `11100000 | -32 |
| 2-Byte Lead | `110` | `11011111` | -33 |
| | | `11000000 | -64 |
| Continuation | `10` | `10111111` | -65 |
| | | `10000000 | -128 |
This makes it pretty easy to get the continuation mask! It's just a single comparison:
```
is_continuation = input < -64`
```
We can do something similar for the others, but it takes two comparisons instead of one: "is
the start of a 4-byte character" is `< -32` and `> -65`, for example. And 2+ bytes is `< 0` and
`> -64`. Surely we can do better, they're right next to each other!
#### Getting the is_xxx Masks: Shifting the Range
Notice *why* continuations were a single comparison. The actual *range* would require two
comparisons--`< -64` and `> -129`--but all characters are always greater than -128, so we get
that for free. In fact, if we had *unsigned* comparisons, 2+, 3+ and 4+ comparisons would be
just as easy: 4+ would be `> 239`, 3+ would be `> 223`, and 2+ would be `> 191`.
Instead, we add 128 to each byte, shifting the range up to make comparison easy. This wraps
ASCII down into the negative, and puts 4+-Byte Lead at the top:
| Type | High Bits | Binary Range | Signed |
|----------------------|------------|--------------|-------|
| 4+-Byte Lead (+ 127) | `0111` | `01111111` | 127 |
| | | `01110000 | 112 |
|----------------------|------------|--------------|-------|
| 3-Byte Lead (+ 127) | `0110` | `01101111` | 111 |
| | | `01100000 | 96 |
|----------------------|------------|--------------|-------|
| 2-Byte Lead (+ 127) | `010` | `01011111` | 95 |
| | | `01000000 | 64 |
|----------------------|------------|--------------|-------|
| Continuation (+ 127) | `00` | `00111111` | 63 |
| | | `00000000 | 0 |
|----------------------|------------|--------------|-------|
| ASCII (+ 127) | `1` | `11111111` | -1 |
| | | `10000000` | -128 |
|----------------------|------------|--------------|-------|
*Now* we can use signed `>` on all of them:
```
prev1 = input.prev<1>
prev2 = input.prev<2>
prev3 = input.prev<3>
prev1_flipped = input.prev<1>(prev_input) ^ 0x80; // Same as `+ 128`
prev2_flipped = input.prev<2>(prev_input) ^ 0x80; // Same as `+ 128`
prev3_flipped = input.prev<3>(prev_input) ^ 0x80; // Same as `+ 128`
is_second_byte = prev1_flipped > 63;2+-byte lead
is_third_byte = prev2_flipped > 95;3+-byte lead
is_fourth_byte = prev3_flipped > 111; // 4+-byte lead
```
NOTE: we use `^ 0x80` instead of `+ 128` in the code, which accomplishes the same thing, and even takes the same number
of cycles as `+`, but on many Intel architectures can be parallelized better (you can do 3
`^`'s at a time on Haswell, but only 2 `+`'s).
That doesn't look like it saved us any instructions, did it? Well, because we're adding the
same number to all of them, we can save one of those `+ 128` operations by assembling
`prev2_flipped` out of prev 1 and prev 3 instead of assembling it from input and adding 128
to it. One more instruction saved!
```
prev1 = input.prev<1>
prev3 = input.prev<3>
prev1_flipped = prev1 ^ 0x80; // Same as `+ 128`
prev3_flipped = prev3 ^ 0x80; // Same as `+ 128`
prev2_flipped = prev1_flipped.concat<2>(prev3_flipped): // <shuffle: take the first 2 bytes from prev1 and the rest from prev3
```
#### Bringing It All Together: Detecting the Errors
At this point, we have `is_continuation`, `is_first_byte`, `is_second_byte` and `is_third_byte`.
All we have left to do is check if they match!
```
return (is_second_byte | is_third_byte | is_fourth_byte) ^ is_continuation;
```
But wait--there's more. The above statement is only 3 operations, but they *cannot be done in
parallel*. You have to do 2 `|`'s and then 1 `&`. Haswell, at least, has 3 ports that can do
bitwise operations, and we're only using 1!
#### Epilogue: Addition For Booleans
There is one big case the above code doesn't explicitly talk about--what if is_second_byte
and is_third_byte are BOTH true? That means there is a 3-byte and 2-byte character right next
to each other (or any combination), and the continuation could be part of either of them!
Our algorithm using `&` and `|` won't detect that the continuation byte is problematic.
Never fear, though. If that situation occurs, we'll already have detected that the second
leading byte was an error, because it was supposed to be a part of the preceding multibyte
character, but it *wasn't a continuation*.
We could stop here, but it turns out that we can fix it using `+` and `-` instead of `|` and
`&`, which is both interesting and possibly useful (even though we're not using it here). It
exploits the fact that in SIMD, a *true* value is -1, and a *false* value is 0. So those
comparisons were giving us numbers!
Given that, if you do `is_second_byte + is_third_byte + is_fourth_byte`, under normal
circumstances you will either get 0 (0 + 0 + 0) or -1 (-1 + 0 + 0, etc.). Thus,
`(is_second_byte + is_third_byte + is_fourth_byte) - is_continuation` will yield 0 only if
*both* or *neither* are 0 (0-0 or -1 - -1). You'll get 1 or -1 if they are different. Because
*any* nonzero value is treated as an error (not just -1), we're just fine here :)
Further, if *more than one* multibyte character overlaps,
`is_second_byte + is_third_byte + is_fourth_byte` will be -2 or -3! Subtracting `is_continuation`
from *that* is guaranteed to give you a nonzero value (-1, -2 or -3). So it'll always be
considered an error.
One reason you might want to do this is parallelism. ^ and | are not associative, so
(A | B | C) ^ D will always be three operations in a row: either you do A | B -> | C -> ^ D, or
you do B | C -> | A -> ^ D. But addition and subtraction *are* associative: (A + B + C) - D can
be written as `(A + B) + (C - D)`. This means you can do A + B and C - D at the same time, and
then adds the result together. Same number of operations, but if the processor can run
independent things in parallel (which most can), it runs faster.
This doesn't help us on Intel, but might help us elsewhere: on Haswell, at least, | and ^ have
a super nice advantage in that more of them can be run at the same time (they can run on 3
ports, while + and - can run on 2)! This means that we can do A | B while we're still doing C,
saving us the cycle we would have earned by using +. Even more, using an instruction with a
wider array of ports can help *other* code run ahead, too, since these instructions can "get
out of the way," running on a port other instructions can't.
#### Epilogue II: One More Trick
There's one more relevant trick up our sleeve, it turns out: it turns out on Intel we can "pay
for" the (prev<1> + 128) instruction, because it can be used to save an instruction in
check_special_cases()--but we'll talk about that there :)
## About the Project
### Bindings and Ports of simdjson
@ -420,6 +627,8 @@ make allparsingcompetition
Both the `parsingcompetition` and `allparsingcompetition` tools take a `-t` flag which produces
a table-oriented output that can be conveniently parsed by other tools.
### Various References
- [Google double-conv](https://github.com/google/double-conversion/)

View File

@ -1,4 +1,4 @@
/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
#include <iostream>
#include "simdjson.h"

View File

@ -1,4 +1,4 @@
/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
/* begin file src/simdjson.cpp */
#include "simdjson.h"
@ -3180,10 +3180,10 @@ namespace utf8_validation {
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
// and emits an error anytime there is a mismatch.
//
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
// For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
// characters, the file will look like this:
//
// | Character | 𝄞 | | | | | | | ֏ | | a | b |
// | Character | | | | | | | | | | a | b |
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
@ -4049,10 +4049,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
// If you consume a large value and you map it to "infinity", you will no
// longer be able to serialize back a standard-compliant JSON. And there is
// no realistic application where you might need values so large than they
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
// 10^308 It is an unimaginable large number. There will never be any piece of
// engineering involving as many as 10^308 parts. It is estimated that there
// are about 10^80 atoms in the universe.  The estimate for the total number
// are about 10^80 atoms in the universe. The estimate for the total number
// of electrons is similar. Using a double-precision floating-point value, we
// can represent easily the number of atoms in the universe. We could also
// represent the number of ways you can pick any three individual atoms at
@ -5872,10 +5872,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
// If you consume a large value and you map it to "infinity", you will no
// longer be able to serialize back a standard-compliant JSON. And there is
// no realistic application where you might need values so large than they
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
// 10^308 It is an unimaginable large number. There will never be any piece of
// engineering involving as many as 10^308 parts. It is estimated that there
// are about 10^80 atoms in the universe.  The estimate for the total number
// are about 10^80 atoms in the universe. The estimate for the total number
// of electrons is similar. Using a double-precision floating-point value, we
// can represent easily the number of atoms in the universe. We could also
// represent the number of ways you can pick any three individual atoms at
@ -8142,10 +8142,10 @@ namespace utf8_validation {
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
// and emits an error anytime there is a mismatch.
//
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
// For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
// characters, the file will look like this:
//
// | Character | 𝄞 | | | | | | | ֏ | | a | b |
// | Character | | | | | | | | | | a | b |
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
@ -9015,10 +9015,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
// If you consume a large value and you map it to "infinity", you will no
// longer be able to serialize back a standard-compliant JSON. And there is
// no realistic application where you might need values so large than they
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
// 10^308 It is an unimaginable large number. There will never be any piece of
// engineering involving as many as 10^308 parts. It is estimated that there
// are about 10^80 atoms in the universe.  The estimate for the total number
// are about 10^80 atoms in the universe. The estimate for the total number
// of electrons is similar. Using a double-precision floating-point value, we
// can represent easily the number of atoms in the universe. We could also
// represent the number of ways you can pick any three individual atoms at
@ -11254,10 +11254,10 @@ namespace utf8_validation {
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
// and emits an error anytime there is a mismatch.
//
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
// For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
// characters, the file will look like this:
//
// | Character | 𝄞 | | | | | | | ֏ | | a | b |
// | Character | | | | | | | | | | a | b |
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
@ -12130,10 +12130,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
// If you consume a large value and you map it to "infinity", you will no
// longer be able to serialize back a standard-compliant JSON. And there is
// no realistic application where you might need values so large than they
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
// 10^308 It is an unimaginable large number. There will never be any piece of
// engineering involving as many as 10^308 parts. It is estimated that there
// are about 10^80 atoms in the universe.  The estimate for the total number
// are about 10^80 atoms in the universe. The estimate for the total number
// of electrons is similar. Using a double-precision floating-point value, we
// can represent easily the number of atoms in the universe. We could also
// represent the number of ways you can pick any three individual atoms at

View File

@ -1,4 +1,4 @@
/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
/* begin file include/simdjson.h */
#ifndef SIMDJSON_H
#define SIMDJSON_H

View File

@ -47,9 +47,9 @@
// support values with more than 23 bits (which a 4-byte character supports).
//
// e.g. 11111000 10100000 10000000 10000000 10000000 (U+800000)
//
//
// Legal utf-8 byte sequences per http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94:
//
//
// Code Points 1st 2s 3s 4s
// U+0000..U+007F 00..7F
// U+0080..U+07FF C2..DF 80..BF
@ -64,6 +64,7 @@
using namespace simd;
namespace utf8_validation {
// For a detailed description of the lookup2 algorithm, see the file HACKING.md under "UTF-8 validation (lookup2)".
//
// Find special case UTF-8 errors where the character is technically readable (has the right length)
@ -108,7 +109,7 @@ namespace utf8_validation {
const simd8<uint8_t> byte_1_high = prev1.shr<4>().lookup_16<uint8_t>(
// [0___]____ (ASCII)
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0,
// [10__]____ (continuation)
0, 0, 0, 0,
@ -139,214 +140,6 @@ namespace utf8_validation {
return byte_1_high & byte_1_low & byte_2_high;
}
//
// Validate the length of multibyte characters (that each multibyte character has the right number
// of continuation characters, and that all continuation characters are part of a multibyte
// character).
//
// Algorithm
// =========
//
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
// and emits an error anytime there is a mismatch.
//
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
// characters, the file will look like this:
//
// | Character | 𝄞 | | | | ₿ | | | ֏ | | a | b |
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
// | is_second_byte | | X | | | | X | | | X | | |
// | is_third_byte | | | X | | | | X | | | | |
// | is_fourth_byte | | | | X | | | | | | | |
// | expected_continuation | | X | X | X | | X | X | | X | | |
// | is_continuation | | X | X | X | | X | X | | X | | |
//
// The errors here are basically (Second Byte OR Third Byte OR Fourth Byte == Continuation):
//
// - **Extra Continuations:** Any continuation that is not a second, third or fourth byte is not
// part of a valid 2-, 3- or 4-byte character and is thus an error. It could be that it's just
// floating around extra outside of any character, or that there is an illegal 5-byte character,
// or maybe it's at the beginning of the file before any characters have started; but it's an
// error in all these cases.
// - **Missing Continuations:** Any second, third or fourth byte that *isn't* a continuation is an error, because that means
// we started a new character before we were finished with the current one.
//
// Getting the Previous Bytes
// --------------------------
//
// Because we want to know if a byte is the *second* (or third, or fourth) byte of a multibyte
// character, we need to "shift the bytes" to find that out. This is what they mean:
//
// - `is_continuation`: if the current byte is a continuation.
// - `is_second_byte`: if 1 byte back is the start of a 2-, 3- or 4-byte character.
// - `is_third_byte`: if 2 bytes back is the start of a 3- or 4-byte character.
// - `is_fourth_byte`: if 3 bytes back is the start of a 4-byte character.
//
// We use shuffles to go n bytes back, selecting part of the current `input` and part of the
// `prev_input` (search for `.prev<1>`, `.prev<2>`, etc.). These are passed in by the caller
// function, because the 1-byte-back data is used by other checks as well.
//
// Getting the Continuation Mask
// -----------------------------
//
// Once we have the right bytes, we have to get the masks. To do this, we treat UTF-8 bytes as
// numbers, using signed `<` and `>` operations to check if they are continuations or leads.
// In fact, we treat the numbers as *signed*, partly because it helps us, and partly because
// Intel's SIMD presently only offers signed `<` and `>` operations (not unsigned ones).
//
// In UTF-8, bytes that start with the bits 110, 1110 and 11110 are 2-, 3- and 4-byte "leads,"
// respectively, meaning they expect to have 1, 2 and 3 "continuation bytes" after them.
// Continuation bytes start with 10, and ASCII (1-byte characters) starts with 0.
//
// When treated as signed numbers, they look like this:
//
// | Type | High Bits | Binary Range | Signed |
// |--------------|------------|--------------|--------|
// | ASCII | `0` | `01111111` | 127 |
// | | | `00000000` | 0 |
// | 4+-Byte Lead | `1111` | `11111111` | -1 |
// | | | `11110000 | -16 |
// | 3-Byte Lead | `1110` | `11101111` | -17 |
// | | | `11100000 | -32 |
// | 2-Byte Lead | `110` | `11011111` | -33 |
// | | | `11000000 | -64 |
// | Continuation | `10` | `10111111` | -65 |
// | | | `10000000 | -128 |
//
// This makes it pretty easy to get the continuation mask! It's just a single comparison:
//
// ```
// is_continuation = input < -64`
// ```
//
// We can do something similar for the others, but it takes two comparisons instead of one: "is
// the start of a 4-byte character" is `< -32` and `> -65`, for example. And 2+ bytes is `< 0` and
// `> -64`. Surely we can do better, they're right next to each other!
//
// Getting the is_xxx Masks: Shifting the Range
// --------------------------------------------
//
// Notice *why* continuations were a single comparison. The actual *range* would require two
// comparisons--`< -64` and `> -129`--but all characters are always greater than -128, so we get
// that for free. In fact, if we had *unsigned* comparisons, 2+, 3+ and 4+ comparisons would be
// just as easy: 4+ would be `> 239`, 3+ would be `> 223`, and 2+ would be `> 191`.
//
// Instead, we add 128 to each byte, shifting the range up to make comparison easy. This wraps
// ASCII down into the negative, and puts 4+-Byte Lead at the top:
//
// | Type | High Bits | Binary Range | Signed |
// |----------------------|------------|--------------|-------|
// | 4+-Byte Lead (+ 127) | `0111` | `01111111` | 127 |
// | | | `01110000 | 112 |
// |----------------------|------------|--------------|-------|
// | 3-Byte Lead (+ 127) | `0110` | `01101111` | 111 |
// | | | `01100000 | 96 |
// |----------------------|------------|--------------|-------|
// | 2-Byte Lead (+ 127) | `010` | `01011111` | 95 |
// | | | `01000000 | 64 |
// |----------------------|------------|--------------|-------|
// | Continuation (+ 127) | `00` | `00111111` | 63 |
// | | | `00000000 | 0 |
// |----------------------|------------|--------------|-------|
// | ASCII (+ 127) | `1` | `11111111` | -1 |
// | | | `10000000` | -128 |
// |----------------------|------------|--------------|-------|
//
// *Now* we can use signed `>` on all of them:
//
// ```
// prev1 = input.prev<1>
// prev2 = input.prev<2>
// prev3 = input.prev<3>
// prev1_flipped = input.prev<1>(prev_input) ^ 0x80; // Same as `+ 128`
// prev2_flipped = input.prev<2>(prev_input) ^ 0x80; // Same as `+ 128`
// prev3_flipped = input.prev<3>(prev_input) ^ 0x80; // Same as `+ 128`
// is_second_byte = prev1_flipped > 63; // 2+-byte lead
// is_third_byte = prev2_flipped > 95; // 3+-byte lead
// is_fourth_byte = prev3_flipped > 111; // 4+-byte lead
// ```
//
// NOTE: we use `^ 0x80` instead of `+ 128` in the code, which accomplishes the same thing, and even takes the same number
// of cycles as `+`, but on many Intel architectures can be parallelized better (you can do 3
// `^`'s at a time on Haswell, but only 2 `+`'s).
//
// That doesn't look like it saved us any instructions, did it? Well, because we're adding the
// same number to all of them, we can save one of those `+ 128` operations by assembling
// `prev2_flipped` out of prev 1 and prev 3 instead of assembling it from input and adding 128
// to it. One more instruction saved!
//
// ```
// prev1 = input.prev<1>
// prev3 = input.prev<3>
// prev1_flipped = prev1 ^ 0x80; // Same as `+ 128`
// prev3_flipped = prev3 ^ 0x80; // Same as `+ 128`
// prev2_flipped = prev1_flipped.concat<2>(prev3_flipped): // <shuffle: take the first 2 bytes from prev1 and the rest from prev3
// ```
//
// ### Bringing It All Together: Detecting the Errors
//
// At this point, we have `is_continuation`, `is_first_byte`, `is_second_byte` and `is_third_byte`.
// All we have left to do is check if they match!
//
// ```
// return (is_second_byte | is_third_byte | is_fourth_byte) ^ is_continuation;
// ```
//
// But wait--there's more. The above statement is only 3 operations, but they *cannot be done in
// parallel*. You have to do 2 `|`'s and then 1 `&`. Haswell, at least, has 3 ports that can do
// bitwise operations, and we're only using 1!
//
// Epilogue: Addition For Booleans
// -------------------------------
//
// There is one big case the above code doesn't explicitly talk about--what if is_second_byte
// and is_third_byte are BOTH true? That means there is a 3-byte and 2-byte character right next
// to each other (or any combination), and the continuation could be part of either of them!
// Our algorithm using `&` and `|` won't detect that the continuation byte is problematic.
//
// Never fear, though. If that situation occurs, we'll already have detected that the second
// leading byte was an error, because it was supposed to be a part of the preceding multibyte
// character, but it *wasn't a continuation*.
//
// We could stop here, but it turns out that we can fix it using `+` and `-` instead of `|` and
// `&`, which is both interesting and possibly useful (even though we're not using it here). It
// exploits the fact that in SIMD, a *true* value is -1, and a *false* value is 0. So those
// comparisons were giving us numbers!
//
// Given that, if you do `is_second_byte + is_third_byte + is_fourth_byte`, under normal
// circumstances you will either get 0 (0 + 0 + 0) or -1 (-1 + 0 + 0, etc.). Thus,
// `(is_second_byte + is_third_byte + is_fourth_byte) - is_continuation` will yield 0 only if
// *both* or *neither* are 0 (0-0 or -1 - -1). You'll get 1 or -1 if they are different. Because
// *any* nonzero value is treated as an error (not just -1), we're just fine here :)
//
// Further, if *more than one* multibyte character overlaps,
// `is_second_byte + is_third_byte + is_fourth_byte` will be -2 or -3! Subtracting `is_continuation`
// from *that* is guaranteed to give you a nonzero value (-1, -2 or -3). So it'll always be
// considered an error.
//
// One reason you might want to do this is parallelism. ^ and | are not associative, so
// (A | B | C) ^ D will always be three operations in a row: either you do A | B -> | C -> ^ D, or
// you do B | C -> | A -> ^ D. But addition and subtraction *are* associative: (A + B + C) - D can
// be written as `(A + B) + (C - D)`. This means you can do A + B and C - D at the same time, and
// then adds the result together. Same number of operations, but if the processor can run
// independent things in parallel (which most can), it runs faster.
//
// This doesn't help us on Intel, but might help us elsewhere: on Haswell, at least, | and ^ have
// a super nice advantage in that more of them can be run at the same time (they can run on 3
// ports, while + and - can run on 2)! This means that we can do A | B while we're still doing C,
// saving us the cycle we would have earned by using +. Even more, using an instruction with a
// wider array of ports can help *other* code run ahead, too, since these instructions can "get
// out of the way," running on a port other instructions can't.
//
// Epilogue II: One More Trick
// ---------------------------
//
// There's one more relevant trick up our sleeve, it turns out: it turns out on Intel we can "pay
// for" the (prev<1> + 128) instruction, because it can be used to save an instruction in
// check_special_cases()--but we'll talk about that there :)
//
really_inline simd8<uint8_t> check_multibyte_lengths(simd8<uint8_t> input, simd8<uint8_t> prev_input, simd8<uint8_t> prev1) {
simd8<uint8_t> prev2 = input.prev<2>(prev_input);
simd8<uint8_t> prev3 = input.prev<3>(prev_input);
@ -422,4 +215,4 @@ namespace utf8_validation {
}; // struct utf8_checker
}
using utf8_validation::utf8_checker;
using utf8_validation::utf8_checker;

View File

@ -191,10 +191,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
// If you consume a large value and you map it to "infinity", you will no
// longer be able to serialize back a standard-compliant JSON. And there is
// no realistic application where you might need values so large than they
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
// 10^308 It is an unimaginable large number. There will never be any piece of
// engineering involving as many as 10^308 parts. It is estimated that there
// are about 10^80 atoms in the universe.  The estimate for the total number
// are about 10^80 atoms in the universe. The estimate for the total number
// of electrons is similar. Using a double-precision floating-point value, we
// can represent easily the number of atoms in the universe. We could also
// represent the number of ways you can pick any three individual atoms at

View File

@ -10,7 +10,7 @@
#include <string_view>
#include <sstream>
#include <utility>
#include <ciso646>
#include <ciso646>
#include <unistd.h>
#include "simdjson.h"
@ -62,7 +62,7 @@ namespace number_tests {
std::cerr << "JSON '" << str << "' parsed to " << actual << " instead of " << i << std::endl;
return false;
}
}
}
}
return true;
}
@ -79,7 +79,7 @@ namespace number_tests {
fflush(NULL);
auto [actual, error] = parser.parse(buf, n).get<double>();
if (error) { std::cerr << error << std::endl; return false; }
uint64_t ulp = f64_ulp_dist(actual,expected);
uint64_t ulp = f64_ulp_dist(actual,expected);
if(ulp > maxulp) maxulp = ulp;
if(ulp > 0) {
std::cerr << "JSON '" << buf << " parsed to " << actual << " instead of " << expected << std::endl;
@ -452,8 +452,8 @@ namespace document_stream_tests {
size_t n = snprintf(buf,
sizeof(buf),
"{\"id\": %zu, \"name\": \"name%zu\", \"gender\": \"%s\", "
"\"été\": {\"id\": %zu, \"name\": \"éventail%zu\"}}",
i, i, (i % 2) ? "" : "", i % 10, i % 10);
"\"\xC3\xA9t\xC3\xA9\": {\"id\": %zu, \"name\": \"\xC3\xA9ventail%zu\"}}",
i, i, (i % 2) ? "\xE2\xBA\x83" : "\xE2\xBA\x95", i % 10, i % 10);
if (n >= sizeof(buf)) { abort(); }
data += std::string(buf, n);
}
@ -678,7 +678,7 @@ namespace dom_api_tests {
}
if (iter.move_to_key_insensitive("bad key")) {
printf("We should not move to a non-existing key\n");
return false;
return false;
}
if (!iter.is_object()) {
printf("We should have remained at the object.\n");
@ -726,7 +726,7 @@ namespace dom_api_tests {
}
if (!iter.move_to_key("IDs")) {
printf("We should be able to move to an existing key\n");
return false;
return false;
}
if (!iter.is_array()) {
printf("Value of IDs should be array, it is %c \n", iter.get_type());
@ -734,7 +734,7 @@ namespace dom_api_tests {
}
if (iter.move_to_index(4)) {
printf("We should not be able to move to a non-existing index\n");
return false;
return false;
}
if (!iter.is_array()) {
printf("We should have remained at the array\n");
@ -1930,8 +1930,8 @@ int main(int argc, char *argv[]) {
// this is put here deliberately to check that the documentation is correct (README),
// should this fail to compile, you should update the documentation:
if (simdjson::active_implementation->name() == "unsupported") {
printf("unsupported CPU\n");
if (simdjson::active_implementation->name() == "unsupported") {
printf("unsupported CPU\n");
}
std::cout << "Running basic tests." << std::endl;
if (parse_api_tests::run() &&