Using just ASCII. (#899)
* Using just ASCII. * Let us prune checkperf. * Moving the description of lookup2 to the HACKING.md file.
This commit is contained in:
parent
219b02c1e5
commit
12150baa5e
|
@ -38,13 +38,27 @@ add_subdirectory(singleheader)
|
||||||
#
|
#
|
||||||
# Compile tools / tests / benchmarks
|
# Compile tools / tests / benchmarks
|
||||||
#
|
#
|
||||||
|
|
||||||
add_subdirectory(dependencies)
|
add_subdirectory(dependencies)
|
||||||
add_subdirectory(tests)
|
add_subdirectory(tests)
|
||||||
add_subdirectory(examples)
|
add_subdirectory(examples)
|
||||||
add_subdirectory(benchmark)
|
add_subdirectory(benchmark)
|
||||||
add_subdirectory(fuzz)
|
add_subdirectory(fuzz)
|
||||||
|
|
||||||
|
#
|
||||||
|
# Source files should be just ASCII
|
||||||
|
#
|
||||||
|
find_program(FIND find)
|
||||||
|
find_program(FILE file)
|
||||||
|
find_program(GREP grep)
|
||||||
|
if((FIND) AND (FILE) AND (GREP))
|
||||||
|
add_test(
|
||||||
|
NAME "just_ascii"
|
||||||
|
COMMAND sh -c "${FIND} include src windows tools singleheader tests examples benchmark -path benchmark/checkperf-reference -prune -name '*.h' -o -name '*.cpp' -type f -exec ${FILE} '{}' \; |${GREP} -v ASCII || exit 0 && exit 1"
|
||||||
|
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
|
||||||
|
)
|
||||||
|
endif()
|
||||||
|
|
||||||
|
|
||||||
#
|
#
|
||||||
# CPack
|
# CPack
|
||||||
#
|
#
|
||||||
|
|
|
@ -40,6 +40,7 @@ We have few hard rules, but we have some:
|
||||||
|
|
||||||
- Printing to standard output or standard error (`stderr`, `stdout`, `std::cerr`, `std::cout`) in the core library is forbidden. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Compiled code should not write to stdout or stderr".
|
- Printing to standard output or standard error (`stderr`, `stdout`, `std::cerr`, `std::cout`) in the core library is forbidden. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Compiled code should not write to stdout or stderr".
|
||||||
- Calls to `abort()` are forbidden in the core library. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Under no circumstances should your compiled code ever call abort or exit".
|
- Calls to `abort()` are forbidden in the core library. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Under no circumstances should your compiled code ever call abort or exit".
|
||||||
|
- All source code files (.h, .cpp) must be ASCII.
|
||||||
|
|
||||||
Tools, tests and benchmarks are not held to these same strict rules.
|
Tools, tests and benchmarks are not held to these same strict rules.
|
||||||
|
|
||||||
|
|
209
HACKING.md
209
HACKING.md
|
@ -369,6 +369,213 @@ This helps as we redefine some new characters as pseudo-structural such as the c
|
||||||
|
|
||||||
> { "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }
|
> { "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### UTF-8 validation (lookup2)
|
||||||
|
|
||||||
|
The simdjson library relies on the lookup2 algorithm for UTF-8 validation on x64 platforms.
|
||||||
|
|
||||||
|
This algorithm validate the length of multibyte characters (that each multibyte character has the right number of continuation characters, and that all continuation characters are part of a multibyte character).
|
||||||
|
|
||||||
|
#### Algorithm
|
||||||
|
|
||||||
|
This algorithm compares *expected* continuation characters with *actual* continuation bytes, and emits an error anytime there is a mismatch.
|
||||||
|
|
||||||
|
For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
|
||||||
|
characters, the file will look like this:
|
||||||
|
|
||||||
|
| Character | 𝄞 | | | | ₿ | | | ֏ | | a | b |
|
||||||
|
|-----------------------|----|----|----|----|----|----|----|----|----|----|----|
|
||||||
|
| Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
|
||||||
|
| Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
|
||||||
|
| is_second_byte | | X | | | | X | | | X | | |
|
||||||
|
| is_third_byte | | | X | | | | X | | | | |
|
||||||
|
| is_fourth_byte | | | | X | | | | | | | |
|
||||||
|
| expected_continuation | | X | X | X | | X | X | | X | | |
|
||||||
|
| is_continuation | | X | X | X | | X | X | | X | | |
|
||||||
|
|
||||||
|
The errors here are basically (Second Byte OR Third Byte OR Fourth Byte == Continuation):
|
||||||
|
|
||||||
|
- **Extra Continuations:** Any continuation that is not a second, third or fourth byte is not
|
||||||
|
part of a valid 2-, 3- or 4-byte character and is thus an error. It could be that it's just
|
||||||
|
floating around extra outside of any character, or that there is an illegal 5-byte character,
|
||||||
|
or maybe it's at the beginning of the file before any characters have started; but it's an
|
||||||
|
error in all these cases.
|
||||||
|
- **Missing Continuations:** Any second, third or fourth byte that *isn't* a continuation is an error, because that means
|
||||||
|
we started a new character before we were finished with the current one.
|
||||||
|
|
||||||
|
#### Getting the Previous Bytes
|
||||||
|
|
||||||
|
Because we want to know if a byte is the *second* (or third, or fourth) byte of a multibyte
|
||||||
|
character, we need to "shift the bytes" to find that out. This is what they mean:
|
||||||
|
|
||||||
|
- `is_continuation`: if the current byte is a continuation.
|
||||||
|
- `is_second_byte`: if 1 byte back is the start of a 2-, 3- or 4-byte character.
|
||||||
|
- `is_third_byte`: if 2 bytes back is the start of a 3- or 4-byte character.
|
||||||
|
- `is_fourth_byte`: if 3 bytes back is the start of a 4-byte character.
|
||||||
|
|
||||||
|
We use shuffles to go n bytes back, selecting part of the current `input` and part of the
|
||||||
|
`prev_input` (search for `.prev<1>`, `.prev<2>`, etc.). These are passed in by the caller
|
||||||
|
function, because the 1-byte-back data is used by other checks as well.
|
||||||
|
|
||||||
|
#### Getting the Continuation Mask
|
||||||
|
|
||||||
|
Once we have the right bytes, we have to get the masks. To do this, we treat UTF-8 bytes as
|
||||||
|
numbers, using signed `<` and `>` operations to check if they are continuations or leads.
|
||||||
|
In fact, we treat the numbers as *signed*, partly because it helps us, and partly because
|
||||||
|
Intel's SIMD presently only offers signed `<` and `>` operations (not unsigned ones).
|
||||||
|
|
||||||
|
In UTF-8, bytes that start with the bits 110, 1110 and 11110 are 2-, 3- and 4-byte "leads,"
|
||||||
|
respectively, meaning they expect to have 1, 2 and 3 "continuation bytes" after them.
|
||||||
|
Continuation bytes start with 10, and ASCII (1-byte characters) starts with 0.
|
||||||
|
|
||||||
|
When treated as signed numbers, they look like this:
|
||||||
|
|
||||||
|
| Type | High Bits | Binary Range | Signed |
|
||||||
|
|--------------|------------|--------------|--------|
|
||||||
|
| ASCII | `0` | `01111111` | 127 |
|
||||||
|
| | | `00000000` | 0 |
|
||||||
|
| 4+-Byte Lead | `1111` | `11111111` | -1 |
|
||||||
|
| | | `11110000 | -16 |
|
||||||
|
| 3-Byte Lead | `1110` | `11101111` | -17 |
|
||||||
|
| | | `11100000 | -32 |
|
||||||
|
| 2-Byte Lead | `110` | `11011111` | -33 |
|
||||||
|
| | | `11000000 | -64 |
|
||||||
|
| Continuation | `10` | `10111111` | -65 |
|
||||||
|
| | | `10000000 | -128 |
|
||||||
|
|
||||||
|
This makes it pretty easy to get the continuation mask! It's just a single comparison:
|
||||||
|
|
||||||
|
```
|
||||||
|
is_continuation = input < -64`
|
||||||
|
```
|
||||||
|
|
||||||
|
We can do something similar for the others, but it takes two comparisons instead of one: "is
|
||||||
|
the start of a 4-byte character" is `< -32` and `> -65`, for example. And 2+ bytes is `< 0` and
|
||||||
|
`> -64`. Surely we can do better, they're right next to each other!
|
||||||
|
|
||||||
|
#### Getting the is_xxx Masks: Shifting the Range
|
||||||
|
|
||||||
|
Notice *why* continuations were a single comparison. The actual *range* would require two
|
||||||
|
comparisons--`< -64` and `> -129`--but all characters are always greater than -128, so we get
|
||||||
|
that for free. In fact, if we had *unsigned* comparisons, 2+, 3+ and 4+ comparisons would be
|
||||||
|
just as easy: 4+ would be `> 239`, 3+ would be `> 223`, and 2+ would be `> 191`.
|
||||||
|
|
||||||
|
Instead, we add 128 to each byte, shifting the range up to make comparison easy. This wraps
|
||||||
|
ASCII down into the negative, and puts 4+-Byte Lead at the top:
|
||||||
|
|
||||||
|
| Type | High Bits | Binary Range | Signed |
|
||||||
|
|----------------------|------------|--------------|-------|
|
||||||
|
| 4+-Byte Lead (+ 127) | `0111` | `01111111` | 127 |
|
||||||
|
| | | `01110000 | 112 |
|
||||||
|
|----------------------|------------|--------------|-------|
|
||||||
|
| 3-Byte Lead (+ 127) | `0110` | `01101111` | 111 |
|
||||||
|
| | | `01100000 | 96 |
|
||||||
|
|----------------------|------------|--------------|-------|
|
||||||
|
| 2-Byte Lead (+ 127) | `010` | `01011111` | 95 |
|
||||||
|
| | | `01000000 | 64 |
|
||||||
|
|----------------------|------------|--------------|-------|
|
||||||
|
| Continuation (+ 127) | `00` | `00111111` | 63 |
|
||||||
|
| | | `00000000 | 0 |
|
||||||
|
|----------------------|------------|--------------|-------|
|
||||||
|
| ASCII (+ 127) | `1` | `11111111` | -1 |
|
||||||
|
| | | `10000000` | -128 |
|
||||||
|
|----------------------|------------|--------------|-------|
|
||||||
|
|
||||||
|
*Now* we can use signed `>` on all of them:
|
||||||
|
|
||||||
|
```
|
||||||
|
prev1 = input.prev<1>
|
||||||
|
prev2 = input.prev<2>
|
||||||
|
prev3 = input.prev<3>
|
||||||
|
prev1_flipped = input.prev<1>(prev_input) ^ 0x80; // Same as `+ 128`
|
||||||
|
prev2_flipped = input.prev<2>(prev_input) ^ 0x80; // Same as `+ 128`
|
||||||
|
prev3_flipped = input.prev<3>(prev_input) ^ 0x80; // Same as `+ 128`
|
||||||
|
is_second_byte = prev1_flipped > 63;2+-byte lead
|
||||||
|
is_third_byte = prev2_flipped > 95;3+-byte lead
|
||||||
|
is_fourth_byte = prev3_flipped > 111; // 4+-byte lead
|
||||||
|
```
|
||||||
|
|
||||||
|
NOTE: we use `^ 0x80` instead of `+ 128` in the code, which accomplishes the same thing, and even takes the same number
|
||||||
|
of cycles as `+`, but on many Intel architectures can be parallelized better (you can do 3
|
||||||
|
`^`'s at a time on Haswell, but only 2 `+`'s).
|
||||||
|
|
||||||
|
That doesn't look like it saved us any instructions, did it? Well, because we're adding the
|
||||||
|
same number to all of them, we can save one of those `+ 128` operations by assembling
|
||||||
|
`prev2_flipped` out of prev 1 and prev 3 instead of assembling it from input and adding 128
|
||||||
|
to it. One more instruction saved!
|
||||||
|
|
||||||
|
```
|
||||||
|
prev1 = input.prev<1>
|
||||||
|
prev3 = input.prev<3>
|
||||||
|
prev1_flipped = prev1 ^ 0x80; // Same as `+ 128`
|
||||||
|
prev3_flipped = prev3 ^ 0x80; // Same as `+ 128`
|
||||||
|
prev2_flipped = prev1_flipped.concat<2>(prev3_flipped): // <shuffle: take the first 2 bytes from prev1 and the rest from prev3
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Bringing It All Together: Detecting the Errors
|
||||||
|
|
||||||
|
At this point, we have `is_continuation`, `is_first_byte`, `is_second_byte` and `is_third_byte`.
|
||||||
|
All we have left to do is check if they match!
|
||||||
|
|
||||||
|
```
|
||||||
|
return (is_second_byte | is_third_byte | is_fourth_byte) ^ is_continuation;
|
||||||
|
```
|
||||||
|
|
||||||
|
But wait--there's more. The above statement is only 3 operations, but they *cannot be done in
|
||||||
|
parallel*. You have to do 2 `|`'s and then 1 `&`. Haswell, at least, has 3 ports that can do
|
||||||
|
bitwise operations, and we're only using 1!
|
||||||
|
|
||||||
|
#### Epilogue: Addition For Booleans
|
||||||
|
|
||||||
|
There is one big case the above code doesn't explicitly talk about--what if is_second_byte
|
||||||
|
and is_third_byte are BOTH true? That means there is a 3-byte and 2-byte character right next
|
||||||
|
to each other (or any combination), and the continuation could be part of either of them!
|
||||||
|
Our algorithm using `&` and `|` won't detect that the continuation byte is problematic.
|
||||||
|
|
||||||
|
Never fear, though. If that situation occurs, we'll already have detected that the second
|
||||||
|
leading byte was an error, because it was supposed to be a part of the preceding multibyte
|
||||||
|
character, but it *wasn't a continuation*.
|
||||||
|
|
||||||
|
We could stop here, but it turns out that we can fix it using `+` and `-` instead of `|` and
|
||||||
|
`&`, which is both interesting and possibly useful (even though we're not using it here). It
|
||||||
|
exploits the fact that in SIMD, a *true* value is -1, and a *false* value is 0. So those
|
||||||
|
comparisons were giving us numbers!
|
||||||
|
|
||||||
|
Given that, if you do `is_second_byte + is_third_byte + is_fourth_byte`, under normal
|
||||||
|
circumstances you will either get 0 (0 + 0 + 0) or -1 (-1 + 0 + 0, etc.). Thus,
|
||||||
|
`(is_second_byte + is_third_byte + is_fourth_byte) - is_continuation` will yield 0 only if
|
||||||
|
*both* or *neither* are 0 (0-0 or -1 - -1). You'll get 1 or -1 if they are different. Because
|
||||||
|
*any* nonzero value is treated as an error (not just -1), we're just fine here :)
|
||||||
|
|
||||||
|
Further, if *more than one* multibyte character overlaps,
|
||||||
|
`is_second_byte + is_third_byte + is_fourth_byte` will be -2 or -3! Subtracting `is_continuation`
|
||||||
|
from *that* is guaranteed to give you a nonzero value (-1, -2 or -3). So it'll always be
|
||||||
|
considered an error.
|
||||||
|
|
||||||
|
One reason you might want to do this is parallelism. ^ and | are not associative, so
|
||||||
|
(A | B | C) ^ D will always be three operations in a row: either you do A | B -> | C -> ^ D, or
|
||||||
|
you do B | C -> | A -> ^ D. But addition and subtraction *are* associative: (A + B + C) - D can
|
||||||
|
be written as `(A + B) + (C - D)`. This means you can do A + B and C - D at the same time, and
|
||||||
|
then adds the result together. Same number of operations, but if the processor can run
|
||||||
|
independent things in parallel (which most can), it runs faster.
|
||||||
|
|
||||||
|
This doesn't help us on Intel, but might help us elsewhere: on Haswell, at least, | and ^ have
|
||||||
|
a super nice advantage in that more of them can be run at the same time (they can run on 3
|
||||||
|
ports, while + and - can run on 2)! This means that we can do A | B while we're still doing C,
|
||||||
|
saving us the cycle we would have earned by using +. Even more, using an instruction with a
|
||||||
|
wider array of ports can help *other* code run ahead, too, since these instructions can "get
|
||||||
|
out of the way," running on a port other instructions can't.
|
||||||
|
|
||||||
|
#### Epilogue II: One More Trick
|
||||||
|
|
||||||
|
There's one more relevant trick up our sleeve, it turns out: it turns out on Intel we can "pay
|
||||||
|
for" the (prev<1> + 128) instruction, because it can be used to save an instruction in
|
||||||
|
check_special_cases()--but we'll talk about that there :)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## About the Project
|
## About the Project
|
||||||
|
|
||||||
### Bindings and Ports of simdjson
|
### Bindings and Ports of simdjson
|
||||||
|
@ -420,6 +627,8 @@ make allparsingcompetition
|
||||||
Both the `parsingcompetition` and `allparsingcompetition` tools take a `-t` flag which produces
|
Both the `parsingcompetition` and `allparsingcompetition` tools take a `-t` flag which produces
|
||||||
a table-oriented output that can be conveniently parsed by other tools.
|
a table-oriented output that can be conveniently parsed by other tools.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Various References
|
### Various References
|
||||||
|
|
||||||
- [Google double-conv](https://github.com/google/double-conversion/)
|
- [Google double-conv](https://github.com/google/double-conversion/)
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
|
/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
|
||||||
|
|
||||||
#include <iostream>
|
#include <iostream>
|
||||||
#include "simdjson.h"
|
#include "simdjson.h"
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
|
/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
|
||||||
/* begin file src/simdjson.cpp */
|
/* begin file src/simdjson.cpp */
|
||||||
#include "simdjson.h"
|
#include "simdjson.h"
|
||||||
|
|
||||||
|
@ -3180,10 +3180,10 @@ namespace utf8_validation {
|
||||||
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
|
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
|
||||||
// and emits an error anytime there is a mismatch.
|
// and emits an error anytime there is a mismatch.
|
||||||
//
|
//
|
||||||
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
|
// For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
|
||||||
// characters, the file will look like this:
|
// characters, the file will look like this:
|
||||||
//
|
//
|
||||||
// | Character | 𝄞 | | | | ₿ | | | ֏ | | a | b |
|
// | Character | | | | | | | | | | a | b |
|
||||||
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
|
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
|
||||||
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
|
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
|
||||||
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
|
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
|
||||||
|
@ -4049,10 +4049,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
|
||||||
// If you consume a large value and you map it to "infinity", you will no
|
// If you consume a large value and you map it to "infinity", you will no
|
||||||
// longer be able to serialize back a standard-compliant JSON. And there is
|
// longer be able to serialize back a standard-compliant JSON. And there is
|
||||||
// no realistic application where you might need values so large than they
|
// no realistic application where you might need values so large than they
|
||||||
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
|
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
|
||||||
// 10^308 It is an unimaginable large number. There will never be any piece of
|
// 10^308 It is an unimaginable large number. There will never be any piece of
|
||||||
// engineering involving as many as 10^308 parts. It is estimated that there
|
// engineering involving as many as 10^308 parts. It is estimated that there
|
||||||
// are about 10^80 atoms in the universe. The estimate for the total number
|
// are about 10^80 atoms in the universe. The estimate for the total number
|
||||||
// of electrons is similar. Using a double-precision floating-point value, we
|
// of electrons is similar. Using a double-precision floating-point value, we
|
||||||
// can represent easily the number of atoms in the universe. We could also
|
// can represent easily the number of atoms in the universe. We could also
|
||||||
// represent the number of ways you can pick any three individual atoms at
|
// represent the number of ways you can pick any three individual atoms at
|
||||||
|
@ -5872,10 +5872,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
|
||||||
// If you consume a large value and you map it to "infinity", you will no
|
// If you consume a large value and you map it to "infinity", you will no
|
||||||
// longer be able to serialize back a standard-compliant JSON. And there is
|
// longer be able to serialize back a standard-compliant JSON. And there is
|
||||||
// no realistic application where you might need values so large than they
|
// no realistic application where you might need values so large than they
|
||||||
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
|
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
|
||||||
// 10^308 It is an unimaginable large number. There will never be any piece of
|
// 10^308 It is an unimaginable large number. There will never be any piece of
|
||||||
// engineering involving as many as 10^308 parts. It is estimated that there
|
// engineering involving as many as 10^308 parts. It is estimated that there
|
||||||
// are about 10^80 atoms in the universe. The estimate for the total number
|
// are about 10^80 atoms in the universe. The estimate for the total number
|
||||||
// of electrons is similar. Using a double-precision floating-point value, we
|
// of electrons is similar. Using a double-precision floating-point value, we
|
||||||
// can represent easily the number of atoms in the universe. We could also
|
// can represent easily the number of atoms in the universe. We could also
|
||||||
// represent the number of ways you can pick any three individual atoms at
|
// represent the number of ways you can pick any three individual atoms at
|
||||||
|
@ -8142,10 +8142,10 @@ namespace utf8_validation {
|
||||||
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
|
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
|
||||||
// and emits an error anytime there is a mismatch.
|
// and emits an error anytime there is a mismatch.
|
||||||
//
|
//
|
||||||
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
|
// For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
|
||||||
// characters, the file will look like this:
|
// characters, the file will look like this:
|
||||||
//
|
//
|
||||||
// | Character | 𝄞 | | | | ₿ | | | ֏ | | a | b |
|
// | Character | | | | | | | | | | a | b |
|
||||||
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
|
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
|
||||||
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
|
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
|
||||||
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
|
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
|
||||||
|
@ -9015,10 +9015,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
|
||||||
// If you consume a large value and you map it to "infinity", you will no
|
// If you consume a large value and you map it to "infinity", you will no
|
||||||
// longer be able to serialize back a standard-compliant JSON. And there is
|
// longer be able to serialize back a standard-compliant JSON. And there is
|
||||||
// no realistic application where you might need values so large than they
|
// no realistic application where you might need values so large than they
|
||||||
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
|
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
|
||||||
// 10^308 It is an unimaginable large number. There will never be any piece of
|
// 10^308 It is an unimaginable large number. There will never be any piece of
|
||||||
// engineering involving as many as 10^308 parts. It is estimated that there
|
// engineering involving as many as 10^308 parts. It is estimated that there
|
||||||
// are about 10^80 atoms in the universe. The estimate for the total number
|
// are about 10^80 atoms in the universe. The estimate for the total number
|
||||||
// of electrons is similar. Using a double-precision floating-point value, we
|
// of electrons is similar. Using a double-precision floating-point value, we
|
||||||
// can represent easily the number of atoms in the universe. We could also
|
// can represent easily the number of atoms in the universe. We could also
|
||||||
// represent the number of ways you can pick any three individual atoms at
|
// represent the number of ways you can pick any three individual atoms at
|
||||||
|
@ -11254,10 +11254,10 @@ namespace utf8_validation {
|
||||||
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
|
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
|
||||||
// and emits an error anytime there is a mismatch.
|
// and emits an error anytime there is a mismatch.
|
||||||
//
|
//
|
||||||
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
|
// For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
|
||||||
// characters, the file will look like this:
|
// characters, the file will look like this:
|
||||||
//
|
//
|
||||||
// | Character | 𝄞 | | | | ₿ | | | ֏ | | a | b |
|
// | Character | | | | | | | | | | a | b |
|
||||||
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
|
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
|
||||||
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
|
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
|
||||||
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
|
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
|
||||||
|
@ -12130,10 +12130,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
|
||||||
// If you consume a large value and you map it to "infinity", you will no
|
// If you consume a large value and you map it to "infinity", you will no
|
||||||
// longer be able to serialize back a standard-compliant JSON. And there is
|
// longer be able to serialize back a standard-compliant JSON. And there is
|
||||||
// no realistic application where you might need values so large than they
|
// no realistic application where you might need values so large than they
|
||||||
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
|
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
|
||||||
// 10^308 It is an unimaginable large number. There will never be any piece of
|
// 10^308 It is an unimaginable large number. There will never be any piece of
|
||||||
// engineering involving as many as 10^308 parts. It is estimated that there
|
// engineering involving as many as 10^308 parts. It is estimated that there
|
||||||
// are about 10^80 atoms in the universe. The estimate for the total number
|
// are about 10^80 atoms in the universe. The estimate for the total number
|
||||||
// of electrons is similar. Using a double-precision floating-point value, we
|
// of electrons is similar. Using a double-precision floating-point value, we
|
||||||
// can represent easily the number of atoms in the universe. We could also
|
// can represent easily the number of atoms in the universe. We could also
|
||||||
// represent the number of ways you can pick any three individual atoms at
|
// represent the number of ways you can pick any three individual atoms at
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
|
/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
|
||||||
/* begin file include/simdjson.h */
|
/* begin file include/simdjson.h */
|
||||||
#ifndef SIMDJSON_H
|
#ifndef SIMDJSON_H
|
||||||
#define SIMDJSON_H
|
#define SIMDJSON_H
|
||||||
|
|
|
@ -64,6 +64,7 @@
|
||||||
using namespace simd;
|
using namespace simd;
|
||||||
|
|
||||||
namespace utf8_validation {
|
namespace utf8_validation {
|
||||||
|
// For a detailed description of the lookup2 algorithm, see the file HACKING.md under "UTF-8 validation (lookup2)".
|
||||||
|
|
||||||
//
|
//
|
||||||
// Find special case UTF-8 errors where the character is technically readable (has the right length)
|
// Find special case UTF-8 errors where the character is technically readable (has the right length)
|
||||||
|
@ -139,214 +140,6 @@ namespace utf8_validation {
|
||||||
return byte_1_high & byte_1_low & byte_2_high;
|
return byte_1_high & byte_1_low & byte_2_high;
|
||||||
}
|
}
|
||||||
|
|
||||||
//
|
|
||||||
// Validate the length of multibyte characters (that each multibyte character has the right number
|
|
||||||
// of continuation characters, and that all continuation characters are part of a multibyte
|
|
||||||
// character).
|
|
||||||
//
|
|
||||||
// Algorithm
|
|
||||||
// =========
|
|
||||||
//
|
|
||||||
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
|
|
||||||
// and emits an error anytime there is a mismatch.
|
|
||||||
//
|
|
||||||
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
|
|
||||||
// characters, the file will look like this:
|
|
||||||
//
|
|
||||||
// | Character | 𝄞 | | | | ₿ | | | ֏ | | a | b |
|
|
||||||
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
|
|
||||||
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
|
|
||||||
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
|
|
||||||
// | is_second_byte | | X | | | | X | | | X | | |
|
|
||||||
// | is_third_byte | | | X | | | | X | | | | |
|
|
||||||
// | is_fourth_byte | | | | X | | | | | | | |
|
|
||||||
// | expected_continuation | | X | X | X | | X | X | | X | | |
|
|
||||||
// | is_continuation | | X | X | X | | X | X | | X | | |
|
|
||||||
//
|
|
||||||
// The errors here are basically (Second Byte OR Third Byte OR Fourth Byte == Continuation):
|
|
||||||
//
|
|
||||||
// - **Extra Continuations:** Any continuation that is not a second, third or fourth byte is not
|
|
||||||
// part of a valid 2-, 3- or 4-byte character and is thus an error. It could be that it's just
|
|
||||||
// floating around extra outside of any character, or that there is an illegal 5-byte character,
|
|
||||||
// or maybe it's at the beginning of the file before any characters have started; but it's an
|
|
||||||
// error in all these cases.
|
|
||||||
// - **Missing Continuations:** Any second, third or fourth byte that *isn't* a continuation is an error, because that means
|
|
||||||
// we started a new character before we were finished with the current one.
|
|
||||||
//
|
|
||||||
// Getting the Previous Bytes
|
|
||||||
// --------------------------
|
|
||||||
//
|
|
||||||
// Because we want to know if a byte is the *second* (or third, or fourth) byte of a multibyte
|
|
||||||
// character, we need to "shift the bytes" to find that out. This is what they mean:
|
|
||||||
//
|
|
||||||
// - `is_continuation`: if the current byte is a continuation.
|
|
||||||
// - `is_second_byte`: if 1 byte back is the start of a 2-, 3- or 4-byte character.
|
|
||||||
// - `is_third_byte`: if 2 bytes back is the start of a 3- or 4-byte character.
|
|
||||||
// - `is_fourth_byte`: if 3 bytes back is the start of a 4-byte character.
|
|
||||||
//
|
|
||||||
// We use shuffles to go n bytes back, selecting part of the current `input` and part of the
|
|
||||||
// `prev_input` (search for `.prev<1>`, `.prev<2>`, etc.). These are passed in by the caller
|
|
||||||
// function, because the 1-byte-back data is used by other checks as well.
|
|
||||||
//
|
|
||||||
// Getting the Continuation Mask
|
|
||||||
// -----------------------------
|
|
||||||
//
|
|
||||||
// Once we have the right bytes, we have to get the masks. To do this, we treat UTF-8 bytes as
|
|
||||||
// numbers, using signed `<` and `>` operations to check if they are continuations or leads.
|
|
||||||
// In fact, we treat the numbers as *signed*, partly because it helps us, and partly because
|
|
||||||
// Intel's SIMD presently only offers signed `<` and `>` operations (not unsigned ones).
|
|
||||||
//
|
|
||||||
// In UTF-8, bytes that start with the bits 110, 1110 and 11110 are 2-, 3- and 4-byte "leads,"
|
|
||||||
// respectively, meaning they expect to have 1, 2 and 3 "continuation bytes" after them.
|
|
||||||
// Continuation bytes start with 10, and ASCII (1-byte characters) starts with 0.
|
|
||||||
//
|
|
||||||
// When treated as signed numbers, they look like this:
|
|
||||||
//
|
|
||||||
// | Type | High Bits | Binary Range | Signed |
|
|
||||||
// |--------------|------------|--------------|--------|
|
|
||||||
// | ASCII | `0` | `01111111` | 127 |
|
|
||||||
// | | | `00000000` | 0 |
|
|
||||||
// | 4+-Byte Lead | `1111` | `11111111` | -1 |
|
|
||||||
// | | | `11110000 | -16 |
|
|
||||||
// | 3-Byte Lead | `1110` | `11101111` | -17 |
|
|
||||||
// | | | `11100000 | -32 |
|
|
||||||
// | 2-Byte Lead | `110` | `11011111` | -33 |
|
|
||||||
// | | | `11000000 | -64 |
|
|
||||||
// | Continuation | `10` | `10111111` | -65 |
|
|
||||||
// | | | `10000000 | -128 |
|
|
||||||
//
|
|
||||||
// This makes it pretty easy to get the continuation mask! It's just a single comparison:
|
|
||||||
//
|
|
||||||
// ```
|
|
||||||
// is_continuation = input < -64`
|
|
||||||
// ```
|
|
||||||
//
|
|
||||||
// We can do something similar for the others, but it takes two comparisons instead of one: "is
|
|
||||||
// the start of a 4-byte character" is `< -32` and `> -65`, for example. And 2+ bytes is `< 0` and
|
|
||||||
// `> -64`. Surely we can do better, they're right next to each other!
|
|
||||||
//
|
|
||||||
// Getting the is_xxx Masks: Shifting the Range
|
|
||||||
// --------------------------------------------
|
|
||||||
//
|
|
||||||
// Notice *why* continuations were a single comparison. The actual *range* would require two
|
|
||||||
// comparisons--`< -64` and `> -129`--but all characters are always greater than -128, so we get
|
|
||||||
// that for free. In fact, if we had *unsigned* comparisons, 2+, 3+ and 4+ comparisons would be
|
|
||||||
// just as easy: 4+ would be `> 239`, 3+ would be `> 223`, and 2+ would be `> 191`.
|
|
||||||
//
|
|
||||||
// Instead, we add 128 to each byte, shifting the range up to make comparison easy. This wraps
|
|
||||||
// ASCII down into the negative, and puts 4+-Byte Lead at the top:
|
|
||||||
//
|
|
||||||
// | Type | High Bits | Binary Range | Signed |
|
|
||||||
// |----------------------|------------|--------------|-------|
|
|
||||||
// | 4+-Byte Lead (+ 127) | `0111` | `01111111` | 127 |
|
|
||||||
// | | | `01110000 | 112 |
|
|
||||||
// |----------------------|------------|--------------|-------|
|
|
||||||
// | 3-Byte Lead (+ 127) | `0110` | `01101111` | 111 |
|
|
||||||
// | | | `01100000 | 96 |
|
|
||||||
// |----------------------|------------|--------------|-------|
|
|
||||||
// | 2-Byte Lead (+ 127) | `010` | `01011111` | 95 |
|
|
||||||
// | | | `01000000 | 64 |
|
|
||||||
// |----------------------|------------|--------------|-------|
|
|
||||||
// | Continuation (+ 127) | `00` | `00111111` | 63 |
|
|
||||||
// | | | `00000000 | 0 |
|
|
||||||
// |----------------------|------------|--------------|-------|
|
|
||||||
// | ASCII (+ 127) | `1` | `11111111` | -1 |
|
|
||||||
// | | | `10000000` | -128 |
|
|
||||||
// |----------------------|------------|--------------|-------|
|
|
||||||
//
|
|
||||||
// *Now* we can use signed `>` on all of them:
|
|
||||||
//
|
|
||||||
// ```
|
|
||||||
// prev1 = input.prev<1>
|
|
||||||
// prev2 = input.prev<2>
|
|
||||||
// prev3 = input.prev<3>
|
|
||||||
// prev1_flipped = input.prev<1>(prev_input) ^ 0x80; // Same as `+ 128`
|
|
||||||
// prev2_flipped = input.prev<2>(prev_input) ^ 0x80; // Same as `+ 128`
|
|
||||||
// prev3_flipped = input.prev<3>(prev_input) ^ 0x80; // Same as `+ 128`
|
|
||||||
// is_second_byte = prev1_flipped > 63; // 2+-byte lead
|
|
||||||
// is_third_byte = prev2_flipped > 95; // 3+-byte lead
|
|
||||||
// is_fourth_byte = prev3_flipped > 111; // 4+-byte lead
|
|
||||||
// ```
|
|
||||||
//
|
|
||||||
// NOTE: we use `^ 0x80` instead of `+ 128` in the code, which accomplishes the same thing, and even takes the same number
|
|
||||||
// of cycles as `+`, but on many Intel architectures can be parallelized better (you can do 3
|
|
||||||
// `^`'s at a time on Haswell, but only 2 `+`'s).
|
|
||||||
//
|
|
||||||
// That doesn't look like it saved us any instructions, did it? Well, because we're adding the
|
|
||||||
// same number to all of them, we can save one of those `+ 128` operations by assembling
|
|
||||||
// `prev2_flipped` out of prev 1 and prev 3 instead of assembling it from input and adding 128
|
|
||||||
// to it. One more instruction saved!
|
|
||||||
//
|
|
||||||
// ```
|
|
||||||
// prev1 = input.prev<1>
|
|
||||||
// prev3 = input.prev<3>
|
|
||||||
// prev1_flipped = prev1 ^ 0x80; // Same as `+ 128`
|
|
||||||
// prev3_flipped = prev3 ^ 0x80; // Same as `+ 128`
|
|
||||||
// prev2_flipped = prev1_flipped.concat<2>(prev3_flipped): // <shuffle: take the first 2 bytes from prev1 and the rest from prev3
|
|
||||||
// ```
|
|
||||||
//
|
|
||||||
// ### Bringing It All Together: Detecting the Errors
|
|
||||||
//
|
|
||||||
// At this point, we have `is_continuation`, `is_first_byte`, `is_second_byte` and `is_third_byte`.
|
|
||||||
// All we have left to do is check if they match!
|
|
||||||
//
|
|
||||||
// ```
|
|
||||||
// return (is_second_byte | is_third_byte | is_fourth_byte) ^ is_continuation;
|
|
||||||
// ```
|
|
||||||
//
|
|
||||||
// But wait--there's more. The above statement is only 3 operations, but they *cannot be done in
|
|
||||||
// parallel*. You have to do 2 `|`'s and then 1 `&`. Haswell, at least, has 3 ports that can do
|
|
||||||
// bitwise operations, and we're only using 1!
|
|
||||||
//
|
|
||||||
// Epilogue: Addition For Booleans
|
|
||||||
// -------------------------------
|
|
||||||
//
|
|
||||||
// There is one big case the above code doesn't explicitly talk about--what if is_second_byte
|
|
||||||
// and is_third_byte are BOTH true? That means there is a 3-byte and 2-byte character right next
|
|
||||||
// to each other (or any combination), and the continuation could be part of either of them!
|
|
||||||
// Our algorithm using `&` and `|` won't detect that the continuation byte is problematic.
|
|
||||||
//
|
|
||||||
// Never fear, though. If that situation occurs, we'll already have detected that the second
|
|
||||||
// leading byte was an error, because it was supposed to be a part of the preceding multibyte
|
|
||||||
// character, but it *wasn't a continuation*.
|
|
||||||
//
|
|
||||||
// We could stop here, but it turns out that we can fix it using `+` and `-` instead of `|` and
|
|
||||||
// `&`, which is both interesting and possibly useful (even though we're not using it here). It
|
|
||||||
// exploits the fact that in SIMD, a *true* value is -1, and a *false* value is 0. So those
|
|
||||||
// comparisons were giving us numbers!
|
|
||||||
//
|
|
||||||
// Given that, if you do `is_second_byte + is_third_byte + is_fourth_byte`, under normal
|
|
||||||
// circumstances you will either get 0 (0 + 0 + 0) or -1 (-1 + 0 + 0, etc.). Thus,
|
|
||||||
// `(is_second_byte + is_third_byte + is_fourth_byte) - is_continuation` will yield 0 only if
|
|
||||||
// *both* or *neither* are 0 (0-0 or -1 - -1). You'll get 1 or -1 if they are different. Because
|
|
||||||
// *any* nonzero value is treated as an error (not just -1), we're just fine here :)
|
|
||||||
//
|
|
||||||
// Further, if *more than one* multibyte character overlaps,
|
|
||||||
// `is_second_byte + is_third_byte + is_fourth_byte` will be -2 or -3! Subtracting `is_continuation`
|
|
||||||
// from *that* is guaranteed to give you a nonzero value (-1, -2 or -3). So it'll always be
|
|
||||||
// considered an error.
|
|
||||||
//
|
|
||||||
// One reason you might want to do this is parallelism. ^ and | are not associative, so
|
|
||||||
// (A | B | C) ^ D will always be three operations in a row: either you do A | B -> | C -> ^ D, or
|
|
||||||
// you do B | C -> | A -> ^ D. But addition and subtraction *are* associative: (A + B + C) - D can
|
|
||||||
// be written as `(A + B) + (C - D)`. This means you can do A + B and C - D at the same time, and
|
|
||||||
// then adds the result together. Same number of operations, but if the processor can run
|
|
||||||
// independent things in parallel (which most can), it runs faster.
|
|
||||||
//
|
|
||||||
// This doesn't help us on Intel, but might help us elsewhere: on Haswell, at least, | and ^ have
|
|
||||||
// a super nice advantage in that more of them can be run at the same time (they can run on 3
|
|
||||||
// ports, while + and - can run on 2)! This means that we can do A | B while we're still doing C,
|
|
||||||
// saving us the cycle we would have earned by using +. Even more, using an instruction with a
|
|
||||||
// wider array of ports can help *other* code run ahead, too, since these instructions can "get
|
|
||||||
// out of the way," running on a port other instructions can't.
|
|
||||||
//
|
|
||||||
// Epilogue II: One More Trick
|
|
||||||
// ---------------------------
|
|
||||||
//
|
|
||||||
// There's one more relevant trick up our sleeve, it turns out: it turns out on Intel we can "pay
|
|
||||||
// for" the (prev<1> + 128) instruction, because it can be used to save an instruction in
|
|
||||||
// check_special_cases()--but we'll talk about that there :)
|
|
||||||
//
|
|
||||||
really_inline simd8<uint8_t> check_multibyte_lengths(simd8<uint8_t> input, simd8<uint8_t> prev_input, simd8<uint8_t> prev1) {
|
really_inline simd8<uint8_t> check_multibyte_lengths(simd8<uint8_t> input, simd8<uint8_t> prev_input, simd8<uint8_t> prev1) {
|
||||||
simd8<uint8_t> prev2 = input.prev<2>(prev_input);
|
simd8<uint8_t> prev2 = input.prev<2>(prev_input);
|
||||||
simd8<uint8_t> prev3 = input.prev<3>(prev_input);
|
simd8<uint8_t> prev3 = input.prev<3>(prev_input);
|
||||||
|
|
|
@ -191,10 +191,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
|
||||||
// If you consume a large value and you map it to "infinity", you will no
|
// If you consume a large value and you map it to "infinity", you will no
|
||||||
// longer be able to serialize back a standard-compliant JSON. And there is
|
// longer be able to serialize back a standard-compliant JSON. And there is
|
||||||
// no realistic application where you might need values so large than they
|
// no realistic application where you might need values so large than they
|
||||||
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
|
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
|
||||||
// 10^308 It is an unimaginable large number. There will never be any piece of
|
// 10^308 It is an unimaginable large number. There will never be any piece of
|
||||||
// engineering involving as many as 10^308 parts. It is estimated that there
|
// engineering involving as many as 10^308 parts. It is estimated that there
|
||||||
// are about 10^80 atoms in the universe. The estimate for the total number
|
// are about 10^80 atoms in the universe. The estimate for the total number
|
||||||
// of electrons is similar. Using a double-precision floating-point value, we
|
// of electrons is similar. Using a double-precision floating-point value, we
|
||||||
// can represent easily the number of atoms in the universe. We could also
|
// can represent easily the number of atoms in the universe. We could also
|
||||||
// represent the number of ways you can pick any three individual atoms at
|
// represent the number of ways you can pick any three individual atoms at
|
||||||
|
|
|
@ -452,8 +452,8 @@ namespace document_stream_tests {
|
||||||
size_t n = snprintf(buf,
|
size_t n = snprintf(buf,
|
||||||
sizeof(buf),
|
sizeof(buf),
|
||||||
"{\"id\": %zu, \"name\": \"name%zu\", \"gender\": \"%s\", "
|
"{\"id\": %zu, \"name\": \"name%zu\", \"gender\": \"%s\", "
|
||||||
"\"été\": {\"id\": %zu, \"name\": \"éventail%zu\"}}",
|
"\"\xC3\xA9t\xC3\xA9\": {\"id\": %zu, \"name\": \"\xC3\xA9ventail%zu\"}}",
|
||||||
i, i, (i % 2) ? "⺃" : "⺕", i % 10, i % 10);
|
i, i, (i % 2) ? "\xE2\xBA\x83" : "\xE2\xBA\x95", i % 10, i % 10);
|
||||||
if (n >= sizeof(buf)) { abort(); }
|
if (n >= sizeof(buf)) { abort(); }
|
||||||
data += std::string(buf, n);
|
data += std::string(buf, n);
|
||||||
}
|
}
|
||||||
|
|
Loading…
Reference in New Issue