Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).
If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.
Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.
Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.
(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")
If you have access to the BMI2 instruction set I can do branchless UTF-8 encoding like in the article using only 9 instructions and 73 bytes of lookup tables:
I'm surprised there are no UTF-8 specific decode instructions yet, the way ARM has "FJCVTZS - Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero"
/// Encode a UTF-8 codepoint.
/// […]
/// Returns a length of zero for invalid codepoints (surrogates and out-of-bounds values);
/// it's up to the caller to turn that into U+FFFD, or return an error.
It's not a "UTF-8 codepoint", that's horridly mangling the terminology. Code points are just code points.
The input to a UTF-8 encode is a scalar value, not a code point, and encoding a scalar value is infallible. What doubly kills me is that Rust has a dedicated type for scalar values. (`char`.)
(In languages with non-[USV]-strings…, Python raises an exception, JS emits garbage.)
comex ·5 hours ago
First of all, a recent LLVM patch apparently changes codegen to use CMOV instead of a branch:
https://github.com/llvm/llvm-project/pull/102885
Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).
If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.
Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.
Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.
(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")
Show replies
orlp ·4 hours ago
Show replies
koala_man ·4 hours ago
Show replies
deathanatos ·1 hours ago
The input to a UTF-8 encode is a scalar value, not a code point, and encoding a scalar value is infallible. What doubly kills me is that Rust has a dedicated type for scalar values. (`char`.)
(In languages with non-[USV]-strings…, Python raises an exception, JS emits garbage.)
xeeeeeeeeeeenu ·5 hours ago
Not if you're targeting x86-64-v3 or higher. Haswell (Intel) and Piledriver (AMD) introduced the LZCNT instruction that doesn't have this problem.
Show replies