Skip to content

Commit

Permalink
Avoid reduce_once in scalar_centered_binomial_distribution_eta_2_with…
Browse files Browse the repository at this point in the history
…_prf

The comment that a barrier-less reduce_once was safe turned out to not
*quite* be true. In Clang configurations without auto-vectorization
(notably -O1), Clang would emit a branch instead of a CMOV.

Unfortunately, adding a barrier to reduce_once has significant
performance costs. The problem seems to be that auto-vectorization
breaks. I suspect it is primarily because the value barrier forces the
value into a general-purpose register, while vectorized code puts it
straight into a SIMD register. Though knowing the comparison is a
comparison seems to also help a bit.

Based on what we've understood of Clang's select transforms thus far, it
would make sense that ML-KEM might not need the barrier. The main
culprit is turning multiple selects with the same condition into a
branch, and that does not happen in ML-KEM. Yet we observe a problem.

Based on valgrind instrumentation, the problem seems to be limited to
scalar_centered_binomial_distribution_eta_2_with_prf, likely
because the value has such a limited range of values. For some reason,
this causes many recent versions of Clang to emit a branch.

I think this may actually be a misoptimization. Indeed the very latest
trunk build of Clang on godbolt does not have this problem. Somewhere
between 8cb44859cc31929521c09fc6a8add66d53db44de and
8daf4f16fa08b5d876e98108721dd1743a360326, LLVM seems to have fixed this
issue.

We can avoid this by computing it differently. We currently write
reduce_once(kPrime + a + b - (c + d)), where a through d are 0 or 1.
Instead, we can write a + b - (c + d), let the underflow happen, and
then conditionally add kPrime based on the sign bit of the result. This
seems to avoid mishaps, for now.

If this breaks down again, we may need to get better value barriers, or
to stop relying on auto-vectorization and vectorize ourselves.

Change-Id: I917456348d63628880467d21138a57297532bc9a
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/74447
Auto-Submit: David Benjamin <[email protected]>
Reviewed-by: Adam Langley <[email protected]>
Commit-Queue: David Benjamin <[email protected]>
  • Loading branch information
davidben authored and Boringssl LUCI CQ committed Dec 17, 2024
1 parent f49081b commit ee0c13a
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 20 deletions.
30 changes: 20 additions & 10 deletions crypto/fipsmodule/mlkem/mlkem.cc.inc
Original file line number Diff line number Diff line change
Expand Up @@ -184,10 +184,15 @@ uint16_t reduce_once(uint16_t x) {
declassify_assert(x < 2 * kPrime);
const uint16_t subtracted = x - kPrime;
uint16_t mask = 0u - (subtracted >> 15);
// On Aarch64, omitting a |value_barrier_u16| results in a 2x speedup of
// ML-KEM overall and Clang still produces constant-time code using `csel`. On
// other platforms & compilers on godbolt that we care about, this code also
// produces constant-time output.
// Although this is a constant-time select, we omit a value barrier here.
// Value barriers impede auto-vectorization (likely because it forces the
// value to transit through a general-purpose register). On AArch64, this is a
// difference of 2x.
//
// We usually add value barriers to selects because Clang turns consecutive
// selects with the same condition into a branch instead of CMOV/CSEL. This
// condition does not occur in ML-KEM, so omitting it seems to be safe so far,
// but see |scalar_centered_binomial_distribution_eta_2_with_prf|.
return (mask & x) | (~mask & subtracted);
}

Expand Down Expand Up @@ -393,16 +398,21 @@ void scalar_centered_binomial_distribution_eta_2_with_prf(
for (int i = 0; i < DEGREE; i += 2) {
uint8_t byte = entropy[i / 2];

uint16_t value = kPrime;
value += (byte & 1) + ((byte >> 1) & 1);
uint16_t value = (byte & 1) + ((byte >> 1) & 1);
value -= ((byte >> 2) & 1) + ((byte >> 3) & 1);
out->c[i] = reduce_once(value);
// Add |kPrime| if |value| underflowed. See |reduce_once| for a discussion
// on why the value barrier is omitted. While this could have been written
// reduce_once(value + kPrime), this is one extra addition and small range
// of |value| tempts some versions of Clang to emit a branch.
uint16_t mask = 0u - (value >> 15);
out->c[i] = ((value + kPrime) & mask) | (value & ~mask);

byte >>= 4;
value = kPrime;
value += (byte & 1) + ((byte >> 1) & 1);
value = (byte & 1) + ((byte >> 1) & 1);
value -= ((byte >> 2) & 1) + ((byte >> 3) & 1);
out->c[i + 1] = reduce_once(value);
// See above.
mask = 0u - (value >> 15);
out->c[i + 1] = ((value + kPrime) & mask) | (value & ~mask);
}
}

Expand Down
30 changes: 20 additions & 10 deletions crypto/kyber/kyber.cc
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,15 @@ static uint16_t reduce_once(uint16_t x) {
assert(x < 2 * kPrime);
const uint16_t subtracted = x - kPrime;
uint16_t mask = 0u - (subtracted >> 15);
// On Aarch64, omitting a |value_barrier_u16| results in a 2x speedup of Kyber
// overall and Clang still produces constant-time code using `csel`. On other
// platforms & compilers on godbolt that we care about, this code also
// produces constant-time output.
// Although this is a constant-time select, we omit a value barrier here.
// Value barriers impede auto-vectorization (likely because it forces the
// value to transit through a general-purpose register). On AArch64, this is a
// difference of 2x.
//
// We usually add value barriers to selects because Clang turns consecutive
// selects with the same condition into a branch instead of CMOV/CSEL. This
// condition does not occur in Kyber, so omitting it seems to be safe so far,
// but see |scalar_centered_binomial_distribution_eta_2_with_prf|.
return (mask & x) | (~mask & subtracted);
}

Expand Down Expand Up @@ -337,16 +342,21 @@ static void scalar_centered_binomial_distribution_eta_2_with_prf(
for (int i = 0; i < DEGREE; i += 2) {
uint8_t byte = entropy[i / 2];

uint16_t value = kPrime;
value += (byte & 1) + ((byte >> 1) & 1);
uint16_t value = (byte & 1) + ((byte >> 1) & 1);
value -= ((byte >> 2) & 1) + ((byte >> 3) & 1);
out->c[i] = reduce_once(value);
// Add |kPrime| if |value| underflowed. See |reduce_once| for a discussion
// on why the value barrier is omitted. While this could have been written
// reduce_once(value + kPrime), this is one extra addition and small range
// of |value| tempts some versions of Clang to emit a branch.
uint16_t mask = 0u - (value >> 15);
out->c[i] = value + (kPrime & mask);

byte >>= 4;
value = kPrime;
value += (byte & 1) + ((byte >> 1) & 1);
value = (byte & 1) + ((byte >> 1) & 1);
value -= ((byte >> 2) & 1) + ((byte >> 3) & 1);
out->c[i + 1] = reduce_once(value);
// See above.
mask = 0u - (value >> 15);
out->c[i + 1] = value + (kPrime & mask);
}
}

Expand Down

0 comments on commit ee0c13a

Please sign in to comment.