Avoid reduce_once in scalar_centered_binomial_distribution_eta_2_with…

…_prf The comment that a barrier-less reduce_once was safe turned out to not *quite* be true. In Clang configurations without auto-vectorization (notably -O1), Clang would emit a branch instead of a CMOV. Unfortunately, adding a barrier to reduce_once has significant performance costs. The problem seems to be that auto-vectorization breaks. I suspect it is primarily because the value barrier forces the value into a general-purpose register, while vectorized code puts it straight into a SIMD register. Though knowing the comparison is a comparison seems to also help a bit. Based on what we've understood of Clang's select transforms thus far, it would make sense that ML-KEM might not need the barrier. The main culprit is turning multiple selects with the same condition into a branch, and that does not happen in ML-KEM. Yet we observe a problem. Based on valgrind instrumentation, the problem seems to be limited to scalar_centered_binomial_distribution_eta_2_with_prf, likely because the value has such a limited range of values. For some reason, this causes many recent versions of Clang to emit a branch. I think this may actually be a misoptimization. Indeed the very latest trunk build of Clang on godbolt does not have this problem. Somewhere between 8cb44859cc31929521c09fc6a8add66d53db44de and 8daf4f16fa08b5d876e98108721dd1743a360326, LLVM seems to have fixed this issue. We can avoid this by computing it differently. We currently write reduce_once(kPrime + a + b - (c + d)), where a through d are 0 or 1. Instead, we can write a + b - (c + d), let the underflow happen, and then conditionally add kPrime based on the sign bit of the result. This seems to avoid mishaps, for now. If this breaks down again, we may need to get better value barriers, or to stop relying on auto-vectorization and vectorize ourselves. Change-Id: I917456348d63628880467d21138a57297532bc9a Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/74447 Auto-Submit: David Benjamin <[email protected]> Reviewed-by: Adam Langley <[email protected]> Commit-Queue: David Benjamin <[email protected]>
google · Dec 17, 2024 · ee0c13a · ee0c13a
1 parent f49081b
commit ee0c13a
Show file tree

Hide file tree

Showing 2 changed files with 40 additions and 20 deletions.
diff --git a/crypto/fipsmodule/mlkem/mlkem.cc.inc b/crypto/fipsmodule/mlkem/mlkem.cc.inc
@@ -184,10 +184,15 @@ uint16_t reduce_once(uint16_t x) {
   declassify_assert(x < 2 * kPrime);
   const uint16_t subtracted = x - kPrime;
   uint16_t mask = 0u - (subtracted >> 15);
-  // On Aarch64, omitting a |value_barrier_u16| results in a 2x speedup of
-  // ML-KEM overall and Clang still produces constant-time code using `csel`. On
-  // other platforms & compilers on godbolt that we care about, this code also
-  // produces constant-time output.
+  // Although this is a constant-time select, we omit a value barrier here.
+  // Value barriers impede auto-vectorization (likely because it forces the
+  // value to transit through a general-purpose register). On AArch64, this is a
+  // difference of 2x.
+  //
+  // We usually add value barriers to selects because Clang turns consecutive
+  // selects with the same condition into a branch instead of CMOV/CSEL. This
+  // condition does not occur in ML-KEM, so omitting it seems to be safe so far,
+  // but see |scalar_centered_binomial_distribution_eta_2_with_prf|.
   return (mask & x) | (~mask & subtracted);
 }
 
@@ -393,16 +398,21 @@ void scalar_centered_binomial_distribution_eta_2_with_prf(
   for (int i = 0; i < DEGREE; i += 2) {
     uint8_t byte = entropy[i / 2];
 
-    uint16_t value = kPrime;
-    value += (byte & 1) + ((byte >> 1) & 1);
+    uint16_t value = (byte & 1) + ((byte >> 1) & 1);
     value -= ((byte >> 2) & 1) + ((byte >> 3) & 1);
-    out->c[i] = reduce_once(value);
+    // Add |kPrime| if |value| underflowed. See |reduce_once| for a discussion
+    // on why the value barrier is omitted. While this could have been written
+    // reduce_once(value + kPrime), this is one extra addition and small range
+    // of |value| tempts some versions of Clang to emit a branch.
+    uint16_t mask = 0u - (value >> 15);
+    out->c[i] = ((value + kPrime) & mask) | (value & ~mask);
 
     byte >>= 4;
-    value = kPrime;
-    value += (byte & 1) + ((byte >> 1) & 1);
+    value = (byte & 1) + ((byte >> 1) & 1);
     value -= ((byte >> 2) & 1) + ((byte >> 3) & 1);
-    out->c[i + 1] = reduce_once(value);
+    // See above.
+    mask = 0u - (value >> 15);
+    out->c[i + 1] = ((value + kPrime) & mask) | (value & ~mask);
   }
 }
 

diff --git a/crypto/kyber/kyber.cc b/crypto/kyber/kyber.cc
@@ -138,10 +138,15 @@ static uint16_t reduce_once(uint16_t x) {
   assert(x < 2 * kPrime);
   const uint16_t subtracted = x - kPrime;
   uint16_t mask = 0u - (subtracted >> 15);
-  // On Aarch64, omitting a |value_barrier_u16| results in a 2x speedup of Kyber
-  // overall and Clang still produces constant-time code using `csel`. On other
-  // platforms & compilers on godbolt that we care about, this code also
-  // produces constant-time output.
+  // Although this is a constant-time select, we omit a value barrier here.
+  // Value barriers impede auto-vectorization (likely because it forces the
+  // value to transit through a general-purpose register). On AArch64, this is a
+  // difference of 2x.
+  //
+  // We usually add value barriers to selects because Clang turns consecutive
+  // selects with the same condition into a branch instead of CMOV/CSEL. This
+  // condition does not occur in Kyber, so omitting it seems to be safe so far,
+  // but see |scalar_centered_binomial_distribution_eta_2_with_prf|.
   return (mask & x) | (~mask & subtracted);
 }
 
@@ -337,16 +342,21 @@ static void scalar_centered_binomial_distribution_eta_2_with_prf(
   for (int i = 0; i < DEGREE; i += 2) {
     uint8_t byte = entropy[i / 2];
 
-    uint16_t value = kPrime;
-    value += (byte & 1) + ((byte >> 1) & 1);
+    uint16_t value = (byte & 1) + ((byte >> 1) & 1);
     value -= ((byte >> 2) & 1) + ((byte >> 3) & 1);
-    out->c[i] = reduce_once(value);
+    // Add |kPrime| if |value| underflowed. See |reduce_once| for a discussion
+    // on why the value barrier is omitted. While this could have been written
+    // reduce_once(value + kPrime), this is one extra addition and small range
+    // of |value| tempts some versions of Clang to emit a branch.
+    uint16_t mask = 0u - (value >> 15);
+    out->c[i] = value + (kPrime & mask);
 
     byte >>= 4;
-    value = kPrime;
-    value += (byte & 1) + ((byte >> 1) & 1);
+    value = (byte & 1) + ((byte >> 1) & 1);
     value -= ((byte >> 2) & 1) + ((byte >> 3) & 1);
-    out->c[i + 1] = reduce_once(value);
+    // See above.
+    mask = 0u - (value >> 15);
+    out->c[i + 1] = value + (kPrime & mask);
   }
 }