[[TOC]]
Highway is a collection of 'ops': platform-agnostic pure functions that operate
on tuples (multiple values of the same type). These functions are implemented
using platform-specific intrinsics, which map to SIMD/vector instructions.
hwy/contrib
also includes higher-level algorithms such as FindIf
or Sorter
implemented using these ops.
Highway can use dynamic dispatch, which chooses the best available implementation at runtime, or static dispatch which has no runtime overhead. Dynamic dispatch works by compiling your code once per target CPU and then selecting (via indirect call) at runtime.
Examples of both are provided in examples/. Dynamic dispatch uses the same
source code as static, plus #define HWY_TARGET_INCLUDE
, #include "third_party/highway/hwy/foreach_target.h"
(which must come before any
inclusion of highway.h) and HWY_DYNAMIC_DISPATCH
.
The public headers are:
-
hwy/highway.h: main header, included from source AND/OR header files that use vector types. Note that including in headers may increase compile time, but allows declaring functions implemented out of line.
-
hwy/base.h: included from headers that only need compiler/platform-dependent definitions (e.g.
PopCount
) without the full highway.h. -
hwy/foreach_target.h: re-includes the translation unit (specified by
HWY_TARGET_INCLUDE
) once per enabled target to generate code from the same source code. highway.h must still be included. -
hwy/aligned_allocator.h: defines functions for allocating memory with alignment suitable for
Load
/Store
. -
hwy/cache_control.h: defines stand-alone functions to control caching (e.g. prefetching), independent of actual SIMD.
-
hwy/nanobenchmark.h: library for precisely measuring elapsed time (under varying inputs) for benchmarking small/medium regions of code.
-
hwy/print-inl.h: defines Print() for writing vector lanes to stderr.
-
hwy/tests/test_util-inl.h: defines macros for invoking tests on all available targets, plus per-target functions useful in tests.
SIMD implementations must be preceded and followed by the following:
#include "hwy/highway.h"
HWY_BEFORE_NAMESPACE(); // at file scope
namespace project { // optional
namespace HWY_NAMESPACE {
// implementation
// NOLINTNEXTLINE(google-readability-namespace-comments)
} // namespace HWY_NAMESPACE
} // namespace project - optional
HWY_AFTER_NAMESPACE();
T
denotes the type of a vector lane (integer or floating-point);N
is a size_t value that governs (but is not necessarily identical to) the number of lanes;D
is shorthand for a zero-sized tag typeSimd<T, N, kPow2>
, used to select the desired overloaded function (see next section). Use aliases such asScalableTag
instead of referring to this type directly;d
is an lvalue of typeD
, passed as a function argument e.g. to Zero;V
is the type of a vector, which may be a class or built-in type.
Highway vectors consist of one or more 'lanes' of the same built-in type
uint##_t, int##_t
for ## = 8, 16, 32, 64
, plus float##_t
for ## = 16, 32, 64
and bfloat16_t
.
Beware that char
may differ from these types, and is not supported directly.
If your code loads from/stores to char*
, use T=uint8_t
for Highway's d
tags (see below) or T=int8_t
(which may enable faster less-than/greater-than
comparisons), and cast your char*
pointers to your T*
.
In Highway, float16_t
(an IEEE binary16 half-float) and bfloat16_t
(the
upper 16 bits of an IEEE binary32 float) only support load, store, and
conversion to/from float32_t
. The behavior of infinity and NaN in float16_t
is implementation-defined due to ARMv7.
On RVV/SVE, vectors are sizeless and cannot be wrapped inside a class. The
Highway API allows using built-in types as vectors because operations are
expressed as overloaded functions. Instead of constructors, overloaded
initialization functions such as Set
take a zero-sized tag argument called d
of type D
and return an actual vector of unspecified type.
T
is one of the lane types above, and may be retrieved via TFromD<D>
.
The actual lane count (used to increment loop counters etc.) can be obtained via
Lanes(d)
. This value might not be known at compile time, thus storage for
vectors should be dynamically allocated, e.g. via AllocateAligned(Lanes(d))
.
Note that Lanes(d)
could potentially change at runtime. This is currently
unlikely, and will not be initiated by Highway without user action, but could
still happen in other circumstances:
- upon user request in future via special CPU instructions (switching to 'streaming SVE' mode for Arm SME), or
- via system software (
prctl(PR_SVE_SET_VL
on Linux for Arm SVE). When the vector length is changed using this mechanism, all but the lower 128 bits of vector registers are invalidated.
Thus we discourage caching the result; it is typically used inside a function or
basic block. If the application anticipates that one of the above circumstances
could happen, it should ensure by some out-of-band mechanism that such changes
will not happen during the critical section (the vector code which uses the
result of the previously obtained Lanes(d)
).
MaxLanes(d)
returns a (potentially loose) upper bound on Lanes(d)
, and is
implemented as a constexpr function.
The actual lane count is guaranteed to be a power of two, even on SVE hardware
where vectors can be a multiple of 128 bits (there, the extra lanes remain
unused). This simplifies alignment: remainders can be computed as count & (Lanes(d) - 1)
instead of an expensive modulo. It also ensures loop trip counts
that are a large power of two (at least MaxLanes
) are evenly divisible by the
lane count, thus avoiding the need for a second loop to handle remainders.
d
lvalues (a tag, NOT actual vector) are obtained using aliases:
-
Most common:
ScalableTag<T[, kPow2=0]> d;
or the macro formHWY_FULL(T[, LMUL=1]) d;
. With the default value of the second argument, these both select full vectors which utilize all available lanes.Only for targets (e.g. RVV) that support register groups, the kPow2 (-3..3) and LMUL argument (1, 2, 4, 8) specify
LMUL
, the number of registers in the group. This effectively multiplies the lane count in each operation byLMUL
, or left-shifts bykPow2
(negative values are understood as right-shifting by the absolute value). These arguments will eventually be optional hints that may improve performance on 1-2 wide machines (at the cost of reducing the effective number of registers), but RVV target does not yet support fractionalLMUL
. Thus, mixed-precision code (e.g. demoting float to uint8_t) currently requiresLMUL
to be at least the ratio of the sizes of the largest and smallest type, and smallerd
to be obtained viaHalf<DLarger>
. -
Less common:
CappedTag<T, kCap> d
or the macro formHWY_CAPPED(T, kCap) d;
. These select vectors or masks where no more than the largest power of two not exceedingkCap
lanes have observable effects such as loading/storing to memory, or being counted byCountTrue
. The number of lanes may also be less; for theHWY_SCALAR
target, vectors always have a single lane. For example,CappedTag<T, 3>
will use up to two lanes. -
For applications that require fixed-size vectors:
FixedTag<T, kCount> d;
will select vectors where exactlykCount
lanes have observable effects. These may be implemented using full vectors plus additional runtime cost for masking inLoad
etc.kCount
must be a power of two not exceedingHWY_LANES(T)
, which is one forHWY_SCALAR
. This tag can be used when theHWY_SCALAR
target is anyway disabled (superseded by a higher baseline) or unusable (due to use of ops such asTableLookupBytes
). As a convenience, we also provideFull128<T>
,Full64<T>
andFull32<T>
aliases which are equivalent toFixedTag<T, 16 / sizeof(T)>
,FixedTag<T, 8 / sizeof(T)>
andFixedTag<T, 4 / sizeof(T)>
. -
The result of
UpperHalf
/LowerHalf
has half the lanes. To obtain a correspondingd
, useHalf<decltype(d)>
; the opposite isTwice<>
.
User-specified lane counts or tuples of vectors could cause spills on targets with fewer or smaller vectors. By contrast, Highway encourages vector-length agnostic code, which is more performance-portable.
For mixed-precision code (e.g. uint8_t
lanes promoted to float
), tags for
the smaller types must be obtained from those of the larger type (e.g. via
Rebind<uint8_t, ScalableTag<float>>
).
Vector types are unspecified and depend on the target. User code could define
them as auto
, but it is more readable (due to making the type visible) to use
an alias such as Vec<D>
, or decltype(Zero(d))
. Similarly, the mask type can
be obtained via Mask<D>
.
Vectors are sizeless types on RVV/SVE. Therefore, vectors must not be used in
arrays/STL containers (use the lane type T
instead), class members,
static/thread_local variables, new-expressions (use AllocateAligned
instead),
and sizeof/pointer arithmetic (increment T*
by Lanes(d)
instead).
Initializing constants requires a tag type D
, or an lvalue d
of that type.
The D
can be passed as a template argument or obtained from a vector type V
via DFromV<V>
. TFromV<V>
is equivalent to TFromD<DFromV<V>>
.
Note: Let DV = DFromV<V>
. For builtin V
(currently necessary on
RVV/SVE), DV
might not be the same as the D
used to create V
. In
particular, DV
must not be passed to Load/Store
functions because it may
lack the limit on N
established by the original D
. However, Vec<DV>
is the
same as V
.
Thus a template argument V
suffices for generic functions that do not load
from/store to memory: template<class V> V Mul4(V v) { return v * Set(DFromV<V>(), 4); }
.
Example of mixing partial vectors with generic functions:
CappedTag<int16_t, 2> d2;
auto v = Mul4(Set(d2, 2));
Store(v, d2, ptr); // Use d2, NOT DFromV<decltype(v)>()
Let Target
denote an instruction set, one of
SCALAR/EMU128/SSSE3/SSE4/AVX2/AVX3/AVX3_DL/NEON/SVE/SVE2/WASM/RVV
. Each of
these is represented by a HWY_Target
(for example, HWY_SSE4
) macro which
expands to a unique power-of-two value.
Note that x86 CPUs are segmented into dozens of feature flags and capabilities,
which are often used together because they were introduced in the same CPU
(example: AVX2 and FMA). To keep the number of targets and thus compile time and
code size manageable, we define targets as 'clusters' of related features. To
use HWY_AVX2
, it is therefore insufficient to pass -mavx2. For definitions of
the clusters, see kGroup*
in targets.cc
. The corresponding Clang/GCC
compiler options to enable them (without -m prefix) are defined by
HWY_TARGET_STR*
in set_macros-inl.h
.
Targets are only used if enabled (i.e. not broken nor disabled). Baseline targets are those for which the compiler is unconditionally allowed to generate instructions (implying the target CPU must support them).
-
HWY_STATIC_TARGET
is the best enabled baselineHWY_Target
, and matchesHWY_TARGET
in static dispatch mode. This is useful even in dynamic dispatch mode for deducing and printing the compiler flags. -
HWY_TARGETS
indicates which targets to generate for dynamic dispatch, and which headers to include. It is determined by configuration macros and always includesHWY_STATIC_TARGET
. -
HWY_SUPPORTED_TARGETS
is the set of targets available at runtime. Expands to a literal if only a single target is enabled, or SupportedTargets(). -
HWY_TARGET
: whichHWY_Target
is currently being compiled. This is initially identical toHWY_STATIC_TARGET
and remains so in static dispatch mode. For dynamic dispatch, this changes before each re-inclusion and finally reverts toHWY_STATIC_TARGET
. Can be used in#if
expressions to provide an alternative to functions which are not supported byHWY_SCALAR
.In particular, for x86 we sometimes wish to specialize functions for AVX-512 because it provides many new instructions. This can be accomplished via
#if HWY_TARGET <= HWY_AVX3
, which means AVX-512 or better (e.g.HWY_AVX3_DL
). This is because numerically lower targets are better, and no other platform has targets numerically less than those of x86. -
HWY_WANT_SSSE3
,HWY_WANT_SSE4
: add SSSE3 and SSE4 to the baseline even if they are not marked as available by the compiler. On MSVC, the only ways to enable SSSE3 and SSE4 are defining these, or enabling AVX. -
HWY_WANT_AVX3_DL
: opt-in for dynamic dispatch toHWY_AVX3_DL
. This is unnecessary if the baseline already includes AVX3_DL.
In the following, the argument or return type V
denotes a vector with N
lanes, and M
a mask. Operations limited to certain vector types begin with a
constraint of the form V
: {prefixes}[{bits}]
. The prefixes u,i,f
denote
unsigned, signed, and floating-point types, and bits indicates the number of
bits per lane: 8, 16, 32, or 64. Any combination of the specified prefixes and
bits are allowed. Abbreviations of the form u32 = {u}{32}
may also be used.
Note that Highway functions reside in hwy::HWY_NAMESPACE
, whereas user-defined
functions reside in project::[nested]::HWY_NAMESPACE
. Highway functions
generally take either a D
or vector/mask argument. For targets where vectors
and masks are defined in namespace hwy
, the functions will be found via
Argument-Dependent Lookup. However, this does not work for function templates,
and RVV and SVE both use builtin vectors. There are three options for portable
code, in descending order of preference:
namespace hn = hwy::HWY_NAMESPACE;
alias used to prefix ops, e.g.hn::LoadDup128(..)
;using hwy::HWY_NAMESPACE::LoadDup128;
declarations for each op used;using hwy::HWY_NAMESPACE;
directive. This is generally discouraged, especially for SIMD code residing in a header.
Note that overloaded operators are not yet supported on RVV and SVE. Until that
is resolved, code that wishes to run on all targets must use the corresponding
equivalents mentioned in the description of each overloaded operator, for
example Lt
instead of operator<
.
V Zero(D)
: returns N-lane vector with all bits set to 0.V Set(D, T)
: returns N-lane vector with all lanes equal to the given value of typeT
.V Undefined(D)
: returns uninitialized N-lane vector, e.g. for use as an output parameter.V Iota(D, T)
: returns N-lane vector where the lane with indexi
has the given value of typeT
plusi
. The least significant lane has index 0. This is useful in tests for detecting lane-crossing bugs.V SignBit(D, T)
: returns N-lane vector with all lanes set to a value whose representation has only the most-significant bit set.
T GetLane(V)
: returns lane 0 withinV
. This is useful for extractingSumOfLanes
results.
The following may be slow on some platforms (e.g. x86) and should not be used in time-critical code:
-
T ExtractLane(V, size_t i)
: returns lanei
withinV
.i
must be in[0, Lanes(DFromV<V>()))
. Potentially slow, it may be better to store an entire vector to an array and then operate on its elements. -
V InsertLane(V, size_t i, T t)
: returns a copy of V whose lanei
is set tot
.i
must be in[0, Lanes(DFromV<V>()))
. Potentially slow, it may be better set all elements of an aligned array and thenLoad
it.
V Print(D, const char* caption, V [, size_t lane][, size_t max_lanes])
: printscaption
followed by up tomax_lanes
comma-separated lanes from the vector argument, starting at indexlane
. Defined in hwy/print-inl.h, also available if hwy/tests/test_util-inl.h has been included.
-
V operator+(V a, V b)
: returnsa[i] + b[i]
(mod 2^bits). Currently unavailable on SVE/RVV; use the equivalentAdd
instead. -
V operator-(V a, V b)
: returnsa[i] - b[i]
(mod 2^bits). Currently unavailable on SVE/RVV; use the equivalentSub
instead. -
V
:{i,f}
V Neg(V a)
: returns-a[i]
. -
V
:{i,f}
V Abs(V a)
returns the absolute value ofa[i]
; for integers,LimitsMin()
maps toLimitsMax() + 1
. -
V
:f32
V AbsDiff(V a, V b)
: returns|a[i] - b[i]|
in each lane. -
V
:u8
VU64 SumsOf8(V v)
returns the sums of 8 consecutive u8 lanes, zero-extending each sum into a u64 lane. This is slower on RVV/WASM. -
V
:{u,i}{8,16}
V SaturatedAdd(V a, V b)
returnsa[i] + b[i]
saturated to the minimum/maximum representable value. -
V
:{u,i}{8,16}
V SaturatedSub(V a, V b)
returnsa[i] - b[i]
saturated to the minimum/maximum representable value. -
V
:{u}{8,16}
V AverageRound(V a, V b)
returns(a[i] + b[i] + 1) / 2
. -
V Clamp(V a, V lo, V hi)
: returnsa[i]
clamped to[lo[i], hi[i]]
. -
V
:{f}
V operator/(V a, V b)
: returnsa[i] / b[i]
in each lane. Currently unavailable on SVE/RVV; use the equivalentDiv
instead. -
V
:{f}
V Sqrt(V a)
: returnssqrt(a[i])
. -
V
:f32
V ApproximateReciprocalSqrt(V a)
: returns an approximation of1.0 / sqrt(a[i])
.sqrt(a) ~= ApproximateReciprocalSqrt(a) * a
. x86 and PPC provide 12-bit approximations but the error on ARM is closer to 1%. -
V
:f32
V ApproximateReciprocal(V a)
: returns an approximation of1.0 / a[i]
.
Note: Min/Max corner cases are target-specific and may change. If either argument is qNaN, x86 SIMD returns the second argument, ARMv7 Neon returns NaN, Wasm is supposed to return NaN but does not always, but other targets actually uphold IEEE 754-2019 minimumNumber: returning the other argument if exactly one is qNaN, and NaN if both are.
-
V Min(V a, V b)
: returnsmin(a[i], b[i])
. -
V Max(V a, V b)
: returnsmax(a[i], b[i])
.
All other ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
-
V
:u64
M Min128(D, V a, V b)
: returns the minimum of unsigned 128-bit values, each stored as an adjacent pair of 64-bit lanes (e.g. indices 1 and 0, where 0 is the least-significant 64-bits). -
V
:u64
M Max128(D, V a, V b)
: returns the maximum of unsigned 128-bit values, each stored as an adjacent pair of 64-bit lanes (e.g. indices 1 and 0, where 0 is the least-significant 64-bits). -
V
:u64
M Min128Upper(D, V a, V b)
: for each 128-bit key-value pair, returnsa
if it is considered less thanb
by Lt128Upper, elseb
. -
V
:u64
M Max128Upper(D, V a, V b)
: for each 128-bit key-value pair, returnsa
if it is considered >b
by Lt128Upper, elseb
.
-
V
:{u,i}{16,32,64}
V operator*(V a, V b)
: returns the lower half ofa[i] * b[i]
in each lane. Currently unavailable on SVE/RVV; use the equivalentMul
instead. -
V
:{f}
V operator*(V a, V b)
: returnsa[i] * b[i]
in each lane. Currently unavailable on SVE/RVV; use the equivalentMul
instead. -
V
:i16
V MulHigh(V a, V b)
: returns the upper half ofa[i] * b[i]
in each lane. -
V
:i16
V MulFixedPoint15(V a, V b)
: returns the result of multiplying two 1.15 fixed-point numbers. This corresponds to doubling the multiplication result and storing the upper half. Results are implementation-defined iff both inputs are -32768. -
V
:{u,i}{32},u64
V2 MulEven(V a, V b)
: returns double-wide result ofa[i] * b[i]
for every eveni
, in lanesi
(lower) andi + 1
(upper).V2
is a vector with double-width lanes, or the same asV
for 64-bit inputs (which are only supported ifHWY_TARGET != HWY_SCALAR
). -
V
:u64
V MulOdd(V a, V b)
: returns double-wide result ofa[i] * b[i]
for every oddi
, in lanesi - 1
(lower) andi
(upper). Only supported ifHWY_TARGET != HWY_SCALAR
. -
V
:{bf,i}16
,D
:RepartitionToWide<DFromV<V>>
Vec ReorderWidenMulAccumulate(D d, V a, V b, Vec sum0, Vec& sum1)
: widensa
andb
toTFromD<D>
, then addsa[i] * b[i]
to eithersum1[j]
or lanej
of the return value, wherej = P(i)
andP
is a permutation. The only guarantee is thatSumOfLanes(d, Add(return_value, sum1))
is the sum of alla[i] * b[i]
. This is useful for computing dot products and the L2 norm.
When implemented using special instructions, these functions are more precise
and faster than separate multiplication followed by addition. The *Sub
variants are somewhat slower on ARM; it is preferable to replace them with
MulAdd
using a negated constant.
-
V
:{f}
V MulAdd(V a, V b, V c)
: returnsa[i] * b[i] + c[i]
. -
V
:{f}
V NegMulAdd(V a, V b, V c)
: returns-a[i] * b[i] + c[i]
. -
V
:{f}
V MulSub(V a, V b, V c)
: returnsa[i] * b[i] - c[i]
. -
V
:{f}
V NegMulSub(V a, V b, V c)
: returns-a[i] * b[i] - c[i]
.
Note: Counts not in [0, sizeof(T)*8)
yield implementation-defined results.
Left-shifting signed T
and right-shifting positive signed T
is the same as
shifting MakeUnsigned<T>
and casting to T
. Right-shifting negative signed
T
is the same as an unsigned shift, except that 1-bits are shifted in.
Compile-time constant shifts: the amount must be in [0, sizeof(T)*8). Generally
the most efficient variant, but 8-bit shifts are potentially slower than other
lane sizes, and RotateRight
is often emulated with shifts:
-
V
:{u,i}
V ShiftLeft<int>(V a)
returnsa[i] << int
. -
V
:{u,i}
V ShiftRight<int>(V a)
returnsa[i] >> int
. -
V
:{u}{32,64}
V RotateRight<int>(V a)
returns(a[i] >> int) | (a[i] << (sizeof(T)*8 - int))
.
Shift all lanes by the same (not necessarily compile-time constant) amount:
-
V
:{u,i}
V ShiftLeftSame(V a, int bits)
returnsa[i] << bits
. -
V
:{u,i}
V ShiftRightSame(V a, int bits)
returnsa[i] >> bits
.
Per-lane variable shifts (slow if SSSE3/SSE4, or 16-bit, or Shr i64 on AVX2):
-
V
:{u,i}{16,32,64}
V operator<<(V a, V b)
returnsa[i] << b[i]
. Currently unavailable on SVE/RVV; use the equivalentShl
instead. -
V
:{u,i}{16,32,64}
V operator>>(V a, V b)
returnsa[i] >> b[i]
. Currently unavailable on SVE/RVV; use the equivalentShr
instead.
-
V
:{f}
V Round(V v)
: returnsv[i]
rounded towards the nearest integer, with ties to even. -
V
:{f}
V Trunc(V v)
: returnsv[i]
rounded towards zero (truncate). -
V
:{f}
V Ceil(V v)
: returnsv[i]
rounded towards positive infinity (ceiling). -
V
:{f}
V Floor(V v)
: returnsv[i]
rounded towards negative infinity.
-
V
:{f}
M IsNaN(V v)
: returns mask indicating whetherv[i]
is "not a number" (unordered). -
V
:{f}
M IsInf(V v)
: returns mask indicating whetherv[i]
is positive or negative infinity. -
V
:{f}
M IsFinite(V v)
: returns mask indicating whetherv[i]
is neither NaN nor infinity, i.e. normal, subnormal or zero. Equivalent toNot(Or(IsNaN(v), IsInf(v)))
.
V
:{u,i}
V PopulationCount(V a)
: returns the number of 1-bits in each lane, i.e.PopCount(a[i])
.
The following operate on individual bits within each lane. Note that the
non-operator functions (And
instead of &
) must be used for floating-point
types, and on SVE/RVV.
-
V
:{u,i}
V operator&(V a, V b)
: returnsa[i] & b[i]
. Currently unavailable on SVE/RVV; use the equivalentAnd
instead. -
V
:{u,i}
V operator|(V a, V b)
: returnsa[i] | b[i]
. Currently unavailable on SVE/RVV; use the equivalentOr
instead. -
V
:{u,i}
V operator^(V a, V b)
: returnsa[i] ^ b[i]
. Currently unavailable on SVE/RVV; use the equivalentXor
instead. -
V
:{u,i}
V Not(V v)
: returns~v[i]
. -
V AndNot(V a, V b)
: returns~a[i] & b[i]
.
The following three-argument functions may be more efficient than assembling them from 2-argument functions:
V Or3(V o1, V o2, V o3)
: returnso1[i] | o2[i] | o3[i]
.V OrAnd(V o, V a1, V a2)
: returnso[i] | (a1[i] & a2[i])
.
Special functions for signed types:
-
V
:{f}
V CopySign(V a, V b)
: returns the number with the magnitude ofa
and sign ofb
. -
V
:{f}
V CopySignToAbs(V a, V b)
: as above, but potentially slightly more efficient; requires the first argument to be non-negative. -
V
:i32/64
V BroadcastSignBit(V a)
returnsa[i] < 0 ? -1 : 0
. -
V
:{f}
V ZeroIfNegative(V v)
: returnsv[i] < 0 ? 0 : v[i]
. -
V
:{i,f}
V IfNegativeThenElse(V v, V yes, V no)
: returnsv[i] < 0 ? yes[i] : no[i]
. This may be more efficient thanIfThenElse(Lt..)
.
Let M
denote a mask capable of storing a logical true/false for each lane (the
encoding depends on the platform).
-
M FirstN(D, size_t N)
: returns mask with the firstN
lanes (those with index< N
) true.N >= Lanes(D())
results in an all-true mask.N
must not exceedLimitsMax<SignedFromSize<HWY_MIN(sizeof(size_t), sizeof(TFromD<D>))>>()
. Useful for implementing "masked" stores by loadingprev
followed byIfThenElse(FirstN(d, N), what_to_store, prev)
. -
M MaskFromVec(V v)
: returns false in lanei
ifv[i] == 0
, or true ifv[i]
has all bits set. The result is implementation-defined ifv[i]
is neither zero nor all bits set. -
M LoadMaskBits(D, const uint8_t* p)
: returns a mask indicating whether the i-th bit in the array is set. Loads bytes and bits in ascending order of address and index. At least 8 bytes ofp
must be readable, but only(Lanes(D()) + 7) / 8
need be initialized. Any unused bits (happens ifLanes(D()) < 8
) are treated as if they were zero.
-
M1 RebindMask(D, M2 m)
: returns same mask bits asm
, but reinterpreted as a mask for lanes of typeTFromD<D>
.M1
andM2
must have the same number of lanes. -
V VecFromMask(D, M m)
: returns 0 in lanei
ifm[i] == false
, otherwise all bits set. -
size_t StoreMaskBits(D, M m, uint8_t* p)
: stores a bit array indicating whetherm[i]
is true, in ascending order ofi
, filling the bits of each byte from least to most significant, then proceeding to the next byte. Returns the number of bytes written:(Lanes(D()) + 7) / 8
. At least 8 bytes ofp
must be writable.
-
bool AllTrue(D, M m)
: returns whether allm[i]
are true. -
bool AllFalse(D, M m)
: returns whether allm[i]
are false. -
size_t CountTrue(D, M m)
: returns how many ofm[i]
are true [0, N]. This is typically more expensive than AllTrue/False. -
intptr_t FindFirstTrue(D, M m)
: returns the index of the first (i.e. lowest index)m[i]
that is true, or -1 if none are.
For IfThen*
, masks must adhere to the invariant established by MaskFromVec
:
false is zero, true has all bits set:
-
V IfThenElse(M mask, V yes, V no)
: returnsmask[i] ? yes[i] : no[i]
. -
V IfThenElseZero(M mask, V yes)
: returnsmask[i] ? yes[i] : 0
. -
V IfThenZeroElse(M mask, V no)
: returnsmask[i] ? 0 : no[i]
. -
V IfVecThenElse(V mask, V yes, V no)
: equivalent to and possibly faster thanIfVecThenElse(MaskFromVec(mask), yes, no)
. The result is implementation-defined ifmask[i]
is neither zero nor all bits set.
-
M Not(M m)
: returns mask of elements indicating whether the input mask element was false. -
M And(M a, M b)
: returns mask of elements indicating whether both input mask elements were true. -
M AndNot(M not_a, M b)
: returns mask of elements indicating whether not_a is false and b is true. -
M Or(M a, M b)
: returns mask of elements indicating whether either input mask element was true. -
M Xor(M a, M b)
: returns mask of elements indicating whether exactly one input mask element was true.
-
V
:{u,i,f}{16,32,64}
V Compress(V v, M m)
: returnsr
such thatr[n]
isv[i]
, withi
the n-th lane index (starting from 0) wherem[i]
is true. Compacts lanes whose mask is true into the lower lanes. For targets and lane typeT
whereCompressIsPartition<T>::value
is true, the upper lanes are those whose mask is false (thusCompress
corresponds to partitioning according to the mask). Otherwise, the upper lanes are implementation-defined. Slow with 16-bit lanes. Use this form when the input is already a mask, e.g. returned by a comparison. -
V
:{u,i,f}{16,32,64}
V CompressNot(V v, M m)
: equivalent toCompress(v, Not(m))
but possibly faster ifCompressIsPartition<T>::value
is true. -
V
:u64
V CompressBlocksNot(V v, M m)
: equivalent toCompressNot(v, m)
whenm
is structured as adjacent pairs (both true or false), e.g. as returned byLt128
. This is a no-op for 128 bit vectors. Unavailable ifHWY_TARGET == HWY_SCALAR
. -
V
:{u,i,f}{16,32,64}
size_t CompressStore(V v, M m, D d, T* p)
: writes lanes whose maskm
is true intop
, starting from lane 0. ReturnsCountTrue(d, m)
, the number of valid lanes. May be implemented asCompress
followed byStoreU
; lanes after the valid ones may still be overwritten! Slower for 16-bit lanes. -
V
:{u,i,f}{16,32,64}
size_t CompressBlendedStore(V v, M m, D d, T* p)
: writes only lanes whose maskm
is true intop
, starting from lane 0. ReturnsCountTrue(d, m)
, the number of lanes written. Does not modify subsequent lanes, but there is no guarantee of atomicity because this may be implemented asCompress, LoadU, IfThenElse(FirstN), StoreU
. -
V
:{u,i,f}{16,32,64}
V CompressBits(V v, const uint8_t* HWY_RESTRICT bits)
: Equivalent to, but often faster thanCompress(v, LoadMaskBits(d, bits))
.bits
is as specified forLoadMaskBits
. If called multiple times, thebits
pointer passed to this function must also be markedHWY_RESTRICT
to avoid repeated work. Note that if the vector has less than 8 elements, incrementingbits
will not work as intended for packed bit arrays. As withCompress
,CompressIsPartition
indicates the mask=false lanes are moved to the upper lanes; this op is also slow for 16-bit lanes. -
V
:{u,i,f}{16,32,64}
size_t CompressBitsStore(V v, const uint8_t* HWY_RESTRICT bits, D d, T* p)
: combination ofCompressStore
andCompressBits
, see remarks there.
These return a mask (see above) indicating whether the condition is true.
-
M operator==(V a, V b)
: returnsa[i] == b[i]
. Currently unavailable on SVE/RVV; use the equivalentEq
instead. -
M operator!=(V a, V b)
: returnsa[i] != b[i]
. Currently unavailable on SVE/RVV; use the equivalentNe
instead. -
M operator<(V a, V b)
: returnsa[i] < b[i]
. Currently unavailable on SVE/RVV; use the equivalentLt
instead. -
M operator>(V a, V b)
: returnsa[i] > b[i]
. Currently unavailable on SVE/RVV; use the equivalentGt
instead. -
V
:{f}
M operator<=(V a, V b)
: returnsa[i] <= b[i]
. Currently unavailable on SVE/RVV; use the equivalentLe
instead. -
V
:{f}
M operator>=(V a, V b)
: returnsa[i] >= b[i]
. Currently unavailable on SVE/RVV; use the equivalentGe
instead. -
V
:{u,i}
M TestBit(V v, V bit)
: returns(v[i] & bit[i]) == bit[i]
.bit[i]
must have exactly one bit set. -
V
:u64
M Lt128(D, V a, V b)
: for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]:a[0]
concatenated to an unsigned 128-bit integer (least significant bits ina[0]
) is less thanb[1]:b[0]
. For each pair, the mask lanes are either both true or both false. Unavailable ifHWY_TARGET == HWY_SCALAR
. -
V
:u64
M Lt128Upper(D, V a, V b)
: for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]
is less thanb[1]
. For each pair, the mask lanes are either both true or both false. This is useful for comparing 64-bit keys alongside 64-bit values. Only available ifHWY_TARGET != HWY_SCALAR
. -
V
:u64
M Eq128(D, V a, V b)
: for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]:a[0]
concatenated to an unsigned 128-bit integer (least significant bits ina[0]
) equalsb[1]:b[0]
. For each pair, the mask lanes are either both true or both false. Unavailable ifHWY_TARGET == HWY_SCALAR
. -
V
:u64
M Eq128Upper(D, V a, V b)
: for each adjacent pair of 64-bit lanes (e.g. indices 1,0), returns whethera[1]
equalsb[1]
. For each pair, the mask lanes are either both true or both false. This is useful for comparing 64-bit keys alongside 64-bit values. Only available ifHWY_TARGET != HWY_SCALAR
.
Memory operands are little-endian, otherwise their order would depend on the
lane configuration. Pointers are the addresses of N
consecutive T
values,
either aligned
(address is a multiple of the vector size) or possibly
unaligned (denoted p
).
Even unaligned addresses must still be a multiple of sizeof(T)
, otherwise
StoreU
may crash on some platforms (e.g. RVV and ARMv7). Note that C++ ensures
automatic (stack) and dynamically allocated (via new
or malloc
) variables of
type T
are aligned to sizeof(T)
, hence such addresses are suitable for
StoreU
. However, casting pointers to char*
and adding arbitrary offsets (not
a multiple of sizeof(T)
) can violate this requirement.
Note: computations with low arithmetic intensity (FLOP/s per memory traffic bytes), e.g. dot product, can be 1.5 times as fast when the memory operands are aligned to the vector size. An unaligned access may require two load ports.
Vec<D> Load(D, const T* aligned)
: returnsaligned[i]
. May fault if the pointer is not aligned to the vector size (using aligned_allocator.h is safe). Using this whenever possible improves codegen on SSSE3/SSE4: unlikeLoadU
,Load
can be fused into a memory operand, which reduces register pressure.
Requires only element-aligned vectors (e.g. from malloc/std::vector, or aligned memory at indices which are not a multiple of the vector length):
-
Vec<D> LoadU(D, const T* p)
: returnsp[i]
. -
Vec<D> LoadDup128(D, const T* p)
: returns one 128-bit block loaded fromp
and broadcasted into all 128-bit block[s]. This may be faster than broadcasting single values, and is more convenient than preparing constants for the actual vector length. Only available ifHWY_TARGET != HWY_SCALAR
. -
Vec<D> MaskedLoad(M mask, D, const T* p)
: returnsp[i]
or zero if themask
governing elementi
is false. May fault even wheremask
is false#if HWY_MEM_OPS_MIGHT_FAULT
. Ifp
is aligned, faults cannot happen unless the entire vector is inaccessible. Equivalent to, and potentially more efficient than,IfThenElseZero(mask, Load(D(), aligned))
. -
void LoadInterleaved2(D, const T* p, Vec<D>& v0, Vec<D>& v1)
: equivalent toLoadU
intov0, v1
followed by shuffling, such thatv0[0] == p[0], v1[0] == p[1]
. -
void LoadInterleaved3(D, const T* p, Vec<D>& v0, Vec<D>& v1, Vec<D>& v2)
: as above, but for three vectors (e.g. RGB samples). -
void LoadInterleaved4(D, const T* p, Vec<D>& v0, Vec<D>& v1, Vec<D>& v2, Vec<D>& v3)
: as above, but for four vectors (e.g. RGBA).
Note: Offsets/indices are of type VI = Vec<RebindToSigned<D>>
and need not
be unique. The results are implementation-defined if any are negative.
Note: Where possible, applications should Load/Store/TableLookup*
entire
vectors, which is much faster than Scatter/Gather
. Otherwise, code of the form
dst[tbl[i]] = F(src[i])
should when possible be transformed to dst[i] = F(src[tbl[i]])
because Scatter
is more expensive than Gather
.
-
D
:{u,i,f}{32,64}
void ScatterOffset(Vec<D> v, D, const T* base, VI offsets)
: storesv[i]
to the base address plus byteoffsets[i]
. -
D
:{u,i,f}{32,64}
void ScatterIndex(Vec<D> v, D, const T* base, VI indices)
: storesv[i]
tobase[indices[i]]
. -
D
:{u,i,f}{32,64}
Vec<D> GatherOffset(D, const T* base, VI offsets)
: returns elements of base selected by byteoffsets[i]
. -
D
:{u,i,f}{32,64}
Vec<D> GatherIndex(D, const T* base, VI indices)
: returns vector ofbase[indices[i]]
.
-
void Store(Vec<D> v, D, T* aligned)
: copiesv[i]
intoaligned[i]
, which must be aligned to the vector size. Writes exactlyN * sizeof(T)
bytes. -
void StoreU(Vec<D> v, D, T* p)
: asStore
, but the alignment requirement is relaxed to element-aligned (multiple ofsizeof(T)
). -
void BlendedStore(Vec<D> v, M m, D d, T* p)
: asStoreU
, but only updatesp
wherem
is true. May fault even wheremask
is false#if HWY_MEM_OPS_MIGHT_FAULT
. Ifp
is aligned, faults cannot happen unless the entire vector is inaccessible. Equivalent to, and potentially more efficient than,StoreU(IfThenElse(m, v, LoadU(d, p)), d, p)
. "Blended" indicates this may not be atomic; other threads must not concurrently update[p, p + Lanes(d))
without synchronization. -
void SafeFillN(size_t num, T value, D d, T* HWY_RESTRICT to)
: Setsto[0, num)
tovalue
. Ifnum
exceedsLanes(d)
, the behavior is target-dependent (either filling all, or no more than one vector). Potentially more efficient than a scalar loop, but will not fault, unlikeBlendedStore
. No alignment requirement. Potentially non-atomic, likeBlendedStore
. -
void SafeCopyN(size_t num, D d, const T* HWY_RESTRICT from, T* HWY_RESTRICT to)
: Copiesfrom[0, num)
toto
. Ifnum
exceedsLanes(d)
, the behavior is target-dependent (either copying all, or no more than one vector). Potentially more efficient than a scalar loop, but will not fault, unlikeBlendedStore
. No alignment requirement. Potentially non-atomic, likeBlendedStore
. -
void StoreInterleaved2(Vec<D> v0, Vec<D> v1, D, T* p)
: equivalent to shufflingv0, v1
followed by twoStoreU()
, such thatp[0] == v0[0], p[1] == v1[0]
. -
void StoreInterleaved3(Vec<D> v0, Vec<D> v1, Vec<D> v2, D, T* p)
: as above, but for three vectors (e.g. RGB samples). -
void StoreInterleaved4(Vec<D> v0, Vec<D> v1, Vec<D> v2, Vec<D> v3, D, T* p)
: as above, but for four vectors (e.g. RGBA samples).
All functions except Stream
are defined in cache_control.h.
-
void Stream(Vec<D> a, D d, const T* aligned)
: copiesa[i]
intoaligned[i]
with non-temporal hint if available (useful for write-only data; avoids cache pollution). May be implemented using a CPU-internal buffer. To avoid partial flushes and unpredictable interactions with atomics (for example, see Intel SDM Vol 4, Sec. 8.1.2.2), call this consecutively for an entire cache line (typically 64 bytes, aligned to its size). Each call may write a multiple ofHWY_STREAM_MULTIPLE
bytes, which can exceedLanes(d) * sizeof(T)
. The new contents ofaligned
may not be visible untilFlushStream
is called. -
void FlushStream()
: ensures values written by previousStream
calls are visible on the current core. This is NOT sufficient for synchronizing across cores; whenStream
outputs are to be consumed by other core(s), the producer must publish availability (e.g. via mutex or atomic_flag) afterFlushStream
. -
void FlushCacheline(const void* p)
: invalidates and flushes the cache line containing "p", if possible. -
void Prefetch(const T* p)
: optionally begins loading the cache line containing "p" to reduce latency of subsequent actual loads. -
void Pause()
: when called inside a spin-loop, may reduce power consumption.
-
Vec<D> BitCast(D, V)
: returns the bits ofV
reinterpreted as typeVec<D>
. -
V
,D
: (u8,u16
), (u16,u32
), (u8,u32
), (u32,u64
), (u8,i16
),
(u8,i32
), (u16,i32
), (i8,i16
), (i8,i32
), (i16,i32
), (i32,i64
)Vec<D> PromoteTo(D, V part)
: returnspart[i]
zero- or sign-extended to the integer typeMakeWide<T>
. -
V
,D
: (f16,f32
), (bf16,f32
), (f32,f64
)
Vec<D> PromoteTo(D, V part)
: returnspart[i]
widened to the floating-point typeMakeWide<T>
. -
V
,D
:
Vec<D> PromoteTo(D, V part)
: returnspart[i]
converted to 64-bit floating point. -
V
,D
: (bf16,f32
)Vec<D> PromoteLowerTo(D, V v)
: returnsv[i]
widened toMakeWide<T>
, for i in[0, Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value. -
V
,D
: (bf16,f32
)Vec<D> PromoteUpperTo(D, V v)
: returnsv[i]
widened toMakeWide<T>
, for i in[Lanes(D()), 2 * Lanes(D()))
. Note thatV
has twice as many lanes asD
and the return value. -
V
,V8
: (u32,u8
)
V8 U8FromU32(V)
: special-caseu32
tou8
conversion when all lanes ofV
are already clamped to[0, 256)
. -
D
,V
: (u64,u32
), (u64,u16
), (u64,u8
), (u32,u16
), (u32,u8
),
(u16,u8
)Vec<D> TruncateTo(D, V v)
: returnsv[i]
truncated to the smaller type indicated byT = TFromD<D>
, with the same result as if the more-signficant input bits that do not fit inT
had been zero. Example:ScalableTag<uint32_t> du32; Rebind<uint8_t> du8; TruncateTo(du8, Set(du32, 0xF08F))
is the same asSet(du8, 0x8F)
.
DemoteTo
and float-to-int ConvertTo
return the closest representable value
if the input exceeds the destination range.
-
V
,D
: (i16,i8
), (i32,i8
), (i32,i16
), (i16,u8
), (i32,u8
), (i32,u16
), (f64,f32
)
Vec<D> DemoteTo(D, V a)
: returnsa[i]
after packing with signed/unsigned saturation toMakeNarrow<T>
. -
V
,D
:f64,i32
Vec<D> DemoteTo(D, V a)
: rounds floating point towards zero and converts the value to 32-bit integers. -
V
,D
: (f32,f16
), (f32,bf16
)
Vec<D> DemoteTo(D, V a)
: narrows float to half (for bf16, it is unspecified whether this truncates or rounds). -
D
:{bf,i}16
,V
:RepartitionToWide<D>
Vec<D> ReorderDemote2To(D, V a, V b)
: as above, but converts two inputs,D
and the output have twice as many lanes asV
, and the output order is some permutation of the inputs. Only available ifHWY_TARGET != HWY_SCALAR
. -
V
,D
: (i32
,f32
), (i64
,f64
)
Vec<D> ConvertTo(D, V)
: converts an integer value to same-sized floating point. -
V
,D
: (f32
,i32
), (f64
,i64
)
Vec<D> ConvertTo(D, V)
: rounds floating point towards zero and converts the value to same-sized integer. -
V
:f32
;Ret
:i32
Ret NearestInt(V a)
: returns the integer nearest toa[i]
; results are undefined for NaN.
V2 LowerHalf([D, ] V)
: returns the lower half of the vectorV
. The optionalD
(provided for consistency withUpperHalf
) isHalf<DFromV<V>>
.
All other ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
-
V2 UpperHalf(D, V)
: returns upper half of the vectorV
, whereD
isHalf<DFromV<V>>
. -
V ZeroExtendVector(D, V2)
: returns vector whoseUpperHalf
is zero and whoseLowerHalf
is the argument;D
isTwice<DFromV<V2>>
. -
V Combine(D, V2, V2)
: returns vector whoseUpperHalf
is the first argument and whoseLowerHalf
is the second argument;D
isTwice<DFromV<V2>>
.
Note: the following operations cross block boundaries, which is typically more expensive on AVX2/AVX-512 than per-block operations.
-
V ConcatLowerLower(D, V hi, V lo)
: returns the concatenation of the lower halves ofhi
andlo
without splitting into blocks.D
isDFromV<V>
. -
V ConcatUpperUpper(D, V hi, V lo)
: returns the concatenation of the upper halves ofhi
andlo
without splitting into blocks.D
isDFromV<V>
. -
V ConcatLowerUpper(D, V hi, V lo)
: returns the inner half of the concatenation ofhi
andlo
without splitting into blocks. Useful for swapping the two blocks in 256-bit vectors.D
isDFromV<V>
. -
V ConcatUpperLower(D, V hi, V lo)
: returns the outer quarters of the concatenation ofhi
andlo
without splitting into blocks. Unlike the other variants, this does not incur a block-crossing penalty on AVX2/3.D
isDFromV<V>
. -
V ConcatOdd(D, V hi, V lo)
: returns the concatenation of the odd lanes ofhi
and the odd lanes oflo
. -
V ConcatEven(D, V hi, V lo)
: returns the concatenation of the even lanes ofhi
and the even lanes oflo
.
Note: if vectors are larger than 128 bits, the following operations split their operands into independently processed 128-bit blocks.
V
:{u,i}{16,32,64}, {f}
V Broadcast<int i>(V)
: returns individual blocks, each with lanes set toinput_block[i]
,i = [0, 16/sizeof(T))
.
All other ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
-
V
:{u,i}
VI TableLookupBytes(V bytes, VI indices)
: returnsbytes[indices[i]]
. Uses byte lanes regardless of the actual vector types. Results are implementation-defined ifindices[i] < 0
orindices[i] >= HWY_MIN(Lanes(DFromV<V>()), 16)
.VI
are integers, possibly of a different type than those inV
. The number of lanes inV
andVI
may differ, e.g. a full-length table vector loaded viaLoadDup128
, plus partial vectorVI
of 4-bit indices. -
V
:{u,i}
VI TableLookupBytesOr0(V bytes, VI indices)
: returnsbytes[indices[i]]
, or 0 ifindices[i] & 0x80
. Uses byte lanes regardless of the actual vector types. Results are implementation-defined forindices[i] < 0
or in[HWY_MIN(Lanes(DFromV<V>()), 16), 0x80)
. The zeroing behavior has zero cost on x86 and ARM. For vectors of >= 256 bytes (can happen on SVE and RVV), this will set all lanes after the first 128 to 0.VI
are integers, possibly of a different type than those inV
. The number of lanes inV
andVI
may differ.
Ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
-
V InterleaveLower([D, ] V a, V b)
: returns blocks with alternating lanes from the lower halves ofa
andb
(a[0]
in the least-significant lane). The optionalD
(provided for consistency withInterleaveUpper
) isDFromV<V>
. -
V InterleaveUpper(D, V a, V b)
: returns blocks with alternating lanes from the upper halves ofa
andb
(a[N/2]
in the least-significant lane).D
isDFromV<V>
.
-
Ret
:MakeWide<T>
;V
:{u,i}{8,16,32}
Ret ZipLower([D, ] V a, V b)
: returns the same bits asInterleaveLower
, but repartitioned into double-width lanes (required in order to use this operation with scalars). The optionalD
(provided for consistency withZipUpper
) isRepartitionToWide<DFromV<V>>
. -
Ret
:MakeWide<T>
;V
:{u,i}{8,16,32}
Ret ZipUpper(D, V a, V b)
: returns the same bits asInterleaveUpper
, but repartitioned into double-width lanes (required in order to use this operation with scalars).D
isRepartitionToWide<DFromV<V>>
. Only available ifHWY_TARGET != HWY_SCALAR
.
Ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
-
V
:{u,i}
V ShiftLeftBytes<int>([D, ] V)
: returns the result of shifting independent blocks left byint
bytes [1, 15]. The optionalD
(provided for consistency withShiftRightBytes
) isDFromV<V>
. -
V ShiftLeftLanes<int>([D, ] V)
: returns the result of shifting independent blocks left byint
lanes. The optionalD
(provided for consistency withShiftRightLanes
) isDFromV<V>
. -
V
:{u,i}
V ShiftRightBytes<int>(D, V)
: returns the result of shifting independent blocks right byint
bytes [1, 15], shifting in zeros even for partial vectors.D
isDFromV<V>
. -
V ShiftRightLanes<int>(D, V)
: returns the result of shifting independent blocks right byint
lanes, shifting in zeros even for partial vectors.D
isDFromV<V>
. -
V
:{u,i}
V CombineShiftRightBytes<int>(D, V hi, V lo)
: returns a vector of blocks each the result of shifting two concatenated blockshi[i] || lo[i]
right byint
bytes [1, 16).D
isDFromV<V>
. -
V CombineShiftRightLanes<int>(D, V hi, V lo)
: returns a vector of blocks each the result of shifting two concatenated blockshi[i] || lo[i]
right byint
lanes [1, 16/sizeof(T)).D
isDFromV<V>
.
Ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
-
V
:{u,i,f}{32}
V Shuffle1032(V)
: returns blocks with 64-bit halves swapped. -
V
:{u,i,f}{32}
V Shuffle0321(V)
: returns blocks rotated right (toward the lower end) by 32 bits. -
V
:{u,i,f}{32}
V Shuffle2103(V)
: returns blocks rotated left (toward the upper end) by 32 bits.
The following are equivalent to Reverse2
or Reverse4
, which should be used
instead because they are more general:
-
V
:{u,i,f}{32}
V Shuffle2301(V)
: returns blocks with 32-bit halves swapped inside 64-bit halves. -
V
:{u,i,f}{64}
V Shuffle01(V)
: returns blocks with 64-bit halves swapped. -
V
:{u,i,f}{32}
V Shuffle0123(V)
: returns blocks with lanes in reverse order.
-
V OddEven(V a, V b)
: returns a vector whose odd lanes are taken froma
and the even lanes fromb
. -
V OddEvenBlocks(V a, V b)
: returns a vector whose odd blocks are taken froma
and the even blocks fromb
. Returnsb
if the vector has no more than one block (i.e. is 128 bits or scalar). -
V
:{u,i,f}{32,64}
V DupEven(V v)
: returnsr
, the result of copying even lanes to the next higher-indexed lane. For each even lane indexi
,r[i] == v[i]
andr[i + 1] == v[i]
. -
V ReverseBlocks(V v)
: returns a vector with blocks in reversed order. -
V
:{u,i,f}{32,64}
V TableLookupLanes(V a, unspecified)
returns a vector ofa[indices[i]]
, whereunspecified
is the return value ofSetTableIndices(D, &indices[0])
orIndicesFromVec
. The indices are not limited to blocks, hence this is slower thanTableLookupBytes*
on AVX2/AVX-512. Results are implementation-defined unless0 <= indices[i] < Lanes(D())
.indices
are always integers, even ifV
is a floating-point type. -
D
:{u,i}{32,64}
unspecified IndicesFromVec(D d, V idx)
prepares forTableLookupLanes
with integer indices inidx
, which must be the same bit width asTFromD<D>
and in the range[0, Lanes(d))
, but need not be unique. -
D
:{u,i}{32,64}
unspecified SetTableIndices(D d, TI* idx)
prepares forTableLookupLanes
by loadingLanes(d)
integer indices fromidx
, which must be in the range[0, Lanes(d))
but need not be unique. The index typeTI
must be an integer of the same size asTFromD<D>
. -
V
:{u,i,f}{16,32,64}
V Reverse(D, V a)
returns a vector with lanes in reversed order (out[i] == a[Lanes(D()) - 1 - i]
).
The following ReverseN
must not be called if Lanes(D()) < N
:
-
V
:{u,i,f}{16,32,64}
V Reverse2(D, V a)
returns a vector with each group of 2 contiguous lanes in reversed order (out[i] == a[i ^ 1]
). -
V
:{u,i,f}{16,32,64}
V Reverse4(D, V a)
returns a vector with each group of 4 contiguous lanes in reversed order (out[i] == a[i ^ 3]
). -
V
:{u,i,f}{16,32,64}
V Reverse8(D, V a)
returns a vector with each group of 8 contiguous lanes in reversed order (out[i] == a[i ^ 7]
).
All other ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
-
V
:{u,i,f}{32,64}
V DupOdd(V v)
: returnsr
, the result of copying odd lanes to the previous lower-indexed lane. For each odd lane indexi
,r[i] == v[i]
andr[i - 1] == v[i]
. -
V SwapAdjacentBlocks(V v)
: returns a vector where blocks of index2*i
and2*i+1
are swapped. Results are undefined for vectors with less than two blocks; callers must first check that viaLanes
.
Note: these 'reduce' all lanes to a single result (e.g. sum), which is
broadcasted to all lanes. To obtain a scalar, you can call GetLane
.
Being a horizontal operation (across lanes of the same vector), these are slower than normal SIMD operations and are typically used outside critical loops.
-
V
:{u,i,f}{32,64},{u,i}{16}
V SumOfLanes(D, V v)
: returns the sum of all lanes in each lane. -
V
:{u,i,f}{32,64},{u,i}{16}
V MinOfLanes(D, V v)
: returns the minimum-valued lane in each lane. -
V
:{u,i,f}{32,64},{u,i}{16}
V MaxOfLanes(D, V v)
: returns the maximum-valued lane in each lane.
Ops in this section are only available if HWY_TARGET != HWY_SCALAR
:
-
V
:u8
V AESRound(V state, V round_key)
: one round of AES encryption:MixColumns(SubBytes(ShiftRows(state))) ^ round_key
. This matches x86 AES-NI. The latency is independent of the input values. -
V
:u8
V AESLastRound(V state, V round_key)
: the last round of AES encryption:SubBytes(ShiftRows(state)) ^ round_key
. This matches x86 AES-NI. The latency is independent of the input values. -
V
:u64
V CLMulLower(V a, V b)
: carryless multiplication of the lower 64 bits of each 128-bit block into a 128-bit product. The latency is independent of the input values (assuming that is true of normal integer multiplication) so this can safely be used in crypto. Applications that wish to multiply upper with lower halves canShuffle01
one of the operands; on x86 that is expected to be latency-neutral. -
V
:u64
V CLMulUpper(V a, V b)
: as CLMulLower, but multiplies the upper 64 bits of each 128-bit block.
-
HWY_ALIGN
: Prefix for stack-allocated (i.e. automatic storage duration) arrays to ensure they have suitable alignment for Load()/Store(). This is specific toHWY_TARGET
and should only be used insideHWY_NAMESPACE
.Arrays should also only be used for partial (<= 128-bit) vectors, or
LoadDup128
, because full vectors may be too large for the stack and should be heap-allocated instead (see aligned_allocator.h).Example:
HWY_ALIGN float lanes[4];
-
HWY_ALIGN_MAX
: asHWY_ALIGN
, but independent ofHWY_TARGET
and may be used outsideHWY_NAMESPACE
.
HWY_IDE
is 0 except when parsed by IDEs; adding it to conditions such as#if HWY_TARGET != HWY_SCALAR || HWY_IDE
avoids code appearing greyed out.
The following indicate support for certain lane types and expand to 1 or 0:
HWY_HAVE_INTEGER64
: support for 64-bit signed/unsigned integer lanes.HWY_HAVE_FLOAT16
: support for 16-bit floating-point lanes.HWY_HAVE_FLOAT64
: support for double-precision floating-point lanes.
The above were previously known as HWY_CAP_INTEGER64
, HWY_CAP_FLOAT16
, and
HWY_CAP_FLOAT64
, respectively. Those HWY_CAP_*
names are DEPRECATED.
-
HWY_HAVE_SCALABLE
indicates vector sizes are unknown at compile time, and determined by the CPU. -
HWY_MEM_OPS_MIGHT_FAULT
is 1 iffMaskedLoad
may trigger a (page) fault when attempting to load lanes from unmapped memory, even if the corresponding mask element is false. This is the case on ASAN/MSAN builds, AMD x86 prior to AVX-512, and ARM NEON. If so, users can prevent faults by ensuring memory addresses are aligned to the vector size or at least padded (allocation size increased by at leastLanes(d)
. -
HWY_NATIVE_FMA
expands to 1 if theMulAdd
etc. ops use native fused multiply-add. Otherwise,MulAdd(f, m, a)
is implemented asAdd(Mul(f, m), a)
. Checking this can be useful for increasing the tolerance of expected results (around 1E-5 or 1E-6).
The following were used to signal the maximum number of lanes for certain operations, but this is no longer necessary (nor possible on SVE/RVV), so they are DEPRECATED:
HWY_CAP_GE256
: the current target supports vectors of >= 256 bits.HWY_CAP_GE512
: the current target supports vectors of >= 512 bits.
SupportedTargets()
returns a non-cached (re-initialized on each call) bitfield
of the targets supported on the current CPU, detected using CPUID on x86 or
equivalent. This may include targets that are not in HWY_TARGETS
, and vice
versa. If there is no overlap the binary will likely crash. This can only happen
if:
- the specified baseline is not supported by the current CPU, which contradicts the definition of baseline, so the configuration is invalid; or
- the baseline does not include the enabled/attainable target(s), which are
also not supported by the current CPU, and baseline targets (in particular
HWY_SCALAR
) were explicitly disabled.
The following macros govern which targets to generate. Unless specified
otherwise, they may be defined per translation unit, e.g. to disable >128 bit
vectors in modules that do not benefit from them (if bandwidth-limited or only
called occasionally). This is safe because HWY_TARGETS
always includes at
least one baseline target which HWY_EXPORT
can use.
HWY_DISABLE_CACHE_CONTROL
makes the cache-control functions no-ops.HWY_DISABLE_BMI2_FMA
prevents emitting BMI/BMI2/FMA instructions. This allows using AVX2 in VMs that do not support the other instructions, but only if defined for all translation units.
The following *_TARGETS
are zero or more HWY_Target
bits and can be defined
as an expression, e.g. -DHWY_DISABLED_TARGETS=(HWY_SSE4|HWY_AVX3)
.
-
HWY_BROKEN_TARGETS
defaults to a blocklist of known compiler bugs. Defining to 0 disables the blocklist. -
HWY_DISABLED_TARGETS
defaults to zero. This allows explicitly disabling targets without interfering with the blocklist. -
HWY_BASELINE_TARGETS
defaults to the set whose predefined macros are defined (i.e. those for which the corresponding flag, e.g. -mavx2, was passed to the compiler). If specified, this should be the same for all translation units, otherwise the safety check in SupportedTargets (that all enabled baseline targets are supported) may be inaccurate.
Zero or one of the following macros may be defined to replace the default
policy for selecting HWY_TARGETS
:
HWY_COMPILE_ONLY_EMU128
selects onlyHWY_EMU128
, which avoids intrinsics but implements all ops using standard C++.HWY_COMPILE_ONLY_SCALAR
selects onlyHWY_SCALAR
, which implements single-lane-only ops using standard C++.HWY_COMPILE_ONLY_STATIC
selects onlyHWY_STATIC_TARGET
, which effectively disables dynamic dispatch.HWY_COMPILE_ALL_ATTAINABLE
selects all attainable targets (i.e. enabled and permitted by the compiler, independently of autovectorization), which maximizes coverage in tests.
At most one HWY_COMPILE_ONLY_*
may be defined. HWY_COMPILE_ALL_ATTAINABLE
may also be defined even if one of HWY_COMPILE_ONLY_*
is, but will then be
ignored.
If none are defined, but HWY_IS_TEST
is defined, the default is
HWY_COMPILE_ALL_ATTAINABLE
. Otherwise, the default is to select all attainable
targets except any non-best baseline (typically HWY_SCALAR
), which reduces
code size.
Clang and GCC require e.g. -mavx2 flags in order to use SIMD intrinsics.
However, this enables AVX2 instructions in the entire translation unit, which
may violate the one-definition rule and cause crashes. Instead, we use
target-specific attributes introduced via #pragma. Function using SIMD must
reside between HWY_BEFORE_NAMESPACE
and HWY_AFTER_NAMESPACE
. Alternatively,
individual functions or lambdas may be prefixed with HWY_ATTR
.
Immediates (compile-time constants) are specified as template arguments to avoid constant-propagation issues with Clang on ARM.
-
IsFloat<T>()
returns true if theT
is a floating-point type. -
IsSigned<T>()
returns true if theT
is a signed or floating-point type. -
LimitsMin/Max<T>()
return the smallest/largest value representable in integerT
. -
SizeTag<N>
is an empty struct, used to select overloaded functions appropriate forN
bytes. -
MakeUnsigned<T>
is an alias for an unsigned type of the same size asT
. -
MakeSigned<T>
is an alias for a signed type of the same size asT
. -
MakeFloat<T>
is an alias for a floating-point type of the same size asT
. -
MakeWide<T>
is an alias for a type with twice the size ofT
and the same category (unsigned/signed/float). -
MakeNarrow<T>
is an alias for a type with half the size ofT
and the same category (unsigned/signed/float).
AllocateAligned<T>(items)
returns a unique pointer to newly allocated memory
for items
elements of POD type T
. The start address is aligned as required
by Load/Store
. Furthermore, successive allocations are not congruent modulo a
platform-specific alignment. This helps prevent false dependencies or cache
conflicts. The memory allocation is analogous to using malloc()
and free()
with a std::unique_ptr
since the returned items are not initialized or
default constructed and it is released using FreeAlignedBytes()
without
calling ~T()
.
MakeUniqueAligned<T>(Args&&... args)
creates a single object in newly
allocated aligned memory as above but constructed passing the args
argument to
T
's constructor and returning a unique pointer to it. This is analogous to
using std::make_unique
with new
but for aligned memory since the object is
constructed and later destructed when the unique pointer is deleted. Typically
this type T
is a struct containing multiple members with HWY_ALIGN
or
HWY_ALIGN_MAX
, or arrays whose lengths are known to be a multiple of the
vector size.
MakeUniqueAlignedArray<T>(size_t items, Args&&... args)
creates an array of
objects in newly allocated aligned memory as above and constructs every element
of the new array using the passed constructor parameters, returning a unique
pointer to the array. Note that only the first element is guaranteed to be
aligned to the vector size; because there is no padding between elements,
the alignment of the remaining elements depends on the size of T
.