Your CPU predicts the future. Every time your code hits an if, the processor guesses which way it goes and starts executing that path before the condition is evaluated.

When the guess is right, you pay nothing. When it is wrong, the pipeline flushes and restarts. Fifteen to twenty wasted cycles.

For most branches, the predictor gets it right over 95% of the time. But some branches resist prediction. Those are the ones worth eliminating — and eliminating them opens a door to something bigger.

How the branch predictor learns

The branch predictor is a pattern matcher. A loop that runs a thousand iterations produces a thousand consecutive “taken” decisions. The predictor sees that streak and bets “taken” on the next iteration. It is wrong once — at the loop exit — and right every other time.

A null check on a pointer that is almost never null follows the same logic. The predictor bets on the common case and wins overwhelmingly.

This works because most branches in real code are biased. They go one way far more often than the other. The predictor exploits that skew, and the branch costs you nearly nothing.

The trouble starts when the skew disappears.

A branch the predictor cannot learn

An order book processes messages. Each message is a buy or a sell. The stream looks like: buy, sell, buy, buy, sell, sell, buy, sell — determined by what the market does, and the market does not repeat.

void process(const Order& order) {
    if (order.side == Side::Buy)
        buy_book.insert(order);
    else
        sell_book.insert(order);
}

The predictor faces a coin flip on every call. It will be wrong roughly half the time. At 15–20 cycles per misprediction, in a loop processing millions of messages per second, those pipeline flushes accumulate into measurable latency.

The fix is not a smarter predictor. The fix is to remove the branch entirely.

Table dispatch: replacing prediction with a lookup

Map each side to an index. Use the index to select the target from an array.

enum class Side : int { Buy = 0, Sell = 1 };

std::array<Book*, 2> books = {&buy_book, &sell_book};

void process(const Order& order) {
    books[static_cast<int>(order.side)]->insert(order);
}

No if. No prediction. The CPU loads a pointer from a small, hot array — one indexed memory access — and calls insert on whichever book it gets. The prediction cost disappears. What you pay instead is one cache-friendly load from a two-element array.

Arithmetic dispatch: letting the hardware choose

Table lookup replaces a branch with a memory access. Sometimes you can replace a branch with pure arithmetic instead.

A common case is clamping:

// Branching clamp — two branches,
// both unpredictable on random input
int clamp(int v, int lo, int hi) {
    if (v < lo) return lo;
    if (v > hi) return hi;
    return v;
}

// Branchless — compiles to cmov instructions
int clamp(int v, int lo, int hi) {
    return std::max(lo, std::min(v, hi));
}

cmov — conditional move — is the hardware’s branchless primitive. The CPU evaluates the condition, then moves (or does not move) a value into a register without redirecting the pipeline. No prediction, no flush.

The compiler already knows this trick for simple integer conditions. It will often emit cmov without you asking. For more complex dispatch — multiple outcomes, function pointers, state machines — you have to structure the code so the compiler can see the opportunity.

When to leave the branch alone

Not every branch is worth eliminating. The predictor is good, and branchless code trades away readability.

A branch the predictor gets right 99% of the time costs you roughly 0.15–0.20 cycles per execution on average. A table lookup that hits L1 every time costs 3–4 cycles. You made the code harder to read and slower.

Branchless dispatch pays off when two conditions hold: the branch is unpredictable (close to 50/50, or a pattern the predictor cannot learn), and the code path runs hot enough for the misprediction cost to accumulate.

Order book dispatch fits both criteria. A null check before dereferencing a pointer fits neither.

perf stat gives you branch-misses as a percentage. If a hot function shows 2% misses, the predictor has it handled. If it shows 30%, you are leaving cycles on the table. Profile before you rewrite.

The gate to SIMD

Once you remove the branches, something else becomes possible. Your data flows straight through without control divergence. And the CPU has hardware built for exactly that situation.

A scalar register holds one value — one double, one integer. A SIMD register holds several. On x86, SSE registers are 128 bits wide: four floats side by side. AVX extends that to 256 bits — eight floats. AVX-512 doubles again: sixteen floats in a single register. ARM has NEON (128-bit) and SVE, where the width is determined by the hardware at runtime — 128 to 2048 bits.

When you add two SIMD registers, every lane computes its sum simultaneously. One instruction, one cycle, four (or eight, or sixteen) results.

The execution units behind the width

The register is storage. The execution happens in dedicated vector units — FMA (fused multiply-add) units and ALUs built into each core.

The number and width of these units varies by chip. Intel’s Skylake Xeon cores have two 512-bit FMA units, so they can retire two 512-bit FMA operations per cycle. Ice Lake client cores have one 512-bit FMA unit. AMD Zen 4 uses two 256-bit units that double-pump to handle 512-bit operations across two cycles. Zen 5 shipped full native 512-bit units.

These units are physically large and draw significant current when active. On Intel’s early AVX-512 processors (Skylake-era), activating the 512-bit units triggered a license-based frequency downshift — the CPU dropped its clock by up to 40% to stay within its thermal design power.

Newer processors have reduced or eliminated this penalty. Ice Lake and Rocket Lake show minimal license-based downclocking. AMD Zen 4 and Zen 5 have no artificial frequency offsets — they throttle only when thermal limits are reached. But the thermal cost of wide vector execution remains real, especially under sustained all-core load.

GCC and Clang still default to preferring 256-bit vectors on Intel targets, a direct consequence of the Skylake-era frequency behavior. You can override this with -mprefer-vector-width=512, but the default exists for a reason.

Why branches kill vectorization

A scalar branch says: go left or go right. The CPU picks one path.

SIMD cannot do that. Eight values sit in the same register. Three of them might need the left path, five the right. There is no way to send part of a register down one branch and the rest down another.

If your scalar loop contains an unpredictable if, the compiler cannot vectorize it. The branch forces each element to be processed individually — you are back to scalar, one value at a time.

Remove the branch and the compiler sees a straight-line computation. Every element takes the same path. That is what makes SIMD viable. The branchless dispatch from the previous sections is not just a pipeline optimization — it is the prerequisite for the 4–16x throughput multiplier sitting in your CPU’s vector unit.

What the auto-vectorizer does for free

You may not need to write SIMD code yourself. Modern compilers auto-vectorize loops when they can prove the transformation is safe:

void add(const float* a, const float* b,
         float* out, int n) {
    for (int i = 0; i < n; ++i)
        out[i] = a[i] + b[i];
}

Compile with -O2 -march=native and the compiler emits AVX instructions automatically. No intrinsics, no special types. It sees contiguous memory, no aliasing, no branches, a simple arithmetic body — and vectorizes the loop.

But the auto-vectorizer is conservative. It gives up when it cannot prove the transformation is safe. Pointer aliasing, complex control flow, cross-iteration dependencies, non-trivial reductions — any of these can silently prevent vectorization. Your code compiles, runs correctly, and processes one element at a time.

Two compiler flags tell you when this happens:

-fopt-info-vec-missed          (GCC)
-Rpass-missed=loop-vectorize   (Clang)

These report every loop the auto-vectorizer considered and rejected, along with the reason. Worth checking on any loop you care about.

std::simd in C++26: stating your intent

C++26 adds std::simd to the standard library — a portable data-parallel type that replaces both intrinsics and the auto-vectorizer’s guesswork with explicit intent.

The C++26 API lives in std::datapar and differs significantly from the Parallelism TS that preceded it. The TS used std::experimental::native_simd<T> and conditional assignment via where(mask, x) = value. The C++26 version uses std::datapar::simd<T> with a static size member, range-based constructors, and dp::select(mask, a, b) for conditional logic.

No compiler ships the final std::datapar namespace in a release yet. GCC has carried std::experimental::simd since version 11, and Matthias Kretz (the author of proposal P1928) maintains a standalone C++26 implementation for GCC. Michael Wong’s CppCon 2025 talk (youtube.com) walks through the full interface.

Here is the same add function rewritten with C++26 target syntax:

#include <simd>
#include <span>

namespace dp = std::datapar;

void add(std::span<const float> a,
         std::span<const float> b,
         std::span<float> out) {
    constexpr auto W = dp::simd<float>::size;
    size_t i = 0;
    for (; i + W <= a.size(); i += W) {
        dp::simd<float> va(a.subspan(i, W));
        dp::simd<float> vb(b.subspan(i, W));
        (va + vb).copy_to(out.begin() + i);
    }
    for (; i < a.size(); ++i)
        out[i] = a[i] + b[i];  // scalar tail
}

simd<float> holds as many floats as the target architecture’s native register supports — four on SSE, eight on AVX, sixteen on AVX-512. The width resolves at compile time. You write the algorithm once; the register width adapts to the hardware. The scalar tail handles remaining elements that do not fill a full vector.

The cost of conditional SIMD

Scalar code uses if to pick a path. SIMD code computes both paths and picks results per lane with a mask:

dp::simd<float> x = /* ... */;
auto mask = x > 0.0f;
auto result = dp::select(mask, x * 2.0f, x);

Both x * 2.0f and x are computed for every lane. The mask discards the unwanted results afterward. When one path is expensive and the other is rare, you pay the full cost of the expensive path on every element. SIMD wins when both paths are cheap, or when both are common enough that the per-element work dominates the wasted computation.

std::simd carries other costs too. The <simd> header pulls in deep template machinery. Benchmarks on GCC’s std::experimental::simd show roughly 10x slower compilation for SIMD-heavy translation units compared to scalar equivalents. The C++26 standard version may improve this, but the template depth is inherent to the library-based design.

Third-party alternatives like Google Highway offer runtime dispatch — detecting the CPU’s SIMD width at startup. std::simd sits between auto-vectorization and intrinsics: more explicit than the auto-vectorizer, more portable than intrinsics, but without runtime width dispatch built in.

The full picture

The branch predictor keeps your pipeline full by betting on the future. For biased branches it wins. For branches that look like coin flips — order book dispatch, random input classification, unpatterned state transitions — it loses at a rate you can measure with perf stat.

Table dispatch, conditional moves, and arithmetic encoding each replace a prediction with a computation. The compiler handles some of these rewrites for you. The rest require you to see the pattern: a branch whose outcome depends on data the predictor cannot learn from.

And once the branches are gone, the data flows straight through. SIMD becomes viable — four to sixteen values per instruction, processed by dedicated vector hardware that has been sitting in your CPU since SSE shipped in 1999. But SIMD changes how you handle conditional logic (masks instead of branches), how you pay for power (wider units draw more current), and how you reason about both paths at once.

Pick a hot loop. Compile it with -fopt-info-vec-missed and read what the compiler reports. If it vectorized, check the throughput. If it did not, read the reason. Then check the branch-miss rate with perf stat. If a branch mispredicts above 10%, look at what it dispatches on. Can you map that value to an index, a pointer, an arithmetic expression?

Remove the branch. Let the vectorizer through. Benchmark before and after.