Branch Prediction: Why an if Inside a Hot Loop Costs Milliseconds

Hot loops with unpredictable branches can waste millions of cycles on pipeline flushes. Same Big-O, different wall-clock time: the CPU does not execute one instruction at a time; it pipelines work and speculates on branches. When speculation is wrong, the pipeline is cleared. This log is about making branches predictable or removing them on critical paths.

Pipelining: the CPU assembly line

Modern CPUs overlap stages of instruction handling:

Fetch
Decode
Execute
Write-back

While one instruction executes, the next may already be decoded and the one after that fetched. Throughput stays high when the instruction stream is steady.

The pipeline wants a straight path. A branch is a fork: the CPU must know which instructions come next before it can keep the pipe full.

Branches and the branch predictor

Evaluating data[i] >= 128 takes clock cycles. The pipeline cannot always wait idle.

Branch prediction: hardware guesses which way the branch will go and speculatively executes down that path. If the guess is right, work is already in flight. If wrong, that work is discarded.

Misprediction: pipeline flush

When the guess is wrong:

Speculative results are thrown away.
The pipeline is flushed.
Correct instructions are fetched and the pipe refills.

Cost: on many cores, a mispredict is often on the order of 10 to 20 cycles (architecture-dependent). In a loop of 1,000,000 iterations with random branch outcomes, mispredictions dominate.

Example: conditional inside a tight loop

for (int i = 0; i < 100000; i++) {
    if (data[i] >= 128) {
        sum += data[i];
    }
}

Data pattern	Predictor behavior	Typical outcome
Sorted (false then true)	Learns pattern; ~one transition	Few mispredicts
Random	Hard to predict	Many mispredicts, many flushes

Key idea: the same source shape, different data layout, radically different cycle cost.

Mermaid: speculate vs flush

sequenceDiagram
    participant Pipe as Pipeline
    participant Pred as Branch predictor
    participant ALU

    Pipe->>Pred: Branch at PC
    Pred->>Pipe: Guess taken / not taken
    Pipe->>ALU: Speculative execute
    ALU-->>Pipe: Actual condition result
    alt Guess correct
        Pipe->>Pipe: Commit speculative work
    else Guess wrong
        Pipe->>Pipe: Flush pipeline, refetch correct path
    end

Fix 1: make the branch predictable (sort)

If you can reorder data:

Sort so all values < 128 come first, then >= 128.
The branch outcome is false for a long run, then true for a long run.

The predictor sees a stable pattern; mispredicts cluster near the single transition instead of every iteration.

In practice: sorting before the loop can yield large speedups (reports often cite ~6x or more vs random data on the same loop; always validate on your CPU and workload).

Fix 2: branchless and/or loop unrolling

Branchless accumulation

Avoid the if by turning the condition into a mask or a 0/1 multiplier.

Pattern A (C): multiply by 0 or 1

for (int i = 0; i < 100000; i++) {
    sum += data[i] * (data[i] >= 128);
}

(data[i] >= 128) evaluates to 0 or 1 in C.

Pattern B: mask with sign extension (no branch)

for (int i = 0; i < 100000; i++) {
    int ge = data[i] >= 128;
    int mask = -(int)(ge);   /* 0 or 0xFFFFFFFF */
    sum += data[i] & mask;
}

Exact bit tricks depend on types and compiler; measure with -O2 / -O3 and check assembly.

Loop unrolling

Process multiple elements per iteration (e.g. 4 at a time):

Fewer backward branches per element (the loop counter branch runs 1/4 as often).
More work for the scheduler to fill the pipeline.

Trade-off: larger code, possible unroll factor tuning, and you must handle the tail when N is not a multiple of the unroll factor.

When this matters

Hot paths: inner loops in indexing, search, analytics, codecs.
Random or adversarial input: worst case for predictors.
Microservices at scale: a few cycles per request times QPS is real money.

Key takeaways

Pipelining + speculation make branches cheap when predictable and expensive when wrong.
Random branches in tight loops cause mispredictions and pipeline flushes; wall time is not captured by Big-O alone.
Make data ordered (sort) when the branch pattern can become stable.
Remove the branch (branchless math, masks) or reduce branch frequency (unrolling) when you cannot sort.
Always validate on target hardware: compilers differ, and perf/llvm-mca tell the real story.