Hot loops with unpredictable branches can waste millions of cycles on pipeline flushes. Same Big-O, different wall-clock time: the CPU does not execute one instruction at a time; it pipelines work and speculates on branches. When speculation is wrong, the pipeline is cleared. This log is about making branches predictable or removing them on critical paths.
Pipelining: the CPU assembly line
Modern CPUs overlap stages of instruction handling:
- Fetch
- Decode
- Execute
- Write-back
While one instruction executes, the next may already be decoded and the one after that fetched. Throughput stays high when the instruction stream is steady.
The pipeline wants a straight path. A branch is a fork: the CPU must know which instructions come next before it can keep the pipe full.
Branches and the branch predictor
Evaluating data[i] >= 128 takes clock cycles. The pipeline cannot always wait idle.
Branch prediction: hardware guesses which way the branch will go and speculatively executes down that path. If the guess is right, work is already in flight. If wrong, that work is discarded.
Misprediction: pipeline flush
When the guess is wrong:
- Speculative results are thrown away.
- The pipeline is flushed.
- Correct instructions are fetched and the pipe refills.
Cost: on many cores, a mispredict is often on the order of 10 to 20 cycles (architecture-dependent). In a loop of 1,000,000 iterations with random branch outcomes, mispredictions dominate.
Example: conditional inside a tight loop
for (int i = 0; i < 100000; i++) {
if (data[i] >= 128) {
sum += data[i];
}
}
| Data pattern | Predictor behavior | Typical outcome |
|---|---|---|
| Sorted (false then true) | Learns pattern; ~one transition | Few mispredicts |
| Random | Hard to predict | Many mispredicts, many flushes |
Key idea: the same source shape, different data layout, radically different cycle cost.
Mermaid: speculate vs flush
sequenceDiagram
participant Pipe as Pipeline
participant Pred as Branch predictor
participant ALU
Pipe->>Pred: Branch at PC
Pred->>Pipe: Guess taken / not taken
Pipe->>ALU: Speculative execute
ALU-->>Pipe: Actual condition result
alt Guess correct
Pipe->>Pipe: Commit speculative work
else Guess wrong
Pipe->>Pipe: Flush pipeline, refetch correct path
end
Fix 1: make the branch predictable (sort)
If you can reorder data:
- Sort so all values
< 128come first, then>= 128. - The branch outcome is false for a long run, then true for a long run.
The predictor sees a stable pattern; mispredicts cluster near the single transition instead of every iteration.
In practice: sorting before the loop can yield large speedups (reports often cite ~6x or more vs random data on the same loop; always validate on your CPU and workload).
Fix 2: branchless and/or loop unrolling
Branchless accumulation
Avoid the if by turning the condition into a mask or a 0/1 multiplier.
Pattern A (C): multiply by 0 or 1
for (int i = 0; i < 100000; i++) {
sum += data[i] * (data[i] >= 128);
}
(data[i] >= 128) evaluates to 0 or 1 in C.
Pattern B: mask with sign extension (no branch)
for (int i = 0; i < 100000; i++) {
int ge = data[i] >= 128;
int mask = -(int)(ge); /* 0 or 0xFFFFFFFF */
sum += data[i] & mask;
}
Exact bit tricks depend on types and compiler; measure with -O2 / -O3 and check assembly.
Loop unrolling
Process multiple elements per iteration (e.g. 4 at a time):
- Fewer backward branches per element (the loop counter branch runs
1/4as often). - More work for the scheduler to fill the pipeline.
Trade-off: larger code, possible unroll factor tuning, and you must handle the tail when N is not a multiple of the unroll factor.
When this matters
- Hot paths: inner loops in indexing, search, analytics, codecs.
- Random or adversarial input: worst case for predictors.
- Microservices at scale: a few cycles per request times QPS is real money.
Key takeaways
- Pipelining + speculation make branches cheap when predictable and expensive when wrong.
- Random branches in tight loops cause mispredictions and pipeline flushes; wall time is not captured by Big-O alone.
- Make data ordered (
sort) when the branch pattern can become stable. - Remove the branch (branchless math, masks) or reduce branch frequency (unrolling) when you cannot sort.
- Always validate on target hardware: compilers differ, and
perf/llvm-mcatell the real story.
// SPONSORSHIP
If this research saved you time or improved your architecture, consider sponsoring my work on GitHub. All sponsorships go directly toward infrastructure and further technical research.
[ Become a Sponsor ]