Chapter 12: Cache & Branch Prediction
Part III: Theory
"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton
The Story of "One Line Change" That Made It 10× Faster
I inherited an image processing codebase. Processing a 4K image took 800ms—too slow.
I spent two days profiling and found the hotspot in a nested loop. The code looked normal:
// Original version
for (int x = 0; x < width; x++) {
for (int y = 0; y < height; y++) {
output[y][x] = process(input[y][x]);
}
}
I changed one thing:
// Modified version
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
output[y][x] = process(input[y][x]);
}
}
Just swapped the loop order of x and y.
Result: 800ms → 80ms. 10× faster.
Why? Because C's 2D arrays are row-major, and [y][x] access should have x in the inner loop (sequential access). The original version jumped an entire row each access, causing massive cache misses.
This is the power (or curse) of cache.
Cache Basics
Why We Need Cache
CPU speed vs Memory speed (approximate):
- CPU: 1 cycle = 0.3 ns (3 GHz)
- L1: ~4 cycles = 1.2 ns
- L2: ~12 cycles = 4 ns
- L3: ~40 cycles = 12 ns
- DRAM: ~200 cycles = 60 ns
If every access went to DRAM, CPU would spend most time waiting.
Cache Hierarchy
┌─────────────┐
│ CPU │
│ Registers │ ← Few bytes, < 1 cycle
└──────┬──────┘
│
┌──────▼──────┐
│ L1 Cache │ ← 32-64 KB, ~4 cycles
│ (per core) │
└──────┬──────┘
│
┌──────▼──────┐
│ L2 Cache │ ← 256 KB - 1 MB, ~12 cycles
│ (per core) │
└──────┬──────┘
│
┌──────▼──────┐
│ L3 Cache │ ← 8-64 MB, ~40 cycles (shared)
│ (shared) │
└──────┬──────┘
│
┌──────▼──────┐
│ DRAM │ ← GB scale, ~200 cycles
└─────────────┘
Cache Line
Cache doesn't operate byte-by-byte, but in cache lines (typically 64 bytes):
When you access address 0x1000:
- CPU doesn't read just 1 byte
- It reads the entire cache line: 0x1000 - 0x103F (64 bytes)
- Subsequent accesses to 0x1001, 0x1002... will hit
This is the basis of spatial locality.
Types of Cache Misses
The 3C Model
| Type | Name | Cause | Solution |
|---|---|---|---|
| Compulsory | Cold start | First access | Prefetching |
| Capacity | Capacity | Cache too small | Larger cache, better locality |
| Conflict | Conflict | Multiple addresses map to same location | Higher associativity |
(Sometimes a fourth C is added: Coherence, in multi-core systems)
Measuring Cache Misses
# Using perf
perf stat -e L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses ./my_program
1,234,567,890 L1-dcache-loads
12,345,678 L1-dcache-load-misses # 1% miss rate
123,456,789 LLC-loads
12,345,678 LLC-load-misses # 10% miss rate
Common Cache Problems
Problem 1: Wrong Access Order
Row-major vs Column-major:
// C/C++ is row-major: arr[row][col]
// Fortran is column-major: arr(row, col)
// Correct order for row-major (C)
for (int row = 0; row < N; row++)
for (int col = 0; col < N; col++)
sum += arr[row][col]; // Sequential access ✓
// Wrong order in C (column-major style)
for (int col = 0; col < N; col++)
for (int row = 0; row < N; row++)
sum += arr[row][col]; // Strided access in C ✗
Problem 2: Stride Access
Cache Optimization Techniques
Technique 1: Loop Tiling (Blocking)
Split large loops into small blocks that fit in cache:
// Original: matrix multiplication
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
// Tiled version
#define BLOCK 64
for (int ii = 0; ii < N; ii += BLOCK)
for (int jj = 0; jj < N; jj += BLOCK)
for (int kk = 0; kk < N; kk += BLOCK)
for (int i = ii; i < min(ii+BLOCK, N); i++)
for (int j = jj; j < min(jj+BLOCK, N); j++)
for (int k = kk; k < min(kk+BLOCK, N); k++)
C[i][j] += A[i][k] * B[k][j];
Each BLOCK × BLOCK sub-matrix can fit in L1/L2.
Technique 2: Data Layout Optimization
Array of Structures (AoS) vs Structure of Arrays (SoA):
// AoS: if you only need x, you load useless y, z too
struct Particle {
float x, y, z;
float vx, vy, vz;
float mass;
};
struct Particle particles[N];
// SoA: only access the fields you need
struct Particles {
float x[N], y[N], z[N];
float vx[N], vy[N], vz[N];
float mass[N];
};
struct Particles particles;
// When only accessing x, SoA is more cache-friendly
for (int i = 0; i < N; i++)
sum += particles.x[i]; // Sequential access
Technique 3: Prefetching
Tell CPU to load data ahead of time:
#include <xmmintrin.h> // for _mm_prefetch
for (int i = 0; i < N; i++) {
// Prefetch data we'll need soon
_mm_prefetch(&arr[i + 64], _MM_HINT_T0);
sum += arr[i];
}
Modern CPUs have hardware prefetchers that work well for sequential access. Random access may need software prefetch.
Branch Prediction Basics
Why We Need Branch Prediction
Modern CPUs are pipelined:
Instruction 1: Fetch → Decode → Execute → Memory → Writeback
Instruction 2: Fetch → Decode → Execute → Memory → Writeback
Instruction 3: Fetch → Decode → Execute → Memory → Writeback
...
When encountering a branch (if/else):
if (condition) {
// path A
} else {
// path B
}
CPU doesn't know whether to fetch A or B's instructions. It must guess (predict), then continue.
If wrong (misprediction) → flush pipeline, restart → waste 15-20 cycles.
Typical Branch Prediction Accuracy
| Pattern | Accuracy |
|---|---|
| Always-taken loop | ~99% |
| Always not-taken | ~99% |
| Predictable pattern (TTNTTN...) | ~95%+ |
| Random (50/50) | ~50% |
Measuring Branch Misprediction
perf stat -e branches,branch-misses ./my_program
500,000,000 branches
2,500,000 branch-misses # 0.5% miss rate (good)
Over 1-2% branch miss rate is worth investigating.
Branch Optimization Techniques
Technique 1: Branchless Programming
// With branch
int max1(int a, int b) {
if (a > b) return a;
else return b;
}
// Branchless (using conditional move)
int max2(int a, int b) {
return a > b ? a : b; // Compiler may use cmov
}
// Manual branchless (guaranteed)
int max3(int a, int b) {
int diff = a - b;
int mask = diff >> 31; // All 0s or all 1s
return a - (diff & mask);
}
Technique 2: Sort to Improve Branch Prediction
// Processing positive numbers in array
// If array is random, branch is hard to predict
// Original (assume 50% positive)
for (int i = 0; i < N; i++)
if (arr[i] > 0)
process(arr[i]);
// If sorted first
std::sort(arr, arr + N);
// Now all negatives first, positives last
// Branch becomes: N/2 consecutive not-taken, N/2 consecutive taken
// Much more predictable!
Of course, sorting has its own cost. Only worth it if used multiple times.
Technique 3: Profile-Guided Optimization (PGO)
Let the compiler optimize branch layout based on actual execution profile:
# 1. Compile with instrumentation
gcc -fprofile-generate -o program program.c
# 2. Run typical workload
./program typical_input
# 3. Recompile with profile
gcc -fprofile-use -o program_optimized program.c
PGO can:
- Put common paths together (better icache)
- Adjust default branch predictions
- Inline frequently-called functions
Measurement and Diagnosis
Using perf to Find Cache/Branch Problems
# Comprehensive analysis
perf stat -e cycles,instructions,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses,\
branches,branch-misses \
./my_program
Summary
Cache Core Concepts
- Cache line (64 bytes) is the minimum unit
- 3C Model: Compulsory, Capacity, Conflict
- Spatial locality + temporal locality
Common Cache Problems
- Wrong access order (row vs column major)
- Stride access
- False sharing
- Cache thrashing
Cache Optimization Techniques
- Loop tiling/blocking
- AoS → SoA transformation
- Prefetching
- Hot/cold splitting
Branch Prediction
- Misprediction cost: 15-20 cycles
- Predictable patterns: ~99% accurate
- Random patterns: ~50% accurate
Branch Optimization Techniques
- Branchless programming
- Sort to improve patterns
- Profile-Guided Optimization (PGO)
Diagnostic Tools
perf stat -e L1-dcache-load-misses,branch-misses ./prog
Rules of Thumb
- Cache miss rate > 5%: worth investigating
- Branch miss rate > 2%: worth investigating
- Optimization order: algorithm → data structure → cache → branch