Chapter 12: Cache & Branch Prediction

Part III: Theory

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

The Story of "One Line Change" That Made It 10× Faster

I inherited an image processing codebase. Processing a 4K image took 800ms—too slow.

I spent two days profiling and found the hotspot in a nested loop. The code looked normal:

// Original version
for (int x = 0; x < width; x++) {
    for (int y = 0; y < height; y++) {
        output[y][x] = process(input[y][x]);
    }
}

I changed one thing:

// Modified version
for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
        output[y][x] = process(input[y][x]);
    }
}

Just swapped the loop order of x and y.

Result: 800ms → 80ms. 10× faster.

Why? Because C's 2D arrays are row-major, and [y][x] access should have x in the inner loop (sequential access). The original version jumped an entire row each access, causing massive cache misses.

This is the power (or curse) of cache.

Cache Basics

Why We Need Cache

CPU speed vs Memory speed (approximate):
- CPU: 1 cycle = 0.3 ns (3 GHz)
- L1:  ~4 cycles = 1.2 ns
- L2:  ~12 cycles = 4 ns
- L3:  ~40 cycles = 12 ns
- DRAM: ~200 cycles = 60 ns

If every access went to DRAM, CPU would spend most time waiting.

Cache Hierarchy

┌─────────────┐
│    CPU      │
│  Registers  │ ← Few bytes, < 1 cycle
└──────┬──────┘
       │
┌──────▼──────┐
│   L1 Cache  │ ← 32-64 KB, ~4 cycles
│  (per core) │
└──────┬──────┘
       │
┌──────▼──────┐
│   L2 Cache  │ ← 256 KB - 1 MB, ~12 cycles
│  (per core) │
└──────┬──────┘
       │
┌──────▼──────┐
│   L3 Cache  │ ← 8-64 MB, ~40 cycles (shared)
│  (shared)   │
└──────┬──────┘
       │
┌──────▼──────┐
│    DRAM     │ ← GB scale, ~200 cycles
└─────────────┘

Cache Line

Cache doesn't operate byte-by-byte, but in cache lines (typically 64 bytes):

When you access address 0x1000:
- CPU doesn't read just 1 byte
- It reads the entire cache line: 0x1000 - 0x103F (64 bytes)
- Subsequent accesses to 0x1001, 0x1002... will hit

This is the basis of spatial locality.

Types of Cache Misses

The 3C Model

Type	Name	Cause	Solution
Compulsory	Cold start	First access	Prefetching
Capacity	Capacity	Cache too small	Larger cache, better locality
Conflict	Conflict	Multiple addresses map to same location	Higher associativity

(Sometimes a fourth C is added: Coherence, in multi-core systems)

Measuring Cache Misses

# Using perf
perf stat -e L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses ./my_program

 1,234,567,890      L1-dcache-loads
    12,345,678      L1-dcache-load-misses  # 1% miss rate
   123,456,789      LLC-loads
    12,345,678      LLC-load-misses        # 10% miss rate

Common Cache Problems

Problem 1: Wrong Access Order

Row-major vs Column-major:

// C/C++ is row-major: arr[row][col]
// Fortran is column-major: arr(row, col)

// Correct order for row-major (C)
for (int row = 0; row < N; row++)
    for (int col = 0; col < N; col++)
        sum += arr[row][col];  // Sequential access ✓

// Wrong order in C (column-major style)
for (int col = 0; col < N; col++)
    for (int row = 0; row < N; row++)
        sum += arr[row][col];  // Strided access in C ✗

Problem 2: Stride Access

Cache Optimization Techniques

Technique 1: Loop Tiling (Blocking)

Split large loops into small blocks that fit in cache:

// Original: matrix multiplication
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        for (int k = 0; k < N; k++)
            C[i][j] += A[i][k] * B[k][j];

// Tiled version
#define BLOCK 64

for (int ii = 0; ii < N; ii += BLOCK)
for (int jj = 0; jj < N; jj += BLOCK)
for (int kk = 0; kk < N; kk += BLOCK)
    for (int i = ii; i < min(ii+BLOCK, N); i++)
    for (int j = jj; j < min(jj+BLOCK, N); j++)
    for (int k = kk; k < min(kk+BLOCK, N); k++)
        C[i][j] += A[i][k] * B[k][j];

Each BLOCK × BLOCK sub-matrix can fit in L1/L2.

Technique 2: Data Layout Optimization

Array of Structures (AoS) vs Structure of Arrays (SoA):

// AoS: if you only need x, you load useless y, z too
struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
};
struct Particle particles[N];

// SoA: only access the fields you need
struct Particles {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
    float mass[N];
};
struct Particles particles;

// When only accessing x, SoA is more cache-friendly
for (int i = 0; i < N; i++)
    sum += particles.x[i];  // Sequential access

Technique 3: Prefetching

Tell CPU to load data ahead of time:

#include <xmmintrin.h>  // for _mm_prefetch

for (int i = 0; i < N; i++) {
    // Prefetch data we'll need soon
    _mm_prefetch(&arr[i + 64], _MM_HINT_T0);

    sum += arr[i];
}

Modern CPUs have hardware prefetchers that work well for sequential access. Random access may need software prefetch.

Branch Prediction Basics

Why We Need Branch Prediction

Modern CPUs are pipelined:

Instruction 1: Fetch → Decode → Execute → Memory → Writeback
Instruction 2:        Fetch → Decode → Execute → Memory → Writeback
Instruction 3:               Fetch → Decode → Execute → Memory → Writeback
...

When encountering a branch (if/else):

if (condition) {
    // path A
} else {
    // path B
}

CPU doesn't know whether to fetch A or B's instructions. It must guess (predict), then continue.

If wrong (misprediction) → flush pipeline, restart → waste 15-20 cycles.

Typical Branch Prediction Accuracy

Pattern	Accuracy
Always-taken loop	~99%
Always not-taken	~99%
Predictable pattern (TTNTTN...)	~95%+
Random (50/50)	~50%

Measuring Branch Misprediction

perf stat -e branches,branch-misses ./my_program

   500,000,000      branches
     2,500,000      branch-misses  # 0.5% miss rate (good)

Over 1-2% branch miss rate is worth investigating.

Branch Optimization Techniques

Technique 1: Branchless Programming

// With branch
int max1(int a, int b) {
    if (a > b) return a;
    else return b;
}

// Branchless (using conditional move)
int max2(int a, int b) {
    return a > b ? a : b;  // Compiler may use cmov
}

// Manual branchless (guaranteed)
int max3(int a, int b) {
    int diff = a - b;
    int mask = diff >> 31;  // All 0s or all 1s
    return a - (diff & mask);
}

Technique 2: Sort to Improve Branch Prediction

// Processing positive numbers in array
// If array is random, branch is hard to predict

// Original (assume 50% positive)
for (int i = 0; i < N; i++)
    if (arr[i] > 0)
        process(arr[i]);

// If sorted first
std::sort(arr, arr + N);
// Now all negatives first, positives last
// Branch becomes: N/2 consecutive not-taken, N/2 consecutive taken
// Much more predictable!

Of course, sorting has its own cost. Only worth it if used multiple times.

Technique 3: Profile-Guided Optimization (PGO)

Let the compiler optimize branch layout based on actual execution profile:

# 1. Compile with instrumentation
gcc -fprofile-generate -o program program.c

# 2. Run typical workload
./program typical_input

# 3. Recompile with profile
gcc -fprofile-use -o program_optimized program.c

PGO can:

Put common paths together (better icache)
Adjust default branch predictions
Inline frequently-called functions

Measurement and Diagnosis

Using perf to Find Cache/Branch Problems

# Comprehensive analysis
perf stat -e cycles,instructions,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses,\
branches,branch-misses \
./my_program

Summary

Cache Core Concepts

Cache line (64 bytes) is the minimum unit
3C Model: Compulsory, Capacity, Conflict
Spatial locality + temporal locality

Common Cache Problems

Wrong access order (row vs column major)
Stride access
False sharing
Cache thrashing

Cache Optimization Techniques

Loop tiling/blocking
AoS → SoA transformation
Prefetching
Hot/cold splitting

Branch Prediction

Misprediction cost: 15-20 cycles
Predictable patterns: ~99% accurate
Random patterns: ~50% accurate

Branch Optimization Techniques

Branchless programming
Sort to improve patterns
Profile-Guided Optimization (PGO)

Diagnostic Tools

perf stat -e L1-dcache-load-misses,branch-misses ./prog

Rules of Thumb

Cache miss rate > 5%: worth investigating
Branch miss rate > 2%: worth investigating
Optimization order: algorithm → data structure → cache → branch

Performance and Benchmarking