Chapter 2: Memory Hierarchy

Part I: Foundations

“Memory is the new disk, disk is the new tape.” — Jim Gray

The 100-Cycle Problem

In Chapter 1, we saw that cache misses cost 100-200 cycles while cache hits cost only 1-4 cycles. This isn’t a minor detail—it’s the single most important factor in modern performance.

Let me show you why.

I was optimizing a device driver for a RISC-V embedded system. The driver needed to process packets from a network interface, and we were dropping packets under load. The CPU was running at 1 GHz, and each packet required about 500 instructions to process. Simple math:

500 instructions ÷ 1 GHz = 500 nanoseconds per packet

At 500 ns per packet, we should handle 2 million packets per second. But we were only managing 200,000 packets per second—10× slower than expected.

The profiler told the story:

$ perf stat -e cycles,instructions,cache-misses ./driver_test
  Performance counter stats:
    5,000,000 cycles
      500,000 instructions
       45,000 cache-misses

Wait. 500,000 instructions should take 500,000 cycles (at 1 IPC). But we’re seeing 5,000,000 cycles. Where did the extra 4.5 million cycles go?

Cache misses: 45,000 misses × 100 cycles = 4,500,000 cycles

The cache misses were dominating our execution time. The actual computation (500,000 cycles) was only 10% of the total time. The other 90% was waiting for memory.

This is the reality of modern computing: memory is slow, and it’s getting slower relative to CPU speed.

The Memory Hierarchy

Modern computers don’t have “memory”—they have a hierarchy of memories, each with different speeds and sizes:

Level	Type	Latency	Size
Registers	32 registers	1 cycle	~128 B
L1 Cache	Split I/D	3-4 cycles	32-64 KB
L2 Cache	Unified	12-15 cycles	256-512 KB
L3 Cache (if present)	Shared	40-50 cycles	2-32 MB
DRAM	Main memory	100-200 cycles	GB-TB

Key observations:

Speed decreases as you go down (1 → 200 cycles)
Size increases as you go down (128 B → GB)
The gap is huge: DRAM is 100-200× slower than L1

On embedded systems, the hierarchy is often simpler:

Typical MCU (e.g., RISC-V RV32IMC @ 100 MHz):

Level	Type	Latency	Size
Registers	32 registers	1 cycle	128 B
L1 I-Cache	Instruction	1 cycle	16 KB
L1 D-Cache/SRAM	Data	1-2 cycles	8-32 KB
Flash	Code storage	~10 cycles	128 KB - 1 MB
External DRAM (optional)	Data (if present)	50-100 cycles	8-64 MB

Embedded differences:

Smaller caches (8-64 KB vs 256 KB-32 MB)
Often no L2/L3 cache
Flash memory instead of DRAM for code
Tighter memory budgets

Cache Lines: The Fundamental Unit

Here’s the crucial insight: caches don’t fetch individual bytes—they fetch cache lines.

A cache line is typically 64 bytes on modern processors (both desktop and embedded). When you access a single byte, the hardware fetches the entire 64-byte block containing that byte.

Example: Accessing a single integer

int x = array[0];  // Access 4 bytes at address 0x1000

What actually happens:

CPU requests: 4 bytes at 0x1000
Cache fetches: 64 bytes from 0x1000 to 0x103F

The cache line includes:

The requested integer (4 bytes)
The next 15 integers (60 bytes)

This is why sequential access is fast:

// Fast: All in the same cache line
for (int i = 0; i < 16; i++) {
    sum += array[i];  // First access: miss, next 15: hits
}

But random access is slow:

// Slow: Each access likely in different cache line
for (int i = 0; i < 16; i++) {
    sum += array[random_index[i]];  // Each access: likely miss
}

Cache Organization

Caches are organized into sets and ways. Understanding this helps explain cache conflicts.

Direct-mapped cache (1-way):

Address bits: [     Tag      |   Index   |      Offset      ]
               └─────────────┴───────────┴──────────────────
               Identifies     Selects     Byte within
               cache line     set         cache line

Example: 32 KB cache, 64-byte lines, direct-mapped

Cache lines: 32 KB ÷ 64 B = 512 lines
Index bits: log₂(512) = 9 bits
Offset bits: log₂(64) = 6 bits
Tag bits: remaining bits (e.g., 32 - 9 - 6 = 17 bits for 32-bit address)

Problem with direct-mapped: Cache conflicts

int a[1024];  // At address 0x10000
int b[1024];  // At address 0x18000

// These two arrays map to the SAME cache sets!
// 0x10000 and 0x18000 differ only in bit 15
// Index uses bits 6-14, so they collide

Set-associative cache (N-way):

A 4-way set-associative cache has 4 “slots” per set:

Set 0: [Line 0] [Line 1] [Line 2] [Line 3]
Set 1: [Line 0] [Line 1] [Line 2] [Line 3]
...

When address maps to Set 0, it can go in any of the 4 slots. This reduces conflicts.

Typical configurations:

L1: 8-way set-associative (32-64 KB)
L2: 8-16-way set-associative (256-512 KB)
L3: 16-way set-associative (2-32 MB)

Embedded systems:

Often direct-mapped or 2-way (simpler hardware)
Smaller caches mean more conflicts

Spatial and Temporal Locality

Cache performance depends on two types of locality:

Spatial locality: Accessing nearby addresses

// Good spatial locality
for (int i = 0; i < n; i++) {
    sum += array[i];  // Sequential access
}

// Poor spatial locality
for (int i = 0; i < n; i++) {
    sum += array[random[i]];  // Random access
}

Temporal locality: Accessing the same address repeatedly

// Good temporal locality
int temp = array[0];
for (int i = 0; i < 1000; i++) {
    result += temp * i;  // Reuse 'temp'
}

// Poor temporal locality
for (int i = 0; i < 1000; i++) {
    result += array[i % 10] * i;  // Evicts before reuse
}

Cache-friendly code exploits both:

// Matrix multiplication: cache-friendly version
for (int i = 0; i < N; i++) {
    for (int k = 0; k < N; k++) {
        int r = A[i][k];
        for (int j = 0; j < N; j++) {
            C[i][j] += r * B[k][j];  // Good spatial locality on B
        }
    }
}

The Prefetcher

Modern CPUs have hardware prefetchers that predict memory access patterns and fetch data before you need it.

How the prefetcher works:

stateDiagram-v2
    [*] --> Idle: Power on
    Idle --> Detecting: Memory access
    Detecting --> Sequential: 2+ consecutive accesses
    Detecting --> Strided: Constant stride detected
    Detecting --> Idle: Random pattern

    Sequential --> Prefetching: Fetch ahead
    Strided --> Prefetching: Fetch ahead

    Prefetching --> Prefetching: Pattern continues
    Prefetching --> Idle: Pattern breaks

    note right of Sequential
        Access: A[0], A[1], A[2]
        Prefetch: A[3], A[4], A[5]
    end note

    note right of Strided
        Access: A[0], A[4], A[8]
        Prefetch: A[12], A[16], A[20]
    end note

Sequential prefetcher: Detects sequential access

// Prefetcher detects pattern and fetches ahead
for (int i = 0; i < n; i++) {
    sum += array[i];  // Prefetcher fetches array[i+1], array[i+2], ...
}

Stride prefetcher: Detects constant stride

// Prefetcher detects stride of 8 bytes
for (int i = 0; i < n; i++) {
    sum += array[i * 2];  // Accessing every other element
}

Prefetcher limitations:

Doesn’t help random access:

for (int i = 0; i < n; i++) {
    sum += array[random[i]];  // Unpredictable, no prefetch
}

Limited distance: Typically 10-20 cache lines ahead
Can be fooled:

// Alternating pattern confuses prefetcher
for (int i = 0; i < n; i++) {
    if (i % 2 == 0)
        sum += array[i];
    else
        sum += other_array[i];
}

Embedded systems: Many MCUs have no prefetcher or simple sequential-only prefetchers. This makes sequential access even more critical.

Memory Bandwidth

Even with perfect cache behavior, you’re limited by memory bandwidth.

Example calculation (desktop system):

DDR4-3200: 25.6 GB/s per channel
Dual channel: 51.2 GB/s total
L3 cache: ~200 GB/s
L2 cache: ~400 GB/s
L1 cache: ~1000 GB/s

Implication: Streaming through large arrays is bandwidth-limited

// Bandwidth-limited: streaming through 1 GB array
for (int i = 0; i < 256*1024*1024; i++) {
    array[i] = 0;  // Limited by DRAM bandwidth
}

Embedded systems have much lower bandwidth:

Typical MCU SRAM: 1-4 GB/s
External DRAM (if present): 100-500 MB/s

This makes working set size critical—keep data in on-chip SRAM.

Cache Coherency (Multi-core)

On multi-core systems, caches must stay coherent—all cores see consistent data.

MESI protocol (common on x86, ARM):

Modified: This cache has the only valid copy, modified
Exclusive: This cache has the only valid copy, clean
Shared: Multiple caches have valid copies
Invalid: This cache line is invalid

False sharing: Performance killer on multi-core

// BAD: False sharing
struct {
    int counter_core0;  // Used by core 0
    int counter_core1;  // Used by core 1
} shared;  // Both in same cache line!

// Core 0 writes counter_core0 → invalidates core 1's cache line
// Core 1 writes counter_core1 → invalidates core 0's cache line
// Ping-pong effect: terrible performance

Solution: Pad to separate cache lines

// GOOD: No false sharing
struct {
    int counter_core0;
    char pad[60];       // Pad to 64 bytes
    int counter_core1;
} shared;

RISC-V: Uses RVWMO (RISC-V Weak Memory Ordering) with fence instructions for synchronization.

RISC-V Memory Model

RISC-V has a weak memory model—memory operations can be reordered unless you use fences.

Memory ordering:

// Without fence: these can be reordered
store A
store B
load C
load D

Fence instruction:

sw   a0, 0(a1)    # Store A
fence w, w        # Ensure store completes before next store
sw   a2, 0(a3)    # Store B

Fence types:

fence r, r: Load-load fence
fence w, w: Store-store fence
fence rw, rw: Full fence
fence.i: Instruction fence (for self-modifying code)

Atomic operations (A extension):

lr.w  a0, (a1)    # Load-reserved
# ... modify a0 ...
sc.w  a2, a0, (a1) # Store-conditional (fails if reservation broken)

Practical Guidelines

Based on this understanding of memory hierarchy, here are practical guidelines for data structure design:

1. Minimize cache misses

Use sequential access patterns when possible
Keep working set small (fit in L1/L2)
Avoid pointer chasing (linked lists, trees)

2. Exploit cache lines

Pack related data together (structs)
Align data structures to cache line boundaries
Avoid false sharing on multi-core

3. Consider prefetcher

Use predictable access patterns
Sequential or constant-stride access
Avoid random access when possible

4. Know your hardware

Cache sizes (L1, L2, L3)
Cache line size (usually 64 bytes)
Associativity (affects conflicts)
Prefetcher capabilities

5. Measure, don’t guess

Use perf to measure cache misses
Profile before optimizing
Test on target hardware

Summary

The 100-cycle problem was solved by understanding the memory hierarchy. The device driver’s packet loss came from 45,000 cache misses consuming 4.5 million cycles—90% of execution time spent waiting for memory. Optimizing memory access patterns reduced cache misses and brought throughput from 200,000 to the expected 2 million packets per second.

Key insights:

Cache misses cost 100-200 cycles (vs 1-4 for hits)
Caches fetch 64-byte lines, not individual bytes
Sequential access is 10-100× faster than random access
Embedded systems have smaller caches and simpler hierarchies

Design implications:

Arrays beat linked lists (spatial locality)
Small working sets beat large ones (temporal locality)
Sequential beats random (prefetcher)
Measurement beats intuition (use profiling tools)

Next Chapter: We’ll build a comprehensive benchmarking framework to measure these effects precisely and learn how to use profiling tools effectively.

Keyboard shortcuts

Data Structures in Practice