Chapter 6: Memory Benchmarks
Part II: Tools
"Memory is the new disk, and disk is the new tape." — Jim Gray
The Afternoon When O(1) Was Slower Than O(n)
"This can't be right."
I stared at the numbers on my screen, making sure I wasn't misreading them. Our hash table lookup (theoretically O(1)) was slower than linear search (O(n)). On an array of just 64 elements.
This was a few years ago. I was optimizing a lookup table in an embedded system. The original implementation was simple linear search; I "improved" it to a hash table, expecting massive performance gains.
The results:
Linear search (64 elements): 180 cycles
Hash table lookup: 340 cycles
The hash table was almost twice as slow.
I was confused at the time. Later I learned to use memory benchmarks to understand this phenomenon. The problem wasn't the algorithm—it was the memory access pattern.
Memory Is the Bottleneck in Modern Systems
Let's look at some numbers. Here's a typical modern processor's memory hierarchy:
Level Size Latency Bandwidth
─────────────────────────────────────────────────
Register ~1 KB 0 cycles N/A
L1 Cache 32-64 KB 3-4 cycles ~1 TB/s
L2 Cache 256-512 KB 10-14 cycles ~500 GB/s
L3 Cache 8-32 MB 30-50 cycles ~200 GB/s
DRAM 16-128 GB 100-300 cycles ~50 GB/s
─────────────────────────────────────────────────
From L1 cache to DRAM, latency differs by nearly 100×. This means:
- A single cache miss can waste 100 CPU cycles
- In those 100 cycles, the CPU could execute hundreds of instructions
- If your algorithm has poor memory access patterns, the CPU spends most of its time waiting for memory
Back to my hash table problem. Linear search operates on 64 contiguous elements, all in L1 cache. Hash table access patterns are random—each lookup might cache miss.
Linear search: 64 × 3 cycles (L1 hit) = 192 cycles
Hash table: 1 hash + 1-2 random access × 150 cycles = 300+ cycles
This is why understanding memory performance is so important.
LMbench: The Classic Memory Benchmark
LMbench is a benchmark suite developed by Larry McVoy in the 1990s. Though old, the fundamental concepts it measures remain important.
Memory Latency Measurement
LMbench uses pointer chasing to measure memory latency:
// Pointer chasing: each node points to the next (random location)
struct node {
struct node *next;
char padding[STRIDE - sizeof(void*)];
};
// Measure latency
void measure_latency(struct node *head, int count) {
struct node *p = head;
for (int i = 0; i < count; i++) {
p = p->next; // Must wait for previous access to complete
}
}
The key to this technique: each memory access depends on the previous result, so the CPU cannot prefetch or parallelize. This measures true latency, not bandwidth.
Memory Bandwidth Measurement
Bandwidth is measured differently:
// Sequential read - measures bandwidth
void measure_bandwidth(char *buffer, size_t size) {
volatile int sum = 0;
for (size_t i = 0; i < size; i += 64) { // Each cache line
sum += buffer[i];
}
}
Here accesses are sequential; the CPU can prefetch. We're measuring how fast the system can "feed" the CPU.
STREAM Benchmark
STREAM is a memory bandwidth benchmark developed by John McCalpin, specifically measuring sustained memory bandwidth.
Four Core Tests
// STREAM's four operations
// Assume a, b, c are large arrays, scalar is a constant
// 1. Copy: c = a
for (int i = 0; i < N; i++)
c[i] = a[i];
// 2. Scale: b = scalar * c
for (int i = 0; i < N; i++)
b[i] = scalar * c[i];
// 3. Add: c = a + b
for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];
// 4. Triad: a = b + scalar * c
for (int i = 0; i < N; i++)
a[i] = b[i] + scalar * c[i];
Why These Four?
Each operation measures a different memory access pattern:
| Operation | Reads | Writes | Bytes/Element |
|---|---|---|---|
| Copy | 1 | 1 | 16 (read 8 + write 8) |
| Scale | 1 | 1 | 16 |
| Add | 2 | 1 | 24 |
| Triad | 2 | 1 | 24 |
Pointer Chasing: Measuring True Latency
STREAM measures bandwidth, but sometimes you need to know latency. Pointer chasing is the standard method.
Basic Principle
// Create a randomly linked array
void setup_pointer_chase(void **array, size_t count) {
// Initialize sequentially first
for (size_t i = 0; i < count - 1; i++) {
array[i] = &array[i + 1];
}
array[count - 1] = &array[0];
// Then shuffle (Fisher-Yates)
for (size_t i = count - 1; i > 0; i--) {
size_t j = rand() % (i + 1);
void *temp = array[i];
array[i] = array[j];
array[j] = temp;
}
}
// Measure
uint64_t measure_latency(void **array, size_t iterations) {
void **p = array;
uint64_t start = rdtsc();
for (size_t i = 0; i < iterations; i++) {
p = (void **)*p; // Depends on previous result
}
uint64_t end = rdtsc();
// Prevent optimization
volatile void *sink = p;
(void)sink;
return (end - start) / iterations;
}
Why Shuffle?
Without shuffling, the access pattern is sequential, and the CPU prefetcher will preload the next location. After shuffling, access is random—each is a true memory access.
Measuring Different Working Set Sizes
// Measure L1, L2, L3, DRAM latency
size_t sizes[] = {
8 * 1024, // 8 KB - should be in L1
64 * 1024, // 64 KB - probably in L2
512 * 1024, // 512 KB - should be in L2/L3
4 * 1024 * 1024, // 4 MB - should be in L3
64 * 1024 * 1024 // 64 MB - should be in DRAM
};
for (int i = 0; i < 5; i++) {
size_t count = sizes[i] / sizeof(void*);
void **array = malloc(sizes[i]);
setup_pointer_chase(array, count);
uint64_t latency = measure_latency(array, 10000000);
printf("%8zu KB: %3lu cycles\n", sizes[i] / 1024, latency);
free(array);
}
Typical results:
Working Set Latency
8 KB: 4 cycles (L1)
64 KB: 12 cycles (L2)
512 KB: 35 cycles (L3)
4096 KB: 45 cycles (L3)
65536 KB: 150 cycles (DRAM)
This is a visualization of the memory hierarchy. You can clearly see the latency difference at each level.
TLB Effects
Memory access isn't only affected by cache—there's also the TLB (Translation Lookaside Buffer).
What Is TLB
Modern systems use virtual memory; every memory access requires address translation. TLB is a cache for the page table:
Virtual Address → TLB lookup → Physical Address → Cache/Memory
↓
TLB miss?
↓
Page table walk
(slow, may require multiple memory accesses)
Measuring TLB Effects
// Access once per page
#define PAGE_SIZE 4096
void measure_tlb(char *buffer, size_t pages) {
volatile int sum = 0;
for (size_t i = 0; i < pages; i++) {
sum += buffer[i * PAGE_SIZE]; // One access per page
}
}
If pages exceeds the number of TLB entries, you'll start seeing TLB miss costs.
Typical results:
Pages accessed Latency per access
16: 4 cycles (TLB hit)
64: 4 cycles (TLB hit)
256: 5 cycles (TLB misses starting)
1024: 25 cycles (heavy TLB misses)
4096: 35 cycles (TLB thrashing)
Huge Pages
One solution to TLB problems is using huge pages (2MB or 1GB):
# Check huge pages
cat /proc/meminfo | grep Huge
# Allocate huge pages
echo 100 | sudo tee /proc/sys/vm/nr_hugepages
# Use in program
#include <sys/mman.h>
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
With 2MB huge pages, you need only 1/512 the TLB entries to cover the same memory range.
Common Pitfalls
Pitfall 1: Array Too Small
If your test array fits entirely in L1 cache, you're not measuring memory performance:
// Bad: only measures L1 cache
char buffer[4096];
measure_bandwidth(buffer, 4096);
// Good: ensure it exceeds cache size
char *buffer = malloc(100 * 1024 * 1024); // 100 MB
measure_bandwidth(buffer, 100 * 1024 * 1024);
Pitfall 2: Compiler Optimization
The compiler might optimize away your memory accesses:
// Bad: compiler might optimize away
int sum = 0;
for (int i = 0; i < N; i++) {
sum += array[i];
}
// Good: use volatile to prevent optimization
volatile int sum = 0;
for (int i = 0; i < N; i++) {
sum += array[i];
}
Pitfall 3: Not Considering Prefetcher
Modern CPU prefetchers are smart. Sequential access gets prefetched, showing lower latency:
// This gets prefetched, not true memory latency
for (int i = 0; i < N; i++) {
sum += array[i];
}
// Use pointer chasing to measure true latency
p = *p; // Must wait for previous access to complete
Pitfall 4: NUMA Effects
On multi-socket systems, memory location matters:
# Check NUMA topology
numactl --hardware
# Bind to specific node
numactl --cpunodebind=0 --membind=0 ./benchmark
Back to That Hash Table
Now let's re-analyze my hash table problem with this knowledge.
Linear search (64 elements):
- Array size: 64 × 8 bytes = 512 bytes
- Entirely in L1 cache
- Sequential access, prefetcher effective
- Each comparison: ~3 cycles
- Average 32 lookups: 32 × 3 = 96 cycles
- Plus some overhead: ~180 cycles ✓
Hash table lookup:
- Hash computation: ~20 cycles
- Bucket access: possible cache miss, ~100 cycles
- If collision, another random access
- Total: ~200-400 cycles ✓
Lesson: On small datasets, cache-friendly simple algorithms are often faster than "clever" algorithms.
Summary
Memory performance is often the bottleneck in modern systems:
Memory Hierarchy
- L1 → L2 → L3 → DRAM, latency can differ by 100×
- Understanding which level your working set is in matters
Benchmark Tools
- LMbench: Classic latency and bandwidth measurement
- STREAM: Standard bandwidth benchmark
- Pointer chasing: Measures true random access latency
Measurement Techniques
- Use large enough arrays to avoid measuring only cache
- Use volatile to prevent compiler optimization
- Use pointer chasing to measure latency
- Consider TLB and NUMA effects
Practical Advice
- On small datasets, cache locality matters more than Big-O
- Random access is 10-100× slower than sequential access
- Profile first, then optimize—don't guess