Chapter 6: Memory Benchmarks

Part II: Tools


"Memory is the new disk, and disk is the new tape." — Jim Gray

The Afternoon When O(1) Was Slower Than O(n)

"This can't be right."

I stared at the numbers on my screen, making sure I wasn't misreading them. Our hash table lookup (theoretically O(1)) was slower than linear search (O(n)). On an array of just 64 elements.

This was a few years ago. I was optimizing a lookup table in an embedded system. The original implementation was simple linear search; I "improved" it to a hash table, expecting massive performance gains.

The results:

Linear search (64 elements): 180 cycles
Hash table lookup:           340 cycles

The hash table was almost twice as slow.

I was confused at the time. Later I learned to use memory benchmarks to understand this phenomenon. The problem wasn't the algorithm—it was the memory access pattern.

Memory Is the Bottleneck in Modern Systems

Let's look at some numbers. Here's a typical modern processor's memory hierarchy:

Level        Size        Latency      Bandwidth
─────────────────────────────────────────────────
Register     ~1 KB       0 cycles     N/A
L1 Cache     32-64 KB    3-4 cycles   ~1 TB/s
L2 Cache     256-512 KB  10-14 cycles ~500 GB/s
L3 Cache     8-32 MB     30-50 cycles ~200 GB/s
DRAM         16-128 GB   100-300 cycles ~50 GB/s
─────────────────────────────────────────────────

From L1 cache to DRAM, latency differs by nearly 100×. This means:

  • A single cache miss can waste 100 CPU cycles
  • In those 100 cycles, the CPU could execute hundreds of instructions
  • If your algorithm has poor memory access patterns, the CPU spends most of its time waiting for memory

Back to my hash table problem. Linear search operates on 64 contiguous elements, all in L1 cache. Hash table access patterns are random—each lookup might cache miss.

Linear search: 64 × 3 cycles (L1 hit) = 192 cycles
Hash table:    1 hash + 1-2 random access × 150 cycles = 300+ cycles

This is why understanding memory performance is so important.

LMbench: The Classic Memory Benchmark

LMbench is a benchmark suite developed by Larry McVoy in the 1990s. Though old, the fundamental concepts it measures remain important.

Memory Latency Measurement

LMbench uses pointer chasing to measure memory latency:

// Pointer chasing: each node points to the next (random location)
struct node {
    struct node *next;
    char padding[STRIDE - sizeof(void*)];
};

// Measure latency
void measure_latency(struct node *head, int count) {
    struct node *p = head;
    for (int i = 0; i < count; i++) {
        p = p->next;  // Must wait for previous access to complete
    }
}

The key to this technique: each memory access depends on the previous result, so the CPU cannot prefetch or parallelize. This measures true latency, not bandwidth.

Memory Bandwidth Measurement

Bandwidth is measured differently:

// Sequential read - measures bandwidth
void measure_bandwidth(char *buffer, size_t size) {
    volatile int sum = 0;
    for (size_t i = 0; i < size; i += 64) {  // Each cache line
        sum += buffer[i];
    }
}

Here accesses are sequential; the CPU can prefetch. We're measuring how fast the system can "feed" the CPU.

STREAM Benchmark

STREAM is a memory bandwidth benchmark developed by John McCalpin, specifically measuring sustained memory bandwidth.

Four Core Tests

// STREAM's four operations
// Assume a, b, c are large arrays, scalar is a constant

// 1. Copy: c = a
for (int i = 0; i < N; i++)
    c[i] = a[i];

// 2. Scale: b = scalar * c
for (int i = 0; i < N; i++)
    b[i] = scalar * c[i];

// 3. Add: c = a + b
for (int i = 0; i < N; i++)
    c[i] = a[i] + b[i];

// 4. Triad: a = b + scalar * c
for (int i = 0; i < N; i++)
    a[i] = b[i] + scalar * c[i];

Why These Four?

Each operation measures a different memory access pattern:

OperationReadsWritesBytes/Element
Copy1116 (read 8 + write 8)
Scale1116
Add2124
Triad2124

Pointer Chasing: Measuring True Latency

STREAM measures bandwidth, but sometimes you need to know latency. Pointer chasing is the standard method.

Basic Principle

// Create a randomly linked array
void setup_pointer_chase(void **array, size_t count) {
    // Initialize sequentially first
    for (size_t i = 0; i < count - 1; i++) {
        array[i] = &array[i + 1];
    }
    array[count - 1] = &array[0];

    // Then shuffle (Fisher-Yates)
    for (size_t i = count - 1; i > 0; i--) {
        size_t j = rand() % (i + 1);
        void *temp = array[i];
        array[i] = array[j];
        array[j] = temp;
    }
}

// Measure
uint64_t measure_latency(void **array, size_t iterations) {
    void **p = array;
    uint64_t start = rdtsc();

    for (size_t i = 0; i < iterations; i++) {
        p = (void **)*p;  // Depends on previous result
    }

    uint64_t end = rdtsc();

    // Prevent optimization
    volatile void *sink = p;
    (void)sink;

    return (end - start) / iterations;
}

Why Shuffle?

Without shuffling, the access pattern is sequential, and the CPU prefetcher will preload the next location. After shuffling, access is random—each is a true memory access.

Measuring Different Working Set Sizes

// Measure L1, L2, L3, DRAM latency
size_t sizes[] = {
    8 * 1024,        // 8 KB - should be in L1
    64 * 1024,       // 64 KB - probably in L2
    512 * 1024,      // 512 KB - should be in L2/L3
    4 * 1024 * 1024, // 4 MB - should be in L3
    64 * 1024 * 1024 // 64 MB - should be in DRAM
};

for (int i = 0; i < 5; i++) {
    size_t count = sizes[i] / sizeof(void*);
    void **array = malloc(sizes[i]);
    setup_pointer_chase(array, count);

    uint64_t latency = measure_latency(array, 10000000);
    printf("%8zu KB: %3lu cycles\n", sizes[i] / 1024, latency);

    free(array);
}

Typical results:

Working Set    Latency
      8 KB:    4 cycles   (L1)
     64 KB:   12 cycles   (L2)
    512 KB:   35 cycles   (L3)
   4096 KB:   45 cycles   (L3)
  65536 KB:  150 cycles   (DRAM)

This is a visualization of the memory hierarchy. You can clearly see the latency difference at each level.

TLB Effects

Memory access isn't only affected by cache—there's also the TLB (Translation Lookaside Buffer).

What Is TLB

Modern systems use virtual memory; every memory access requires address translation. TLB is a cache for the page table:

Virtual Address → TLB lookup → Physical Address → Cache/Memory
                     ↓
                 TLB miss?
                     ↓
              Page table walk
              (slow, may require multiple memory accesses)

Measuring TLB Effects

// Access once per page
#define PAGE_SIZE 4096

void measure_tlb(char *buffer, size_t pages) {
    volatile int sum = 0;
    for (size_t i = 0; i < pages; i++) {
        sum += buffer[i * PAGE_SIZE];  // One access per page
    }
}

If pages exceeds the number of TLB entries, you'll start seeing TLB miss costs.

Typical results:

Pages accessed    Latency per access
         16:      4 cycles    (TLB hit)
         64:      4 cycles    (TLB hit)
        256:      5 cycles    (TLB misses starting)
       1024:     25 cycles    (heavy TLB misses)
       4096:     35 cycles    (TLB thrashing)

Huge Pages

One solution to TLB problems is using huge pages (2MB or 1GB):

# Check huge pages
cat /proc/meminfo | grep Huge

# Allocate huge pages
echo 100 | sudo tee /proc/sys/vm/nr_hugepages

# Use in program
#include <sys/mman.h>
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

With 2MB huge pages, you need only 1/512 the TLB entries to cover the same memory range.

Common Pitfalls

Pitfall 1: Array Too Small

If your test array fits entirely in L1 cache, you're not measuring memory performance:

// Bad: only measures L1 cache
char buffer[4096];
measure_bandwidth(buffer, 4096);

// Good: ensure it exceeds cache size
char *buffer = malloc(100 * 1024 * 1024);  // 100 MB
measure_bandwidth(buffer, 100 * 1024 * 1024);

Pitfall 2: Compiler Optimization

The compiler might optimize away your memory accesses:

// Bad: compiler might optimize away
int sum = 0;
for (int i = 0; i < N; i++) {
    sum += array[i];
}

// Good: use volatile to prevent optimization
volatile int sum = 0;
for (int i = 0; i < N; i++) {
    sum += array[i];
}

Pitfall 3: Not Considering Prefetcher

Modern CPU prefetchers are smart. Sequential access gets prefetched, showing lower latency:

// This gets prefetched, not true memory latency
for (int i = 0; i < N; i++) {
    sum += array[i];
}

// Use pointer chasing to measure true latency
p = *p;  // Must wait for previous access to complete

Pitfall 4: NUMA Effects

On multi-socket systems, memory location matters:

# Check NUMA topology
numactl --hardware

# Bind to specific node
numactl --cpunodebind=0 --membind=0 ./benchmark

Back to That Hash Table

Now let's re-analyze my hash table problem with this knowledge.

Linear search (64 elements):

  • Array size: 64 × 8 bytes = 512 bytes
  • Entirely in L1 cache
  • Sequential access, prefetcher effective
  • Each comparison: ~3 cycles
  • Average 32 lookups: 32 × 3 = 96 cycles
  • Plus some overhead: ~180 cycles ✓

Hash table lookup:

  • Hash computation: ~20 cycles
  • Bucket access: possible cache miss, ~100 cycles
  • If collision, another random access
  • Total: ~200-400 cycles ✓

Lesson: On small datasets, cache-friendly simple algorithms are often faster than "clever" algorithms.

Summary

Memory performance is often the bottleneck in modern systems:

Memory Hierarchy

  • L1 → L2 → L3 → DRAM, latency can differ by 100×
  • Understanding which level your working set is in matters

Benchmark Tools

  • LMbench: Classic latency and bandwidth measurement
  • STREAM: Standard bandwidth benchmark
  • Pointer chasing: Measures true random access latency

Measurement Techniques

  • Use large enough arrays to avoid measuring only cache
  • Use volatile to prevent compiler optimization
  • Use pointer chasing to measure latency
  • Consider TLB and NUMA effects

Practical Advice

  • On small datasets, cache locality matters more than Big-O
  • Random access is 10-100× slower than sequential access
  • Profile first, then optimize—don't guess