Chapter 1: The Performance Gap
Part I: Foundations
“In theory, theory and practice are the same. In practice, they are not.” — Attributed to various computer scientists
The Mystery
It was 2:00 AM, and I was staring at profiling data that made no sense.
I was working on a bootloader for a RISC-V SoC, and we had a performance problem. The bootloader needed to look up device configurations from a table—about 500 entries, each with a 32-bit device ID and a pointer to configuration data. Simple enough.
My colleague had implemented it using a hash table. “O(1) lookup,” he said confidently. “Can’t beat that.”
But the bootloader was slow. Unacceptably slow. We were missing our 100ms boot time target by a factor of three.
I tried the obvious optimization: replacing the hash table with a binary search on a sorted array. Binary search is O(log n), which is theoretically worse than O(1). The textbooks say so. My algorithms professor would have frowned.
The result? The bootloader was now 40% faster.
How could O(log n) beat O(1)? What was going on?
The Investigation
I fired up perf, Linux’s performance profiling tool, and ran both implementations:
# Hash table version
$ perf stat -e cache-references,cache-misses ./bootloader_hash
Performance counter stats:
1,247,832 cache-references
892,441 cache-misses (71.5% miss rate)
# Binary search version
$ perf stat -e cache-references,cache-misses ./bootloader_binsearch
Performance counter stats:
423,156 cache-references
89,234 cache-misses (21.1% miss rate)
There it was. The hash table had a 71.5% cache miss rate. The binary search had only 21.1%.
Each cache miss costs roughly 100 CPU cycles on this system. The hash table was spending most of its time waiting for memory.
The O(1) hash table was doing fewer operations, but each operation was expensive. The O(log n) binary search was doing more operations, but each operation was cheap.
The hardware had overruled the algorithm.
Why This Matters
This book is about that gap—the gap between what the textbooks teach and what actually happens when your code runs on real silicon.
Traditional data structures courses teach you to think in terms of Big-O complexity:
- Arrays: O(1) access, O(n) insertion
- Linked lists: O(1) insertion, O(n) access
- Hash tables: O(1) average case
- Binary search trees: O(log n) operations
These are useful abstractions. They help us reason about algorithms at scale. But they’re incomplete.
They assume all memory accesses cost the same. They assume operations happen in isolation. They assume an idealized computer that doesn’t exist.
Real computers have:
- Memory hierarchies: Registers, L1 cache, L2 cache, L3 cache, DRAM
- Latency gaps: 1 cycle vs 100+ cycles
- Cache lines: 64 bytes fetched together
- Prefetchers: Hardware that guesses what you’ll need next
- Limited bandwidth: You can’t fetch everything at once
And if you’re working on embedded systems, you have even more constraints:
- Tiny caches: 8KB to 64KB is common
- No L3 cache: Many MCUs stop at L1 or L2
- Slow memory: DRAM might be 100MHz, not 3GHz
- Real-time requirements: Worst-case matters, not average-case
The Real Performance Model
Here’s a better mental model for modern computers:
Time = Operations × (Computation Cost + Memory Cost)
Where:
- Computation Cost: The actual ALU operations (usually cheap)
- Memory Cost: Cache misses, DRAM accesses (usually expensive)
For many algorithms, Memory Cost dominates.
Let’s quantify this with real numbers from a typical embedded RISC-V system:
| Operation | Latency | Relative Cost |
|---|---|---|
| Register access | 1 cycle | 1× |
| L1 cache hit | 3-4 cycles | 3× |
| L2 cache hit | 12-15 cycles | 12× |
| L3 cache hit | 40-50 cycles | 40× |
| DRAM access | 100-200 cycles | 100× |
A single cache miss can cost as much as 100 register operations.
This means:
- An O(n) algorithm with good cache behavior can beat an O(log n) algorithm with poor cache behavior
- An O(1) hash table can lose to an O(log n) binary search
- A “slow” algorithm that fits in cache can beat a “fast” algorithm that doesn’t
Our First Benchmark: Array vs Linked List
Let’s make this concrete with a simple experiment. We’ll compare two ways to sum 100,000 integers:
- Array: Contiguous memory, perfect for cache
- Linked list: Scattered memory, cache nightmare
Both are O(n). The textbooks say they should perform similarly. Let’s see what really happens.
Here’s the array version:
// Array: contiguous memory
int array[100000];
for (int i = 0; i < 100000; i++) {
array[i] = i;
}
// Sum all elements
long long sum = 0;
for (int i = 0; i < 100000; i++) {
sum += array[i];
}
And the linked list version:
// Linked list: scattered memory
typedef struct node {
int value;
struct node *next;
} node_t;
node_t *head = NULL;
for (int i = 0; i < 100000; i++) {
node_t *node = malloc(sizeof(node_t));
node->value = i;
node->next = head;
head = node;
}
// Sum all elements
long long sum = 0;
node_t *curr = head;
while (curr) {
sum += curr->value;
curr = curr->next;
}
Using our benchmark framework (which we’ll explore in detail in Chapter 3), here are the results:
=== Array Sequential Sum ===
Mean time: 70,147 ns
Median time: 71,724 ns
Total cycles: 17,557,410
=== Linked List Sequential Sum ===
Mean time: 179,169 ns
Median time: 160,527 ns
Total cycles: 44,740,656
Array is 2.55× faster than Linked List
Same algorithm (sequential sum), same O(n) complexity, but the array is 2.5× faster.
Why? Let’s look at the cache behavior:
Array access pattern:
Memory: [0][1][2][3][4][5][6][7][8][9]...
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
Access: Sequential, predictable
Cache: Fetch 64 bytes (16 ints) at once
Result: ~94% cache hit rate
Linked list access pattern:
Memory: [node] ... [node] ... [node] ... [node]
↑ ↑ ↑ ↑
Access: Random, unpredictable (follows pointers)
Cache: Each node likely in different cache line
Result: ~70% cache miss rate
The array benefits from spatial locality—when you access array[0], the CPU fetches an entire cache line (64 bytes), which includes array[0] through array[15]. The next 15 accesses are free.
The linked list suffers from pointer chasing—each node is allocated separately by malloc(), scattered randomly in memory. Each access likely requires a new cache line fetch.
The Memory Hierarchy
To understand why cache matters so much, we need to understand the memory hierarchy.
Modern computers are not the simple “CPU + RAM” model from introductory courses. They’re more like this:
CPU Core
↓ 1 cycle
Registers (32-64 registers, ~256 bytes)
↓ 3-4 cycles
L1 Cache (32-64 KB, split I/D)
↓ 12-15 cycles
L2 Cache (256 KB - 1 MB, unified)
↓ 40-50 cycles
L3 Cache (4-32 MB, shared) [not on all systems]
↓ 100-200 cycles
DRAM (GB scale)
↓ 10,000+ cycles
SSD/Flash
Each level is:
- Faster but smaller than the level below
- More expensive per byte
- Closer to the CPU
The speed gap is enormous. On a 1 GHz RISC-V processor:
- L1 cache: 3-4 nanoseconds
- DRAM: 100-200 nanoseconds
- That’s a 50× difference
For comparison, if L1 cache access was 1 second, DRAM access would be 50 seconds. That’s the difference between a quick response and going to make coffee.
Cache Lines: The Fundamental Unit
Here’s a critical insight: CPUs don’t fetch individual bytes. They fetch cache lines.
A cache line is typically 64 bytes. When you access a single byte, the CPU fetches the entire 64-byte block containing that byte.
This has profound implications:
Good: If you access nearby data, it’s already in cache (spatial locality)
// Excellent: sequential access
for (int i = 0; i < n; i++) {
sum += array[i]; // Next element likely in same cache line
}
Bad: If you access scattered data, you waste 63 bytes per fetch
// Terrible: random access
for (int i = 0; i < n; i++) {
sum += array[random()]; // Each access likely misses cache
}
Worse: If your data structure has poor layout, you pay for data you don’t use
// Linked list node: 16 bytes (4-byte value + 8-byte pointer + padding)
// Cache line: 64 bytes
// Waste: 48 bytes (75% of cache line unused!)
Prefetching: Hardware Tries to Help
Modern CPUs have hardware prefetchers that try to predict what you’ll access next. They’re good at detecting simple patterns:
Sequential access: Prefetcher loves this
for (int i = 0; i < n; i++) {
process(array[i]); // Prefetcher: "I see a pattern! Fetch ahead!"
}
Strided access: Prefetcher can handle this
for (int i = 0; i < n; i += 2) {
process(array[i]); // Prefetcher: "Stride of 2, got it!"
}
Pointer chasing: Prefetcher gives up
while (node) {
process(node->value);
node = node->next; // Prefetcher: "No idea what's next..."
}
This is why linked lists are so slow—the prefetcher can’t help. Each pointer dereference is a surprise.
Embedded Systems: Even Harsher Constraints
If you’re working on embedded systems, the situation is more extreme:
Typical embedded RISC-V MCU:
- L1 cache: 16-32 KB (vs 32-64 KB on desktop)
- L2 cache: 128-256 KB (vs 256 KB - 1 MB on desktop)
- L3 cache: None (vs 4-32 MB on desktop)
- DRAM: 100 MHz (vs 3 GHz on desktop)
With a 16 KB L1 cache, your entire working set needs to fit in 16 KB or you’ll thrash the cache.
For comparison:
- 100,000 integers (array): 400 KB → won’t fit in L1
- 100,000 linked list nodes: 1.6 MB → won’t even fit in L2
This is why embedded systems developers obsess over data structure size and layout. Every byte counts.
Real-Time Considerations
In embedded systems, we often care about worst-case performance, not average-case.
Consider a real-time control loop running at 1 kHz (1ms period):
- Best case: All data in L1 cache → 50 microseconds
- Worst case: All data in DRAM → 500 microseconds
If your algorithm has unpredictable cache behavior, you can’t guarantee real-time deadlines.
This is why real-time systems often prefer:
- Static allocation: Predictable memory layout
- Fixed-size data structures: No dynamic resizing
- Simple algorithms: Predictable cache behavior
Even if they’re “slower” in average-case Big-O terms.
What You’ll Learn in This Book
This book will teach you to think about data structures in terms of hardware reality:
Part I: Foundations
- How memory hierarchy works (Chapter 2)
- How to measure and profile performance (Chapter 3)
Part II: Basic Data Structures
- Arrays: The cache-friendly foundation (Chapter 4)
- Linked lists: When and how to use them (Chapter 5)
- Stacks, queues, and ring buffers (Chapter 6)
- Hash tables: Cache-conscious design (Chapter 7)
- Dynamic arrays and memory management (Chapter 8)
Part III: Trees and Hierarchies
- Binary search trees: Cache behavior (Chapter 9)
- B-trees: Cache-conscious trees (Chapter 10)
- Tries and radix trees (Chapter 11)
- Heaps and priority queues (Chapter 12)
Part IV: Advanced Topics
- Lock-free data structures (Chapter 13)
- String processing (Chapter 14)
- Graphs and networks (Chapter 15)
- Probabilistic structures (Chapter 16)
Part V: Case Studies
- Bootloader data structures (Chapter 17)
- Device driver queues (Chapter 18)
- Firmware memory management (Chapter 19)
Each chapter will include:
- Real-world examples from embedded systems
- Benchmarks showing actual performance
- Cache analysis with profiling tools
- Design guidelines for your own code
Prerequisites and Setup
To get the most out of this book, you should:
Know:
- C programming (pointers, structs, memory management)
- Basic data structures (arrays, linked lists, trees)
- Basic algorithms (sorting, searching)
Have:
- Linux system (Ubuntu/Debian recommended)
- GCC compiler
- Basic command-line skills
Optional but helpful:
- RISC-V development board or QEMU
- Experience with embedded systems
- Familiarity with assembly language
All code examples and benchmarks are available at:
github.com/dannyjiang/ds-in-practice (placeholder)
The Road Ahead
In the next chapter, we’ll dive deep into the memory hierarchy. You’ll learn:
- How caches work at the hardware level
- What cache lines, sets, and ways mean
- How to predict cache behavior
- How to measure cache performance
Then in Chapter 3, we’ll build a complete benchmarking framework—the same one used for all measurements in this book.
By the end of Part I, you’ll have the tools and knowledge to analyze any data structure’s real-world performance.
Let’s get started.
Summary
The mystery from 2:00 AM was solved. The O(log n) binary search beat the O(1) hash table by 40% because cache behavior mattered more than algorithmic complexity. The hash table’s 71.5% cache miss rate versus the binary search’s 21.1% explained everything. The hardware had overruled the algorithm.
Key insights:
- Big-O complexity is necessary but not sufficient for understanding real-world performance
- Memory hierarchy dominates modern computer performance
- Cache misses cost 100× more than cache hits
- Spatial locality matters: Sequential access beats random access
- Embedded systems have harsher constraints: Smaller caches, slower memory
- Real-time systems need predictable performance: Worst-case matters
The Performance Gap:
- Textbook: O(1) hash table beats O(log n) binary search
- Reality: Cache behavior can reverse this
- Lesson: Measure, don’t assume
Next Chapter: We’ll explore the memory hierarchy in detail and learn how caches work at the hardware level.