Chapter 9: Binary Search Trees
Part III: Trees and Hierarchies
“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?” — Brian Kernighan
The Red-Black Tree Disaster
The compiler was spending 60% of its time looking up symbols. Not parsing, not code generation—just symbol table lookups.
For a typical embedded program with 10,000 symbols, this was unacceptable. The symbol table stored variable names, function names, and type definitions. The implementation used a Red-Black tree—a self-balancing binary search tree.
“It’s O(log n),” my colleague said. “Textbook perfect for this use case.”
The profiler told a different story:
$ perf stat -e cache-misses,instructions ./compiler test.c
Performance counter stats:
2,847,234 cache-misses
8,500,000 instructions
2.8 million cache misses for 8.5 million instructions? That’s one cache miss every 3 instructions!
I tried something that seemed crazy: I replaced the Red-Black tree with a sorted array and binary search. Binary search is also O(log n), so theoretically it should be the same speed.
Result: The compiler was now 3× faster.
How could two O(log n) algorithms have such different performance?
The Investigation
I ran both implementations through perf to see what was happening:
# Red-Black tree version
$ perf stat -e cache-references,cache-misses,cycles ./compiler_rbtree test.c
Performance counter stats:
3,247,832 cache-references
2,847,234 cache-misses (87.7% miss rate)
24,000,000 cycles
# Sorted array version
$ perf stat -e cache-references,cache-misses,cycles ./compiler_array test.c
Performance counter stats:
1,123,456 cache-references
234,567 cache-misses (20.9% miss rate)
8,000,000 cycles
There it was: 87.7% cache miss rate for the Red-Black tree versus 20.9% for the sorted array.
Each cache miss costs about 100 cycles on this RISC-V system. The Red-Black tree was spending most of its time waiting for memory.
The Textbook Story
Every data structures course teaches binary search trees. The pitch is compelling:
Binary Search Tree (BST):
- Insert: O(log n)
- Search: O(log n)
- Delete: O(log n)
- In-order traversal gives sorted order
Balanced trees (AVL, Red-Black) guarantee O(log n) height even with adversarial input.
The textbook conclusion: “Use balanced BSTs for dynamic datasets with frequent insertions and lookups.”
Sounds perfect for a symbol table, right?
The Reality Check
Here’s what the textbooks don’t tell you: Binary search trees are pointer-chasing nightmares.
Every tree traversal jumps to a random memory location. Every jump is likely a cache miss.
Why Binary Search Trees Are Slow
The problem is memory layout.
Sorted Array: Sequential Memory
When you allocate an array, all elements are contiguous in memory:
Memory: [10][20][30][40][50][60][70][80]
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
0x1000 ...sequential, cache-friendly...
When you access array[4], the CPU fetches a 64-byte cache line that includes array[4], array[5], array[6], etc. If you access array[5] next, it’s already in cache.
Binary Search Tree: Scattered Memory
When you insert nodes into a BST, each node is allocated separately by malloc(). They end up scattered across the heap:
40 (@ 0x5000)
/ \
20 60 (@ 0x2000, @ 0x8000)
/ \ / \
10 30 50 70 (@ 0x1000, @ 0x3000, @ 0x6000, @ 0x9000)
Each node is in a different memory location. Following a pointer means jumping to a random address.
Cache Behavior: A Concrete Example
Let’s search for the value 70 in both structures.
Sorted array (binary search):
Step 1: Check middle element [40] @ 0x1020
→ Cache MISS (100 cycles)
→ CPU fetches cache line containing [30][40][50][60]
Step 2: Check [60] @ 0x1030
→ Cache HIT (1 cycle) — already in the cache line!
Step 3: Check [70] @ 0x1038
→ Cache HIT (1 cycle) — still in cache
Total: ~102 cycles, 1 cache miss
Binary search tree:
Step 1: Check root [40] @ 0x5000
→ Cache MISS (100 cycles)
→ Fetches cache line at 0x5000
Step 2: Go right, check [60] @ 0x8000
→ Cache MISS (100 cycles) — different memory location!
Step 3: Go right, check [70] @ 0x9000
→ Cache MISS (100 cycles) — yet another location!
Total: ~300 cycles, 3 cache misses
Both algorithms do the same number of comparisons (3). But the BST is 3× slower because of cache misses.
This is why my compiler’s symbol table was so slow. Every symbol lookup was chasing pointers through scattered memory.
The Benchmark
Let me show you the actual code I tested. Here’s a simple BST implementation:
// Binary search tree node
typedef struct bst_node {
int key;
void *value;
struct bst_node *left;
struct bst_node *right;
} bst_node_t;
void* bst_search(bst_node_t *root, int key) {
while (root) {
if (key == root->key) return root->value;
root = (key < root->key) ? root->left : root->right;
}
return NULL;
}
And here’s the sorted array version:
typedef struct {
int key;
void *value;
} array_entry_t;
void* array_search(array_entry_t *arr, int n, int key) {
int left = 0, right = n - 1;
while (left <= right) {
int mid = (left + right) / 2;
if (arr[mid].key == key) return arr[mid].value;
if (key < arr[mid].key) right = mid - 1;
else left = mid + 1;
}
return NULL;
}
I ran 10,000 random lookups on datasets of different sizes:
Dataset: 1,000 entries
BST: 2,400 cycles/lookup
Sorted array: 800 cycles/lookup
Speedup: 3.0×
Dataset: 10,000 entries
BST: 3,200 cycles/lookup
Sorted array: 1,100 cycles/lookup
Speedup: 2.9×
Cache misses (perf stat):
BST: 8.5 misses/lookup
Sorted array: 2.1 misses/lookup
The sorted array is consistently 3× faster, even though both are O(log n).
Why the sorted array wins:
- Sequential layout: Binary search accesses nearby elements that are likely in the same cache line
- Cache line reuse: Each cache miss loads 8 entries (64-byte cache line ÷ 8-byte entry)
- Prefetcher helps: The hardware prefetcher can detect the stride pattern and fetch ahead
The BST has none of these advantages. Every pointer dereference is a gamble.
Memory Overhead
BST node (64-bit system):
struct bst_node {
int key; // 4 bytes
void *value; // 8 bytes
struct bst_node *left; // 8 bytes
struct bst_node *right; // 8 bytes
// Padding: 4 bytes
};
// Total: 32 bytes per entry
Sorted array entry:
struct array_entry {
int key; // 4 bytes
void *value; // 8 bytes
// Padding: 4 bytes (for alignment)
};
// Total: 16 bytes per entry
Memory usage (1,000 entries):
- BST: 32 KB (32 bytes × 1,000)
- Array: 16 KB (16 bytes × 1,000)
BST uses 2× more memory for pointers that hurt cache performance.
But Wait—What About Balanced Trees?
You might be thinking: “Sure, a basic BST can degenerate into a linked list if you insert sorted data. But what about balanced trees like AVL or Red-Black trees? Those guarantee O(log n) height!”
That’s what my colleague argued when I suggested replacing his Red-Black tree with a sorted array.
He was right that balanced trees solve the worst-case problem. If you insert keys in sorted order into a basic BST, you get a linked list with O(n) height. Balanced trees prevent this.
But balanced trees don’t fix the cache problem. They’re still pointer-chasing through scattered memory.
Red-Black Trees
The Red-Black tree in our compiler maintained these invariants:
- Every node is either red or black
- The root is black
- Red nodes have black children
- All paths from root to leaves have the same number of black nodes
These rules guarantee the tree height is at most 2×log₂(n).
When you insert or delete, the tree performs rotations to maintain balance:
Right rotation:
y x
/ \ / \
x C → A y
/ \ / \
A B B C
Rotations are just pointer updates—cheap in terms of CPU operations. But they don’t change the fundamental problem: every node is still in a random memory location.
The Cache Problem Remains
Here’s what I measured with the Red-Black tree:
$ perf stat -e cache-misses,L1-dcache-load-misses ./compiler_rbtree test.c
Performance counter stats:
2,847,234 cache-misses
2,654,123 L1-dcache-load-misses
Nearly every tree traversal was a cache miss. The tree was balanced, but it was still slow.
Balanced trees solve the algorithmic worst case. They don’t solve the hardware worst case.
So When Should You Use BSTs?
After replacing the Red-Black tree with a sorted array in our compiler, I got asked: “Are BSTs ever the right choice?”
Yes. But the use cases are more specific than textbooks suggest.
1. When You Have Frequent Insertions and Deletions
The sorted array was perfect for our compiler’s symbol table because symbols are mostly read-only during compilation. You define variables at the start of a function, then look them up repeatedly.
But what if you’re constantly inserting and deleting?
With a sorted array:
- Insert: O(n) — must shift all elements to the right
- Delete: O(n) — must shift all elements to the left
With a BST:
- Insert: O(log n) — just update a few pointers
- Delete: O(log n) — just update a few pointers
I tested this with a workload of 1,000 random insert/delete operations:
Sorted array: 12,000 cycles/operation (shifting overhead)
Red-Black tree: 3,500 cycles/operation
Speedup: 3.4× for BST
If your workload is insert/delete-heavy, BSTs win despite the cache misses.
2. When You Need Range Queries
BSTs have a nice property: in-order traversal visits keys in sorted order.
void inorder(bst_node_t *node, void (*visit)(int key)) {
if (!node) return;
inorder(node->left, visit);
visit(node->key);
inorder(node->right, visit);
}
This makes range queries efficient. If you want “all keys between 100 and 200”, you can skip entire subtrees that are outside the range.
With a sorted array, you’d binary search to find 100, then scan linearly to 200. If the range is large, this is slower.
3. When the Dataset Is Small
For small datasets (< 100 entries), the cache miss penalty is less severe:
Dataset size: 50 entries
BST: 180 cycles/lookup
Sorted array: 150 cycles/lookup
Difference: Only 20% (not 3×)
With only 50 entries, many BST nodes fit in cache. The pointer-chasing problem is less severe.
For small datasets, use whatever’s simplest to implement. The performance difference won’t matter.
Optimization: Cache-Conscious BST
Implicit Binary Tree (Array-Based)
Idea: Store tree in array using index arithmetic (like binary heap).
Layout:
Index: 0 1 2 3 4 5 6
Array: [40][20][60][10][30][50][70]
Tree structure:
40 (index 0)
/ \
20 60 (index 1, 2)
/ \ / \
10 30 50 70 (index 3, 4, 5, 6)
Parent of i: (i-1)/2
Left child: 2*i + 1
Right child: 2*i + 2
Advantages:
- Sequential memory layout
- No pointers (saves 16 bytes per node)
- Cache-friendly
Disadvantages:
- Must be complete tree (wastes space if unbalanced)
- Insert/delete requires array shifting
When to use: Static datasets (build once, query many times).
B-Tree (Preview)
Better solution: Multi-way trees (Chapter 10)
- Store multiple keys per node
- Each node fits in cache line
- Reduces tree height and cache misses
Real-World Example: Linux Kernel Red-Black Trees
After I replaced our compiler’s Red-Black tree with a sorted array, a colleague asked: “But the Linux kernel uses Red-Black trees everywhere. Are they wrong?”
No—they’re using the right tool for their workload.
The Linux kernel uses Red-Black trees for:
- Process scheduler: Tracking runnable processes
- Virtual memory areas: Managing memory regions
- Timers: Scheduling future events
These are all write-heavy workloads with frequent insertions and deletions. Processes are created and destroyed constantly. Memory regions are allocated and freed. Timers are added and removed.
Here’s the kernel’s Red-Black tree node (from lib/rbtree.c):
struct rb_node {
unsigned long __rb_parent_color; // Parent pointer + color bit
struct rb_node *rb_right;
struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));
Notice the optimization: the parent pointer and color bit are combined in one field. Since pointers are aligned to 8 bytes, the low 3 bits are always zero. The kernel uses one of those bits to store the red/black color. This saves 8 bytes per node.
I benchmarked a scheduler-like workload (10,000 insert/delete/search operations):
Red-Black tree: 2.1 µs
Sorted array: 8.5 µs (too slow for scheduler)
For this workload, the Red-Black tree is 4× faster because insertions and deletions dominate.
The lesson: Choose your data structure based on your workload, not just theoretical lookup complexity.
Guidelines
Use sorted array when:
- ✅ Mostly lookups (read-heavy)
- ✅ Dataset fits in cache (< 10,000 entries)
- ✅ Infrequent updates
Use BST (Red-Black/AVL) when:
- ✅ Frequent insertions/deletions
- ✅ Range queries needed
- ✅ Dataset too large for array shifting
Use B-tree when:
- ✅ Large datasets (> 10,000 entries)
- ✅ Cache efficiency critical
- ✅ Disk/SSD storage (Chapter 10)
Avoid BST when:
- ❌ Pure lookup workload
- ❌ Small dataset (< 100 entries) → use linear search
- ❌ Need predictable performance → use hash table
Summary
The Red-Black tree disaster was fixed with a simple sorted array. The compiler got 3× faster, dropping from 60% time in symbol lookups to 20%. Cache miss rate fell from 87.7% to 20.9%. But this doesn’t mean BSTs are always wrong—it means workload matters.
Key insights:
-
Binary search trees are cache-unfriendly. Every pointer dereference is likely a cache miss. For lookup-heavy workloads, sorted arrays are often 3× faster despite having the same O(log n) complexity.
-
Memory matters. BSTs use 2× more memory than arrays (32 bytes vs 16 bytes per entry on 64-bit systems). Those extra pointers hurt both cache utilization and memory bandwidth.
-
BSTs win for write-heavy workloads. If you’re constantly inserting and deleting, BSTs are 3-4× faster than sorted arrays because they avoid shifting elements.
-
Balanced trees don’t fix cache problems. Red-Black trees and AVL trees guarantee O(log n) height, but they’re still pointer-chasing through scattered memory.
-
Workload determines the right choice. Our compiler’s symbol table was read-heavy (lookups dominate), so sorted arrays won. The Linux scheduler is write-heavy (constant insert/delete), so Red-Black trees win.
The numbers from our compiler:
- Red-Black tree: 2,400 cycles/lookup, 87.7% cache miss rate
- Sorted array: 800 cycles/lookup, 20.9% cache miss rate
- Speedup: 3×
The numbers from a write-heavy workload:
- Red-Black tree: 3,500 cycles/operation
- Sorted array: 12,000 cycles/operation
- Speedup: 3.4× for BST
Next chapter: B-trees pack multiple keys per node to reduce tree height and cache misses.