Chapter 13: Array vs Linked List
Part IV: Data Structures & Algorithms
"In theory, there is no difference between theory and practice. But in practice, there is." — Jan L. A. van de Snepscheut
The Story of the "Optimal" Data Structure
A senior engineer once joined a team working on a high-frequency trading system. The codebase had a critical hot path that maintained an order book—a sorted collection of pending orders. The previous developer had implemented it using a doubly-linked list, reasoning that O(1) insertion and deletion would be optimal for the frequent updates.
The senior engineer ran the profiler and found something surprising: the linked list implementation was spending 70% of its time on cache misses. The "O(1)" insertions were indeed fast in terms of pointer operations, but each node access triggered a cache miss that cost 100+ cycles.
She rewrote the hot path using a sorted array with binary search and memmove for insertions. Despite the O(n) insertion complexity, the new implementation was 8x faster. The lesson: Big-O notation hides constants, and those constants are dominated by memory access patterns.
Why This Comparison Matters
In algorithm textbooks, we learn:
- Arrays: O(1) random access, O(n) insertion/deletion
- Linked Lists: O(n) random access, O(1) insertion/deletion
This theoretical analysis assumes uniform memory access cost—an assumption that hasn't been true since the 1980s. Modern CPUs have multi-level cache hierarchies where L1 cache access takes ~4 cycles, while main memory access takes 100-300 cycles.
Cache Locality: The Decisive Victory
Although linked lists have O(1) theoretical insertion/deletion advantage, arrays (or std::vector) are overwhelmingly superior in practice.
Cache Line Utilization
Arrays occupy contiguous memory. When the CPU reads one element, it loads an entire cache line (typically 64 bytes), bringing multiple adjacent elements into cache for "free":
Array (contiguous memory):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ A0 │ A1 │ A2 │ A3 │ A4 │ A5 │ A6 │ A7 │
└────┴────┴────┴────┴────┴────┴────┴────┘
↑ One cache line fetch (64B) gets 16 integers
Linked list nodes are scattered across the heap. Each node access potentially triggers a cache miss:
Linked List (scattered memory):
┌────┬────┐ ┌────┬────┐ ┌────┬────┐
│ D0 │ *──┼────→│ D1 │ *──┼────→│ D2 │ *──┼→...
└────┴────┘ └────┴────┘ └────┴────┘
0x1000 0x5F20 0x3A80
↑ Each access = potential cache miss (100+ cycles)
Hardware Prefetching
Modern CPUs have hardware prefetchers that detect sequential access patterns and proactively load upcoming data. This works beautifully for arrays but is useless for linked lists.
The "pointer chasing" pattern of linked lists has data dependencies: you must load node N to discover the address of node N+1. This serializes memory accesses and prevents the CPU pipeline from hiding latency.
The Crossover Point
Empirical measurements on modern desktop CPUs show that even for middle insertion, arrays beat linked lists when N < several thousand elements. The memmove operation is highly optimized (often vectorized) and benefits from contiguous memory.
Memory Overhead
On a 64-bit system, each linked list node requires two 8-byte pointers (next, prev) plus alignment padding. For a list of integers (4 bytes each), the pointer overhead can exceed 400% of actual data.
Benchmarking Methodology
When benchmarking data structures, you must avoid measuring the allocator rather than the data structure itself. As with benchmarking in general, the goal is to control confounding factors so that you are actually measuring the behavior of the data structure, not side effects from the allocator or environment.
Fair Comparison Requires
Same allocation strategy: Linked list performance heavily depends on node memory layout. Using a pool allocator that places nodes contiguously dramatically improves linked list performance. Note whether custom allocators are used.
Warm-up consideration: First access triggers page faults and cold cache misses. Run multiple iterations and distinguish cold start from warm state.
Background memory pressure: In production, memory isn't pristine. Add background memory activity to simulate realistic cache pollution.
Sample Benchmark Structure
// Prevent compiler from optimizing away the computation
volatile int sink;
void benchmark_sequential_access(int* array, int n) {
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
long sum = 0;
for (int i = 0; i < n; i++) {
sum += array[i];
}
sink = sum; // Prevent dead code elimination
clock_gettime(CLOCK_MONOTONIC, &end);
// Report timing...
}
Measuring Cache Effects Directly
To truly understand the array vs. linked list difference, let's measure cache behavior using hardware performance counters:
# Using perf to measure cache misses during sequential traversal
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
./array_traverse 1000000
# Typical output for array:
# 12,543,210 cache-references
# 62,891 cache-misses # 0.50%
# Typical output for linked list:
# 12,892,344 cache-references
# 9,234,567 cache-misses # 71.6%
The linked list shows 140x more cache misses for the same logical operation. Each cache miss costs approximately 100 cycles on modern hardware, explaining why the linked list is 30-50x slower despite doing the "same" work.
Measuring with perf to Confirm the Theory
# Memory bandwidth comparison
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses \
./benchmark_both_structures
# Array traversal:
# 1,024,567 L1-dcache-loads
# 12,345 L1-dcache-load-misses # 1.2%
# 2,345 LLC-loads
# 234 LLC-load-misses # 10.0%
# Linked list traversal:
# 1,089,234 L1-dcache-loads
# 987,654 L1-dcache-load-misses # 90.7%
# 876,543 LLC-loads
# 765,432 LLC-load-misses # 87.3%
Small Size Optimization (SSO)
An important insight that surprises many developers: when N is small, simple linear search often beats complex data structures.
O(n) vs O(1): When N < 10-20, linear array search is typically faster than hash table lookup. Array operations fit entirely in L1 cache with no hash computation overhead.
Branch prediction: Binary search on sorted arrays causes ~50% branch misprediction rate on random queries. Linear search has trivial prediction patterns.
Here's a demonstration:
// Linear search: trivial, but cache-friendly
int linear_search(int* arr, int n, int target) {
for (int i = 0; i < n; i++) {
if (arr[i] == target) return i;
}
return -1;
}
// Measured performance for N=16, random queries:
// Linear search: ~15 cycles average
// Binary search: ~45 cycles average (branch misprediction penalty)
// Hash table: ~60 cycles average (hash computation + indirection)
Many high-performance libraries and standard containers implement SSO-style optimizations: when the element count is below a threshold, they fall back to simple arrays or small in-object buffers internally. The std::string small-string optimization and boost::container::small_vector are canonical examples.
When to Use Each
Use Arrays (std::vector) When
- Sequential or random access patterns dominate
- Size is known or changes infrequently
- Cache performance is critical
- Elements are small (pointers, integers, small structs)
Use Linked Lists When
- Frequent insertions/deletions at known positions (with held iterators)
- Need stable pointers/references after insertion
- Elements are very large (moving is expensive)
- Memory fragmentation is acceptable
The Modern Reality
In C++, Bjarne Stroustrup and others have empirically shown that std::vector outperforms std::list for almost all use cases. The exceptions are rare and specific.
Practical Measurements
Typical results on modern hardware (Intel Core i7, DDR4):
| Operation | Array (N=1000) | Linked List (N=1000) | Ratio |
|---|---|---|---|
| Sequential traverse | 0.5 μs | 15 μs | 30x |
| Random access | 0.02 μs | 8 μs | 400x |
| Middle insertion | 0.3 μs | 0.05 μs* | 0.17x |
| Memory per element | 4 bytes | 24 bytes | 6x |
*Linked list insertion assumes you already have the iterator—finding the position is O(n).
These results come from one specific CPU, compiler, and standard library implementation. On a different machine or implementation the absolute timings will change, but the qualitative trend is robust: contiguous arrays dominate linked lists for traversal and random access on modern hardware.
Summary
- Cache locality often matters more than algorithmic complexity. The gap between L1 cache and main memory is 100x.
- Measure, don't assume. Your intuition from algorithm class may be wrong for modern hardware.
- Consider your actual access patterns. If you traverse more than you insert, prefer arrays.
- When in doubt, use std::vector. It's the default choice for a reason.