Chapter 13: Array vs Linked List

Part IV: Data Structures & Algorithms

"In theory, there is no difference between theory and practice. But in practice, there is." — Jan L. A. van de Snepscheut

The Story of the "Optimal" Data Structure

A senior engineer once joined a team working on a high-frequency trading system. The codebase had a critical hot path that maintained an order book—a sorted collection of pending orders. The previous developer had implemented it using a doubly-linked list, reasoning that O(1) insertion and deletion would be optimal for the frequent updates.

The senior engineer ran the profiler and found something surprising: the linked list implementation was spending 70% of its time on cache misses. The "O(1)" insertions were indeed fast in terms of pointer operations, but each node access triggered a cache miss that cost 100+ cycles.

She rewrote the hot path using a sorted array with binary search and memmove for insertions. Despite the O(n) insertion complexity, the new implementation was 8x faster. The lesson: Big-O notation hides constants, and those constants are dominated by memory access patterns.

Why This Comparison Matters

In algorithm textbooks, we learn:

Arrays: O(1) random access, O(n) insertion/deletion
Linked Lists: O(n) random access, O(1) insertion/deletion

This theoretical analysis assumes uniform memory access cost—an assumption that hasn't been true since the 1980s. Modern CPUs have multi-level cache hierarchies where L1 cache access takes ~4 cycles, while main memory access takes 100-300 cycles.

Cache Locality: The Decisive Victory

Although linked lists have O(1) theoretical insertion/deletion advantage, arrays (or std::vector) are overwhelmingly superior in practice.

Cache Line Utilization

Arrays occupy contiguous memory. When the CPU reads one element, it loads an entire cache line (typically 64 bytes), bringing multiple adjacent elements into cache for "free":

Array (contiguous memory):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ A0 │ A1 │ A2 │ A3 │ A4 │ A5 │ A6 │ A7 │
└────┴────┴────┴────┴────┴────┴────┴────┘
  ↑ One cache line fetch (64B) gets 16 integers

Linked list nodes are scattered across the heap. Each node access potentially triggers a cache miss:

Linked List (scattered memory):
┌────┬────┐     ┌────┬────┐     ┌────┬────┐
│ D0 │ *──┼────→│ D1 │ *──┼────→│ D2 │ *──┼→...
└────┴────┘     └────┴────┘     └────┴────┘
 0x1000          0x5F20          0x3A80
  ↑ Each access = potential cache miss (100+ cycles)

Hardware Prefetching

Modern CPUs have hardware prefetchers that detect sequential access patterns and proactively load upcoming data. This works beautifully for arrays but is useless for linked lists.

The "pointer chasing" pattern of linked lists has data dependencies: you must load node N to discover the address of node N+1. This serializes memory accesses and prevents the CPU pipeline from hiding latency.

The Crossover Point

Empirical measurements on modern desktop CPUs show that even for middle insertion, arrays beat linked lists when N < several thousand elements. The memmove operation is highly optimized (often vectorized) and benefits from contiguous memory.

Memory Overhead

On a 64-bit system, each linked list node requires two 8-byte pointers (next, prev) plus alignment padding. For a list of integers (4 bytes each), the pointer overhead can exceed 400% of actual data.

Benchmarking Methodology

When benchmarking data structures, you must avoid measuring the allocator rather than the data structure itself. As with benchmarking in general, the goal is to control confounding factors so that you are actually measuring the behavior of the data structure, not side effects from the allocator or environment.

Fair Comparison Requires

Same allocation strategy: Linked list performance heavily depends on node memory layout. Using a pool allocator that places nodes contiguously dramatically improves linked list performance. Note whether custom allocators are used.

Warm-up consideration: First access triggers page faults and cold cache misses. Run multiple iterations and distinguish cold start from warm state.

Background memory pressure: In production, memory isn't pristine. Add background memory activity to simulate realistic cache pollution.

Sample Benchmark Structure

// Prevent compiler from optimizing away the computation
volatile int sink;

void benchmark_sequential_access(int* array, int n) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    long sum = 0;
    for (int i = 0; i < n; i++) {
        sum += array[i];
    }
    sink = sum;  // Prevent dead code elimination

    clock_gettime(CLOCK_MONOTONIC, &end);
    // Report timing...
}

Measuring Cache Effects Directly

To truly understand the array vs. linked list difference, let's measure cache behavior using hardware performance counters:

# Using perf to measure cache misses during sequential traversal
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
    ./array_traverse 1000000

# Typical output for array:
#   12,543,210      cache-references
#       62,891      cache-misses             # 0.50%

# Typical output for linked list:
#   12,892,344      cache-references
#    9,234,567      cache-misses             # 71.6%

The linked list shows 140x more cache misses for the same logical operation. Each cache miss costs approximately 100 cycles on modern hardware, explaining why the linked list is 30-50x slower despite doing the "same" work.

Measuring with perf to Confirm the Theory

# Memory bandwidth comparison
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses \
    ./benchmark_both_structures

# Array traversal:
#   1,024,567      L1-dcache-loads
#      12,345      L1-dcache-load-misses    # 1.2%
#       2,345      LLC-loads
#         234      LLC-load-misses          # 10.0%

# Linked list traversal:
#   1,089,234      L1-dcache-loads
#     987,654      L1-dcache-load-misses    # 90.7%
#     876,543      LLC-loads
#     765,432      LLC-load-misses          # 87.3%

Small Size Optimization (SSO)

An important insight that surprises many developers: when N is small, simple linear search often beats complex data structures.

O(n) vs O(1): When N < 10-20, linear array search is typically faster than hash table lookup. Array operations fit entirely in L1 cache with no hash computation overhead.

Branch prediction: Binary search on sorted arrays causes ~50% branch misprediction rate on random queries. Linear search has trivial prediction patterns.

Here's a demonstration:

// Linear search: trivial, but cache-friendly
int linear_search(int* arr, int n, int target) {
    for (int i = 0; i < n; i++) {
        if (arr[i] == target) return i;
    }
    return -1;
}

// Measured performance for N=16, random queries:
// Linear search: ~15 cycles average
// Binary search: ~45 cycles average (branch misprediction penalty)
// Hash table:    ~60 cycles average (hash computation + indirection)

Many high-performance libraries and standard containers implement SSO-style optimizations: when the element count is below a threshold, they fall back to simple arrays or small in-object buffers internally. The std::string small-string optimization and boost::container::small_vector are canonical examples.

When to Use Each

Use Arrays (std::vector) When

Sequential or random access patterns dominate
Size is known or changes infrequently
Cache performance is critical
Elements are small (pointers, integers, small structs)

Use Linked Lists When

Frequent insertions/deletions at known positions (with held iterators)
Need stable pointers/references after insertion
Elements are very large (moving is expensive)
Memory fragmentation is acceptable

The Modern Reality

In C++, Bjarne Stroustrup and others have empirically shown that std::vector outperforms std::list for almost all use cases. The exceptions are rare and specific.

Practical Measurements

Typical results on modern hardware (Intel Core i7, DDR4):

Operation	Array (N=1000)	Linked List (N=1000)	Ratio
Sequential traverse	0.5 μs	15 μs	30x
Random access	0.02 μs	8 μs	400x
Middle insertion	0.3 μs	0.05 μs*	0.17x
Memory per element	4 bytes	24 bytes	6x

*Linked list insertion assumes you already have the iterator—finding the position is O(n).

These results come from one specific CPU, compiler, and standard library implementation. On a different machine or implementation the absolute timings will change, but the qualitative trend is robust: contiguous arrays dominate linked lists for traversal and random access on modern hardware.

Summary

Cache locality often matters more than algorithmic complexity. The gap between L1 cache and main memory is 100x.
Measure, don't assume. Your intuition from algorithm class may be wrong for modern hardware.
Consider your actual access patterns. If you traverse more than you insert, prefer arrays.
When in doubt, use std::vector. It's the default choice for a reason.

Performance and Benchmarking