Chapter 15: Graphs and Cache-Efficient Traversal

Part IV: Advanced Topics

“The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.” — Edsger W. Dijkstra

The Cache Miss Explosion

The network topology discovery was taking 37.5 milliseconds to traverse 500 switches. That doesn’t sound slow until you look at the cache miss count: 8.5 million cache misses. For 500 nodes, that’s 17,000 cache misses per node.

Something was fundamentally wrong with the data structure.

The tool’s job was straightforward: discover network topology by traversing a graph of connected devices. Each switch had up to 48 ports, and we needed to find all reachable devices from a starting point using breadth-first search.

The implementation looked textbook-correct—an adjacency list with standard BFS:

typedef struct node {
    int id;
    struct node **neighbors;  // Array of pointers
    int num_neighbors;
} node_t;

void bfs(node_t *start) {
    queue_t *q = queue_create();
    bool *visited = calloc(MAX_NODES, sizeof(bool));
    
    queue_push(q, start);
    visited[start->id] = true;
    
    while (!queue_empty(q)) {
        node_t *node = queue_pop(q);
        process(node);
        
        for (int i = 0; i < node->num_neighbors; i++) {
            node_t *neighbor = node->neighbors[i];
            if (!visited[neighbor->id]) {
                visited[neighbor->id] = true;
                queue_push(q, neighbor);
            }
        }
    }
}

For a network with 500 switches (average 12 connections each), this took:

$ perf stat -e cycles,cache-misses ./network_discovery_naive
  Performance counter stats:
    45,000,000 cycles
     8,500,000 cache-misses
     
Traversal time: 37.5 ms

8.5 million cache misses for 500 nodes? That’s 17,000 cache misses per node!

I rewrote it with cache-conscious graph representation. The results:

$ perf stat -e cycles,cache-misses ./network_discovery_optimized
  Performance counter stats:
    12,000,000 cycles
     1,200,000 cache-misses
     
Traversal time: 10 ms

3.75× faster with 7× fewer cache misses.

This chapter explores how to represent and traverse graphs efficiently.

The Textbook Story

Graphs are typically represented in two ways:

1. Adjacency Matrix

A 2D array where matrix[i][j] = 1 if there’s an edge from node i to node j:

bool adj_matrix[MAX_NODES][MAX_NODES];

// Check if edge exists
if (adj_matrix[u][v]) {
    // Edge from u to v exists
}

Pros: O(1) edge lookup
Cons: O(n²) space, even for sparse graphs

2. Adjacency List

Each node stores a list of its neighbors:

typedef struct {
    int *neighbors;
    int num_neighbors;
} node_t;

node_t nodes[MAX_NODES];

Pros: O(n + m) space (n nodes, m edges)
Cons: O(degree) edge lookup

The textbook says: “Use adjacency matrix for dense graphs, adjacency list for sparse graphs.”

The Reality Check: Why Standard Representations Are Slow

1. Pointer Chasing in Adjacency Lists

The standard adjacency list uses pointers:

typedef struct edge {
    int dest;
    struct edge *next;  // Linked list of edges
} edge_t;

typedef struct {
    edge_t *edges;  // Pointer to first edge
} node_t;

Problem: Each edge is a separate allocation, scattered in memory.

Traversing neighbors:

Node → Edge1 (cache miss) → Edge2 (cache miss) → Edge3 (cache miss) ...

For a node with 12 neighbors: 12 cache misses just to read the neighbor list!

2. Poor Locality in BFS Queue

Standard BFS uses a queue of pointers:

queue_push(q, neighbor);  // Push pointer to node

Problem: Nodes are processed in BFS order, but they’re scattered in memory.

Queue: [Node5, Node12, Node3, Node45, ...]
       Each node is in a different cache line!

3. Random Access to Visited Array

The visited array is indexed by node ID:

visited[neighbor->id] = true;

If node IDs are not sequential or clustered, this causes random memory access.

Optimization 1: Compact Adjacency List

Instead of pointers, store neighbors in a contiguous array:

typedef struct {
    int *neighbors;      // Contiguous array of neighbor IDs
    int num_neighbors;
} node_t;

typedef struct {
    node_t *nodes;
    int *edge_data;      // All edges in one array
    int num_nodes;
} graph_t;

graph_t *graph_create(int num_nodes, int num_edges) {
    graph_t *g = malloc(sizeof(graph_t));
    g->nodes = malloc(num_nodes * sizeof(node_t));
    g->edge_data = malloc(num_edges * sizeof(int));
    g->num_nodes = num_nodes;
    
    // Nodes point into edge_data array
    int offset = 0;
    for (int i = 0; i < num_nodes; i++) {
        g->nodes[i].neighbors = &g->edge_data[offset];
        g->nodes[i].num_neighbors = /* ... */;
        offset += g->nodes[i].num_neighbors;
    }
    
    return g;
}

Why this helps:

All edges in one contiguous array (better prefetching)
No pointer chasing (neighbors are sequential)
Better cache utilization

The Benchmark

Test: BFS on 500-node graph (avg degree: 12)

Linked list adjacency list:
  Cache misses: 8.5M
  Cycles: 45M
  
Compact adjacency list:
  Cache misses: 2.8M (3× fewer)
  Cycles: 18M
  Speedup: 2.5×

Optimization 2: Cache-Oblivious BFS

Standard BFS processes nodes level-by-level, but nodes in the same level might be far apart in memory.

Blocked BFS

Process nodes in cache-sized blocks:

#define BLOCK_SIZE 64  // Process 64 nodes at a time

void bfs_blocked(graph_t *g, int start) {
    bool *visited = calloc(g->num_nodes, sizeof(bool));
    int *queue = malloc(g->num_nodes * sizeof(int));
    int head = 0, tail = 0;
    
    queue[tail++] = start;
    visited[start] = true;
    
    while (head < tail) {
        int block_end = (head + BLOCK_SIZE < tail) ? head + BLOCK_SIZE : tail;
        
        // Process a block of nodes
        for (int i = head; i < block_end; i++) {
            int node_id = queue[i];
            node_t *node = &g->nodes[node_id];
            
            process(node);
            
            // Add neighbors to queue
            for (int j = 0; j < node->num_neighbors; j++) {
                int neighbor = node->neighbors[j];
                if (!visited[neighbor]) {
                    visited[neighbor] = true;
                    queue[tail++] = neighbor;
                }
            }
        }
        
        head = block_end;
    }
}

Why this helps:

Process multiple nodes before moving to next level
Better temporal locality (reuse visited array in cache)
Amortize queue overhead

The Benchmark

Test: BFS on 500-node graph

Standard BFS:
  Cache misses: 2.8M
  Cycles: 18M

Blocked BFS (block size 64):
  Cache misses: 1.5M
  Cycles: 11M
  Speedup: 1.6×

Optimization 3: Compressed Sparse Row (CSR) Format

For very large sparse graphs, CSR format is even more compact:

typedef struct {
    int *row_ptr;     // row_ptr[i] = start of node i's neighbors
    int *col_idx;     // col_idx[j] = neighbor ID
    int num_nodes;
    int num_edges;
} csr_graph_t;

csr_graph_t *graph_to_csr(graph_t *g) {
    csr_graph_t *csr = malloc(sizeof(csr_graph_t));
    csr->num_nodes = g->num_nodes;
    csr->num_edges = /* total edges */;

    csr->row_ptr = malloc((g->num_nodes + 1) * sizeof(int));
    csr->col_idx = malloc(csr->num_edges * sizeof(int));

    int offset = 0;
    for (int i = 0; i < g->num_nodes; i++) {
        csr->row_ptr[i] = offset;
        for (int j = 0; j < g->nodes[i].num_neighbors; j++) {
            csr->col_idx[offset++] = g->nodes[i].neighbors[j];
        }
    }
    csr->row_ptr[g->num_nodes] = offset;

    return csr;
}

// Access neighbors of node i
void visit_neighbors(csr_graph_t *g, int node_id) {
    int start = g->row_ptr[node_id];
    int end = g->row_ptr[node_id + 1];

    for (int i = start; i < end; i++) {
        int neighbor = g->col_idx[i];
        // Process neighbor
    }
}

Memory layout:

row_ptr: [0, 3, 7, 10, ...]  (node 0 has neighbors at indices 0-2)
col_idx: [1, 2, 5, 0, 3, 4, 6, ...]  (actual neighbor IDs)

Advantages:

Minimal memory overhead (just two arrays)
Sequential access to neighbors (excellent prefetching)
Cache-friendly (all data is contiguous)

The Benchmark

Test: 10,000-node graph, 120,000 edges

Adjacency list (pointers):
  Memory: 2.4 MB
  Cache misses: 85M
  BFS time: 180 ms

CSR format:
  Memory: 0.96 MB (2.5× less)
  Cache misses: 18M (4.7× fewer)
  BFS time: 42 ms
  Speedup: 4.3×

Optimization 4: Node Reordering for Locality

If you can reorder node IDs, place connected nodes close together in memory.

Breadth-First Ordering

Assign node IDs in BFS order:

void reorder_bfs(graph_t *g, int start) {
    int *new_id = malloc(g->num_nodes * sizeof(int));
    int *old_id = malloc(g->num_nodes * sizeof(int));
    bool *visited = calloc(g->num_nodes, sizeof(bool));

    int next_id = 0;
    queue_t *q = queue_create();
    queue_push(q, start);
    visited[start] = true;

    while (!queue_empty(q)) {
        int node = queue_pop(q);
        new_id[node] = next_id;
        old_id[next_id] = node;
        next_id++;

        // Visit neighbors
        for (int i = 0; i < g->nodes[node].num_neighbors; i++) {
            int neighbor = g->nodes[node].neighbors[i];
            if (!visited[neighbor]) {
                visited[neighbor] = true;
                queue_push(q, neighbor);
            }
        }
    }

    // Rebuild graph with new IDs
    // ...
}

Why this helps:

Nodes visited together are numbered sequentially
Better cache locality during traversal
Visited array accesses are more sequential

The Benchmark

Test: BFS on 500-node graph

Random node IDs:
  Cache misses: 1.5M
  Cycles: 11M

BFS-ordered node IDs:
  Cache misses: 0.8M
  Cycles: 7.5M
  Speedup: 1.5×

Optimization 5: Parallel Graph Traversal

On multi-core systems, we can parallelize BFS using level-synchronous approach.

Level-Synchronous BFS

Process each level in parallel:

void bfs_parallel(csr_graph_t *g, int start, int num_threads) {
    bool *visited = calloc(g->num_nodes, sizeof(bool));
    int *current_level = malloc(g->num_nodes * sizeof(int));
    int *next_level = malloc(g->num_nodes * sizeof(int));

    int current_size = 1;
    current_level[0] = start;
    visited[start] = true;

    while (current_size > 0) {
        atomic_int next_size = 0;

        // Process current level in parallel
        #pragma omp parallel for num_threads(num_threads)
        for (int i = 0; i < current_size; i++) {
            int node = current_level[i];
            int start = g->row_ptr[node];
            int end = g->row_ptr[node + 1];

            for (int j = start; j < end; j++) {
                int neighbor = g->col_idx[j];

                // Atomic check-and-set
                bool expected = false;
                if (__atomic_compare_exchange_n(&visited[neighbor], &expected, true,
                                                0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) {
                    int pos = __atomic_fetch_add(&next_size, 1, __ATOMIC_SEQ_CST);
                    next_level[pos] = neighbor;
                }
            }
        }

        // Swap levels
        int *temp = current_level;
        current_level = next_level;
        next_level = temp;
        current_size = next_size;
    }
}

The Benchmark

Test: BFS on 10,000-node graph (RISC-V 8-core @ 1.2 GHz)

Sequential BFS:
  Cycles: 120M
  Time: 100 ms

Parallel BFS (2 cores):
  Cycles: 68M
  Time: 57 ms
  Speedup: 1.75×

Parallel BFS (4 cores):
  Cycles: 38M
  Time: 32 ms
  Speedup: 3.1×

Parallel BFS (8 cores):
  Cycles: 24M
  Time: 20 ms
  Speedup: 5.0×

Not perfect scaling (8× speedup on 8 cores) due to:

Synchronization overhead (atomic operations)
Load imbalance (some levels have few nodes)
Cache coherence traffic

But still a significant improvement for large graphs.

Real-World Example: Linux Kernel’s Radix Tree for Page Cache

The Linux kernel uses a radix tree (a specialized graph structure) for the page cache.

The Problem

The kernel needs to map file offsets to physical pages:

Millions of pages per file
Sparse mapping (not all offsets have pages)
Fast lookup (O(log n) or better)

The Solution: Radix Tree

A 64-way tree where each level represents 6 bits of the offset:

#define RADIX_TREE_MAP_SHIFT 6
#define RADIX_TREE_MAP_SIZE (1 << RADIX_TREE_MAP_SHIFT)  // 64

struct radix_tree_node {
    void *slots[RADIX_TREE_MAP_SIZE];  // 64 pointers
    unsigned long tags[3][RADIX_TREE_MAP_SIZE / BITS_PER_LONG];
};

Why 64-way:

One node fits in one cache line (64 pointers × 8 bytes = 512 bytes ≈ 8 cache lines)
Shallow tree (depth ≤ 11 for 64-bit offsets)
Good balance between memory and cache misses

The Performance

Lookup in radix tree (depth 3):
  Cache misses: 3 (one per level)
  Cycles: ~50

Lookup in binary tree (depth 20):
  Cache misses: 20
  Cycles: ~300

Speedup: 6×

Putting It All Together: Optimized Network Discovery

Here’s the final optimized version combining all techniques:

typedef struct {
    int *row_ptr;
    int *col_idx;
    int num_nodes;
    int num_edges;
} network_graph_t;

void discover_network_optimized(network_graph_t *g, int start) {
    // Use bitmap for visited (cache-friendly)
    uint64_t *visited = calloc((g->num_nodes + 63) / 64, sizeof(uint64_t));

    // Use array-based queue (not linked list)
    int *queue = malloc(g->num_nodes * sizeof(int));
    int head = 0, tail = 0;

    queue[tail++] = start;
    visited[start / 64] |= (1UL << (start % 64));

    while (head < tail) {
        // Process in blocks for better cache reuse
        int block_end = (head + 64 < tail) ? head + 64 : tail;

        for (int i = head; i < block_end; i++) {
            int node = queue[i];
            process_device(node);

            // Sequential access to neighbors (CSR format)
            int start_idx = g->row_ptr[node];
            int end_idx = g->row_ptr[node + 1];

            for (int j = start_idx; j < end_idx; j++) {
                int neighbor = g->col_idx[j];
                uint64_t mask = 1UL << (neighbor % 64);
                int word = neighbor / 64;

                if (!(visited[word] & mask)) {
                    visited[word] |= mask;
                    queue[tail++] = neighbor;
                }
            }
        }

        head = block_end;
    }

    free(visited);
    free(queue);
}

Final Benchmark

Test: Network discovery, 500 switches, avg 12 connections

Original (adjacency list, linked queue):
  Cycles: 45M
  Cache misses: 8.5M
  Memory: 128 KB
  Time: 37.5 ms

Optimized (CSR, blocked BFS, bitmap):
  Cycles: 7.5M
  Cache misses: 0.8M
  Memory: 24 KB
  Time: 6.2 ms

Speedup: 6.0×
Cache miss reduction: 10.6×
Memory reduction: 5.3×

Summary

The cache miss explosion was tamed. Network discovery time dropped from 37.5 ms to 6.2 ms—a 6× improvement. Cache misses dropped from 8.5 million to 0.8 million—a 10.6× reduction. The graph traversal went from 17,000 cache misses per node to just 1,600.

Key insights:

Compact adjacency lists beat pointer-based lists. Storing all edges in one contiguous array eliminates pointer chasing and improves prefetching. 2.5× speedup.
CSR format is optimal for sparse graphs. Two arrays (row_ptr and col_idx) provide minimal memory overhead and excellent cache behavior. 4.3× speedup over pointer-based lists.
Blocked BFS improves temporal locality. Processing nodes in cache-sized blocks (64 nodes) reuses the visited array in cache. 1.6× speedup.
Node reordering matters. Assigning IDs in BFS order places connected nodes close in memory. 1.5× speedup.
Parallel BFS scales reasonably. Level-synchronous BFS with atomic operations achieved 5× speedup on 8 cores (62% efficiency).

The numbers from network discovery:

Compact adjacency list: 2.5× faster than pointers
CSR format: 4.3× faster, 2.5× less memory
Blocked BFS: 1.6× faster
BFS ordering: 1.5× faster
Combined: 6× faster, 10.6× fewer cache misses

Graph traversal is memory-bound. Focus on cache-friendly representations and access patterns.

Next chapter: Bloom filters and probabilistic data structures—trading accuracy for speed and memory.

Keyboard shortcuts

Data Structures in Practice