Chapter 10: B-Trees and Cache-Conscious Trees

Part III: Trees and Hierarchies

“The purpose of computing is insight, not numbers.” — Richard Hamming

The Database Mystery

The database was all in-memory, yet lookups were taking 12,000 cycles. For 1 million sensor readings on an IoT device with 64 KB of cache, the Red-Black tree implementation was too slow for real-time queries.

“Let’s try a B-tree,” I suggested during the performance review.

“Isn’t that just for disk-based databases?” the lead engineer asked. “We’re all in-memory. Why would we need a B-tree?”

The question was reasonable. B-trees were designed for disk access, where each node is a disk block. But the cache miss patterns looked suspiciously similar to disk I/O patterns—just 100× faster instead of 100,000× faster.

We implemented a B-tree anyway. The results surprised everyone:

$ perf stat -e cache-misses,cycles ./db_query_rbtree
  Performance counter stats:
    18,500,000 cache-misses
   120,000,000 cycles

$ perf stat -e cache-misses,cycles ./db_query_btree
  Performance counter stats:
     2,800,000 cache-misses
    18,000,000 cycles

The B-tree was 6.7× faster than the Red-Black tree. Cache misses dropped from 18.5 million to 2.8 million.

Why? The B-tree had only 3 levels versus the Red-Black tree’s 20 levels. Fewer levels = fewer cache misses.

The Problem with Binary Trees

In Chapter 9, we saw that binary search trees suffer from pointer-chasing. Every node is in a random memory location, so every traversal step is a cache miss.

But there’s a deeper problem: tree height.

For 1 million entries:

Binary tree height: log₂(1,000,000) ≈ 20 levels
Each lookup: 20 pointer dereferences
Cache misses: ~18-20 (almost every node is a miss)

Even if we could magically make every node cache-friendly, we’d still have 20 levels to traverse.

The insight: Most cache misses come from tree height, not from individual node access.

The solution: Reduce height by increasing the branching factor.

What Is a B-Tree?

A B-tree is like a binary search tree, but each node can have many children instead of just two.

Here’s a simple example with order 4 (max 3 keys per node):

                [40|80]
               /   |   \
         [10|20] [50|60] [90|100]

The root has 2 keys (40 and 80) and 3 children. Each child is also a node with multiple keys.

Key properties:

Each node contains up to M-1 keys (M is the “order”)
Each internal node has up to M children
All leaves are at the same depth (the tree is balanced)
Keys within each node are sorted

For our IoT database, we used order 64. That means:

Each node has up to 63 keys
Each internal node has up to 64 children
Tree height: log₆₄(1,000,000) ≈ 3 levels

Compare that to a binary tree’s 20 levels!

Why B-Trees Are Cache-Friendly

The magic of B-trees is that all the keys in a node are stored sequentially in memory.

Here’s the node structure I used:

#define BTREE_ORDER 64

typedef struct btree_node {
    int num_keys;                    // 4 bytes
    int keys[BTREE_ORDER - 1];       // 252 bytes (63 keys)
    void *values[BTREE_ORDER - 1];   // 504 bytes
    struct btree_node *children[BTREE_ORDER];  // 512 bytes
    // Total: ~1,272 bytes (fits in ~20 cache lines)
} btree_node_t;

When you access a node, you get all 63 keys in a contiguous array. You can binary search through them without any pointer-chasing:

int find_key(btree_node_t *node, int key) {
    // Binary search in sorted array (cache-friendly!)
    int left = 0, right = node->num_keys - 1;
    while (left <= right) {
        int mid = (left + right) / 2;
        if (node->keys[mid] == key) return mid;
        if (key < node->keys[mid]) right = mid - 1;
        else left = mid + 1;
    }
    return -1;  // Not found
}

Cache behavior:

First access to node: 1 cache miss (loads the node into cache)
Binary search within node: 0 additional cache misses (all keys are sequential)
Total: 1 cache miss per tree level

With only 3 levels, that’s only 3 cache misses per lookup!

B-Tree Search

void* btree_search(btree_node_t *root, int key) {
    btree_node_t *node = root;
    
    while (node) {
        // Binary search within node (cache-friendly)
        int i = 0;
        while (i < node->num_keys && key > node->keys[i]) {
            i++;
        }
        
        // Found?
        if (i < node->num_keys && key == node->keys[i]) {
            return node->values[i];
        }
        
        // Leaf node?
        if (!node->children[0]) {
            return NULL;  // Not found
        }
        
        // Descend to child (cache miss here)
        node = node->children[i];
    }
    
    return NULL;
}

Complexity:

Tree height: O(log_M N)
Search within node: O(log M)
Total: O(log M × log_M N) = O(log N)

Cache misses: O(log_M N) (one per level)

The Benchmark Results

I tested different B-tree orders on our IoT database with 1 million sensor readings:

Dataset: 1,000,000 entries, 10,000 random lookups

Red-Black tree:
  Height: 20 levels
  Cycles/lookup: 12,000
  Cache misses: 18.5

B-tree (order 16):
  Height: 5 levels
  Cycles/lookup: 3,200
  Cache misses: 4.8
  Speedup: 3.75×

B-tree (order 64):
  Height: 3 levels
  Cycles/lookup: 1,800
  Cache misses: 2.8
  Speedup: 6.7×

B-tree (order 256):
  Height: 2 levels
  Cycles/lookup: 1,200
  Cache misses: 1.9
  Speedup: 10×

The B-tree with order 64 was our sweet spot—6.7× faster than the Red-Black tree.

Why it works:

Fewer levels: 3 vs 20 means 3 cache misses vs 20
Sequential keys: Binary search within each node is cache-friendly
Amortized cost: The cost of searching within a node (log₆₄ ≈ 6 comparisons) is tiny compared to the cost of a cache miss (100 cycles)

Choosing B-Tree Order

Trade-off: Larger order → fewer levels, but more comparisons per node.

Optimal order: Fit node in one cache line (64 bytes).

Cache Line Analysis

Order 4 (3 keys):

struct btree_node {
    int num_keys;        // 4 bytes
    int keys[3];         // 12 bytes
    void *values[3];     // 24 bytes
    void *children[4];   // 32 bytes
    // Total: 72 bytes (2 cache lines)
};

Order 8 (7 keys):

struct btree_node {
    int num_keys;        // 4 bytes
    int keys[7];         // 28 bytes
    void *values[7];     // 56 bytes
    void *children[8];   // 64 bytes
    // Total: 152 bytes (3 cache lines)
};

Recommendation:

In-memory B-tree: Order 16-64 (balance height vs node size)
Disk-based B-tree: Order 128-512 (minimize disk seeks)

B-Tree Insertion

Challenge: Maintain balance (all leaves at same depth).

Strategy: Split full nodes.

Insertion Algorithm

void btree_insert(btree_node_t **root, int key, void *value) {
    btree_node_t *node = *root;

    // If root is full, split it
    if (node->num_keys == BTREE_ORDER - 1) {
        btree_node_t *new_root = create_node();
        new_root->children[0] = node;
        split_child(new_root, 0);
        *root = new_root;
    }

    insert_non_full(*root, key, value);
}

void insert_non_full(btree_node_t *node, int key, void *value) {
    int i = node->num_keys - 1;

    if (!node->children[0]) {  // Leaf node
        // Shift keys to make room
        while (i >= 0 && key < node->keys[i]) {
            node->keys[i + 1] = node->keys[i];
            node->values[i + 1] = node->values[i];
            i--;
        }
        node->keys[i + 1] = key;
        node->values[i + 1] = value;
        node->num_keys++;
    } else {  // Internal node
        // Find child to descend
        while (i >= 0 && key < node->keys[i]) {
            i--;
        }
        i++;

        // Split child if full
        if (node->children[i]->num_keys == BTREE_ORDER - 1) {
            split_child(node, i);
            if (key > node->keys[i]) i++;
        }

        insert_non_full(node->children[i], key, value);
    }
}

Node Splitting

Example (order 4, max 3 keys):

Before split:
  Node: [10|20|30]  (full)

After split:
  Left:  [10]
  Parent: [20]
  Right: [30]

Cost: O(M) to split (copy keys), but amortized O(1) (rare).

B+ Trees: Optimized for Range Queries

Problem with B-tree: Values scattered across all levels.

B+ tree: All values in leaves, internal nodes only store keys.

Structure:

Internal nodes (keys only):
                [40|80]
               /   |   \
Leaf nodes (keys + values):
  [10:v1|20:v2|30:v3] → [40:v4|50:v5|60:v6] → [80:v7|90:v8|100:v9]
   ↑                      ↑                      ↑
   └──────────────────────┴──────────────────────┘
         Linked list for range scans

Advantages:

Range queries: Scan linked list of leaves (sequential access)
Higher fanout: Internal nodes smaller (no values)
All data in leaves: Simpler code

Use case: Databases (MySQL InnoDB, PostgreSQL, SQLite).

Cache-Oblivious B-Trees

Problem: Optimal B-tree order depends on cache line size (64 bytes on x86, 128 bytes on some ARM).

Cache-oblivious B-tree: Adapts to any cache size without tuning.

Idea: Recursive layout (van Emde Boas layout).

Example (16 keys):

Memory layout:
[8] [4|12] [2|6|10|14] [1|3|5|7|9|11|13|15]
 ↑    ↑        ↑              ↑
Root  Level 1  Level 2        Leaves

Sequential in memory, but logically a tree

Advantage: Works well across different cache sizes.

Disadvantage: Complex implementation, harder to modify.

Real-World Example: SQLite B-Tree

Use case: Embedded database (browsers, mobile apps).

Design:

Page size: 4 KB (matches filesystem block size)
Order: ~340 (4 KB / 12 bytes per entry)
B+ tree: All data in leaves

Optimization: Page cache in memory.

Benchmark (1M records):

Lookup:
  In-memory:  1,200 cycles (3 levels, all in cache)
  On-disk:    8 ms (3 disk seeks)

Range scan (1000 records):
  In-memory:  180,000 cycles (sequential leaf scan)
  On-disk:    12 ms (sequential disk read)

Why B-tree for disk:

Minimize seeks: 3 seeks vs 20 (BST)
Sequential reads: Leaf nodes linked
Page-aligned: Each node = one disk block

Embedded Systems: Fixed-Size B-Trees

Challenge: No dynamic allocation in embedded systems.

Solution: Pre-allocate B-tree nodes in array.

#define MAX_NODES 1024
#define BTREE_ORDER 16

typedef struct {
    int num_keys;
    int keys[BTREE_ORDER - 1];
    void *values[BTREE_ORDER - 1];
    uint16_t children[BTREE_ORDER];  // Indices, not pointers
} btree_node_t;

typedef struct {
    btree_node_t nodes[MAX_NODES];
    uint16_t root;
    uint16_t free_list;
} btree_t;

Advantages:

No malloc: Predictable memory usage
Cache-friendly: Nodes in contiguous array
Indices instead of pointers: Saves memory (2 bytes vs 8 bytes)

Disadvantage: Fixed capacity (MAX_NODES).

Guidelines

Use B-tree when:

✅ Large datasets (> 10,000 entries)
✅ Frequent insertions/deletions
✅ Range queries needed
✅ Disk/SSD storage

Use B+ tree when:

✅ Database indexing
✅ Range scans common
✅ All data can be in leaves

Use BST when:

✅ Small datasets (< 1,000 entries)
✅ Simple implementation needed

Use sorted array when:

✅ Read-only or rare updates
✅ Dataset fits in cache

Optimization Techniques

1. Bulk Loading

Problem: Inserting sorted data one-by-one is slow.

Solution: Build B-tree bottom-up.

btree_t* bulk_load(int *keys, void **values, int n) {
    // Sort input
    qsort_pairs(keys, values, n);

    // Build leaves
    int num_leaves = (n + BTREE_ORDER - 2) / (BTREE_ORDER - 1);
    btree_node_t *leaves = build_leaves(keys, values, n);

    // Build internal levels bottom-up
    while (num_leaves > 1) {
        leaves = build_level(leaves, num_leaves);
        num_leaves = (num_leaves + BTREE_ORDER - 1) / BTREE_ORDER;
    }

    return leaves;  // Root
}

Speedup: 10-100× faster than individual inserts.

2. Prefix Compression

Observation: Keys often share prefixes (e.g., URLs, file paths).

Optimization: Store common prefix once per node.

struct compressed_node {
    char prefix[32];         // Common prefix
    int prefix_len;
    char suffixes[BTREE_ORDER][32];  // Only unique parts
};

Savings: 50-80% memory reduction for string keys.

3. SIMD Search

Idea: Use SIMD to compare key against multiple node keys in parallel.

#include <immintrin.h>

int simd_search(int *keys, int n, int target) {
    __m256i target_vec = _mm256_set1_epi32(target);

    for (int i = 0; i < n; i += 8) {
        __m256i keys_vec = _mm256_loadu_si256((__m256i*)&keys[i]);
        __m256i cmp = _mm256_cmpeq_epi32(keys_vec, target_vec);
        int mask = _mm256_movemask_epi8(cmp);
        if (mask) {
            return i + __builtin_ctz(mask) / 4;
        }
    }
    return -1;
}

Speedup: 2-3× for large nodes (order 64+).

Summary

The database mystery was solved. The B-tree delivered 6.7× faster queries than the Red-Black tree, dropping lookup time from 12,000 to 1,800 cycles. Cache misses fell from 18.5 to 2.8 per lookup. The IoT device could now handle real-time sensor queries with ease. The “disk-only” data structure turned out to be perfect for in-memory cache optimization.

Key insights:

Tree height matters more than node complexity. A B-tree with 3 levels beats a binary tree with 20 levels, even though searching within a B-tree node takes more comparisons.
Sequential memory layout is king. Storing all keys in a node sequentially means binary search within the node is cache-friendly. One cache miss loads the entire node.
B-trees aren’t just for disk. The textbooks teach B-trees for databases on disk, but they’re equally valuable for in-memory data structures when the dataset is large.
Order matters. Too small (order 4-8) and you don’t reduce height enough. Too large (order 256+) and nodes don’t fit in cache. Order 16-64 is the sweet spot for in-memory B-trees.
B+ trees are better for range queries. By storing all data in leaves and linking them, you can scan ranges sequentially without traversing the tree.

The numbers from our IoT database:

Red-Black tree: 12,000 cycles/lookup, 18.5 cache misses
B-tree (order 64): 1,800 cycles/lookup, 2.8 cache misses
Speedup: 6.7×

Next chapter: Tries and radix trees for prefix matching and string keys.

Keyboard shortcuts

Data Structures in Practice