Chapter 18: Memory Allocators

Part V: Parallelism & Low-Level Optimization

"Memory allocation is the dark matter of performance." — Anonymous

The malloc That Wasn't Free

A high-frequency trading firm was debugging mysterious latency spikes. Their code was tight—no unnecessary copies, no blocking I/O, carefully tuned algorithms. Yet every few seconds, a trade would take 10x longer than expected.

The culprit: malloc. Under heavy allocation load, the default glibc allocator would occasionally need to request memory from the kernel via mmap, causing a 50-100 μs stall. In a system where microseconds mattered, this was catastrophic.

Switching to jemalloc with pre-warmed arenas eliminated the spikes entirely. The fastest allocation is the one you don't make—but when you must allocate, the allocator matters enormously.

How malloc Works

The Allocation Problem

Memory allocation seems simple: give me N bytes, return a pointer. But the allocator must solve several hard problems:

Speed: Allocation should be O(1) or close to it
Fragmentation: Avoid wasting memory on gaps
Thread safety: Multiple threads allocating simultaneously
Cache efficiency: Keep related allocations together

Traditional malloc (glibc ptmalloc2)

Heap Structure:
┌─────────────────────────────────────────────────┐
│ Chunk │ Chunk │ Free │ Chunk │ Free │ Chunk │...│
│  16B  │  32B  │ 64B  │ 128B  │ 48B  │  24B  │   │
└─────────────────────────────────────────────────┘
         ↑
    Free list links free chunks together

Free list: Freed memory is linked together. Allocation searches for a suitable chunk.

Coalescing: Adjacent free chunks are merged to reduce fragmentation.

Size classes: Small allocations use bins of fixed sizes (16, 32, 48, 64... bytes).

System Call Overhead

User Space:              Kernel Space:
┌──────────────┐         ┌──────────────┐
│   malloc()   │ ──brk──→│ Extend heap  │  (small allocs)
│              │         │              │
│   malloc()   │ ─mmap──→│ Map pages    │  (large allocs, >128KB)
│              │         │              │
│   free()     │ ─munmap→│ Unmap pages  │  (large frees)
└──────────────┘         └──────────────┘

The problem: System calls are expensive (1-10 μs). Good allocators minimize them by:

Caching freed memory for reuse
Requesting memory in large chunks
Delaying returns to the OS

Modern Allocators

jemalloc (Facebook/Meta)

Originally developed for FreeBSD, now widely used (Firefox, Redis, Facebook services).

Key innovations:

Thread-local caches: Each thread has its own cache, avoiding lock contention
Arenas: Multiple independent heaps reduce contention
Size classes: Carefully chosen to minimize internal fragmentation
Huge page support: Reduces TLB misses for large allocations

tcmalloc (Google)

Google's thread-caching malloc, used in most Google services.

Architecture:

┌─────────────────────────────────────────────────┐
│ Thread 1 Cache │ Thread 2 Cache │ Thread N Cache│
└───────┬────────┴───────┬────────┴───────┬───────┘
        │                │                │
        └────────────────┼────────────────┘
                         ↓
              ┌─────────────────────┐
              │   Central Free List │
              └─────────────────────┘
                         ↓
              ┌─────────────────────┐
              │    Page Heap        │
              └─────────────────────┘

Small objects (< 256KB): Served from thread-local cache, no locking. Large objects: Served from central page heap with locking.

mimalloc (Microsoft)

Microsoft's allocator, designed for maximum performance.

Key features:

Free list sharding: Multiple free lists per size class reduce contention
Eager page reuse: Freed pages are immediately available for reuse
Segment-based: Memory organized in segments for better locality

Performance Comparison

Allocator	Single-thread	Multi-thread	Memory Overhead	Latency Consistency
glibc ptmalloc2	Baseline	Poor (lock contention)	Medium	Variable
jemalloc	1.1-1.3x	2-5x	Low	Good
tcmalloc	1.1-1.2x	2-4x	Low	Good
mimalloc	1.2-1.5x	3-6x	Very Low	Excellent

Speedup relative to glibc on typical workloads

Memory Pools and Custom Allocators

When general-purpose allocators aren't enough:

Object Pools

Pre-allocate a fixed number of same-sized objects:

template<typename T, size_t N>
class ObjectPool {
    std::array<T, N> storage;
    std::array<T*, N> free_list;
    size_t free_count = N;

public:
    T* allocate() {
        if (free_count == 0) return nullptr;
        return free_list[--free_count];
    }

    void deallocate(T* ptr) {
        free_list[free_count++] = ptr;
    }
};

Use case: Game entities, network connections, fixed-size messages.

Arena/Bump Allocators

Allocate sequentially, free all at once:

class Arena {
    char* buffer;
    size_t offset = 0;
    size_t capacity;

public:
    void* allocate(size_t size) {
        size = align_up(size, 8);
        if (offset + size > capacity) return nullptr;
        void* ptr = buffer + offset;
        offset += size;
        return ptr;
    }

    void reset() { offset = 0; }  // "Free" everything
};

Use case: Per-frame game allocations, request-scoped web server data, compiler passes.

Stack Allocators

LIFO allocation pattern:

class StackAllocator {
    char* buffer;
    size_t offset = 0;

public:
    void* allocate(size_t size);
    void deallocate(void* ptr);  // Must be most recent allocation
};

Benchmarking Allocators

Methodology

Use realistic workloads: Synthetic benchmarks often don't reflect real patterns
Measure latency distribution: Mean is less important than P99/P999
Test under contention: Single-threaded benchmarks miss the point
Monitor memory usage: Fast but memory-hungry isn't always better

Tools

# heaptrack: Allocation profiling
heaptrack ./program
heaptrack_gui heaptrack.program.*.gz

# Valgrind massif: Heap profiling
valgrind --tool=massif ./program
ms_print massif.out.*

# perf for allocation-related events
perf stat -e page-faults,minor-faults,major-faults ./program

Switching Allocators

Most modern allocators can be used via LD_PRELOAD:

# Use jemalloc
LD_PRELOAD=/usr/lib/libjemalloc.so ./program

# Use tcmalloc
LD_PRELOAD=/usr/lib/libtcmalloc.so ./program

# Use mimalloc
LD_PRELOAD=/usr/lib/libmimalloc.so ./program

Embedded and Real-Time Considerations

Deterministic Allocation

Real-time systems need bounded allocation time. Solutions:

Static allocation: Allocate everything at startup
Pool allocators: Fixed-size pools with O(1) allocation
TLSF (Two-Level Segregated Fit): O(1) general-purpose allocator

Memory Budget

Embedded systems have fixed memory. Strategies:

Pre-calculate maximum memory needs
Use compile-time allocation where possible
Implement memory monitoring and alerts

Summary

Default malloc is often a bottleneck in multi-threaded applications
Modern allocators (jemalloc, tcmalloc, mimalloc) provide 2-6x better multi-threaded performance
Custom allocators (pools, arenas) can provide 10-100x speedup for specific patterns
Measure before switching: Profile allocation patterns to understand your needs
Consider latency, not just throughput: P99 latency often matters more than average

Performance and Benchmarking