Chapter 18: Memory Allocators

Part V: Parallelism & Low-Level Optimization


"Memory allocation is the dark matter of performance." — Anonymous

The malloc That Wasn't Free

A high-frequency trading firm was debugging mysterious latency spikes. Their code was tight—no unnecessary copies, no blocking I/O, carefully tuned algorithms. Yet every few seconds, a trade would take 10x longer than expected.

The culprit: malloc. Under heavy allocation load, the default glibc allocator would occasionally need to request memory from the kernel via mmap, causing a 50-100 μs stall. In a system where microseconds mattered, this was catastrophic.

Switching to jemalloc with pre-warmed arenas eliminated the spikes entirely. The fastest allocation is the one you don't make—but when you must allocate, the allocator matters enormously.

How malloc Works

The Allocation Problem

Memory allocation seems simple: give me N bytes, return a pointer. But the allocator must solve several hard problems:

  1. Speed: Allocation should be O(1) or close to it
  2. Fragmentation: Avoid wasting memory on gaps
  3. Thread safety: Multiple threads allocating simultaneously
  4. Cache efficiency: Keep related allocations together

Traditional malloc (glibc ptmalloc2)

Heap Structure:
┌─────────────────────────────────────────────────┐
│ Chunk │ Chunk │ Free │ Chunk │ Free │ Chunk │...│
│  16B  │  32B  │ 64B  │ 128B  │ 48B  │  24B  │   │
└─────────────────────────────────────────────────┘
         ↑
    Free list links free chunks together

Free list: Freed memory is linked together. Allocation searches for a suitable chunk.

Coalescing: Adjacent free chunks are merged to reduce fragmentation.

Size classes: Small allocations use bins of fixed sizes (16, 32, 48, 64... bytes).

System Call Overhead

User Space:              Kernel Space:
┌──────────────┐         ┌──────────────┐
│   malloc()   │ ──brk──→│ Extend heap  │  (small allocs)
│              │         │              │
│   malloc()   │ ─mmap──→│ Map pages    │  (large allocs, >128KB)
│              │         │              │
│   free()     │ ─munmap→│ Unmap pages  │  (large frees)
└──────────────┘         └──────────────┘

The problem: System calls are expensive (1-10 μs). Good allocators minimize them by:

  • Caching freed memory for reuse
  • Requesting memory in large chunks
  • Delaying returns to the OS

Modern Allocators

jemalloc (Facebook/Meta)

Originally developed for FreeBSD, now widely used (Firefox, Redis, Facebook services).

Key innovations:

  • Thread-local caches: Each thread has its own cache, avoiding lock contention
  • Arenas: Multiple independent heaps reduce contention
  • Size classes: Carefully chosen to minimize internal fragmentation
  • Huge page support: Reduces TLB misses for large allocations

tcmalloc (Google)

Google's thread-caching malloc, used in most Google services.

Architecture:

┌─────────────────────────────────────────────────┐
│ Thread 1 Cache │ Thread 2 Cache │ Thread N Cache│
└───────┬────────┴───────┬────────┴───────┬───────┘
        │                │                │
        └────────────────┼────────────────┘
                         ↓
              ┌─────────────────────┐
              │   Central Free List │
              └─────────────────────┘
                         ↓
              ┌─────────────────────┐
              │    Page Heap        │
              └─────────────────────┘

Small objects (< 256KB): Served from thread-local cache, no locking. Large objects: Served from central page heap with locking.

mimalloc (Microsoft)

Microsoft's allocator, designed for maximum performance.

Key features:

  • Free list sharding: Multiple free lists per size class reduce contention
  • Eager page reuse: Freed pages are immediately available for reuse
  • Segment-based: Memory organized in segments for better locality

Performance Comparison

AllocatorSingle-threadMulti-threadMemory OverheadLatency Consistency
glibc ptmalloc2BaselinePoor (lock contention)MediumVariable
jemalloc1.1-1.3x2-5xLowGood
tcmalloc1.1-1.2x2-4xLowGood
mimalloc1.2-1.5x3-6xVery LowExcellent

Speedup relative to glibc on typical workloads

Memory Pools and Custom Allocators

When general-purpose allocators aren't enough:

Object Pools

Pre-allocate a fixed number of same-sized objects:

template<typename T, size_t N>
class ObjectPool {
    std::array<T, N> storage;
    std::array<T*, N> free_list;
    size_t free_count = N;

public:
    T* allocate() {
        if (free_count == 0) return nullptr;
        return free_list[--free_count];
    }

    void deallocate(T* ptr) {
        free_list[free_count++] = ptr;
    }
};

Use case: Game entities, network connections, fixed-size messages.

Arena/Bump Allocators

Allocate sequentially, free all at once:

class Arena {
    char* buffer;
    size_t offset = 0;
    size_t capacity;

public:
    void* allocate(size_t size) {
        size = align_up(size, 8);
        if (offset + size > capacity) return nullptr;
        void* ptr = buffer + offset;
        offset += size;
        return ptr;
    }

    void reset() { offset = 0; }  // "Free" everything
};

Use case: Per-frame game allocations, request-scoped web server data, compiler passes.

Stack Allocators

LIFO allocation pattern:

class StackAllocator {
    char* buffer;
    size_t offset = 0;

public:
    void* allocate(size_t size);
    void deallocate(void* ptr);  // Must be most recent allocation
};

Benchmarking Allocators

Methodology

  1. Use realistic workloads: Synthetic benchmarks often don't reflect real patterns
  2. Measure latency distribution: Mean is less important than P99/P999
  3. Test under contention: Single-threaded benchmarks miss the point
  4. Monitor memory usage: Fast but memory-hungry isn't always better

Tools

# heaptrack: Allocation profiling
heaptrack ./program
heaptrack_gui heaptrack.program.*.gz

# Valgrind massif: Heap profiling
valgrind --tool=massif ./program
ms_print massif.out.*

# perf for allocation-related events
perf stat -e page-faults,minor-faults,major-faults ./program

Switching Allocators

Most modern allocators can be used via LD_PRELOAD:

# Use jemalloc
LD_PRELOAD=/usr/lib/libjemalloc.so ./program

# Use tcmalloc
LD_PRELOAD=/usr/lib/libtcmalloc.so ./program

# Use mimalloc
LD_PRELOAD=/usr/lib/libmimalloc.so ./program

Embedded and Real-Time Considerations

Deterministic Allocation

Real-time systems need bounded allocation time. Solutions:

  • Static allocation: Allocate everything at startup
  • Pool allocators: Fixed-size pools with O(1) allocation
  • TLSF (Two-Level Segregated Fit): O(1) general-purpose allocator

Memory Budget

Embedded systems have fixed memory. Strategies:

  • Pre-calculate maximum memory needs
  • Use compile-time allocation where possible
  • Implement memory monitoring and alerts

Summary

  • Default malloc is often a bottleneck in multi-threaded applications
  • Modern allocators (jemalloc, tcmalloc, mimalloc) provide 2-6x better multi-threaded performance
  • Custom allocators (pools, arenas) can provide 10-100x speedup for specific patterns
  • Measure before switching: Profile allocation patterns to understand your needs
  • Consider latency, not just throughput: P99 latency often matters more than average