Chapter 18: Memory Allocators
Part V: Parallelism & Low-Level Optimization
"Memory allocation is the dark matter of performance." — Anonymous
The malloc That Wasn't Free
A high-frequency trading firm was debugging mysterious latency spikes. Their code was tight—no unnecessary copies, no blocking I/O, carefully tuned algorithms. Yet every few seconds, a trade would take 10x longer than expected.
The culprit: malloc. Under heavy allocation load, the default glibc allocator would occasionally need to request memory from the kernel via mmap, causing a 50-100 μs stall. In a system where microseconds mattered, this was catastrophic.
Switching to jemalloc with pre-warmed arenas eliminated the spikes entirely. The fastest allocation is the one you don't make—but when you must allocate, the allocator matters enormously.
How malloc Works
The Allocation Problem
Memory allocation seems simple: give me N bytes, return a pointer. But the allocator must solve several hard problems:
- Speed: Allocation should be O(1) or close to it
- Fragmentation: Avoid wasting memory on gaps
- Thread safety: Multiple threads allocating simultaneously
- Cache efficiency: Keep related allocations together
Traditional malloc (glibc ptmalloc2)
Heap Structure:
┌─────────────────────────────────────────────────┐
│ Chunk │ Chunk │ Free │ Chunk │ Free │ Chunk │...│
│ 16B │ 32B │ 64B │ 128B │ 48B │ 24B │ │
└─────────────────────────────────────────────────┘
↑
Free list links free chunks together
Free list: Freed memory is linked together. Allocation searches for a suitable chunk.
Coalescing: Adjacent free chunks are merged to reduce fragmentation.
Size classes: Small allocations use bins of fixed sizes (16, 32, 48, 64... bytes).
System Call Overhead
User Space: Kernel Space:
┌──────────────┐ ┌──────────────┐
│ malloc() │ ──brk──→│ Extend heap │ (small allocs)
│ │ │ │
│ malloc() │ ─mmap──→│ Map pages │ (large allocs, >128KB)
│ │ │ │
│ free() │ ─munmap→│ Unmap pages │ (large frees)
└──────────────┘ └──────────────┘
The problem: System calls are expensive (1-10 μs). Good allocators minimize them by:
- Caching freed memory for reuse
- Requesting memory in large chunks
- Delaying returns to the OS
Modern Allocators
jemalloc (Facebook/Meta)
Originally developed for FreeBSD, now widely used (Firefox, Redis, Facebook services).
Key innovations:
- Thread-local caches: Each thread has its own cache, avoiding lock contention
- Arenas: Multiple independent heaps reduce contention
- Size classes: Carefully chosen to minimize internal fragmentation
- Huge page support: Reduces TLB misses for large allocations
tcmalloc (Google)
Google's thread-caching malloc, used in most Google services.
Architecture:
┌─────────────────────────────────────────────────┐
│ Thread 1 Cache │ Thread 2 Cache │ Thread N Cache│
└───────┬────────┴───────┬────────┴───────┬───────┘
│ │ │
└────────────────┼────────────────┘
↓
┌─────────────────────┐
│ Central Free List │
└─────────────────────┘
↓
┌─────────────────────┐
│ Page Heap │
└─────────────────────┘
Small objects (< 256KB): Served from thread-local cache, no locking. Large objects: Served from central page heap with locking.
mimalloc (Microsoft)
Microsoft's allocator, designed for maximum performance.
Key features:
- Free list sharding: Multiple free lists per size class reduce contention
- Eager page reuse: Freed pages are immediately available for reuse
- Segment-based: Memory organized in segments for better locality
Performance Comparison
| Allocator | Single-thread | Multi-thread | Memory Overhead | Latency Consistency |
|---|---|---|---|---|
| glibc ptmalloc2 | Baseline | Poor (lock contention) | Medium | Variable |
| jemalloc | 1.1-1.3x | 2-5x | Low | Good |
| tcmalloc | 1.1-1.2x | 2-4x | Low | Good |
| mimalloc | 1.2-1.5x | 3-6x | Very Low | Excellent |
Speedup relative to glibc on typical workloads
Memory Pools and Custom Allocators
When general-purpose allocators aren't enough:
Object Pools
Pre-allocate a fixed number of same-sized objects:
template<typename T, size_t N>
class ObjectPool {
std::array<T, N> storage;
std::array<T*, N> free_list;
size_t free_count = N;
public:
T* allocate() {
if (free_count == 0) return nullptr;
return free_list[--free_count];
}
void deallocate(T* ptr) {
free_list[free_count++] = ptr;
}
};
Use case: Game entities, network connections, fixed-size messages.
Arena/Bump Allocators
Allocate sequentially, free all at once:
class Arena {
char* buffer;
size_t offset = 0;
size_t capacity;
public:
void* allocate(size_t size) {
size = align_up(size, 8);
if (offset + size > capacity) return nullptr;
void* ptr = buffer + offset;
offset += size;
return ptr;
}
void reset() { offset = 0; } // "Free" everything
};
Use case: Per-frame game allocations, request-scoped web server data, compiler passes.
Stack Allocators
LIFO allocation pattern:
class StackAllocator {
char* buffer;
size_t offset = 0;
public:
void* allocate(size_t size);
void deallocate(void* ptr); // Must be most recent allocation
};
Benchmarking Allocators
Methodology
- Use realistic workloads: Synthetic benchmarks often don't reflect real patterns
- Measure latency distribution: Mean is less important than P99/P999
- Test under contention: Single-threaded benchmarks miss the point
- Monitor memory usage: Fast but memory-hungry isn't always better
Tools
# heaptrack: Allocation profiling
heaptrack ./program
heaptrack_gui heaptrack.program.*.gz
# Valgrind massif: Heap profiling
valgrind --tool=massif ./program
ms_print massif.out.*
# perf for allocation-related events
perf stat -e page-faults,minor-faults,major-faults ./program
Switching Allocators
Most modern allocators can be used via LD_PRELOAD:
# Use jemalloc
LD_PRELOAD=/usr/lib/libjemalloc.so ./program
# Use tcmalloc
LD_PRELOAD=/usr/lib/libtcmalloc.so ./program
# Use mimalloc
LD_PRELOAD=/usr/lib/libmimalloc.so ./program
Embedded and Real-Time Considerations
Deterministic Allocation
Real-time systems need bounded allocation time. Solutions:
- Static allocation: Allocate everything at startup
- Pool allocators: Fixed-size pools with O(1) allocation
- TLSF (Two-Level Segregated Fit): O(1) general-purpose allocator
Memory Budget
Embedded systems have fixed memory. Strategies:
- Pre-calculate maximum memory needs
- Use compile-time allocation where possible
- Implement memory monitoring and alerts
Summary
- Default malloc is often a bottleneck in multi-threaded applications
- Modern allocators (jemalloc, tcmalloc, mimalloc) provide 2-6x better multi-threaded performance
- Custom allocators (pools, arenas) can provide 10-100x speedup for specific patterns
- Measure before switching: Profile allocation patterns to understand your needs
- Consider latency, not just throughput: P99 latency often matters more than average