Chapter 19: Firmware Memory Management
Part V: Case Studies
“Controlling complexity is the essence of computer programming.” — Brian Kernighan
The Final Testing Phase
We were in the final testing phase of an IoT sensor project—a smart building device with 128 KB of RAM that monitored temperature, humidity, and air quality. The firmware had passed all functional tests. Unit tests: green. Integration tests: green. Power consumption: within spec.
The last requirement was a 72-hour continuous operation test. We set up twelve devices in the lab, configured them to report sensor data every second, and let them run.
After three days, I came into the lab expecting to collect the test logs and close the project.
Instead, I found all twelve devices had crashed.
The serial console showed the same error on every device:
[72:14:23] malloc failed: out of memory
[72:14:23] Fragmentation: 45%
[72:14:23] System halted
My stomach sank. The project was supposed to ship in two weeks. I knew exactly what had happened—and I knew it was going to be painful to fix.
The Textbook Approach
When I’d designed the firmware months earlier, I’d used what seemed like reasonable practices. The device needed to:
- Handle network communication (TCP/IP stack)
- Process sensor data every second
- Store configuration
- Perform OTA (Over-The-Air) updates
I used malloc/free from newlib, just like the textbooks teach:
void process_sensor_data(void) {
// Allocate buffer for sensor reading
sensor_data_t *data = malloc(sizeof(sensor_data_t));
// Read sensors
read_temperature(data);
read_humidity(data);
// Process and send
send_to_cloud(data);
// Free buffer
free(data);
}
Simple. Clean. Textbook-correct.
And after 72 hours of continuous operation, it killed all twelve devices.
The Autopsy
I pulled the memory trace from the crash dump:
[72:14:23] malloc failed: out of memory
[72:14:23] Available: 8 KB
[72:14:23] Requested: 16 KB
[72:14:23] Fragmentation: 45%
[72:14:23] Total allocations: 259,200 (72 hours × 3600 seconds/hour)
[72:14:23] Average allocation size: 156 bytes
[72:14:23] System halted
45% fragmentation. Out of 128 KB of RAM, only 8 KB was available in contiguous blocks. The firmware needed 16 KB for a network buffer, but couldn’t find it.
The problem wasn’t a memory leak—we were freeing everything correctly. The problem was fragmentation.
After 259,200 allocations and frees over 72 hours, the heap looked like Swiss cheese:
Initial state (128 KB free):
[ ]
After 72 hours:
[used][free][used][free][used][free][used][free]...
4KB 2KB 8KB 1KB
Total free: 58 KB
Largest contiguous: 8 KB
Can't allocate 16 KB!
I had two weeks before the scheduled ship date. I needed a solution that wouldn’t require rewriting the entire firmware.
The Redesign: Finding a Path Forward
I couldn’t rewrite the entire firmware in two weeks, but I could fix the memory management.
The key insight: our allocations fell into predictable patterns.
I analyzed the crash dump and found:
- Sensor data: 156 bytes, allocated every second
- Network packets: 1024 bytes, allocated every 5 seconds
- Configuration: 2048 bytes, allocated once at startup
- Temporary buffers: 256 bytes, allocated during processing
All predictable sizes. All predictable lifetimes.
I didn’t need a general-purpose allocator. I needed specialized allocators for each use case.
Strategy 1: Fixed-Size Memory Pools
For the sensor data (156 bytes, allocated every second), I created a fixed-size pool:
#define SENSOR_POOL_SIZE 256 // Round up to power of 2
#define SENSOR_POOL_COUNT 10 // Max 10 concurrent readings
typedef struct free_block {
struct free_block *next;
} free_block_t;
typedef struct {
uint8_t memory[SENSOR_POOL_SIZE * SENSOR_POOL_COUNT];
free_block_t *free_list;
} sensor_pool_t;
static sensor_pool_t g_sensor_pool;
void sensor_pool_init(void) {
g_sensor_pool.free_list = NULL;
// Link all blocks into free list
for (int i = 0; i < SENSOR_POOL_COUNT; i++) {
free_block_t *block = (free_block_t *)&g_sensor_pool.memory[i * SENSOR_POOL_SIZE];
block->next = g_sensor_pool.free_list;
g_sensor_pool.free_list = block;
}
}
void *sensor_alloc(void) {
if (!g_sensor_pool.free_list) {
return NULL; // Pool exhausted
}
void *ptr = g_sensor_pool.free_list;
g_sensor_pool.free_list = g_sensor_pool.free_list->next;
return ptr;
}
void sensor_free(void *ptr) {
free_block_t *block = (free_block_t *)ptr;
block->next = g_sensor_pool.free_list;
g_sensor_pool.free_list = block;
}
Simple. O(1) allocation and free. Zero fragmentation.
I benchmarked it:
Test: 10,000 sensor readings (256-byte blocks)
malloc/free:
Cycles: 2.4M (240 cycles per operation)
Fragmentation: 18%
Time: 2.0 ms
Fixed-size pool (free list):
Cycles: 120K (12 cycles per operation)
Fragmentation: 0%
Time: 0.10 ms
Speedup: 20×
20× faster and zero fragmentation. That solved the sensor data problem.
Strategy 2: Static Allocation for Network Buffers
The network stack needed buffers for TCP/IP communication. The original code allocated them dynamically:
// BAD: Dynamic allocation
typedef struct {
char *tx_buffer;
char *rx_buffer;
// ...
} uart_context_t;
void uart_init(uart_context_t *ctx) {
ctx->tx_buffer = malloc(1024); // Fragmentation!
ctx->rx_buffer = malloc(1024);
}
// GOOD: Static allocation
typedef struct {
char tx_buffer[1024];
char rx_buffer[1024];
// ...
} uart_context_t;
static uart_context_t g_uart_ctx; // Static, no malloc
void uart_init(void) {
// Buffers already allocated, nothing to do
}
Advantages:
- Zero fragmentation: No heap usage
- Zero overhead: No metadata
- Predictable: Known at compile time
- Fast: No allocation time
Trade-off: Uses RAM even when not needed. But for firmware, this is usually acceptable.
Solution 3: Stack-Based Allocation
For temporary buffers, use the stack:
// BAD: Heap allocation for temporary buffer
void process_data(void) {
char *temp = malloc(512);
// ... use temp
free(temp);
}
// GOOD: Stack allocation
void process_data(void) {
char temp[512]; // On stack
// ... use temp
// Automatically freed when function returns
}
Advantages:
- Fastest: Just adjust stack pointer
- No fragmentation: Stack grows/shrinks cleanly
- Automatic cleanup: No need to free
Limitation: Stack size is limited (typically 4-16 KB). Don’t allocate large buffers on stack.
Solution 4: Memory Regions
Partition memory into regions for different purposes:
// Memory layout (128 KB total)
#define REGION_STATIC_START 0x20000000
#define REGION_STATIC_SIZE (64 * 1024) // 64 KB for static data
#define REGION_POOL_START (REGION_STATIC_START + REGION_STATIC_SIZE)
#define REGION_POOL_SIZE (32 * 1024) // 32 KB for pools
#define REGION_STACK_START (REGION_POOL_START + REGION_POOL_SIZE)
#define REGION_STACK_SIZE (16 * 1024) // 16 KB for stack
#define REGION_DMA_START (REGION_STACK_START + REGION_STACK_SIZE)
#define REGION_DMA_SIZE (16 * 1024) // 16 KB for DMA buffers
typedef struct {
uint8_t static_data[REGION_STATIC_SIZE];
uint8_t pool_memory[REGION_POOL_SIZE];
uint8_t stack[REGION_STACK_SIZE];
uint8_t dma_buffers[REGION_DMA_SIZE];
} memory_layout_t;
__attribute__((section(".ram")))
static memory_layout_t g_memory;
Why this helps:
- Clear boundaries: Each region has fixed size
- No interference: DMA doesn’t corrupt stack
- Easy debugging: Know which region is full
- Cache-friendly: Related data in same region
Solution 5: Slab Allocator
For objects of the same type, use a slab allocator:
#define MAX_CONNECTIONS 32
typedef struct {
int socket_fd;
char rx_buffer[2048];
char tx_buffer[2048];
// ... other fields
} connection_t;
typedef struct {
connection_t connections[MAX_CONNECTIONS];
uint32_t free_bitmap; // 1 bit per connection
} connection_pool_t;
static connection_pool_t g_conn_pool;
connection_t *conn_alloc(void) {
// Find first free bit
int idx = __builtin_ffs(g_conn_pool.free_bitmap) - 1;
if (idx < 0) {
return NULL; // Pool exhausted
}
// Mark as used
g_conn_pool.free_bitmap &= ~(1U << idx);
// Return connection
return &g_conn_pool.connections[idx];
}
void conn_free(connection_t *conn) {
int idx = conn - g_conn_pool.connections;
g_conn_pool.free_bitmap |= (1U << idx);
}
Advantages:
- O(1) allocation: Just find first set bit
- Cache-friendly: All connections contiguous
- Type-safe: Can only allocate connection_t
- Low overhead: 1 bit per object
The Benchmark
Test: 1000 connection allocations
malloc/free:
Cycles: 240K
Fragmentation: 12%
Time: 0.20 ms
Slab allocator (bitmap):
Cycles: 18K
Fragmentation: 0%
Time: 0.015 ms
Speedup: 13.3×
Real-World Example: FreeRTOS Heap Management
FreeRTOS offers multiple heap implementations:
heap_1.c: Simple Bump Allocator
static uint8_t heap[configTOTAL_HEAP_SIZE];
static size_t next_free_byte = 0;
void *pvPortMalloc(size_t size) {
void *ptr = NULL;
if (next_free_byte + size < configTOTAL_HEAP_SIZE) {
ptr = &heap[next_free_byte];
next_free_byte += size;
}
return ptr;
}
void vPortFree(void *ptr) {
// No-op: can't free individual blocks
}
Use case: Systems that never free memory (allocate at startup only).
heap_4.c: First-Fit with Coalescing
typedef struct A_BLOCK_LINK {
struct A_BLOCK_LINK *pxNextFreeBlock;
size_t xBlockSize;
} BlockLink_t;
static BlockLink_t xStart;
static BlockLink_t *pxEnd = NULL;
void *pvPortMalloc(size_t xWantedSize) {
BlockLink_t *pxBlock, *pxPreviousBlock, *pxNewBlockLink;
// Find first block large enough
pxPreviousBlock = &xStart;
pxBlock = xStart.pxNextFreeBlock;
while ((pxBlock->xBlockSize < xWantedSize) && (pxBlock->pxNextFreeBlock != NULL)) {
pxPreviousBlock = pxBlock;
pxBlock = pxBlock->pxNextFreeBlock;
}
if (pxBlock != pxEnd) {
// Split block if large enough
// ...
return (void *)(((uint8_t *)pxPreviousBlock->pxNextFreeBlock) + xHeapStructSize);
}
return NULL;
}
Use case: General-purpose allocation with some fragmentation tolerance.
heap_5.c: Multiple Regions
typedef struct HeapRegion {
uint8_t *pucStartAddress;
size_t xSizeInBytes;
} HeapRegion_t;
void vPortDefineHeapRegions(const HeapRegion_t * const pxHeapRegions) {
// Initialize multiple non-contiguous memory regions
// ...
}
Use case: Systems with multiple RAM regions (internal SRAM + external DRAM).
Putting It All Together: Optimized Firmware Memory
Here’s the final optimized firmware combining all techniques:
// 1. Memory regions
#define STATIC_REGION_SIZE (64 * 1024)
#define POOL_REGION_SIZE (32 * 1024)
#define STACK_REGION_SIZE (16 * 1024)
#define DMA_REGION_SIZE (16 * 1024)
// 2. Fixed-size pools
typedef struct {
fast_pool_t small_pool; // 32-byte blocks
fast_pool_t medium_pool; // 256-byte blocks
fast_pool_t large_pool; // 4096-byte blocks
} pool_manager_t;
static pool_manager_t g_pools;
// 3. Static allocation for long-lived objects
typedef struct {
char tx_buffer[1024];
char rx_buffer[1024];
// ...
} uart_context_t;
static uart_context_t g_uart;
// 4. Slab allocator for connections
typedef struct {
connection_t connections[MAX_CONNECTIONS];
uint32_t free_bitmap;
} connection_pool_t;
static connection_pool_t g_conn_pool;
// Memory initialization
void memory_init(void) {
// Initialize pools
pool_init(&g_pools.small_pool, 32, 128);
pool_init(&g_pools.medium_pool, 256, 32);
pool_init(&g_pools.large_pool, 4096, 8);
// Initialize connection pool
g_conn_pool.free_bitmap = 0xFFFFFFFF; // All free
// Static objects already initialized
}
// Smart allocation function
void *mem_alloc(size_t size) {
if (size <= 32) {
return pool_alloc(&g_pools.small_pool);
} else if (size <= 256) {
return pool_alloc(&g_pools.medium_pool);
} else if (size <= 4096) {
return pool_alloc(&g_pools.large_pool);
} else {
return NULL; // Too large
}
}
void mem_free(void *ptr, size_t size) {
if (size <= 32) {
pool_free(&g_pools.small_pool, ptr);
} else if (size <= 256) {
pool_free(&g_pools.medium_pool, ptr);
} else if (size <= 4096) {
pool_free(&g_pools.large_pool, ptr);
}
}
Final Benchmark
Test: IoT firmware running for 24 hours
Original (malloc/free):
Peak memory: 118 KB
Fragmentation: 45%
Largest free block: 8 KB
Crashes: 3 (out of memory)
Allocation time: 200 cycles (avg)
Optimized (pools + static + slab):
Peak memory: 96 KB
Fragmentation: 0%
Largest free block: 32 KB
Crashes: 0
Allocation time: 12 cycles (avg)
Improvements:
Memory usage: 18.6% reduction
Fragmentation: 100% elimination
Allocation speed: 16.7× faster
Reliability: No crashes
The Retest
After implementing the memory pool redesign, I reflashed all twelve devices and restarted the 72-hour test.
This time, I monitored the memory usage continuously. After 15 minutes, the pattern was already clear: memory usage had stabilized at 96 KB, and fragmentation remained at zero.
After 72 hours, all twelve devices were still running. After a week, still running. We eventually let the test run for three months—no crashes, no memory issues, no fragmentation.
What I Learned
The firmware redesign taught me that malloc/free is the wrong tool for embedded systems.
Here’s what worked:
1. Fixed-size pools eliminate fragmentation
Pre-allocating blocks of fixed sizes (256 bytes for sensors, 1024 bytes for network packets) provides O(1) allocation with zero fragmentation. The sensor pool was 20× faster than malloc.
2. Static allocation for long-lived objects
Network buffers, configuration data, and other persistent objects should be statically allocated. Zero overhead, zero fragmentation, and the memory layout is known at compile time.
3. Stack allocation for temporary buffers
Short-lived buffers (like temporary processing buffers) should use the stack. Fastest allocation—just adjust the stack pointer—and automatic cleanup when the function returns.
4. Slab allocators for uniform objects
Connection pools and packet buffers benefit from slab allocation. Using a bitmap for free/used tracking provides O(1) allocation with just 1 bit of overhead per object. 13.3× faster than malloc.
The Final Numbers
Original firmware (malloc/free):
Peak memory: 118 KB
Fragmentation: 45%
Crashes after: 72 hours
Allocation time: 240 cycles (avg)
Optimized firmware (pools + static + slab):
Peak memory: 96 KB
Fragmentation: 0%
Crashes after: Never (ran for 3+ months)
Allocation time: 12 cycles (avg)
Improvements:
Memory usage: 18.6% reduction
Fragmentation: 100% elimination
Allocation speed: 20× faster
Reliability: Zero crashes
The lesson: firmware needs predictable, deterministic memory management. Avoid malloc/free. Use pools, static allocation, and slab allocators.
And always run extended testing—72 hours minimum—before declaring a project complete. The bugs that matter most are the ones that only appear after days of continuous operation.
Summary
Key insights:
- malloc/free causes fragmentation in long-running firmware
- Fixed-size pools: 20× faster, zero fragmentation
- Static allocation: Best for long-lived objects
- Stack allocation: Best for temporary buffers
- Slab allocators: 13.3× faster for uniform objects
The IoT sensor firmware:
- 18.6% less memory (118 KB → 96 KB)
- Zero fragmentation (45% → 0%)
- 20× faster allocation (240 cycles → 12 cycles)
- Zero crashes (ran for 3+ months continuously)
Takeaway: In embedded systems, predictability matters more than flexibility. Design your memory management for your specific workload, not for general-purpose use.