Chapter 4: Arrays and Cache Locality
Part II: Basic Data Structures
“The array is the most important data structure in computer science.” — Donald Knuth (paraphrased)
The Simplest Data Structure
Arrays are so simple that we often take them for granted. Contiguous memory, O(1) access, what’s there to optimize?
Everything.
I was working on a packet processing pipeline for a network switch. The code was straightforward: read packets from a ring buffer (an array), process them, and write results to another array. Simple, right?
The performance was terrible. We were processing 100,000 packets per second when the hardware should handle 1 million.
The profiler showed something strange:
$ perf stat -e cache-misses,instructions ./packet_processor
Performance counter stats:
450,000 cache-misses
1,000,000 instructions
450,000 cache misses for 1,000,000 instructions? That’s a cache miss every 2-3 instructions. For simple array operations, this made no sense.
The problem wasn’t the arrays themselves—it was how we were using them.
Memory Layout Matters
Let’s start with the basics. An array is contiguous memory:
int array[8] = {0, 1, 2, 3, 4, 5, 6, 7};
In memory (assuming 4-byte integers):
Address: 0x1000 0x1004 0x1008 0x100C 0x1010 0x1014 0x1018 0x101C
Value: 0 1 2 3 4 5 6 7
└───────────────────────────────────────────────────────┘
One 64-byte cache line
Key insight: All 8 integers fit in a single 64-byte cache line.
Accessing the array sequentially:
int sum = 0;
for (int i = 0; i < 8; i++) {
sum += array[i];
}
Cache behavior with prefetching:
sequenceDiagram
participant CPU
participant Cache
participant Prefetcher
participant Memory
CPU->>Cache: Request array[0]
Cache->>Memory: MISS - Fetch cache line
Memory-->>Cache: Return 64 bytes (array[0-15])
Prefetcher->>Prefetcher: Detect sequential pattern
Prefetcher->>Memory: Prefetch next cache line
Memory-->>Cache: Prefetch array[16-31]
CPU->>Cache: Request array[1]
Cache-->>CPU: HIT (already in cache)
CPU->>Cache: Request array[2-15]
Cache-->>CPU: HIT (all in cache)
CPU->>Cache: Request array[16]
Cache-->>CPU: HIT (prefetched!)
Prefetcher->>Memory: Prefetch array[32-47]
Note over CPU,Memory: Prefetcher stays ahead,<br/>hiding memory latency
Cache behavior:
- First access (
array[0]): Cache miss (100 cycles) - Fetches entire cache line (64 bytes = 16 integers)
- Next 7 accesses (
array[1]toarray[7]): Cache hits (1 cycle each) - Prefetcher: Detects pattern, fetches ahead
Total cost: 100 + 7 = 107 cycles for 8 accesses = 13.4 cycles per access
Compare this to random access:
int indices[8] = {7, 2, 5, 0, 3, 6, 1, 4};
int sum = 0;
for (int i = 0; i < 8; i++) {
sum += array[indices[i]];
}
If indices causes accesses to different cache lines:
- Each access: Cache miss (100 cycles)
- Total cost: 800 cycles for 8 accesses = 100 cycles per access
Sequential is 7.5× faster than random, even though both are O(n).
Stride Patterns
Not all sequential access is equal. Stride matters.
Stride-1 access (best case):
for (int i = 0; i < n; i++) {
sum += array[i]; // Stride = 1 element = 4 bytes
}
Stride-2 access (still good):
for (int i = 0; i < n; i += 2) {
sum += array[i]; // Stride = 2 elements = 8 bytes
}
Large stride (worse):
for (int i = 0; i < n; i += 16) {
sum += array[i]; // Stride = 16 elements = 64 bytes
}
Why does stride matter?
-
Cache line utilization: Stride-1 uses all 64 bytes fetched. Stride-16 uses only 4 bytes per cache line (6.25% utilization).
-
Prefetcher effectiveness: Hardware prefetchers detect stride patterns, but large strides may exceed prefetch distance.
Benchmark (1M element array):
Stride-1: 1.2 ms (100% cache line utilization)
Stride-2: 1.3 ms (50% utilization, still prefetched)
Stride-4: 1.5 ms (25% utilization)
Stride-8: 2.1 ms (12.5% utilization)
Stride-16: 3.8 ms (6.25% utilization)
Stride-64: 8.5 ms (1.56% utilization, new cache line each access)
Guideline: Keep stride small (≤ 8 elements) for good performance.
Real-World Tool: lmbench lat_mem_rd
The classic lmbench benchmark suite includes lat_mem_rd, which measures memory latency across different array sizes and strides. This is exactly what we’ve been discussing.
How it works:
// Simplified version of lmbench lat_mem_rd
char *p = array;
for (int i = 0; i < iterations; i++) {
// Pointer chasing with configurable stride
p = *(char **)p; // Follow pointer to next element
}
The array is initialized so each element points to the next element at distance stride:
// Initialize array with stride
for (size_t i = 0; i < size; i += stride) {
array[i] = &array[(i + stride) % size];
}
Running lmbench:
$ lat_mem_rd 64M 128
# Array size: 64 MB, stride: 128 bytes
Output:
Stride Latency
128 3.2 ns (L1 cache)
256 3.5 ns (L1 cache)
512 4.1 ns (L1 cache)
1024 5.8 ns (L2 cache)
4096 12.5 ns (L2 cache)
16384 45.0 ns (L3 cache)
65536 102.0 ns (DRAM)
What this shows:
- Small strides (128-512 bytes): Stay in L1 cache (~3-4 ns)
- Medium strides (1-4 KB): L2 cache (~6-12 ns)
- Large strides (16-64 KB): L3 cache or DRAM (45-100+ ns)
Why stride affects latency:
- Small stride: Sequential access, prefetcher helps, stays in L1
- Large stride: Jumps across cache lines, defeats prefetcher, evicts from L1
Key insight: This is why data structure layout matters. If your struct is 128 bytes and you iterate through an array of them, you’re doing stride-128 access. If only 8 bytes of the struct are “hot” (frequently accessed), you’re wasting 93.75% of each cache line.
Embedded perspective: On embedded systems without L3 cache, the latency cliff is steeper. Once you exceed L1/L2 capacity, you go straight to DRAM or flash (100-1000× slower).
Multi-Dimensional Arrays
Multi-dimensional arrays introduce a critical choice: row-major vs column-major layout.
C uses row-major order:
int matrix[4][4] = {
{0, 1, 2, 3},
{4, 5, 6, 7},
{8, 9, 10, 11},
{12, 13, 14, 15}
};
Memory layout (row-major):
Address: 0x1000 0x1004 0x1008 0x100C 0x1010 0x1014 0x1018 0x101C ...
Value: 0 1 2 3 4 5 6 7 ...
└───────────── Row 0 ──────────┘└───────────── Row 1 ──────────┘
Row-major traversal (good):
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 4; j++) {
sum += matrix[i][j]; // Sequential in memory
}
}
Column-major traversal (bad):
for (int j = 0; j < 4; j++) {
for (int i = 0; i < 4; i++) {
sum += matrix[i][j]; // Stride = 4 elements = 16 bytes
}
}
Benchmark (1024×1024 matrix):
Row-major: 12 ms (sequential access)
Column-major: 45 ms (stride-1024 access)
Column-major is 3.75× slower for the same algorithm!
The Matrix Multiplication Problem
Matrix multiplication is the classic example of cache optimization:
// Naive implementation
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
Access patterns:
A[i][k]: Row-major, good (stride-1)C[i][j]: Same element repeatedly, excellent (temporal locality)B[k][j]: Column-major, terrible (stride-N)
For N=1024: Accessing B[k][j] has stride of 1024 elements = 4096 bytes = 64 cache lines!
Solution 1: Loop reordering (ikj order)
// Better: ikj order
for (int i = 0; i < N; i++) {
for (int k = 0; k < N; k++) {
int r = A[i][k];
for (int j = 0; j < N; j++) {
C[i][j] += r * B[k][j]; // Now B is row-major!
}
}
}
Access patterns now:
A[i][k]: Row-major, goodB[k][j]: Row-major, good (was column-major)C[i][j]: Row-major, good
Benchmark (512×512 matrices):
ijk order (naive): 2,450 ms
ikj order: 680 ms (3.6× faster)
Solution 2: Blocking (tiling)
For very large matrices that don’t fit in cache, use blocking:
#define BLOCK_SIZE 64
for (int ii = 0; ii < N; ii += BLOCK_SIZE) {
for (int jj = 0; jj < N; jj += BLOCK_SIZE) {
for (int kk = 0; kk < N; kk += BLOCK_SIZE) {
// Process BLOCK_SIZE × BLOCK_SIZE submatrix
for (int i = ii; i < ii + BLOCK_SIZE && i < N; i++) {
for (int k = kk; k < kk + BLOCK_SIZE && k < N; k++) {
int r = A[i][k];
for (int j = jj; j < jj + BLOCK_SIZE && j < N; j++) {
C[i][j] += r * B[k][j];
}
}
}
}
}
}
Why blocking works:
- Processes small blocks that fit in L1 cache
- Reuses data before eviction
- Reduces cache misses dramatically
Benchmark (1024×1024 matrices):
Naive (ijk): 18,500 ms
Reordered (ikj): 5,200 ms (3.6× faster)
Blocked: 1,800 ms (10.3× faster than naive, 2.9× faster than reordered)
Structure of Arrays vs Array of Structures
How you organize data in arrays has huge performance implications.
Memory layout comparison:
graph TD
subgraph "AoS: Array of Structures"
A1["Cache Line 0<br/>[x,y,z,vx,vy,vz,mass,id]<br/>Particle 0"]
A2["Cache Line 1<br/>[x,y,z,vx,vy,vz,mass,id]<br/>Particle 1"]
A3["Cache Line 2<br/>[x,y,z,vx,vy,vz,mass,id]<br/>Particle 2"]
A1 --> A2 --> A3
style A1 fill:#ffcccb
style A2 fill:#ffcccb
style A3 fill:#ffcccb
end
subgraph "SoA: Structure of Arrays"
S1["Cache Line 0<br/>[x[0], x[1], ..., x[15]]"]
S2["Cache Line 1<br/>[x[16], x[17], ..., x[31]]"]
S3["Cache Line N<br/>[vx[0], vx[1], ..., vx[15]]"]
S1 --> S2 --> S3
style S1 fill:#90ee90
style S2 fill:#90ee90
style S3 fill:#90ee90
end
note1["AoS: 37.5% cache utilization<br/>Only need x,y,z,vx,vy,vz<br/>but fetch mass,id too"]
note2["SoA: 100% cache utilization<br/>Each cache line contains<br/>only needed data"]
A3 -.-> note1
S3 -.-> note2
Array of Structures (AoS):
typedef struct {
float x, y, z; // Position (12 bytes)
float vx, vy, vz; // Velocity (12 bytes)
float mass; // Mass (4 bytes)
int id; // ID (4 bytes)
} particle_t; // Total: 32 bytes
particle_t particles[1000];
// Update positions
for (int i = 0; i < 1000; i++) {
particles[i].x += particles[i].vx * dt;
particles[i].y += particles[i].vy * dt;
particles[i].z += particles[i].vz * dt;
}
Memory layout:
Cache line 0: [p0.x, p0.y, p0.z, p0.vx, p0.vy, p0.vz, p0.mass, p0.id]
Cache line 1: [p1.x, p1.y, p1.z, p1.vx, p1.vy, p1.vz, p1.mass, p1.id]
...
Problem: Each cache line contains data we don’t need (mass, id). We’re using only 24 bytes out of 64 (37.5% utilization).
Structure of Arrays (SoA):
typedef struct {
float x[1000];
float y[1000];
float z[1000];
float vx[1000];
float vy[1000];
float vz[1000];
float mass[1000];
int id[1000];
} particles_t;
particles_t particles;
// Update positions
for (int i = 0; i < 1000; i++) {
particles.x[i] += particles.vx[i] * dt;
particles.y[i] += particles.vy[i] * dt;
particles.z[i] += particles.vz[i] * dt;
}
Memory layout:
Cache line 0: [x[0], x[1], x[2], ..., x[15]]
Cache line 1: [x[16], x[17], ..., x[31]]
...
Advantage: 100% cache line utilization. Each cache line contains only the data we need.
Benchmark (1M particles, 1000 iterations):
AoS: 2,850 ms
SoA: 1,200 ms (2.4× faster)
When to use SoA:
- Operations access only a few fields
- Large arrays (> cache size)
- Performance-critical loops
When to use AoS:
- Operations access all fields
- Small arrays (< cache size)
- Code clarity matters more than performance
Alignment and Padding
Memory alignment affects both correctness and performance.
Natural alignment:
char: 1-byte alignedshort: 2-byte alignedint: 4-byte alignedlong: 8-byte aligneddouble: 8-byte aligned
Unaligned access:
char buffer[16];
int *p = (int*)(buffer + 1); // Unaligned!
*p = 42; // May be slow or crash
On x86: Unaligned access works but is slower (may cross cache line boundary) On ARM/RISC-V: May trap or require multiple accesses
Structure padding:
struct bad {
char a; // 1 byte
int b; // 4 bytes, needs 4-byte alignment
char c; // 1 byte
}; // Size: 12 bytes (with padding)
Memory layout:
Offset: 0 1 2 3 4 5 6 7 8 9 10 11
Value: a pad pad pad b b b b c pad pad pad
Better ordering:
struct good {
int b; // 4 bytes
char a; // 1 byte
char c; // 1 byte
}; // Size: 8 bytes (with padding)
Memory layout:
Offset: 0 1 2 3 4 5 6 7
Value: b b b b a c pad pad
Guideline: Order struct members from largest to smallest to minimize padding.
Cache line alignment:
For performance-critical structures, align to cache line boundaries:
struct __attribute__((aligned(64))) cache_aligned {
int data[16];
};
Why?
- Prevents false sharing on multi-core
- Ensures structure doesn’t span cache lines
- Predictable cache behavior
Array Bounds and Prefetching
Modern CPUs prefetch data, but they can’t prefetch past array bounds they don’t know.
Helping the prefetcher:
// BAD: Unpredictable loop bound
for (int i = 0; i < get_count(); i++) {
sum += array[i];
}
// GOOD: Constant loop bound
int n = get_count();
for (int i = 0; i < n; i++) {
sum += array[i];
}
// BETTER: Compiler can see bound
#define SIZE 1000
for (int i = 0; i < SIZE; i++) {
sum += array[i];
}
Loop unrolling helps prefetching:
// Manual unrolling
for (int i = 0; i < n; i += 4) {
sum += array[i];
sum += array[i+1];
sum += array[i+2];
sum += array[i+3];
}
Benefits:
- Reduces loop overhead
- Exposes more parallelism
- Helps prefetcher see pattern
Compiler can auto-unroll:
#pragma GCC unroll 4
for (int i = 0; i < n; i++) {
sum += array[i];
}
Embedded Systems: Small Arrays, Big Impact
On embedded systems with tiny caches (8-32 KB), array optimization is even more critical.
Example: RISC-V MCU with 16 KB L1 cache
// This array is 40% of your entire cache!
int buffer[1000]; // 4 KB
Guidelines for embedded:
1. Keep arrays small
// BAD: Wastes cache
int large_buffer[10000]; // 40 KB, doesn't fit in cache
// GOOD: Fits in cache
int small_buffer[1000]; // 4 KB, fits comfortably
2. Reuse arrays
// BAD: Multiple arrays compete for cache
int input[1000];
int temp[1000];
int output[1000];
// GOOD: Reuse same buffer
int buffer[1000];
process_in_place(buffer);
3. Use smaller types
// BAD: Wastes memory and cache
int32_t values[1000]; // 4 KB
// GOOD: If range allows
int16_t values[1000]; // 2 KB, 2× more data in cache
4. Pack data
// BAD: 4 bytes per flag
int flags[1000]; // 4 KB
// GOOD: 1 bit per flag
uint32_t flags[32]; // 128 bytes, 32× smaller!
void set_flag(int i) {
flags[i / 32] |= (1 << (i % 32));
}
int get_flag(int i) {
return (flags[i / 32] >> (i % 32)) & 1;
}
Real-World Example: Packet Buffer
Back to my packet processing problem. Here’s what was wrong:
Original code (bad):
typedef struct {
uint8_t data[1500]; // Packet data
uint32_t length; // Packet length
uint32_t timestamp; // Timestamp
uint32_t src_ip; // Source IP
uint32_t dst_ip; // Dest IP
uint16_t src_port; // Source port
uint16_t dst_port; // Dest port
uint8_t protocol; // Protocol
uint8_t flags; // Flags
} packet_t; // Total: ~1520 bytes
packet_t packets[1000]; // 1.52 MB
// Process packets
for (int i = 0; i < count; i++) {
if (packets[i].protocol == TCP) {
process_tcp(&packets[i]);
}
}
Problem: Each iteration fetches 1520 bytes (24 cache lines) just to check protocol (1 byte).
Fixed code (good):
// Separate hot and cold data
typedef struct {
uint8_t protocol; // Hot: checked every iteration
uint8_t flags; // Hot: checked often
uint16_t length; // Hot: used for processing
uint32_t data_offset; // Offset into data array
} packet_header_t; // 8 bytes
packet_header_t headers[1000]; // 8 KB
uint8_t packet_data[1500 * 1000]; // 1.5 MB
// Process packets
for (int i = 0; i < count; i++) {
if (headers[i].protocol == TCP) {
uint8_t *data = &packet_data[headers[i].data_offset];
process_tcp(&headers[i], data);
}
}
Result:
- Headers fit in cache (8 KB vs 1.52 MB)
- First loop: 8× fewer cache lines
- Only fetch packet data when needed
- Performance: 100K → 950K packets/sec (9.5× faster)
Summary
The packet processing pipeline’s terrible performance—100,000 packets per second instead of 1 million—was fixed by understanding array access patterns. The 450,000 cache misses came from poor memory layout and access order. Restructuring to Structure of Arrays and optimizing traversal order brought performance to 950,000 packets per second, nearly 10× faster.
Key principles:
- Sequential access beats random (7-10× faster)
- Small strides beat large strides
- Row-major traversal for C arrays
- SoA beats AoS for selective field access
- Alignment matters (correctness and performance)
- Keep working set in cache
Optimization techniques:
- Loop reordering (ikj vs ijk)
- Blocking/tiling for large arrays
- Structure of Arrays (SoA)
- Proper alignment and padding
- Loop unrolling
Embedded considerations:
- Keep arrays small (fit in cache)
- Reuse buffers
- Use smaller types
- Pack data (bit arrays)
- Separate hot/cold data
Measurement:
- Profile before optimizing
- Measure cache misses
- Test on target hardware
Next Chapter: We’ve seen why arrays are fast. Now let’s explore why linked lists are slow—and when you should use them anyway.