Chapter 16: SIMD & Vectorization
Part V: Parallelism & Low-Level Optimization
"The free lunch is over." — Herb Sutter
The 10x Speedup That Cost Nothing
A game studio was optimizing their physics engine. The collision detection code was clean, well-structured, and algorithmically optimal—yet it consumed 40% of frame time. A senior engineer spent an afternoon adding SIMD intrinsics to the inner loop. The result: 8x speedup, bringing collision detection down to 5% of frame time.
No algorithmic changes. No architectural redesign. Just telling the CPU to do 8 operations at once instead of 1.
This is the promise of SIMD: massive speedups for data-parallel workloads, often with minimal code changes. But it's also a minefield of alignment requirements, instruction set variations, and subtle correctness issues.
What is SIMD?
Single Instruction, Multiple Data: process multiple data elements with one instruction.
Scalar Operation:
A[0] + B[0] = C[0]
A[1] + B[1] = C[1]
A[2] + B[2] = C[2]
A[3] + B[3] = C[3]
→ 4 instructions
SIMD Operation (256-bit AVX):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│A[0]│A[1]│A[2]│A[3]│A[4]│A[5]│A[6]│A[7]│ (8 floats)
└────┴────┴────┴────┴────┴────┴────┴────┘
+
┌────┬────┬────┬────┬────┬────┬────┬────┐
│B[0]│B[1]│B[2]│B[3]│B[4]│B[5]│B[6]│B[7]│
└────┴────┴────┴────┴────┴────┴────┴────┘
↓
┌────┬────┬────┬────┬────┬────┬────┬────┐
│C[0]│C[1]│C[2]│C[3]│C[4]│C[5]│C[6]│C[7]│
└────┴────┴────┴────┴────┴────┴────┴────┘
→ 1 instruction
Theoretical vs Practical Speedup
With 256-bit registers processing 8 floats, you might expect 8x speedup. Reality is more nuanced:
| Factor | Impact |
|---|---|
| Memory bandwidth | Often the bottleneck, not compute |
| Alignment overhead | Unaligned loads are slower |
| Remainder handling | N not divisible by vector width |
| Register pressure | Limited SIMD registers |
| Instruction latency | Some SIMD ops have higher latency |
Typical real-world speedup: 2-6x for memory-bound workloads, 4-8x for compute-bound.
SIMD Instruction Sets
x86/x64 Evolution
| Generation | Width | Year | Key Features |
|---|---|---|---|
| SSE | 128-bit | 1999 | 4x float |
| SSE2 | 128-bit | 2001 | 2x double, integer ops |
| AVX | 256-bit | 2011 | 8x float, 3-operand |
| AVX2 | 256-bit | 2013 | 256-bit integer |
| AVX-512 | 512-bit | 2017 | Masking, scatter/gather |
ARM
- NEON: 128-bit, ubiquitous on ARM (phones, tablets, M-series Macs)
- SVE/SVE2: Scalable Vector Extension (128-2048 bit), write-once-run-anywhere
RISC-V
- RVV (Vector Extension): Variable-length VLEN (128-65536 bits), vector-length agnostic (VLA) programming model. One codebase can run on different hardware widths without recompilation.
The Portability Challenge
Code written for AVX won't run on ARM. Code for AVX-512 won't run on older x86. Solutions:
- Runtime dispatch: Detect CPU features, call appropriate implementation
- Portable libraries: Highway, XSIMD, std::simd (C++23)
- Compiler auto-vectorization: Let the compiler handle it
Auto-Vectorization
Modern compilers can automatically vectorize simple loops. This is the easiest path to SIMD benefits.
Helping the Compiler
// ✓ Easy to vectorize: simple loop, no dependencies
void add_arrays(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
// ✗ Hard to vectorize: potential aliasing
void add_arrays_bad(float* a, float* b, float* c, int n) {
// Compiler doesn't know if a, b, c overlap
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
// ✓ Fixed with restrict (C) or __restrict (C++)
void add_arrays_good(float* __restrict a,
float* __restrict b,
float* __restrict c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
Compiler Flags
# GCC/Clang: Enable vectorization
-O3 -march=native -ffast-math
# Check what got vectorized
-fopt-info-vec-optimized # GCC: show successful vectorization
-fopt-info-vec-missed # GCC: show failed attempts
-Rpass=loop-vectorize # Clang: show successful
-Rpass-missed=loop-vectorize # Clang: show failures
When Auto-Vectorization Fails
Common blockers:
- Pointer aliasing: Use
restrict/__restrict - Loop-carried dependencies:
a[i] = a[i-1] + 1can't parallelize - Function calls in loop: Unless inlined
- Complex control flow:
ifstatements inside loops - Non-contiguous access: Strided or indirect indexing
Manual SIMD Programming
When auto-vectorization isn't enough, you have options:
Intel Intrinsics Example
#include <immintrin.h>
void add_arrays_avx(float* a, float* b, float* c, int n) {
int i = 0;
// Process 8 floats at a time
for (; i + 7 < n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_storeu_ps(&c[i], vc);
}
// Handle remainder
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
}
Portable SIMD with Highway
#include "hwy/highway.h"
void add_arrays_highway(float* a, float* b, float* c, int n) {
namespace hn = hwy::HWY_NAMESPACE;
using D = hn::ScalableTag<float>;
D d;
for (int i = 0; i < n; i += hn::Lanes(d)) {
auto va = hn::LoadU(d, a + i);
auto vb = hn::LoadU(d, b + i);
hn::StoreU(hn::Add(va, vb), d, c + i);
}
}
Highway automatically uses the best available SIMD for the target platform.
Memory Alignment
SIMD performance is sensitive to memory alignment:
// Aligned allocation (C++17)
float* data = static_cast<float*>(
std::aligned_alloc(32, n * sizeof(float))); // 32-byte for AVX
// Aligned load (faster)
__m256 v = _mm256_load_ps(aligned_ptr); // Requires 32-byte alignment
// Unaligned load (works anywhere, slightly slower)
__m256 v = _mm256_loadu_ps(any_ptr);
Rule of thumb: Align to vector width (16 bytes for SSE, 32 for AVX, 64 for AVX-512).
Measuring SIMD Performance
Verify Vectorization Happened
# Check assembly for vector instructions
objdump -d binary | grep -E 'vmov|vadd|vmul' # AVX
objdump -d binary | grep -E 'movaps|addps' # SSE
Benchmark Methodology
- Warm up: First iterations may have cold cache/code effects
- Large enough N: Small arrays may not show SIMD benefit
- Measure throughput: Elements processed per second
- Check memory bandwidth: SIMD can't help if memory-bound
Common Pitfalls
1. Assuming Linear Speedup
8-wide SIMD ≠ 8x speedup. Memory bandwidth, alignment, and remainder handling all reduce gains.
2. Ignoring Horizontal Operations
Summing a vector requires "horizontal" operations that are slower:
// Horizontal sum of 8 floats in AVX - multiple instructions
float hsum(__m256 v) {
__m128 lo = _mm256_castps256_ps128(v);
__m128 hi = _mm256_extractf128_ps(v, 1);
lo = _mm_add_ps(lo, hi);
// ... more shuffles and adds
}
3. AVX-512 Frequency Throttling
On some Intel CPUs, heavy AVX-512 use causes frequency reduction (up to 20%). The wider operations may not compensate for the lower clock speed.
4. Forgetting the Scalar Remainder
When N isn't divisible by vector width, you need scalar cleanup code.
Summary
- SIMD provides 2-8x speedup for data-parallel workloads
- Start with auto-vectorization: Use
-O3 -march=native, check compiler output - Help the compiler: Use
restrict, avoid aliasing, keep loops simple - Manual SIMD when needed: Intrinsics for maximum control, portable libraries for cross-platform
- Measure carefully: Verify vectorization happened, account for memory bandwidth