Chapter 16: SIMD & Vectorization

Part V: Parallelism & Low-Level Optimization

"The free lunch is over." — Herb Sutter

The 10x Speedup That Cost Nothing

A game studio was optimizing their physics engine. The collision detection code was clean, well-structured, and algorithmically optimal—yet it consumed 40% of frame time. A senior engineer spent an afternoon adding SIMD intrinsics to the inner loop. The result: 8x speedup, bringing collision detection down to 5% of frame time.

No algorithmic changes. No architectural redesign. Just telling the CPU to do 8 operations at once instead of 1.

This is the promise of SIMD: massive speedups for data-parallel workloads, often with minimal code changes. But it's also a minefield of alignment requirements, instruction set variations, and subtle correctness issues.

What is SIMD?

Single Instruction, Multiple Data: process multiple data elements with one instruction.

Scalar Operation:
  A[0] + B[0] = C[0]
  A[1] + B[1] = C[1]
  A[2] + B[2] = C[2]
  A[3] + B[3] = C[3]
  → 4 instructions

SIMD Operation (256-bit AVX):
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │A[0]│A[1]│A[2]│A[3]│A[4]│A[5]│A[6]│A[7]│  (8 floats)
  └────┴────┴────┴────┴────┴────┴────┴────┘
                    +
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │B[0]│B[1]│B[2]│B[3]│B[4]│B[5]│B[6]│B[7]│
  └────┴────┴────┴────┴────┴────┴────┴────┘
                    ↓
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │C[0]│C[1]│C[2]│C[3]│C[4]│C[5]│C[6]│C[7]│
  └────┴────┴────┴────┴────┴────┴────┴────┘
  → 1 instruction

Theoretical vs Practical Speedup

With 256-bit registers processing 8 floats, you might expect 8x speedup. Reality is more nuanced:

Factor	Impact
Memory bandwidth	Often the bottleneck, not compute
Alignment overhead	Unaligned loads are slower
Remainder handling	N not divisible by vector width
Register pressure	Limited SIMD registers
Instruction latency	Some SIMD ops have higher latency

Typical real-world speedup: 2-6x for memory-bound workloads, 4-8x for compute-bound.

SIMD Instruction Sets

x86/x64 Evolution

Generation	Width	Year	Key Features
SSE	128-bit	1999	4x float
SSE2	128-bit	2001	2x double, integer ops
AVX	256-bit	2011	8x float, 3-operand
AVX2	256-bit	2013	256-bit integer
AVX-512	512-bit	2017	Masking, scatter/gather

ARM

NEON: 128-bit, ubiquitous on ARM (phones, tablets, M-series Macs)
SVE/SVE2: Scalable Vector Extension (128-2048 bit), write-once-run-anywhere

RISC-V

RVV (Vector Extension): Variable-length VLEN (128-65536 bits), vector-length agnostic (VLA) programming model. One codebase can run on different hardware widths without recompilation.

The Portability Challenge

Code written for AVX won't run on ARM. Code for AVX-512 won't run on older x86. Solutions:

Runtime dispatch: Detect CPU features, call appropriate implementation
Portable libraries: Highway, XSIMD, std::simd (C++23)
Compiler auto-vectorization: Let the compiler handle it

Auto-Vectorization

Modern compilers can automatically vectorize simple loops. This is the easiest path to SIMD benefits.

Helping the Compiler

// ✓ Easy to vectorize: simple loop, no dependencies
void add_arrays(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// ✗ Hard to vectorize: potential aliasing
void add_arrays_bad(float* a, float* b, float* c, int n) {
    // Compiler doesn't know if a, b, c overlap
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// ✓ Fixed with restrict (C) or __restrict (C++)
void add_arrays_good(float* __restrict a,
                     float* __restrict b,
                     float* __restrict c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Compiler Flags

# GCC/Clang: Enable vectorization
-O3 -march=native -ffast-math

# Check what got vectorized
-fopt-info-vec-optimized      # GCC: show successful vectorization
-fopt-info-vec-missed         # GCC: show failed attempts
-Rpass=loop-vectorize         # Clang: show successful
-Rpass-missed=loop-vectorize  # Clang: show failures

When Auto-Vectorization Fails

Common blockers:

Pointer aliasing: Use restrict/__restrict
Loop-carried dependencies: a[i] = a[i-1] + 1 can't parallelize
Function calls in loop: Unless inlined
Complex control flow: if statements inside loops
Non-contiguous access: Strided or indirect indexing

Manual SIMD Programming

When auto-vectorization isn't enough, you have options:

Intel Intrinsics Example

#include <immintrin.h>

void add_arrays_avx(float* a, float* b, float* c, int n) {
    int i = 0;
    // Process 8 floats at a time
    for (; i + 7 < n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
    // Handle remainder
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Portable SIMD with Highway

#include "hwy/highway.h"

void add_arrays_highway(float* a, float* b, float* c, int n) {
    namespace hn = hwy::HWY_NAMESPACE;
    using D = hn::ScalableTag<float>;
    D d;

    for (int i = 0; i < n; i += hn::Lanes(d)) {
        auto va = hn::LoadU(d, a + i);
        auto vb = hn::LoadU(d, b + i);
        hn::StoreU(hn::Add(va, vb), d, c + i);
    }
}

Highway automatically uses the best available SIMD for the target platform.

Memory Alignment

SIMD performance is sensitive to memory alignment:

// Aligned allocation (C++17)
float* data = static_cast<float*>(
    std::aligned_alloc(32, n * sizeof(float)));  // 32-byte for AVX

// Aligned load (faster)
__m256 v = _mm256_load_ps(aligned_ptr);   // Requires 32-byte alignment

// Unaligned load (works anywhere, slightly slower)
__m256 v = _mm256_loadu_ps(any_ptr);

Rule of thumb: Align to vector width (16 bytes for SSE, 32 for AVX, 64 for AVX-512).

Measuring SIMD Performance

Verify Vectorization Happened

# Check assembly for vector instructions
objdump -d binary | grep -E 'vmov|vadd|vmul'  # AVX
objdump -d binary | grep -E 'movaps|addps'    # SSE

Benchmark Methodology

Warm up: First iterations may have cold cache/code effects
Large enough N: Small arrays may not show SIMD benefit
Measure throughput: Elements processed per second
Check memory bandwidth: SIMD can't help if memory-bound

// Horizontal sum of 8 floats in AVX - multiple instructions
float hsum(__m256 v) {
    __m128 lo = _mm256_castps256_ps128(v);
    __m128 hi = _mm256_extractf128_ps(v, 1);
    lo = _mm_add_ps(lo, hi);
    // ... more shuffles and adds
}

SIMD provides 2-8x speedup for data-parallel workloads
Start with auto-vectorization: Use -O3 -march=native, check compiler output
Help the compiler: Use restrict, avoid aliasing, keep loops simple
Manual SIMD when needed: Intrinsics for maximum control, portable libraries for cross-platform
Measure carefully: Verify vectorization happened, account for memory bandwidth

Performance and Benchmarking