Chapter 16: SIMD & Vectorization

Part V: Parallelism & Low-Level Optimization


"The free lunch is over." — Herb Sutter

The 10x Speedup That Cost Nothing

A game studio was optimizing their physics engine. The collision detection code was clean, well-structured, and algorithmically optimal—yet it consumed 40% of frame time. A senior engineer spent an afternoon adding SIMD intrinsics to the inner loop. The result: 8x speedup, bringing collision detection down to 5% of frame time.

No algorithmic changes. No architectural redesign. Just telling the CPU to do 8 operations at once instead of 1.

This is the promise of SIMD: massive speedups for data-parallel workloads, often with minimal code changes. But it's also a minefield of alignment requirements, instruction set variations, and subtle correctness issues.

What is SIMD?

Single Instruction, Multiple Data: process multiple data elements with one instruction.

Scalar Operation:
  A[0] + B[0] = C[0]
  A[1] + B[1] = C[1]
  A[2] + B[2] = C[2]
  A[3] + B[3] = C[3]
  → 4 instructions

SIMD Operation (256-bit AVX):
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │A[0]│A[1]│A[2]│A[3]│A[4]│A[5]│A[6]│A[7]│  (8 floats)
  └────┴────┴────┴────┴────┴────┴────┴────┘
                    +
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │B[0]│B[1]│B[2]│B[3]│B[4]│B[5]│B[6]│B[7]│
  └────┴────┴────┴────┴────┴────┴────┴────┘
                    ↓
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │C[0]│C[1]│C[2]│C[3]│C[4]│C[5]│C[6]│C[7]│
  └────┴────┴────┴────┴────┴────┴────┴────┘
  → 1 instruction

Theoretical vs Practical Speedup

With 256-bit registers processing 8 floats, you might expect 8x speedup. Reality is more nuanced:

FactorImpact
Memory bandwidthOften the bottleneck, not compute
Alignment overheadUnaligned loads are slower
Remainder handlingN not divisible by vector width
Register pressureLimited SIMD registers
Instruction latencySome SIMD ops have higher latency

Typical real-world speedup: 2-6x for memory-bound workloads, 4-8x for compute-bound.

SIMD Instruction Sets

x86/x64 Evolution

GenerationWidthYearKey Features
SSE128-bit19994x float
SSE2128-bit20012x double, integer ops
AVX256-bit20118x float, 3-operand
AVX2256-bit2013256-bit integer
AVX-512512-bit2017Masking, scatter/gather

ARM

  • NEON: 128-bit, ubiquitous on ARM (phones, tablets, M-series Macs)
  • SVE/SVE2: Scalable Vector Extension (128-2048 bit), write-once-run-anywhere

RISC-V

  • RVV (Vector Extension): Variable-length VLEN (128-65536 bits), vector-length agnostic (VLA) programming model. One codebase can run on different hardware widths without recompilation.

The Portability Challenge

Code written for AVX won't run on ARM. Code for AVX-512 won't run on older x86. Solutions:

  1. Runtime dispatch: Detect CPU features, call appropriate implementation
  2. Portable libraries: Highway, XSIMD, std::simd (C++23)
  3. Compiler auto-vectorization: Let the compiler handle it

Auto-Vectorization

Modern compilers can automatically vectorize simple loops. This is the easiest path to SIMD benefits.

Helping the Compiler

// ✓ Easy to vectorize: simple loop, no dependencies
void add_arrays(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// ✗ Hard to vectorize: potential aliasing
void add_arrays_bad(float* a, float* b, float* c, int n) {
    // Compiler doesn't know if a, b, c overlap
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// ✓ Fixed with restrict (C) or __restrict (C++)
void add_arrays_good(float* __restrict a,
                     float* __restrict b,
                     float* __restrict c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Compiler Flags

# GCC/Clang: Enable vectorization
-O3 -march=native -ffast-math

# Check what got vectorized
-fopt-info-vec-optimized      # GCC: show successful vectorization
-fopt-info-vec-missed         # GCC: show failed attempts
-Rpass=loop-vectorize         # Clang: show successful
-Rpass-missed=loop-vectorize  # Clang: show failures

When Auto-Vectorization Fails

Common blockers:

  • Pointer aliasing: Use restrict/__restrict
  • Loop-carried dependencies: a[i] = a[i-1] + 1 can't parallelize
  • Function calls in loop: Unless inlined
  • Complex control flow: if statements inside loops
  • Non-contiguous access: Strided or indirect indexing

Manual SIMD Programming

When auto-vectorization isn't enough, you have options:

Intel Intrinsics Example

#include <immintrin.h>

void add_arrays_avx(float* a, float* b, float* c, int n) {
    int i = 0;
    // Process 8 floats at a time
    for (; i + 7 < n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
    // Handle remainder
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Portable SIMD with Highway

#include "hwy/highway.h"

void add_arrays_highway(float* a, float* b, float* c, int n) {
    namespace hn = hwy::HWY_NAMESPACE;
    using D = hn::ScalableTag<float>;
    D d;

    for (int i = 0; i < n; i += hn::Lanes(d)) {
        auto va = hn::LoadU(d, a + i);
        auto vb = hn::LoadU(d, b + i);
        hn::StoreU(hn::Add(va, vb), d, c + i);
    }
}

Highway automatically uses the best available SIMD for the target platform.

Memory Alignment

SIMD performance is sensitive to memory alignment:

// Aligned allocation (C++17)
float* data = static_cast<float*>(
    std::aligned_alloc(32, n * sizeof(float)));  // 32-byte for AVX

// Aligned load (faster)
__m256 v = _mm256_load_ps(aligned_ptr);   // Requires 32-byte alignment

// Unaligned load (works anywhere, slightly slower)
__m256 v = _mm256_loadu_ps(any_ptr);

Rule of thumb: Align to vector width (16 bytes for SSE, 32 for AVX, 64 for AVX-512).

Measuring SIMD Performance

Verify Vectorization Happened

# Check assembly for vector instructions
objdump -d binary | grep -E 'vmov|vadd|vmul'  # AVX
objdump -d binary | grep -E 'movaps|addps'    # SSE

Benchmark Methodology

  1. Warm up: First iterations may have cold cache/code effects
  2. Large enough N: Small arrays may not show SIMD benefit
  3. Measure throughput: Elements processed per second
  4. Check memory bandwidth: SIMD can't help if memory-bound

Common Pitfalls

1. Assuming Linear Speedup

8-wide SIMD ≠ 8x speedup. Memory bandwidth, alignment, and remainder handling all reduce gains.

2. Ignoring Horizontal Operations

Summing a vector requires "horizontal" operations that are slower:

// Horizontal sum of 8 floats in AVX - multiple instructions
float hsum(__m256 v) {
    __m128 lo = _mm256_castps256_ps128(v);
    __m128 hi = _mm256_extractf128_ps(v, 1);
    lo = _mm_add_ps(lo, hi);
    // ... more shuffles and adds
}

3. AVX-512 Frequency Throttling

On some Intel CPUs, heavy AVX-512 use causes frequency reduction (up to 20%). The wider operations may not compensate for the lower clock speed.

4. Forgetting the Scalar Remainder

When N isn't divisible by vector width, you need scalar cleanup code.

Summary

  • SIMD provides 2-8x speedup for data-parallel workloads
  • Start with auto-vectorization: Use -O3 -march=native, check compiler output
  • Help the compiler: Use restrict, avoid aliasing, keep loops simple
  • Manual SIMD when needed: Intrinsics for maximum control, portable libraries for cross-platform
  • Measure carefully: Verify vectorization happened, account for memory bandwidth