Chapter 17: Multi-core Performance

Part V: Parallelism & Low-Level Optimization

"More threads doesn't always mean more performance." — Every performance engineer

The Thread That Made It Slower

A team was optimizing a data processing pipeline. The single-threaded version processed 100,000 records per second. "Easy win," they thought—just parallelize it across 8 cores.

The 8-threaded version processed... 60,000 records per second.

After a week of investigation, they found the culprit: false sharing. Each thread had its own counter variable, but all counters happened to be allocated on the same cache line. Every increment by any thread invalidated the cache for all other threads. The "parallel" code was actually serialized by cache coherence traffic.

Adding 56 bytes of padding between counters brought performance to 750,000 records per second—7.5x the original, finally matching expectations.

Parallelism Fundamentals

Amdahl's Law: The Ceiling You Can't Break

Speedup = 1 / ((1-P) + P/N)

Where:
- P = parallelizable fraction
- N = number of processors

The brutal math: If 10% of your code is sequential (P = 0.9), maximum speedup with infinite processors is 10x. With 8 cores, you get 4.7x. With 64 cores, you get 8.8x.

Speedup vs Cores (P = 0.9):
     │
  10 │                    ─────────── Theoretical max
     │              ╱
   8 │         ╱
     │     ╱
   4 │  ╱
     │╱
   1 └────────────────────────────
     1    4    8   16   32   64  Cores

Gustafson's Law: Scale the Problem

Amdahl assumes fixed problem size. Gustafson observed that in practice, we often scale the problem with the hardware:

Scaled Speedup = N + (1-N) × s

Where s = sequential fraction

If you have 8 cores and 5% sequential work, scaled speedup = 8 + (1-8) × 0.05 = 7.65x.

Key insight: Larger problems often have proportionally less sequential work.

Thread Overhead

Thread Creation Cost

Operation	Typical Cost
Thread creation (OS)	10-100 μs
Thread pool task dispatch	0.1-1 μs
Context switch	1-10 μs
Mutex lock (uncontended)	10-50 ns
Mutex lock (contended)	1-100 μs
Atomic increment	5-50 ns

Rule of thumb: Don't create threads for tasks shorter than 10-100 μs. Use thread pools.

Context Switch Overhead

When threads exceed available cores, the OS must context switch. Each switch:

Saves/restores registers
May flush TLB entries
Pollutes caches with new thread's data

Symptom: Performance degrades when thread count >> core count.

Synchronization Costs

Lock Contention

Uncontended lock:
Thread 1: [acquire]────[work]────[release]
                                          Thread 2: [acquire]────[work]────[release]

Contended lock:
Thread 1: [acquire]────[work]────[release]
Thread 2:          [spin/wait............][acquire]────[work]────[release]
                   ↑ Wasted time

Contention scales poorly: With N threads competing for one lock, average wait time grows as O(N).

Atomic Operations

Atomics avoid locks but aren't free:

std::atomic<int> counter;
counter++;  // Generates LOCK XADD instruction

Cost: 5-50 ns per atomic operation, depending on contention. Under high contention, atomics can be as slow as locks.

Cache Line (64 bytes):
┌────────────────────────────────────────────────────────────────┐
│ counter_1 │ counter_2 │ counter_3 │ counter_4 │ ... padding ... │
│  (4 bytes) │  (4 bytes) │  (4 bytes) │  (4 bytes) │               │
└────────────────────────────────────────────────────────────────┘
     ↑            ↑            ↑            ↑
  Thread 1    Thread 2    Thread 3    Thread 4

Every write invalidates the entire line for all other threads!

Solution: Pad to cache line boundaries:

struct alignas(64) PaddedCounter {
    std::atomic<int> value;
    char padding[60];  // Fill to 64 bytes
};

Cache Coherence: MESI Protocol

When multiple cores share memory, hardware maintains coherence via MESI:

State	Meaning
Modified	This cache has the only valid copy (dirty)
Exclusive	This cache has the only copy (clean)
Shared	Multiple caches have this line
Invalid	This cache line is stale

Cache line bouncing: When two cores repeatedly write to the same line, it bounces between Modified states, generating bus traffic.

Measuring Parallel Performance

Scalability Metrics

Strong scaling: Fixed problem size, increase cores.

Ideal: 2x cores → 2x speedup
Reality: Diminishing returns due to Amdahl's Law

Weak scaling: Proportional problem size (2x cores → 2x data).

Ideal: Constant time regardless of core count
Reality: Communication overhead grows

Profiling Tools

# Linux perf for cache and synchronization events
perf stat -e cache-misses,cache-references,\
          context-switches,cpu-migrations ./program

# Intel VTune for detailed threading analysis
vtune -collect threading ./program

# Detect false sharing
perf c2c record ./program
perf c2c report

Patterns for Good Parallelism

Embarrassingly Parallel

No communication between tasks. Maximum scalability.

// Perfect parallelism: each iteration independent
#pragma omp parallel for
for (int i = 0; i < n; i++) {
    result[i] = expensive_computation(input[i]);
}

Data Parallelism with Reduction

// Reduction: each thread accumulates locally, then combine
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++) {
    sum += data[i];
}

Pipeline Parallelism

Stage 1 (Parse) → Stage 2 (Process) → Stage 3 (Write)
     ↓                  ↓                   ↓
  Thread 1           Thread 2           Thread 3

Good when stages have similar duration. Bottleneck is the slowest stage.

Task Parallelism with Work Stealing

// Task-based parallelism (TBB, OpenMP tasks)
tbb::parallel_for(0, n, [&](int i) {
    process(items[i]);
});

Work stealing balances load dynamically: idle threads steal tasks from busy threads' queues.

Common Anti-Patterns

1. Over-Synchronization

// Bad: Lock held during I/O
mutex.lock();
result = expensive_network_call();  // Other threads blocked!
mutex.unlock();

// Better: Minimize critical section
auto data = expensive_network_call();
mutex.lock();
result = data;
mutex.unlock();

2. Thread-per-Request

Creating a thread for each request doesn't scale. Use thread pools with bounded concurrency.

3. Ignoring NUMA

On multi-socket systems, memory access time varies by location:

Socket 0          Socket 1
┌─────────┐      ┌─────────┐
│ Core 0-7│      │Core 8-15│
│ Memory A│ ←──→ │Memory B │
└─────────┘      └─────────┘
     ↑ Local: 80ns    ↑ Remote: 150ns

Solution: Pin threads to cores, allocate memory locally.

Summary

Amdahl's Law sets hard limits on parallel speedup. Know your sequential fraction.
Synchronization is expensive. Minimize critical sections, prefer lock-free when possible.
False sharing is a silent killer. Pad data structures to cache line boundaries.
Measure scalability with both strong and weak scaling tests.
Use appropriate patterns: Embarrassingly parallel > reduction > pipeline > fine-grained locking.

Performance and Benchmarking