Chapter 17: Multi-core Performance

Part V: Parallelism & Low-Level Optimization


"More threads doesn't always mean more performance." — Every performance engineer

The Thread That Made It Slower

A team was optimizing a data processing pipeline. The single-threaded version processed 100,000 records per second. "Easy win," they thought—just parallelize it across 8 cores.

The 8-threaded version processed... 60,000 records per second.

After a week of investigation, they found the culprit: false sharing. Each thread had its own counter variable, but all counters happened to be allocated on the same cache line. Every increment by any thread invalidated the cache for all other threads. The "parallel" code was actually serialized by cache coherence traffic.

Adding 56 bytes of padding between counters brought performance to 750,000 records per second—7.5x the original, finally matching expectations.

Parallelism Fundamentals

Amdahl's Law: The Ceiling You Can't Break

Speedup = 1 / ((1-P) + P/N)

Where:
- P = parallelizable fraction
- N = number of processors

The brutal math: If 10% of your code is sequential (P = 0.9), maximum speedup with infinite processors is 10x. With 8 cores, you get 4.7x. With 64 cores, you get 8.8x.

Speedup vs Cores (P = 0.9):
     │
  10 │                    ─────────── Theoretical max
     │              ╱
   8 │         ╱
     │     ╱
   4 │  ╱
     │╱
   1 └────────────────────────────
     1    4    8   16   32   64  Cores

Gustafson's Law: Scale the Problem

Amdahl assumes fixed problem size. Gustafson observed that in practice, we often scale the problem with the hardware:

Scaled Speedup = N + (1-N) × s

Where s = sequential fraction

If you have 8 cores and 5% sequential work, scaled speedup = 8 + (1-8) × 0.05 = 7.65x.

Key insight: Larger problems often have proportionally less sequential work.

Thread Overhead

Thread Creation Cost

OperationTypical Cost
Thread creation (OS)10-100 μs
Thread pool task dispatch0.1-1 μs
Context switch1-10 μs
Mutex lock (uncontended)10-50 ns
Mutex lock (contended)1-100 μs
Atomic increment5-50 ns

Rule of thumb: Don't create threads for tasks shorter than 10-100 μs. Use thread pools.

Context Switch Overhead

When threads exceed available cores, the OS must context switch. Each switch:

  • Saves/restores registers
  • May flush TLB entries
  • Pollutes caches with new thread's data

Symptom: Performance degrades when thread count >> core count.

Synchronization Costs

Lock Contention

Uncontended lock:
Thread 1: [acquire]────[work]────[release]
                                          Thread 2: [acquire]────[work]────[release]

Contended lock:
Thread 1: [acquire]────[work]────[release]
Thread 2:          [spin/wait............][acquire]────[work]────[release]
                   ↑ Wasted time

Contention scales poorly: With N threads competing for one lock, average wait time grows as O(N).

Atomic Operations

Atomics avoid locks but aren't free:

std::atomic<int> counter;
counter++;  // Generates LOCK XADD instruction

Cost: 5-50 ns per atomic operation, depending on contention. Under high contention, atomics can be as slow as locks.

False Sharing: The Silent Killer

Cache Line (64 bytes):
┌────────────────────────────────────────────────────────────────┐
│ counter_1 │ counter_2 │ counter_3 │ counter_4 │ ... padding ... │
│  (4 bytes) │  (4 bytes) │  (4 bytes) │  (4 bytes) │               │
└────────────────────────────────────────────────────────────────┘
     ↑            ↑            ↑            ↑
  Thread 1    Thread 2    Thread 3    Thread 4

Every write invalidates the entire line for all other threads!

Solution: Pad to cache line boundaries:

struct alignas(64) PaddedCounter {
    std::atomic<int> value;
    char padding[60];  // Fill to 64 bytes
};

Cache Coherence: MESI Protocol

When multiple cores share memory, hardware maintains coherence via MESI:

StateMeaning
ModifiedThis cache has the only valid copy (dirty)
ExclusiveThis cache has the only copy (clean)
SharedMultiple caches have this line
InvalidThis cache line is stale

Cache line bouncing: When two cores repeatedly write to the same line, it bounces between Modified states, generating bus traffic.

Measuring Parallel Performance

Scalability Metrics

Strong scaling: Fixed problem size, increase cores.

  • Ideal: 2x cores → 2x speedup
  • Reality: Diminishing returns due to Amdahl's Law

Weak scaling: Proportional problem size (2x cores → 2x data).

  • Ideal: Constant time regardless of core count
  • Reality: Communication overhead grows

Profiling Tools

# Linux perf for cache and synchronization events
perf stat -e cache-misses,cache-references,\
          context-switches,cpu-migrations ./program

# Intel VTune for detailed threading analysis
vtune -collect threading ./program

# Detect false sharing
perf c2c record ./program
perf c2c report

Patterns for Good Parallelism

Embarrassingly Parallel

No communication between tasks. Maximum scalability.

// Perfect parallelism: each iteration independent
#pragma omp parallel for
for (int i = 0; i < n; i++) {
    result[i] = expensive_computation(input[i]);
}

Data Parallelism with Reduction

// Reduction: each thread accumulates locally, then combine
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++) {
    sum += data[i];
}

Pipeline Parallelism

Stage 1 (Parse) → Stage 2 (Process) → Stage 3 (Write)
     ↓                  ↓                   ↓
  Thread 1           Thread 2           Thread 3

Good when stages have similar duration. Bottleneck is the slowest stage.

Task Parallelism with Work Stealing

// Task-based parallelism (TBB, OpenMP tasks)
tbb::parallel_for(0, n, [&](int i) {
    process(items[i]);
});

Work stealing balances load dynamically: idle threads steal tasks from busy threads' queues.

Common Anti-Patterns

1. Over-Synchronization

// Bad: Lock held during I/O
mutex.lock();
result = expensive_network_call();  // Other threads blocked!
mutex.unlock();

// Better: Minimize critical section
auto data = expensive_network_call();
mutex.lock();
result = data;
mutex.unlock();

2. Thread-per-Request

Creating a thread for each request doesn't scale. Use thread pools with bounded concurrency.

3. Ignoring NUMA

On multi-socket systems, memory access time varies by location:

Socket 0          Socket 1
┌─────────┐      ┌─────────┐
│ Core 0-7│      │Core 8-15│
│ Memory A│ ←──→ │Memory B │
└─────────┘      └─────────┘
     ↑ Local: 80ns    ↑ Remote: 150ns

Solution: Pin threads to cores, allocate memory locally.

Summary

  • Amdahl's Law sets hard limits on parallel speedup. Know your sequential fraction.
  • Synchronization is expensive. Minimize critical sections, prefer lock-free when possible.
  • False sharing is a silent killer. Pad data structures to cache line boundaries.
  • Measure scalability with both strong and weak scaling tests.
  • Use appropriate patterns: Embarrassingly parallel > reduction > pipeline > fine-grained locking.