Chapter 17: Multi-core Performance
Part V: Parallelism & Low-Level Optimization
"More threads doesn't always mean more performance." — Every performance engineer
The Thread That Made It Slower
A team was optimizing a data processing pipeline. The single-threaded version processed 100,000 records per second. "Easy win," they thought—just parallelize it across 8 cores.
The 8-threaded version processed... 60,000 records per second.
After a week of investigation, they found the culprit: false sharing. Each thread had its own counter variable, but all counters happened to be allocated on the same cache line. Every increment by any thread invalidated the cache for all other threads. The "parallel" code was actually serialized by cache coherence traffic.
Adding 56 bytes of padding between counters brought performance to 750,000 records per second—7.5x the original, finally matching expectations.
Parallelism Fundamentals
Amdahl's Law: The Ceiling You Can't Break
Speedup = 1 / ((1-P) + P/N)
Where:
- P = parallelizable fraction
- N = number of processors
The brutal math: If 10% of your code is sequential (P = 0.9), maximum speedup with infinite processors is 10x. With 8 cores, you get 4.7x. With 64 cores, you get 8.8x.
Speedup vs Cores (P = 0.9):
│
10 │ ─────────── Theoretical max
│ ╱
8 │ ╱
│ ╱
4 │ ╱
│╱
1 └────────────────────────────
1 4 8 16 32 64 Cores
Gustafson's Law: Scale the Problem
Amdahl assumes fixed problem size. Gustafson observed that in practice, we often scale the problem with the hardware:
Scaled Speedup = N + (1-N) × s
Where s = sequential fraction
If you have 8 cores and 5% sequential work, scaled speedup = 8 + (1-8) × 0.05 = 7.65x.
Key insight: Larger problems often have proportionally less sequential work.
Thread Overhead
Thread Creation Cost
| Operation | Typical Cost |
|---|---|
| Thread creation (OS) | 10-100 μs |
| Thread pool task dispatch | 0.1-1 μs |
| Context switch | 1-10 μs |
| Mutex lock (uncontended) | 10-50 ns |
| Mutex lock (contended) | 1-100 μs |
| Atomic increment | 5-50 ns |
Rule of thumb: Don't create threads for tasks shorter than 10-100 μs. Use thread pools.
Context Switch Overhead
When threads exceed available cores, the OS must context switch. Each switch:
- Saves/restores registers
- May flush TLB entries
- Pollutes caches with new thread's data
Symptom: Performance degrades when thread count >> core count.
Synchronization Costs
Lock Contention
Uncontended lock:
Thread 1: [acquire]────[work]────[release]
Thread 2: [acquire]────[work]────[release]
Contended lock:
Thread 1: [acquire]────[work]────[release]
Thread 2: [spin/wait............][acquire]────[work]────[release]
↑ Wasted time
Contention scales poorly: With N threads competing for one lock, average wait time grows as O(N).
Atomic Operations
Atomics avoid locks but aren't free:
std::atomic<int> counter;
counter++; // Generates LOCK XADD instruction
Cost: 5-50 ns per atomic operation, depending on contention. Under high contention, atomics can be as slow as locks.
False Sharing: The Silent Killer
Cache Line (64 bytes):
┌────────────────────────────────────────────────────────────────┐
│ counter_1 │ counter_2 │ counter_3 │ counter_4 │ ... padding ... │
│ (4 bytes) │ (4 bytes) │ (4 bytes) │ (4 bytes) │ │
└────────────────────────────────────────────────────────────────┘
↑ ↑ ↑ ↑
Thread 1 Thread 2 Thread 3 Thread 4
Every write invalidates the entire line for all other threads!
Solution: Pad to cache line boundaries:
struct alignas(64) PaddedCounter {
std::atomic<int> value;
char padding[60]; // Fill to 64 bytes
};
Cache Coherence: MESI Protocol
When multiple cores share memory, hardware maintains coherence via MESI:
| State | Meaning |
|---|---|
| Modified | This cache has the only valid copy (dirty) |
| Exclusive | This cache has the only copy (clean) |
| Shared | Multiple caches have this line |
| Invalid | This cache line is stale |
Cache line bouncing: When two cores repeatedly write to the same line, it bounces between Modified states, generating bus traffic.
Measuring Parallel Performance
Scalability Metrics
Strong scaling: Fixed problem size, increase cores.
- Ideal: 2x cores → 2x speedup
- Reality: Diminishing returns due to Amdahl's Law
Weak scaling: Proportional problem size (2x cores → 2x data).
- Ideal: Constant time regardless of core count
- Reality: Communication overhead grows
Profiling Tools
# Linux perf for cache and synchronization events
perf stat -e cache-misses,cache-references,\
context-switches,cpu-migrations ./program
# Intel VTune for detailed threading analysis
vtune -collect threading ./program
# Detect false sharing
perf c2c record ./program
perf c2c report
Patterns for Good Parallelism
Embarrassingly Parallel
No communication between tasks. Maximum scalability.
// Perfect parallelism: each iteration independent
#pragma omp parallel for
for (int i = 0; i < n; i++) {
result[i] = expensive_computation(input[i]);
}
Data Parallelism with Reduction
// Reduction: each thread accumulates locally, then combine
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++) {
sum += data[i];
}
Pipeline Parallelism
Stage 1 (Parse) → Stage 2 (Process) → Stage 3 (Write)
↓ ↓ ↓
Thread 1 Thread 2 Thread 3
Good when stages have similar duration. Bottleneck is the slowest stage.
Task Parallelism with Work Stealing
// Task-based parallelism (TBB, OpenMP tasks)
tbb::parallel_for(0, n, [&](int i) {
process(items[i]);
});
Work stealing balances load dynamically: idle threads steal tasks from busy threads' queues.
Common Anti-Patterns
1. Over-Synchronization
// Bad: Lock held during I/O
mutex.lock();
result = expensive_network_call(); // Other threads blocked!
mutex.unlock();
// Better: Minimize critical section
auto data = expensive_network_call();
mutex.lock();
result = data;
mutex.unlock();
2. Thread-per-Request
Creating a thread for each request doesn't scale. Use thread pools with bounded concurrency.
3. Ignoring NUMA
On multi-socket systems, memory access time varies by location:
Socket 0 Socket 1
┌─────────┐ ┌─────────┐
│ Core 0-7│ │Core 8-15│
│ Memory A│ ←──→ │Memory B │
└─────────┘ └─────────┘
↑ Local: 80ns ↑ Remote: 150ns
Solution: Pin threads to cores, allocate memory locally.
Summary
- Amdahl's Law sets hard limits on parallel speedup. Know your sequential fraction.
- Synchronization is expensive. Minimize critical sections, prefer lock-free when possible.
- False sharing is a silent killer. Pad data structures to cache line boundaries.
- Measure scalability with both strong and weak scaling tests.
- Use appropriate patterns: Embarrassingly parallel > reduction > pipeline > fine-grained locking.