Chapter 10: Performance Modeling
Part III: Theory
"All models are wrong, but some are useful." — George Box
The "Theoretically Impossible" Optimization
"That's impossible. Your optimization violates Amdahl's Law."
That's what a senior engineer said during code review. I claimed to have sped up a program by 10×, but according to his analysis, only 50% was parallelizable—so by Amdahl's Law, the theoretical limit was 2×.
He was right—if I had just added more threads.
But I didn't parallelize. I changed the algorithm from O(n²) to O(n log n). That's outside Amdahl's Law's scope.
This experience taught me two things:
- Performance models are useful, but know their applicable scope
- Sometimes, breaking out of the model's framework is where real breakthroughs happen
This chapter covers the most important performance models in performance engineering:
- Amdahl's Law: The parallelization ceiling
- Gustafson's Law: The scaling horizon
- Universal Scalability Law: Real-world gravity
- Roofline Model: Compute or Memory bound?
- Little's Law: System's physical conservation
- Queuing Theory: Why 90% utilization causes collapse
Amdahl's Law: The Parallelization Ceiling
Historical Background
In 1967, legendary computer architect Gene Amdahl presented a profoundly influential paper at the AFIPS Spring Joint Computer Conference. The industry was debating whether to invest in "single extremely fast processors" or research connecting multiple "relatively slower processors" for parallel computing.
Amdahl pointed out that regardless of hardware scaling, programs always contain serial portions that cannot be parallelized—such as I/O initialization, memory allocation, or specific logical dependencies. These serial portions become the "ceiling" of overall system performance.
Basic Formula
Assume a program has a parallelizable portion (fraction p) and a serial portion (fraction 1-p):
Speedup = 1 / ((1 - p) + p/n)
Where:
- p = parallelizable fraction
- n = number of processors
- 1-p = serial fraction
Visualization
Original program (single thread):
┌──────────────────────────────────────────┐
│ Serial (20%) │ Parallel (80%) │
└──────────────────────────────────────────┘
Total: 100 time units
4 threads:
┌───────────┬──────────┐
│ Serial │ Parallel │ Thread 1
│ (20%) │ (20%) │
└───────────┴──────────┘
│ (20%) │ Thread 2
├──────────┤
│ (20%) │ Thread 3
├──────────┤
│ (20%) │ Thread 4
└──────────┘
Total: 20 + 20 = 40 time units
Speedup: 100/40 = 2.5×
Practical Calculation
def amdahl_speedup(p, n):
"""
p: parallelizable fraction (0 to 1)
n: number of processors
"""
return 1 / ((1 - p) + p / n)
# 80% parallelizable
p = 0.8
print(f"1 processor: {amdahl_speedup(p, 1):.2f}x")
print(f"2 processors: {amdahl_speedup(p, 2):.2f}x")
print(f"4 processors: {amdahl_speedup(p, 4):.2f}x")
print(f"8 processors: {amdahl_speedup(p, 8):.2f}x")
print(f"16 processors: {amdahl_speedup(p, 16):.2f}x")
print(f"∞ processors: {amdahl_speedup(p, 1000000):.2f}x")
1 processor: 1.00x
2 processors: 1.67x
4 processors: 2.50x
8 processors: 3.33x
16 processors: 4.00x
∞ processors: 5.00x ← This is the ceiling
The Harsh Reality
| Parallelizable Fraction | Theoretical Max Speedup |
|---|---|
| 50% | 2× |
| 75% | 4× |
| 90% | 10× |
| 95% | 20× |
| 99% | 100× |
Even if 99% of code is parallelizable, maximum speedup is only 100×. That 1% serial portion determines the ceiling.
Amdahl's Law Limitations
1. Assumes fixed workload
Amdahl's Law assumes total work is constant. In reality, more resources might mean processing larger problems (Gustafson's Law).
2. Ignores parallelization overhead
In practice, adding threads brings:
- Thread creation/destruction costs
- Synchronization costs (mutex, barrier)
- Cache coherence overhead
- False sharing
3. Only considers CPU
Doesn't account for memory bandwidth, I/O, network, or other bottlenecks.
Real-World Serial Bottlenecks
In practice, serial bottlenecks often hide in details:
- Lock Contention: Even with 128 threads, if they all compete for the same
mutex, lock-waiting time is serial - I/O Operations: Reading disk or network packets is typically sequential
- Memory Allocation: Frequent
malloccalls may cause global lock contention in the allocator
How to measure the parallel fraction p? Typically through empirical measurement: measure execution time at different core counts and fit backwards to find p. See Appendix H for detailed measurement methods and Python fitting code.
Gustafson's Law: A Different Perspective
Gustafson proposed a different assumption: more processors means solving larger problems, not solving the same problem faster.
Formula
Speedup = (1 - p) + p × n
Where:
- p = parallel portion fraction
- n = number of processors
Comparison
def gustafson_speedup(p, n):
return (1 - p) + p * n
p = 0.8 # 80% parallel
print("Processors | Amdahl | Gustafson")
print("-" * 35)
for n in [1, 2, 4, 8, 16, 64]:
a = amdahl_speedup(p, n)
g = gustafson_speedup(p, n)
print(f"{n:10} | {a:6.2f} | {g:6.2f}")
Processors | Amdahl | Gustafson
-----------------------------------
1 | 1.00 | 1.00
2 | 1.67 | 1.80
4 | 2.50 | 3.40
8 | 3.33 | 6.60
16 | 4.00 | 13.00
64 | 4.71 | 51.40
Gustafson's view: 64 processors can process 51× larger problems.
When to Use Which?
| Scenario | Applicable Law | Key Metric |
|---|---|---|
| Fixed-size problem, finish faster | Amdahl | Strong Scaling |
| Fixed time, process larger problem | Gustafson | Weak Scaling |
| Real-time systems (fixed deadline) | Amdahl | Latency |
| Scientific computing (bigger is better) | Gustafson | Throughput |
| UI response, single function optimization | Amdahl | Response Time |
| Big data processing, AI training | Gustafson | Data Volume |
Practical advice: If customers complain "App starts too slowly," use Amdahl to find serial bottlenecks; if they want "more transactions in the same time," use Gustafson to think about scaling.
Universal Scalability Law: Real-World Gravity
If Amdahl's Law is the performance "ceiling" and Gustafson's Law is the "distant horizon," then the Universal Scalability Law (USL) is real-world "gravity"—it explains why some systems' performance not only plateaus but actually declines when adding more cores.
Why Amdahl Isn't Enough
In 1993, Neil Gunther proposed USL to address "Negative Scaling" phenomena. Amdahl assumes performance eventually approaches a constant, but in real distributed systems, we often see performance reach a peak then decline.
Amdahl only considers "work being serialized" overhead, ignoring the "communication and coordination" between cores needed to maintain data consistency.
Formula and Parameters
C(N) = N / (1 + σ(N-1) + κN(N-1))
Where:
- N = number of processors
- σ (sigma) = contention coefficient - overhead from waiting for same resource
- κ (kappa) = coherence coefficient - communication overhead for consistency
Key insights:
- σ is linear: Like Amdahl's serial portion, growth slows then plateaus
- κ is quadratic: N(N-1) represents pairwise node communication, overhead grows explosively
Three Scaling Behaviors
Performance
^
| Linear (σ=0, κ=0)
| /
| / Amdahl (κ=0)
| / _______________
| / /
| / / USL (σ>0, κ>0)
| / / /\
| / / / \ ← Retrograde!
|/ / / \
└────────────────────────> N (cores)
When κ > 0, systems exhibit "Retrograde Behavior"—beyond a critical point, communication overhead exceeds computation gains, and performance declines.
Identifying System Bottlenecks
- Contention-bound (σ dominant): Curve gradually flattens like a slope. Optimization: reduce lock contention, shrink critical sections
- Coherence-bound (κ dominant): Curve like a mountain with clear peak then rapid drop. Optimization: reduce cross-node communication, avoid false sharing
Practical Application
USL's greatest power: with just a few measurement points (e.g., N=1,2,4,8), you can predict system behavior at N=64.
# Optimal parallelism
N_optimal = sqrt((1 - sigma) / kappa)
Real-World Example: Web Server Scaling
A team benchmarked their API server at different instance counts:
Instances │ Throughput (req/s) │ Speedup
──────────┼────────────────────┼─────────
1 │ 1,000 │ 1.0×
2 │ 1,850 │ 1.85×
4 │ 3,200 │ 3.2×
8 │ 4,800 │ 4.8×
16 │ 5,600 │ 5.6×
32 │ 4,200 │ 4.2× ← Degradation!
After USL fitting: σ = 0.05, κ = 0.008
Diagnosis: High κ indicates coherence-bound behavior—each request hits a shared database, causing cross-instance coordination overhead. The optimal instance count is √((1-0.05)/0.008) ≈ 11 instances.
Solution: Add read replicas and cache layer to reduce database roundtrips.
USL Limitations
- Assumes homogeneous nodes: All processors/nodes must be identical
- Steady-state assumption: Doesn't capture transient behavior during ramp-up
- Single resource model: Real systems have multiple bottlenecks (CPU, memory, network)
- Fitting sensitivity: Results depend on measurement quality; outliers can skew σ and κ
See Appendix H for detailed Python fitting code and case studies.
Roofline Model: Finding the Bottleneck
The Roofline Model is a visualization tool proposed by UC Berkeley in 2008 for analyzing whether a program is compute-bound or memory-bound.
Core Concepts
Every program has two characteristics:
-
Operational Intensity (OI): How many operations per byte of memory access
OI = FLOPs / Bytes moved -
Attainable Performance: Actual achievable FLOPS
The system has two limits:
- Peak Compute: Maximum CPU FLOPS (horizontal line)
- Peak Memory Bandwidth: Memory bandwidth limit (sloped line)
The Roofline Diagram
Performance (GFLOPS)
^
| __________________ Peak Compute (roof)
| /← Ridge Point
| /
| /
| / ← Memory Bandwidth (slope)
| /
| /
| /
| /
| /
|──────────────────────────────────────> Operational Intensity
(FLOPs/Byte)
Calculation Example
Assume a system with:
- Peak Compute: 100 GFLOPS
- Memory Bandwidth: 50 GB/s
def roofline_performance(oi, peak_compute, bandwidth):
"""
oi: Operational Intensity (FLOPs/Byte)
peak_compute: Peak GFLOPS
bandwidth: Memory bandwidth (GB/s)
"""
memory_bound = oi * bandwidth # GFLOPS
return min(memory_bound, peak_compute)
peak = 100 # GFLOPS
bw = 50 # GB/s
ridge_point = peak / bw # 2 FLOPs/Byte
print("Operational Intensity | Attainable GFLOPS | Bound")
print("-" * 55)
for oi in [0.1, 0.5, 1, 2, 4, 8, 16]:
perf = roofline_performance(oi, peak, bw)
bound = "Memory" if oi < ridge_point else "Compute"
print(f"{oi:21.1f} | {perf:17.1f} | {bound}")
Operational Intensity | Attainable GFLOPS | Bound
-------------------------------------------------------
0.1 | 5.0 | Memory
0.5 | 25.0 | Memory
1.0 | 50.0 | Memory
2.0 | 100.0 | Compute ← Ridge Point
4.0 | 100.0 | Compute
8.0 | 100.0 | Compute
16.0 | 100.0 | Compute
Real-World Examples
Operational Intensity of different algorithms:
| Algorithm | OI (FLOPs/Byte) | Usually... |
|---|---|---|
| STREAM copy | 0 | Memory-bound |
| SpMV (sparse) | 0.25 | Memory-bound |
| BLAS Level 1 | 0.25-0.5 | Memory-bound |
| Stencil | 0.5-1 | Memory-bound |
| BLAS Level 2 | 1-2 | Borderline |
| Dense GEMM | High | Compute-bound |
| FFT | Medium | Depends on implementation |
Cache-Aware Roofline Model (CARM)
Traditional Roofline only considers DRAM bandwidth, but modern processors have multiple cache levels. CARM draws different rooflines for each memory level:
Performance
^
| __________________________ L1 Peak (highest slope)
| /_________________________ L2 Peak
|//________________________ L3 Peak
|||_______________________ DRAM Peak (lowest slope)
|||/
||/
|/
└──────────────────────────> Operational Intensity
Diagnostic logic: If your point falls below the DRAM slope, the problem is cache misses (optimize prefetching, data locality); if near L1 but below compute peak, arithmetic intensity is insufficient (consider loop fusion).
Multi-core Roofline Considerations
- Shared Bandwidth: Multiple cores share DRAM bus, total bandwidth saturates faster
- Ridge Point shifts right: Compute peak increases linearly with cores, but bandwidth doesn't, making programs more likely to be memory-bound
- NUMA effects: Local DRAM bandwidth is much higher than Remote DRAM; label separately
Roofline Limitations
- Static view: Doesn't capture phase behavior—a program may be memory-bound in one phase, compute-bound in another
- Assumes perfect overlap: Ignores latency hiding and out-of-order execution limitations
- Single bottleneck model: Real programs may have mixed OI across different kernels
- Measurement challenges: Accurately counting FLOPs and bytes moved requires careful instrumentation
See Appendix H for detailed CARM analysis and tool usage.
Little's Law: System's Physical Conservation
In performance engineering, some laws transcend algorithms and hardware architectures. Little's Law is like conservation of energy in physics—it defines the fundamental boundaries of system operation.
In 1961, John Little proved that in stable systems, three core metrics have an invariant relationship.
Formula
L = λ × W
Where:
- L = average number of items in system (in-flight requests)
- λ = arrival rate (throughput)
- W = average wait time (latency)
Intuitive Understanding
Imagine a restaurant:
30 people in restaurant (L)
10 people arrive per minute (λ)
Each person stays 3 minutes on average (W)
L = λ × W
30 = 10 × 3 ✓
Applications in Computer Systems
1. Memory System
Outstanding memory requests = Bandwidth × Latency
Example:
- Memory latency: 100 ns
- Required bandwidth: 50 GB/s
Outstanding requests = 50 GB/s × 100 ns = 5000 bytes = 78 cache lines
If CPU can only maintain 16 outstanding requests,
Actual bandwidth = 16 × 64 bytes / 100 ns = 10.24 GB/s
This is why modern CPUs need deep memory hierarchies and prefetchers.
2. Network System
Bandwidth-Delay Product (BDP) = Bandwidth × RTT
Example trans-Pacific connection:
- Bandwidth: 10 Gbps
- RTT: 150 ms
BDP = 10 Gbps × 150 ms = 1.5 Gb = 187.5 MB
TCP window needs at least 187.5 MB to fill the pipe
3. Concurrent System
Throughput = Concurrency / Latency
Example web server:
- Each request latency: 50 ms
- Want 1000 requests/sec throughput
Required concurrency = 1000 × 0.05 = 50 concurrent requests
Key Prerequisites
Little's Law is powerful because it makes no assumptions about task distribution or service order. But three prerequisites must hold:
- System must be stable: Arrival rate = Departure rate. If tasks keep accumulating, the formula fails
- Long-term average: Describes equilibrium state, not instantaneous bursts
- Task conservation: Tasks don't disappear (like dropped packets) or self-replicate
Dimensional analysis verification: Throughput [tasks/time] × Latency [time/task] = [tasks]
Diagnostic Value When Formula "Fails"
When measured data doesn't match L = λW, this typically indicates:
- Actual concurrency > Expected: Long-tail requests inflating average, or resource leaks (unclosed connections)
- Actual concurrency < Expected: Throughput overestimated, or tasks being batched
Little's Law is the performance engineer's "Sanity Check." See Appendix H for detailed mathematical proof, verification methods, and architectural applications.
Queuing Theory Fundamentals
If Little's Law tells us system's physical conservation, queuing theory reveals the dynamic changes as requests wait to be processed. Understanding queuing theory explains why systems "suddenly collapse" at 90% load.
M/M/1 Model
This is the most basic queuing model, assuming Poisson arrivals, exponential service times, and a single server.
Key formulas:
- ρ = λ/μ (utilization, λ=arrival rate, μ=service rate)
- L = ρ/(1-ρ) (average queue length)
- W = 1/(μ-λ) (average wait time)
Why does latency explode at 90% utilization?
When ρ → 1, (1-ρ) → 0, causing W → ∞. Any small arrival fluctuation causes queue accumulation, and subsequent requests' wait times grow in a chain reaction—this is the "Hockey Stick Effect."
Practical Rules of Thumb
- 70% Rule: For latency-sensitive systems, keep Utilization ≤ 70%
- Latency multiplier: At 50% utilization, wait time is 2× pure processing time; at 90% it's 10×
- Separate processing from waiting: If response time increases but Service Time hasn't changed, the problem is "queuing"
Real-World Example: Database Connection Pool Sizing
A service connects to PostgreSQL with a connection pool. Current setup:
- Arrival rate: 500 queries/sec
- Average query time: 10ms
- Current pool size: 10 connections
Analysis using M/M/c:
ρ = λ / (c × μ) = 500 / (10 × 100) = 0.5 per connection
With 10 connections at 50% average utilization, queuing probability is acceptable.
But during peak hours:
Peak arrival: 800 queries/sec
ρ = 800 / (10 × 100) = 0.8 per connection
At 80% utilization, wait time ≈ 4× service time = 40ms added latency!
Solution: Increase pool to 15 connections:
ρ = 800 / (15 × 100) = 0.53 per connection
Wait time drops to ~1.1× service time = 11ms added latency
See Appendix H for detailed M/M/c model, Erlang C formula, and capacity planning.
Integrated Application: Finding the Real Bottleneck
Let's use an example to apply these models together.
Problem
You have an image processing pipeline:
- Read image (I/O)
- Apply filter (compute)
- Write result (I/O)
Current performance: 100 images/sec Target: 500 images/sec
Analysis
Step 1: Amdahl Analysis
First, measure each stage's time:
Read: 2 ms (20%)
Filter: 6 ms (60%)
Write: 2 ms (20%)
Total: 10 ms
If we only optimize Filter (parallelize):
# Filter is 60%, even with infinite parallelism
max_speedup = amdahl_speedup(0.6, float('inf'))
print(f"Max speedup: {max_speedup:.2f}x") # 2.5x
2.5x only gets us to 250 images/sec—not enough.
Step 2: Roofline Analysis of Filter
Filter characteristics:
- Per pixel: 20 FLOPs
- Per pixel: read 4 bytes, write 4 bytes = 8 bytes
- OI = 20/8 = 2.5 FLOPs/Byte
System:
- Peak: 200 GFLOPS
- Bandwidth: 50 GB/s
- Ridge point: 4 FLOPs/Byte
OI = 2.5 < 4 → Memory-bound
Optimization direction: not more threads, but improve memory access pattern.
Step 3: Little's Law Analysis of I/O
Read throughput: 500 images/sec × 4 MB/image = 2 GB/s
Disk latency: assume SSD, 0.1 ms
Required queue depth = 2 GB/s × 0.1 ms = 200 KB ≈ 50 images
But we're using sync I/O (queue depth = 1)
Problem found: I/O isn't pipelined, need async I/O.
Solution
- Use async I/O with queue depth = 64
- Improve filter's cache locality (loop tiling)
- Now filter is compute-bound, can apply SIMD optimization
Result: Achieved 600 images/sec.
Common Pitfalls
Pitfall 1: Only Looking at Averages
Bad: "Average latency 10ms, throughput 100/sec"
Little's Law: L = 100 × 0.01 = 1
Good: "Average latency 10ms, P99 latency 500ms"
That P99 might cause problems
Pitfall 2: Ignoring Model Assumptions
Amdahl's Law assumes:
- Fixed workload
- No parallelization overhead
- Perfect parallelism (no synchronization waits)
None of these hold in reality.
Pitfall 3: Over-trusting Roofline
Roofline says your program is memory-bound, but:
- Might be due to poor cache miss patterns
- Might be because prefetcher can't predict
- Might be due to false sharing
Need deeper analysis (perf, VTune).
Pitfall 4: Confusing Throughput and Latency
System A: 1000 req/s, 100ms latency
System B: 800 req/s, 10ms latency
Which is better? Depends on your needs.
Little's Law:
A: 1000 × 0.1 = 100 concurrent
B: 800 × 0.01 = 8 concurrent
A needs more resources to maintain that throughput
Which Model to Use: Decision Guide
┌─────────────────────────────────┐
│ What's your performance │
│ question? │
└────────────────┬────────────────┘
│
┌────────────────────────────┼────────────────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ "Will more cores │ │ "Is my code │ │ "Why is latency │
│ help?" │ │ compute or │ │ so high?" │
└─────────┬─────────┘ │ memory bound?" │ └─────────┬─────────┘
│ └─────────┬─────────┘ │
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ Amdahl's Law │ │ Roofline Model │ │ Little's Law │
│ (fixed workload) │ │ │ │ + Queuing Theory │
└─────────┬─────────┘ └───────────────────┘ └───────────────────┘
│
▼ Performance degrades at high N?
│
┌─────────┴─────────┐
│ Yes │ No
│ ▼ │ ▼
│ USL │ Gustafson
│ (find σ, κ) │ (scale workload)
└───────────────────┘
Quick Reference:
| Symptom | Model | Key Metric |
|---|---|---|
| Adding cores doesn't help | Amdahl | Serial fraction (1-p) |
| Performance drops at high N | USL | σ (contention), κ (coherence) |
| Slow despite high CPU | Roofline | Operational Intensity |
| Latency spikes at peak load | Queuing | ρ (utilization) |
| Throughput × Latency mismatch | Little's Law | L = λ × W |
Summary
Six Performance Models, six perspectives:
Amdahl's Law
- Theoretical upper limit for parallelization
- Serial portion determines the ceiling
- Use to judge "will adding more cores help"
Gustafson's Law
- More resources → process larger problems
- More optimistic than Amdahl
- Applies to scalable workloads
Universal Scalability Law (USL)
- Captures contention (σ) and coherence (κ) overhead
- Predicts performance degradation with more nodes
- Find optimal parallelism: N_opt = √((1-σ)/κ)
Roofline Model
- Compute-bound vs Memory-bound
- Visualize performance bottlenecks
- Guide optimization direction
Little's Law
- L = λ × W
- Connects throughput, latency, concurrency
- Diagnose queuing and resource utilization
Queuing Theory (M/M/1)
- Explains "Hockey Stick Effect" at high utilization
- 70% rule for latency-sensitive systems
- Response time = Service time / (1 - utilization)
Usage Recommendations
- First use Amdahl/Gustafson to evaluate parallelization potential
- Use USL when scaling shows degradation—identify σ vs κ bottlenecks
- Use Roofline to determine compute vs memory bound
- Use Little's Law to analyze throughput/latency relationships
- Use queuing theory to understand utilization vs latency tradeoffs
- Remember: all models are simplifications; actual measurement is always the final answer