Chapter 10: Performance Modeling

Part III: Theory


"All models are wrong, but some are useful." — George Box

The "Theoretically Impossible" Optimization

"That's impossible. Your optimization violates Amdahl's Law."

That's what a senior engineer said during code review. I claimed to have sped up a program by 10×, but according to his analysis, only 50% was parallelizable—so by Amdahl's Law, the theoretical limit was 2×.

He was right—if I had just added more threads.

But I didn't parallelize. I changed the algorithm from O(n²) to O(n log n). That's outside Amdahl's Law's scope.

This experience taught me two things:

  1. Performance models are useful, but know their applicable scope
  2. Sometimes, breaking out of the model's framework is where real breakthroughs happen

This chapter covers the most important performance models in performance engineering:

  • Amdahl's Law: The parallelization ceiling
  • Gustafson's Law: The scaling horizon
  • Universal Scalability Law: Real-world gravity
  • Roofline Model: Compute or Memory bound?
  • Little's Law: System's physical conservation
  • Queuing Theory: Why 90% utilization causes collapse

Amdahl's Law: The Parallelization Ceiling

Historical Background

In 1967, legendary computer architect Gene Amdahl presented a profoundly influential paper at the AFIPS Spring Joint Computer Conference. The industry was debating whether to invest in "single extremely fast processors" or research connecting multiple "relatively slower processors" for parallel computing.

Amdahl pointed out that regardless of hardware scaling, programs always contain serial portions that cannot be parallelized—such as I/O initialization, memory allocation, or specific logical dependencies. These serial portions become the "ceiling" of overall system performance.

Basic Formula

Assume a program has a parallelizable portion (fraction p) and a serial portion (fraction 1-p):

Speedup = 1 / ((1 - p) + p/n)

Where:
- p = parallelizable fraction
- n = number of processors
- 1-p = serial fraction

Visualization

Original program (single thread):
┌──────────────────────────────────────────┐
│ Serial (20%) │    Parallel (80%)         │
└──────────────────────────────────────────┘
Total: 100 time units

4 threads:
┌───────────┬──────────┐
│ Serial    │ Parallel │  Thread 1
│  (20%)    │  (20%)   │
└───────────┴──────────┘
            │  (20%)   │  Thread 2
            ├──────────┤
            │  (20%)   │  Thread 3
            ├──────────┤
            │  (20%)   │  Thread 4
            └──────────┘
Total: 20 + 20 = 40 time units
Speedup: 100/40 = 2.5×

Practical Calculation

def amdahl_speedup(p, n):
    """
    p: parallelizable fraction (0 to 1)
    n: number of processors
    """
    return 1 / ((1 - p) + p / n)

# 80% parallelizable
p = 0.8

print(f"1 processor:   {amdahl_speedup(p, 1):.2f}x")
print(f"2 processors:  {amdahl_speedup(p, 2):.2f}x")
print(f"4 processors:  {amdahl_speedup(p, 4):.2f}x")
print(f"8 processors:  {amdahl_speedup(p, 8):.2f}x")
print(f"16 processors: {amdahl_speedup(p, 16):.2f}x")
print(f"∞ processors:  {amdahl_speedup(p, 1000000):.2f}x")
1 processor:   1.00x
2 processors:  1.67x
4 processors:  2.50x
8 processors:  3.33x
16 processors: 4.00x
∞ processors:  5.00x  ← This is the ceiling

The Harsh Reality

Parallelizable FractionTheoretical Max Speedup
50%
75%
90%10×
95%20×
99%100×

Even if 99% of code is parallelizable, maximum speedup is only 100×. That 1% serial portion determines the ceiling.

Amdahl's Law Limitations

1. Assumes fixed workload

Amdahl's Law assumes total work is constant. In reality, more resources might mean processing larger problems (Gustafson's Law).

2. Ignores parallelization overhead

In practice, adding threads brings:

  • Thread creation/destruction costs
  • Synchronization costs (mutex, barrier)
  • Cache coherence overhead
  • False sharing

3. Only considers CPU

Doesn't account for memory bandwidth, I/O, network, or other bottlenecks.

Real-World Serial Bottlenecks

In practice, serial bottlenecks often hide in details:

  • Lock Contention: Even with 128 threads, if they all compete for the same mutex, lock-waiting time is serial
  • I/O Operations: Reading disk or network packets is typically sequential
  • Memory Allocation: Frequent malloc calls may cause global lock contention in the allocator

How to measure the parallel fraction p? Typically through empirical measurement: measure execution time at different core counts and fit backwards to find p. See Appendix H for detailed measurement methods and Python fitting code.

Gustafson's Law: A Different Perspective

Gustafson proposed a different assumption: more processors means solving larger problems, not solving the same problem faster.

Formula

Speedup = (1 - p) + p × n

Where:
- p = parallel portion fraction
- n = number of processors

Comparison

def gustafson_speedup(p, n):
    return (1 - p) + p * n

p = 0.8  # 80% parallel

print("Processors | Amdahl | Gustafson")
print("-" * 35)
for n in [1, 2, 4, 8, 16, 64]:
    a = amdahl_speedup(p, n)
    g = gustafson_speedup(p, n)
    print(f"{n:10} | {a:6.2f} | {g:6.2f}")
Processors | Amdahl | Gustafson
-----------------------------------
         1 |   1.00 |   1.00
         2 |   1.67 |   1.80
         4 |   2.50 |   3.40
         8 |   3.33 |   6.60
        16 |   4.00 |  13.00
        64 |   4.71 |  51.40

Gustafson's view: 64 processors can process 51× larger problems.

When to Use Which?

ScenarioApplicable LawKey Metric
Fixed-size problem, finish fasterAmdahlStrong Scaling
Fixed time, process larger problemGustafsonWeak Scaling
Real-time systems (fixed deadline)AmdahlLatency
Scientific computing (bigger is better)GustafsonThroughput
UI response, single function optimizationAmdahlResponse Time
Big data processing, AI trainingGustafsonData Volume

Practical advice: If customers complain "App starts too slowly," use Amdahl to find serial bottlenecks; if they want "more transactions in the same time," use Gustafson to think about scaling.

Universal Scalability Law: Real-World Gravity

If Amdahl's Law is the performance "ceiling" and Gustafson's Law is the "distant horizon," then the Universal Scalability Law (USL) is real-world "gravity"—it explains why some systems' performance not only plateaus but actually declines when adding more cores.

Why Amdahl Isn't Enough

In 1993, Neil Gunther proposed USL to address "Negative Scaling" phenomena. Amdahl assumes performance eventually approaches a constant, but in real distributed systems, we often see performance reach a peak then decline.

Amdahl only considers "work being serialized" overhead, ignoring the "communication and coordination" between cores needed to maintain data consistency.

Formula and Parameters

C(N) = N / (1 + σ(N-1) + κN(N-1))

Where:
- N = number of processors
- σ (sigma) = contention coefficient - overhead from waiting for same resource
- κ (kappa) = coherence coefficient - communication overhead for consistency

Key insights:

  • σ is linear: Like Amdahl's serial portion, growth slows then plateaus
  • κ is quadratic: N(N-1) represents pairwise node communication, overhead grows explosively

Three Scaling Behaviors

Performance
    ^
    |        Linear (σ=0, κ=0)
    |       /
    |      /    Amdahl (κ=0)
    |     /   _______________
    |    /   /
    |   /   /   USL (σ>0, κ>0)
    |  /   /  /\
    | /   /  /  \  ← Retrograde!
    |/   /  /    \
    └────────────────────────> N (cores)

When κ > 0, systems exhibit "Retrograde Behavior"—beyond a critical point, communication overhead exceeds computation gains, and performance declines.

Identifying System Bottlenecks

  • Contention-bound (σ dominant): Curve gradually flattens like a slope. Optimization: reduce lock contention, shrink critical sections
  • Coherence-bound (κ dominant): Curve like a mountain with clear peak then rapid drop. Optimization: reduce cross-node communication, avoid false sharing

Practical Application

USL's greatest power: with just a few measurement points (e.g., N=1,2,4,8), you can predict system behavior at N=64.

# Optimal parallelism
N_optimal = sqrt((1 - sigma) / kappa)

Real-World Example: Web Server Scaling

A team benchmarked their API server at different instance counts:

Instances │ Throughput (req/s) │ Speedup
──────────┼────────────────────┼─────────
    1     │      1,000         │   1.0×
    2     │      1,850         │   1.85×
    4     │      3,200         │   3.2×
    8     │      4,800         │   4.8×
   16     │      5,600         │   5.6×
   32     │      4,200         │   4.2× ← Degradation!

After USL fitting: σ = 0.05, κ = 0.008

Diagnosis: High κ indicates coherence-bound behavior—each request hits a shared database, causing cross-instance coordination overhead. The optimal instance count is √((1-0.05)/0.008) ≈ 11 instances.

Solution: Add read replicas and cache layer to reduce database roundtrips.

USL Limitations

  1. Assumes homogeneous nodes: All processors/nodes must be identical
  2. Steady-state assumption: Doesn't capture transient behavior during ramp-up
  3. Single resource model: Real systems have multiple bottlenecks (CPU, memory, network)
  4. Fitting sensitivity: Results depend on measurement quality; outliers can skew σ and κ

See Appendix H for detailed Python fitting code and case studies.

Roofline Model: Finding the Bottleneck

The Roofline Model is a visualization tool proposed by UC Berkeley in 2008 for analyzing whether a program is compute-bound or memory-bound.

Core Concepts

Every program has two characteristics:

  1. Operational Intensity (OI): How many operations per byte of memory access

    OI = FLOPs / Bytes moved
    
  2. Attainable Performance: Actual achievable FLOPS

The system has two limits:

  1. Peak Compute: Maximum CPU FLOPS (horizontal line)
  2. Peak Memory Bandwidth: Memory bandwidth limit (sloped line)

The Roofline Diagram

Performance (GFLOPS)
     ^
     |                    __________________ Peak Compute (roof)
     |                   /← Ridge Point
     |                  /
     |                 /
     |                /  ← Memory Bandwidth (slope)
     |               /
     |              /
     |             /
     |            /
     |           /
     |──────────────────────────────────────> Operational Intensity
                                               (FLOPs/Byte)

Calculation Example

Assume a system with:

  • Peak Compute: 100 GFLOPS
  • Memory Bandwidth: 50 GB/s
def roofline_performance(oi, peak_compute, bandwidth):
    """
    oi: Operational Intensity (FLOPs/Byte)
    peak_compute: Peak GFLOPS
    bandwidth: Memory bandwidth (GB/s)
    """
    memory_bound = oi * bandwidth  # GFLOPS
    return min(memory_bound, peak_compute)

peak = 100  # GFLOPS
bw = 50     # GB/s
ridge_point = peak / bw  # 2 FLOPs/Byte

print("Operational Intensity | Attainable GFLOPS | Bound")
print("-" * 55)
for oi in [0.1, 0.5, 1, 2, 4, 8, 16]:
    perf = roofline_performance(oi, peak, bw)
    bound = "Memory" if oi < ridge_point else "Compute"
    print(f"{oi:21.1f} | {perf:17.1f} | {bound}")
Operational Intensity | Attainable GFLOPS | Bound
-------------------------------------------------------
                  0.1 |               5.0 | Memory
                  0.5 |              25.0 | Memory
                  1.0 |              50.0 | Memory
                  2.0 |             100.0 | Compute  ← Ridge Point
                  4.0 |             100.0 | Compute
                  8.0 |             100.0 | Compute
                 16.0 |             100.0 | Compute

Real-World Examples

Operational Intensity of different algorithms:

AlgorithmOI (FLOPs/Byte)Usually...
STREAM copy0Memory-bound
SpMV (sparse)0.25Memory-bound
BLAS Level 10.25-0.5Memory-bound
Stencil0.5-1Memory-bound
BLAS Level 21-2Borderline
Dense GEMMHighCompute-bound
FFTMediumDepends on implementation

Cache-Aware Roofline Model (CARM)

Traditional Roofline only considers DRAM bandwidth, but modern processors have multiple cache levels. CARM draws different rooflines for each memory level:

Performance
    ^
    |  __________________________ L1 Peak (highest slope)
    | /_________________________ L2 Peak
    |//________________________ L3 Peak
    |||_______________________ DRAM Peak (lowest slope)
    |||/
    ||/
    |/
    └──────────────────────────> Operational Intensity

Diagnostic logic: If your point falls below the DRAM slope, the problem is cache misses (optimize prefetching, data locality); if near L1 but below compute peak, arithmetic intensity is insufficient (consider loop fusion).

Multi-core Roofline Considerations

  • Shared Bandwidth: Multiple cores share DRAM bus, total bandwidth saturates faster
  • Ridge Point shifts right: Compute peak increases linearly with cores, but bandwidth doesn't, making programs more likely to be memory-bound
  • NUMA effects: Local DRAM bandwidth is much higher than Remote DRAM; label separately

Roofline Limitations

  1. Static view: Doesn't capture phase behavior—a program may be memory-bound in one phase, compute-bound in another
  2. Assumes perfect overlap: Ignores latency hiding and out-of-order execution limitations
  3. Single bottleneck model: Real programs may have mixed OI across different kernels
  4. Measurement challenges: Accurately counting FLOPs and bytes moved requires careful instrumentation

See Appendix H for detailed CARM analysis and tool usage.

Little's Law: System's Physical Conservation

In performance engineering, some laws transcend algorithms and hardware architectures. Little's Law is like conservation of energy in physics—it defines the fundamental boundaries of system operation.

In 1961, John Little proved that in stable systems, three core metrics have an invariant relationship.

Formula

L = λ × W

Where:
- L = average number of items in system (in-flight requests)
- λ = arrival rate (throughput)
- W = average wait time (latency)

Intuitive Understanding

Imagine a restaurant:

30 people in restaurant (L)
10 people arrive per minute (λ)
Each person stays 3 minutes on average (W)

L = λ × W
30 = 10 × 3 ✓

Applications in Computer Systems

1. Memory System

Outstanding memory requests = Bandwidth × Latency

Example:
- Memory latency: 100 ns
- Required bandwidth: 50 GB/s

Outstanding requests = 50 GB/s × 100 ns = 5000 bytes = 78 cache lines

If CPU can only maintain 16 outstanding requests,
Actual bandwidth = 16 × 64 bytes / 100 ns = 10.24 GB/s

This is why modern CPUs need deep memory hierarchies and prefetchers.

2. Network System

Bandwidth-Delay Product (BDP) = Bandwidth × RTT

Example trans-Pacific connection:
- Bandwidth: 10 Gbps
- RTT: 150 ms

BDP = 10 Gbps × 150 ms = 1.5 Gb = 187.5 MB

TCP window needs at least 187.5 MB to fill the pipe

3. Concurrent System

Throughput = Concurrency / Latency

Example web server:
- Each request latency: 50 ms
- Want 1000 requests/sec throughput

Required concurrency = 1000 × 0.05 = 50 concurrent requests

Key Prerequisites

Little's Law is powerful because it makes no assumptions about task distribution or service order. But three prerequisites must hold:

  1. System must be stable: Arrival rate = Departure rate. If tasks keep accumulating, the formula fails
  2. Long-term average: Describes equilibrium state, not instantaneous bursts
  3. Task conservation: Tasks don't disappear (like dropped packets) or self-replicate

Dimensional analysis verification: Throughput [tasks/time] × Latency [time/task] = [tasks]

Diagnostic Value When Formula "Fails"

When measured data doesn't match L = λW, this typically indicates:

  • Actual concurrency > Expected: Long-tail requests inflating average, or resource leaks (unclosed connections)
  • Actual concurrency < Expected: Throughput overestimated, or tasks being batched

Little's Law is the performance engineer's "Sanity Check." See Appendix H for detailed mathematical proof, verification methods, and architectural applications.

Queuing Theory Fundamentals

If Little's Law tells us system's physical conservation, queuing theory reveals the dynamic changes as requests wait to be processed. Understanding queuing theory explains why systems "suddenly collapse" at 90% load.

M/M/1 Model

This is the most basic queuing model, assuming Poisson arrivals, exponential service times, and a single server.

Key formulas:
- ρ = λ/μ (utilization, λ=arrival rate, μ=service rate)
- L = ρ/(1-ρ) (average queue length)
- W = 1/(μ-λ) (average wait time)

Why does latency explode at 90% utilization?

When ρ → 1, (1-ρ) → 0, causing W → ∞. Any small arrival fluctuation causes queue accumulation, and subsequent requests' wait times grow in a chain reaction—this is the "Hockey Stick Effect."

Practical Rules of Thumb

  1. 70% Rule: For latency-sensitive systems, keep Utilization ≤ 70%
  2. Latency multiplier: At 50% utilization, wait time is 2× pure processing time; at 90% it's 10×
  3. Separate processing from waiting: If response time increases but Service Time hasn't changed, the problem is "queuing"

Real-World Example: Database Connection Pool Sizing

A service connects to PostgreSQL with a connection pool. Current setup:

- Arrival rate: 500 queries/sec
- Average query time: 10ms
- Current pool size: 10 connections

Analysis using M/M/c:

ρ = λ / (c × μ) = 500 / (10 × 100) = 0.5 per connection

With 10 connections at 50% average utilization, queuing probability is acceptable.

But during peak hours:

Peak arrival: 800 queries/sec
ρ = 800 / (10 × 100) = 0.8 per connection

At 80% utilization, wait time ≈ 4× service time = 40ms added latency!

Solution: Increase pool to 15 connections:

ρ = 800 / (15 × 100) = 0.53 per connection
Wait time drops to ~1.1× service time = 11ms added latency

See Appendix H for detailed M/M/c model, Erlang C formula, and capacity planning.

Integrated Application: Finding the Real Bottleneck

Let's use an example to apply these models together.

Problem

You have an image processing pipeline:

  • Read image (I/O)
  • Apply filter (compute)
  • Write result (I/O)

Current performance: 100 images/sec Target: 500 images/sec

Analysis

Step 1: Amdahl Analysis

First, measure each stage's time:

Read:   2 ms (20%)
Filter: 6 ms (60%)
Write:  2 ms (20%)
Total:  10 ms

If we only optimize Filter (parallelize):

# Filter is 60%, even with infinite parallelism
max_speedup = amdahl_speedup(0.6, float('inf'))
print(f"Max speedup: {max_speedup:.2f}x")  # 2.5x

2.5x only gets us to 250 images/sec—not enough.

Step 2: Roofline Analysis of Filter

Filter characteristics:

  • Per pixel: 20 FLOPs
  • Per pixel: read 4 bytes, write 4 bytes = 8 bytes
  • OI = 20/8 = 2.5 FLOPs/Byte

System:

  • Peak: 200 GFLOPS
  • Bandwidth: 50 GB/s
  • Ridge point: 4 FLOPs/Byte

OI = 2.5 < 4 → Memory-bound

Optimization direction: not more threads, but improve memory access pattern.

Step 3: Little's Law Analysis of I/O

Read throughput: 500 images/sec × 4 MB/image = 2 GB/s
Disk latency: assume SSD, 0.1 ms

Required queue depth = 2 GB/s × 0.1 ms = 200 KB ≈ 50 images

But we're using sync I/O (queue depth = 1)

Problem found: I/O isn't pipelined, need async I/O.

Solution

  1. Use async I/O with queue depth = 64
  2. Improve filter's cache locality (loop tiling)
  3. Now filter is compute-bound, can apply SIMD optimization

Result: Achieved 600 images/sec.

Common Pitfalls

Pitfall 1: Only Looking at Averages

Bad:  "Average latency 10ms, throughput 100/sec"
      Little's Law: L = 100 × 0.01 = 1

Good: "Average latency 10ms, P99 latency 500ms"
      That P99 might cause problems

Pitfall 2: Ignoring Model Assumptions

Amdahl's Law assumes:

  • Fixed workload
  • No parallelization overhead
  • Perfect parallelism (no synchronization waits)

None of these hold in reality.

Pitfall 3: Over-trusting Roofline

Roofline says your program is memory-bound, but:

  • Might be due to poor cache miss patterns
  • Might be because prefetcher can't predict
  • Might be due to false sharing

Need deeper analysis (perf, VTune).

Pitfall 4: Confusing Throughput and Latency

System A: 1000 req/s, 100ms latency
System B: 800 req/s, 10ms latency

Which is better? Depends on your needs.

Little's Law:
A: 1000 × 0.1 = 100 concurrent
B: 800 × 0.01 = 8 concurrent

A needs more resources to maintain that throughput

Which Model to Use: Decision Guide

                    ┌─────────────────────────────────┐
                    │   What's your performance       │
                    │          question?              │
                    └────────────────┬────────────────┘
                                     │
        ┌────────────────────────────┼────────────────────────────┐
        ▼                            ▼                            ▼
┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐
│ "Will more cores  │    │ "Is my code       │    │ "Why is latency   │
│    help?"         │    │  compute or       │    │   so high?"       │
└─────────┬─────────┘    │  memory bound?"   │    └─────────┬─────────┘
          │              └─────────┬─────────┘              │
          ▼                        ▼                        ▼
┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐
│  Amdahl's Law     │    │  Roofline Model   │    │  Little's Law     │
│  (fixed workload) │    │                   │    │  + Queuing Theory │
└─────────┬─────────┘    └───────────────────┘    └───────────────────┘
          │
          ▼ Performance degrades at high N?
          │
┌─────────┴─────────┐
│  Yes              │ No
│  ▼                │ ▼
│  USL              │ Gustafson
│  (find σ, κ)      │ (scale workload)
└───────────────────┘

Quick Reference:

SymptomModelKey Metric
Adding cores doesn't helpAmdahlSerial fraction (1-p)
Performance drops at high NUSLσ (contention), κ (coherence)
Slow despite high CPURooflineOperational Intensity
Latency spikes at peak loadQueuingρ (utilization)
Throughput × Latency mismatchLittle's LawL = λ × W

Summary

Six Performance Models, six perspectives:

Amdahl's Law

  • Theoretical upper limit for parallelization
  • Serial portion determines the ceiling
  • Use to judge "will adding more cores help"

Gustafson's Law

  • More resources → process larger problems
  • More optimistic than Amdahl
  • Applies to scalable workloads

Universal Scalability Law (USL)

  • Captures contention (σ) and coherence (κ) overhead
  • Predicts performance degradation with more nodes
  • Find optimal parallelism: N_opt = √((1-σ)/κ)

Roofline Model

  • Compute-bound vs Memory-bound
  • Visualize performance bottlenecks
  • Guide optimization direction

Little's Law

  • L = λ × W
  • Connects throughput, latency, concurrency
  • Diagnose queuing and resource utilization

Queuing Theory (M/M/1)

  • Explains "Hockey Stick Effect" at high utilization
  • 70% rule for latency-sensitive systems
  • Response time = Service time / (1 - utilization)

Usage Recommendations

  1. First use Amdahl/Gustafson to evaluate parallelization potential
  2. Use USL when scaling shows degradation—identify σ vs κ bottlenecks
  3. Use Roofline to determine compute vs memory bound
  4. Use Little's Law to analyze throughput/latency relationships
  5. Use queuing theory to understand utilization vs latency tradeoffs
  6. Remember: all models are simplifications; actual measurement is always the final answer