Appendix H: Performance Models Deep Dive
"In theory, theory and practice are the same. In practice, they are not." — Yogi Berra
This appendix provides detailed mathematical foundations, proofs, and advanced applications for the performance models introduced in Chapter 10.
Little's Law: Mathematical Foundation
Rigorous Derivation
Little's Law states: L = λ × W
Where:
- L = average number of items in system
- λ = arrival rate (throughput)
- W = average time in system (latency)
Intuitive Analogy
Imagine a highway toll booth:
- Vehicle arrival rate (λ) = 100 vehicles/hour
- Time to pass through (W) = 0.1 hour/vehicle
- Vehicles at booth (L) = 100 × 0.1 = 10 vehicles
This makes sense: if each vehicle needs 0.1 hours to pass,
and 100 vehicles arrive per hour, then at any moment,
there are on average 10 vehicles in the system.
Formal Proof Sketch
Consider a time interval T.
During T:
- Tasks arriving = N = λ × T
- Each task spends average time W in system
- Task j contributes time W_j to system occupancy
Average items in system (L):
L = (1/T) × Σ [time each task spent in system]
= (1/T) × [total time contribution from all tasks]
Since each task contributes W on average:
L ≈ (1/T) × (N × W)
= (1/T) × (λ × T × W)
= λ × W
Why It Works in Computer Systems
1. Conservation Principle
Tasks enter → [ Processing System ] → Tasks leave
Conservation: tasks_in = tasks_out (steady state)
Items in system = tasks that entered but haven't left yet
2. Time Average = Space Average
Observing system for time T:
Time-averaged concurrency = (1/T) × ∫[0,T] items_in_system(t) dt
This equals the task-centric view:
Each task sees an average system load during its stay
3. Valid for All Queuing Models
Proven mathematically for:
M/M/1, M/M/c, M/G/1, G/G/c, and more
Practical Verification
def verify_littles_law(throughput, latency, measured_concurrency):
"""Verify Little's Law with measured data."""
expected = throughput * latency
error = abs(measured_concurrency - expected) / expected
print(f"Throughput: {throughput:.1f}/s")
print(f"Latency: {latency:.3f}s")
print(f"Expected concurrency: {expected:.1f}")
print(f"Measured concurrency: {measured_concurrency:.1f}")
print(f"Error: {error:.1%}")
# Error < 5% indicates stable system
return error < 0.05
# Example
verify_littles_law(150, 0.08, 11.8)
# Output: Error: 1.7% ✓
Key Prerequisites
| Prerequisite | Description |
|---|---|
| Stable System | Input rate ≈ output rate (no unbounded queue growth) |
| Long-term Average | Short-term may violate; converges over time |
| Task Conservation | Every task that enters eventually leaves |
Counter-Examples (When It Fails)
Counter-Example 1: Task Loss
throughput_in = 100/s
throughput_out = 80/s (20% dropped)
latency = 0.1s
Calculation: 100 × 0.1 = 10
Reality: tasks accumulate unboundedly → formula fails
Counter-Example 2: Burst Arrival
1000 tasks arrive instantly
throughput = 1000/s (instantaneous)
latency = 1s
Calculation: 1000 × 1 = 1000
But this is instantaneous, not steady-state average
Diagnostic Value
When measured values don't match expectations:
| Situation | Possible Causes |
|---|---|
| Actual > Expected | Latency underestimated, tasks stuck, memory leak |
| Actual < Expected | Throughput overestimated, parallel processing, measurement too short |
Amdahl's Law: Extended Analysis
Mathematical Derivation
Let T be the total execution time on a single core. Let p be the parallelizable fraction.
- Single-core execution time: T_1 = T
- On N cores, parallel portion shrinks to (p×T)/N, serial portion remains (1-p)×T
- N-core execution time: T_N = (1-p)×T + (p×T)/N
- Speedup S = T_1 / T_N = 1 / ((1-p) + p/N)
- As N → ∞: S_max = 1 / (1-p)
If p = 0.9 (90% parallelizable): S_max = 10×
Measuring the Parallel Fraction p
Since p is difficult to calculate from code inspection, use empirical fitting:
import numpy as np
from scipy.optimize import curve_fit
def amdahl_model(n, p):
"""Amdahl's Law model."""
return 1 / ((1 - p) + p / n)
# Measured data: cores and corresponding speedup
n_data = np.array([1, 2, 4, 8, 16])
speedup_data = np.array([1.0, 1.85, 3.20, 4.80, 5.60])
# Fit p
params, _ = curve_fit(amdahl_model, n_data, speedup_data)
p_fitted = params[0]
print(f"Fitted parallel fraction p = {p_fitted:.2%}")
print(f"Theoretical maximum speedup = {1/(1-p_fitted):.2f}x")
Real-World Serial Bottleneck Examples
| Bottleneck Type | Example | Mitigation |
|---|---|---|
| Lock Contention | Global mutex for logging | Lock-free queues, per-thread buffers |
| I/O Serialization | Sequential file reads | Async I/O, memory-mapped files |
| Memory Allocator | malloc global lock | jemalloc, tcmalloc, arena allocators |
| Kernel Bottleneck | syscall serialization | io_uring, batched operations |
Gustafson's Law: Scaled Speedup
Mathematical Derivation
Unlike Amdahl, Gustafson starts from the parallel execution result and works backward.
- On N cores, normalized execution time T_N = 1
- Let s be the serial fraction of this time, so parallel fraction = 1-s
- On single core, serial part still takes s, but parallel part takes N×(1-s)
- Single-core time: T_1 = s + N×(1-s)
- Scaled Speedup: S(N) = s + N×(1-s) = N - s×(N-1)
Conclusion: Speedup grows linearly with N as long as s is small.
Strong vs Weak Scaling
| Scaling Type | Definition | Model | Metric |
|---|---|---|---|
| Strong Scaling | Fixed problem size, add resources | Amdahl | Time reduction |
| Weak Scaling | Problem grows with resources | Gustafson | Constant time, larger problem |
When to Use Each
Amdahl scenarios:
- User-facing latency requirements
- Real-time constraints
- Interactive applications
Gustafson scenarios:
- Scientific computing (higher resolution simulations)
- Big data processing (process more data in same time)
- Machine learning training (larger batches)
Universal Scalability Law (USL)
Formula and Parameters
C(N) = N / (1 + σ(N-1) + κN(N-1))
σ (sigma): Contention/Serialization coefficient
κ (kappa): Coherence/Crosstalk coefficient
Physical Meaning of Parameters
σ (Contention):
- Represents queuing for shared resources
- Like Amdahl's serial fraction
- Effect: Linear degradation as N increases
- Examples: mutex waits, database locks
κ (Coherence):
- Represents pairwise communication overhead
- Each node must communicate with others for consistency
- Effect: Quadratic degradation (N×(N-1) pairs)
- Examples: cache coherence traffic, distributed consensus
Three Scaling Regimes
| Condition | Behavior | Curve Shape |
|---|---|---|
| σ=0, κ=0 | Linear | Straight diagonal |
| σ>0, κ=0 | Amdahl | Approaches horizontal asymptote |
| σ>0, κ>0 | USL | Peak then decline (retrograde) |
Optimal Parallelism
The peak of C(N) occurs at:
N* = sqrt((1 - σ) / κ)
Beyond N*, adding more processors hurts performance.
Python Fitting Example
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
def usl_model(n, sigma, kappa):
"""Universal Scalability Law model."""
return n / (1 + sigma * (n - 1) + kappa * n * (n - 1))
# Measured data
n_data = np.array([1, 2, 4, 8, 16, 32, 64])
throughput_data = np.array([100, 185, 340, 580, 820, 900, 750])
# Normalize to relative capacity
relative_capacity = throughput_data / throughput_data[0]
# Fit USL parameters
params, covariance = curve_fit(usl_model, n_data, relative_capacity,
p0=[0.01, 0.001], bounds=(0, 1))
sigma, kappa = params
print(f"Contention (σ): {sigma:.4f}")
print(f"Coherence (κ): {kappa:.6f}")
# Optimal parallelism
n_optimal = np.sqrt((1 - sigma) / kappa)
print(f"Optimal parallelism N*: {n_optimal:.1f}")
# Predict and plot
n_pred = np.linspace(1, 128, 100)
c_pred = usl_model(n_pred, sigma, kappa)
plt.figure(figsize=(10, 6))
plt.scatter(n_data, relative_capacity, label='Measured', s=100)
plt.plot(n_pred, c_pred, 'r-', label=f'USL fit (σ={sigma:.3f}, κ={kappa:.5f})')
plt.axvline(n_optimal, color='g', linestyle='--', label=f'N* = {n_optimal:.1f}')
plt.xlabel('Number of Processors (N)')
plt.ylabel('Relative Capacity C(N)')
plt.title('USL Analysis: Identifying Scalability Limits')
plt.legend()
plt.grid(True)
plt.savefig('usl_analysis.png', dpi=150)
Case Study: Database Connection Pooling
Observations:
- 10 connections: 1000 TPS
- 20 connections: 1800 TPS
- 40 connections: 2800 TPS
- 80 connections: 3200 TPS (peak!)
- 160 connections: 2400 TPS (retrograde)
Fitted: σ=0.02, κ=0.0001
N* = sqrt(0.98/0.0001) ≈ 99 connections
Diagnosis: κ > 0 indicates coherence issues
- Connection pool management overhead
- Distributed lock contention
- Context switch costs
Roofline Model: Cache-Aware Extensions
Cache-Aware Roofline Model (CARM)
Traditional Roofline considers only DRAM bandwidth. CARM extends to multiple memory hierarchy levels.
Multiple Rooflines
Performance (GFLOPS)
^
| Peak Compute ─────────────────────────
| / / / / /
| / L1 Roofline (highest slope)
|/ / L2 Roofline
| / / L3 Roofline
| / / / DRAM Roofline (lowest slope)
|/ / / /
└──────────────────────────────────────> Arithmetic Intensity
Calculating AI for Each Level
| Level | Arithmetic Intensity | When to Use |
|---|---|---|
| AI_L1 | FLOPs / L1_traffic | Data fits in L1 |
| AI_L2 | FLOPs / L2_traffic | Data fits in L2 |
| AI_L3 | FLOPs / L3_traffic | Data fits in L3 |
| AI_DRAM | FLOPs / DRAM_traffic | Streaming from memory |
Measurement with perf
# Measure floating-point operations
perf stat -e fp_arith_inst_retired.scalar_double,\
fp_arith_inst_retired.128b_packed_double ./my_program
# Measure memory traffic (cache misses × cache line size)
perf stat -e L1-dcache-load-misses,L1-dcache-store-misses,\
LLC-load-misses,LLC-store-misses ./my_program
Optimization Strategy by Location
| Point Location | Diagnosis | Optimization |
|---|---|---|
| Below DRAM slope | DRAM bandwidth limited | Prefetching, streaming stores |
| Between L3 and DRAM | L3 miss issues | Improve data locality, blocking |
| Near L1 slope, below compute | Low arithmetic intensity | Loop fusion, vectorization |
| Near compute ceiling | Compute limited | Better algorithms, SIMD |
Integer Roofline
For non-FP workloads (compilers, databases, encryption):
- Y-axis: GOPS (Giga-Operations Per Second)
- Integer ops often faster than FP
- More sensitive to cache latency than FP
Energy Roofline (Green Computing)
- Y-axis: GFLOPS/Watt
- Finds energy-efficient operating points
- Important for HPC and data centers
Queuing Theory Fundamentals
Kendall's Notation: A/S/c/K/N/D
| Symbol | Meaning | Common Values |
|---|---|---|
| A | Arrival distribution | M (Poisson), D (Deterministic), G (General) |
| S | Service distribution | M (Exponential), D, G |
| c | Number of servers | 1, c, ∞ |
| K | System capacity | ∞ (unbounded), finite |
| N | Population size | ∞, finite |
| D | Queue discipline | FIFO, LIFO, Priority |
M/M/1 Model: Complete Analysis
Assumptions:
- Poisson arrivals (rate λ)
- Exponential service times (rate μ)
- Single server
- FIFO queue
- Infinite capacity
Key Formulas:
Utilization: ρ = λ/μ (must be < 1 for stability)
Probability of n items in system: P_n = (1-ρ) × ρ^n
Average items in system: L = ρ/(1-ρ)
Average items in queue: L_q = ρ²/(1-ρ)
Average time in system: W = 1/(μ-λ)
Average time in queue: W_q = ρ/(μ-λ)
The Hockey Stick Effect:
| Utilization (ρ) | Avg Queue Length (L) | Wait Time Multiplier |
|---|---|---|
| 50% | 1.0 | 2× service time |
| 70% | 2.3 | 3.3× |
| 80% | 4.0 | 5× |
| 90% | 9.0 | 10× |
| 95% | 19.0 | 20× |
| 99% | 99.0 | 100× |
M/M/c Model: Multiple Servers
Erlang C Formula (probability of waiting):
import math
def erlang_c(c, rho):
"""Calculate probability of queuing in M/M/c system."""
a = c * rho # Offered load
# Erlang C formula
sum_term = sum((a**k) / math.factorial(k) for k in range(c))
last_term = (a**c) / (math.factorial(c) * (1 - rho))
return last_term / (sum_term + last_term)
# Example: 10 servers, 80% average utilization
prob_wait = erlang_c(10, 0.8)
print(f"Probability of queuing: {prob_wait:.1%}")
Capacity Planning Guidelines
- 70% Rule: Keep utilization ≤ 70% for latency-sensitive systems
- Target Wait Probability: For SLA, target < 5% probability of waiting
- Headroom for Bursts: Leave 30% headroom for traffic spikes
Connection Pool Sizing Formula
def optimal_pool_size(arrival_rate, service_time, target_wait_prob=0.05):
"""Calculate optimal connection pool size."""
rho = arrival_rate * service_time
# Binary search for minimum c where P(wait) < target
for c in range(1, 1000):
if c * rho > c: # Unstable
continue
if erlang_c(c, rho/c) < target_wait_prob:
return c
return None
# Example: 100 requests/sec, 50ms average service time
pool_size = optimal_pool_size(100, 0.05)
print(f"Recommended pool size: {pool_size} connections")
Tools and Measurement
Performance Counter Events
| Tool | Best For | Key Events |
|---|---|---|
| perf | Linux, general | cycles, instructions, cache-misses |
| Intel VTune | Intel CPUs | vectorization, memory bandwidth |
| Intel Advisor | Roofline analysis | FLOPS, memory traffic |
| Likwid | HPC, multi-arch | configurable groups |
| PAPI | Cross-platform | portable API |
Measurement Best Practices
- Warm up: Run several iterations before measuring
- Steady state: Ensure system reaches equilibrium
- Multiple runs: Report mean and standard deviation
- Control variables: Pin cores, disable frequency scaling
- Representative load: Use realistic workloads
References
- Amdahl, G.M. (1967). "Validity of the single processor approach"
- Gustafson, J.L. (1988). "Reevaluating Amdahl's Law"
- Gunther, N.J. (1993). "Practical Performance Analyst"
- Williams, S. et al. (2009). "Roofline: An Insightful Visual Performance Model"
- Little, J.D.C. (1961). "A Proof for the Queuing Formula: L = λW"
- Kleinrock, L. (1975). "Queueing Systems, Volume 1: Theory"
- Iyer, L.M. et al. (2015). "Cache-Aware Roofline Model"