Appendix H: Performance Models Deep Dive

"In theory, theory and practice are the same. In practice, they are not." — Yogi Berra

This appendix provides detailed mathematical foundations, proofs, and advanced applications for the performance models introduced in Chapter 10.

Little's Law: Mathematical Foundation

Rigorous Derivation

Little's Law states: L = λ × W

Where:

L = average number of items in system
λ = arrival rate (throughput)
W = average time in system (latency)

Intuitive Analogy

Imagine a highway toll booth:
- Vehicle arrival rate (λ) = 100 vehicles/hour
- Time to pass through (W) = 0.1 hour/vehicle
- Vehicles at booth (L) = 100 × 0.1 = 10 vehicles

This makes sense: if each vehicle needs 0.1 hours to pass,
and 100 vehicles arrive per hour, then at any moment,
there are on average 10 vehicles in the system.

Formal Proof Sketch

Consider a time interval T.
During T:
- Tasks arriving = N = λ × T
- Each task spends average time W in system
- Task j contributes time W_j to system occupancy

Average items in system (L):
L = (1/T) × Σ [time each task spent in system]
  = (1/T) × [total time contribution from all tasks]

Since each task contributes W on average:
L ≈ (1/T) × (N × W)
  = (1/T) × (λ × T × W)
  = λ × W

Why It Works in Computer Systems

1. Conservation Principle

Tasks enter → [ Processing System ] → Tasks leave

Conservation: tasks_in = tasks_out (steady state)
Items in system = tasks that entered but haven't left yet

2. Time Average = Space Average

Observing system for time T:
Time-averaged concurrency = (1/T) × ∫[0,T] items_in_system(t) dt

This equals the task-centric view:
Each task sees an average system load during its stay

3. Valid for All Queuing Models

Proven mathematically for:
M/M/1, M/M/c, M/G/1, G/G/c, and more

Practical Verification

def verify_littles_law(throughput, latency, measured_concurrency):
    """Verify Little's Law with measured data."""
    expected = throughput * latency
    error = abs(measured_concurrency - expected) / expected
    
    print(f"Throughput: {throughput:.1f}/s")
    print(f"Latency: {latency:.3f}s")
    print(f"Expected concurrency: {expected:.1f}")
    print(f"Measured concurrency: {measured_concurrency:.1f}")
    print(f"Error: {error:.1%}")
    
    # Error < 5% indicates stable system
    return error < 0.05

# Example
verify_littles_law(150, 0.08, 11.8)
# Output: Error: 1.7% ✓

Key Prerequisites

Prerequisite	Description
Stable System	Input rate ≈ output rate (no unbounded queue growth)
Long-term Average	Short-term may violate; converges over time
Task Conservation	Every task that enters eventually leaves

Counter-Examples (When It Fails)

Counter-Example 1: Task Loss

throughput_in = 100/s
throughput_out = 80/s  (20% dropped)
latency = 0.1s

Calculation: 100 × 0.1 = 10
Reality: tasks accumulate unboundedly → formula fails

Counter-Example 2: Burst Arrival

1000 tasks arrive instantly
throughput = 1000/s (instantaneous)
latency = 1s
Calculation: 1000 × 1 = 1000

But this is instantaneous, not steady-state average

Diagnostic Value

When measured values don't match expectations:

Situation	Possible Causes
Actual > Expected	Latency underestimated, tasks stuck, memory leak
Actual < Expected	Throughput overestimated, parallel processing, measurement too short

Amdahl's Law: Extended Analysis

Mathematical Derivation

Let T be the total execution time on a single core. Let p be the parallelizable fraction.

Single-core execution time: T_1 = T
On N cores, parallel portion shrinks to (p×T)/N, serial portion remains (1-p)×T
N-core execution time: T_N = (1-p)×T + (p×T)/N
Speedup S = T_1 / T_N = 1 / ((1-p) + p/N)
As N → ∞: S_max = 1 / (1-p)

If p = 0.9 (90% parallelizable): S_max = 10×

Measuring the Parallel Fraction p

Since p is difficult to calculate from code inspection, use empirical fitting:

import numpy as np
from scipy.optimize import curve_fit

def amdahl_model(n, p):
    """Amdahl's Law model."""
    return 1 / ((1 - p) + p / n)

# Measured data: cores and corresponding speedup
n_data = np.array([1, 2, 4, 8, 16])
speedup_data = np.array([1.0, 1.85, 3.20, 4.80, 5.60])

# Fit p
params, _ = curve_fit(amdahl_model, n_data, speedup_data)
p_fitted = params[0]

print(f"Fitted parallel fraction p = {p_fitted:.2%}")
print(f"Theoretical maximum speedup = {1/(1-p_fitted):.2f}x")

Real-World Serial Bottleneck Examples

Bottleneck Type	Example	Mitigation
Lock Contention	Global mutex for logging	Lock-free queues, per-thread buffers
I/O Serialization	Sequential file reads	Async I/O, memory-mapped files
Memory Allocator	malloc global lock	jemalloc, tcmalloc, arena allocators
Kernel Bottleneck	syscall serialization	io_uring, batched operations

Gustafson's Law: Scaled Speedup

Mathematical Derivation

Unlike Amdahl, Gustafson starts from the parallel execution result and works backward.

On N cores, normalized execution time T_N = 1
Let s be the serial fraction of this time, so parallel fraction = 1-s
On single core, serial part still takes s, but parallel part takes N×(1-s)
Single-core time: T_1 = s + N×(1-s)
Scaled Speedup: S(N) = s + N×(1-s) = N - s×(N-1)

Conclusion: Speedup grows linearly with N as long as s is small.

Strong vs Weak Scaling

Scaling Type	Definition	Model	Metric
Strong Scaling	Fixed problem size, add resources	Amdahl	Time reduction
Weak Scaling	Problem grows with resources	Gustafson	Constant time, larger problem

When to Use Each

Amdahl scenarios:

User-facing latency requirements
Real-time constraints
Interactive applications

Gustafson scenarios:

Scientific computing (higher resolution simulations)
Big data processing (process more data in same time)
Machine learning training (larger batches)

Universal Scalability Law (USL)

Formula and Parameters

C(N) = N / (1 + σ(N-1) + κN(N-1))

σ (sigma): Contention/Serialization coefficient
κ (kappa): Coherence/Crosstalk coefficient

Physical Meaning of Parameters

σ (Contention):

Represents queuing for shared resources
Like Amdahl's serial fraction
Effect: Linear degradation as N increases
Examples: mutex waits, database locks

κ (Coherence):

Represents pairwise communication overhead
Each node must communicate with others for consistency
Effect: Quadratic degradation (N×(N-1) pairs)
Examples: cache coherence traffic, distributed consensus

Three Scaling Regimes

Condition	Behavior	Curve Shape
σ=0, κ=0	Linear	Straight diagonal
σ>0, κ=0	Amdahl	Approaches horizontal asymptote
σ>0, κ>0	USL	Peak then decline (retrograde)

Optimal Parallelism

The peak of C(N) occurs at:

N* = sqrt((1 - σ) / κ)

Beyond N*, adding more processors hurts performance.

Python Fitting Example

import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

def usl_model(n, sigma, kappa):
    """Universal Scalability Law model."""
    return n / (1 + sigma * (n - 1) + kappa * n * (n - 1))

# Measured data
n_data = np.array([1, 2, 4, 8, 16, 32, 64])
throughput_data = np.array([100, 185, 340, 580, 820, 900, 750])

# Normalize to relative capacity
relative_capacity = throughput_data / throughput_data[0]

# Fit USL parameters
params, covariance = curve_fit(usl_model, n_data, relative_capacity,
                                p0=[0.01, 0.001], bounds=(0, 1))
sigma, kappa = params

print(f"Contention (σ): {sigma:.4f}")
print(f"Coherence (κ): {kappa:.6f}")

# Optimal parallelism
n_optimal = np.sqrt((1 - sigma) / kappa)
print(f"Optimal parallelism N*: {n_optimal:.1f}")

# Predict and plot
n_pred = np.linspace(1, 128, 100)
c_pred = usl_model(n_pred, sigma, kappa)

plt.figure(figsize=(10, 6))
plt.scatter(n_data, relative_capacity, label='Measured', s=100)
plt.plot(n_pred, c_pred, 'r-', label=f'USL fit (σ={sigma:.3f}, κ={kappa:.5f})')
plt.axvline(n_optimal, color='g', linestyle='--', label=f'N* = {n_optimal:.1f}')
plt.xlabel('Number of Processors (N)')
plt.ylabel('Relative Capacity C(N)')
plt.title('USL Analysis: Identifying Scalability Limits')
plt.legend()
plt.grid(True)
plt.savefig('usl_analysis.png', dpi=150)

Case Study: Database Connection Pooling

Observations:
- 10 connections: 1000 TPS
- 20 connections: 1800 TPS
- 40 connections: 2800 TPS
- 80 connections: 3200 TPS (peak!)
- 160 connections: 2400 TPS (retrograde)

Fitted: σ=0.02, κ=0.0001
N* = sqrt(0.98/0.0001) ≈ 99 connections

Diagnosis: κ > 0 indicates coherence issues
- Connection pool management overhead
- Distributed lock contention
- Context switch costs

Roofline Model: Cache-Aware Extensions

Cache-Aware Roofline Model (CARM)

Traditional Roofline considers only DRAM bandwidth. CARM extends to multiple memory hierarchy levels.

Multiple Rooflines

Performance (GFLOPS)
    ^
    |  Peak Compute ─────────────────────────
    |  /  /  /  /  /
    | / L1 Roofline (highest slope)
    |/  / L2 Roofline
    |  /  / L3 Roofline
    | /  /  / DRAM Roofline (lowest slope)
    |/  /  /  /
    └──────────────────────────────────────> Arithmetic Intensity

Calculating AI for Each Level

Level	Arithmetic Intensity	When to Use
AI_L1	FLOPs / L1_traffic	Data fits in L1
AI_L2	FLOPs / L2_traffic	Data fits in L2
AI_L3	FLOPs / L3_traffic	Data fits in L3
AI_DRAM	FLOPs / DRAM_traffic	Streaming from memory

Measurement with perf

# Measure floating-point operations
perf stat -e fp_arith_inst_retired.scalar_double,\
fp_arith_inst_retired.128b_packed_double ./my_program

# Measure memory traffic (cache misses × cache line size)
perf stat -e L1-dcache-load-misses,L1-dcache-store-misses,\
LLC-load-misses,LLC-store-misses ./my_program

Optimization Strategy by Location

Point Location	Diagnosis	Optimization
Below DRAM slope	DRAM bandwidth limited	Prefetching, streaming stores
Between L3 and DRAM	L3 miss issues	Improve data locality, blocking
Near L1 slope, below compute	Low arithmetic intensity	Loop fusion, vectorization
Near compute ceiling	Compute limited	Better algorithms, SIMD

Integer Roofline

For non-FP workloads (compilers, databases, encryption):

Y-axis: GOPS (Giga-Operations Per Second)
Integer ops often faster than FP
More sensitive to cache latency than FP

Energy Roofline (Green Computing)

Y-axis: GFLOPS/Watt
Finds energy-efficient operating points
Important for HPC and data centers

Queuing Theory Fundamentals

Kendall's Notation: A/S/c/K/N/D

Symbol	Meaning	Common Values
A	Arrival distribution	M (Poisson), D (Deterministic), G (General)
S	Service distribution	M (Exponential), D, G
c	Number of servers	1, c, ∞
K	System capacity	∞ (unbounded), finite
N	Population size	∞, finite
D	Queue discipline	FIFO, LIFO, Priority

M/M/1 Model: Complete Analysis

Assumptions:

Poisson arrivals (rate λ)
Exponential service times (rate μ)
Single server
FIFO queue
Infinite capacity

Key Formulas:

Utilization: ρ = λ/μ (must be < 1 for stability)

Probability of n items in system: P_n = (1-ρ) × ρ^n

Average items in system: L = ρ/(1-ρ)

Average items in queue: L_q = ρ²/(1-ρ)

Average time in system: W = 1/(μ-λ)

Average time in queue: W_q = ρ/(μ-λ)

The Hockey Stick Effect:

Utilization (ρ)	Avg Queue Length (L)	Wait Time Multiplier
50%	1.0	2× service time
70%	2.3	3.3×
80%	4.0	5×
90%	9.0	10×
95%	19.0	20×
99%	99.0	100×

M/M/c Model: Multiple Servers

Erlang C Formula (probability of waiting):

import math

def erlang_c(c, rho):
    """Calculate probability of queuing in M/M/c system."""
    a = c * rho  # Offered load

    # Erlang C formula
    sum_term = sum((a**k) / math.factorial(k) for k in range(c))
    last_term = (a**c) / (math.factorial(c) * (1 - rho))

    return last_term / (sum_term + last_term)

# Example: 10 servers, 80% average utilization
prob_wait = erlang_c(10, 0.8)
print(f"Probability of queuing: {prob_wait:.1%}")

Capacity Planning Guidelines

70% Rule: Keep utilization ≤ 70% for latency-sensitive systems
Target Wait Probability: For SLA, target < 5% probability of waiting
Headroom for Bursts: Leave 30% headroom for traffic spikes

Connection Pool Sizing Formula

def optimal_pool_size(arrival_rate, service_time, target_wait_prob=0.05):
    """Calculate optimal connection pool size."""
    rho = arrival_rate * service_time

    # Binary search for minimum c where P(wait) < target
    for c in range(1, 1000):
        if c * rho > c:  # Unstable
            continue
        if erlang_c(c, rho/c) < target_wait_prob:
            return c
    return None

# Example: 100 requests/sec, 50ms average service time
pool_size = optimal_pool_size(100, 0.05)
print(f"Recommended pool size: {pool_size} connections")

Tools and Measurement

Performance Counter Events

Tool	Best For	Key Events
perf	Linux, general	cycles, instructions, cache-misses
Intel VTune	Intel CPUs	vectorization, memory bandwidth
Intel Advisor	Roofline analysis	FLOPS, memory traffic
Likwid	HPC, multi-arch	configurable groups
PAPI	Cross-platform	portable API

Measurement Best Practices

Warm up: Run several iterations before measuring
Steady state: Ensure system reaches equilibrium
Multiple runs: Report mean and standard deviation
Control variables: Pin cores, disable frequency scaling
Representative load: Use realistic workloads

References

Amdahl, G.M. (1967). "Validity of the single processor approach"
Gustafson, J.L. (1988). "Reevaluating Amdahl's Law"
Gunther, N.J. (1993). "Practical Performance Analyst"
Williams, S. et al. (2009). "Roofline: An Insightful Visual Performance Model"
Little, J.D.C. (1961). "A Proof for the Queuing Formula: L = λW"
Kleinrock, L. (1975). "Queueing Systems, Volume 1: Theory"
Iyer, L.M. et al. (2015). "Cache-Aware Roofline Model"

Performance and Benchmarking