Appendix H: Performance Models Deep Dive


"In theory, theory and practice are the same. In practice, they are not." — Yogi Berra

This appendix provides detailed mathematical foundations, proofs, and advanced applications for the performance models introduced in Chapter 10.

Little's Law: Mathematical Foundation

Rigorous Derivation

Little's Law states: L = λ × W

Where:

  • L = average number of items in system
  • λ = arrival rate (throughput)
  • W = average time in system (latency)

Intuitive Analogy

Imagine a highway toll booth:
- Vehicle arrival rate (λ) = 100 vehicles/hour
- Time to pass through (W) = 0.1 hour/vehicle
- Vehicles at booth (L) = 100 × 0.1 = 10 vehicles

This makes sense: if each vehicle needs 0.1 hours to pass,
and 100 vehicles arrive per hour, then at any moment,
there are on average 10 vehicles in the system.

Formal Proof Sketch

Consider a time interval T.
During T:
- Tasks arriving = N = λ × T
- Each task spends average time W in system
- Task j contributes time W_j to system occupancy

Average items in system (L):
L = (1/T) × Σ [time each task spent in system]
  = (1/T) × [total time contribution from all tasks]

Since each task contributes W on average:
L ≈ (1/T) × (N × W)
  = (1/T) × (λ × T × W)
  = λ × W

Why It Works in Computer Systems

1. Conservation Principle

Tasks enter → [ Processing System ] → Tasks leave

Conservation: tasks_in = tasks_out (steady state)
Items in system = tasks that entered but haven't left yet

2. Time Average = Space Average

Observing system for time T:
Time-averaged concurrency = (1/T) × ∫[0,T] items_in_system(t) dt

This equals the task-centric view:
Each task sees an average system load during its stay

3. Valid for All Queuing Models

Proven mathematically for:
M/M/1, M/M/c, M/G/1, G/G/c, and more

Practical Verification

def verify_littles_law(throughput, latency, measured_concurrency):
    """Verify Little's Law with measured data."""
    expected = throughput * latency
    error = abs(measured_concurrency - expected) / expected
    
    print(f"Throughput: {throughput:.1f}/s")
    print(f"Latency: {latency:.3f}s")
    print(f"Expected concurrency: {expected:.1f}")
    print(f"Measured concurrency: {measured_concurrency:.1f}")
    print(f"Error: {error:.1%}")
    
    # Error < 5% indicates stable system
    return error < 0.05

# Example
verify_littles_law(150, 0.08, 11.8)
# Output: Error: 1.7% ✓

Key Prerequisites

PrerequisiteDescription
Stable SystemInput rate ≈ output rate (no unbounded queue growth)
Long-term AverageShort-term may violate; converges over time
Task ConservationEvery task that enters eventually leaves

Counter-Examples (When It Fails)

Counter-Example 1: Task Loss

throughput_in = 100/s
throughput_out = 80/s  (20% dropped)
latency = 0.1s

Calculation: 100 × 0.1 = 10
Reality: tasks accumulate unboundedly → formula fails

Counter-Example 2: Burst Arrival

1000 tasks arrive instantly
throughput = 1000/s (instantaneous)
latency = 1s
Calculation: 1000 × 1 = 1000

But this is instantaneous, not steady-state average

Diagnostic Value

When measured values don't match expectations:

SituationPossible Causes
Actual > ExpectedLatency underestimated, tasks stuck, memory leak
Actual < ExpectedThroughput overestimated, parallel processing, measurement too short

Amdahl's Law: Extended Analysis

Mathematical Derivation

Let T be the total execution time on a single core. Let p be the parallelizable fraction.

  1. Single-core execution time: T_1 = T
  2. On N cores, parallel portion shrinks to (p×T)/N, serial portion remains (1-p)×T
  3. N-core execution time: T_N = (1-p)×T + (p×T)/N
  4. Speedup S = T_1 / T_N = 1 / ((1-p) + p/N)
  5. As N → ∞: S_max = 1 / (1-p)

If p = 0.9 (90% parallelizable): S_max = 10×

Measuring the Parallel Fraction p

Since p is difficult to calculate from code inspection, use empirical fitting:

import numpy as np
from scipy.optimize import curve_fit

def amdahl_model(n, p):
    """Amdahl's Law model."""
    return 1 / ((1 - p) + p / n)

# Measured data: cores and corresponding speedup
n_data = np.array([1, 2, 4, 8, 16])
speedup_data = np.array([1.0, 1.85, 3.20, 4.80, 5.60])

# Fit p
params, _ = curve_fit(amdahl_model, n_data, speedup_data)
p_fitted = params[0]

print(f"Fitted parallel fraction p = {p_fitted:.2%}")
print(f"Theoretical maximum speedup = {1/(1-p_fitted):.2f}x")

Real-World Serial Bottleneck Examples

Bottleneck TypeExampleMitigation
Lock ContentionGlobal mutex for loggingLock-free queues, per-thread buffers
I/O SerializationSequential file readsAsync I/O, memory-mapped files
Memory Allocatormalloc global lockjemalloc, tcmalloc, arena allocators
Kernel Bottlenecksyscall serializationio_uring, batched operations

Gustafson's Law: Scaled Speedup

Mathematical Derivation

Unlike Amdahl, Gustafson starts from the parallel execution result and works backward.

  1. On N cores, normalized execution time T_N = 1
  2. Let s be the serial fraction of this time, so parallel fraction = 1-s
  3. On single core, serial part still takes s, but parallel part takes N×(1-s)
  4. Single-core time: T_1 = s + N×(1-s)
  5. Scaled Speedup: S(N) = s + N×(1-s) = N - s×(N-1)

Conclusion: Speedup grows linearly with N as long as s is small.

Strong vs Weak Scaling

Scaling TypeDefinitionModelMetric
Strong ScalingFixed problem size, add resourcesAmdahlTime reduction
Weak ScalingProblem grows with resourcesGustafsonConstant time, larger problem

When to Use Each

Amdahl scenarios:

  • User-facing latency requirements
  • Real-time constraints
  • Interactive applications

Gustafson scenarios:

  • Scientific computing (higher resolution simulations)
  • Big data processing (process more data in same time)
  • Machine learning training (larger batches)

Universal Scalability Law (USL)

Formula and Parameters

C(N) = N / (1 + σ(N-1) + κN(N-1))

σ (sigma): Contention/Serialization coefficient
κ (kappa): Coherence/Crosstalk coefficient

Physical Meaning of Parameters

σ (Contention):

  • Represents queuing for shared resources
  • Like Amdahl's serial fraction
  • Effect: Linear degradation as N increases
  • Examples: mutex waits, database locks

κ (Coherence):

  • Represents pairwise communication overhead
  • Each node must communicate with others for consistency
  • Effect: Quadratic degradation (N×(N-1) pairs)
  • Examples: cache coherence traffic, distributed consensus

Three Scaling Regimes

ConditionBehaviorCurve Shape
σ=0, κ=0LinearStraight diagonal
σ>0, κ=0AmdahlApproaches horizontal asymptote
σ>0, κ>0USLPeak then decline (retrograde)

Optimal Parallelism

The peak of C(N) occurs at:

N* = sqrt((1 - σ) / κ)

Beyond N*, adding more processors hurts performance.

Python Fitting Example

import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

def usl_model(n, sigma, kappa):
    """Universal Scalability Law model."""
    return n / (1 + sigma * (n - 1) + kappa * n * (n - 1))

# Measured data
n_data = np.array([1, 2, 4, 8, 16, 32, 64])
throughput_data = np.array([100, 185, 340, 580, 820, 900, 750])

# Normalize to relative capacity
relative_capacity = throughput_data / throughput_data[0]

# Fit USL parameters
params, covariance = curve_fit(usl_model, n_data, relative_capacity,
                                p0=[0.01, 0.001], bounds=(0, 1))
sigma, kappa = params

print(f"Contention (σ): {sigma:.4f}")
print(f"Coherence (κ): {kappa:.6f}")

# Optimal parallelism
n_optimal = np.sqrt((1 - sigma) / kappa)
print(f"Optimal parallelism N*: {n_optimal:.1f}")

# Predict and plot
n_pred = np.linspace(1, 128, 100)
c_pred = usl_model(n_pred, sigma, kappa)

plt.figure(figsize=(10, 6))
plt.scatter(n_data, relative_capacity, label='Measured', s=100)
plt.plot(n_pred, c_pred, 'r-', label=f'USL fit (σ={sigma:.3f}, κ={kappa:.5f})')
plt.axvline(n_optimal, color='g', linestyle='--', label=f'N* = {n_optimal:.1f}')
plt.xlabel('Number of Processors (N)')
plt.ylabel('Relative Capacity C(N)')
plt.title('USL Analysis: Identifying Scalability Limits')
plt.legend()
plt.grid(True)
plt.savefig('usl_analysis.png', dpi=150)

Case Study: Database Connection Pooling

Observations:
- 10 connections: 1000 TPS
- 20 connections: 1800 TPS
- 40 connections: 2800 TPS
- 80 connections: 3200 TPS (peak!)
- 160 connections: 2400 TPS (retrograde)

Fitted: σ=0.02, κ=0.0001
N* = sqrt(0.98/0.0001) ≈ 99 connections

Diagnosis: κ > 0 indicates coherence issues
- Connection pool management overhead
- Distributed lock contention
- Context switch costs

Roofline Model: Cache-Aware Extensions

Cache-Aware Roofline Model (CARM)

Traditional Roofline considers only DRAM bandwidth. CARM extends to multiple memory hierarchy levels.

Multiple Rooflines

Performance (GFLOPS)
    ^
    |  Peak Compute ─────────────────────────
    |  /  /  /  /  /
    | / L1 Roofline (highest slope)
    |/  / L2 Roofline
    |  /  / L3 Roofline
    | /  /  / DRAM Roofline (lowest slope)
    |/  /  /  /
    └──────────────────────────────────────> Arithmetic Intensity

Calculating AI for Each Level

LevelArithmetic IntensityWhen to Use
AI_L1FLOPs / L1_trafficData fits in L1
AI_L2FLOPs / L2_trafficData fits in L2
AI_L3FLOPs / L3_trafficData fits in L3
AI_DRAMFLOPs / DRAM_trafficStreaming from memory

Measurement with perf

# Measure floating-point operations
perf stat -e fp_arith_inst_retired.scalar_double,\
fp_arith_inst_retired.128b_packed_double ./my_program

# Measure memory traffic (cache misses × cache line size)
perf stat -e L1-dcache-load-misses,L1-dcache-store-misses,\
LLC-load-misses,LLC-store-misses ./my_program

Optimization Strategy by Location

Point LocationDiagnosisOptimization
Below DRAM slopeDRAM bandwidth limitedPrefetching, streaming stores
Between L3 and DRAML3 miss issuesImprove data locality, blocking
Near L1 slope, below computeLow arithmetic intensityLoop fusion, vectorization
Near compute ceilingCompute limitedBetter algorithms, SIMD

Integer Roofline

For non-FP workloads (compilers, databases, encryption):

  • Y-axis: GOPS (Giga-Operations Per Second)
  • Integer ops often faster than FP
  • More sensitive to cache latency than FP

Energy Roofline (Green Computing)

  • Y-axis: GFLOPS/Watt
  • Finds energy-efficient operating points
  • Important for HPC and data centers

Queuing Theory Fundamentals

Kendall's Notation: A/S/c/K/N/D

SymbolMeaningCommon Values
AArrival distributionM (Poisson), D (Deterministic), G (General)
SService distributionM (Exponential), D, G
cNumber of servers1, c, ∞
KSystem capacity∞ (unbounded), finite
NPopulation size∞, finite
DQueue disciplineFIFO, LIFO, Priority

M/M/1 Model: Complete Analysis

Assumptions:

  • Poisson arrivals (rate λ)
  • Exponential service times (rate μ)
  • Single server
  • FIFO queue
  • Infinite capacity

Key Formulas:

Utilization: ρ = λ/μ (must be < 1 for stability)

Probability of n items in system: P_n = (1-ρ) × ρ^n

Average items in system: L = ρ/(1-ρ)

Average items in queue: L_q = ρ²/(1-ρ)

Average time in system: W = 1/(μ-λ)

Average time in queue: W_q = ρ/(μ-λ)

The Hockey Stick Effect:

Utilization (ρ)Avg Queue Length (L)Wait Time Multiplier
50%1.02× service time
70%2.33.3×
80%4.0
90%9.010×
95%19.020×
99%99.0100×

M/M/c Model: Multiple Servers

Erlang C Formula (probability of waiting):

import math

def erlang_c(c, rho):
    """Calculate probability of queuing in M/M/c system."""
    a = c * rho  # Offered load

    # Erlang C formula
    sum_term = sum((a**k) / math.factorial(k) for k in range(c))
    last_term = (a**c) / (math.factorial(c) * (1 - rho))

    return last_term / (sum_term + last_term)

# Example: 10 servers, 80% average utilization
prob_wait = erlang_c(10, 0.8)
print(f"Probability of queuing: {prob_wait:.1%}")

Capacity Planning Guidelines

  1. 70% Rule: Keep utilization ≤ 70% for latency-sensitive systems
  2. Target Wait Probability: For SLA, target < 5% probability of waiting
  3. Headroom for Bursts: Leave 30% headroom for traffic spikes

Connection Pool Sizing Formula

def optimal_pool_size(arrival_rate, service_time, target_wait_prob=0.05):
    """Calculate optimal connection pool size."""
    rho = arrival_rate * service_time

    # Binary search for minimum c where P(wait) < target
    for c in range(1, 1000):
        if c * rho > c:  # Unstable
            continue
        if erlang_c(c, rho/c) < target_wait_prob:
            return c
    return None

# Example: 100 requests/sec, 50ms average service time
pool_size = optimal_pool_size(100, 0.05)
print(f"Recommended pool size: {pool_size} connections")

Tools and Measurement

Performance Counter Events

ToolBest ForKey Events
perfLinux, generalcycles, instructions, cache-misses
Intel VTuneIntel CPUsvectorization, memory bandwidth
Intel AdvisorRoofline analysisFLOPS, memory traffic
LikwidHPC, multi-archconfigurable groups
PAPICross-platformportable API

Measurement Best Practices

  1. Warm up: Run several iterations before measuring
  2. Steady state: Ensure system reaches equilibrium
  3. Multiple runs: Report mean and standard deviation
  4. Control variables: Pin cores, disable frequency scaling
  5. Representative load: Use realistic workloads

References

  1. Amdahl, G.M. (1967). "Validity of the single processor approach"
  2. Gustafson, J.L. (1988). "Reevaluating Amdahl's Law"
  3. Gunther, N.J. (1993). "Practical Performance Analyst"
  4. Williams, S. et al. (2009). "Roofline: An Insightful Visual Performance Model"
  5. Little, J.D.C. (1961). "A Proof for the Queuing Formula: L = λW"
  6. Kleinrock, L. (1975). "Queueing Systems, Volume 1: Theory"
  7. Iyer, L.M. et al. (2015). "Cache-Aware Roofline Model"