title: "Performance and Benchmarking" subtitle: "Beyond the Bottleneck: From Classic Systems to Modern AI and HPC" author: "Danny Jiang" version: "Draft v0p9" date: "January 2026"

Performance and Benchmarking

Beyond the Bottleneck

From Classic Systems to Modern AI and HPC

Danny Jiang

Draft v0p9 - January 2026

Complete Book Contents:

  • 35 chapters in 9 parts
  • 8 appendices with exercises and reference materials
  • Comprehensive coverage from benchmarking basics to AI/HPC and embedded systems

Licensed under CC BY 4.0

Copyright and License


Performance and Benchmarking

Beyond the Bottleneck: From Classic Systems to Modern AI and HPC

Copyright © 2025-2026 Danny Jiang

  • Version: Draft v0p9
  • Published: January 2026
  • Author: Danny Jiang
  • Contact: djiang.tw@gmail.com

License

This work is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

You are free to:

  • Share Copy and redistribute the material in any medium or format, even for commercial purposes

  • Adapt Remix, transform, and build upon the material, even for commercial purposes

Under the following terms:

  • Attribution You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • No additional restrictions You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Full License Terms: https://creativecommons.org/licenses/by/4.0/


Trademarks

  • RISC-V is a trademark of RISC-V International
  • ARM is a trademark of Arm Limited
  • Intel, x86, and VTune are trademarks of Intel Corporation
  • NVIDIA, CUDA, and Nsight are trademarks of NVIDIA Corporation
  • Linux is a trademark of Linus Torvalds
  • Other product and company names mentioned may be trademarks of their respective owners

Disclaimer

This book is provided "as is" without warranty of any kind, express or implied. The author and publisher disclaim all warranties, including but not limited to warranties of merchantability, fitness for a particular purpose, and non-infringement.

The information in this book is based on publicly available documentation, specifications, and the author's professional experience. While every effort has been made to ensure accuracy, hardware and software continue to evolve. Readers should verify information against current documentation and test thoroughly in their specific environments.

Performance measurements and benchmarks in this book are for the specific hardware and software configurations described. Results may vary on different systems.


About This Book

This is the complete book "Performance and Benchmarking." This book contains:

  • 35 chapters in 9 parts
  • 8 appendices with exercises, tools, and reference materials
  • Comprehensive coverage from benchmarking basics to AI/HPC and embedded footprint analysis

Author GitHub: https://github.com/djiangtw

Updates and Errata: To be announced


January 2026

Preface

Who This Book Is For

This book is written for software engineers who need to perform performance measurement and analysis. You might be:

  • A developer who needs to evaluate different algorithms or systems
  • A QA engineer responsible for performance testing
  • An embedded systems engineer evaluating hardware platforms
  • A technical lead who needs to present performance data to customers
  • A student interested in performance engineering

Regardless of your background, this book will help you avoid common benchmarking pitfalls and produce reliable, reproducible, and meaningful performance data.

Book Structure

This book is organized into nine parts:

Part I: Foundations - Benchmarking Methodology (Chapters 1-4) Establishes correct benchmarking mindset and methodology. Covers measurement environment, statistical methods, and result presentation.

Part II: Tools - Classic Benchmarks & Profiling (Chapters 5-9) Introduces various benchmark tools including CPU, memory, system-level, profiling, and embedded benchmarks.

Part III: Theory - Performance Modeling (Chapters 10-12) Deep dive into performance modeling including Roofline Model, Amdahl's Law, cache, and branch prediction.

Part IV: Practice - Data Structures & Algorithms (Chapters 13-15) Practical data structure performance analysis including array vs linked list, hash table vs tree, and sorting algorithms.

Part V: Advanced - Parallelism & Vectorization (Chapters 16-18) Advanced topics: SIMD, multi-core performance, and memory allocators.

Part VI: Embedded Constraints (Chapters 19-22) Embedded system footprint analysis: static/dynamic analysis, compiler optimization, stack analysis, and RTOS case study.

Part VII: AI/HPC Performance (Chapters 23-29) Modern performance domains: AI/ML benchmarks, HPC, GPU, LLM, ML compiler, and Edge AI.

Part VIII: Case Studies (Chapters 30-32) Real-world optimization case studies: web server, database query, and ML inference.

Part IX: Synthesis (Chapters 33-35) Bringing it all together: how to benchmark, how to optimize, and CI/CD for performance.

About Code and Commands

Code examples in this book are primarily in C, the most common language for performance measurement. Concepts apply to any language.

Commands and tools in this book focus on Linux. Linux is chosen because:

  1. Linux is the mainstream platform for servers and embedded systems
  2. Linux provides the most complete performance measurement tools
  3. Linux behavior is most predictable and controllable

Most concepts and methodologies apply to all operating systems. Appendices provide corresponding tools and commands for Windows and macOS users.

How to Use This Book

You can read sequentially or jump to chapters that interest you. However, I recommend:

  1. Read Part I first: Even if experienced, the methodology here helps avoid common mistakes
  2. Parts II-III are selectively readable: Choose relevant chapters based on what you need to measure
  3. Part IV is for data structures: When you need to understand how data structures perform in practice
  4. Part V is for low-level optimization: SIMD, multi-core, and memory allocators
  5. Part VI is for embedded: When you need to analyze footprint in resource-constrained systems
  6. Part VII is for AI/HPC: When you need to handle modern AI and HPC workloads
  7. Part VIII contains case studies: Real-world optimization examples
  8. Part IX provides synthesis: How to benchmark, optimize, and integrate into CI/CD

Each chapter ends with a Summary, and appendices contain exercises for practice.

Acknowledgments

This book exists thanks to the inspiration and support of many people.

First, I want to thank Gavin Guo and Jim Huang (jserv), two former colleagues from whom I learned a great deal—both through direct collaboration and through their publications, talks, and open-source contributions to performance analysis tools and methodology. Their work in the public domain continues to benefit engineers everywhere.

I thank the open-source community for creating the tools that make this book possible—perf, Valgrind, GCC, LLVM, and countless others. The transparency of open-source software allows us to understand performance at the deepest levels.

Thanks to engineers who share knowledge through blogs, papers, and conference talks. The work of Brendan Gregg on performance analysis, Fedor Pikus on C++ optimization, Ulrich Drepper on memory systems, and Agner Fog on x86 optimization has shaped my understanding and influenced this book.

I thank colleagues at SiFive, MIPS, Andes Technology, Broadcom, Western Digital, and SiS. Performance analysis and benchmarking has been my primary focus at SiFive and a significant part of my responsibilities at other companies. The practical experience gained from these teams—debugging real performance issues, building measurement infrastructure, and optimizing production systems—forms the foundation of the examples and case studies throughout this book.

Thanks to early reviewers who provided feedback on draft chapters. Your suggestions improved the technical accuracy and clarity of the material.

Finally, thanks to my family for their patience and support during many evenings and weekends of writing.

Feedback

If you find errors or have suggestions, please contact: djiang.tw@gmail.com


Let's begin.

Table of Contents

Part I: Foundations - Benchmarking Methodology

  • Chapter 1: Why Benchmarking is Hard
  • Chapter 2: Setting Up Your Measurement Environment
  • Chapter 3: Measurement Methodology
  • Chapter 4: Presenting Results

Part II: Tools - Classic Benchmarks & Profiling

  • Chapter 5: CPU Benchmarks
  • Chapter 6: Memory Benchmarks
  • Chapter 7: System-Level Benchmarks
  • Chapter 8: Profiling Tools
  • Chapter 9: Embedded & RTOS Benchmarks

Part III: Theory - Performance Modeling

  • Chapter 10: Performance Modeling
  • Chapter 11: Galactic Algorithms
  • Chapter 12: Cache & Branch Prediction

Part IV: Data Structures & Algorithms

  • Chapter 13: Array vs Linked List
  • Chapter 14: Hash Table vs Tree
  • Chapter 15: Sorting Algorithms

Part V: Parallelism & Low-Level Optimization

  • Chapter 16: SIMD & Vectorization
  • Chapter 17: Multi-core Performance
  • Chapter 18: Memory Allocators

Part VI: Embedded Constraints

  • Chapter 19: Footprint Analysis Fundamentals
  • Chapter 20: Compiler Size Optimization
  • Chapter 21: Stack Analysis and Estimation
  • Chapter 22: RTOS Footprint Case Study

Part VII: AI/HPC Performance

  • Chapter 23: Evolution of Performance Metrics
  • Chapter 24: AI/ML Benchmarks
  • Chapter 25: HPC Benchmarks
  • Chapter 26: GPU Benchmarking
  • Chapter 27: LLM Performance Analysis
  • Chapter 28: ML Compilers and Runtime
  • Chapter 29: Edge AI Performance

Part VIII: Case Studies

  • Chapter 30: Case Study: Web Server Optimization
  • Chapter 31: Case Study: Database Query Optimization
  • Chapter 32: Case Study: ML Inference Optimization

Part IX: Synthesis

  • Chapter 33: How to Benchmark
  • Chapter 34: How to Optimize
  • Chapter 35: CI/CD for Performance

Appendices

  • Appendix A: Benchmark Automation
  • Appendix B: Embedded and RTOS Implementation
  • Appendix C: I/O and Storage Performance
  • Appendix D: Power and Performance
  • Appendix E: Exercises and Solutions
  • Appendix F: Environment Setup Guide
  • Appendix G: Further Reading
  • Appendix H: Performance Models Deep Dive

Chapter 1: Why Benchmarking Is Hard

Part I: Foundations


"There are three kinds of lies: lies, damned lies, and benchmarks." — Adapted from Benjamin Disraeli

The Meeting Before Launch

It was a Monday morning, two weeks before product launch. I sat in a conference room listening to Kevin, our marketing manager, describe what he needed.

"We need some performance numbers," he said, "for the datasheet and press release. Customers want to know how much faster our new chip is compared to the previous generation."

Fair enough. I was the performance engineer—this was literally my job.

"No problem," I said. "I can run some benchmarks. Should take about a week for a proper analysis."

Kevin frowned. "A week? We just need a few numbers. Tony got us the data in one day last time."

Tony was the engineer who handled the previous generation. He'd since left the company. I found his benchmark scripts and decided to run them first.

The results shocked me.

Those "Perfect" Numbers

Tony's scripts produced this output:

New Chip vs Old Chip Performance Comparison
============================================
Integer Operations:   +47%
Floating Point:       +62%
Memory Bandwidth:     +35%
Overall Score:        +48%

Too perfect. Every number showed improvement, and the gains were right in that "impressive but believable" range.

But I noticed a few oddities:

  1. No variance data — Each test had exactly one number, no standard deviation
  2. No environment description — No mention of test conditions
  3. No raw data — Only final conclusions, no underlying measurements

I decided to dig deeper.

Running It Again

I re-ran the same benchmark ten times. The results:

Run 1:   +52%
Run 2:   +31%
Run 3:   +47%
Run 4:   +68%
Run 5:   +29%
Run 6:   +41%
Run 7:   +55%
Run 8:   +33%
Run 9:   +44%
Run 10:  +38%

The variance was enormous. The best run showed +68%, the worst +29%—more than a 2× difference.

Tony's reported +47% was within this range, but he'd only run it once and happened to hit a favorable number. This wasn't fraud, but it wasn't accurate either.

Worse, when I checked the test environments:

  • The new chip was tested in a 25°C air-conditioned lab
  • The old chip was tested in a 35°C regular office
  • The new chip had been freshly rebooted before testing
  • The old chip had been running for three days before testing

This wasn't a fair comparison at all.

I Can Make the Numbers Say Anything

That evening, I ran an experiment. I wanted to know: if I deliberately manipulated test conditions, how wide could I make the performance gap?

Best case (conditions favoring the new chip):

  • New chip: cold start, fresh reboot, all background processes killed, CPU frequency locked to maximum
  • Old chip: warm, running for days, multiple background processes, CPU in power-saving mode

Result: +89%

Worst case (conditions reversed):

Result: +12%

Same hardware, same benchmark program, and the performance difference ranged from +12% to +89% depending purely on how I set up the test environment.

This is what makes benchmarking terrifying: numbers don't lie, but numbers can be manipulated.

I Told Kevin the Truth

The next day, I scheduled a meeting with Kevin.

"I have good news and bad news," I said.

"Bad news first."

"The +47% figure isn't reliable. The test environments were inconsistent, and there was no statistical analysis. If we publish that number, tech journalists will tear it apart."

Kevin's face fell. "And the good news?"

"The good news is the new chip really is faster. Under controlled, fair testing conditions, the performance improvement is somewhere between +25% and +35%, with 95% confidence. That's a number we can defend."

Kevin was quiet for a moment. "+25% doesn't sound as impressive as +47%."

"But +25% is real. +47% was a lucky single run."

In the end, we used +30% (the middle of our confidence interval) and added a footnote to the datasheet describing our test methodology.

That decision taught me a lesson: honest benchmarks may not look as impressive, but at least they won't blow up in your face later.

Why Benchmarking Is So Hard

This experience taught me the fundamental challenge of benchmarking: too many factors affect measurement results, and our intuition consistently overlooks them.

Let me walk through the six major factors that influence benchmark results:

1. System Noise

Your computer never does just one thing. Background processes, kernel threads, and interrupt handlers are all competing for CPU time.

$ perf stat -r 10 ./my_benchmark

Performance counter stats for './my_benchmark' (10 runs):

    1,234,567 cycles    ( +- 15.2% )

System noise alone can cause 15% variance—and that's on a "quiet" system.

2. CPU Frequency Scaling

Modern CPUs don't run at fixed frequencies. They boost when cold, throttle when hot, and save power when idle.

Run 1 (cold):   1,000 μs @ 4.2 GHz
Run 2 (warm):   1,150 μs @ 3.8 GHz
Run 3 (hot):    1,400 μs @ 3.2 GHz

Statistics 101: Three Things You Must Know

After seeing these six factors, you understand why a single measurement isn't enough. Let me introduce three statistical concepts every performance engineer should know.

Mean: The Most Common Lie

Mean is the most commonly reported statistic—and often the most misleading.

Consider these two benchmark results:

Benchmark A: 100, 100, 100, 100, 100
Mean: 100 μs

Benchmark B: 50, 50, 50, 50, 300
Mean: 100 μs

Same mean, completely different behavior. Benchmark B has a tail latency problem, but the mean hides it.

Lesson: Never report just the mean. Always include variance or percentiles.

Variance and Standard Deviation

Variance measures how spread out your data is. Standard deviation (σ) is the square root of variance, with the same units as your measurement.

Benchmark A: σ = 0 μs (perfectly consistent)
Benchmark B: σ = 100 μs (high variance)

Rule of thumb: if σ exceeds 5% of the mean, your measurements are too noisy.

Confidence Intervals

When you say "my optimization is 15% faster," how sure are you?

A confidence interval tells you where the true value likely falls. A 95% confidence interval means: if you repeated this experiment 100 times, 95 of them would contain the true value.

Performance improvement: 15% (95% CI: 8% to 22%)

This says: "I'm 95% confident the real improvement is between 8% and 22%."

If your confidence interval crosses zero, you can't claim any improvement at all:

Performance improvement: 5% (95% CI: -3% to 13%)

That might just be noise, not signal.

How Many Runs Do I Need?

One of the most common questions: how many times should I run my benchmark?

The answer depends on variance. Here's a practical approach:

Step 1: Run 10 times, calculate the standard deviation.

Step 2: Use this formula to estimate the required sample size:

n = (z × σ / E)²

where:
  n = required sample size
  z = 1.96 (for 95% confidence)
  σ = standard deviation
  E = acceptable margin of error

Example: Your benchmark has σ = 100 μs, and you want the error within ±10 μs:

n = (1.96 × 100 / 10)² = 384 samples

You need about 400 runs to get reliable results.

Step 3: If you can't run that many, either relax your error margin or reduce variance by controlling the test environment.

Warm-up: The Hidden Requirement

Watch what happens when I run a benchmark 100 times consecutively and plot the results:

Run   Time (μs)
1     5,234    ← cold start
2     3,891
3     2,456
4     1,234
5     1,198
...
50    1,201
100   1,199    ← steady state

The first few runs are outliers—JIT compilation, cache warming, branch predictor training. These don't represent steady-state performance.

Solution: Always include warm-up runs, then discard that data:

// Warm-up phase (discard these)
for (int i = 0; i < WARMUP_RUNS; i++) {
    run_benchmark();
}

// Measurement phase (keep these)
for (int i = 0; i < MEASURED_RUNS; i++) {
    times[i] = run_benchmark();
}

How many warm-up runs? Enough that subsequent results stabilize. Plot the data—you'll see when it converges.

Back to That Meeting

Looking back at the pre-launch meeting, here's what proper analysis of Tony's data would have shown:

Tony's method (1 run):
  Result: +47%
  Confidence: Unknown
  Reproducibility: Unverified

Proper method (100 runs, controlled environment):
  Mean: +30%
  σ: 4.2%
  95% CI: [+25%, +35%]
  Reproducibility: ✓

+47% became +30%. Less impressive, but true.

More importantly, this number was defensible. When tech journalists or competitors challenged us, we could produce complete methodology and raw data.

That's the value of proper benchmarking: not prettier numbers, but trustworthy numbers.

Guidelines for Reliable Benchmarking

Based on these experiences, here are my guidelines:

Guideline 1: Control the Environment

# Disable CPU frequency scaling
sudo cpupower frequency-set -g performance

# Disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

# Pin to a specific CPU core
taskset -c 2 ./benchmark  # bind to core 2

Guideline 2: Warm Up Before Measuring

Always discard the first runs. How many depends on your workload—measure until results stabilize.

Guideline 3: Report Variance, Not Just Mean

✗ Bad:   "Latency: 1.2 ms"
✓ Good:  "Latency: 1.2 ms (σ = 0.1 ms, n = 1000)"

Guideline 4: Use Confidence Intervals When Comparing

Don't say "A is 15% faster than B." Say "A is 15% faster than B (95% CI: 10% to 20%)."

Guideline 5: Be Suspicious of Large Improvements

If your optimization shows a 10× improvement, check everything three times. You've probably made a measurement error.

Summary

Benchmarking is hard because reality is messy. CPUs change frequency, caches retain old data, background processes steal cycles, and our intuition—trained on Big-O analysis—consistently fails us.

To benchmark correctly:

  • Measure multiple times and compute variance
  • Warm up before measuring
  • Report uncertainty using confidence intervals
  • Control the environment to reduce noise
  • Stay skeptical of your own results

Chapter 2: Setting Up Your Measurement Environment

Part I: Foundations


"Measure what is measurable, and make measurable what is not so." — Galileo Galilei

The Unreproducible Bug

"I ran it ten times, and I got a different number each time."

This was the first thing Jason, a new engineer I was mentoring, said during his first performance analysis task. He was measuring a sorting algorithm, and his results varied by 40% between runs.

"What's your measurement environment?" I asked.

"Just my laptop."

I glanced at his screen—Slack was open, Chrome had twenty-something tabs, Spotify was playing music, and a Docker container was running in the background.

"That's your problem," I said. "You're not measuring your program. You're measuring your program plus Slack, plus Chrome, plus Spotify, plus Docker, plus whatever mood your laptop is in."

System Noise: The Invisible Enemy

Modern operating systems are time-shared. Even when you're running a single program, the OS is doing many things in the background:

  • Kernel threads: Memory management, I/O scheduling, network processing
  • Interrupts: Hardware interrupts from network cards, USB, timers
  • Background daemons: Cron jobs, logging, file indexing
  • Power management: CPU frequency scaling, thermal throttling

This "noise" steals CPU cycles from your program—unpredictably.

Let me show you a simple experiment. This is the same benchmark run 100 times on a "busy" system:

Run    Time (μs)    Notes
1      1,234
2      1,198
3      5,678        ← Context switch?
4      1,201
5      1,245
...
47     8,901        ← Interrupt storm?
...
100    1,199

Mean: 1,456 μs, but median: only 1,215 μs. Those outliers completely distort the average.

Step 1: Reduce System Noise

To get reproducible measurements, you must first reduce noise sources.

Kill Unnecessary Programs

This is basic, but often overlooked:

# Check what's running
ps aux | head -20

# Kill the usual CPU hogs
pkill chrome
pkill slack
pkill spotify
pkill docker

Disable Background Services

# Stop cron
sudo systemctl stop cron

# Stop logging daemon (careful—this stops system logs)
sudo systemctl stop rsyslog

# Stop indexing services
sudo systemctl stop tracker-miner-fs  # GNOME
sudo systemctl stop mlocate           # updatedb

Set Up CPU Isolation

Linux can reserve certain CPU cores exclusively for your benchmark, preventing the kernel from scheduling other work on them:

# Add to boot parameters (requires reboot)
isolcpus=2,3

# Or use cgroups for dynamic isolation
sudo cset shield -c 2,3 -k on

Then pin your benchmark to these isolated cores:

taskset -c 2 ./my_benchmark

CPU Frequency: The Hidden Variable

Modern CPUs don't run at fixed frequencies. They boost when cold, throttle when hot, and save power when idle. Great for laptops, terrible for benchmarking.

Turbo Boost: Friend or Foe?

When CPU load is low, the processor can temporarily overclock. But as temperature rises, frequency drops back down.

Time (s)    Frequency    Benchmark Time
0           4.2 GHz      952 μs
10          4.0 GHz      1,000 μs
30          3.6 GHz      1,111 μs
60          3.2 GHz      1,250 μs

Same benchmark, but later runs are 30% slower due to thermal throttling.

Solution: Lock the CPU Frequency

# Check current frequency governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set to performance mode (maximum frequency)
sudo cpupower frequency-set -g performance

# Or set a specific frequency
sudo cpupower frequency-set -f 2.0GHz

# Disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# Or for Intel:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

Important: Locking to "maximum frequency" isn't always best. If your benchmark runs long, the CPU may thermal-throttle anyway. Choose a frequency the system can sustain indefinitely.

Verify Frequency Stability

# Monitor CPU frequency in real-time
watch -n 0.5 "cat /proc/cpuinfo | grep MHz"

# Or use turbostat for detailed stats
sudo turbostat --interval 1

Cache State: Cold vs. Warm

CPU cache is another hidden variable. The same program can be 10× slower with a cold cache than a warm one.

The Problem

Consider this simple array sum:

long sum_array(int *arr, size_t n) {
    long sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += arr[i];
    }
    return sum;
}

On first execution, the data isn't in cache. Every access goes to main memory:

First run (cold cache):   5,234 μs
Second run (warm cache):  523 μs

A 10× difference! If you only measure once, which result did you capture?

Solution: Explicitly Choose Cold or Warm

Option 1: Measure warm cache (steady state)

This is usually what you want—performance during normal operation:

// Warm-up runs (discard results)
for (int i = 0; i < WARMUP; i++) {
    sum_array(arr, n);
}

// Measurement runs (record results)
for (int i = 0; i < RUNS; i++) {
    times[i] = measure(sum_array, arr, n);
}

Option 2: Measure cold cache (worst case)

Sometimes you need first-run performance, like startup latency:

for (int i = 0; i < RUNS; i++) {
    // Flush cache
    flush_cache();

    // Measure cold cache performance
    times[i] = measure(sum_array, arr, n);
}

How to flush the cache:

// Method 1: Use clflush instruction (need to know addresses)
void flush_array(void *ptr, size_t size) {
    char *p = (char *)ptr;
    for (size_t i = 0; i < size; i += 64) {  // 64 = cache line size
        _mm_clflush(p + i);
    }
    _mm_mfence();
}

// Method 2: Access a large "trash" array to evict old data
void evict_cache(void) {
    static char trash[32 * 1024 * 1024];  // 32MB > L3 cache
    volatile char sum = 0;
    for (size_t i = 0; i < sizeof(trash); i += 64) {
        sum += trash[i];
    }
}

Key principle: Whatever you choose, be explicit and consistent. Don't mix approaches between runs.

ASLR and Memory Layout

Address Space Layout Randomization (ASLR) is a security feature. Each time you run a program, the addresses of the stack, heap, and shared libraries are randomized.

How does this affect performance measurements?

Cache Conflicts

CPU caches use certain address bits to determine which cache set data belongs to. If your data structures happen to land at addresses that conflict, cache efficiency drops dramatically.

Because of ASLR, the same program may have different cache behavior on each run.

Solutions

Option 1: Disable ASLR

# Temporarily disable (affects current shell only)
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

# Disable for a specific program only
setarch $(uname -m) -R ./my_benchmark

Option 2: Run enough iterations to average it out

If you run enough times, ASLR's effects average out. But this requires more runs and produces higher variance.

My recommendation: Disable ASLR for benchmarking. We want reproducibility, not security.

NUMA: The Multi-Socket Trap

If you're benchmarking on a multi-socket server, there's another pitfall: NUMA (Non-Uniform Memory Access).

In NUMA systems, each CPU socket has its own "local" memory. Accessing local memory is fast; accessing remote memory is slow.

CPU 0 accessing local memory:   100 ns
CPU 0 accessing remote memory:  300 ns

If your program runs on CPU 0 but its data is allocated in CPU 1's memory, performance tanks.

Solution: Pin Both CPU and Memory

# Bind to node 0's CPUs and memory
numactl --cpunodebind=0 --membind=0 ./my_benchmark

# Or use interleave mode (distribute evenly)
numactl --interleave=all ./my_benchmark

Putting It All Together: Benchmark Environment Checklist

Based on everything above, here's my benchmark environment setup script:

#!/bin/bash
# benchmark_setup.sh - Create a reproducible benchmark environment

echo "=== Setting up benchmark environment ==="

# 1. Check for root privileges
if [ "$EUID" -ne 0 ]; then
    echo "Please run as root"
    exit 1
fi

# 2. Stop background services
echo "Stopping background services..."
systemctl stop cron
systemctl stop rsyslog
systemctl stop NetworkManager  # if networking not needed

# 3. Set CPU frequency
echo "Setting CPU frequency..."
cpupower frequency-set -g performance
echo 0 > /sys/devices/system/cpu/cpufreq/boost

# 4. Disable ASLR
echo "Disabling ASLR..."
echo 0 > /proc/sys/kernel/randomize_va_space

# 5. Show CPU isolation status
echo "CPU isolation: $(cat /sys/devices/system/cpu/isolated)"

# 6. Display current status
echo ""
echo "=== Current status ==="
echo "CPU frequency: $(cat /proc/cpuinfo | grep MHz | head -1)"
echo "Turbo boost: $(cat /sys/devices/system/cpu/cpufreq/boost 2>/dev/null || echo 'N/A')"
echo "ASLR: $(cat /proc/sys/kernel/randomize_va_space)"
echo "Isolated CPUs: $(cat /sys/devices/system/cpu/isolated)"

echo ""
echo "Ready for benchmarking!"
echo "Remember to run: taskset -c <isolated_cpu> ./your_benchmark"

Back to Jason's Problem

Remember Jason's 40% variance?

I had him make these changes:

  1. Close all unnecessary programs
  2. Lock CPU frequency
  3. Add warm-up runs
  4. Pin to a specific CPU core

Results:

Before:
  Mean: 1,234 μs
  Std Dev: 512 μs (41.5%)

After:
  Mean: 1,198 μs
  Std Dev: 12 μs (1.0%)

Variance dropped from 41.5% to 1.0%. Now that's a measurement you can trust.

"So there was nothing wrong with my program," Jason said. "It was my measurement environment."

"Exactly," I said. "You've just learned the most important lesson in performance analysis: Before you measure your program, measure your measurement environment."

Summary

A reliable measurement environment is the foundation of correct benchmarking. This chapter covered:

System Noise

  • Kill unnecessary programs and background services
  • Use CPU isolation to reduce kernel interference
  • Pin benchmarks to fixed CPU cores with taskset

CPU Frequency

  • Lock frequency to avoid turbo boost and thermal throttling
  • Choose a frequency the system can sustain
  • Verify stability throughout the entire test

Cache State

  • Explicitly choose cold cache or warm cache measurement
  • Be consistent—don't mix approaches

ASLR and NUMA

  • Disable ASLR for reproducibility
  • On NUMA systems, bind both CPU and memory to the same node

Chapter 3: Measurement Methodology

Part I: Foundations


"Not everything that can be counted counts, and not everything that counts can be counted." — William Bruce Cameron

A Tale of Two Timers

It was a simple enough task: measure how long a function takes to execute.

My colleague Lisa used gettimeofday():

struct timeval start, end;
gettimeofday(&start, NULL);
do_something();
gettimeofday(&end, NULL);
long elapsed = (end.tv_sec - start.tv_sec) * 1000000 +
               (end.tv_usec - start.tv_usec);
printf("Time: %ld μs\n", elapsed);

I used clock_gettime(CLOCK_MONOTONIC):

struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
do_something();
clock_gettime(CLOCK_MONOTONIC, &end);
long elapsed = (end.tv_sec - start.tv_sec) * 1000000000L +
               (end.tv_nsec - start.tv_nsec);
printf("Time: %ld ns\n", elapsed);

We measured the same function but got different results.

Lisa's results bounced between 1,000 and 1,200 microseconds. Mine stayed stable around 1,050 microseconds.

Stranger still: after a sysadmin adjusted NTP (Network Time Protocol), Lisa's measurement came back negative—the function ended before it started.

"That's impossible," she said. "Time doesn't run backwards."

But the timer she was using can.

Wall Clock vs. Monotonic Clock

This is the first lesson in understanding timers: not all clocks are suitable for measuring elapsed time.

Wall Clock (gettimeofday, CLOCK_REALTIME)

A wall clock represents "what time is it right now?" It can be adjusted by NTP, changed manually by users, or affected by daylight saving time.

12:00:00.000  - Start measurement
12:00:01.000  - NTP adjusts time back to 11:59:59.500
11:59:59.600  - End measurement

Elapsed time = -0.4 seconds (negative!)

Wall clocks are for recording when something happened, not how long it took.

Monotonic Clock (CLOCK_MONOTONIC)

A monotonic clock is guaranteed to only move forward. It won't be adjusted by NTP (or will only be slewed gradually, never jumping).

Monotonic: 1000.000 - Start measurement
Monotonic: 1001.050 - End measurement

Elapsed time = 1.050 seconds (always positive)

Rule: When measuring elapsed time, always use a monotonic clock.

Timer Precision vs. Timer Resolution

Another common confusion: precision versus resolution.

Resolution: Smallest Reportable Unit

struct timespec res;
clock_getres(CLOCK_MONOTONIC, &res);
printf("Resolution: %ld ns\n", res.tv_nsec);
// Typically outputs: 1 ns

This tells you the timer can report values in 1-nanosecond increments. But this does not mean you can accurately measure 1-nanosecond intervals.

Precision: What You Can Actually Trust

Calling the timer itself takes time:

// Measure timer overhead
struct timespec t1, t2;
clock_gettime(CLOCK_MONOTONIC, &t1);
clock_gettime(CLOCK_MONOTONIC, &t2);
printf("Timer overhead: %ld ns\n",
       (t2.tv_sec - t1.tv_sec) * 1000000000L + t2.tv_nsec - t1.tv_nsec);

On my system, this overhead is about 20-50 nanoseconds. This means:

  • Measuring a 1 μs event: 2-5% error
  • Measuring a 100 ns event: 20-50% error
  • Measuring a 10 ns event: completely unreliable

Rule: The interval you're measuring should be at least 100× the timer overhead for less than 1% error.

CPU Cycles: The Most Precise Timer

When you need to measure very short intervals—tens to hundreds of CPU cycles—read the cycle counter directly.

x86: RDTSC

#include <x86intrin.h>

uint64_t start = __rdtsc();
do_something_very_fast();
uint64_t end = __rdtsc();

printf("Cycles: %lu\n", end - start);

ARM: CNTVCT_EL0

uint64_t read_cycles(void) {
    uint64_t val;
    asm volatile("mrs %0, cntvct_el0" : "=r" (val));
    return val;
}

RISC-V: rdcycle

uint64_t read_cycles(void) {
    uint64_t val;
    asm volatile("rdcycle %0" : "=r" (val));
    return val;
}

Handling Outliers

Real benchmark data will have outliers—extreme high or low values. The question is: should you remove them?

Why Outliers Happen

  1. Interrupts: System interrupts preempt your program
  2. Context switches: OS swaps your program out
  3. Page faults: First access to a new memory page
  4. GC pauses: If your language has garbage collection
  5. Thermal throttling: CPU overheats and slows down

Two Approaches

Conservative: Keep all data

If your goal is understanding "real world" behavior, outliers are part of reality.

# Report full percentiles
p50 = np.percentile(data, 50)
p90 = np.percentile(data, 90)
p99 = np.percentile(data, 99)
p999 = np.percentile(data, 99.9)

This approach is common for latency SLAs ("99% of requests complete within 10ms").

Aggressive: Remove outliers

If your goal is understanding algorithm performance, outliers are noise.

# Remove outliers beyond 3 standard deviations
mean = np.mean(data)
std = np.std(data)
filtered = [x for x in data if abs(x - mean) < 3 * std]

Or use a more robust method:

# Use IQR (Interquartile Range)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
filtered = [x for x in data if q1 - 1.5*iqr <= x <= q3 + 1.5*iqr]

My recommendation: Do both. Report filtered results as the primary data, but also report full percentiles so readers can see the outlier situation.

Statistical Significance: Is Your Improvement Real?

You made an optimization. Before: 1,000 μs. After: 980 μs. Is that a 2% improvement?

Not necessarily. It might just be noise.

Hypothesis Testing

The formal approach uses hypothesis testing:

  1. Null hypothesis (H0): No difference between groups
  2. Alternative hypothesis (H1): There is a difference
  3. Run a statistical test (e.g., t-test)
  4. Check p-value: If p < 0.05, reject H0
from scipy import stats

before = [1000, 1020, 980, 1010, 990, ...]
after = [980, 970, 990, 960, 985, ...]

t_stat, p_value = stats.ttest_ind(before, after)
print(f"p-value: {p_value}")

if p_value < 0.05:
    print("Difference is statistically significant")
else:
    print("Difference might be noise")

Effect Size

Even if a difference is "statistically significant," it doesn't mean it's "important." With 10,000 samples, even a 0.1% difference can be "significant."

Effect size tells you how big the difference is:

# Cohen's d
def cohens_d(before, after):
    pooled_std = np.sqrt((np.std(before)**2 + np.std(after)**2) / 2)
    return (np.mean(after) - np.mean(before)) / pooled_std

d = cohens_d(before, after)
# |d| < 0.2: small
# 0.2 <= |d| < 0.8: medium
# |d| >= 0.8: large

Practical Significance

Finally, ask yourself: does this improvement matter to users?

  • 1,000 μs → 980 μs (2%): Probably unnoticeable
  • 100 ms → 50 ms (50%): Users will feel it
  • 10 s → 5 s (50%): Users will thank you

Rule: Statistical significance is necessary but not sufficient. You also need practical significance.

Warm-up Revisited

In Chapter 1 we mentioned warm-up. Here's a deeper look at determining how many warm-up iterations you need.

Visual Method: Plot It

The most intuitive approach is plotting each run's time:

Run#  Time (μs)
1     5,234    **********************
2     3,891    ****************
3     2,456    **********
4     1,334    *****
5     1,256    *****
6     1,243    *****
7     1,238    *****
...
50    1,241    *****

You can see runs 1-4 are warm-up; from run 5 onward, results stabilize.

Automatic Method: CV Convergence

Calculate the rolling coefficient of variation (CV). When it stabilizes, warm-up is complete:

def find_warmup(times, threshold=0.05, window=10):
    """
    Find where warm-up ends.
    When rolling CV drops below threshold, consider it stable.
    """
    for i in range(window, len(times)):
        recent = times[i-window:i]
        cv = np.std(recent) / np.mean(recent)
        if cv < threshold:
            return i - window
    return len(times) // 2  # fallback

Rules of Thumb

  1. CPU-bound computation: Usually 3-5 warm-up runs (JIT, cache warming)
  2. Memory-intensive workload: May need 10-20 runs (page tables, TLB)
  3. JIT languages (Java, JavaScript): May need hundreds to thousands

Important: Always verify your assumptions. Plot the data—don't blindly trust rules of thumb.

A Simple Measurement Framework

Let's put all these concepts together into a reusable framework:

#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>

typedef struct {
    double *samples;
    size_t count;
    double mean;
    double std;
    double median;
    double p95;
    double p99;
} BenchmarkResult;

double get_time_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec * 1e9 + ts.tv_nsec;
}

int compare_double(const void *a, const void *b) {
    double da = *(const double *)a;
    double db = *(const double *)b;
    return (da > db) - (da < db);
}

void analyze_result(BenchmarkResult *r) {
    qsort(r->samples, r->count, sizeof(double), compare_double);

    // Mean
    double sum = 0;
    for (size_t i = 0; i < r->count; i++) sum += r->samples[i];
    r->mean = sum / r->count;

    // Standard deviation
    double sq_sum = 0;
    for (size_t i = 0; i < r->count; i++) {
        sq_sum += (r->samples[i] - r->mean) * (r->samples[i] - r->mean);
    }
    r->std = sqrt(sq_sum / r->count);

    // Percentiles
    r->median = r->samples[r->count / 2];
    r->p95 = r->samples[(size_t)(r->count * 0.95)];
    r->p99 = r->samples[(size_t)(r->count * 0.99)];
}

void print_result(BenchmarkResult *r, const char *name) {
    printf("%s:\n", name);
    printf("  Mean:   %.2f ns (σ = %.2f, CV = %.2f%%)\n",
           r->mean, r->std, (r->std / r->mean) * 100);
    printf("  Median: %.2f ns\n", r->median);
    printf("  P95:    %.2f ns, P99: %.2f ns\n", r->p95, r->p99);
}

Usage:

#define WARMUP 100
#define RUNS 1000

void benchmark_example(void) {
    BenchmarkResult result;
    result.samples = malloc(RUNS * sizeof(double));
    result.count = RUNS;

    // Warm-up (discard)
    for (int i = 0; i < WARMUP; i++) do_something();

    // Measurement
    for (int i = 0; i < RUNS; i++) {
        double start = get_time_ns();
        do_something();
        result.samples[i] = get_time_ns() - start;
    }

    analyze_result(&result);
    print_result(&result, "do_something()");
    free(result.samples);
}

Back to Lisa's Problem

Remember Lisa getting negative elapsed time with gettimeofday()?

That day she learned:

  1. Use the right timer: CLOCK_MONOTONIC won't go backwards due to NTP
  2. Understand timer overhead: Don't try to measure intervals too short
  3. CPU time vs. wall time: Choose based on your goal

A week later, she wrote a complete benchmark framework that became our team's standard tool.

"I had no idea timers were this complicated," she said.

"They are," I said. "But once you understand the pitfalls, you can write benchmarks that anyone can trust."

Summary

Correct measurement methodology is the core of reliable benchmarking. This chapter covered:

Timer Selection

  • Use CLOCK_MONOTONIC, not wall clocks
  • Understand timer overhead; don't measure intervals too short
  • For cycle-level precision, use CPU cycle counters

CPU Time vs. Wall Time

  • Wall time: user-perceived latency
  • CPU time: actual CPU usage
  • Their ratio tells you CPU utilization

Outlier Handling

  • Conservative: keep all data, report percentiles
  • Aggressive: remove outliers, but document your method
  • Recommendation: do both

Statistical Significance

  • Use hypothesis testing to confirm differences are real
  • Check effect size to confirm differences matter
  • Consider practical significance

Warm-up

  • Discard unstable initial runs
  • Use visual or automatic methods to determine warm-up count
  • Always verify your assumptions

Chapter 4: Presenting Results

Part I: Foundations


"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey

The Chart That Lost Us the Contract

This happened a few years ago. Our team spent three months optimizing a critical module, achieving a 35% performance improvement. We were excited to present the results to the client.

My colleague Mark prepared the presentation. He made a bar chart in Excel:

Performance Comparison

Old System    ████████████████████████████████████████  1000 ms
New System    █████  650 ms

Looks great, right? The new system is clearly much shorter.

But wait—that's the problem.

But the client's CTO frowned. "Wait, where does the Y-axis start?"

"At 600," Mark said. "It makes the difference more visible."

The room went silent for a few seconds.

"So actually," the CTO said, "you went from 1000 to 650. If the Y-axis started at zero, the difference would look like this:"

Performance Comparison (Y-axis from 0)

Old    ████████████████████████████████████████████████████████████████████████████████  1000 ms
New    █████████████████████████████████████████████████████████████████  650 ms

"Your 35% improvement is real," he said. "But your chart made it look like 80%. That makes me wonder if your other data has similar problems."

We didn't get that contract.

Common Misleading Chart Techniques

Mark's mistake is common. Here are techniques frequently used (intentionally or not) to exaggerate results:

1. Truncated Y-Axis

The most common trick. Start the Y-axis at a non-zero value, and small differences look huge.

Misleading (Y starts at 95):
A    █  95
B    ████████████  100

Honest (Y starts at 0):
A    ███████████████████████████████████████████████  95
B    ██████████████████████████████████████████████████  100

A 5% difference looks like a 5× difference in the truncated version.

2. Selective Data Range

Show only the data points that favor you.

"Our product has been ahead for the last three months!"
(But we were behind for the previous six months—not shown)

3. Dual Y-Axis Abuse

Use two Y-axes with different scales to make unrelated trends appear correlated.

Left Y-axis: Sales (0-1000)
Right Y-axis: Temperature (20-25°C)

"Look! Temperature and sales are perfectly correlated!"
(It's just a coincidence from scale manipulation)

4. 3D Effects

3D charts look fancy but distort visual proportions.

5. Area vs. Length

When using circle sizes to represent values, people confuse area with diameter.

A = 100, B = 200

If using circle area:
  A radius = 10
  B radius = 14.14 (√2 times)

Visually, B looks only slightly larger, but it's actually 2×.

How to Present Benchmark Results Correctly

Rule 1: Start Y-Axis at Zero (Unless You Have Good Reason)

The only exception is when the data range is truly narrow, and you explicitly label it.

OK: "Note: Y-axis starts at 950 to show fine differences"
NOT OK: Sneakily starting at non-zero, hoping nobody notices

Rule 2: Show Uncertainty

Always include error bars. A bar chart without error bars is incomplete.

Performance (ms)
                Mean   [95% CI]
Algorithm A:    100    ████████████████████├──┤
Algorithm B:     95    ███████████████████├────┤

If error bars overlap, the difference may not be significant.

Rule 3: State Sample Size

"N = 1000 runs, 95% confidence interval"

A chart could come from 10 tests or 10,000 tests—the meaning is completely different.

Rule 4: Provide Raw Data or Distribution

Choosing the Right Chart Type

Different data needs different visualization.

Bar Chart: Compare Discrete Categories

Good for: comparing algorithms, systems, configurations

Throughput (ops/sec)

Algorithm A  ████████████████████████  2400
Algorithm B  ██████████████████  1800
Algorithm C  ██████████████████████████████  3000

Note: Bar charts are for independent categories, not trends.

Good for: change over time, change with parameters

Latency vs Data Size

Latency
(ms)
  │                                    ●
  │                               ●
  │                          ●
  │                     ●
  │                ●
  │           ●
  │      ●
  │ ●
  └────────────────────────────────
    1KB   10KB   100KB   1MB   10MB
              Data Size

Note: Line charts imply continuity. If your data is discrete, use bars.

Scatter Plot: Show Correlation or Distribution

Good for: relationships between variables, individual run results

Latency vs Throughput

Latency
  │  ●
  │    ●  ●
  │      ●●●
  │        ●●●●
  │          ●●●●
  │            ●●●
  │              ●●
  │                ●
  └──────────────────────
                  Throughput

Box Plot: Compare Distributions

Good for: comparing spread across multiple groups

Latency by Configuration

           Config A      Config B      Config C
              │             │              │
              ○             │              │    ← outlier
              │             │              │
           ┌──┴──┐       ┌──┴──┐        ┌──┴──┐
           │     │       │     │        │     │
           ├─────┤       ├─────┤        ├─────┤  ← median
           │     │       │     │        │     │
           └──┬──┘       └──┬──┘        └──┬──┘
              │             │              │
              │             ○              │    ← outlier

Box plots show: median (center line), quartiles (box), range (whiskers), outliers (dots).

Heatmap: Multi-Dimensional Data

Good for: effect of two parameters on performance

Throughput Heatmap

Thread Count
         1    2    4    8    16
      ┌────┬────┬────┬────┬────┐
  1KB │ ░░ │ ▒▒ │ ▓▓ │ ▓▓ │ ▒▒ │
      ├────┼────┼────┼────┼────┤
 10KB │ ░░ │ ▒▒ │ ▓▓ │ ██ │ ▓▓ │
      ├────┼────┼────┼────┼────┤
100KB │ ░░ │ ▒▒ │ ▓▓ │ ██ │ ██ │
      └────┴────┴────┴────┴────┘
Buffer
Size         ░ Low  ▒ Med  ▓ High  █ Best

Log Scale: When to Use It

When data spans multiple orders of magnitude, linear scale makes small values invisible.

Linear Scale:
Algorithm A  █  1 ms
Algorithm B  ██████████████████████████████████████████████████  1000 ms

Log Scale:
Algorithm A  ██████████  1 ms
Algorithm B  ██████████████████████████████████████████████████  1000 ms
                   (3 orders of magnitude difference)

When to use log scale:

  • Data spans 2+ orders of magnitude
  • You care about "ratios" rather than "absolute differences"
  • Comparing different scales (like latency percentiles: p50, p99, p99.9)

Caution: Log scale makes large differences look smaller. Ensure readers understand it's logarithmic.

Fair Comparisons

Same Conditions

Comparisons must use identical conditions. If you change multiple variables, you don't know what caused the difference.

Bad:
"Algorithm A on Intel Xeon vs Algorithm B on AMD EPYC"

Good:
"Algorithm A vs Algorithm B, both on Intel Xeon E5-2690"

Baseline Choice

Your choice of baseline affects interpretation.

Scenario 1: A is baseline
  A: 1.00× (baseline)
  B: 1.35× faster

Scenario 2: B is baseline
  A: 0.74× (26% slower)
  B: 1.00× (baseline)

Same data, different narrative. Choose a reasonable baseline (usually "current system" or "industry standard") and be consistent.

Avoid Cherry-Picking

Don't only show favorable test cases.

Bad:
"Our system is 3× faster!" (on one specific workload we optimized for)

Good:
"Our system is 3× faster on workload A, 1.2× faster on workload B,
 and 0.9× (10% slower) on workload C"

Report all results honestly, including where you perform worse.

Structure of a Benchmark Report

A complete benchmark report should include:

1. Executive Summary

One paragraph summarizing the key findings. For people who don't have time to read the full report.

2. Test Environment

## Test Environment

- **Hardware**: Intel Xeon E5-2690 v4 @ 2.6GHz, 128GB RAM
- **OS**: Ubuntu 22.04 LTS, kernel 5.15.0
- **Compiler**: GCC 11.2 with -O3
- **Date**: 2024-01-15

3. Methodology

  • How many runs?
  • How many warm-up iterations?
  • How were outliers handled?
  • What statistical methods were used?

4. Results

Charts + data tables. Charts give visual impression; tables give precise numbers.

5. Analysis

Explain the results. Why is A faster than B? Where are the bottlenecks?

6. Limitations

Honestly state test limitations.

## Limitations

- Tests performed on a single machine; results may vary on different hardware
- Only tested with synthetic workloads; real-world performance may differ
- Memory-bound workloads not covered in this benchmark

7. Raw Data

Provide raw data for readers to analyze themselves (in appendix or via link).

Practical Tools

Simple Charts: gnuplot

set terminal png size 800,600
set output 'benchmark.png'
set title 'Algorithm Performance'
set xlabel 'Data Size'
set ylabel 'Time (ms)'
set style data linespoints
plot 'data.txt' using 1:2 title 'Algorithm A', \
     'data.txt' using 1:3 title 'Algorithm B'

Statistical Charts: Python + matplotlib

import matplotlib.pyplot as plt
import numpy as np

data_a = [100, 102, 98, 105, 97, ...]
data_b = [95, 93, 97, 94, 96, ...]

fig, ax = plt.subplots()
bp = ax.boxplot([data_a, data_b], labels=['Algorithm A', 'Algorithm B'])
ax.set_ylabel('Latency (μs)')
ax.set_title('Latency Comparison')
plt.savefig('comparison.png', dpi=150)

Interactive: Jupyter Notebook

Jupyter Notebooks let you combine code, data, charts, and analysis text in one place—easy to reproduce and share.

Back to Mark's Story

After that failure, our team established visualization standards:

  1. Y-axis starts at zero (unless explicitly labeled)
  2. Always include error bars
  3. State sample size and test environment
  4. Provide raw data
  5. Report all results honestly, including bad ones

Six months later, we had another chance to present to the same client. This time our charts didn't look as "impressive," but the CTO said:

"This is how I want to see data presented. Your improvement is 35%, and your chart clearly shows 35%—no more, no less. That makes me trust your other data too."

We got that contract.

Summary

Presenting benchmark results correctly is as important as measuring correctly. This chapter covered:

Avoiding Misleading Charts

  • Don't truncate Y-axis (unless clearly labeled)
  • Don't cherry-pick data ranges
  • Avoid 3D effects and misleading area comparisons

Correct Visualization

  • Always show error bars
  • State sample size and test conditions
  • Provide raw data or distributions

Choosing the Right Chart

  • Bar chart: compare discrete categories
  • Line chart: show trends
  • Scatter plot: correlation
  • Box plot: compare distributions
  • Heatmap: multi-dimensional data

Fair Comparisons

  • Compare under identical conditions
  • Choose a reasonable baseline
  • Report all results, not just favorable ones

Complete Reports

  • Executive summary
  • Test environment and methodology
  • Results and analysis
  • Limitations
  • Raw data

Chapter 5: CPU Benchmarks

Part II: Tools


"Benchmarks are like statistics: you can prove anything with them if you try hard enough." — Unknown

The Dhrystone Revelation

In 1984, Reinhold Weicker released the Dhrystone benchmark. It's a short C program designed to measure CPU integer performance. Over thirty years later, it's still widely used.

But Dhrystone has a fundamental problem. Let me start with a story.

A few years ago, I was evaluating two embedded processors. Vendor A claimed 3.0 DMIPS/MHz; Vendor B claimed 2.8 DMIPS/MHz. A looked faster, right?

We bought two development boards and ran Dhrystone:

Chip A: 3.1 DMIPS/MHz (matches spec)
Chip B: 2.9 DMIPS/MHz (matches spec)

Great, specs are accurate. Then we ran our actual application—an image processing pipeline:

Chip A: 45 fps
Chip B: 62 fps

Wait, Chip B is 38% faster? But A has higher DMIPS!

This is Dhrystone's problem.

Why Dhrystone Is Unreliable

Problem 1: Too Small, Fits in Cache

The entire Dhrystone program is only a few KB. On modern processors, it fits entirely in L1 instruction cache. This means it measures "best case," not real-world performance.

Dhrystone code size: ~4 KB
L1 I-cache size:     32-64 KB

Result: 100% cache hit rate (unrealistic)

Problem 2: Compilers Can "Cheat"

Dhrystone's source code has computations that can be optimized away. Smart compilers can dramatically boost scores.

// A piece of Dhrystone code
Proc_1(Ptr_Val_Par)
{
    // This function's result might not be used
    // Compiler might optimize the entire function away
}

This is why DMIPS numbers sometimes include compiler versions:

"3.0 DMIPS/MHz (GCC 4.8, -O2)"
"4.2 DMIPS/MHz (Commercial Compiler X, -O3)"

Same chip, different compilers, 40% score difference. Are we measuring CPU or compiler?

Problem 3: Doesn't Represent Real Workloads

Dhrystone was designed in 1984, based on "typical" instruction distributions of that era. Modern programs are completely different:

  • More memory access
  • More complex control flow
  • Larger working sets
  • More SIMD and floating-point operations

Using Dhrystone to predict modern application performance is like using 1984 traffic data to predict today's congestion.

CoreMark: The Modern Alternative

EEMBC (Embedded Microprocessor Benchmark Consortium) released CoreMark in 2009 as a Dhrystone replacement.

CoreMark's Improvements

1. Prevents Compiler Cheating

CoreMark results are validated. If the compiler optimizes away computations, validation fails.

// CoreMark uses CRC to validate results
crc = crc_calc(result);
if (crc != EXPECTED_CRC) {
    // Compiler cheated, result invalid
}

2. Larger Code Footprint

CoreMark is about 16-32 KB—larger than Dhrystone, but may still fit in L1 cache.

3. More Modern Workload Mix

Includes list processing, matrix operations, state machines—closer to modern applications.

CoreMark's Limitations

CoreMark is better than Dhrystone, but still has limits:

  1. Still synthetic — not a real application
  2. Still small — mainly measures cache-hot performance
  3. Single score — can't distinguish different workload types

SPEC CPU: The Industry Gold Standard

For serious CPU performance evaluation, SPEC CPU is the industry standard.

What Is SPEC CPU

SPEC (Standard Performance Evaluation Corporation) maintains several benchmark suites. SPEC CPU includes:

  • SPECint: Integer operations (compilers, compression, database engines, etc.)
  • SPECfp: Floating-point operations (scientific computing, simulation, etc.)

Each suite contains a dozen real applications, not synthetic code.

SPEC CPU 2006 Composition

SPECint 2006 (Integer)
----------------------
400.perlbench      Perl interpreter
401.bzip2          Compression
403.gcc            C compiler
429.mcf            Combinatorial optimization
445.gobmk          AI: Go game
456.hmmer          Search gene sequence
458.sjeng          AI: Chess
462.libquantum     Quantum computing simulation
464.h264ref        Video compression
471.omnetpp        Network simulation
473.astar          Path-finding
483.xalancbmk      XML processing

SPECfp 2006 (Floating Point)
----------------------------
410.bwaves         Fluid dynamics
416.gamess         Quantum chemistry
433.milc           Physics: QCD
434.zeusmp         Physics: CFD
... (and more)

SPEC 2006 is still widely used in academia because:

  • Many published papers use 2006 as baseline
  • Rich historical data for comparison
  • Some benchmarks (like mcf, gcc) are classic memory-bound and compute-bound representatives

SPEC CPU 2017 Composition

SPECint 2017 Rate (Integer)
----------------------------
500.perlbench_r    Perl interpreter
502.gcc_r          C compiler
505.mcf_r          Route planning
520.omnetpp_r      Network simulation
523.xalancbmk_r    XML processing
525.x264_r         Video compression
531.deepsjeng_r    AI game playing
541.leela_r        Monte Carlo Go
548.exchange2_r    AI puzzle solving
557.xz_r           Data compression

SPECfp 2017 Rate (Floating Point)
---------------------------------
503.bwaves_r       Fluid dynamics
507.cactuBSSN_r    Physics
508.namd_r         Molecular dynamics
510.parest_r       Biomedical imaging
511.povray_r       Ray tracing
... (and more)

2017 version improvements:

  • Larger working sets (reflecting modern applications)
  • More multi-threaded workloads (rate and speed versions)
  • Removed some outdated benchmarks
  • Added AI/ML-related workloads (like leela)

Why SPEC Is More Trustworthy

1. Real Applications

These aren't synthetic code written for benchmarking. They're actually used software.

2. Strict Execution Rules

  • Must run complete workloads (no cherry-picking)
  • Must report complete environment configuration
  • Results must be reviewed by SPEC before publication

3. Composite Score from Multiple Workloads

A single workload can be specifically optimized. But optimizing a dozen different applications simultaneously requires genuine architectural improvements.

SPEC's Downsides

  1. Expensive — Commercial licensing isn't cheap
  2. Time-consuming — Running the full suite can take days
  3. Complex — Requires expertise to set up and interpret correctly

For embedded systems and everyday comparisons, SPEC may be overkill.

Whetstone: The Floating-Point Veteran

Whetstone is a floating-point benchmark released in 1972—even older than Dhrystone. It measures MWIPS (Millions of Whetstone Instructions Per Second).

Why People Still Use It

  1. Historical data — Decades of data for comparison
  2. Simple — Runs in minutes
  3. Floating-point focus — If you only care about FP performance

Why You Shouldn't Use It

Same problems as Dhrystone: too old, too small, too easy to optimize.

Modern alternatives are LINPACK (for HPC rankings) or SPEC FP.

How to Use CPU Benchmarks Correctly

Rule 1: Know What You're Measuring

Each benchmark measures different things:

BenchmarkPrimary MeasurementUse Case
DhrystoneInteger ops (small program)Quick comparison, embedded
CoreMarkInteger ops (more modern)Embedded, MCU
SPEC CPUReal application performanceServers, desktops
WhetstoneFloating-point (old)Historical comparison
LINPACKLinear algebraHPC

Rule 2: Don't Just Look at a Single Number

Bad:  "Chip A: 5000 CoreMark"

Good: "Chip A: 5000 CoreMark @ 1GHz
       - CPU: ARM Cortex-A72, 32KB L1-I, 32KB L1-D, 1MB L2
       - Compiler: GCC 11.2 -O3 -mcpu=cortex-a72
       - CoreMark/MHz: 5.0"

A single number hides too much information. Reports should include hardware specs, compiler version, and flags.

Rule 3: Ensure Identical Conditions When Comparing

Bad:  "Chip A (3.0 GHz): 15000 CoreMark
       Chip B (2.5 GHz): 12000 CoreMark
       Conclusion: A is faster"

Good: "Chip A: 5000 CoreMark/GHz
       Chip B: 4800 CoreMark/GHz
       Conclusion: At same frequency, A is 4% faster"

Normalize to per-MHz or per-watt for fair comparison.

Rule 4: Cross-Validate with Multiple Benchmarks

Back to my opening story—Chip A had higher DMIPS, but Chip B was faster in practice.

If we had run more benchmarks:

Chip A:
  Dhrystone: 3.1 DMIPS/MHz
  CoreMark:  3.2 CM/MHz
  Memory BW: 1.5 GB/s

Chip B:
  Dhrystone: 2.9 DMIPS/MHz
  CoreMark:  3.0 CM/MHz
  Memory BW: 3.2 GB/s      ← Big difference here!

Chip B's memory bandwidth was 2× that of A. Our image processing pipeline was memory-bound, so B was faster.

A single benchmark is never enough.

Rule 5: Be Careful with Cross-Architecture Comparisons

Dhrystone/CoreMark scores across different CPU architectures can't be directly compared:

Typical DMIPS/MHz Reference Values (varies with compiler and optimization)
──────────────────────────────────────────────────────────────────────────
Architecture          DMIPS/MHz    CoreMark/MHz
──────────────────────────────────────────────────────────────────────────
ARM Cortex-M0+        0.95         2.4
ARM Cortex-M3         1.25         3.3
ARM Cortex-M4         1.25         3.4
ARM Cortex-M7         2.14         5.0
ARM Cortex-A53        2.3          5.5
ARM Cortex-A72        4.7          8.0
──────────────────────────────────────────────────────────────────────────
RISC-V RV32IMC        1.2-1.8      2.5-3.5
SiFive E31 (RV32IMAC) 1.61         3.1
SiFive E76 (RV32IMAFC)2.36         4.5
SiFive U74 (RV64GC)   2.5          5.0
──────────────────────────────────────────────────────────────────────────
x86 Skylake           ~5.0         ~8.0
x86 Zen 3             ~5.5         ~9.0
──────────────────────────────────────────────────────────────────────────

Note: These numbers are highly dependent on compiler version, optimization level, and ISA extensions. The same RISC-V core can vary 30% across different compilers.

Practical Tips for Running Benchmarks

Setting Up the Environment

# 1. Lock CPU frequency
sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -f 2.0GHz

# 2. Disable turbo
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

# 3. Pin to CPU
taskset -c 0 ./coremark

# 4. Set priority
sudo nice -n -20 ./coremark

Record Complete Environment

## Benchmark Environment

- **CPU**: Intel Core i7-10700 @ 2.9 GHz (locked)
- **Memory**: 32GB DDR4-3200
- **OS**: Ubuntu 22.04, kernel 5.15.0
- **Compiler**: GCC 11.2.0
- **Flags**: -O3 -march=native
- **CoreMark Version**: 1.01
- **Iterations**: 30000 (runtime ~10 seconds)
- **Date**: 2024-01-15

Run Multiple Times, Report Statistics

CoreMark Results (10 runs):
  Mean:   24567.3 iterations/sec
  StdDev: 123.4 (0.5%)
  Min:    24312
  Max:    24789

Back to That Image Processing Project

When we discovered Chip A had higher Dhrystone scores but worse real performance, I learned an important lesson:

Benchmarks are tools, not answers.

We ultimately chose Chip B because our application was memory-bound. If our application had been compute-bound, we might have chosen Chip A.

The correct approach is:

  1. First understand your workload characteristics (CPU-bound? Memory-bound? I/O-bound?)
  2. Choose appropriate benchmarks to evaluate
  3. Cross-validate with multiple benchmarks
  4. Finally, test on your actual application

No benchmark can replace testing on your actual application.

Summary

CPU benchmarks are tools for evaluating processor performance, but each has limitations:

Dhrystone

  • Pros: Fast, universal, lots of historical data
  • Cons: Too small, can be compiler-optimized, doesn't represent modern workloads
  • Use for: Quick embedded system comparisons

CoreMark

  • Pros: More modern than Dhrystone, anti-cheat design
  • Cons: Still synthetic, still small
  • Use for: Embedded systems, MCU evaluation

SPEC CPU

  • Pros: Real applications, strict rules, industry standard
  • Cons: Expensive, time-consuming, complex
  • Use for: Formal server/desktop system evaluation

Correct Usage

  • Know what each benchmark measures
  • Don't just look at a single number
  • Ensure identical comparison conditions
  • Cross-validate with multiple benchmarks
  • Ultimately test on your actual application

Chapter 6: Memory Benchmarks

Part II: Tools


"Memory is the new disk, and disk is the new tape." — Jim Gray

The Afternoon When O(1) Was Slower Than O(n)

"This can't be right."

I stared at the numbers on my screen, making sure I wasn't misreading them. Our hash table lookup (theoretically O(1)) was slower than linear search (O(n)). On an array of just 64 elements.

This was a few years ago. I was optimizing a lookup table in an embedded system. The original implementation was simple linear search; I "improved" it to a hash table, expecting massive performance gains.

The results:

Linear search (64 elements): 180 cycles
Hash table lookup:           340 cycles

The hash table was almost twice as slow.

I was confused at the time. Later I learned to use memory benchmarks to understand this phenomenon. The problem wasn't the algorithm—it was the memory access pattern.

Memory Is the Bottleneck in Modern Systems

Let's look at some numbers. Here's a typical modern processor's memory hierarchy:

Level        Size        Latency      Bandwidth
─────────────────────────────────────────────────
Register     ~1 KB       0 cycles     N/A
L1 Cache     32-64 KB    3-4 cycles   ~1 TB/s
L2 Cache     256-512 KB  10-14 cycles ~500 GB/s
L3 Cache     8-32 MB     30-50 cycles ~200 GB/s
DRAM         16-128 GB   100-300 cycles ~50 GB/s
─────────────────────────────────────────────────

From L1 cache to DRAM, latency differs by nearly 100×. This means:

  • A single cache miss can waste 100 CPU cycles
  • In those 100 cycles, the CPU could execute hundreds of instructions
  • If your algorithm has poor memory access patterns, the CPU spends most of its time waiting for memory

Back to my hash table problem. Linear search operates on 64 contiguous elements, all in L1 cache. Hash table access patterns are random—each lookup might cache miss.

Linear search: 64 × 3 cycles (L1 hit) = 192 cycles
Hash table:    1 hash + 1-2 random access × 150 cycles = 300+ cycles

This is why understanding memory performance is so important.

LMbench: The Classic Memory Benchmark

LMbench is a benchmark suite developed by Larry McVoy in the 1990s. Though old, the fundamental concepts it measures remain important.

Memory Latency Measurement

LMbench uses pointer chasing to measure memory latency:

// Pointer chasing: each node points to the next (random location)
struct node {
    struct node *next;
    char padding[STRIDE - sizeof(void*)];
};

// Measure latency
void measure_latency(struct node *head, int count) {
    struct node *p = head;
    for (int i = 0; i < count; i++) {
        p = p->next;  // Must wait for previous access to complete
    }
}

The key to this technique: each memory access depends on the previous result, so the CPU cannot prefetch or parallelize. This measures true latency, not bandwidth.

Memory Bandwidth Measurement

Bandwidth is measured differently:

// Sequential read - measures bandwidth
void measure_bandwidth(char *buffer, size_t size) {
    volatile int sum = 0;
    for (size_t i = 0; i < size; i += 64) {  // Each cache line
        sum += buffer[i];
    }
}

Here accesses are sequential; the CPU can prefetch. We're measuring how fast the system can "feed" the CPU.

STREAM Benchmark

STREAM is a memory bandwidth benchmark developed by John McCalpin, specifically measuring sustained memory bandwidth.

Four Core Tests

// STREAM's four operations
// Assume a, b, c are large arrays, scalar is a constant

// 1. Copy: c = a
for (int i = 0; i < N; i++)
    c[i] = a[i];

// 2. Scale: b = scalar * c
for (int i = 0; i < N; i++)
    b[i] = scalar * c[i];

// 3. Add: c = a + b
for (int i = 0; i < N; i++)
    c[i] = a[i] + b[i];

// 4. Triad: a = b + scalar * c
for (int i = 0; i < N; i++)
    a[i] = b[i] + scalar * c[i];

Why These Four?

Each operation measures a different memory access pattern:

OperationReadsWritesBytes/Element
Copy1116 (read 8 + write 8)
Scale1116
Add2124
Triad2124

Pointer Chasing: Measuring True Latency

STREAM measures bandwidth, but sometimes you need to know latency. Pointer chasing is the standard method.

Basic Principle

// Create a randomly linked array
void setup_pointer_chase(void **array, size_t count) {
    // Initialize sequentially first
    for (size_t i = 0; i < count - 1; i++) {
        array[i] = &array[i + 1];
    }
    array[count - 1] = &array[0];

    // Then shuffle (Fisher-Yates)
    for (size_t i = count - 1; i > 0; i--) {
        size_t j = rand() % (i + 1);
        void *temp = array[i];
        array[i] = array[j];
        array[j] = temp;
    }
}

// Measure
uint64_t measure_latency(void **array, size_t iterations) {
    void **p = array;
    uint64_t start = rdtsc();

    for (size_t i = 0; i < iterations; i++) {
        p = (void **)*p;  // Depends on previous result
    }

    uint64_t end = rdtsc();

    // Prevent optimization
    volatile void *sink = p;
    (void)sink;

    return (end - start) / iterations;
}

Why Shuffle?

Without shuffling, the access pattern is sequential, and the CPU prefetcher will preload the next location. After shuffling, access is random—each is a true memory access.

Measuring Different Working Set Sizes

// Measure L1, L2, L3, DRAM latency
size_t sizes[] = {
    8 * 1024,        // 8 KB - should be in L1
    64 * 1024,       // 64 KB - probably in L2
    512 * 1024,      // 512 KB - should be in L2/L3
    4 * 1024 * 1024, // 4 MB - should be in L3
    64 * 1024 * 1024 // 64 MB - should be in DRAM
};

for (int i = 0; i < 5; i++) {
    size_t count = sizes[i] / sizeof(void*);
    void **array = malloc(sizes[i]);
    setup_pointer_chase(array, count);

    uint64_t latency = measure_latency(array, 10000000);
    printf("%8zu KB: %3lu cycles\n", sizes[i] / 1024, latency);

    free(array);
}

Typical results:

Working Set    Latency
      8 KB:    4 cycles   (L1)
     64 KB:   12 cycles   (L2)
    512 KB:   35 cycles   (L3)
   4096 KB:   45 cycles   (L3)
  65536 KB:  150 cycles   (DRAM)

This is a visualization of the memory hierarchy. You can clearly see the latency difference at each level.

TLB Effects

Memory access isn't only affected by cache—there's also the TLB (Translation Lookaside Buffer).

What Is TLB

Modern systems use virtual memory; every memory access requires address translation. TLB is a cache for the page table:

Virtual Address → TLB lookup → Physical Address → Cache/Memory
                     ↓
                 TLB miss?
                     ↓
              Page table walk
              (slow, may require multiple memory accesses)

Measuring TLB Effects

// Access once per page
#define PAGE_SIZE 4096

void measure_tlb(char *buffer, size_t pages) {
    volatile int sum = 0;
    for (size_t i = 0; i < pages; i++) {
        sum += buffer[i * PAGE_SIZE];  // One access per page
    }
}

If pages exceeds the number of TLB entries, you'll start seeing TLB miss costs.

Typical results:

Pages accessed    Latency per access
         16:      4 cycles    (TLB hit)
         64:      4 cycles    (TLB hit)
        256:      5 cycles    (TLB misses starting)
       1024:     25 cycles    (heavy TLB misses)
       4096:     35 cycles    (TLB thrashing)

Huge Pages

One solution to TLB problems is using huge pages (2MB or 1GB):

# Check huge pages
cat /proc/meminfo | grep Huge

# Allocate huge pages
echo 100 | sudo tee /proc/sys/vm/nr_hugepages

# Use in program
#include <sys/mman.h>
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

With 2MB huge pages, you need only 1/512 the TLB entries to cover the same memory range.

Common Pitfalls

Pitfall 1: Array Too Small

If your test array fits entirely in L1 cache, you're not measuring memory performance:

// Bad: only measures L1 cache
char buffer[4096];
measure_bandwidth(buffer, 4096);

// Good: ensure it exceeds cache size
char *buffer = malloc(100 * 1024 * 1024);  // 100 MB
measure_bandwidth(buffer, 100 * 1024 * 1024);

Pitfall 2: Compiler Optimization

The compiler might optimize away your memory accesses:

// Bad: compiler might optimize away
int sum = 0;
for (int i = 0; i < N; i++) {
    sum += array[i];
}

// Good: use volatile to prevent optimization
volatile int sum = 0;
for (int i = 0; i < N; i++) {
    sum += array[i];
}

Pitfall 3: Not Considering Prefetcher

Modern CPU prefetchers are smart. Sequential access gets prefetched, showing lower latency:

// This gets prefetched, not true memory latency
for (int i = 0; i < N; i++) {
    sum += array[i];
}

// Use pointer chasing to measure true latency
p = *p;  // Must wait for previous access to complete

Pitfall 4: NUMA Effects

On multi-socket systems, memory location matters:

# Check NUMA topology
numactl --hardware

# Bind to specific node
numactl --cpunodebind=0 --membind=0 ./benchmark

Back to That Hash Table

Now let's re-analyze my hash table problem with this knowledge.

Linear search (64 elements):

  • Array size: 64 × 8 bytes = 512 bytes
  • Entirely in L1 cache
  • Sequential access, prefetcher effective
  • Each comparison: ~3 cycles
  • Average 32 lookups: 32 × 3 = 96 cycles
  • Plus some overhead: ~180 cycles ✓

Hash table lookup:

  • Hash computation: ~20 cycles
  • Bucket access: possible cache miss, ~100 cycles
  • If collision, another random access
  • Total: ~200-400 cycles ✓

Lesson: On small datasets, cache-friendly simple algorithms are often faster than "clever" algorithms.

Summary

Memory performance is often the bottleneck in modern systems:

Memory Hierarchy

  • L1 → L2 → L3 → DRAM, latency can differ by 100×
  • Understanding which level your working set is in matters

Benchmark Tools

  • LMbench: Classic latency and bandwidth measurement
  • STREAM: Standard bandwidth benchmark
  • Pointer chasing: Measures true random access latency

Measurement Techniques

  • Use large enough arrays to avoid measuring only cache
  • Use volatile to prevent compiler optimization
  • Use pointer chasing to measure latency
  • Consider TLB and NUMA effects

Practical Advice

  • On small datasets, cache locality matters more than Big-O
  • Random access is 10-100× slower than sequential access
  • Profile first, then optimize—don't guess

Chapter 7: System-Level Benchmarks

Part II: Tools


"The best benchmark is the actual workload." — Anonymous sysadmin

The "New Server Is Slower" Ticket

"This new server has problems. It's slower than the old one."

That was the trouble ticket I received. A freshly racked server—faster CPU, more memory, newer SSD—but users complained it "felt slower."

"Felt" is a hard thing to debug.

I ran CPU benchmarks—new server was 40% faster. Memory benchmarks—30% faster. Disk I/O—3× faster. Every individual test showed the new server was better.

But users insisted: "It's just slower."

Finally I found the problem: network latency. The new server was in a different rack, adding 2ms latency to the database. For this database-heavy application, each request accessed the database dozens of times. The accumulated latency was noticeable.

This is why we need system-level benchmarks—measuring CPU, memory, and disk separately isn't enough. We need to measure how the entire system works together.

Micro-benchmarks vs System-level Benchmarks

Let's clarify the difference between these two types:

TypeMeasuresExamples
Micro-benchmarkSingle componentCoreMark (CPU), STREAM (memory)
System-levelEntire systemUnixBench, Sysbench, Phoronix

The problem with micro-benchmarks:

CPU score:     100 points
Memory score:  100 points
Disk score:    100 points
─────────────────────────
System score:  ???

(Not 300—might be 50)

System performance isn't a simple sum of component performance. The bottleneck could be anywhere—CPU, memory, disk, network, or even the OS kernel.

UnixBench: The Classic System Benchmark

UnixBench is one of the oldest system benchmarks, first released in 1984. Though dated, the fundamental concepts it measures remain important.

UnixBench Test Items

Dhrystone            CPU integer operations
Whetstone            CPU floating-point operations
Execl Throughput     Process creation (execl)
File Copy            Disk I/O
Pipe Throughput      IPC performance
Pipe-based Switching Context switch
Process Creation     fork() performance
Shell Scripts        Shell script execution
System Call          syscall overhead

Running UnixBench

# Download and compile
git clone https://github.com/kdlucas/byte-unixbench.git
cd byte-unixbench/UnixBench
make

# Run (single-threaded and multi-threaded)
./Run

Typical Output

========================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: myserver
   OS: GNU/Linux -- 5.15.0-generic -- #1 SMP
   Machine: x86_64 (x86_64)
   CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

------------------------------------------------------------------------
Benchmark Run: Wed Dec 18 2024 10:00:00

1 parallel copy of tests:

Dhrystone 2 using register variables    45000000.0 lps   (10.0 s, 7 samples)
Double-Precision Whetstone               8500.0 MWIPS (10.0 s, 7 samples)
Execl Throughput                         5000.0 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks  900000.0 KBps  (30.0 s, 2 samples)
...

System Benchmarks Index Values:

                                          BASELINE       RESULT    INDEX
Dhrystone 2 using register variables      116700.0   45000000.0   3855.8
Double-Precision Whetstone                    55.0       8500.0   1545.5
...
                                                       ========
System Benchmarks Index Score:                           2150.3

UnixBench Limitations

  1. Too old — Many tests designed in the 1980-90s
  2. Doesn't represent modern workloads — No web server, database tests
  3. Controversial index calculation — Geometric mean may hide individual weaknesses
  4. Single-machine only — Doesn't test network performance

Sysbench: A More Modern Choice

Sysbench is a more modern benchmark tool, particularly suited for testing database servers.

Sysbench Test Types

# CPU test
sysbench cpu --cpu-max-prime=20000 run

# Memory test
sysbench memory --memory-block-size=1K --memory-total-size=10G run

# Disk I/O test
sysbench fileio --file-total-size=10G prepare
sysbench fileio --file-total-size=10G --file-test-mode=rndrw run
sysbench fileio --file-total-size=10G cleanup

# MySQL test
sysbench oltp_read_write --mysql-host=localhost --mysql-user=root \
    --mysql-db=test --tables=10 --table-size=100000 prepare
sysbench oltp_read_write --mysql-host=localhost --mysql-user=root \
    --mysql-db=test --tables=10 --table-size=100000 --threads=16 run

Sysbench CPU Test Analysis

sysbench cpu --cpu-max-prime=20000 --threads=4 run

Sysbench Advantages

  1. Database-ready — Built-in MySQL/PostgreSQL support
  2. Scriptable — Can write custom Lua scripts
  3. Modern metrics — Provides latency percentiles
  4. Active maintenance — Continuously updated

Phoronix Test Suite: The Most Comprehensive Option

Phoronix Test Suite (PTS) is currently the most comprehensive open-source benchmark suite, containing hundreds of tests.

Installing Phoronix Test Suite

# Ubuntu/Debian
sudo apt install phoronix-test-suite

# Or download from official site
wget https://phoronix-test-suite.com/releases/phoronix-test-suite-10.8.4.tar.gz
tar xvf phoronix-test-suite-10.8.4.tar.gz
cd phoronix-test-suite
sudo ./install-sh

Common Commands

# List all available tests
phoronix-test-suite list-available-tests

# Install a test
phoronix-test-suite install pts/compress-7zip

# Run a single test
phoronix-test-suite run pts/compress-7zip

# Run a test suite
phoronix-test-suite run pts/disk

# Compare two results
phoronix-test-suite merge-results result1 result2

Common Test Suites

SuiteContents
pts/diskDisk I/O (fio, iozone, bonnie++)
pts/cpuCPU (compress, encode, compile)
pts/memoryMemory bandwidth and latency
pts/networkNetwork throughput
pts/compilationKernel compile, GCC compile

Example Run

$ phoronix-test-suite run pts/compress-7zip

    Phoronix Test Suite v10.8.4

    7-Zip Compression 16.02

    Test: Compression Rating

    Processor: Intel Core i7-10700 @ 4.80GHz (8 Cores / 16 Threads)
    Memory: 32GB DDR4-3200
    OS: Ubuntu 22.04

    Compression Rating:
        58234 MIPS

    Decompression Rating:
        71823 MIPS

OpenBenchmarking.org Integration

Phoronix's unique feature is integration with OpenBenchmarking.org, where you can:

  1. Upload results — Share to the cloud
  2. Compare results — Compare with other users' results
  3. Track history — Observe performance trends over time
# Upload results
phoronix-test-suite upload-result my-result

# Compare results to baseline
phoronix-test-suite compare-results-to-baseline my-result baseline-id

Choosing the Right System Benchmark

Different scenarios call for different tools:

ScenarioRecommended ToolReason
Quick system health checkUnixBenchSimple, fast, covers basics
Database serverSysbenchDedicated OLTP tests
Comprehensive analysisPhoronixHundreds of tests, customizable
CI/CD automationSysbench + custom scriptsScriptable, easy to integrate
Hardware purchase decisionsPhoronix + public comparisonsLots of public data

Practical Advice

1. Define "Performance" First

Before running benchmarks, ask yourself:

  • For this system, what is "performance"?
  • Do users care about throughput or latency?
  • Which component is most likely the bottleneck?

2. Tests Should Simulate Real Usage

# Bad: test CPU alone
sysbench cpu run

# Better: simulate real database workload
sysbench oltp_read_write --tables=10 --table-size=1000000 \
    --threads=32 --time=300 run

3. Run Multiple Times, Report Statistics

# Run 5 times, take median
for i in {1..5}; do
    sysbench cpu run >> results.txt
done

4. Record Complete Environment

# Record environment before benchmark
echo "=== System Info ===" > env.txt
uname -a >> env.txt
cat /proc/cpuinfo | grep "model name" | head -1 >> env.txt
free -h >> env.txt
df -h >> env.txt

Back to That Ticket

Now you know why individual CPU/memory/disk benchmarks didn't find the problem. The real bottleneck was network latency, which wasn't in the tests I ran.

If I had used Sysbench's OLTP test to directly measure "the complete path from application to database," I should have found the problem.

Lesson: Choose benchmarks that represent your real workload.

Summary

System-level benchmarks measure overall system performance, not individual components:

Tool Selection

  • UnixBench: Classic but dated, good for quick checks
  • Sysbench: Modern, scriptable, good for database workloads
  • Phoronix: Most comprehensive, good for deep analysis

Best Practices

  • Define what "performance" means first
  • Choose tests that represent real workloads
  • Run multiple times, report statistics
  • Record complete environment information

Common Pitfalls

  • Testing components individually, ignoring system integration
  • Using tests that don't represent real workloads
  • Looking at only one metric, ignoring latency distribution

Chapter 8: Profiling Tools

Part II: Tools


"Premature optimization is the root of all evil." — Donald Knuth

"But you can't optimize what you don't measure." — Also Donald Knuth (paraphrased)

The Three Weeks I Optimized the Wrong Thing

I once spent three weeks optimizing a function.

It was an image processing pipeline that ran too slowly. I looked at the code and decided the bottleneck must be in the pixel processing loop—after all, it had millions of iterations.

So I started optimizing: SIMD vectorization, loop unrolling, cache blocking... Three weeks later, that loop was 5× faster.

Overall performance improvement? 3%.

Because the real bottleneck wasn't there. It was I/O. File reading time. The part I never looked at.

If I had run a profiler first, I would have found the real bottleneck in five minutes. Three weeks of work could have been avoided with five minutes of profiling.

This is the value of profiling.

Basic Profiling Concepts

Sampling vs Instrumentation

Profilers use two main approaches:

Sampling

Every N milliseconds, pause the program, record current location
     ↓
Count which functions appear most often
     ↓
More appearances = more execution time

Pros: Low overhead, doesn't affect program behavior Cons: Statistical approximation, may miss short functions

Instrumentation

Insert timing code at every function entry/exit
     ↓
Precisely record time for each call
     ↓
Completely accurate call counts and times

Pros: Precise Cons: High overhead, may change program behavior

Common Profilers

ToolTypePlatformFeatures
perfSamplingLinuxMost powerful, hardware PMU support
gprofInstrumentationUnixClassic, but outdated
ValgrindSimulationLinux/macOSPrecise but very slow
VTuneSamplingLinux/WindowsIntel official, GUI
InstrumentsSamplingmacOSApple official

gprof: The Classic Instrumentation Profiler

gprof is a product of 1980s BSD Unix, still widely used in teaching. Understanding its workings and limitations helps appreciate modern tools.

Basic Usage

# 1. Compile with -pg flag
gcc -pg -O2 -o my_program my_program.c

# 2. Run program (generates gmon.out)
./my_program

# 3. Analyze results
gprof my_program gmon.out > analysis.txt

Reading gprof Output

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 45.23      1.23     1.23   100000     0.01     0.02  process_pixel
 23.45      1.87     0.64   100000     0.01     0.01  apply_filter
 12.34      2.21     0.34        1   340.00   450.00  main
  8.76      2.45     0.24  1000000     0.00     0.00  get_value
  • % time: Percentage of total time in this function
  • self seconds: Time in function itself (excluding callees)
  • calls: Number of calls
  • self ms/call: Time per call

gprof's Serious Limitations

1. Can only profile instrumented code

# These won't appear in gprof reports:
# - System call time
# - Dynamic libraries (unless also compiled with -pg)
# - I/O wait time

2. Instrumentation changes performance characteristics

3. Sampling resolution too low — 10ms intervals

4. Poor multi-threading support

5. Can't see inlined functions

gprof vs perf

Aspectgprofperf
MechanismInstrumentationSampling (PMU)
OverheadHigh (10-30%)Low (<5%)
System callsInvisibleVisible
Shared librariesNeed recompileDirect support
Multi-threadingPoorGood

Conclusion: Unless there's a special reason, prefer perf on modern Linux.

perf: Linux Performance Analysis Powerhouse

perf is a performance analysis tool built into the Linux kernel, extremely powerful.

Basic Usage

# Record performance data
perf record ./my_program

# View report

### perf record + perf report

```bash
# Record (-g means record call stack)
perf record -g ./my_program

# Interactive report
perf report
Overhead  Command      Shared Object        Symbol
  45.23%  my_program   my_program           [.] process_pixel
  23.45%  my_program   libc.so.6            [.] memcpy
  12.34%  my_program   my_program           [.] read_file
   8.76%  my_program   libm.so.6            [.] exp
   ...

Now you know process_pixel takes 45% of the time. That's where to optimize.

Common perf Events

# CPU cycles
perf stat -e cycles,instructions ./program

# Cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program

# Branch prediction
perf stat -e branch-misses,branches ./program

# List all available events
perf list

perf Advanced: annotate

# See source-level hotspots
perf annotate process_pixel
       │    void process_pixel(int* data, int n) {
       │        for (int i = 0; i < n; i++) {
 45.23 │            data[i] = expensive_calc(data[i]);
       │        }
       │    }

Now you know the expensive_calc line is slowest.

Flame Graphs: Visualizing Call Stacks

Flame Graph is a visualization method invented by Brendan Gregg that lets you see program hotspots at a glance.

What Is a Flame Graph

┌─────────────────────────────────────────────────────────────┐
│                          main                                │
├─────────────────────────────┬───────────────────────────────┤
│       process_image         │          read_file            │
├─────────────┬───────────────┤                               │
│ process_row │  apply_filter │                               │
├─────────────┴───────────────┤                               │
│        process_pixel        │                               │
└─────────────────────────────┴───────────────────────────────┘

Width = time proportion
Height = call stack depth
  • X-axis: Not time order—alphabetically sorted function names
  • Y-axis: Call stack depth
  • Width: CPU time proportion of function (including children)

Generating Flame Graphs

# 1. Record with perf
perf record -g ./my_program

# 2. Generate readable stack trace
perf script > out.perf

# 3. Convert with FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git
./FlameGraph/stackcollapse-perf.pl out.perf > out.folded
./FlameGraph/flamegraph.pl out.folded > flamegraph.svg

# 4. Open SVG in browser
firefox flamegraph.svg

Reading Flame Graphs

  1. Find the widest "plateaus" — These are where most time is spent
  2. Read bottom to top — Bottom is caller, top is callee
  3. Click to zoom in — SVG is interactive

Common Patterns

Pattern 1: Single Hotspot

┌──────────────────────────────────────────┐
│                  main                     │
├──────────────────────────────────────────┤
│              hot_function (90%)           │
└──────────────────────────────────────────┘

→ Optimize hot_function

Pattern 2: Wide Base

┌──────────────────────────────────────────┐
│                  main                     │
├────┬────┬────┬────┬────┬────┬────┬───────┤
│ f1 │ f2 │ f3 │ f4 │ f5 │ f6 │ f7 │ ...   │
└────┴────┴────┴────┴────┴────┴────┴───────┘

→ No single hotspot; need overall optimization or algorithm change

Pattern 3: Deep Narrow Stack

┌───┐
│ a │
├───┤
│ b │
├───┤
│ c │
├───┤
│ d │ ← Actual work here
└───┘

→ Consider reducing call stack depth or inlining

Valgrind: Memory and Cache Analysis

Valgrind is an instrumentation framework. The most commonly used tools are Memcheck (memory error detection) and Cachegrind (cache analysis).

Cachegrind: Cache Behavior Analysis

valgrind --tool=cachegrind ./my_program
==12345== Cachegrind, a cache and branch-prediction profiler
==12345==
==12345== I   refs:      1,234,567,890
==12345== I1  misses:          123,456
==12345== LLi misses:           12,345
==12345== I1  miss rate:          0.01%
==12345==
==12345== D   refs:        456,789,012  (234,567,890 rd + 222,221,122 wr)
==12345== D1  misses:       12,345,678  (  8,765,432 rd +   3,580,246 wr)
==12345== LLd misses:        1,234,567  (    876,543 rd +     358,024 wr)
==12345== D1  miss rate:           2.7% (        3.7%   +         1.6%)
  • I refs: Instruction fetches
  • D refs: Data reads/writes
  • D1 misses: L1 data cache misses
  • LLd misses: Last-level cache misses (goes to DRAM)

Callgrind: Call Graph Analysis

valgrind --tool=callgrind ./my_program

# Visualize
kcachegrind callgrind.out.12345

KCachegrind provides interactive call graph visualization showing:

  • Inclusive/exclusive time per function
  • Call graph
  • Source code annotation

Valgrind Limitations

  1. Very slow — Usually 10-50× slower
  2. Not real execution — Runs on virtual CPU
  3. No GPU support — Only analyzes CPU code

Practical Workflow

Step 1: Quick Look with perf stat

perf stat ./my_program

Check if IPC, cache misses, branch misses are abnormal.

Step 2: Find Hotspots with perf record

perf record -g ./my_program
perf report

Find the hottest functions.

Step 3: Visualize with Flame Graph

perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flame.svg

See the overall structure at a glance.

Step 4: Deep Analysis

Choose tools based on problem type:

  • Cache issues → Cachegrind or VTune Memory Access
  • Branch issues → perf branch-misses events
  • Multi-threading issues → VTune Threading

Step 5: Profile Again After Optimization

# Before
perf stat ./my_program_v1

# After
perf stat ./my_program_v2

# Compare

Confirm optimization was effective.

Common Pitfalls

Pitfall 1: Debug Build Overhead

# Bad: debug build
gcc -g -O0 program.c
perf record ./a.out  # Results don't represent production

# Good: release build with debug info
gcc -g -O3 program.c
perf record ./a.out  # Closer to real performance

Pitfall 2: Compiler Optimization Changes Behavior

Use -fno-omit-frame-pointer to preserve frame pointer for more accurate call stacks.

Pitfall 3: Short Execution Time Noise

# Bad: too short, too much noise
perf stat ./quick_program  # 0.01 seconds

# Good: run long enough
perf stat ./quick_program --iterations=10000  # 10 seconds

Back to That Three-Week Story

If I had done this:

$ perf record -g ./image_pipeline
$ perf report

Overhead  Symbol
  65.23%  read_file      ← This was the bottleneck!
  23.45%  write_file
   8.76%  process_pixel  ← Where I spent three weeks
   ...

Five minutes would have revealed I/O was the bottleneck. Three weeks of optimization could have been spent on the right thing.

Summary

Profiling is the first step in performance optimization:

Tool Selection

  • perf: Linux first choice, free and powerful
  • Flame Graph: Visualize call stacks
  • Valgrind/Cachegrind: Cache behavior analysis
  • VTune: Intel CPU deep analysis
  • Instruments: macOS development

Workflow

  1. perf stat for quick overview
  2. perf record + perf report to find hotspots
  3. Flame Graph for visualization
  4. Deep analysis (cache/branch/threading)
  5. Profile again after optimization

Core Principles

  • Profile first, then optimize
  • Don't guess bottlenecks before profiling
  • Profile with release builds
  • Ensure execution time is long enough

Chapter 9: Embedded & RTOS Benchmarks

Part II: Tools


"In embedded systems, the worst case is the only case that matters." — Jack Ganssle

The "Average 1ms, But Sometimes 100ms" Disaster

"Average latency 1ms, fully meets spec."

That was the vendor's benchmark report. We were using this MCU for motor control, with a requirement to update PWM output every 1ms. Average 1ms? Perfect.

After the system went live, the motor started stuttering. Not every time—just "occasionally."

We spent three days debugging. Finally we discovered: behind that "average 1ms," there was a 0.1% chance of jumping to 50-100ms. In typical benchmark reports, these outliers get averaged away—invisible.

But for motor control, 0.1% of 100ms delays = stuttering once per second.

This is the fundamental difference between embedded/RTOS benchmarking and GPOS benchmarking: we care about worst case, not average case.

GPOS vs RTOS vs Bare-metal

Let's clarify the differences between these three environments:

FeatureGPOSRTOSBare-metal
ExamplesLinux, Windows, macOSFreeRTOS, Zephyr, RT-LinuxRunning directly on hardware
SchedulingTime-slicing, variable priorityFixed priority, preemptiveNone (or super loop)
MemoryVirtual memory, pagingUsually flat memoryFlat memory
Interrupt latencyNot guaranteed (may be ms)Guaranteed upper bound (usually μs)Minimal (cycles)
JitterHigh (background processes)Low (deterministic)Lowest
Tool supportRich (perf, VTune)Medium (trace, SEGGER)Basic (GPIO toggle)

Why This Matters

On GPOS, if an operation is "usually" 1ms, "occasionally" 10ms, most applications can tolerate it.

On RTOS/bare-metal:

  • Motor control: 100ms delay = motor loses control
  • Automotive ABS: 10ms delay = brake failure
  • Medical devices: delay = potentially fatal

RTOS benchmarks must report worst-case, not just average.

Time Measurement: What If There's No OS?

On GPOS, we use clock_gettime() or rdtsc. On bare-metal, these APIs don't exist.

ARM Cortex-M: DWT Cycle Counter

The Data Watchpoint and Trace (DWT) unit provides a cycle counter:

// Enable DWT cycle counter (need to enable trace first)
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

// Read cycle count
static inline uint32_t get_cycles(void) {
    return DWT->CYCCNT;
}

// Usage
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
uint32_t elapsed = end - start;  // cycles

Note: DWT->CYCCNT is 32-bit, overflows on high-frequency MCUs (168MHz ≈ 25 seconds)

RISC-V: mcycle/minstret CSRs

RISC-V has standard cycle and instruction counters:

// Read cycle counter
static inline uint64_t get_mcycle(void) {
    uint64_t cycle;
    asm volatile ("rdcycle %0" : "=r"(cycle));
    return cycle;
}

// Read instruction counter
static inline uint64_t get_minstret(void) {
    uint64_t instret;
    asm volatile ("rdinstret %0" : "=r"(instret));
    return instret;
}

// Calculate CPI
uint64_t cycles_start = get_mcycle();
uint64_t instr_start = get_minstret();

my_function();

uint64_t cycles = get_mcycle() - cycles_start;
uint64_t instrs = get_minstret() - instr_start;
double cpi = (double)cycles / instrs;

Running on QEMU

# ARM Cortex-M3 (lm3s6965evb)
qemu-system-arm -M lm3s6965evb -nographic -kernel firmware.elf

# RISC-V (sifive_e - FE310)
qemu-system-riscv32 -M sifive_e -nographic -kernel firmware.elf

QEMU's cycle counter is "functional," not cycle-accurate. Numbers can verify program logic but don't represent real hardware cycle counts.

Porting Open-Source Benchmarks to Bare-metal

Good news: most CPU/memory benchmarks port easily:

BenchmarkPorting DifficultyDependenciesNotes
DhrystoneEasylibc onlyNeed to remove time() calls
CoreMarkEasylibc onlyOfficial bare-metal support
EmbenchEasyNoneDesigned for embedded
WhetstoneEasylibmNeeds floating-point support
STREAMMediumNoneNeeds enough memory
lmbenchHardPOSIXCore algorithms portable

CoreMark Bare-metal Port

CoreMark officially supports bare-metal; just implement a few porting functions:

// core_portme.c - ARM Cortex-M implementation

// 1. Timing start/end
void start_time(void) {
    start_cycles = DWT->CYCCNT;
}

void stop_time(void) {
    end_cycles = DWT->CYCCNT;
}

CORE_TICKS get_time(void) {

### Compile and Run

```bash
# Cross-compile for ARM Cortex-M4
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -O3 \
    -DITERATIONS=10000 \
    core_main.c core_list_join.c core_matrix.c \
    core_state.c core_util.c core_portme.c \
    -T linker.ld -o coremark_arm.elf

# Cross-compile for RISC-V (RV32IMAC)
riscv32-unknown-elf-gcc -march=rv32imac -mabi=ilp32 -O3 \
    -DITERATIONS=10000 \
    core_main.c core_list_join.c core_matrix.c \
    core_state.c core_util.c core_portme.c \
    -T linker.ld -o coremark_riscv.elf

# Run on QEMU ARM
qemu-system-arm -M lm3s6965evb -nographic \
    -semihosting -kernel coremark_arm.elf

# Run on QEMU RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
    -kernel coremark_riscv.elf

Embench: Designed for Embedded

Embench is a modern embedded benchmark developed by EEMBC and academia:

# Download
git clone https://github.com/embench/embench-iot.git
cd embench-iot

# Build ARM version
python3 build_all.py --arch arm --chip cortex-m4 \
    --board qemu-arm

# Run (needs appropriate runner)
python3 benchmark_speed.py --target-module run_qemu

Embench includes 19 real-application kernels:

aha-mont64     Montgomery multiplication
crc32          CRC calculation
cubic          Cubic root solver
edn            FIR filter
huffbench      Huffman encoding
matmult-int    Integer matrix multiply
md5sum         MD5 hash
minver         Matrix inversion
nbody          N-body simulation
nettle-aes     AES encryption
...

RTOS Benchmarks: Measuring the OS Itself

When using an RTOS, besides application performance, you need to measure OS overhead.

Context Switch Time

// FreeRTOS context switch benchmark
static TaskHandle_t task1, task2;
static volatile uint32_t switch_start, switch_end;

void Task1(void *pvParameters) {
    for (;;) {
        switch_start = get_cycles();
        xTaskNotifyGive(task2);  // Wake Task2
        ulTaskNotifyTake(pdTRUE, portMAX_DELAY);  // Wait
    }
}

void Task2(void *pvParameters) {
    for (;;) {
        ulTaskNotifyTake(pdTRUE, portMAX_DELAY);
        switch_end = get_cycles();

        uint32_t elapsed = switch_end - switch_start;
        // Record or accumulate elapsed

        xTaskNotifyGive(task1);
    }
}

Typical results (depends on MCU and RTOS):

RTOS            MCU              Context Switch
─────────────────────────────────────────────────
FreeRTOS        Cortex-M4@168MHz     ~200 cycles
Zephyr          Cortex-M4@168MHz     ~300 cycles
RT-Thread       Cortex-M4@168MHz     ~250 cycles

Interrupt Latency

Time from interrupt trigger to ISR execution start:

// Set up GPIO interrupt (STM32)
void EXTI0_IRQHandler(void) {
    uint32_t entry_time = get_cycles();  // First line of ISR

    // Calculate latency
    uint32_t latency = entry_time - trigger_time;
    record_latency(latency);

    // Clear interrupt flag
    EXTI->PR = EXTI_PR_PR0;
}

// Trigger in main program
trigger_time = get_cycles();
// Trigger via software or external GPIO
EXTI->SWIER = EXTI_SWIER_SWIER0;

Important: Measure multiple times, report distribution!

Interrupt Latency Distribution (10000 samples):
  Min:    12 cycles
  Max:    89 cycles
  Avg:    15 cycles
  P99:    45 cycles
  P99.9:  78 cycles

That P99.9 of 78 cycles is the number to consider in design.

Semaphore/Mutex Overhead

static SemaphoreHandle_t sem;

void measure_semaphore_overhead(void) {
    uint32_t total = 0;

    for (int i = 0; i < 10000; i++) {
        uint32_t start = get_cycles();
        xSemaphoreTake(sem, portMAX_DELAY);
        xSemaphoreGive(sem);
        uint32_t end = get_cycles();
        total += (end - start);
    }

    printf("Semaphore take+give: %lu cycles avg\n", total / 10000);
}

Determinism Measurement

A key RTOS characteristic is determinism. How do we quantify it?

Jitter Measurement

#define SAMPLES 10000
static uint32_t latencies[SAMPLES];

// Periodic task
void PeriodicTask(void *pvParameters) {
    TickType_t last_wake = xTaskGetTickCount();
    int idx = 0;

    for (;;) {
        uint32_t expected = last_wake * CYCLES_PER_TICK;
        uint32_t actual = get_cycles();

        if (idx < SAMPLES) {
            latencies[idx++] = actual - expected;
        }

        vTaskDelayUntil(&last_wake, pdMS_TO_TICKS(1));
    }
}

// Analyze jitter
void analyze_jitter(void) {
    uint32_t min = UINT32_MAX, max = 0;
    uint64_t sum = 0;

    for (int i = 0; i < SAMPLES; i++) {
        if (latencies[i] < min) min = latencies[i];
        if (latencies[i] > max) max = latencies[i];
        sum += latencies[i];
    }

    printf("Jitter: min=%lu, max=%lu, avg=%lu, range=%lu\n",
           min, max, (uint32_t)(sum/SAMPLES), max-min);
}

WCET Estimation

Worst-Case Execution Time (WCET) is critical for real-time system design:

#define WCET_SAMPLES 100000

uint32_t measure_wcet(void (*func)(void)) {
    uint32_t max_time = 0;

    for (int i = 0; i < WCET_SAMPLES; i++) {
        uint32_t start = get_cycles();
        func();
        uint32_t elapsed = get_cycles() - start;

        if (elapsed > max_time) {
            max_time = elapsed;
        }
    }

    return max_time;
}

Warning: Measured WCET is only the observed maximum; true WCET may be larger. Rigorous WCET analysis requires static analysis tools (like aiT, Bound-T).

Running on Simulators

QEMU + Semihosting

Semihosting lets bare-metal programs use host I/O:

// ARM semihosting
static inline void semihosting_write(const char *s) {
    asm volatile (
        "mov r0, #0x04\n"  // SYS_WRITE0
        "mov r1, %0\n"
        "bkpt #0xAB\n"
        :
        : "r"(s)
        : "r0", "r1"
    );
}
# ARM
qemu-system-arm -M lm3s6965evb -nographic \
    -semihosting-config enable=on,target=native \
    -kernel firmware_arm.elf

# RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
    -semihosting-config enable=on,target=native \
    -kernel firmware_riscv.elf

Common Pitfalls

Pitfall 1: Only Reporting Averages

Bad:  "Average latency 1ms"
Good: "Latency: avg=1ms, max=15ms, P99=3ms, P99.9=12ms"

Pitfall 2: Ignoring Interrupt Effects

During measurement, other interrupts can pollute results:

// Disable interrupts during measurement
__disable_irq();
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
__enable_irq();

But this doesn't represent reality. Real systems have interrupts—measure both "with interrupts" and "without interrupts" scenarios.

Pitfall 3: Simulator ≠ Real Hardware

QEMU cycle count:  1000 cycles
Real hardware:     3500 cycles

QEMU is a functional simulator, not cycle-accurate. Use it to verify program correctness, not for performance evaluation.

Pitfall 4: Cache Matters in Embedded Too

Many assume MCUs don't have cache. Wrong:

  • Cortex-M7 has I-cache and D-cache
  • Modern RISC-V MCUs may have cache
  • Flash to RAM access speed differences
// Cortex-M7 cache control
SCB_EnableICache();
SCB_EnableDCache();

// Invalidate before measurement
SCB_InvalidateDCache();

Summary

Embedded/RTOS benchmarking differs fundamentally from GPOS:

Core Differences

  • GPOS cares about average case
  • RTOS/bare-metal cares about worst case
  • Determinism matters more than throughput

Time Measurement

  • ARM: DWT cycle counter, SysTick
  • RISC-V: mcycle/minstret CSRs
  • Handle overflow carefully

Portable Benchmarks

  • CoreMark, Dhrystone, Embench: Easy to port
  • STREAM: Needs enough memory
  • lmbench: Core algorithms portable

RTOS Measurements

  • Context switch time
  • Interrupt latency (report distribution!)
  • Semaphore/mutex overhead
  • Jitter and WCET

Simulator Usage

  • QEMU: Functional verification, not performance evaluation
  • Renode: Better peripheral and RTOS support
  • Simulators cannot measure power consumption

Chapter 10: Performance Modeling

Part III: Theory


"All models are wrong, but some are useful." — George Box

The "Theoretically Impossible" Optimization

"That's impossible. Your optimization violates Amdahl's Law."

That's what a senior engineer said during code review. I claimed to have sped up a program by 10×, but according to his analysis, only 50% was parallelizable—so by Amdahl's Law, the theoretical limit was 2×.

He was right—if I had just added more threads.

But I didn't parallelize. I changed the algorithm from O(n²) to O(n log n). That's outside Amdahl's Law's scope.

This experience taught me two things:

  1. Performance models are useful, but know their applicable scope
  2. Sometimes, breaking out of the model's framework is where real breakthroughs happen

This chapter covers the most important performance models in performance engineering:

  • Amdahl's Law: The parallelization ceiling
  • Gustafson's Law: The scaling horizon
  • Universal Scalability Law: Real-world gravity
  • Roofline Model: Compute or Memory bound?
  • Little's Law: System's physical conservation
  • Queuing Theory: Why 90% utilization causes collapse

Amdahl's Law: The Parallelization Ceiling

Historical Background

In 1967, legendary computer architect Gene Amdahl presented a profoundly influential paper at the AFIPS Spring Joint Computer Conference. The industry was debating whether to invest in "single extremely fast processors" or research connecting multiple "relatively slower processors" for parallel computing.

Amdahl pointed out that regardless of hardware scaling, programs always contain serial portions that cannot be parallelized—such as I/O initialization, memory allocation, or specific logical dependencies. These serial portions become the "ceiling" of overall system performance.

Basic Formula

Assume a program has a parallelizable portion (fraction p) and a serial portion (fraction 1-p):

Speedup = 1 / ((1 - p) + p/n)

Where:
- p = parallelizable fraction
- n = number of processors
- 1-p = serial fraction

Visualization

Original program (single thread):
┌──────────────────────────────────────────┐
│ Serial (20%) │    Parallel (80%)         │
└──────────────────────────────────────────┘
Total: 100 time units

4 threads:
┌───────────┬──────────┐
│ Serial    │ Parallel │  Thread 1
│  (20%)    │  (20%)   │
└───────────┴──────────┘
            │  (20%)   │  Thread 2
            ├──────────┤
            │  (20%)   │  Thread 3
            ├──────────┤
            │  (20%)   │  Thread 4
            └──────────┘
Total: 20 + 20 = 40 time units
Speedup: 100/40 = 2.5×

Practical Calculation

def amdahl_speedup(p, n):
    """
    p: parallelizable fraction (0 to 1)
    n: number of processors
    """
    return 1 / ((1 - p) + p / n)

# 80% parallelizable
p = 0.8

print(f"1 processor:   {amdahl_speedup(p, 1):.2f}x")
print(f"2 processors:  {amdahl_speedup(p, 2):.2f}x")
print(f"4 processors:  {amdahl_speedup(p, 4):.2f}x")
print(f"8 processors:  {amdahl_speedup(p, 8):.2f}x")
print(f"16 processors: {amdahl_speedup(p, 16):.2f}x")
print(f"∞ processors:  {amdahl_speedup(p, 1000000):.2f}x")
1 processor:   1.00x
2 processors:  1.67x
4 processors:  2.50x
8 processors:  3.33x
16 processors: 4.00x
∞ processors:  5.00x  ← This is the ceiling

The Harsh Reality

Parallelizable FractionTheoretical Max Speedup
50%
75%
90%10×
95%20×
99%100×

Even if 99% of code is parallelizable, maximum speedup is only 100×. That 1% serial portion determines the ceiling.

Amdahl's Law Limitations

1. Assumes fixed workload

Amdahl's Law assumes total work is constant. In reality, more resources might mean processing larger problems (Gustafson's Law).

2. Ignores parallelization overhead

In practice, adding threads brings:

  • Thread creation/destruction costs
  • Synchronization costs (mutex, barrier)
  • Cache coherence overhead
  • False sharing

3. Only considers CPU

Doesn't account for memory bandwidth, I/O, network, or other bottlenecks.

Real-World Serial Bottlenecks

In practice, serial bottlenecks often hide in details:

  • Lock Contention: Even with 128 threads, if they all compete for the same mutex, lock-waiting time is serial
  • I/O Operations: Reading disk or network packets is typically sequential
  • Memory Allocation: Frequent malloc calls may cause global lock contention in the allocator

How to measure the parallel fraction p? Typically through empirical measurement: measure execution time at different core counts and fit backwards to find p. See Appendix H for detailed measurement methods and Python fitting code.

Gustafson's Law: A Different Perspective

Gustafson proposed a different assumption: more processors means solving larger problems, not solving the same problem faster.

Formula

Speedup = (1 - p) + p × n

Where:
- p = parallel portion fraction
- n = number of processors

Comparison

def gustafson_speedup(p, n):
    return (1 - p) + p * n

p = 0.8  # 80% parallel

print("Processors | Amdahl | Gustafson")
print("-" * 35)
for n in [1, 2, 4, 8, 16, 64]:
    a = amdahl_speedup(p, n)
    g = gustafson_speedup(p, n)
    print(f"{n:10} | {a:6.2f} | {g:6.2f}")
Processors | Amdahl | Gustafson
-----------------------------------
         1 |   1.00 |   1.00
         2 |   1.67 |   1.80
         4 |   2.50 |   3.40
         8 |   3.33 |   6.60
        16 |   4.00 |  13.00
        64 |   4.71 |  51.40

Gustafson's view: 64 processors can process 51× larger problems.

When to Use Which?

ScenarioApplicable LawKey Metric
Fixed-size problem, finish fasterAmdahlStrong Scaling
Fixed time, process larger problemGustafsonWeak Scaling
Real-time systems (fixed deadline)AmdahlLatency
Scientific computing (bigger is better)GustafsonThroughput
UI response, single function optimizationAmdahlResponse Time
Big data processing, AI trainingGustafsonData Volume

Practical advice: If customers complain "App starts too slowly," use Amdahl to find serial bottlenecks; if they want "more transactions in the same time," use Gustafson to think about scaling.

Universal Scalability Law: Real-World Gravity

If Amdahl's Law is the performance "ceiling" and Gustafson's Law is the "distant horizon," then the Universal Scalability Law (USL) is real-world "gravity"—it explains why some systems' performance not only plateaus but actually declines when adding more cores.

Why Amdahl Isn't Enough

In 1993, Neil Gunther proposed USL to address "Negative Scaling" phenomena. Amdahl assumes performance eventually approaches a constant, but in real distributed systems, we often see performance reach a peak then decline.

Amdahl only considers "work being serialized" overhead, ignoring the "communication and coordination" between cores needed to maintain data consistency.

Formula and Parameters

C(N) = N / (1 + σ(N-1) + κN(N-1))

Where:
- N = number of processors
- σ (sigma) = contention coefficient - overhead from waiting for same resource
- κ (kappa) = coherence coefficient - communication overhead for consistency

Key insights:

  • σ is linear: Like Amdahl's serial portion, growth slows then plateaus
  • κ is quadratic: N(N-1) represents pairwise node communication, overhead grows explosively

Three Scaling Behaviors

Performance
    ^
    |        Linear (σ=0, κ=0)
    |       /
    |      /    Amdahl (κ=0)
    |     /   _______________
    |    /   /
    |   /   /   USL (σ>0, κ>0)
    |  /   /  /\
    | /   /  /  \  ← Retrograde!
    |/   /  /    \
    └────────────────────────> N (cores)

When κ > 0, systems exhibit "Retrograde Behavior"—beyond a critical point, communication overhead exceeds computation gains, and performance declines.

Identifying System Bottlenecks

  • Contention-bound (σ dominant): Curve gradually flattens like a slope. Optimization: reduce lock contention, shrink critical sections
  • Coherence-bound (κ dominant): Curve like a mountain with clear peak then rapid drop. Optimization: reduce cross-node communication, avoid false sharing

Practical Application

USL's greatest power: with just a few measurement points (e.g., N=1,2,4,8), you can predict system behavior at N=64.

# Optimal parallelism
N_optimal = sqrt((1 - sigma) / kappa)

Real-World Example: Web Server Scaling

A team benchmarked their API server at different instance counts:

Instances │ Throughput (req/s) │ Speedup
──────────┼────────────────────┼─────────
    1     │      1,000         │   1.0×
    2     │      1,850         │   1.85×
    4     │      3,200         │   3.2×
    8     │      4,800         │   4.8×
   16     │      5,600         │   5.6×
   32     │      4,200         │   4.2× ← Degradation!

After USL fitting: σ = 0.05, κ = 0.008

Diagnosis: High κ indicates coherence-bound behavior—each request hits a shared database, causing cross-instance coordination overhead. The optimal instance count is √((1-0.05)/0.008) ≈ 11 instances.

Solution: Add read replicas and cache layer to reduce database roundtrips.

USL Limitations

  1. Assumes homogeneous nodes: All processors/nodes must be identical
  2. Steady-state assumption: Doesn't capture transient behavior during ramp-up
  3. Single resource model: Real systems have multiple bottlenecks (CPU, memory, network)
  4. Fitting sensitivity: Results depend on measurement quality; outliers can skew σ and κ

See Appendix H for detailed Python fitting code and case studies.

Roofline Model: Finding the Bottleneck

The Roofline Model is a visualization tool proposed by UC Berkeley in 2008 for analyzing whether a program is compute-bound or memory-bound.

Core Concepts

Every program has two characteristics:

  1. Operational Intensity (OI): How many operations per byte of memory access

    OI = FLOPs / Bytes moved
    
  2. Attainable Performance: Actual achievable FLOPS

The system has two limits:

  1. Peak Compute: Maximum CPU FLOPS (horizontal line)
  2. Peak Memory Bandwidth: Memory bandwidth limit (sloped line)

The Roofline Diagram

Performance (GFLOPS)
     ^
     |                    __________________ Peak Compute (roof)
     |                   /← Ridge Point
     |                  /
     |                 /
     |                /  ← Memory Bandwidth (slope)
     |               /
     |              /
     |             /
     |            /
     |           /
     |──────────────────────────────────────> Operational Intensity
                                               (FLOPs/Byte)

Calculation Example

Assume a system with:

  • Peak Compute: 100 GFLOPS
  • Memory Bandwidth: 50 GB/s
def roofline_performance(oi, peak_compute, bandwidth):
    """
    oi: Operational Intensity (FLOPs/Byte)
    peak_compute: Peak GFLOPS
    bandwidth: Memory bandwidth (GB/s)
    """
    memory_bound = oi * bandwidth  # GFLOPS
    return min(memory_bound, peak_compute)

peak = 100  # GFLOPS
bw = 50     # GB/s
ridge_point = peak / bw  # 2 FLOPs/Byte

print("Operational Intensity | Attainable GFLOPS | Bound")
print("-" * 55)
for oi in [0.1, 0.5, 1, 2, 4, 8, 16]:
    perf = roofline_performance(oi, peak, bw)
    bound = "Memory" if oi < ridge_point else "Compute"
    print(f"{oi:21.1f} | {perf:17.1f} | {bound}")
Operational Intensity | Attainable GFLOPS | Bound
-------------------------------------------------------
                  0.1 |               5.0 | Memory
                  0.5 |              25.0 | Memory
                  1.0 |              50.0 | Memory
                  2.0 |             100.0 | Compute  ← Ridge Point
                  4.0 |             100.0 | Compute
                  8.0 |             100.0 | Compute
                 16.0 |             100.0 | Compute

Real-World Examples

Operational Intensity of different algorithms:

AlgorithmOI (FLOPs/Byte)Usually...
STREAM copy0Memory-bound
SpMV (sparse)0.25Memory-bound
BLAS Level 10.25-0.5Memory-bound
Stencil0.5-1Memory-bound
BLAS Level 21-2Borderline
Dense GEMMHighCompute-bound
FFTMediumDepends on implementation

Cache-Aware Roofline Model (CARM)

Traditional Roofline only considers DRAM bandwidth, but modern processors have multiple cache levels. CARM draws different rooflines for each memory level:

Performance
    ^
    |  __________________________ L1 Peak (highest slope)
    | /_________________________ L2 Peak
    |//________________________ L3 Peak
    |||_______________________ DRAM Peak (lowest slope)
    |||/
    ||/
    |/
    └──────────────────────────> Operational Intensity

Diagnostic logic: If your point falls below the DRAM slope, the problem is cache misses (optimize prefetching, data locality); if near L1 but below compute peak, arithmetic intensity is insufficient (consider loop fusion).

Multi-core Roofline Considerations

  • Shared Bandwidth: Multiple cores share DRAM bus, total bandwidth saturates faster
  • Ridge Point shifts right: Compute peak increases linearly with cores, but bandwidth doesn't, making programs more likely to be memory-bound
  • NUMA effects: Local DRAM bandwidth is much higher than Remote DRAM; label separately

Roofline Limitations

  1. Static view: Doesn't capture phase behavior—a program may be memory-bound in one phase, compute-bound in another
  2. Assumes perfect overlap: Ignores latency hiding and out-of-order execution limitations
  3. Single bottleneck model: Real programs may have mixed OI across different kernels
  4. Measurement challenges: Accurately counting FLOPs and bytes moved requires careful instrumentation

See Appendix H for detailed CARM analysis and tool usage.

Little's Law: System's Physical Conservation

In performance engineering, some laws transcend algorithms and hardware architectures. Little's Law is like conservation of energy in physics—it defines the fundamental boundaries of system operation.

In 1961, John Little proved that in stable systems, three core metrics have an invariant relationship.

Formula

L = λ × W

Where:
- L = average number of items in system (in-flight requests)
- λ = arrival rate (throughput)
- W = average wait time (latency)

Intuitive Understanding

Imagine a restaurant:

30 people in restaurant (L)
10 people arrive per minute (λ)
Each person stays 3 minutes on average (W)

L = λ × W
30 = 10 × 3 ✓

Applications in Computer Systems

1. Memory System

Outstanding memory requests = Bandwidth × Latency

Example:
- Memory latency: 100 ns
- Required bandwidth: 50 GB/s

Outstanding requests = 50 GB/s × 100 ns = 5000 bytes = 78 cache lines

If CPU can only maintain 16 outstanding requests,
Actual bandwidth = 16 × 64 bytes / 100 ns = 10.24 GB/s

This is why modern CPUs need deep memory hierarchies and prefetchers.

2. Network System

Bandwidth-Delay Product (BDP) = Bandwidth × RTT

Example trans-Pacific connection:
- Bandwidth: 10 Gbps
- RTT: 150 ms

BDP = 10 Gbps × 150 ms = 1.5 Gb = 187.5 MB

TCP window needs at least 187.5 MB to fill the pipe

3. Concurrent System

Throughput = Concurrency / Latency

Example web server:
- Each request latency: 50 ms
- Want 1000 requests/sec throughput

Required concurrency = 1000 × 0.05 = 50 concurrent requests

Key Prerequisites

Little's Law is powerful because it makes no assumptions about task distribution or service order. But three prerequisites must hold:

  1. System must be stable: Arrival rate = Departure rate. If tasks keep accumulating, the formula fails
  2. Long-term average: Describes equilibrium state, not instantaneous bursts
  3. Task conservation: Tasks don't disappear (like dropped packets) or self-replicate

Dimensional analysis verification: Throughput [tasks/time] × Latency [time/task] = [tasks]

Diagnostic Value When Formula "Fails"

When measured data doesn't match L = λW, this typically indicates:

  • Actual concurrency > Expected: Long-tail requests inflating average, or resource leaks (unclosed connections)
  • Actual concurrency < Expected: Throughput overestimated, or tasks being batched

Little's Law is the performance engineer's "Sanity Check." See Appendix H for detailed mathematical proof, verification methods, and architectural applications.

Queuing Theory Fundamentals

If Little's Law tells us system's physical conservation, queuing theory reveals the dynamic changes as requests wait to be processed. Understanding queuing theory explains why systems "suddenly collapse" at 90% load.

M/M/1 Model

This is the most basic queuing model, assuming Poisson arrivals, exponential service times, and a single server.

Key formulas:
- ρ = λ/μ (utilization, λ=arrival rate, μ=service rate)
- L = ρ/(1-ρ) (average queue length)
- W = 1/(μ-λ) (average wait time)

Why does latency explode at 90% utilization?

When ρ → 1, (1-ρ) → 0, causing W → ∞. Any small arrival fluctuation causes queue accumulation, and subsequent requests' wait times grow in a chain reaction—this is the "Hockey Stick Effect."

Practical Rules of Thumb

  1. 70% Rule: For latency-sensitive systems, keep Utilization ≤ 70%
  2. Latency multiplier: At 50% utilization, wait time is 2× pure processing time; at 90% it's 10×
  3. Separate processing from waiting: If response time increases but Service Time hasn't changed, the problem is "queuing"

Real-World Example: Database Connection Pool Sizing

A service connects to PostgreSQL with a connection pool. Current setup:

- Arrival rate: 500 queries/sec
- Average query time: 10ms
- Current pool size: 10 connections

Analysis using M/M/c:

ρ = λ / (c × μ) = 500 / (10 × 100) = 0.5 per connection

With 10 connections at 50% average utilization, queuing probability is acceptable.

But during peak hours:

Peak arrival: 800 queries/sec
ρ = 800 / (10 × 100) = 0.8 per connection

At 80% utilization, wait time ≈ 4× service time = 40ms added latency!

Solution: Increase pool to 15 connections:

ρ = 800 / (15 × 100) = 0.53 per connection
Wait time drops to ~1.1× service time = 11ms added latency

See Appendix H for detailed M/M/c model, Erlang C formula, and capacity planning.

Integrated Application: Finding the Real Bottleneck

Let's use an example to apply these models together.

Problem

You have an image processing pipeline:

  • Read image (I/O)
  • Apply filter (compute)
  • Write result (I/O)

Current performance: 100 images/sec Target: 500 images/sec

Analysis

Step 1: Amdahl Analysis

First, measure each stage's time:

Read:   2 ms (20%)
Filter: 6 ms (60%)
Write:  2 ms (20%)
Total:  10 ms

If we only optimize Filter (parallelize):

# Filter is 60%, even with infinite parallelism
max_speedup = amdahl_speedup(0.6, float('inf'))
print(f"Max speedup: {max_speedup:.2f}x")  # 2.5x

2.5x only gets us to 250 images/sec—not enough.

Step 2: Roofline Analysis of Filter

Filter characteristics:

  • Per pixel: 20 FLOPs
  • Per pixel: read 4 bytes, write 4 bytes = 8 bytes
  • OI = 20/8 = 2.5 FLOPs/Byte

System:

  • Peak: 200 GFLOPS
  • Bandwidth: 50 GB/s
  • Ridge point: 4 FLOPs/Byte

OI = 2.5 < 4 → Memory-bound

Optimization direction: not more threads, but improve memory access pattern.

Step 3: Little's Law Analysis of I/O

Read throughput: 500 images/sec × 4 MB/image = 2 GB/s
Disk latency: assume SSD, 0.1 ms

Required queue depth = 2 GB/s × 0.1 ms = 200 KB ≈ 50 images

But we're using sync I/O (queue depth = 1)

Problem found: I/O isn't pipelined, need async I/O.

Solution

  1. Use async I/O with queue depth = 64
  2. Improve filter's cache locality (loop tiling)
  3. Now filter is compute-bound, can apply SIMD optimization

Result: Achieved 600 images/sec.

Common Pitfalls

Pitfall 1: Only Looking at Averages

Bad:  "Average latency 10ms, throughput 100/sec"
      Little's Law: L = 100 × 0.01 = 1

Good: "Average latency 10ms, P99 latency 500ms"
      That P99 might cause problems

Pitfall 2: Ignoring Model Assumptions

Amdahl's Law assumes:

  • Fixed workload
  • No parallelization overhead
  • Perfect parallelism (no synchronization waits)

None of these hold in reality.

Pitfall 3: Over-trusting Roofline

Roofline says your program is memory-bound, but:

  • Might be due to poor cache miss patterns
  • Might be because prefetcher can't predict
  • Might be due to false sharing

Need deeper analysis (perf, VTune).

Pitfall 4: Confusing Throughput and Latency

System A: 1000 req/s, 100ms latency
System B: 800 req/s, 10ms latency

Which is better? Depends on your needs.

Little's Law:
A: 1000 × 0.1 = 100 concurrent
B: 800 × 0.01 = 8 concurrent

A needs more resources to maintain that throughput

Which Model to Use: Decision Guide

                    ┌─────────────────────────────────┐
                    │   What's your performance       │
                    │          question?              │
                    └────────────────┬────────────────┘
                                     │
        ┌────────────────────────────┼────────────────────────────┐
        ▼                            ▼                            ▼
┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐
│ "Will more cores  │    │ "Is my code       │    │ "Why is latency   │
│    help?"         │    │  compute or       │    │   so high?"       │
└─────────┬─────────┘    │  memory bound?"   │    └─────────┬─────────┘
          │              └─────────┬─────────┘              │
          ▼                        ▼                        ▼
┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐
│  Amdahl's Law     │    │  Roofline Model   │    │  Little's Law     │
│  (fixed workload) │    │                   │    │  + Queuing Theory │
└─────────┬─────────┘    └───────────────────┘    └───────────────────┘
          │
          ▼ Performance degrades at high N?
          │
┌─────────┴─────────┐
│  Yes              │ No
│  ▼                │ ▼
│  USL              │ Gustafson
│  (find σ, κ)      │ (scale workload)
└───────────────────┘

Quick Reference:

SymptomModelKey Metric
Adding cores doesn't helpAmdahlSerial fraction (1-p)
Performance drops at high NUSLσ (contention), κ (coherence)
Slow despite high CPURooflineOperational Intensity
Latency spikes at peak loadQueuingρ (utilization)
Throughput × Latency mismatchLittle's LawL = λ × W

Summary

Six Performance Models, six perspectives:

Amdahl's Law

  • Theoretical upper limit for parallelization
  • Serial portion determines the ceiling
  • Use to judge "will adding more cores help"

Gustafson's Law

  • More resources → process larger problems
  • More optimistic than Amdahl
  • Applies to scalable workloads

Universal Scalability Law (USL)

  • Captures contention (σ) and coherence (κ) overhead
  • Predicts performance degradation with more nodes
  • Find optimal parallelism: N_opt = √((1-σ)/κ)

Roofline Model

  • Compute-bound vs Memory-bound
  • Visualize performance bottlenecks
  • Guide optimization direction

Little's Law

  • L = λ × W
  • Connects throughput, latency, concurrency
  • Diagnose queuing and resource utilization

Queuing Theory (M/M/1)

  • Explains "Hockey Stick Effect" at high utilization
  • 70% rule for latency-sensitive systems
  • Response time = Service time / (1 - utilization)

Usage Recommendations

  1. First use Amdahl/Gustafson to evaluate parallelization potential
  2. Use USL when scaling shows degradation—identify σ vs κ bottlenecks
  3. Use Roofline to determine compute vs memory bound
  4. Use Little's Law to analyze throughput/latency relationships
  5. Use queuing theory to understand utilization vs latency tradeoffs
  6. Remember: all models are simplifications; actual measurement is always the final answer

Chapter 11: Galactic Algorithms

Part III: Theory


"In theory, there is no difference between theory and practice. In practice, there is." — Yogi Berra

The Story of an O(n) Algorithm Losing to O(n²)

I once encountered a classic problem in an interview: find two numbers in an array that sum to a target value.

The candidate gave me an O(n²) brute-force solution. I smugly said, "This can be optimized to O(n) with a hash table."

The interviewer asked, "What if the array only has 10 elements?"

"Uh... O(n) is still faster, right?"

He shook his head. "Let's measure."

We measured. The O(n²) version was 3× faster.

Why? Because 10 elements in a nested loop means only 45 comparisons, while the hash table needs:

  • Hash computation (possibly more expensive than comparison)
  • Hash collision handling
  • Memory allocation
  • Cache misses (hash table isn't contiguous)

This is Big-O's blind spot: it only tells you behavior as n approaches infinity, but in reality n is often finite.

What Big-O Really Means

What Is Big-O

Big-O describes asymptotic behavior: the growth rate as n approaches infinity.

T(n) = 5n² + 3n + 100

As n → ∞:
- 100 can be ignored
- 3n can be ignored
- Coefficient 5 can be ignored

Conclusion: T(n) = O(n²)

What Big-O Ignores

1. Constant Factors

Algorithm A: T(n) = 1000n      → O(n)
Algorithm B: T(n) = n²         → O(n²)

Crossover point: 1000n = n²  →  n = 1000

When n < 1000, B is faster than A!

2. Lower-Order Terms

T(n) = n² + 1000000n

Big-O says this is O(n²), but when n < 1000000,
the linear term dominates execution time.

3. Actual Hardware Characteristics

  • Cache behavior
  • Branch prediction
  • SIMD friendliness
  • Memory allocation

Classic Example: Insertion Sort vs Merge Sort

Insertion Sort: O(n²)
Merge Sort:     O(n log n)

But:
- Insertion sort has small constant factors
- Merge sort needs extra space and copying
- Insertion sort is cache-friendly

Measured: n < 32, insertion sort is usually faster

This is why std::sort and Arrays.sort are hybrid sorts—quicksort/mergesort for large arrays, switching to insertion sort for small ones.

Galactic Algorithms: Theoretically Optimal Monsters

Some algorithms are "optimal" in Big-O terms but nobody actually uses them. These are called Galactic Algorithms—only faster at galactic-scale input sizes.

Example 1: Matrix Multiplication

Basic algorithm: O(n³)

for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i][j] += A[i][k] * B[k][j];

Strassen's algorithm: O(n^2.807)

In 1969, Strassen discovered you could use 7 multiplications instead of 8, recursively.

Faster algorithms:

YearAlgorithmComplexity
1969StrassenO(n^2.807)
1990Coppersmith-WinogradO(n^2.376)
2014Le GallO(n^2.3728639)
2020Alman-WilliamsO(n^2.3728596)

What's actually used: Strassen (sometimes) + highly optimized O(n³)

Why not use O(n^2.37) algorithms?

Coppersmith-Winograd's constant factor estimated at over 10^20

Need n > 10^20 to be faster than Strassen
A 10^20 × 10^20 matrix needs more memory than atoms in the universe

Example 2: Integer Multiplication

Grade school algorithm: O(n²)

Karatsuba: O(n^1.585) - 1960

Schönhage-Strassen: O(n log n log log n) - FFT-based

Constant Factors: The Ignored Elephant

A Tale of Cycles

Suppose we compare two algorithms:

Algorithm A: 10n operations, each takes 1 cycle
Algorithm B: n operations, each takes 100 cycles

A total: 10n cycles
B total: 100n cycles

A is 10× faster, even though it has 10× more "operations"

What operations take 100 cycles?

OperationApproximate cycles
Integer addition1
Integer multiplication3-4
Integer division20-80
L1 cache hit4
L2 cache hit12
L3 cache hit40
DRAM access100-300
Branch misprediction15-20
System call1000+

Real Example: Hash Table vs Array

// Version 1: Hash table lookup O(1)
int find_hash(hash_table *ht, int key) {
    int hash = compute_hash(key);        // ~10 cycles
    int idx = hash % ht->size;           // ~20 cycles (division!)
    // Possible collision handling...
    return ht->buckets[idx];             // Possible cache miss
}

// Version 2: Linear search O(n)
int find_linear(int *arr, int n, int key) {
    for (int i = 0; i < n; i++)          // Sequential access, cache-friendly
        if (arr[i] == key) return i;
    return -1;
}

Crossover point is around n = 10-50, depending on implementation details.

Cache Impact

// Version 1: Random access
int sum_random(int *arr, int *indices, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++)
        sum += arr[indices[i]];  // Possible cache miss every time
    return sum;
}

// Version 2: Sequential access
int sum_sequential(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++)
        sum += arr[i];  // Prefetcher can predict
    return sum;
}

Measured difference can be 10-50×, even though Big-O is O(n) for both.

Why Theory and Practice Diverge

Reason 1: Memory Hierarchy

Big-O assumes all memory accesses cost the same.

Theoretical model:
┌──────────────────────────────────┐
│              Memory              │
│         (uniform cost)           │
└──────────────────────────────────┘

Reality:
┌────────┐
│ L1 (4) │  ← 4 cycles
├────────┤
│ L2 (12)│  ← 12 cycles
├────────┤
│ L3 (40)│  ← 40 cycles
├────────┤
│RAM(200)│  ← 200 cycles
└────────┘

Memory hierarchy-friendly algorithms can beat theoretically faster ones.

Reason 2: Branch Prediction

// Version 1: With branches
int sum_positive(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++)
        if (arr[i] > 0)
            sum += arr[i];
    return sum;
}

// Version 2: Branchless
int sum_positive_branchless(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        int mask = arr[i] >> 31;  // 0 if positive, -1 if negative
        sum += arr[i] & ~mask;
    }
    return sum;
}

If positive/negative distribution is random, version 2 can be 2-3× faster (avoiding branch misprediction).

Reason 3: SIMD

// Scalar
for (int i = 0; i < n; i++)
    c[i] = a[i] + b[i];

// SIMD (AVX2)
for (int i = 0; i < n; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);
    __m256 vb = _mm256_load_ps(&b[i]);
    __m256 vc = _mm256_add_ps(va, vb);
    _mm256_store_ps(&c[i], vc);
}

SIMD version can be 4-8× faster, but Big-O is O(n) for both.

Practical Advice

1. Don't Prematurely Optimize Complexity

Bad:  "This is O(n²), must change to O(n log n)!"
Good: "This is O(n²), what's the max n? 100? Doesn't matter."

2. Profile First, Then Optimize

Bad:  Spend a week switching to theoretically faster algorithm
Good: Use perf to find bottleneck is cache misses, adjust data layout

3. Know Your n

n RangeOptimization Strategy
n < 10Write whatever, prioritize readability
n < 100Simple algorithms, watch constant factors
n < 10000Algorithm starts to matter
n > 100000Algorithm very important
n > 10^9Algorithm + parallelization + distributed

4. Consider Hybrid Methods

def smart_sort(arr):
    if len(arr) < 20:
        return insertion_sort(arr)
    else:
        return quicksort(arr)

5. Benchmark Real Workloads

Don't just test worst case or best case. Test:

  • Typical input sizes
  • Typical input distributions
  • On real hardware

Summary

Big-O Limitations

  • Ignores constant factors
  • Ignores lower-order terms
  • Assumes n → ∞
  • Doesn't consider hardware characteristics

Galactic Algorithms

  • Theoretically optimal
  • Practically unusable
  • Constant factors too large for the universe

Constant Factor Sources

  • Cache miss vs hit (10-100×)
  • Branch misprediction (15-20 cycles)
  • Division vs multiplication (10-20×)
  • System calls (1000+ cycles)

Practical Advice

  1. Know your n
  2. Profile first, then optimize
  3. Consider hybrid methods
  4. Test real workloads
  5. Don't blindly trust Big-O

Remember

O(n) with cache misses < O(n²) with cache hits
...when n is small enough

O(n log n) Quicksort < O(n) Radix Sort
...when n is small enough

Theory is the map, measurement is the territory.
When map and territory disagree, trust the territory.

Chapter 12: Cache & Branch Prediction

Part III: Theory


"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

The Story of "One Line Change" That Made It 10× Faster

I inherited an image processing codebase. Processing a 4K image took 800ms—too slow.

I spent two days profiling and found the hotspot in a nested loop. The code looked normal:

// Original version
for (int x = 0; x < width; x++) {
    for (int y = 0; y < height; y++) {
        output[y][x] = process(input[y][x]);
    }
}

I changed one thing:

// Modified version
for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
        output[y][x] = process(input[y][x]);
    }
}

Just swapped the loop order of x and y.

Result: 800ms → 80ms. 10× faster.

Why? Because C's 2D arrays are row-major, and [y][x] access should have x in the inner loop (sequential access). The original version jumped an entire row each access, causing massive cache misses.

This is the power (or curse) of cache.

Cache Basics

Why We Need Cache

CPU speed vs Memory speed (approximate):
- CPU: 1 cycle = 0.3 ns (3 GHz)
- L1:  ~4 cycles = 1.2 ns
- L2:  ~12 cycles = 4 ns
- L3:  ~40 cycles = 12 ns
- DRAM: ~200 cycles = 60 ns

If every access went to DRAM, CPU would spend most time waiting.

Cache Hierarchy

┌─────────────┐
│    CPU      │
│  Registers  │ ← Few bytes, < 1 cycle
└──────┬──────┘
       │
┌──────▼──────┐
│   L1 Cache  │ ← 32-64 KB, ~4 cycles
│  (per core) │
└──────┬──────┘
       │
┌──────▼──────┐
│   L2 Cache  │ ← 256 KB - 1 MB, ~12 cycles
│  (per core) │
└──────┬──────┘
       │
┌──────▼──────┐
│   L3 Cache  │ ← 8-64 MB, ~40 cycles (shared)
│  (shared)   │
└──────┬──────┘
       │
┌──────▼──────┐
│    DRAM     │ ← GB scale, ~200 cycles
└─────────────┘

Cache Line

Cache doesn't operate byte-by-byte, but in cache lines (typically 64 bytes):

When you access address 0x1000:
- CPU doesn't read just 1 byte
- It reads the entire cache line: 0x1000 - 0x103F (64 bytes)
- Subsequent accesses to 0x1001, 0x1002... will hit

This is the basis of spatial locality.

Types of Cache Misses

The 3C Model

TypeNameCauseSolution
CompulsoryCold startFirst accessPrefetching
CapacityCapacityCache too smallLarger cache, better locality
ConflictConflictMultiple addresses map to same locationHigher associativity

(Sometimes a fourth C is added: Coherence, in multi-core systems)

Measuring Cache Misses

# Using perf
perf stat -e L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses ./my_program
 1,234,567,890      L1-dcache-loads
    12,345,678      L1-dcache-load-misses  # 1% miss rate
   123,456,789      LLC-loads
    12,345,678      LLC-load-misses        # 10% miss rate

Common Cache Problems

Problem 1: Wrong Access Order

Row-major vs Column-major:

// C/C++ is row-major: arr[row][col]
// Fortran is column-major: arr(row, col)

// Correct order for row-major (C)
for (int row = 0; row < N; row++)
    for (int col = 0; col < N; col++)
        sum += arr[row][col];  // Sequential access ✓

// Wrong order in C (column-major style)
for (int col = 0; col < N; col++)
    for (int row = 0; row < N; row++)
        sum += arr[row][col];  // Strided access in C ✗

Problem 2: Stride Access

Cache Optimization Techniques

Technique 1: Loop Tiling (Blocking)

Split large loops into small blocks that fit in cache:

// Original: matrix multiplication
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        for (int k = 0; k < N; k++)
            C[i][j] += A[i][k] * B[k][j];

// Tiled version
#define BLOCK 64

for (int ii = 0; ii < N; ii += BLOCK)
for (int jj = 0; jj < N; jj += BLOCK)
for (int kk = 0; kk < N; kk += BLOCK)
    for (int i = ii; i < min(ii+BLOCK, N); i++)
    for (int j = jj; j < min(jj+BLOCK, N); j++)
    for (int k = kk; k < min(kk+BLOCK, N); k++)
        C[i][j] += A[i][k] * B[k][j];

Each BLOCK × BLOCK sub-matrix can fit in L1/L2.

Technique 2: Data Layout Optimization

Array of Structures (AoS) vs Structure of Arrays (SoA):

// AoS: if you only need x, you load useless y, z too
struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
};
struct Particle particles[N];

// SoA: only access the fields you need
struct Particles {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
    float mass[N];
};
struct Particles particles;

// When only accessing x, SoA is more cache-friendly
for (int i = 0; i < N; i++)
    sum += particles.x[i];  // Sequential access

Technique 3: Prefetching

Tell CPU to load data ahead of time:

#include <xmmintrin.h>  // for _mm_prefetch

for (int i = 0; i < N; i++) {
    // Prefetch data we'll need soon
    _mm_prefetch(&arr[i + 64], _MM_HINT_T0);

    sum += arr[i];
}

Modern CPUs have hardware prefetchers that work well for sequential access. Random access may need software prefetch.

Branch Prediction Basics

Why We Need Branch Prediction

Modern CPUs are pipelined:

Instruction 1: Fetch → Decode → Execute → Memory → Writeback
Instruction 2:        Fetch → Decode → Execute → Memory → Writeback
Instruction 3:               Fetch → Decode → Execute → Memory → Writeback
...

When encountering a branch (if/else):

if (condition) {
    // path A
} else {
    // path B
}

CPU doesn't know whether to fetch A or B's instructions. It must guess (predict), then continue.

If wrong (misprediction) → flush pipeline, restart → waste 15-20 cycles.

Typical Branch Prediction Accuracy

PatternAccuracy
Always-taken loop~99%
Always not-taken~99%
Predictable pattern (TTNTTN...)~95%+
Random (50/50)~50%

Measuring Branch Misprediction

perf stat -e branches,branch-misses ./my_program
   500,000,000      branches
     2,500,000      branch-misses  # 0.5% miss rate (good)

Over 1-2% branch miss rate is worth investigating.

Branch Optimization Techniques

Technique 1: Branchless Programming

// With branch
int max1(int a, int b) {
    if (a > b) return a;
    else return b;
}

// Branchless (using conditional move)
int max2(int a, int b) {
    return a > b ? a : b;  // Compiler may use cmov
}

// Manual branchless (guaranteed)
int max3(int a, int b) {
    int diff = a - b;
    int mask = diff >> 31;  // All 0s or all 1s
    return a - (diff & mask);
}

Technique 2: Sort to Improve Branch Prediction

// Processing positive numbers in array
// If array is random, branch is hard to predict

// Original (assume 50% positive)
for (int i = 0; i < N; i++)
    if (arr[i] > 0)
        process(arr[i]);

// If sorted first
std::sort(arr, arr + N);
// Now all negatives first, positives last
// Branch becomes: N/2 consecutive not-taken, N/2 consecutive taken
// Much more predictable!

Of course, sorting has its own cost. Only worth it if used multiple times.

Technique 3: Profile-Guided Optimization (PGO)

Let the compiler optimize branch layout based on actual execution profile:

# 1. Compile with instrumentation
gcc -fprofile-generate -o program program.c

# 2. Run typical workload
./program typical_input

# 3. Recompile with profile
gcc -fprofile-use -o program_optimized program.c

PGO can:

  • Put common paths together (better icache)
  • Adjust default branch predictions
  • Inline frequently-called functions

Measurement and Diagnosis

Using perf to Find Cache/Branch Problems

# Comprehensive analysis
perf stat -e cycles,instructions,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses,\
branches,branch-misses \
./my_program

Summary

Cache Core Concepts

  • Cache line (64 bytes) is the minimum unit
  • 3C Model: Compulsory, Capacity, Conflict
  • Spatial locality + temporal locality

Common Cache Problems

  • Wrong access order (row vs column major)
  • Stride access
  • False sharing
  • Cache thrashing

Cache Optimization Techniques

  • Loop tiling/blocking
  • AoS → SoA transformation
  • Prefetching
  • Hot/cold splitting

Branch Prediction

  • Misprediction cost: 15-20 cycles
  • Predictable patterns: ~99% accurate
  • Random patterns: ~50% accurate

Branch Optimization Techniques

  • Branchless programming
  • Sort to improve patterns
  • Profile-Guided Optimization (PGO)

Diagnostic Tools

perf stat -e L1-dcache-load-misses,branch-misses ./prog

Rules of Thumb

  • Cache miss rate > 5%: worth investigating
  • Branch miss rate > 2%: worth investigating
  • Optimization order: algorithm → data structure → cache → branch

Chapter 13: Array vs Linked List

Part IV: Data Structures & Algorithms


"In theory, there is no difference between theory and practice. But in practice, there is." — Jan L. A. van de Snepscheut

The Story of the "Optimal" Data Structure

A senior engineer once joined a team working on a high-frequency trading system. The codebase had a critical hot path that maintained an order book—a sorted collection of pending orders. The previous developer had implemented it using a doubly-linked list, reasoning that O(1) insertion and deletion would be optimal for the frequent updates.

The senior engineer ran the profiler and found something surprising: the linked list implementation was spending 70% of its time on cache misses. The "O(1)" insertions were indeed fast in terms of pointer operations, but each node access triggered a cache miss that cost 100+ cycles.

She rewrote the hot path using a sorted array with binary search and memmove for insertions. Despite the O(n) insertion complexity, the new implementation was 8x faster. The lesson: Big-O notation hides constants, and those constants are dominated by memory access patterns.

Why This Comparison Matters

In algorithm textbooks, we learn:

  • Arrays: O(1) random access, O(n) insertion/deletion
  • Linked Lists: O(n) random access, O(1) insertion/deletion

This theoretical analysis assumes uniform memory access cost—an assumption that hasn't been true since the 1980s. Modern CPUs have multi-level cache hierarchies where L1 cache access takes ~4 cycles, while main memory access takes 100-300 cycles.

Cache Locality: The Decisive Victory

Although linked lists have O(1) theoretical insertion/deletion advantage, arrays (or std::vector) are overwhelmingly superior in practice.

Cache Line Utilization

Arrays occupy contiguous memory. When the CPU reads one element, it loads an entire cache line (typically 64 bytes), bringing multiple adjacent elements into cache for "free":

Array (contiguous memory):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ A0 │ A1 │ A2 │ A3 │ A4 │ A5 │ A6 │ A7 │
└────┴────┴────┴────┴────┴────┴────┴────┘
  ↑ One cache line fetch (64B) gets 16 integers

Linked list nodes are scattered across the heap. Each node access potentially triggers a cache miss:

Linked List (scattered memory):
┌────┬────┐     ┌────┬────┐     ┌────┬────┐
│ D0 │ *──┼────→│ D1 │ *──┼────→│ D2 │ *──┼→...
└────┴────┘     └────┴────┘     └────┴────┘
 0x1000          0x5F20          0x3A80
  ↑ Each access = potential cache miss (100+ cycles)

Hardware Prefetching

Modern CPUs have hardware prefetchers that detect sequential access patterns and proactively load upcoming data. This works beautifully for arrays but is useless for linked lists.

The "pointer chasing" pattern of linked lists has data dependencies: you must load node N to discover the address of node N+1. This serializes memory accesses and prevents the CPU pipeline from hiding latency.

The Crossover Point

Empirical measurements on modern desktop CPUs show that even for middle insertion, arrays beat linked lists when N < several thousand elements. The memmove operation is highly optimized (often vectorized) and benefits from contiguous memory.

Memory Overhead

On a 64-bit system, each linked list node requires two 8-byte pointers (next, prev) plus alignment padding. For a list of integers (4 bytes each), the pointer overhead can exceed 400% of actual data.

Benchmarking Methodology

When benchmarking data structures, you must avoid measuring the allocator rather than the data structure itself. As with benchmarking in general, the goal is to control confounding factors so that you are actually measuring the behavior of the data structure, not side effects from the allocator or environment.

Fair Comparison Requires

Same allocation strategy: Linked list performance heavily depends on node memory layout. Using a pool allocator that places nodes contiguously dramatically improves linked list performance. Note whether custom allocators are used.

Warm-up consideration: First access triggers page faults and cold cache misses. Run multiple iterations and distinguish cold start from warm state.

Background memory pressure: In production, memory isn't pristine. Add background memory activity to simulate realistic cache pollution.

Sample Benchmark Structure

// Prevent compiler from optimizing away the computation
volatile int sink;

void benchmark_sequential_access(int* array, int n) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    long sum = 0;
    for (int i = 0; i < n; i++) {
        sum += array[i];
    }
    sink = sum;  // Prevent dead code elimination

    clock_gettime(CLOCK_MONOTONIC, &end);
    // Report timing...
}

Measuring Cache Effects Directly

To truly understand the array vs. linked list difference, let's measure cache behavior using hardware performance counters:

# Using perf to measure cache misses during sequential traversal
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
    ./array_traverse 1000000

# Typical output for array:
#   12,543,210      cache-references
#       62,891      cache-misses             # 0.50%

# Typical output for linked list:
#   12,892,344      cache-references
#    9,234,567      cache-misses             # 71.6%

The linked list shows 140x more cache misses for the same logical operation. Each cache miss costs approximately 100 cycles on modern hardware, explaining why the linked list is 30-50x slower despite doing the "same" work.

Measuring with perf to Confirm the Theory

# Memory bandwidth comparison
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses \
    ./benchmark_both_structures

# Array traversal:
#   1,024,567      L1-dcache-loads
#      12,345      L1-dcache-load-misses    # 1.2%
#       2,345      LLC-loads
#         234      LLC-load-misses          # 10.0%

# Linked list traversal:
#   1,089,234      L1-dcache-loads
#     987,654      L1-dcache-load-misses    # 90.7%
#     876,543      LLC-loads
#     765,432      LLC-load-misses          # 87.3%

Small Size Optimization (SSO)

An important insight that surprises many developers: when N is small, simple linear search often beats complex data structures.

O(n) vs O(1): When N < 10-20, linear array search is typically faster than hash table lookup. Array operations fit entirely in L1 cache with no hash computation overhead.

Branch prediction: Binary search on sorted arrays causes ~50% branch misprediction rate on random queries. Linear search has trivial prediction patterns.

Here's a demonstration:

// Linear search: trivial, but cache-friendly
int linear_search(int* arr, int n, int target) {
    for (int i = 0; i < n; i++) {
        if (arr[i] == target) return i;
    }
    return -1;
}

// Measured performance for N=16, random queries:
// Linear search: ~15 cycles average
// Binary search: ~45 cycles average (branch misprediction penalty)
// Hash table:    ~60 cycles average (hash computation + indirection)

Many high-performance libraries and standard containers implement SSO-style optimizations: when the element count is below a threshold, they fall back to simple arrays or small in-object buffers internally. The std::string small-string optimization and boost::container::small_vector are canonical examples.

When to Use Each

Use Arrays (std::vector) When

  • Sequential or random access patterns dominate
  • Size is known or changes infrequently
  • Cache performance is critical
  • Elements are small (pointers, integers, small structs)

Use Linked Lists When

  • Frequent insertions/deletions at known positions (with held iterators)
  • Need stable pointers/references after insertion
  • Elements are very large (moving is expensive)
  • Memory fragmentation is acceptable

The Modern Reality

In C++, Bjarne Stroustrup and others have empirically shown that std::vector outperforms std::list for almost all use cases. The exceptions are rare and specific.

Practical Measurements

Typical results on modern hardware (Intel Core i7, DDR4):

OperationArray (N=1000)Linked List (N=1000)Ratio
Sequential traverse0.5 μs15 μs30x
Random access0.02 μs8 μs400x
Middle insertion0.3 μs0.05 μs*0.17x
Memory per element4 bytes24 bytes6x

*Linked list insertion assumes you already have the iterator—finding the position is O(n).

These results come from one specific CPU, compiler, and standard library implementation. On a different machine or implementation the absolute timings will change, but the qualitative trend is robust: contiguous arrays dominate linked lists for traversal and random access on modern hardware.

Summary

  • Cache locality often matters more than algorithmic complexity. The gap between L1 cache and main memory is 100x.
  • Measure, don't assume. Your intuition from algorithm class may be wrong for modern hardware.
  • Consider your actual access patterns. If you traverse more than you insert, prefer arrays.
  • When in doubt, use std::vector. It's the default choice for a reason.

Chapter 14: Hash Table vs Tree

Part IV: Data Structures & Algorithms


"O(1) is just O(n) that got lucky." — Anonymous

The O(1) That Wasn't

A startup was building a real-time bidding system that needed to look up advertiser data by ID. The initial implementation used a hash table—the obvious choice for O(1) lookups. Everything worked perfectly in testing.

In production, disaster struck. Under high load, the system would occasionally freeze for hundreds of milliseconds. Debugging revealed that when the hash table needed to resize, it triggered a massive rehashing operation that blocked the main thread. Worse, during a targeted attack, malicious requests with crafted IDs caused hash collisions, degrading the "O(1)" lookups to O(n).

The team switched to a Red-Black tree. The lookups were now O(log n)—technically "slower"—but performance became predictable. The worst-case latency dropped from 500ms to 2ms. The lesson: average-case complexity isn't the whole story.

Theoretical Complexity Review

This is the classic O(1) vs O(log n) battle, but the "constant" in O(1) can be surprisingly expensive.

OperationHash Table (avg)Hash Table (worst)Balanced Tree
SearchO(1)O(n)O(log n)
InsertO(1)O(n)O(log n)
DeleteO(1)O(n)O(log n)
Range queryO(n)O(n)O(log n + k)
Ordered iterationO(n log n)O(n log n)O(n)

Hash Table Deep Dive

The Hidden Costs of O(1)

The O(1) of hash tables includes:

  1. Computing the hash code
  2. Handling collisions
  3. (Occasionally) Resizing the entire table

Hash function complexity: For simple integer keys, hashing is trivial. For long strings, computing a good hash can require examining every character—potentially hundreds of operations before the "O(1)" lookup even begins.

Collision Handling

Chaining (separate chaining): Each bucket contains a linked list of colliding elements. When collisions are frequent, the hash table degrades into multiple linked lists—losing both O(1) complexity and cache locality.

Open addressing (linear probing, quadratic probing): Collisions are resolved by probing subsequent slots. Better cache locality than chaining, but performance degrades rapidly as load factor approaches 1.0.

Load Factor and Rehashing

Hash tables maintain a load factor (elements / buckets). When this exceeds a threshold (typically 0.75), the table must:

  1. Allocate a new, larger backing array
  2. Recompute hashes for ALL existing elements
  3. Insert everything into the new array

This operation is O(n) and causes latency spikes. In latency-sensitive applications, these spikes can be catastrophic.

Hash Collision Attacks

Balanced trees (like Red-Black trees) provide stable O(log n) performance regardless of data distribution. Hash tables are vulnerable to algorithmic complexity attacks: an attacker who knows the hash function can craft inputs that all collide, turning O(1) into O(n).

This vulnerability led to security incidents in web frameworks (PHP, Python, Ruby) that used predictable hash functions for request parameters. Many modern languages and libraries mitigate this by using randomized hash functions or stronger default hashers, which makes these attacks harder in practice. The underlying worst-case behavior of hash tables, however, has not changed.

Tree Structures

Balanced Trees (Red-Black, AVL)

Self-balancing binary search trees guarantee O(log n) operations by maintaining height invariants.

Advantages:

  • Predictable worst-case performance
  • Natural ordering (in-order traversal gives sorted sequence)
  • Efficient range queries

Cache behavior: Traditional BSTs have poor cache locality—each node comparison may trigger a cache miss. For N = 1 million elements, log₂(N) ≈ 20 comparisons, potentially 20 cache misses.

B-Trees: Cache-Friendly Trees

B-Trees store multiple keys per node, designed for systems where node access is expensive (originally disk, but now relevant for cache):

B-Tree node (order 4):
┌─────┬─────┬─────┐
│ K1  │ K2  │ K3  │  ← Multiple keys in one cache line
└──┬──┴──┬──┴──┬──┘
   ↓     ↓     ↓
 children...

A B-Tree with 64-byte nodes can store ~15 integer keys per node. For 1 million elements, height ≈ 5, meaning only 5 cache misses worst case (vs. 20 for a binary tree).

Benchmarking Comparison

Key Type Matters

Integer keys: Hash tables shine. Computing a hash is trivial (often just a modulo or multiplication), and comparison is a single instruction.

String keys: The gap narrows. Hash computation requires examining the entire string. Trees only compare until a difference is found (often early).

Data Distribution Effects

Hash table performance depends heavily on key distribution:

DistributionHash TableTree
RandomExcellentGood
SequentialGoodGood
AdversarialCatastrophicGood
ClusteredDegradedGood

Typical Results (N = 100,000, random keys)

Operationstd::unordered_mapstd::mapRatio
Lookup (int key)25 ns180 ns0.14x
Lookup (string key)80 ns150 ns0.53x
Insert45 ns200 ns0.23x
Range query (1%)2500 μs50 μs50x
Iteration (sorted)3000 μs1500 μs2x

These measurements are representative of a particular hardware and standard-library implementation. Treat them as directional guidance rather than universal constants: the exact numbers will vary, but the trade-offs they illustrate are robust.

Measuring Hash vs Tree Performance

Here's how to benchmark the key operations:

#include <chrono>
#include <map>
#include <unordered_map>
#include <random>
#include <iostream>

template<typename Map>
void benchmark_lookup(Map& m, const std::vector<int>& keys,
                      const std::string& name) {
    volatile int sink = 0;  // Prevent optimization

    auto start = std::chrono::high_resolution_clock::now();
    for (int key : keys) {
        auto it = m.find(key);
        if (it != m.end()) sink = it->second;
    }
    auto end = std::chrono::high_resolution_clock::now();

    auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(
        end - start).count();
    std::cout << name << ": " << ns / keys.size() << " ns/lookup\n";
}

int main() {
    constexpr int N = 100000;
    std::vector<int> keys(N);
    std::iota(keys.begin(), keys.end(), 0);
    std::shuffle(keys.begin(), keys.end(), std::mt19937{42});

    std::unordered_map<int, int> hash_map;
    std::map<int, int> tree_map;

    for (int k : keys) {
        hash_map[k] = k;
        tree_map[k] = k;
    }

    // Warmup then measure
    benchmark_lookup(hash_map, keys, "unordered_map");
    benchmark_lookup(tree_map, keys, "map");
}

Measuring Latency Distribution

Average latency hides the story. For latency-sensitive applications, measure the distribution:

# Using perf to capture latency distribution
perf record -e cycles:u ./hash_vs_tree_benchmark
perf report

# Or capture timing histogram in code:
# Track min, max, P50, P99, P99.9 per operation

When to Use Each

Use Hash Tables When

  • Average-case performance matters more than worst-case
  • Keys have good hash distribution (or you control the hash function)
  • Order is irrelevant
  • Range queries are not needed
  • Latency spikes from rehashing are acceptable

Use Trees When

  • Ordered traversal is needed (logs, time-series data)
  • Latency stability is critical (avoid rehashing spikes)
  • Range queries are common (find all items between X and Y)
  • Data comes from untrusted sources (security)
  • Memory allocation must be predictable

The std::map vs std::unordered_map Decision

In C++, this is a common choice:

  • Default to std::unordered_map for simple key-value lookups
  • Use std::map when you need ordering or predictable performance
  • Consider sorted std::vector for read-heavy workloads (binary search with excellent cache locality). For small to medium, mostly-read tables, a sorted vector often outperforms both std::map and std::unordered_map thanks to tiny constant factors and contiguous storage.

Real-World Implementation Notes

Why std::unordered_map Can Be Slow

The C++ standard's std::unordered_map is often criticized for performance. Reasons include:

  • Required to use chaining (linked lists for buckets)
  • Iterator stability requirements limit optimization
  • Node-based allocation (each element is separately allocated)

Alternatives like absl::flat_hash_map (Google) or robin_hood::unordered_map use open addressing with better cache behavior.

Summary

  • Hash tables trade predictability for average-case speed. The O(1) comes with hidden costs and worst-case risks.
  • Trees provide guarantees. O(log n) may be "slower" but is always O(log n).
  • Consider your requirements: Need ordering? Use trees. Need range queries? Use trees. Need raw speed with controlled inputs? Use hash tables.
  • Measure with realistic data. Adversarial or clustered data can devastate hash table performance.

Chapter 15: Sorting Algorithms

Part IV: Data Structures & Algorithms


"Premature optimization is the root of all evil, but late optimization is the root of all frustration." — Anonymous

The Hybrid Sort Revolution

Why doesn't std::sort just use Quicksort—the algorithm with the best average-case performance? Why does it fall back to Heapsort sometimes? And why does it switch to Insertion Sort for small arrays?

A developer was benchmarking sorting algorithms for a data processing pipeline. She implemented a "pure" Quicksort, confident it would match or beat the standard library. On random data, it was competitive. On nearly-sorted data, it was actually faster. Then she tested on adversarial input—data specifically crafted to trigger O(n²) behavior—and watched her algorithm crawl.

The standard library's std::sort finished in the same time as always. She discovered that modern library sorts are hybrid algorithms: they combine multiple sorting strategies, switching between them based on data characteristics. There is no single "best" sorting algorithm—only best combinations.

From Theory to Hardware Reality

In modern processor architecture, sorting algorithm performance depends not just on comparison count, but on branch prediction success rate, cache hit rate, and memory access patterns.

Algorithm Complexity Overview

AlgorithmBestAverageWorstSpaceStableHardware Friendliness
QuicksortO(n log n)O(n log n)O(n²)O(log n)NoExcellent locality
MergesortO(n log n)O(n log n)O(n log n)O(n)YesExtra memory copy
HeapsortO(n log n)O(n log n)O(n log n)O(1)NoCache-unfriendly
InsertionO(n)O(n²)O(n²)O(1)YesExcellent for small N
TimsortO(n)O(n log n)O(n log n)O(n)YesOptimized for real data

Classic Algorithms: Theory vs Practice

Quicksort:

  • Practical advantage: Excellent locality. The partition operation scans contiguous memory, which is extremely cache-friendly.
  • Optimization: Modern implementations use median-of-three pivot selection to avoid O(n²) worst case.

Mergesort:

  • Characteristic: Stable, guaranteed O(n log n).
  • Disadvantage: Requires O(n) extra space; merge phase involves heavy data copying, pressuring memory bandwidth.

Heapsort:

  • Theory: O(n log n) in-place.
  • Practical bottleneck: Cache-unfriendly. Heap access constantly jumps between indices i and 2i+1; when N is large, nearly every comparison triggers a cache miss.

Small Array Optimization

When N shrinks, algorithm overhead (constant factors) dominates performance.

Why Insertion Sort Wins for Small Arrays

Insertion Sort has extremely simple code with minimal instruction overhead. When n < 20 or so, its O(n²) computation produces less latency than Quicksort's recursive call overhead.

Cutoff threshold: Typically 16 to 32. Modern C++ libraries stop Quicksort recursion at this range, then do a single Insertion Sort pass over the nearly-sorted array.

The performance crossover happens because Insertion Sort's simplicity wins at small sizes despite its O(n²) complexity. For very small arrays (< 16 elements), Insertion Sort is faster due to minimal overhead—no recursive calls, no partition logic, just simple element shifts. As size increases, Quicksort's O(n log n) complexity takes over and becomes increasingly faster. The crossover point where Quicksort starts to win is typically around 16-32 elements, which is why hybrid algorithms switch strategies at this threshold.

Branch Prediction Friendly Algorithms

Sorting is fundamentally about comparisons. With random data, the branch predictor has ~50% failure rate.

Optimization: Some implementations use Conditional Move (CMOV) instructions to replace if branches, avoiding misprediction penalties.

Cache-Aware Sorting

Cache-oblivious Algorithms: Designed to work efficiently without knowing exact cache size. These typically use recursive divide-and-conquer, ensuring that at some recursion level, the working set fits in cache.

Block-based strategies: Process data in cache-sized blocks to maximize temporal locality.

Input Patterns Matter

Sorting performance varies dramatically with input characteristics:

Input PatternQuicksortMergesortTimsort
RandomExcellentGoodGood
Nearly sortedGoodGoodExcellent (O(n))
Reverse sortedDegraded*GoodGood
Many duplicatesDegraded*GoodGood

*Without proper pivot selection or 3-way partitioning.

Parallel Sorting

Parallel MergeSort: Classic divide-and-conquer. Dispatch subtasks to different threads early, merge at the end.

Sorting Networks (e.g., Bitonic Sort): Branch-free comparisons, ideal for GPU SIMD execution.

GPU Sorting: Leverages GPU's massive memory bandwidth. Usually uses Radix Sort, which can be transformed into parallel Prefix Sum (Scan) operations.

For small or medium-sized arrays, or on systems with high synchronization overhead, the thread-management and merge costs of parallel sorting can outweigh any speedup. Parallel sort is a tool for genuinely large problems with enough work to amortize its overhead, not a free performance knob to turn on by default.

Stability Considerations

When needed? When sorting by multiple keys (e.g., first by "date", then by "amount").

Cost: To maintain relative order of equal elements, algorithms typically cannot swap non-adjacent elements freely. This excludes efficient algorithms like Quicksort and Heapsort.

Compromise: Stable sorts usually need O(n) or O(log n) auxiliary space, or use complex hybrid algorithms like TimSort.

Real-World Implementations

std::sort (C++): Introsort

Introsort = Quicksort + Heapsort + Insertion Sort:

  1. Start with Quicksort
  2. If recursion depth exceeds 2·log(n), switch to Heapsort (guarantees O(n log n))
  3. For small partitions, use Insertion Sort

Java's Arrays.sort()

  • Primitives: Dual-Pivot Quicksort
  • Objects: Timsort (for stability)

Python's sorted(): Timsort

Core idea: Real-world data is often partially sorted. Timsort finds existing ascending runs and merges them. For nearly-sorted data, it approaches O(n).

Radix Sort and Linear Time

When applicable, non-comparison sorts break the O(n log n) barrier:

Radix Sort / Counting Sort: When N >> value range, complexity is O(n·k) where k is digit count. Dominant in specific domains such as image processing, fixed-width integer IDs, and log/event UIDs.

Requirements:

  • Keys must be decomposable into digits/characters
  • Value range should be bounded
  • Often requires O(n) extra space

Outside of these constrained settings, general-purpose comparison-based sorts are usually a better default: they handle arbitrary key types, integrate well with existing libraries, and have predictable performance on a wide range of workloads.

Benchmarking Sort Algorithms

Methodology

  1. Test multiple input patterns: Random, sorted, reverse, duplicates
  2. Warm up: First iteration has cold cache effects
  3. Multiple runs: Report median, not mean (avoids outlier skew)
  4. Realistic sizes: Test at N = 100, 1K, 10K, 100K, 1M

A Complete Benchmarking Example

#include <algorithm>
#include <chrono>
#include <random>
#include <vector>
#include <iostream>

enum class Pattern { RANDOM, SORTED, REVERSE, NEARLY_SORTED, DUPLICATES };

std::vector<int> generate_data(int n, Pattern pattern) {
    std::vector<int> data(n);
    std::mt19937 gen(42);

    switch (pattern) {
        case Pattern::RANDOM:
            std::iota(data.begin(), data.end(), 0);
            std::shuffle(data.begin(), data.end(), gen);
            break;
        case Pattern::SORTED:
            std::iota(data.begin(), data.end(), 0);
            break;
        case Pattern::REVERSE:
            std::iota(data.rbegin(), data.rend(), 0);
            break;
        case Pattern::NEARLY_SORTED:
            std::iota(data.begin(), data.end(), 0);
            // Swap 5% of elements
            for (int i = 0; i < n / 20; i++) {
                std::swap(data[gen() % n], data[gen() % n]);
            }
            break;
        case Pattern::DUPLICATES:
            for (int& x : data) x = gen() % 100;  // Only 100 unique values
            break;
    }
    return data;
}

double benchmark_sort(std::vector<int> data) {
    auto start = std::chrono::high_resolution_clock::now();
    std::sort(data.begin(), data.end());
    auto end = std::chrono::high_resolution_clock::now();

    return std::chrono::duration<double, std::micro>(end - start).count();
}

Measure Branch Mispredictions

# Compare branch behavior across input patterns
perf stat -e branches,branch-misses ./sort_benchmark random
perf stat -e branches,branch-misses ./sort_benchmark sorted

# Typical results:
# Random:  45% branch misprediction rate (comparisons are 50/50)
# Sorted:   3% branch misprediction rate (predictable patterns)

Measuring Cache Effects

# Compare cache behavior for different algorithms
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
    ./quicksort_benchmark
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
    ./heapsort_benchmark

# Heapsort typically shows 3-5x higher cache miss rate

This reveals why some algorithms suffer on random data despite good complexity.

Summary Comparison Table

AlgorithmAverageBest CaseSpaceStableCharacteristics
std::sortO(n log n)O(n log n)O(log n)NoCache-friendly, Introsort guarantees O(n log n)
TimsortO(n log n)O(n)O(n)YesExcellent for real data, complex implementation
QuicksortO(n log n)O(n log n)O(log n)NoSmall partition constant, branch-heavy
Radix SortO(n·k)O(n)O(n+k)YesNon-comparison, GPU-friendly

Summary

  • No single "best" sorting algorithm. The winner depends on data characteristics and hardware.
  • Hybrid approaches dominate in practice. Modern library sorts combine multiple strategies.
  • Input characteristics matter as much as algorithm choice. Nearly-sorted data is very different from random.
  • Treat algorithm choice as a hypothesis and benchmark it on your real workload. Branch mispredictions and cache behavior often dominate theoretical complexity.

Chapter 16: SIMD & Vectorization

Part V: Parallelism & Low-Level Optimization


"The free lunch is over." — Herb Sutter

The 10x Speedup That Cost Nothing

A game studio was optimizing their physics engine. The collision detection code was clean, well-structured, and algorithmically optimal—yet it consumed 40% of frame time. A senior engineer spent an afternoon adding SIMD intrinsics to the inner loop. The result: 8x speedup, bringing collision detection down to 5% of frame time.

No algorithmic changes. No architectural redesign. Just telling the CPU to do 8 operations at once instead of 1.

This is the promise of SIMD: massive speedups for data-parallel workloads, often with minimal code changes. But it's also a minefield of alignment requirements, instruction set variations, and subtle correctness issues.

What is SIMD?

Single Instruction, Multiple Data: process multiple data elements with one instruction.

Scalar Operation:
  A[0] + B[0] = C[0]
  A[1] + B[1] = C[1]
  A[2] + B[2] = C[2]
  A[3] + B[3] = C[3]
  → 4 instructions

SIMD Operation (256-bit AVX):
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │A[0]│A[1]│A[2]│A[3]│A[4]│A[5]│A[6]│A[7]│  (8 floats)
  └────┴────┴────┴────┴────┴────┴────┴────┘
                    +
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │B[0]│B[1]│B[2]│B[3]│B[4]│B[5]│B[6]│B[7]│
  └────┴────┴────┴────┴────┴────┴────┴────┘
                    ↓
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │C[0]│C[1]│C[2]│C[3]│C[4]│C[5]│C[6]│C[7]│
  └────┴────┴────┴────┴────┴────┴────┴────┘
  → 1 instruction

Theoretical vs Practical Speedup

With 256-bit registers processing 8 floats, you might expect 8x speedup. Reality is more nuanced:

FactorImpact
Memory bandwidthOften the bottleneck, not compute
Alignment overheadUnaligned loads are slower
Remainder handlingN not divisible by vector width
Register pressureLimited SIMD registers
Instruction latencySome SIMD ops have higher latency

Typical real-world speedup: 2-6x for memory-bound workloads, 4-8x for compute-bound.

SIMD Instruction Sets

x86/x64 Evolution

GenerationWidthYearKey Features
SSE128-bit19994x float
SSE2128-bit20012x double, integer ops
AVX256-bit20118x float, 3-operand
AVX2256-bit2013256-bit integer
AVX-512512-bit2017Masking, scatter/gather

ARM

  • NEON: 128-bit, ubiquitous on ARM (phones, tablets, M-series Macs)
  • SVE/SVE2: Scalable Vector Extension (128-2048 bit), write-once-run-anywhere

RISC-V

  • RVV (Vector Extension): Variable-length VLEN (128-65536 bits), vector-length agnostic (VLA) programming model. One codebase can run on different hardware widths without recompilation.

The Portability Challenge

Code written for AVX won't run on ARM. Code for AVX-512 won't run on older x86. Solutions:

  1. Runtime dispatch: Detect CPU features, call appropriate implementation
  2. Portable libraries: Highway, XSIMD, std::simd (C++23)
  3. Compiler auto-vectorization: Let the compiler handle it

Auto-Vectorization

Modern compilers can automatically vectorize simple loops. This is the easiest path to SIMD benefits.

Helping the Compiler

// ✓ Easy to vectorize: simple loop, no dependencies
void add_arrays(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// ✗ Hard to vectorize: potential aliasing
void add_arrays_bad(float* a, float* b, float* c, int n) {
    // Compiler doesn't know if a, b, c overlap
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// ✓ Fixed with restrict (C) or __restrict (C++)
void add_arrays_good(float* __restrict a,
                     float* __restrict b,
                     float* __restrict c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Compiler Flags

# GCC/Clang: Enable vectorization
-O3 -march=native -ffast-math

# Check what got vectorized
-fopt-info-vec-optimized      # GCC: show successful vectorization
-fopt-info-vec-missed         # GCC: show failed attempts
-Rpass=loop-vectorize         # Clang: show successful
-Rpass-missed=loop-vectorize  # Clang: show failures

When Auto-Vectorization Fails

Common blockers:

  • Pointer aliasing: Use restrict/__restrict
  • Loop-carried dependencies: a[i] = a[i-1] + 1 can't parallelize
  • Function calls in loop: Unless inlined
  • Complex control flow: if statements inside loops
  • Non-contiguous access: Strided or indirect indexing

Manual SIMD Programming

When auto-vectorization isn't enough, you have options:

Intel Intrinsics Example

#include <immintrin.h>

void add_arrays_avx(float* a, float* b, float* c, int n) {
    int i = 0;
    // Process 8 floats at a time
    for (; i + 7 < n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
    // Handle remainder
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Portable SIMD with Highway

#include "hwy/highway.h"

void add_arrays_highway(float* a, float* b, float* c, int n) {
    namespace hn = hwy::HWY_NAMESPACE;
    using D = hn::ScalableTag<float>;
    D d;

    for (int i = 0; i < n; i += hn::Lanes(d)) {
        auto va = hn::LoadU(d, a + i);
        auto vb = hn::LoadU(d, b + i);
        hn::StoreU(hn::Add(va, vb), d, c + i);
    }
}

Highway automatically uses the best available SIMD for the target platform.

Memory Alignment

SIMD performance is sensitive to memory alignment:

// Aligned allocation (C++17)
float* data = static_cast<float*>(
    std::aligned_alloc(32, n * sizeof(float)));  // 32-byte for AVX

// Aligned load (faster)
__m256 v = _mm256_load_ps(aligned_ptr);   // Requires 32-byte alignment

// Unaligned load (works anywhere, slightly slower)
__m256 v = _mm256_loadu_ps(any_ptr);

Rule of thumb: Align to vector width (16 bytes for SSE, 32 for AVX, 64 for AVX-512).

Measuring SIMD Performance

Verify Vectorization Happened

# Check assembly for vector instructions
objdump -d binary | grep -E 'vmov|vadd|vmul'  # AVX
objdump -d binary | grep -E 'movaps|addps'    # SSE

Benchmark Methodology

  1. Warm up: First iterations may have cold cache/code effects
  2. Large enough N: Small arrays may not show SIMD benefit
  3. Measure throughput: Elements processed per second
  4. Check memory bandwidth: SIMD can't help if memory-bound

Common Pitfalls

1. Assuming Linear Speedup

8-wide SIMD ≠ 8x speedup. Memory bandwidth, alignment, and remainder handling all reduce gains.

2. Ignoring Horizontal Operations

Summing a vector requires "horizontal" operations that are slower:

// Horizontal sum of 8 floats in AVX - multiple instructions
float hsum(__m256 v) {
    __m128 lo = _mm256_castps256_ps128(v);
    __m128 hi = _mm256_extractf128_ps(v, 1);
    lo = _mm_add_ps(lo, hi);
    // ... more shuffles and adds
}

3. AVX-512 Frequency Throttling

On some Intel CPUs, heavy AVX-512 use causes frequency reduction (up to 20%). The wider operations may not compensate for the lower clock speed.

4. Forgetting the Scalar Remainder

When N isn't divisible by vector width, you need scalar cleanup code.

Summary

  • SIMD provides 2-8x speedup for data-parallel workloads
  • Start with auto-vectorization: Use -O3 -march=native, check compiler output
  • Help the compiler: Use restrict, avoid aliasing, keep loops simple
  • Manual SIMD when needed: Intrinsics for maximum control, portable libraries for cross-platform
  • Measure carefully: Verify vectorization happened, account for memory bandwidth

Chapter 17: Multi-core Performance

Part V: Parallelism & Low-Level Optimization


"More threads doesn't always mean more performance." — Every performance engineer

The Thread That Made It Slower

A team was optimizing a data processing pipeline. The single-threaded version processed 100,000 records per second. "Easy win," they thought—just parallelize it across 8 cores.

The 8-threaded version processed... 60,000 records per second.

After a week of investigation, they found the culprit: false sharing. Each thread had its own counter variable, but all counters happened to be allocated on the same cache line. Every increment by any thread invalidated the cache for all other threads. The "parallel" code was actually serialized by cache coherence traffic.

Adding 56 bytes of padding between counters brought performance to 750,000 records per second—7.5x the original, finally matching expectations.

Parallelism Fundamentals

Amdahl's Law: The Ceiling You Can't Break

Speedup = 1 / ((1-P) + P/N)

Where:
- P = parallelizable fraction
- N = number of processors

The brutal math: If 10% of your code is sequential (P = 0.9), maximum speedup with infinite processors is 10x. With 8 cores, you get 4.7x. With 64 cores, you get 8.8x.

Speedup vs Cores (P = 0.9):
     │
  10 │                    ─────────── Theoretical max
     │              ╱
   8 │         ╱
     │     ╱
   4 │  ╱
     │╱
   1 └────────────────────────────
     1    4    8   16   32   64  Cores

Gustafson's Law: Scale the Problem

Amdahl assumes fixed problem size. Gustafson observed that in practice, we often scale the problem with the hardware:

Scaled Speedup = N + (1-N) × s

Where s = sequential fraction

If you have 8 cores and 5% sequential work, scaled speedup = 8 + (1-8) × 0.05 = 7.65x.

Key insight: Larger problems often have proportionally less sequential work.

Thread Overhead

Thread Creation Cost

OperationTypical Cost
Thread creation (OS)10-100 μs
Thread pool task dispatch0.1-1 μs
Context switch1-10 μs
Mutex lock (uncontended)10-50 ns
Mutex lock (contended)1-100 μs
Atomic increment5-50 ns

Rule of thumb: Don't create threads for tasks shorter than 10-100 μs. Use thread pools.

Context Switch Overhead

When threads exceed available cores, the OS must context switch. Each switch:

  • Saves/restores registers
  • May flush TLB entries
  • Pollutes caches with new thread's data

Symptom: Performance degrades when thread count >> core count.

Synchronization Costs

Lock Contention

Uncontended lock:
Thread 1: [acquire]────[work]────[release]
                                          Thread 2: [acquire]────[work]────[release]

Contended lock:
Thread 1: [acquire]────[work]────[release]
Thread 2:          [spin/wait............][acquire]────[work]────[release]
                   ↑ Wasted time

Contention scales poorly: With N threads competing for one lock, average wait time grows as O(N).

Atomic Operations

Atomics avoid locks but aren't free:

std::atomic<int> counter;
counter++;  // Generates LOCK XADD instruction

Cost: 5-50 ns per atomic operation, depending on contention. Under high contention, atomics can be as slow as locks.

False Sharing: The Silent Killer

Cache Line (64 bytes):
┌────────────────────────────────────────────────────────────────┐
│ counter_1 │ counter_2 │ counter_3 │ counter_4 │ ... padding ... │
│  (4 bytes) │  (4 bytes) │  (4 bytes) │  (4 bytes) │               │
└────────────────────────────────────────────────────────────────┘
     ↑            ↑            ↑            ↑
  Thread 1    Thread 2    Thread 3    Thread 4

Every write invalidates the entire line for all other threads!

Solution: Pad to cache line boundaries:

struct alignas(64) PaddedCounter {
    std::atomic<int> value;
    char padding[60];  // Fill to 64 bytes
};

Cache Coherence: MESI Protocol

When multiple cores share memory, hardware maintains coherence via MESI:

StateMeaning
ModifiedThis cache has the only valid copy (dirty)
ExclusiveThis cache has the only copy (clean)
SharedMultiple caches have this line
InvalidThis cache line is stale

Cache line bouncing: When two cores repeatedly write to the same line, it bounces between Modified states, generating bus traffic.

Measuring Parallel Performance

Scalability Metrics

Strong scaling: Fixed problem size, increase cores.

  • Ideal: 2x cores → 2x speedup
  • Reality: Diminishing returns due to Amdahl's Law

Weak scaling: Proportional problem size (2x cores → 2x data).

  • Ideal: Constant time regardless of core count
  • Reality: Communication overhead grows

Profiling Tools

# Linux perf for cache and synchronization events
perf stat -e cache-misses,cache-references,\
          context-switches,cpu-migrations ./program

# Intel VTune for detailed threading analysis
vtune -collect threading ./program

# Detect false sharing
perf c2c record ./program
perf c2c report

Patterns for Good Parallelism

Embarrassingly Parallel

No communication between tasks. Maximum scalability.

// Perfect parallelism: each iteration independent
#pragma omp parallel for
for (int i = 0; i < n; i++) {
    result[i] = expensive_computation(input[i]);
}

Data Parallelism with Reduction

// Reduction: each thread accumulates locally, then combine
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++) {
    sum += data[i];
}

Pipeline Parallelism

Stage 1 (Parse) → Stage 2 (Process) → Stage 3 (Write)
     ↓                  ↓                   ↓
  Thread 1           Thread 2           Thread 3

Good when stages have similar duration. Bottleneck is the slowest stage.

Task Parallelism with Work Stealing

// Task-based parallelism (TBB, OpenMP tasks)
tbb::parallel_for(0, n, [&](int i) {
    process(items[i]);
});

Work stealing balances load dynamically: idle threads steal tasks from busy threads' queues.

Common Anti-Patterns

1. Over-Synchronization

// Bad: Lock held during I/O
mutex.lock();
result = expensive_network_call();  // Other threads blocked!
mutex.unlock();

// Better: Minimize critical section
auto data = expensive_network_call();
mutex.lock();
result = data;
mutex.unlock();

2. Thread-per-Request

Creating a thread for each request doesn't scale. Use thread pools with bounded concurrency.

3. Ignoring NUMA

On multi-socket systems, memory access time varies by location:

Socket 0          Socket 1
┌─────────┐      ┌─────────┐
│ Core 0-7│      │Core 8-15│
│ Memory A│ ←──→ │Memory B │
└─────────┘      └─────────┘
     ↑ Local: 80ns    ↑ Remote: 150ns

Solution: Pin threads to cores, allocate memory locally.

Summary

  • Amdahl's Law sets hard limits on parallel speedup. Know your sequential fraction.
  • Synchronization is expensive. Minimize critical sections, prefer lock-free when possible.
  • False sharing is a silent killer. Pad data structures to cache line boundaries.
  • Measure scalability with both strong and weak scaling tests.
  • Use appropriate patterns: Embarrassingly parallel > reduction > pipeline > fine-grained locking.

Chapter 18: Memory Allocators

Part V: Parallelism & Low-Level Optimization


"Memory allocation is the dark matter of performance." — Anonymous

The malloc That Wasn't Free

A high-frequency trading firm was debugging mysterious latency spikes. Their code was tight—no unnecessary copies, no blocking I/O, carefully tuned algorithms. Yet every few seconds, a trade would take 10x longer than expected.

The culprit: malloc. Under heavy allocation load, the default glibc allocator would occasionally need to request memory from the kernel via mmap, causing a 50-100 μs stall. In a system where microseconds mattered, this was catastrophic.

Switching to jemalloc with pre-warmed arenas eliminated the spikes entirely. The fastest allocation is the one you don't make—but when you must allocate, the allocator matters enormously.

How malloc Works

The Allocation Problem

Memory allocation seems simple: give me N bytes, return a pointer. But the allocator must solve several hard problems:

  1. Speed: Allocation should be O(1) or close to it
  2. Fragmentation: Avoid wasting memory on gaps
  3. Thread safety: Multiple threads allocating simultaneously
  4. Cache efficiency: Keep related allocations together

Traditional malloc (glibc ptmalloc2)

Heap Structure:
┌─────────────────────────────────────────────────┐
│ Chunk │ Chunk │ Free │ Chunk │ Free │ Chunk │...│
│  16B  │  32B  │ 64B  │ 128B  │ 48B  │  24B  │   │
└─────────────────────────────────────────────────┘
         ↑
    Free list links free chunks together

Free list: Freed memory is linked together. Allocation searches for a suitable chunk.

Coalescing: Adjacent free chunks are merged to reduce fragmentation.

Size classes: Small allocations use bins of fixed sizes (16, 32, 48, 64... bytes).

System Call Overhead

User Space:              Kernel Space:
┌──────────────┐         ┌──────────────┐
│   malloc()   │ ──brk──→│ Extend heap  │  (small allocs)
│              │         │              │
│   malloc()   │ ─mmap──→│ Map pages    │  (large allocs, >128KB)
│              │         │              │
│   free()     │ ─munmap→│ Unmap pages  │  (large frees)
└──────────────┘         └──────────────┘

The problem: System calls are expensive (1-10 μs). Good allocators minimize them by:

  • Caching freed memory for reuse
  • Requesting memory in large chunks
  • Delaying returns to the OS

Modern Allocators

jemalloc (Facebook/Meta)

Originally developed for FreeBSD, now widely used (Firefox, Redis, Facebook services).

Key innovations:

  • Thread-local caches: Each thread has its own cache, avoiding lock contention
  • Arenas: Multiple independent heaps reduce contention
  • Size classes: Carefully chosen to minimize internal fragmentation
  • Huge page support: Reduces TLB misses for large allocations

tcmalloc (Google)

Google's thread-caching malloc, used in most Google services.

Architecture:

┌─────────────────────────────────────────────────┐
│ Thread 1 Cache │ Thread 2 Cache │ Thread N Cache│
└───────┬────────┴───────┬────────┴───────┬───────┘
        │                │                │
        └────────────────┼────────────────┘
                         ↓
              ┌─────────────────────┐
              │   Central Free List │
              └─────────────────────┘
                         ↓
              ┌─────────────────────┐
              │    Page Heap        │
              └─────────────────────┘

Small objects (< 256KB): Served from thread-local cache, no locking. Large objects: Served from central page heap with locking.

mimalloc (Microsoft)

Microsoft's allocator, designed for maximum performance.

Key features:

  • Free list sharding: Multiple free lists per size class reduce contention
  • Eager page reuse: Freed pages are immediately available for reuse
  • Segment-based: Memory organized in segments for better locality

Performance Comparison

AllocatorSingle-threadMulti-threadMemory OverheadLatency Consistency
glibc ptmalloc2BaselinePoor (lock contention)MediumVariable
jemalloc1.1-1.3x2-5xLowGood
tcmalloc1.1-1.2x2-4xLowGood
mimalloc1.2-1.5x3-6xVery LowExcellent

Speedup relative to glibc on typical workloads

Memory Pools and Custom Allocators

When general-purpose allocators aren't enough:

Object Pools

Pre-allocate a fixed number of same-sized objects:

template<typename T, size_t N>
class ObjectPool {
    std::array<T, N> storage;
    std::array<T*, N> free_list;
    size_t free_count = N;

public:
    T* allocate() {
        if (free_count == 0) return nullptr;
        return free_list[--free_count];
    }

    void deallocate(T* ptr) {
        free_list[free_count++] = ptr;
    }
};

Use case: Game entities, network connections, fixed-size messages.

Arena/Bump Allocators

Allocate sequentially, free all at once:

class Arena {
    char* buffer;
    size_t offset = 0;
    size_t capacity;

public:
    void* allocate(size_t size) {
        size = align_up(size, 8);
        if (offset + size > capacity) return nullptr;
        void* ptr = buffer + offset;
        offset += size;
        return ptr;
    }

    void reset() { offset = 0; }  // "Free" everything
};

Use case: Per-frame game allocations, request-scoped web server data, compiler passes.

Stack Allocators

LIFO allocation pattern:

class StackAllocator {
    char* buffer;
    size_t offset = 0;

public:
    void* allocate(size_t size);
    void deallocate(void* ptr);  // Must be most recent allocation
};

Benchmarking Allocators

Methodology

  1. Use realistic workloads: Synthetic benchmarks often don't reflect real patterns
  2. Measure latency distribution: Mean is less important than P99/P999
  3. Test under contention: Single-threaded benchmarks miss the point
  4. Monitor memory usage: Fast but memory-hungry isn't always better

Tools

# heaptrack: Allocation profiling
heaptrack ./program
heaptrack_gui heaptrack.program.*.gz

# Valgrind massif: Heap profiling
valgrind --tool=massif ./program
ms_print massif.out.*

# perf for allocation-related events
perf stat -e page-faults,minor-faults,major-faults ./program

Switching Allocators

Most modern allocators can be used via LD_PRELOAD:

# Use jemalloc
LD_PRELOAD=/usr/lib/libjemalloc.so ./program

# Use tcmalloc
LD_PRELOAD=/usr/lib/libtcmalloc.so ./program

# Use mimalloc
LD_PRELOAD=/usr/lib/libmimalloc.so ./program

Embedded and Real-Time Considerations

Deterministic Allocation

Real-time systems need bounded allocation time. Solutions:

  • Static allocation: Allocate everything at startup
  • Pool allocators: Fixed-size pools with O(1) allocation
  • TLSF (Two-Level Segregated Fit): O(1) general-purpose allocator

Memory Budget

Embedded systems have fixed memory. Strategies:

  • Pre-calculate maximum memory needs
  • Use compile-time allocation where possible
  • Implement memory monitoring and alerts

Summary

  • Default malloc is often a bottleneck in multi-threaded applications
  • Modern allocators (jemalloc, tcmalloc, mimalloc) provide 2-6x better multi-threaded performance
  • Custom allocators (pools, arenas) can provide 10-100x speedup for specific patterns
  • Measure before switching: Profile allocation patterns to understand your needs
  • Consider latency, not just throughput: P99 latency often matters more than average

Chapter 19: Footprint Analysis Fundamentals

Part VI: Embedded Constraints


"In embedded systems, every byte counts—literally." — Jack Ganssle

The Missing 2 KB

Two weeks before mass production, our project war room reeked of espresso and anxiety.

The development team had just merged the final security patch, but the automated build server flashed an angry red warning:

region 'FLASH' overflowed by 2048 bytes

This was a 128 KB flash system, and our binary had grown to 130 KB. Those extra 2 KB stood like an insurmountable wall between us and product shipment.

Senior engineer Zhang immediately shouted: "Quick! Strip out all the debug strings from printf, and disable those unnecessary assert statements!"

Everyone scrambled through the source code, hunting for strings to delete. An hour later, the second build result arrived: only 400 bytes saved.

The team fell silent. Blind "intuition-based optimization" proved utterly powerless against hard memory constraints.

"We need data, not guesses." Junior performance engineer Ming broke the chaos.

Instead of rushing to delete code, he calmly ran size and nm --size-sort. In the detailed linker map file, he discovered the real "space killer" wasn't printf—it was a newly introduced third-party sensor driver.

That driver had inadvertently pulled in the floating-point emulation library, all because of a calibration routine that mistakenly used double for fewer than ten lines of data processing.

Through systematic analysis tools, the team fixed just two lines of code, converting floating-point to fixed-point arithmetic. The binary instantly shrank by 15 KB.

Optimizing footprint isn't a guessing game of "deleting code"—it's a precise science of measurement.


What is Footprint?

In embedded systems, footprint refers to the memory space a program occupies. Unlike desktop systems, embedded memory is a hard constraint—your firmware must fit into fixed-size flash and RAM.

Static vs Dynamic Footprint

Footprint can be categorized into two types:

┌─────────────────────────────────────────────────────────┐
│  Static Footprint (determined at compile time)          │
├─────────────────────────────────────────────────────────┤
│  .text    │ Machine code, instructions │ Stored in Flash │
│  .rodata  │ Constants, string literals │ Stored in Flash │
│  .data    │ Initialized globals        │ Flash → RAM     │
│  .bss     │ Uninitialized globals      │ RAM (zeroed)    │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│  Dynamic Footprint (changes at runtime)                  │
├─────────────────────────────────────────────────────────┤
│  Stack    │ Local variables, call frames │ RAM           │
│  Heap     │ Dynamically allocated memory │ RAM           │
└─────────────────────────────────────────────────────────┘

Flash vs RAM Occupancy

Understanding the mapping between sections and memory is crucial:

Flash usage = .text + .rodata + .data (initial values)
RAM usage   = .data + .bss + stack + heap

Note that .data occupies both Flash (storing initial values) and RAM (runtime storage). This detail trips up many engineers.

Why "Won't Fit in Flash" Is So Common

Typical embedded system memory constraints:

Device Type       Flash      RAM
────────────────────────────────
Low-end MCU       32 KB      4 KB
Mid-range MCU     256 KB     64 KB
High-end MCU      1 MB       256 KB
Application CPU   Unlimited* 512 MB+

* Has file system and virtual memory

As features accumulate, code size easily grows unnoticed. A single "harmless" library reference might bring in tens of KB of hidden dependencies.


The Toolbox

Just as performance analysis needs profilers, footprint analysis requires specialized tools. Here are four essential tools for systems software engineers.

1. The size Command: Quick Overview

size is the most basic tool for quickly grasping a binary's overall structure:

$ riscv64-unknown-elf-size -A firmware.elf

section           size         addr
.text            0x4500   0x80000000
.rodata          0x0800   0x80004500
.data            0x0100   0x80004d00
.bss             0x0200   0x20000000
.stack           0x1000   0x20000200
Total            0x6000

Key metrics:

  • Flash usage: .text + .rodata + .data = 21,248 bytes
  • RAM usage: .data + .bss + .stack = 4,864 bytes

Pro tip: Use -A (System V format) instead of the default Berkeley format for more detailed section breakdown.

2. The nm Command: Symbol-Level Analysis

When you find a section is too large, dig into symbol level to find the culprit:

$ riscv64-unknown-elf-nm -S --size-sort -r firmware.elf | head -10

80001a20 000005d4 T core_process_loop
20000040 00000400 B network_buffer
800021f4 00000210 t parse_json_string
80002404 000001c8 T uart_send_buffer
...

Output interpretation:

  • Column 1: Symbol address
  • Column 2: Symbol size (bytes)
  • Column 3: Symbol type (T=text, B=bss, D=data)
  • Column 4: Symbol name

Pro tip: Filter by type, e.g., finding only large variables in RAM:

$ nm -S --size-sort -r firmware.elf | grep -E ' [BD] '

3. bloaty: Modern Footprint Analyzer

Bloaty McBloatface is an advanced footprint analysis tool from Google. It displays space distribution hierarchically and supports diff comparison between versions.

# Analyze by compile unit (source file)
$ bloaty firmware.elf -d compileunits

    VM SIZE                FILE SIZE
 --------------          --------------
  62.5%  5.15Ki tasks.c  62.5%  5.15Ki
  21.2%  1.75Ki queue.c  21.2%  1.75Ki
   8.5%    712B list.c    8.5%    712B
   7.8%    650B port.c    7.8%    650B

Version comparison (Diff)—bloaty's most powerful feature:

$ bloaty new_firmware.elf -- old_firmware.elf

     VM SIZE                     FILE SIZE
 --------------               --------------
  +15.2%   +2.1Ki .text       +15.2%   +2.1Ki
   [ = ]       0 .rodata       [ = ]       0
  +8.3%    +128B .bss         +8.3%    +128B
 --------------               --------------
  +12.1%  +2.2Ki TOTAL        +12.1%  +2.2Ki

This diff capability is especially useful in CI/CD—automatically compare footprint changes after each commit.

4. Linker Map File: The Ultimate Truth

The linker map file records how the compiler combines all object files into the final binary. It's the ultimate weapon for solving "where did the space go?" mysteries.

Generating a map file:

$ riscv64-unknown-elf-gcc main.o lib.o -Wl,-Map=output.map -o firmware.elf

Map file example:

.text.core_init
                0x0000000080000100       0x48 main.o
.text.uart_send
                0x0000000080000148       0x20 uart.o
 *fill*         0x0000000080000168       0x08
.text.process_data
                0x0000000080000170      0x120 process.o

Key observations:

  • *fill* indicates padding (alignment)—hidden space waste
  • You can trace each symbol back to its source object file
  • You can discover libraries that were accidentally linked in

Analysis Workflow

Establish a systematic analysis process instead of guessing by intuition:

Step 1: Baseline Measurement
        ↓
    $ size firmware.elf
    Record .text, .data, .bss sizes
        ↓
Step 2: Identify Heavy Hitters
        ↓
    $ nm -S --size-sort -r firmware.elf | head -20
    Find symbols consuming the most space
        ↓
Step 3: Trace Origins
        ↓
    Check linker map file
    Confirm which object files these symbols come from
        ↓
Step 4: Analyze Causes
        ↓
    - Is there an accidentally included library?
    - Are there unnecessary features being compiled in?
    - Are there oversized static buffers?
        ↓
Step 5: Verify Changes
        ↓
    $ bloaty new.elf -- old.elf
    Confirm changes actually reduced footprint

Common "Space Killers"

1. Floating-Point Library
   - Using float/double on MCUs without FPU
   - Even a single printf("%f") pulls in the entire float formatting library

2. Standard Library Functions
   - printf family: 10-20 KB
   - malloc/free: 1-5 KB
   - Consider newlib-nano or custom minimal versions

3. Oversized Static Buffers
   - char log_buffer[4096]; // Do you really need this big?

4. Unused Features
   - Referencing a library but only using a small part
   - Not enabling --gc-sections to remove dead code

Case Study: Tracing an Accidental Library Reference

Let's return to the opening story and reconstruct Ming's analysis process with tools.

Step 1: Discover the problem

$ size firmware_before.elf
   text    data     bss     dec     hex filename
 133120     256    4096  137472   21900 firmware_before.elf

Flash usage is 133,376 bytes (.text + .data), exceeding the 128 KB limit.

Step 2: Find the heavy hitters

$ nm -S --size-sort -r firmware_before.elf | head -5
80010000 00003a00 T __aeabi_ddiv
8000c600 00002800 T __aeabi_dmul
80009e00 00001c00 T __aeabi_dadd
80008200 00001c00 T __aeabi_dsub
80006600 00001400 T __aeabi_d2iz

These __aeabi_d* functions are software emulation for double floating-point operations! They total about 50 KB.

Step 3: Trace the origin

Search for these symbols' source in the linker map file:

$ grep -A1 "__aeabi_ddiv" output.map
__aeabi_ddiv
                0x80010000    0x3a00 libgcc.a(dp-bit.o)

It's libgcc's double-precision floating-point emulation.

Step 4: Find the caller

$ grep -r "double\|float" src/
src/drivers/sensor.c:42:    double calibrated = raw_value * 0.0125;

There it is! A simple calibration operation pulled in 50 KB of floating-point library.

Step 5: Fix it

Convert double operations to fixed-point:

// Before: pulls in 50 KB floating-point library
double calibrated = raw_value * 0.0125;

// After: fixed-point, 0 KB overhead
int32_t calibrated = (raw_value * 125) / 10000;

Step 6: Verify

$ bloaty firmware_after.elf -- firmware_before.elf

     VM SIZE                     FILE SIZE
 --------------               --------------
 -37.5%  -50.0Ki .text       -37.5%  -50.0Ki
   [ = ]       0 .data         [ = ]       0
   [ = ]       0 .bss          [ = ]       0
 --------------               --------------
 -37.5%  -50.0Ki TOTAL       -37.5%  -50.0Ki

Success—50 KB saved!


Summary

  • Footprint = memory space a program occupies, including code size (Flash) and data size (RAM)
  • Measurement tools:
    • size: Quick overview of section sizes
    • nm --size-sort: Find the largest symbols
    • bloaty: Hierarchical analysis and version comparison
    • Linker map file: Trace symbol origins
  • Analysis workflow: Baseline measurement → Identify heavy hitters → Trace origins → Analyze causes → Verify changes
  • Common pitfalls: Floating-point library, standard library functions, oversized static buffers, unremoved dead code
  • Core principle: Measure, don't guess

Chapter 20: Compiler Size Optimization

Part VI: Embedded Constraints


"Premature optimization is the root of all evil, but so is premature pessimization." — Unknown

The Optimization Level Myth

Every C/C++ developer knows -O2 and -O3 make programs run faster. But in the embedded world, there's an often-overlooked friend: -Os and -Oz.

In the previous chapter's story, we used systematic analysis to identify the floating-point library as the "space killer." But that was just the beginning—true size optimization requires understanding what the compiler does behind the scenes.

This chapter answers a core question with experimental data: How much code size difference do different compiler options actually produce?


Optimization Level Comparison

GCC Optimization Level Definitions

Level       Goal               Characteristics
────────────────────────────────────────────────────────────
-O0         Debug              No optimization, max debuggability
-O1         Basic              Reduce code size, moderate optimization
-O2         Standard           Balance speed and size
-O3         Aggressive         Maximum speed, may increase code size
-Os         Size optimization  Based on -O2, but prefer smaller code
-Oz         Minimum size       Clang only, more aggressive than -Os
-Og         Debug optimization Suitable for use with debugger

Experiment: Compiling the Same Program

We use a typical embedded application (RTOS task management + UART driver) as our test baseline:

# Compile same source with different optimization levels
$ riscv64-unknown-elf-gcc -O0 -o test_O0.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O1 -o test_O1.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O2 -o test_O2.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O3 -o test_O3.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -Os -o test_Os.elf main.c drivers/*.c

Measurement results (.text section size):

Opt Level   .text (bytes)    vs -O0        vs -Os
────────────────────────────────────────────────────────
-O0           28,672         100.0%        +78.2%
-O1           18,432          64.3%        +14.5%
-O2           20,480          71.4%        +27.3%
-O3           24,576          85.7%        +52.7%
-Os           16,096          56.1%        baseline
-Oz*          14,848          51.8%        -7.8%

* Clang only

Key observations:

  1. -O3 is larger than -O2: Aggressive inlining and loop unrolling increase code size
  2. -Os is ~20% smaller than -O2: Significant for memory-constrained systems
  3. -O0 is largest: No optimization, each statement generates separate instructions

Speed vs Size Trade-off

  Code Size
  ▲
  │                                   ● -O3 (fastest, but usually largest)
  │
  │   ● -O0 (slow and large, no optimization)
  │
  │                           ● -O2 (fast, balanced size)
  │
  │                   ● -O1 (basic optimization)
  │
  │                       ● -Os (small size, good speed)
  │               ● -Oz (smallest size, sacrifices speed)
  │
  └───────────────────────────────────────────────────────► Speed

Rules of thumb:

  • Development/debug: -Og
  • Release (speed priority): -O2 or -O3
  • Release (space priority): -Os or -Oz
  • Extremely constrained: -Oz + LTO + --gc-sections

Advanced Compiler Options

Basic optimization levels are just the beginning. Here are advanced size optimization techniques.

1. Dead Code Elimination

Use -ffunction-sections, -fdata-sections with linker's --gc-sections:

# Compile: place each function/data in its own section
$ gcc -ffunction-sections -fdata-sections -Os -c main.c -o main.o

# Link: remove unused sections
$ gcc -Wl,--gc-sections main.o -o firmware.elf

Effect example:

Options                           .text size
──────────────────────────────────────────────
Without gc-sections               24,576 bytes
With gc-sections                  18,432 bytes
Savings                           6,144 bytes (25%)

This is especially effective when using large libraries (like newlib)—you might only use memcpy and strlen, but without gc-sections, malloc, printf, and more get linked in.

LTO allows the compiler to perform cross-compile-unit optimization at link time:

# Both compile and link need -flto
$ gcc -flto -Os -c main.c -o main.o
$ gcc -flto -Os -c uart.c -o uart.o
$ gcc -flto -Os main.o uart.o -o firmware.elf

LTO advantages:

  • Cross-file inlining decisions
  • More precise dead code elimination
  • Better constant propagation

LTO costs:

  • Significantly increased compile time (possibly 2-5x)
  • Debug information may be harder to trace
  • Some linker scripts need adjustment

Effect example:

Options             .text size    Compile time
───────────────────────────────────────────────
-Os                 16,096        1.2s
-Os -flto           14,336        3.8s
Savings             1,760 (11%)   +217%

3. Inlining Control

Inlining is the most critical speed vs size trade-off:

// Force inline (may increase code size)
static inline __attribute__((always_inline))
void critical_function(void) { ... }

// Prevent inline (ensure minimum code size)
__attribute__((noinline))
void large_function(void) { ... }

When to force inline:

  • Very small helper functions (1-3 lines)
  • Functions on hot paths
  • When call overhead exceeds the function itself

When to prevent inline:

  • Large functions (over 20-30 lines)
  • Functions with multiple call sites
  • Error handling paths

Standard Library Selection

The standard C library is one of the biggest "hidden costs" in embedded systems.

newlib vs newlib-nano

Library           printf support       .text increase
───────────────────────────────────────────────────────
newlib            Full                 ~50-80 KB
newlib-nano       Basic (no float)     ~8-15 KB
Custom minimal    Integer only         ~1-2 KB

Using newlib-nano:

# GCC's --specs option
$ arm-none-eabi-gcc --specs=nano.specs -Os main.c -o firmware.elf

Verifying the effect:

$ size firmware_newlib.elf
   text    data     bss     dec     hex filename
  52480     256    4096   56832    ddc0 firmware_newlib.elf

$ size firmware_nano.elf
   text    data     bss     dec     hex filename
  12288     256    4096   16640    4100 firmware_nano.elf

Difference: 40 KB. Huge on 64 KB or 128 KB flash systems.

Avoiding printf

If you only need to output integers or simple strings, consider a lightweight custom version:

// Full printf: ~15-50 KB
printf("Value: %d\n", value);

// Custom lightweight version: ~200 bytes
void print_int(const char* prefix, int value) {
    uart_puts(prefix);
    char buf[12];
    itoa(value, buf, 10);
    uart_puts(buf);
    uart_puts("\n");
}

Size Impact of Specific Features

Certain C/C++ features have significant impact on code size:

Floating-Point Operations

Feature                          Extra size on MCU without FPU
────────────────────────────────────────────────────────────────
float arithmetic                 +10-15 KB (software emulation)
double arithmetic                +25-50 KB (software emulation)
printf("%f")                     +15-25 KB (formatting)
math.h (sin, cos, etc.)          +10-30 KB (depends on usage)

Best practices:

  • Use fixed-point arithmetic whenever possible
  • If float is necessary, avoid double
  • Never use printf("%f") on MCUs without FPU

C++ Features

Feature                          Typical size impact
────────────────────────────────────────────────────────────────
Virtual functions                +8-16 bytes vtable per class
RTTI (typeid, dynamic_cast)      +2-10 KB
Exception handling               +10-50 KB
STL containers                   Depends on usage, can be tens of KB

Recommended embedded C++ compile options:

$ g++ -fno-rtti -fno-exceptions -Os ...

Experiment: Complete Optimization Workflow

Let's demonstrate the optimization workflow with a complete example:

Initial state:

$ riscv64-unknown-elf-gcc -O2 main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  45056     512    8192   53760    d200 firmware.elf

Flash usage: 45.5 KB, target: 32 KB.

Step 1: Switch to -Os

$ riscv64-unknown-elf-gcc -Os main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  36864     512    8192   45568    b200 firmware.elf

Saved: 8.2 KB (-18%). Still 4.9 KB over.

Step 2: Add gc-sections

$ riscv64-unknown-elf-gcc -Os -ffunction-sections -fdata-sections \
    -Wl,--gc-sections main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  30720     512    8192   39424    9a00 firmware.elf

Saved: 6.1 KB (-17%). Target achieved! But let's continue.

Step 3: Add LTO

$ riscv64-unknown-elf-gcc -Os -flto -ffunction-sections -fdata-sections \
    -Wl,--gc-sections main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  28672     512    8192   37376    9200 firmware.elf

Saved: 2 KB (-7%).

Step 4: Use newlib-nano

$ riscv64-unknown-elf-gcc -Os -flto -ffunction-sections -fdata-sections \
    -Wl,--gc-sections --specs=nano.specs main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  18432     512    8192   27136    6a00 firmware.elf

Saved: 10.2 KB (-36%).

Optimization summary:

Phase                       .text size    Cumulative savings
───────────────────────────────────────────────────────────
Original (-O2)              45,056        baseline
Step 1: -Os                 36,864        -18%
Step 2: gc-sections         30,720        -32%
Step 3: LTO                 28,672        -36%
Step 4: newlib-nano         18,432        -59%

From 45 KB to 18 KB—60% flash space saved!


Common Pitfalls

1. Over-Inlining

// This function gets inlined 20 times by -O3 = 20x original size
inline void update_display(int x, int y, int color) {
    // 50 lines of drawing logic
    ...
}

Solution: Use -Os or manually mark __attribute__((noinline)).

2. Forgetting gc-sections

# Wrong: added -ffunction-sections at compile, forgot --gc-sections at link
$ gcc -ffunction-sections -fdata-sections -c file.c
$ gcc file.o -o output   # Forgot -Wl,--gc-sections!

3. Debug Symbols Impact

Debug symbols don't increase Flash usage, but make the ELF file larger:

$ riscv64-unknown-elf-gcc -g -Os main.c -o debug.elf
$ riscv64-unknown-elf-gcc -Os main.c -o release.elf

$ ls -lh *.elf
-rwxr-xr-x 1 user user 245K debug.elf
-rwxr-xr-x 1 user user  35K release.elf

# But .text size is the same:
$ size debug.elf release.elf
   text    data     bss     dec     hex filename
  18432     512    8192   27136    6a00 debug.elf
  18432     512    8192   27136    6a00 release.elf

Summary

  • Optimization levels: -Os is usually the best choice for embedded, 15-25% smaller than -O2
  • Dead code elimination: -ffunction-sections -fdata-sections -Wl,--gc-sections saves 10-30%
  • LTO: Cross-compile-unit optimization, additional 5-15% savings (but increases compile time)
  • Standard library: newlib-nano can save 30-50 KB compared to newlib
  • Avoid: Floating-point operations, printf("%f"), C++ exceptions/RTTI
  • Optimization order: First analyze with tools (previous chapter), then choose appropriate compiler options (this chapter)

Chapter 21: Stack Analysis and Estimation

Part VI: Embedded Constraints


"Stack overflow is the most insidious bug in embedded systems—it corrupts silently and strikes randomly." — Miro Samek

The Unpredictable Crash

This is a classic embedded nightmare story.

The product had shipped. It was running smoothly. Customers were happy. Then suddenly, one customer reported: "The system randomly reboots. No pattern whatsoever."

The team checked all logs—no error messages. Checked power—stable. Checked temperature—normal range. The problem was like a ghost, impossible to reproduce.

Three weeks later, the issue was narrowed down to a specific usage scenario: when a user rapidly pressed three buttons in sequence, the system had a 30% chance of rebooting.

An engineer finally captured a crash dump and found the program counter (PC) pointing to a completely meaningless address.

The answer: Stack overflow.

The button interrupt handler had nested calls three levels deep into decoding logic, combined with a 512-byte local variable buffer. This combination only occurred under a specific sequence of operations that was never tested during development.

Stack overflow doesn't throw an exception. It doesn't leave an error message. It silently overwrites other memory regions, causing the system to crash at unpredictable times.

This chapter teaches you how to analyze, estimate, and monitor stack usage—before the problem becomes a field failure.


Stack Basics

What is the Stack?

The stack is a contiguous memory region used to store:

Stack contents:
────────────────────────────────────────
1. Function return addresses
2. Local variables
3. Function parameters (some calling conventions)
4. Saved register values (caller/callee saved)
5. Interrupt handler context

Stack Growth Direction

Most architectures have stacks that grow downward:

High addr   ┌───────────────────────┐
            │  Stack start          │ ← Initial SP value
            ├───────────────────────┤
            │  main() frame         │
            ├───────────────────────┤
            │  func_a() frame       │
            ├───────────────────────┤
            │  func_b() frame       │ ← Current SP
            ├───────────────────────┤
            │                       │
            │  Unused stack space   │
            │                       │
            ├───────────────────────┤
            │  Stack guard/bottom   │
Low addr    └───────────────────────┘

Why Stack Size Is Hard to Estimate

Static footprint problem:
─────────────────────────────────────────────────────────
.text, .data, .bss    → Fully determined at compile time
Stack                  → Only know actual usage at runtime

Difficulties:
1. Call depth varies dynamically
2. Recursion depth
3. Function pointer calls
4. Interrupt nesting
5. Conditional local variables (different paths use different sizes)

Static Analysis Tools

GCC's -fstack-usage

GCC can generate stack usage reports for each function:

$ riscv64-unknown-elf-gcc -fstack-usage -Os -c main.c
$ cat main.su
main.c:10:5:main	64	static
main.c:25:6:process_data	256	static
main.c:50:6:handle_interrupt	128	static

Output format:

file:line:column:function_name    stack_usage(bytes)    type

Type descriptions:
- static: Determinable at compile time
- dynamic: Uses VLA or alloca, size unknown
- bounded: Has upper limit but cannot be precisely determined

Pro tip: Find the largest stack users

$ cat *.su | sort -t$'\t' -k2 -nr | head -10
drivers/usb.c:120:6:usb_parse_descriptor	1024	static
app/json.c:45:6:parse_json_object	512	static
app/protocol.c:80:6:decode_frame	384	static
...

Checkstack Script

The Linux kernel provides a checkstack script that analyzes stack usage directly from assembly:

# Generate assembly and analyze
$ riscv64-unknown-elf-objdump -d firmware.elf | ./checkstack.pl riscv

0x80001234 usb_parse_descriptor [firmware.elf]:	1024
0x80002468 parse_json_object [firmware.elf]:	512
0x80003690 decode_frame [firmware.elf]:	384

Limitations

Static analysis has its limits:

Static analysis can handle:
✅ Fixed-size local variables
✅ Direct function call chains
✅ Call depth known at compile time

Static analysis cannot handle:
❌ Recursion depth (unless explicitly bounded)
❌ Function pointer calls
❌ Interrupt nesting levels
❌ VLA (Variable Length Array)
❌ alloca()

Dynamic Measurement Methods

When static analysis is insufficient, we need dynamic measurement.

Stack Painting

This is the classic stack usage measurement method:

#define STACK_PATTERN  0xDEADBEEF

// Fill stack region with pattern at startup
void stack_paint(void) {
    extern uint32_t _stack_start;  // Defined in linker script
    extern uint32_t _stack_end;
    
    uint32_t *p = &_stack_start;
    while (p < &_stack_end) {
        *p++ = STACK_PATTERN;
    }
}

// Measure stack high water mark at any time
size_t stack_get_high_water_mark(void) {
    extern uint32_t _stack_start;
    extern uint32_t _stack_end;
    
    uint32_t *p = &_stack_start;
    while (p < &_stack_end && *p == STACK_PATTERN) {
        p++;
    }
    
    return ((size_t)&_stack_end - (size_t)p);
}

Usage:

int main(void) {
    stack_paint();  // Paint at startup
    
    // ... run application ...
    
    // Check periodically
    size_t used = stack_get_high_water_mark();
    printf("Stack high water mark: %zu bytes\n", used);
}

Important: This method only measures "maximum ever used"—it cannot guarantee future usage won't exceed this value.

Runtime Stack Monitoring

Add stack overflow detection in debug builds:

// Check stack at each function entry
void __attribute__((no_instrument_function))
stack_check(void) {
    extern uint32_t _stack_start;
    register uint32_t sp asm("sp");

    if (sp < (uint32_t)&_stack_start + STACK_GUARD_SIZE) {
        // Stack overflow detected!
        panic("Stack overflow!");
    }
}

// GCC's -finstrument-functions can auto-insert
void __attribute__((no_instrument_function))
__cyg_profile_func_enter(void *this_fn, void *call_site) {
    stack_check();
}

GCC compile option:

# Auto-insert hooks at function entry/exit
$ gcc -finstrument-functions -Os main.c -o firmware.elf

Note: This adds runtime overhead and code size—use only in debug builds.


RTOS Task Stack Sizing

In RTOS environments, each task has its own stack, making the problem more complex.

FreeRTOS Stack Configuration

// FreeRTOS task creation
#define TASK_STACK_SIZE  512  // words (2048 bytes on 32-bit)

xTaskCreate(
    task_function,
    "TaskName",
    TASK_STACK_SIZE,  // ← How to determine this?
    NULL,
    tskIDLE_PRIORITY + 1,
    &task_handle
);

Methods for Estimating Task Stack Size

Task Stack requirement =
    Context save size (RTOS framework)
  + Task function's own stack usage
  + Stack usage of all possible called functions
  + Interrupt handling (if sharing stack)
  + Safety margin (typically 25-50%)

FreeRTOS Context Save Size (ARM Cortex-M example):

Architecture          Context Size (bytes)
──────────────────────────────────────────
ARM Cortex-M0         64
ARM Cortex-M3/M4      64 (no FPU) / 200 (with FPU)
ARM Cortex-M7         64 (no FPU) / 232 (with FPU)
RISC-V RV32           64-128 (depends on ABI)

FreeRTOS Stack Monitoring

FreeRTOS provides built-in stack high water mark functionality:

// Enable stack overflow detection
#define configCHECK_FOR_STACK_OVERFLOW  2

// Get task's stack usage
UBaseType_t stack_remaining = uxTaskGetStackHighWaterMark(task_handle);
printf("Task stack remaining: %u words\n", stack_remaining);

// Stack overflow hook (configCHECK_FOR_STACK_OVERFLOW >= 1)
void vApplicationStackOverflowHook(TaskHandle_t xTask, char *pcTaskName) {
    // Handle stack overflow
    panic("Stack overflow in task: %s\n", pcTaskName);
}

Practical Recommendations

Task Stack Sizing Best Practices:
──────────────────────────────────────────────────────────

1. Initial configuration: Start with generous space (e.g., 2 KB)

2. Measure actual usage:
   - Run system, trigger all possible execution paths
   - Use uxTaskGetStackHighWaterMark() to check

3. Adjust configuration:
   - Actual usage + 25-50% safety margin
   - Example: Measured 800 bytes → Configure 1024-1200 bytes

4. Continuous monitoring:
   - Keep stack checking in production builds
   - Periodically log stack usage statistics

Worst-Case Stack Depth (WCSD) Analysis

In safety-critical systems (automotive, medical, aerospace), you need to calculate Worst-Case Stack Depth (WCSD).

Call Graph Analysis

Build call graph, calculate deepest path:

main()           [64 bytes]
  ├── init()     [32 bytes]
  │     └── hal_init()  [128 bytes]
  └── loop()     [48 bytes]
        ├── read_sensor()  [96 bytes]
        │     └── spi_transfer()  [64 bytes]
        └── process()  [256 bytes]
              └── filter()  [128 bytes]

Deepest path 1: main → init → hal_init
            = 64 + 32 + 128 = 224 bytes

Deepest path 2: main → loop → process → filter
            = 64 + 48 + 256 + 128 = 496 bytes

WCSD = max(224, 496) = 496 bytes

Interrupt Impact

Must consider interrupts when calculating WCSD:

Application stack usage:        496 bytes
+ Interrupt handler:            128 bytes
+ Nested interrupt (if any):    128 bytes
────────────────────────────────────────────
Total WCSD:                     752 bytes

Tool Support

WCSD analysis tools:
──────────────────────────────────────────────────────────
Tool              Platform      Features
Polyspace         Commercial    Formal verification
PC-lint Plus      Commercial    Static analysis
StackAnalyzer     Commercial    Professional WCSD analysis
                  (AbsInt)
GCC -fstack-usage Open source   Basic, requires manual call chain calc

Common Pitfalls

1. Large Local Variables

void process_packet(void) {
    char buffer[4096];  // 4 KB local variable!
    // ...
}

Solution: Use static buffer or dynamic allocation

// Use static (moves to .bss, doesn't use stack)
static char buffer[4096];

// Or use heap (if system allows)
char *buffer = malloc(4096);

2. Recursive Functions

int factorial(int n) {
    if (n <= 1) return 1;
    return n * factorial(n - 1);  // Depth = n
}

// factorial(1000) needs ~64 KB stack!

Solution: Convert to loop

int factorial(int n) {
    int result = 1;
    for (int i = 2; i <= n; i++) {
        result *= i;
    }
    return result;
}

3. Complex Logic in Interrupts

void __attribute__((interrupt)) timer_isr(void) {
    char log_buffer[512];  // Large buffer in ISR
    sprintf(log_buffer, "...");
    process_complex_data();  // Call complex function
}

Solution: ISRs should be as short as possible; defer complex logic to main loop or task

volatile bool timer_flag = false;

void __attribute__((interrupt)) timer_isr(void) {
    timer_flag = true;  // Just set flag
}

void main_loop(void) {
    if (timer_flag) {
        timer_flag = false;
        process_complex_data();  // Handle in normal context
    }
}

4. VLA and alloca

void process(size_t n) {
    int data[n];  // VLA - stack usage depends on runtime value
    // ...
}

void another(size_t size) {
    void *buf = alloca(size);  // Same problem
    // ...
}

Solution: Avoid VLA and alloca; use static buffers or heap


Summary

  • Stack overflow is the hardest bug to detect in embedded systems—silent corruption, random crashes
  • Static analysis:
    • GCC -fstack-usage: Generates per-function stack usage report
    • Limitation: Cannot handle recursion, function pointers, interrupt nesting
  • Dynamic measurement:
    • Stack painting: Fill with pattern, check high water mark later
    • Runtime monitoring: Check stack pointer at function entry
  • RTOS environments:
    • Each task has independent stack, estimate separately
    • FreeRTOS provides uxTaskGetStackHighWaterMark()
  • WCSD analysis: Safety-critical systems need worst-case stack depth calculation
  • Best practices:
    • Avoid large local variables
    • Avoid recursion
    • Keep ISRs short
    • Configure safety margin (25-50%)

Chapter 22: RTOS Footprint Case Study

Part VI: Embedded Constraints


"The best RTOS is the one that fits your constraints—memory, CPU, and development time." — Colin Walls

The Real Challenge of Choosing an RTOS

"Our MCU only has 64 KB flash and 8 KB RAM. Which RTOS should we choose?"

This is one of the most common questions embedded developers ask. The internet is full of comparison articles, but few answer this question in a data-driven way.

This chapter skips the marketing speak and feature lists. We measure with tools. We speak with data.

We'll analyze the footprint of three mainstream RTOSes:

  • FreeRTOS: The most popular open-source RTOS
  • Zephyr: A modern IoT RTOS
  • RT-Thread: A highly popular RTOS from China

Measurement Methodology

Before comparing, we need to establish fair measurement conditions.

Test Platform

Hardware: QEMU emulator (to avoid hardware differences)
Target: ARM Cortex-M4 (no FPU)
Compiler: GCC 13.2 (arm-none-eabi)
Optimization: -Os -flto -ffunction-sections -fdata-sections -Wl,--gc-sections

Test Scenarios

We define three test scenarios:

Scenario 1: Minimal
- Kernel scheduler only
- 1 task
- No other features

Scenario 2: Basic
- Kernel + semaphore + queue
- 3 tasks
- Timer service

Scenario 3: Typical
- Scenario 2 + shell/console
- 5 tasks
- Dynamic memory allocation

Measurement Method

# For each RTOS and scenario:
$ arm-none-eabi-size firmware.elf
$ arm-none-eabi-nm -S --size-sort firmware.elf > symbols.txt
$ bloaty firmware.elf -d compileunits > modules.txt

FreeRTOS Analysis

Minimal Configuration

// FreeRTOS minimal configuration
#define configUSE_PREEMPTION              1
#define configUSE_IDLE_HOOK               0
#define configUSE_TICK_HOOK               0
#define configMINIMAL_STACK_SIZE          64
#define configTOTAL_HEAP_SIZE             1024
#define configMAX_PRIORITIES              4
#define configUSE_MUTEXES                 0
#define configUSE_SEMAPHORES              0
#define configUSE_TIMERS                  0
#define configUSE_QUEUE_SETS              0

Measurement results:

$ arm-none-eabi-size freertos_minimal.elf
   text    data     bss     dec     hex filename
   3584     120    1152    4856    12f8 freertos_minimal.elf

Section breakdown:
  .text   = 3,584 bytes (kernel code)
  .data   = 120 bytes (initialized data)
  .bss    = 1,152 bytes (heap + TCB)

Main components:

$ arm-none-eabi-nm -S --size-sort freertos_minimal.elf | head -10
20000100 00000400 B ucHeap           # 1024 bytes heap
08000a40 00000280 T xTaskCreate
08000cc0 00000200 T vTaskSwitchContext
08000ec0 00000180 T xTaskIncrementTick
08001040 00000140 T prvIdleTask
...

Feature vs Footprint Table

FreeRTOS feature impact on footprint:

Feature                     .text increase    .bss increase
────────────────────────────────────────────────────────────
Basic kernel (1 task)       3,584             1,152
+ Semaphores                +320              +0
+ Mutexes                   +480              +0
+ Queues                    +640              +0
+ Timers                    +1,024            +256
+ Task notifications        +256              +0
+ Event groups              +512              +64
────────────────────────────────────────────────────────────
Typical configuration       ~6,500            ~1,500

Zephyr Analysis

Zephyr uses Kconfig for fine-grained configuration.

Minimal Configuration

# prj.conf for Zephyr minimal
CONFIG_KERNEL=y
CONFIG_MAIN_THREAD_PRIORITY=0
CONFIG_MAIN_STACK_SIZE=512
CONFIG_IDLE_STACK_SIZE=256
CONFIG_HEAP_MEM_POOL_SIZE=0
CONFIG_MINIMAL_LIBC=y

# Disable unneeded features
CONFIG_PRINTK=n
CONFIG_LOG=n
CONFIG_SHELL=n

Measurement results:

$ arm-none-eabi-size zephyr_minimal.elf
   text    data     bss     dec     hex filename
   5120     256    1280    6656    1a00 zephyr_minimal.elf

Analysis: Zephyr minimal is about 1.5 KB larger than FreeRTOS (.text). This is because Zephyr has a more complete abstraction layer and device model.

Feature vs Footprint Table

Zephyr feature impact on footprint:

Feature                     .text increase    .bss increase
────────────────────────────────────────────────────────────
Basic kernel                5,120             1,280
+ Semaphores                +128              +0
+ Mutexes                   +256              +0
+ Queues (k_msgq)           +384              +0
+ Timers                    +512              +128
+ Shell                     +12,000+          +2,000+
+ Logging                   +4,000+           +1,000+
+ Networking                +50,000+          +10,000+
────────────────────────────────────────────────────────────
Typical (no shell)          ~7,000            ~1,500
Typical (with shell)        ~20,000           ~4,000

RT-Thread Analysis

RT-Thread has a nano version specifically targeting minimal footprint.

Minimal Configuration (RT-Thread Nano)

// rtconfig.h for RT-Thread Nano
#define RT_THREAD_PRIORITY_MAX  8
#define RT_TICK_PER_SECOND      1000
#define RT_USING_OVERFLOW_CHECK
#define RT_USING_HOOK
#define RT_USING_IDLE_HOOK

// Disable most features
// #define RT_USING_SEMAPHORE
// #define RT_USING_MUTEX
// #define RT_USING_MAILBOX
// #define RT_USING_MESSAGEQUEUE

Measurement results:

$ arm-none-eabi-size rtthread_minimal.elf
   text    data     bss     dec     hex filename
   2816     96      896    3808    ee0 rtthread_minimal.elf

Analysis: RT-Thread Nano is the smallest of the three—only 2.8 KB .text.

Feature vs Footprint Table

RT-Thread feature impact on footprint:

Feature                     .text increase    .bss increase
────────────────────────────────────────────────────────────
Basic kernel (Nano)         2,816             896
+ Semaphores                +192              +0
+ Mutexes                   +256              +0
+ Mailbox                   +320              +0
+ Message queue             +384              +0
+ Timer                     +512              +128
+ FinSH shell               +10,000+          +2,000+
+ Device framework          +3,000+           +500+
────────────────────────────────────────────────────────────
Typical (no shell)          ~5,000            ~1,200
Typical (with shell)        ~15,000           ~3,500

Comparison Summary

Minimal Configuration Comparison

RTOS              .text      .data      .bss       Total
────────────────────────────────────────────────────────────
RT-Thread Nano    2,816      96         896        3,808
FreeRTOS          3,584      120        1,152      4,856
Zephyr            5,120      256        1,280      6,656

Visualization:

.text size (bytes):

RT-Thread Nano  ████████████████ 2,816
FreeRTOS        ████████████████████ 3,584
Zephyr          ████████████████████████████ 5,120
                0    1K   2K   3K   4K   5K   6K

Typical Configuration Comparison

RTOS              .text      .data      .bss       Total
────────────────────────────────────────────────────────────
RT-Thread         5,000      128        1,200      6,328
FreeRTOS          6,500      150        1,500      8,150
Zephyr            7,000      300        1,500      8,800

Feature Richness vs Footprint Trade-off

                    Footprint
                        ▲
                        │
            Zephyr  ●   │  ← Most features, largest footprint
                        │
          FreeRTOS    ● │  ← Balanced
                        │
      RT-Thread Nano      ● ← Minimal footprint, fewer features
                        │
        ────────────────┼──────────────► Features
                        │

Selection Recommendations

When to Choose FreeRTOS

✅ Recommended when:
- Need mature, stable, widely-used solution
- Team already has FreeRTOS experience
- Need AWS IoT integration
- Memory constraints are moderate (32 KB+ flash)

❌ Not recommended when:
- Extremely tight memory (< 16 KB flash)
- Need advanced networking stack
- Need comprehensive device driver framework

When to Choose Zephyr

✅ Recommended when:
- Building IoT products with networking
- Need comprehensive device driver support
- Want modern build system (CMake + Kconfig)
- Have sufficient memory (64 KB+ flash)

❌ Not recommended when:
- Extremely tight memory constraints
- Simple bare-metal would suffice
- Team unfamiliar with Kconfig/devicetree

When to Choose RT-Thread

✅ Recommended when:
- Extremely tight memory (< 16 KB flash)
- Need minimal kernel (RT-Thread Nano)
- Chinese documentation/community preferred
- Need rich middleware (GUI, filesystem, etc.)

❌ Not recommended when:
- Need extensive English documentation
- Need AWS/Azure cloud integration
- Team unfamiliar with RT-Thread ecosystem

Optimization Techniques

Regardless of which RTOS you choose, these techniques help reduce footprint:

1. Disable Unused Features

// FreeRTOS example
#define configUSE_MUTEXES           0  // If not using mutexes
#define configUSE_RECURSIVE_MUTEXES 0
#define configUSE_COUNTING_SEMAPHORES 0
#define configUSE_QUEUE_SETS        0
#define configUSE_TASK_NOTIFICATIONS 0

2. Reduce Priority Levels

// Fewer priorities = smaller scheduler data structures
#define configMAX_PRIORITIES  4  // Instead of 32

3. Minimize Stack Sizes

// Measure actual usage, then add 25% margin
#define configMINIMAL_STACK_SIZE  64  // words
#define configTIMER_TASK_STACK_DEPTH 128

4. Use Static Allocation

// FreeRTOS static allocation (no heap overhead)
#define configSUPPORT_STATIC_ALLOCATION 1
#define configSUPPORT_DYNAMIC_ALLOCATION 0

StaticTask_t xTaskBuffer;
StackType_t xStack[128];
xTaskCreateStatic(task_func, "Task", 128, NULL, 1, xStack, &xTaskBuffer);

5. Compiler Optimization

# Always use these for release builds
$ arm-none-eabi-gcc -Os -flto \
    -ffunction-sections -fdata-sections \
    -Wl,--gc-sections \
    --specs=nano.specs \
    ...

Case Study: Fitting into 32 KB Flash

Requirement: IoT sensor node with:

  • 3 tasks (sensor, communication, LED)
  • UART driver
  • Simple protocol parsing
  • 32 KB flash, 8 KB RAM

Initial attempt with FreeRTOS:

$ arm-none-eabi-size firmware.elf
   text    data     bss     dec     hex filename
  38912     512    4096   43520    aa00 firmware.elf

Problem: 38 KB > 32 KB limit!

Optimization steps:

Step 1: Disable unused features
        configUSE_TIMERS = 0
        configUSE_MUTEXES = 0
        Result: 35,840 bytes (-3 KB)

Step 2: Use newlib-nano
        --specs=nano.specs
        Result: 28,672 bytes (-7 KB)

Step 3: Replace printf with custom
        Custom uart_print_int()
        Result: 26,624 bytes (-2 KB)

Step 4: Enable LTO
        -flto
        Result: 24,576 bytes (-2 KB)

Step 5: Static allocation
        configSUPPORT_DYNAMIC_ALLOCATION = 0
        Result: 23,552 bytes (-1 KB)

Final: 23.5 KB < 32 KB ✓

Summary

  • Measurement methodology: Fair comparison requires identical conditions (compiler, optimization, platform)
  • Minimal footprint ranking: RT-Thread Nano (2.8 KB) < FreeRTOS (3.6 KB) < Zephyr (5.1 KB)
  • Feature trade-off: More features = larger footprint; choose based on actual needs
  • Selection criteria:
    • Extremely constrained: RT-Thread Nano
    • Balanced: FreeRTOS
    • Feature-rich IoT: Zephyr
  • Optimization techniques:
    • Disable unused features
    • Reduce priority levels and stack sizes
    • Use static allocation
    • Compiler optimization (-Os, LTO, gc-sections)
    • Use newlib-nano
  • Key principle: Measure, compare, then decide—don't rely on marketing claims

Chapter 23: Evolution of Performance Metrics

Part VII: AI/HPC


"The metrics you optimize for determine the systems you build." — Anonymous

The Presentation That Fell Flat

Marcus had spent three weeks benchmarking the company's new AI accelerator. His presentation to the executive team was packed with data: cache hit rates, branch prediction accuracy, instructions per cycle, memory bandwidth utilization. He was proud of the thoroughness.

The VP of Engineering interrupted five minutes in. "Marcus, this is great detail, but I need one number. How does this compare to the A100 we're currently using?"

Marcus pulled up his IPC comparison chart. "As you can see, our chip achieves 4.2 IPC compared to—"

"IPC?" The VP frowned. "Nobody talks about IPC for AI workloads. What's our TFLOPS? What's the tokens per second for Llama inference?"

Marcus stared at his slides. He'd spent weeks measuring the wrong things.

That evening, Marcus called his former colleague Sarah, now at a leading AI chip startup. "I feel like an idiot," he admitted. "I've been doing CPU performance analysis for fifteen years. When did everything change?"

Sarah laughed sympathetically. "It's not just you. The entire industry went through a metrics revolution. The numbers that mattered for compiling code and running databases are almost irrelevant for training transformers. Let me walk you through what happened."


From IPC to TOPS

Twenty years ago, the core metric for evaluating CPU performance was IPC (Instructions Per Cycle). Engineers used it to compare efficiency across different microarchitectures. A higher IPC meant the processor could execute more instructions in the same amount of time—a clear indicator of "better."

But Marcus discovered what many engineers learn the hard way: metrics that worked for one era can be meaningless in another.

Today, if you ask an AI engineer "what's the IPC of this GPU," they'd look at you the way a race car driver would look at someone asking about their vehicle's cup holder capacity. It's not wrong, exactly—it's just irrelevant.

Modern AI/HPC uses completely different metrics. Where CPUs were measured in instructions, AI accelerators are measured in operations:

EraPrimary MetricWhat It Measures
1990s-2000sIPC (Instructions Per Cycle)CPU efficiency for general-purpose code
2010sGFLOPS (Billion FP ops/sec)GPU compute for graphics and early ML
2020sTFLOPS/TOPSAI accelerator throughput at various precisions

TFLOPS (Tera Floating-point Operations Per Second) has replaced IPC as the primary metric. Furthermore, TOPS (Tera Operations Per Second) describes performance for low-precision operations (INT8, INT4) that dominate AI inference.

Why Do Metrics Evolve?

This change reflects several fundamental shifts in how we compute:

1. Vectorization of Compute Units

Think about what happens when you execute an ADD instruction on a traditional CPU: you add two numbers and get one result. One instruction, one operation.

Now consider a Tensor Core on an NVIDIA H100. A single matrix-multiply-accumulate (MMA) instruction performs 256 FP16 multiply-add operations simultaneously. If you measured this in "instructions per cycle," you'd get a small number—maybe 1 or 2. But in terms of actual useful work for AI, it's doing 256 times more than a CPU instruction.

Traditional CPU:
  ADD r1, r2, r3    →  1 addition

Tensor Core (NVIDIA H100):
  MMA.F16           →  256 FP16 multiply-add operations

Comparing IPC between these two architectures would be like comparing a delivery truck's "packages per trip" when one truck carries 1 box and another carries 256. The metric doesn't capture what matters.

2. The Memory Wall Changed the Game

When Sarah explained this to Marcus, she drew a simple graph on a napkin. "Compute capability has been growing at roughly 2x every two years," she said. "Memory bandwidth? Maybe 1.3x. After a few decades, compute is 1000x faster while memory is only 10x faster."

This means that for many workloads, the processor spends most of its time waiting for data. Measuring "instructions per second" becomes meaningless when the bottleneck is "bytes per second." The Roofline model, which we'll explore shortly, captures this reality.

3. Workloads Became More Homogeneous

Traditional CPU workloads are diverse: branching code, pointer chasing, string manipulation, system calls. Every program is different, so a general metric like IPC made sense.

AI workloads are remarkably similar at their core: they're dominated by matrix multiplication (GEMM). Whether you're training a vision model, a language model, or a recommendation system, 80-95% of the compute is matrix multiply. For such homogeneous workloads, measuring FLOPS directly is far more meaningful than counting abstract "instructions."

The New Metrics Landscape

Here's how the transition from traditional to modern metrics looks across different dimensions:

Traditional MetricModern MetricWhy It Changed
CPI / IPCFLOPS / TOPSSingle instruction now does hundreds of ops
Memory BandwidthRoofline ModelCompute/bandwidth ratio determines bottleneck
Amdahl's LawComm-to-Compute RatioNetwork, not serial code, limits scaling
Latency vs. ThroughputTTFT vs. TPSLLM streaming creates new user experience
Power (W)Energy Efficiency (GFLOPS/W)Electricity is now a major cost center

Let's explore each evolution in depth, starting with the most fundamental: the shift from counting instructions to counting operations.

CPI/IPC → FLOPS/TOPS

Traditional: CPI and IPC

CPI (Cycles Per Instruction) and IPC (Instructions Per Cycle) are reciprocal metrics:

CPI = Total_Cycles / Total_Instructions
IPC = Total_Instructions / Total_Cycles = 1 / CPI

These metrics assume:

  • Each instruction has roughly equal "value"
  • Program performance is proportional to instruction count
  • Microarchitecture improvements are reflected in IPC gains

Modern: FLOPS and TOPS

FLOPS (Floating-point Operations Per Second) directly measures compute capability:

GFLOPS = Floating_Point_Operations / Time / 10^9
TFLOPS = GFLOPS / 1000

TOPS (Tera Operations Per Second) is used for integer operations, especially low-precision AI inference:

TOPS = Operations / Time / 10^12

TOPS at Different Precisions

Modern AI accelerators support multiple precisions, each with different TOPS:

NVIDIA H100 Theoretical Peak:
┌────────────┬──────────────┐
│  Precision │   TOPS       │
├────────────┼──────────────┤
│  FP64      │   67 TFLOPS  │
│  FP32      │  134 TFLOPS  │
│  TF32      │  989 TFLOPS  │
│  FP16      │ 1979 TFLOPS  │
│  FP8       │ 3958 TFLOPS  │
│  INT8      │ 3958 TOPS    │
└────────────┴──────────────┘

Note: These are theoretical peaks. Actual performance is typically 50-80% of peak, depending on workload characteristics.

Peak vs Sustained

When reporting FLOPS, you must distinguish:

  • Peak FLOPS: Theoretical maximum, assuming 100% utilization
  • Sustained FLOPS: Performance maintainable under actual workloads
Typical ratios:
  Matrix multiply (GEMM):   70-95% of peak
  Convolution (Conv):       50-80% of peak
  Attention mechanism:      30-70% of peak
  Memory-intensive ops:     10-30% of peak

Roofline diagram construction:

Performance (GFLOPS)
     ^
     |                    __________________ Peak Compute (roof)
     |                   /
     |                  /   ◄─── Compute Bound Region
     |                 /         (Horizontal = at compute ceiling)
     |                /
     |               /  Ridge Point
     |              /
     |             /
     |            /  ◄─── Memory Bound Region
     |           /       Performance scales linearly with AI
     |          /        (Slope = memory bandwidth)
     |         /
     |        /
     |       /
     |      /
     |     /
     |────┴─────────────────────────────────────────> Arithmetic Intensity
                                                       (FLOPs/Byte)

Interpreting Roofline

Two "roofs":

  1. Horizontal line: Peak Compute (compute ceiling)
  2. Diagonal line: Memory Bandwidth × AI (memory ceiling)

Whichever line an application falls below is its bottleneck:

AI (FLOPs/Byte)  Bottleneck      Typical Applications
──────────────────────────────────────────
    < 10         Memory          Vector add, STREAM
   10-50         Boundary        Sparse matrix, Conv2D
   50-200        Boundary/Compute Dense matrix multiply
    > 200        Compute         Highly optimized GEMM

Amdahl's Law → Communication-to-Computation

When Marcus first learned parallel programming, Amdahl's Law was gospel. "If 10% of your code is sequential," his professor had said, "you can never get more than 10x speedup, no matter how many processors you throw at it."

That mental model worked fine for multi-core CPUs. But when Marcus started working on distributed AI training across hundreds of GPUs, he discovered a new bottleneck that Amdahl never considered: the network.

Traditional: Amdahl's Law

Amdahl's Law describes the theoretical speedup limit of parallelization:

Speedup = 1 / ((1 - P) + P/N)

Where:
  P = parallelizable fraction
  N = number of processors

If 95% of your code is parallelizable (P = 0.95), then even with infinite processors, your speedup is limited to 1 / 0.05 = 20x. The sequential 5% becomes the ceiling.

This law assumes that parallel work is truly parallel—that processors can work independently without coordination. In a shared-memory multi-core system, this is approximately true.

Modern: Communication-to-Computation Ratio

In distributed AI training, a new bottleneck emerges: communication.

Typical data-parallel training:

GPU 0 ─┬── Forward ──┬── Backward ──┬── AllReduce ──┐
GPU 1 ─┤             │              │               │
GPU 2 ─┤             │              │               │
GPU 3 ─┘             │              │               │
                     ▼              ▼               ▼
                 Compute         Compute       Communication

AllReduce operations need to synchronize gradients across all GPUs—this is the main communication bottleneck.

Communication-to-Computation Ratio

C2C Ratio = Communication_Time / Computation_Time

Ideal: C2C << 1 (communication time much less than compute time)
Reality: C2C worsens as GPU count increases

Influencing factors:

  1. Model size: Larger gradients mean more data to transfer
  2. Batch size: Larger batches increase compute time, improving C2C
  3. Network bandwidth: InfiniBand vs Ethernet makes huge difference
  4. AllReduce algorithm: Ring AllReduce, Hierarchical AllReduce

Latency vs. Throughput → TTFT vs. TPS

The traditional latency/throughput trade-off still exists in AI, but LLMs introduced something new: the user experience of streaming.

When you chat with an LLM, you don't wait for the entire response to be generated before seeing anything. The words appear one by one, like watching someone type in real-time. This streaming experience created entirely new metrics that capture what users actually care about.

Traditional: Latency and Throughput

For a traditional web service, performance is simple:

  • Latency: How long until the response is complete?
  • Throughput: How many requests can we handle per second?

You optimize for one, the other, or some balance. A user either sees the result or doesn't.

Modern: LLM's TTFT and TPS

LLMs broke this model because users experience the response progressively. A 10-second response feels fast if words start appearing immediately. The same 10-second response feels slow if there's a 3-second pause before anything appears.

This led to two new metrics that capture different aspects of user experience:

TTFT (Time To First Token): How long until the user sees something?

This is the "perceived responsiveness" metric. Users are more tolerant of slow generation if the response starts quickly. TTFT is dominated by the Prefill phase—processing the entire input prompt before any output can begin.

TPS (Tokens Per Second): How fast do subsequent words appear?

TPS = Tokens generated per second (Decode phase)

Factors affecting TPS:
1. KV Cache size
2. Batch size
3. Memory bandwidth (usually the bottleneck)

Two Phases of LLM Inference

Request processing flow:

[Input Prompt] ──► [Prefill] ──► [Decode × N] ──► [Complete]
                    ▼               ▼
                 TTFT            TPOT × N

Where:
  TTFT = Time To First Token
  TPOT = Time Per Output Token
  N = number of output tokens

Total latency = TTFT + (N × TPOT)

Prefill vs Decode Characteristics

Phase       Characteristics       Bottleneck
─────────────────────────────────────────
Prefill     Parallel process      Compute-bound
            prompt                Uses Tensor Cores
            Compute many positions at once

Decode      Autoregressive        Memory-bound
            generation            Needs to read KV Cache
            Process 1 token at a time

This explains why:

  • TTFT increases with prompt length
  • TPS is almost unaffected by prompt length
  • Batch processing can significantly improve overall throughput

Power → Energy Efficiency

Traditional: Power (Watts)

Early system evaluation treated power as a "constraint" rather than a "metric":

Traditional thinking:
  "This CPU draws 100W, ensure adequate cooling"
  "Server room power capacity is XX kW"

Power was seen as a problem to "handle," not a target to "optimize."

Modern: Energy Efficiency

In two extreme scenarios, energy efficiency becomes a core metric:

1. Hyperscale Data Centers

Training GPT-4 scale models:
  - Thousands of GPUs
  - Megawatts of power consumption
  - Electricity becomes major cost

Efficiency metrics: GFLOPS/W, TOPS/W

2. Edge Devices

AI inference on phones/IoT:
  - Limited battery capacity
  - Thermal Design Power (TDP) limits
  - User experience affected by heat

Efficiency metrics: Inferences/mAh, TOPS/W

Epilogue: Marcus's Second Presentation

Three weeks later, Marcus gave his second presentation to the executive team. This time, his slides told a different story:

"Our chip achieves 847 TFLOPS at FP16, putting it between the A100 and H100. For Llama-70B inference at batch size 1, we measure 23 tokens per second—competitive with the A100."

He showed a Roofline diagram. "We're currently memory-bound for decode-heavy workloads, achieving 78% of theoretical bandwidth. For prefill-heavy workloads, we hit 65% of peak compute."

The VP nodded. "Now I understand what we're buying. Good work."

After the meeting, Sarah texted him: "Heard it went well. What changed?"

Marcus replied: "I stopped measuring what's easy and started measuring what matters."

That's the first lesson of AI/HPC performance analysis: know which metrics matter for your workload, because the right metric is worth more than a thousand benchmarks.


Summary

Performance metric evolution reflects fundamental changes in computing workloads:

From Single to Multi-dimensional

  • Traditional: Single IPC or MHz could explain performance
  • Modern: Need combination of metrics (FLOPS, bandwidth, efficiency, latency)

From Absolute to Relative

  • Traditional: Pursue highest absolute performance
  • Modern: Pursue best "ratios" (Roofline, efficiency)

From Hardware to Application

  • Traditional: Hardware specs determine performance
  • Modern: Application characteristics (AI, C2C, TTFT) determine evaluation approach

Key Metric Evolution

  • IPC → FLOPS/TOPS: Vectorized computation
  • Bandwidth → Roofline: Compute/bandwidth ratio
  • Amdahl → C2C: Communication becomes new bottleneck
  • Latency → TTFT: LLM-specific metrics
  • Power → Efficiency: Performance per watt
  • Code Size → Quantization: Reduce data movement

Chapter 24: AI/ML Benchmarks

Part VII: AI/HPC


"In machine learning, the only benchmark that matters is your production workload." — Unknown

The Benchmark That Proved Nothing

Linda was evaluating AI accelerator chips for her company's inference infrastructure. Vendor A claimed "500 TOPS," Vendor B claimed "450 TOPS." Easy decision, right? Go with the bigger number.

She ran her actual workload—a BERT-based text classifier—on both chips. Vendor B, with the "smaller" TOPS number, was 40% faster.

"How is this possible?" she asked Vendor A's sales engineer.

He shifted uncomfortably. "Well, our 500 TOPS is at INT4 precision. Your model uses FP16. And our number is for batch size 256—you're running batch size 1. Also, that's peak theoretical throughput. Sustained performance depends on..."

Linda cut him off. "So that number on your datasheet is essentially meaningless for my use case?"

"I wouldn't say meaningless..."

This is the fundamental problem with AI benchmarks. Unlike traditional software where "faster" has a clear meaning, AI performance depends on model architecture, precision, batch size, optimization level, and a dozen other factors. Two chips with identical specs can have 3x performance differences on real workloads.

Why AI Benchmarking Is Different

AI performance evaluation is more complex than traditional software for three fundamental reasons:

1. "Correct" Is Not Binary

When a sorting algorithm produces [1, 3, 2, 4, 5], it's wrong. Period. But when an image classifier achieves 76.3% accuracy instead of 78.1%, is that acceptable? It depends on the application, the cost savings, the latency requirements. Traditional benchmarks measure speed at fixed correctness. AI benchmarks must navigate a speed-accuracy trade-off.

2. Performance and Accuracy Are Intertwined

Quantization—running a model at lower precision—makes inference faster but slightly less accurate:

PrecisionRelative SpeedTypical Accuracy Loss
FP321.0x0% (baseline)
FP16~2.0x-0.1%
INT8~2.5x-0.5%
INT4~4.0x-2.0%

A benchmark that only reports speed is incomplete. A benchmark that only reports accuracy ignores practical constraints.

3. Hardware Diversity Creates Comparison Nightmares

The same PyTorch model can run on CPUs, NVIDIA GPUs, AMD GPUs, Google TPUs, Apple Neural Engine, Qualcomm NPUs, Intel Gaudi, and dozens of custom ASICs. Each platform has different optimization paths, different supported operations, different precision formats. Comparing "apples to apples" requires carefully controlled methodology.

MLPerf: The Industry Standard

The industry's answer to the AI benchmarking problem is MLPerf, maintained by MLCommons—a consortium including NVIDIA, Google, Intel, AMD, Meta, Microsoft, and dozens of other companies. MLPerf attempts to create standardized, reproducible benchmarks that allow meaningful comparisons across different hardware platforms.

Think of MLPerf as the SPEC CPU of the AI world: a standardized suite with strict rules about what you can and cannot change.

The MLPerf Family

MLPerf isn't a single benchmark—it's a family of benchmarks targeting different scenarios:

BenchmarkWhat It MeasuresTypical Submitters
MLPerf TrainingTime to train to target accuracyNVIDIA, Google, Intel, hyperscalers
MLPerf InferenceInference latency and throughputAI chip startups, cloud providers
MLPerf HPCML on supercomputersNational labs, research institutions
MLPerf TinyPerformance on microcontrollersEmbedded chip vendors
MLPerf MobilePerformance on phones/tabletsQualcomm, MediaTek, Apple
MLPerf StorageData pipeline performanceStorage vendors

For most readers of this book, Training and Inference are the benchmarks you'll encounter most often.

MLPerf Training: Racing to Accuracy

The Training benchmark measures one thing: how long does it take to train a model from random initialization to a specified target accuracy?

Benchmark        Model             Dataset         Target Accuracy
─────────────────────────────────────────────────────────────────
ResNet           ResNet-50 v1.5    ImageNet        75.9% Top-1
RetinaNet        RetinaNet         COCO            34.0% mAP
BERT             BERT-Large        Wikipedia       0.72 F1
DLRM             DLRM              Criteo          0.8025 AUC
3D U-Net         3D U-Net          KiTS19          0.908 Mean Dice
GPT-3            GPT-3 175B        C4              2.69 log perplexity
Stable Diffusion Stable Diffusion  LAION-400M     10.0 FID

Result Format

Typical result:

System: 8x NVIDIA H100 SXM5
Benchmark: BERT-Large Training
Time: 2.3 minutes (to reach 0.72 F1)

Comparison:
  DGX H100 (8 GPU):     2.3 min
  DGX A100 (8 GPU):     5.1 min
  Cloud TPU v4 (16):    3.8 min

Closed vs Open Division

Closed Division:
  - Must use specified model architecture
  - Must achieve specified accuracy
  - Can only adjust batch size, learning rate, etc.
  - Purpose: Fair hardware comparison

Open Division:
  - Can modify model architecture
  - Can use different optimization techniques
  - Purpose: Showcase innovative methods

MLPerf Inference

Measures "inference performance in deployment scenarios."

Scenario Definitions

Scenario      Description                 Primary Metric
─────────────────────────────────────────────────────────
Server        Concurrent requests,        QPS (within latency SLO)
              latency constraints
Offline       Batch processing,           Throughput (samples/sec)
              no latency constraints
SingleStream  One request at a time       Latency (ms)
MultiStream   Multiple independent        Number of streams
              streams

Server Scenario Details

Server scenario simulates real services:

Request arrives → Queue → Process → Response
                    ↑
              Latency constraint

SLO (Service Level Objective):
  - Example: 99% of requests must complete within 15ms

### Measurement Items

```text
1. GEMM (General Matrix Multiplication)
   - Matrix sizes: from small (256×256) to large (4096×4096)
   - Precision: FP32, FP16, INT8
   - Simulates: Fully connected layers, Attention

2. Convolution
   - Various kernel sizes (1×1, 3×3, 5×5)
   - Stride, padding variations
   - Simulates: CNN convolution layers

3. RNN (Recurrent Neural Networks)
   - LSTM, GRU
   - Different hidden sizes and sequence lengths
   - Simulates: Sequence models

4. All-Reduce
   - Measures distributed training communication performance
   - Different data sizes and GPU counts

Typical Results

NVIDIA A100 DeepBench Results:

GEMM (4096×4096, FP16):
  Peak:      312 TFLOPS
  Achieved:  285 TFLOPS (91.3%)

Conv2D (3×3, 256 channels):
  Peak:      312 TFLOPS
  Achieved:  198 TFLOPS (63.5%)

Reason: Convolution requires more memory access, cannot achieve pure GEMM efficiency

Other AI Benchmarks

AI Benchmark (ETH Zürich)

AI performance testing designed specifically for mobile devices:

Features:
- Targets phone NPUs and GPUs
- Covers multiple AI tasks
- Has Android App for direct testing

Test items:
1. Image Classification (MobileNet, EfficientNet)
2. Object Detection (YOLO, SSD)
3. Image Segmentation
4. Face Recognition
5. Super Resolution
6. Language Models

Result format:
  Total score + individual scores
  Comparable with other devices

DAWNBench

Developed by Stanford, focuses on "cost to train to target accuracy":

Core metrics:
  Time-to-Accuracy
  Cost-to-Accuracy ($)

Example:
  "How much time/money to reach 93% Top-5 accuracy on ImageNet?"

This metric is more practical:
  - Not just speed, but also cost
  - Considers cloud computing pricing

Geekbench ML

Cross-platform consumer AI benchmark:

Pros:
- Easy to run (download app)
- Cross-platform (Windows, macOS, iOS, Android)
- Result database for easy comparison

Cons:
- Not transparent (doesn't disclose all details)
- May be targeted for optimization
- Not suitable for serious performance analysis

Running MLPerf: Practical Guide

Environment Setup

# 1. Get MLPerf code
git clone https://github.com/mlcommons/inference.git
cd inference

# 2. Choose benchmark (ResNet-50 as example)
cd vision/classification_and_detection

# 3. Prepare dataset
# ImageNet validation set (50,000 images)
# Need to download from official source

# 4. Install dependencies
pip install -r requirements.txt

Running Inference Benchmark

# SingleStream scenario (measure latency)
python3 main.py --backend onnxruntime \
    --model resnet50 \
    --scenario SingleStream \
    --accuracy

# Server scenario (measure QPS)
python3 main.py --backend onnxruntime \
    --model resnet50 \
    --scenario Server \
    --qps 100

Interpreting Results

MLPerf Inference Result Example:

TestScenario.SingleStream:
  qps: 156.25
  latency (ns): 6400000 (6.4 ms)

result summary:
  samples processed: 50000
  accuracy: 76.15%
  target accuracy: 76.46%

Result: 6.4ms latency, 76.15% accuracy (below target, needs adjustment)

Building Your Own AI Benchmark

For specific applications, you may need custom benchmarks:

Design Principles

1. Define clear metrics
   - Latency (P50, P95, P99)
   - Throughput (samples/sec)
   - Accuracy (specific definition)
   - Resource usage (memory, power)

2. Reproducibility
   - Fix random seeds
   - Record complete environment
   - Use version control

3. Reflect real workload
   - Use actual data distribution
   - Simulate real request patterns
   - Consider batch mixing

Report Format Suggestion

AI Benchmark Report
═══════════════════════════════════════════════════════════

Model: YOLOv8-Medium
Hardware: NVIDIA RTX 4090
Precision: FP16
Batch Size: 1

Latency Results (1000 iterations):
  Mean:   4.2 ms
  P50:    4.1 ms
  P95:    5.8 ms
  P99:    7.2 ms
  Std:    0.9 ms

Throughput (60 seconds):
  238 images/second

Accuracy (on validation set):
  mAP@0.5: 45.2%
  mAP@0.5:0.95: 33.1%

GPU Utilization: 85%
GPU Memory: 4.2 GB / 24 GB
Power: 280W average

Environment:
  CUDA: 12.2
  cuDNN: 8.9
  TensorRT: 8.6
  Driver: 535.86
  OS: Ubuntu 22.04

What Linda Learned

Linda eventually made her chip decision. Neither Vendor A nor Vendor B.

She built her own benchmark using her actual production model, with her actual batch sizes, at her actual precision requirements. She tested latency at P50, P95, and P99. She measured power consumption under load. She calculated cost per inference.

Vendor C, which she'd initially dismissed because of lower published TOPS numbers, turned out to be the best fit for her specific workload—40% lower cost per inference than either A or B.

"The published specs weren't wrong," she explained to her team. "They just weren't measuring what mattered to us. TOPS at INT4 with batch size 256 is a valid metric—it's just not our metric."

The lesson: standardized benchmarks like MLPerf provide valuable apples-to-apples comparisons, but the most important benchmark is always the one that reflects your actual production workload.


Summary

AI/ML Benchmarking has its unique challenges and methods:

Major Benchmarks

  • MLPerf: Industry standard, complete models, strict rules
  • DeepBench: Core operations, low-level performance analysis
  • AI Benchmark: Mobile devices, consumer-oriented
  • DAWNBench: Cost-oriented, time/money

Key Considerations

  • Performance and accuracy are trade-offs
  • Hardware diversity makes comparison difficult
  • Need clear measurement conditions

Choosing a Benchmark

  • Hardware procurement evaluation: MLPerf
  • Low-level optimization analysis: DeepBench
  • Consumer comparison: Geekbench ML / AI Benchmark
  • Cost-sensitive scenarios: DAWNBench

Custom Benchmarks

  • Define clear metrics
  • Ensure reproducibility
  • Reflect real workload
  • Report complete environment

Chapter 25: HPC Benchmarks

Part VII: AI/HPC


"The TOP500 list is not about who has the biggest computer, but about who can solve the biggest problems." — Jack Dongarra

When Your Supercomputer Ranks #1 But Can't Run Your Code

Dr. Zhang's research group had just gotten access to their national lab's newest supercomputer—ranked in the top 20 of the TOP500 list. Exciting times. Their climate simulation code, which took 48 hours on the previous system, should scream on this new machine.

It didn't. The simulation took 52 hours. Slower than before.

"How is this possible?" Dr. Zhang asked the system administrator. "This machine has 10x the LINPACK score of our old system."

The admin nodded sympathetically. "LINPACK measures dense linear algebra. Your code is sparse matrix-heavy with irregular memory access patterns. This new system has great compute, but the memory bandwidth per core actually went down. Your code is memory-bound, not compute-bound."

Dr. Zhang had just learned one of the fundamental lessons of HPC benchmarking: the benchmark that ranks supercomputers tells you almost nothing about how those supercomputers will perform on your specific workload.

This chapter explores the benchmarks used to evaluate HPC systems, their strengths, their limitations, and how to interpret them for real applications.


A Brief History of HPC Performance Measurement

Since 1993, the TOP500 list has been published twice yearly, ranking the world's most powerful supercomputers. It's become the definitive measure of "who has the biggest computer"—and also a cautionary tale about what benchmarks can and cannot tell you.

The performance growth has been staggering:

YearMilestoneFLOPS
1993First TOP500 list59.7 GFLOPS (CM-5, Los Alamos)
1997First teraFLOPS1.0 TFLOPS (ASCI Red, Intel)
2008First petaFLOPS1.0 PFLOPS (Roadrunner, IBM)
2022First exaFLOPS1.1 EFLOPS (Frontier, AMD)

Over 30 years, peak performance has grown approximately 20 million times. But how is this measured, and what does it actually mean?

LINPACK: The Benchmark That Defines the TOP500

LINPACK (and its modern parallel version, HPL—High Performance LINPACK) is the benchmark used for TOP500 rankings. It measures performance on a specific mathematical operation: solving a dense system of linear equations.

What LINPACK Actually Computes

The problem is deceptively simple:

Solve Ax = b for x

Where:
  A is an n×n dense matrix
  b is a known vector
  x is the unknown we're solving for

The standard approach uses LU decomposition: factorize A into lower and upper triangular matrices (L and U), then solve through forward and backward substitution. The computational complexity is O(n³), dominated by floating-point multiply-add operations.

Why This Particular Problem?

When Jack Dongarra and colleagues created LINPACK in the 1970s, they needed a benchmark that was:

  1. Mathematically rigorous: The correct answer can be verified
  2. Scalable: You can always use a bigger matrix
  3. Compute-intensive: Limited by arithmetic, not I/O
  4. Representative: Linear algebra was central to many scientific codes

For the computing of that era, these were reasonable choices. Dense linear algebra was a dominant workload. Memory bandwidth was relatively fast compared to compute.

HPL (High Performance LINPACK)

HPL is the parallel version of LINPACK for distributed systems:

HPL Parameters:
  N:  Problem size (matrix dimension)
  NB: Block size
  P×Q: Processor grid

Performance calculation:
  FLOPS = (2/3 × N³ + 2 × N²) / Time

Result Interpretation

Typical HPL output:

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      100000   256     8     8            1234.56              5.40e+05
--------------------------------------------------------------------------------

Interpretation:
  N = 100000 (matrix size)
  P×Q = 64 processors
  Time = 1234.56 seconds
  Gflops = 540,000 GFLOPS = 540 TFLOPS

Efficiency Calculation

HPL Efficiency = Achieved FLOPS / Theoretical Peak FLOPS

Typical efficiency:
  Well-optimized systems: 70-85%
  Average systems:        50-70%
  Unoptimized:           30-50%

Factors affecting efficiency:
  - Problem size (larger is better)
  - Network bandwidth and latency
  - Memory bandwidth
  - Software optimization level

HPCG

HPCG (High Performance Conjugate Gradients) was designed to complement LINPACK's shortcomings.

Design Motivation

LINPACK's problems:
  - Compute-intensive, regular memory access
  - Modern applications are often memory-intensive
  - High LINPACK efficiency doesn't mean high application efficiency

HPCG's goals:
  - Access patterns closer to real applications
  - Measure memory system performance
  - Reflect sparse matrix operations

Mathematical Background

HPCG uses conjugate gradient method to solve sparse linear systems:

Ax = b

Where A is a sparse matrix (from 3D 27-point stencil)

Main operations:
1. SpMV (Sparse Matrix-Vector multiply)
2. Vector operations (AXPY, dot product)
3. Multigrid preconditioning

## Graph500

**Graph500** measures graph analysis performance, reflecting data-intensive applications.

### Why Graph500 Is Needed

```text
Many important applications are graph-oriented:
  - Social network analysis
  - Web page ranking
  - Bioinformatics
  - Cybersecurity

Characteristics of these applications:
  - Irregular data access
  - Low computational density
  - High memory bandwidth requirements

Benchmark Content

Graph500 contains three kernels:

1. Graph Construction
   - Build graph data structure from edge list
   - Measures data processing capability

2. BFS (Breadth-First Search)
   - Breadth-first search from random starting point
   - Primary performance metric

3. SSSP (Single-Source Shortest Path)
   - Calculate shortest paths
   - More complex graph algorithm

Performance Metric: GTEPS

GTEPS = Giga Traversed Edges Per Second
      = Billions of edges traversed per second

Calculation:
  GTEPS = Total_Edges / Time / 10^9

Typical values:
  Top systems:    10,000+ GTEPS
  General HPC:    100-1000 GTEPS
  Single node:    1-10 GTEPS

Other HPC Benchmarks

STREAM

Classic benchmark for measuring memory bandwidth:

Four kernels:

Copy:   a[i] = b[i]
Scale:  a[i] = q * b[i]
Add:    a[i] = b[i] + c[i]
Triad:  a[i] = b[i] + q * c[i]

Result units: GB/s or MB/s
// STREAM Triad core code
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    a[i] = b[i] + scalar * c[i];
}

NAS Parallel Benchmarks (NPB)

Parallel computing benchmark suite developed by NASA:

Benchmark    Description                 Characteristics
─────────────────────────────────────────────────────────
EP           Embarrassingly Parallel     No communication
MG           Multigrid                   Long/short range comm
CG           Conjugate Gradient          Irregular access
FT           FFT                         All-to-all comm
IS           Integer Sort                Random access
LU           LU decomposition            Regular comm
SP           Scalar Pentadiagonal        Regular comm
BT           Block Tridiagonal           Regular comm

OSU Micro-Benchmarks

Measures MPI communication performance:

Measurement items:
  - Point-to-point latency
  - Point-to-point bandwidth
  - Collective operations (AllReduce, Broadcast, etc.)
  - One-sided operations

Typical results:
  InfiniBand HDR:
    Latency: ~1 μs
    Bandwidth: ~200 Gb/s

  Ethernet 100G:
    Latency: ~5 μs
    Bandwidth: ~100 Gb/s

Result Analysis and Reporting

Performance Analysis

When analyzing HPC benchmark results, consider:

1. Efficiency
   Achieved performance / Theoretical peak

2. Scalability
   How performance changes with node count

3. Bottleneck identification
   - High HPL but low HPCG → Memory bottleneck
   - Low Graph500 → Memory latency issues
   - High OSU latency → Network problems

4. Energy efficiency
   GFLOPS/W or GTEPS/W

Common Issues

1. Low HPL efficiency
   - Problem size not large enough
   - Block size not suitable
   - Network becoming bottleneck
   - BLAS library not optimized

2. Very low HPCG efficiency
   - This is normal (typically 1-5%)
   - Reflects memory system limitations
   - Can try optimizing memory configuration

3. Poor Graph500 performance
   - Memory latency is key
   - NUMA configuration matters
   - Consider using huge pages

Future of HPC Benchmarks

Emerging Benchmarks

1. HPL-MxP (Mixed Precision)
   - Uses mixed precision
   - Reflects AI hardware capabilities
   - Tracking started in 2024

2. MLPerf HPC
   - AI applications on HPC
   - Scientific computing + machine learning

3. IO500
   - Storage system performance
   - Important for data-intensive applications
1. From FLOPS to multi-dimensional metrics
   - Performance isn't just compute speed
   - Memory, communication, energy efficiency all matter

2. Application-oriented
   - Real applications more meaningful than synthetic benchmarks
   - Mini-apps becoming trend

3. Energy efficiency first
   - GREEN500 importance increasing
   - Power becoming design constraint

The End of Dr. Zhang's Story

Six months later, Dr. Zhang's research group completed their analysis. The "slower" new supercomputer wasn't slower after all—it just required different optimization.

The old system had high memory bandwidth per core, which matched their original code. The new system had more compute per core but less bandwidth, requiring them to restructure their algorithms to improve arithmetic intensity.

After optimization, the new system ran their climate simulation in 12 hours instead of 48—a 4× improvement. But they had to earn that improvement through months of algorithm work.

"LINPACK told us nothing about this," Dr. Zhang said at a department meeting. "The new machine had 10× the LINPACK score, but our speedup was 4× after significant effort. Someone using a different code might get 8×. Someone with an even more memory-bound code might get 0.5×."

"So what's the point of LINPACK?" a student asked.

"It's a common yardstick," Dr. Zhang replied. "It tells you something about the system, just not everything. The real lesson is that no single benchmark captures all aspects of performance. The TOP500 ranks supercomputers, but it doesn't rank how well they'll run your code."


Summary

HPC Benchmarks provide standard methods for evaluating supercomputer performance:

Major Benchmarks

  • HPL/LINPACK: Dense linear algebra, TOP500 foundation
  • HPCG: Sparse operations, closer to real applications
  • Graph500: Graph analysis, data-intensive
  • STREAM: Memory bandwidth

Key Insights

  • High HPL efficiency doesn't mean high application efficiency
  • HPCG efficiency is typically only 1-5%
  • Different benchmarks measure different aspects

Practical Recommendations

  • Use multiple benchmarks to evaluate systems
  • Focus on efficiency rather than absolute performance
  • Consider energy efficiency (GREEN500)
  • Combine with application benchmarks for evaluation

Chapter 26: GPU Benchmarking

Part VII: AI/HPC


"The GPU is the new CPU." — Jensen Huang

The First Time Everything Looks Wrong

Carlos had been a CPU performance engineer for eight years. When his company pivoted to AI, he was assigned to optimize their training infrastructure. "How different can GPUs be?" he thought. "Cores are cores. Memory is memory."

His first profiling session was humbling.

He ran Intel VTune out of habit—it showed barely any useful data. He tried perf—the GPU was invisible. He looked at nvidia-smi and saw "GPU Utilization: 73%." Was that good? Bad? What was the other 27% doing?

When Carlos finally got Nsight Compute working, the metrics were alien. "SM Occupancy: 42%." "Warp Stall: Memory." "L2 Hit Rate: 31%." Nothing mapped to his mental model built on branch prediction, instruction-level parallelism, and cache hierarchies.

"I feel like I'm starting from scratch," he told his new teammate, Wei, a GPU specialist.

"You kind of are," Wei admitted. "But the good news is, the fundamentals transfer. You still care about memory access patterns, instruction throughput, and utilization. It's just that everything is scaled up by a factor of 1000, and the vocabulary is different."

This chapter will help you navigate that transition.


Why GPU Profiling Is Different

GPU performance analysis differs from CPU analysis in three fundamental ways:

1. Parallelism at Unprecedented Scale

When Carlos worked on CPUs, "high parallelism" meant 64 threads across 32 cores. A GPU like the NVIDIA H100 runs over 270,000 threads simultaneously. The optimization strategies that work at 64-thread scale—careful load balancing, avoiding contention—become both more critical and harder to reason about at 270,000-thread scale.

SystemParallel UnitsThreadsRatio
Xeon 8490H60 cores120 threads1x
H100 SXM5132 SMs270,336 threads2,250x

2. Memory Bandwidth Dominates

CPUs are designed for low-latency access to small amounts of data. GPUs are designed for high-bandwidth access to large amounts of data. The H100 can sustain 3 TB/s memory bandwidth—30x higher than a high-end CPU. But that bandwidth is shared across all 270,000 threads, so per-thread bandwidth is actually lower.

3. Different Execution Model

CPUs execute threads independently. GPUs execute threads in groups called warps (NVIDIA) or wavefronts (AMD). All 32 threads in a warp execute the same instruction at the same time. When threads diverge (different branches), performance collapses.

Memory Hierarchy

GPU Memory Hierarchy:

┌─────────────────────────────────────────────────────────┐
│                    HBM (80 GB)                          │
│                    ~3 TB/s                              │
├─────────────────────────────────────────────────────────┤
│                    L2 Cache (50 MB)                     │
│                    ~12 TB/s                             │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │ Shared Mem  │  │ Shared Mem  │  │ Shared Mem  │ ... │
│  │  (228 KB)   │  │  (228 KB)   │  │  (228 KB)   │     │
│  │  ~20 TB/s   │  │  ~20 TB/s   │  │  ~20 TB/s   │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │  Registers  │  │  Registers  │  │  Registers  │ ... │
│  │  (256 KB)   │  │  (256 KB)   │  │  (256 KB)   │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
└─────────────────────────────────────────────────────────┘

Execution Model

CUDA Execution Model:

Grid
 └── Block (up to 1024 threads)
      └── Warp (32 threads, SIMT)
           └── Thread

Key concepts:
  - Warp is the minimum scheduling unit
  - Threads in same warp execute same instruction
  - Warp divergence reduces efficiency

CUDA Performance Analysis Tools

nvidia-smi

The most basic GPU monitoring tool:

# Real-time monitoring
nvidia-smi

# Continuous monitoring (update every second)
nvidia-smi -l 1

# Query specific metrics
nvidia-smi --query-gpu=utilization.gpu,memory.used,power.draw \
           --format=csv -l 1

Output example:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10    Driver Version: 535.86.10    CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA H100 80GB    On   | 00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0   115W / 700W |   1024MiB / 81559MiB |     45%      Default |
+-------------------------------+----------------------+----------------------+

Nsight Compute

Modern CUDA kernel analysis tool:

# Basic analysis
ncu ./my_cuda_app

# Detailed analysis (all sections)
ncu --set full ./my_cuda_app

# Analyze specific kernel
ncu --kernel-name "myKernel" ./my_cuda_app

# Output report
ncu -o report ./my_cuda_app

Nsight Systems

System-level profiling:

# Basic trace
nsys profile ./my_cuda_app

# Include CUDA API and kernels
nsys profile --trace=cuda,nvtx ./my_cuda_app

# Output report
nsys profile -o timeline ./my_cuda_app

Key Performance Metrics

SM Occupancy

Occupancy = Active Warps / Maximum Warps per SM

Influencing factors:
  - Registers used per thread
  - Shared memory used per block
  - Block size

### Tensor Core Generations

```text
Generation   GPU          Supported Precision       Matrix Size
─────────────────────────────────────────────────────────────────
V1 (Volta)   V100         FP16                      4×4×4
V2 (Turing)  RTX 20xx     FP16, INT8, INT4          8×8×4
V3 (Ampere)  A100         FP16, BF16, TF32,         8×8×4
                          FP64, INT8
V4 (Hopper)  H100         FP16, BF16, TF32,         16×8×16
                          FP8, FP64, INT8

Tensor Core Efficiency

Conditions for high Tensor Core efficiency:

1. Matrix dimension alignment
   - m, n, k should be multiples of 8 or 16
   - Depends on precision and GPU generation

2. Memory alignment
   - Matrix start address aligned to 16 bytes
   - Leading dimension aligned

3. Sufficient parallelism
   - Need enough tiles to fill GPU
   - Small matrices have low efficiency

Typical efficiency:
  Large matrices (4096+):  80-95% of peak
  Medium matrices (1024):  50-80% of peak
  Small matrices (256):    20-50% of peak

Practical Profiling Workflow

Step 1: System-Level Analysis

# Use Nsight Systems for overall view
nsys profile -o overview ./my_app

# View report
nsys-ui overview.nsys-rep

Identify:

  • CPU vs GPU time distribution
  • Kernel execution time
  • Data transfer overhead
  • Synchronization waits

Step 2: Identify Hotspots

From Nsight Systems report, find:

1. Longest-running kernels
2. Frequently called kernels
3. Data transfer bottlenecks
4. Unnecessary synchronization

Step 3: Deep Analysis

# Detailed analysis of specific kernel
ncu --set full \
    --kernel-name "hotKernel" \
    -o detailed_report \
    ./my_app

Step 4: Bottleneck Diagnosis

Nsight Compute provides bottleneck analysis:

Memory Bound:
  - High memory throughput
  - Low compute throughput
  - Solution: Optimize access patterns, use shared memory

Compute Bound:
  - High compute throughput
  - Low memory throughput
  - Solution: Algorithm optimization, use Tensor Cores

Latency Bound:
  - Low occupancy
  - High stall percentage
  - Solution: Increase parallelism, reduce dependencies

Instruction Bound:
  - Instruction issue becomes bottleneck
  - Solution: Reduce instruction count, use vectorization

Common Performance Issues

Warp Divergence

// Problematic code
__global__ void divergent(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx % 2 == 0) {
        // Even threads take this path
        data[idx] = expensive_function_a(data[idx]);
    } else {
        // Odd threads take this path
        data[idx] = expensive_function_b(data[idx]);
    }
}

// Threads in same warp take different branches
// Leads to serialized execution

Memory Coalescing

// Problem: Non-coalesced access
__global__ void strided(float* data, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    data[idx] = 1.0f;  // Strided access, low efficiency
}

// Solution: Coalesced access
__global__ void coalesced(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data[idx] = 1.0f;  // Contiguous access, high efficiency
}

Bank Conflicts

// Problem: Shared memory bank conflict
__shared__ float smem[32][32];

// Threads in same warp access same bank
float val = smem[threadIdx.x][0];  // 32-way bank conflict

// Solution: Padding
__shared__ float smem[32][33];  // Add one column padding
float val = smem[threadIdx.x][0];  // No conflict

Low Occupancy

Diagnosis:
  ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active

Causes and solutions:
  1. Too many registers used
     - Reduce per-thread variables
     - Use __launch_bounds__

  2. Too much shared memory used
     - Reduce shared memory size
     - Use dynamic allocation

  3. Unsuitable block size
     - Adjust block dimensions
     - Ensure multiple of 32

Carlos's First Optimization Win

Three months into his GPU journey, Carlos achieved his first major optimization win.

The team's attention kernel was running at 35% of theoretical peak. After careful analysis with Nsight Compute, he identified the bottleneck: bank conflicts in shared memory during the softmax computation.

"In CPU terms," he explained to the team, "it's like having eight cores all trying to access the same cache line. On a GPU, that's 32 threads fighting for the same memory bank, and they have to take turns."

He restructured the data layout to eliminate the conflicts. The kernel jumped to 68% of peak—nearly doubling throughput.

"The weird thing," Carlos admitted, "is that once you understand the GPU execution model, this stuff becomes obvious. The tools show you exactly where the problem is. The hard part is learning to read what they're telling you."

Wei nodded. "Welcome to GPU performance. The tools are powerful. The bottlenecks are visible. The optimization strategies are well-documented. You just have to unlearn your CPU instincts first."

GPU benchmarking isn't harder than CPU benchmarking—it's different. The same principles apply: measure before optimizing, understand the hardware model, and let data guide your decisions. The vocabulary and tools are new, but the discipline is the same.


Summary

GPU Benchmarking requires understanding GPU's unique architecture:

Key Tools

  • nvidia-smi: Basic monitoring
  • Nsight Systems: System-level analysis
  • Nsight Compute: Kernel-level analysis
  • rocprof/Omniperf: AMD GPU

Key Metrics

  • Occupancy: Warp utilization
  • Memory Throughput: Memory bandwidth utilization
  • Compute Throughput: Compute unit utilization
  • Tensor Core Utilization: Matrix operation efficiency

Common Bottlenecks

  • Warp divergence
  • Non-coalesced memory access
  • Bank conflicts
  • Low occupancy

Best Practices

  • First use Nsight Systems for overall view
  • Identify hotspot kernels
  • Use Nsight Compute for deep analysis
  • Choose optimization strategy based on bottleneck type

Chapter 27: LLM Performance Analysis

Part VII: AI/HPC


"The best way to predict the future is to invent it." — Alan Kay

The Mystery of the Stuttering Chatbot

Priya's team had deployed their LLM-powered customer service chatbot. The initial demo went great—responses were fast and coherent. But in production, users started complaining.

"Sometimes it takes forever to start responding," one support ticket read. "And then when it does start, the words come out in bursts—fast for a bit, then pause, then fast again."

Priya looked at the metrics dashboard. Average response latency: 2.3 seconds. That seemed fine. But the P99 was 8.7 seconds, and users were experiencing something the dashboard couldn't capture: the feeling of waiting.

She started instrumenting more carefully. The "slow start" issue was easy to identify: long prompts (users pasting entire documents) caused extended prefill times. But the "stuttering" was mysterious. The tokens-per-second metric looked stable.

After a week of investigation, she found the cause: garbage collection in the KV cache management code. Every 50 tokens or so, the system would pause to clean up old cache entries. Each pause was only 100ms, but users noticed it as unnatural hesitation.

"Traditional latency metrics don't capture this," she realized. "LLM inference isn't like a web request. Users experience it as a conversation, and conversations have rhythm."

This chapter explores the unique performance characteristics of LLM inference—characteristics that traditional benchmarking approaches fail to capture.


Why LLM Inference Is Different

LLM inference differs from traditional AI inference in fundamental ways that affect every aspect of performance analysis.

Autoregressive Generation: One Token at a Time

When you ask a vision model to classify an image, it produces the answer in one forward pass. But when you ask an LLM to write a paragraph, it generates that paragraph one token at a time—each token requiring a complete forward pass through the model.

Traditional AI:  Input → Model → Complete Output (one pass)

LLM Inference:   Input → Model → Token 1
                         Model → Token 2
                         Model → Token 3
                          ...
                         Model → Token N

A 100-token response requires 100 forward passes. This fundamentally changes the performance characteristics.

Two Distinct Phases with Different Bottlenecks

LLM inference has two phases with completely different performance profiles:

Prefill Phase (processing the input prompt):

  • Processes all input tokens in parallel
  • Compute-intensive: GPU tensor cores are busy
  • Latency scales with prompt length
  • Generates the initial KV cache

Decode Phase (generating the output):

  • Generates one token at a time
  • Memory-intensive: waiting for data transfer
  • Latency is relatively constant per token
  • Reads and updates the KV cache

This split creates a key insight: TTFT (Time To First Token) and TPS (Tokens Per Second) are largely independent metrics because they're dominated by different phases.

Why Decode Is Memory-Bound

During the decode phase, something counterintuitive happens: a 70-billion parameter model, running on hardware capable of 2000 TFLOPS, achieves only a tiny fraction of its theoretical compute performance.

The reason is arithmetic intensity. To generate one token, you must:

  1. Read 70 billion parameters from memory (~140 GB at FP16)
  2. Read the KV cache (potentially tens of GB more)
  3. Perform matrix operations for exactly 1 token
  4. Write the new KV cache entry

The computation is minimal compared to the data movement. Even with 3 TB/s memory bandwidth (H100), reading 140 GB takes ~47ms. That's your per-token latency floor for single-user inference.

Core Performance Metrics

TTFT (Time To First Token)

TTFT = Time from receiving request to outputting first token

Components:
  TTFT = Network latency + Queue time + Prefill time

Influencing factors:
  - Prompt length (primary)
  - Model size
  - Hardware performance
  - System load

Typical values (7B model, single GPU):
  Prompt 100 tokens:   50-100 ms
  Prompt 1000 tokens:  200-500 ms
  Prompt 4000 tokens:  500-2000 ms

TPS / Throughput

TPS = Tokens Per Second (tokens generated per second)

Two definitions:
  1. Single-request TPS: Generation speed for one request
  2. System TPS: Total throughput of entire system

Single-request TPS (7B model, single GPU):
  FP16:  30-50 tokens/sec
  INT8:  50-80 tokens/sec
  INT4:  80-120 tokens/sec

System TPS (depends on batch size):
  Batch 1:   30-50 tokens/sec
  Batch 32:  500-1000 tokens/sec
  Batch 128: 1500-3000 tokens/sec

TPOT (Time Per Output Token)

TPOT = Time to generate each output token
     = 1 / Single-request TPS

TPOT determines user-perceived "typing speed"

Typical values:
  Fast (good experience):    < 50 ms/token
  Acceptable:                50-100 ms/token
  Slow (poor experience):    > 100 ms/token

End-to-End Latency

Total Latency = TTFT + (N × TPOT)

Where N = number of output tokens

Example:
  TTFT = 100 ms
  TPOT = 30 ms
  N = 100 tokens

  Total = 100 + (100 × 30) = 3100 ms = 3.1 seconds

KV Cache

What Is KV Cache

KV Cache is the most important optimization technique in LLM inference:

vLLM and PagedAttention

PagedAttention Principle

PagedAttention introduced by vLLM solves the KV Cache fragmentation problem:

Traditional approach:
  Pre-allocate maximum sequence length KV Cache for each request
  Causes significant waste

PagedAttention:
  Divide KV Cache into fixed-size "pages"
  Allocate pages on demand
  Similar to OS virtual memory

Effect:
  - Memory utilization from ~50% to ~95%
  - Can serve more concurrent requests
  - Supports longer sequences

Using vLLM

# Install
pip install vllm

# Start API server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --tensor-parallel-size 1

# Use OpenAI-compatible API
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-hf",
        "prompt": "Hello, world!",
        "max_tokens": 100
    }'

vLLM Performance Tuning

Key parameters:

--gpu-memory-utilization 0.9
  GPU memory usage ratio (default 0.9)

--max-num-seqs 256
  Maximum concurrent sequences

--max-num-batched-tokens 8192
  Maximum tokens per iteration

--block-size 16
  KV Cache page size

--swap-space 4
  CPU swap space (GB)

Other LLM Inference Frameworks

TensorRT-LLM

NVIDIA's high-performance LLM inference framework:

Features:
  - Deeply optimized CUDA kernels
  - Supports Tensor Parallelism
  - Supports In-flight Batching
  - Integrates with Triton Inference Server

Performance:
  Usually 10-30% faster than vLLM (scenario dependent)

Text Generation Inference (TGI)

Hugging Face's inference framework:

Features:
  - Easy to use
  - Supports multiple models
  - Built-in Continuous Batching
  - Docker deployment friendly

llama.cpp

LLM inference for CPU and edge devices:

Features:
  - Pure C/C++ implementation
  - Supports quantization (GGUF format)
  - Runs on CPU, Apple Silicon
  - Low memory footprint

Performance Optimization Strategies

Batching Strategies

1. Static Batching
   - Fixed batch size
   - Wait for batch to fill or timeout
   - Simple but inefficient

2. Continuous Batching
   - Dynamically add/remove requests
   - Don't wait for batch to fill
   - Higher GPU utilization

3. In-flight Batching
   - Add new requests during decode
   - Maximize throughput

Quantization

Quantization impact on LLM:

Precision  Model Size  Memory BW   TPS Improvement
─────────────────────────────────────────────────
FP16       1.0x        1.0x        1.0x
INT8       0.5x        0.5x        ~1.5-2x
INT4       0.25x       0.25x       ~2-3x

Note:
  - Decode is memory-bound
  - Reducing data directly improves performance
  - But may affect output quality

Speculative Decoding

Principle:
  Use small model to "guess" multiple tokens
  Verify with large model
  If correct, accept multiple tokens at once

Effect:
  - Can improve TPS by 2-3x
  - Doesn't affect output quality
  - Requires additional small model

Priya's Dashboard, Revisited

After her investigation, Priya rebuilt the monitoring dashboard for the chatbot. The new version tracked metrics that actually mattered for user experience:

  • TTFT distribution (not just average, but P50/P95/P99)
  • ITL histogram (to catch stuttering)
  • KV cache memory pressure (to predict when GC would trigger)
  • Prefill queue depth (to predict TTFT spikes)

She also added a "smoothness score"—a custom metric that penalized high ITL variance, even when average TPS looked fine.

"The old dashboard said everything was fine," she told her manager. "The new one would have caught the stuttering issue on day one."

The lesson she learned applies beyond LLMs: the right metrics depend on the user experience you're trying to deliver. For a batch processing system, average throughput might be enough. For an interactive chatbot, you need to measure what users actually feel—and that means understanding the unique characteristics of your workload.

LLM inference isn't just "AI inference with more parameters." It's a fundamentally different workload with its own performance model, its own bottlenecks, and its own metrics. Master those, and you can build systems that don't just perform well on benchmarks—they feel fast to users.


Summary

LLM performance analysis has its unique metrics and challenges:

Core Metrics

  • TTFT: First token latency
  • TPS: Generation speed
  • TPOT: Time per token
  • ITL: Inter-token latency

Key Technologies

  • KV Cache: Avoid redundant computation
  • PagedAttention: Solve memory fragmentation
  • Continuous Batching: Improve throughput
  • Quantization: Reduce memory requirements

Inference Frameworks

  • vLLM: PagedAttention, high throughput
  • TensorRT-LLM: NVIDIA optimized
  • TGI: Hugging Face, easy to use
  • llama.cpp: CPU/edge devices

Optimization Directions

  • Prefill: Compute optimization (Tensor Cores)
  • Decode: Memory optimization (quantization, KV Cache)
  • System: Batching strategies

Chapter 28: ML Compilers and Runtime

Part VII: AI/HPC


"A compiler is a program that translates a program written in one language into a program written in another language." — Alfred Aho

The 50x Speedup That Came from Nowhere

Raj was benchmarking different inference backends for his computer vision model. The model was straightforward—a ResNet-50 for image classification. He expected small differences between backends, maybe 10-20%.

The first run with vanilla PyTorch: 15ms per image.

With TorchScript: 12ms. A modest 20% improvement.

Then he tried TensorRT: 0.3ms.

Raj stared at the numbers. Fifty times faster? He ran it again. Same result. He checked the accuracy—identical to within floating-point tolerance.

"How is this possible?" he asked his colleague Ming, who had experience with ML compilers. "It's the same model. Same GPU. Same input."

Ming smiled. "You just discovered why ML compilers exist. PyTorch is designed for flexibility and debugging. It executes operations one at a time, with Python overhead between each one. TensorRT analyzes the entire graph, fuses operations together, chooses optimal kernel implementations, and preallocates all memory. The math is identical, but the execution is completely different."

"But why doesn't everyone just use TensorRT then?"

"Trade-offs. TensorRT compilation can take 20 minutes. It doesn't support every operation. It's harder to debug. And if your model changes frequently, recompiling is painful. ML compilers are powerful, but they're not free."

This chapter explores the world of ML compilers—how they achieve dramatic speedups, what trade-offs they involve, and how to benchmark systems that use them.


Why ML Compilers Exist

When you train a model in PyTorch or TensorFlow, the framework prioritizes flexibility. Each operation is dispatched independently. Gradient computation is tracked automatically. You can stop, inspect, and modify execution at any point.

This flexibility is essential for research but disastrous for production inference. Every layer of abstraction adds overhead. Every dynamic dispatch adds latency. Every flexibility feature you're not using is still costing you.

ML compilers bridge this gap: they take a high-level model description and produce optimized code for specific hardware, eliminating the flexibility overhead in exchange for performance.

The Complexity They Hide

Consider what's required to run a simple convolutional neural network efficiently:

  1. Operation Fusion: A Conv → BatchNorm → ReLU sequence should be executed as a single fused kernel, not three separate operations
  2. Memory Layout: Should the tensor be stored as NCHW or NHWC? Different hardware prefers different layouts
  3. Precision Selection: Which layers can use FP16? Which need FP32? Where should quantization happen?
  4. Kernel Selection: For a 3×3 convolution with batch size 32, which of the 47 available kernel implementations is fastest?
  5. Memory Planning: How should intermediate activations be allocated to minimize fragmentation?

Multiply this by dozens of hardware targets, hundreds of possible operators, and millions of possible configurations. No human can manually optimize this. ML compilers make it tractable.

What ML Compilers Do

Input: High-level model description (PyTorch, TensorFlow, ONNX)
      ↓
┌─────────────────────────────────────────────────────────┐
│                    Frontend                             │
│  - Parse model                                          │
│  - Build computation graph                              │
│  - Type inference                                       │
└─────────────────────────────────────────────────────────┘
      ↓
┌─────────────────────────────────────────────────────────┐
│                    Graph Optimization                   │
│  - Operator fusion                                      │
│  - Constant folding                                     │
│  - Dead code elimination                                │
│  - Layout transformation                                │
└─────────────────────────────────────────────────────────┘
      ↓
┌─────────────────────────────────────────────────────────┐
│                    Backend                              │
│  - Hardware-specific optimization                       │
│  - Memory planning                                      │
│  - Code generation                                      │
└─────────────────────────────────────────────────────────┘
      ↓
Output: Optimized executable program

TVM (Apache TVM)

TVM is one of the most well-known open-source ML Compilers.

TVM Architecture

┌─────────────────────────────────────────────────────────┐
│                    Relay (High-level IR)                │
│  - Functional IR                                        │
│  - Supports dynamic shapes                              │
│  - Graph-level optimization                             │
├─────────────────────────────────────────────────────────┤
│                    TIR (Tensor IR)                      │
│  - Low-level IR                                         │
│  - Loop representation                                  │
│  - Hardware mapping                                     │
├─────────────────────────────────────────────────────────┤
│                    Runtime                              │
│  - Cross-platform execution                             │
│  - Memory management                                    │
│  - Device abstraction                                   │
└─────────────────────────────────────────────────────────┘

Using TVM

import tvm
from tvm import relay
import onnx

# 1. Load ONNX model
onnx_model = onnx.load("model.onnx")

# 2. Convert to Relay IR
mod, params = relay.frontend.from_onnx(onnx_model)

# 3. Set target hardware
target = tvm.target.Target("cuda")

# 4. Compile
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=target, params=params)

# 5. Execute
dev = tvm.cuda(0)
module = tvm.contrib.graph_executor.GraphModule(lib["default"](dev))
module.set_input("input", input_data)
module.run()
output = module.get_output(0)

ONNX Runtime

ONNX Runtime is a cross-platform inference engine developed by Microsoft.

ONNX Runtime Features

Advantages:
  - Wide hardware support
  - Mature and stable
  - Easy to integrate
  - Supports multiple Execution Providers

Execution Providers:
  - CPU (default)
  - CUDA
  - TensorRT
  - DirectML
  - OpenVINO
  - CoreML
  - NNAPI

Using ONNX Runtime

import onnxruntime as ort
import numpy as np

# Create session
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Prepare input

## XLA (Accelerated Linear Algebra)

XLA is an ML Compiler developed by Google, primarily used for TensorFlow and JAX.

### XLA Features

```text
Design goals:
  - Automatically optimize TensorFlow/JAX programs
  - Support TPU
  - JIT and AOT compilation

Main optimizations:
  - Operator fusion
  - Memory optimization
  - Parallelization

Using XLA in TensorFlow

import tensorflow as tf

# Method 1: Use jit_compile
@tf.function(jit_compile=True)
def model_fn(x):
    return tf.nn.relu(tf.matmul(x, w) + b)

# Method 2: Enable globally
tf.config.optimizer.set_jit(True)

Using XLA in JAX

import jax
import jax.numpy as jnp

# JAX uses XLA by default
@jax.jit
def model_fn(x, w, b):
    return jax.nn.relu(jnp.dot(x, w) + b)

# Execute (automatically compiled)
result = model_fn(x, w, b)

Performance Comparison

Benchmark Setup

Model: ResNet-50
Hardware: NVIDIA A100
Batch Size: 1, 8, 32
Precision: FP32, FP16

Typical Results

Framework/Compiler    Batch=1    Batch=8    Batch=32
─────────────────────────────────────────────────────
PyTorch (eager)       5.2 ms     8.1 ms     18.5 ms
PyTorch (compile)     3.8 ms     5.2 ms     12.1 ms
ONNX Runtime (CUDA)   3.5 ms     4.8 ms     11.2 ms
TensorRT              2.1 ms     3.2 ms     7.8 ms
TVM (tuned)           2.4 ms     3.5 ms     8.5 ms

Note: Actual values depend on specific configuration

Selection Guide

Scenario                Recommendation
─────────────────────────────────────────────────────
Rapid prototyping       PyTorch eager
Production (NVIDIA)     TensorRT
Cross-platform          ONNX Runtime
Edge devices            TVM, IREE
TPU                     XLA (JAX)
Research/experiments    PyTorch compile

Common Optimization Techniques

Operator Fusion

Before optimization:
  x = Conv(input)
  y = BatchNorm(x)
  z = ReLU(y)

  3 memory read/writes

After optimization:
  z = FusedConvBNReLU(input)

  1 memory read/write

Effect: Reduces memory bandwidth requirements

Constant Folding

Before optimization:
  a = Constant(2)
  b = Constant(3)
  c = Add(a, b)
  y = Mul(x, c)

After optimization:
  y = Mul(x, 5)

Effect: Reduces runtime computation

Layout Transformation

Different hardware prefers different data layouts:

CPU:  NCHW (batch, channel, height, width)
GPU:  NCHW or NHWC
TPU:  NHWC

ML Compilers automatically insert necessary transformations
and try to minimize the number of conversions

Memory Planning

Problem:
  Intermediate results need memory
  How to minimize total memory usage?

Solution:
  Analyze tensor lifetimes
  Reuse memory that's no longer needed
  Similar to compiler register allocation

The Compiler That Saved the Project

Remember Aisha's edge deployment problem? After two weeks of manual optimization, she was still 40% short of her latency target.

Then she tried TVM with auto-tuning. She let it run overnight on a representative workload.

The next morning, she had a model that met her latency target with room to spare. The auto-tuner had found optimizations she never would have discovered manually—unusual tile sizes, unexpected operator fusion patterns, memory layouts that seemed counterintuitive but worked perfectly for her specific hardware.

"I spent two weeks doing what the compiler did in eight hours," she admitted. "And it did it better."

But she also learned the limits. When she tried to deploy the same model on a slightly different chip variant, the auto-tuned schedule performed poorly. She had to re-tune for the new target.

"ML compilers aren't magic," she concluded. "They're tools that trade tuning time for performance. For production deployment on known hardware, they're invaluable. For rapid prototyping across many targets, they might slow you down."

The lesson: ML compilers represent a fundamental shift in how we think about optimization—from hand-crafted expertise to automated search. But like any tool, knowing when to use them is as important as knowing how.


Summary

ML Compilers are key technology for modern AI deployment:

Main Tools

  • TVM: Open source, auto-tuning, cross-platform
  • IREE: Lightweight, suitable for edge devices
  • ONNX Runtime: Mature, stable, easy to integrate
  • XLA: Backend for TensorFlow/JAX

Core Optimizations

  • Operator fusion
  • Constant folding
  • Layout transformation
  • Memory planning

Selection Considerations

  • Target hardware
  • Performance requirements
  • Development efficiency
  • Maintenance cost

Performance Analysis

  • Use each framework's profiling tools
  • Compare performance across compilers
  • Consider tuning time vs performance gain

Chapter 29: Edge AI Performance

Part VII: AI/HPC


"The future of AI is at the edge." — Pete Warden

When the Cloud Isn't an Option

Elena was developing a hearing aid that used AI to separate speech from background noise. The algorithm worked beautifully in the lab—running on a workstation with an RTX 4090.

Now she had to make it run on a device the size of a fingernail, powered by a battery smaller than a watch battery, with less computing power than a 1990s calculator.

"This model needs 50 GFLOPS," her signal processing colleague said, reviewing the specs. "The chip provides 0.5 GFLOPS. That's a 100x gap."

"And it needs to run in real-time," Elena added. "15ms latency max, or users will notice the delay between lip movement and audio. Oh, and the battery needs to last 16 hours."

This is edge AI in a nutshell: the same intelligence that runs on data center GPUs, compressed into devices that run on milliwatts. The constraints seem impossible until you realize that millions of such devices ship every month.

This chapter explores performance analysis for edge AI—where the rules are different, the constraints are brutal, and traditional GPU-centric thinking will lead you astray.


A Different World of Constraints

Edge AI operates under constraints that would seem absurd to cloud engineers:

ResourceCloud (A100 GPU)Edge (Typical MCU)Ratio
Memory80 GB256 KB - 1 MB80,000 - 320,000x
Compute312 TFLOPS1-100 MOPS3M - 300M x
Power400W10mW - 1W400 - 40,000x
Cost$10,000+$1 - $101,000 - 10,000x

That's 6-9 orders of magnitude difference across every dimension. Techniques that work in the cloud—larger batch sizes, more parameters, higher precision—are simply impossible.

The Four Constraints of Edge AI

Memory: Your model must fit. Not "mostly fit" or "fit with swapping." The entire model, plus activations, plus input/output buffers, must fit in available RAM and Flash. There's no cloud to offload to.

Power: Battery life matters more than speed. A model that runs 2x faster but consumes 3x the energy is worse, not better. Thermal limits cap sustained performance.

Latency: Real-time means real-time. A hearing aid with 100ms delay is unusable. An autonomous vehicle with 200ms perception delay is dangerous. You need both low latency and consistent latency.

Cost: When you're shipping millions of units, every cent matters. A $2 chip that needs a $3 NPU accelerator is twice as expensive as a $2.50 chip that doesn't.

TensorFlow Lite

TensorFlow Lite is Google's lightweight inference framework for mobile and embedded devices.

Model Conversion

import tensorflow as tf

# Load SavedModel
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")

# Basic conversion
tflite_model = converter.convert()

# Save
with open("model.tflite", "wb") as f:
    f.write(tflite_model)

Quantization Options

# Dynamic range quantization (simplest)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Full integer quantization (requires representative dataset)
def representative_dataset():
    for _ in range(100):
        yield [np.random.randn(1, 224, 224, 3).astype(np.float32)]

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# Float16 quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

Running Inference

import numpy as np
import tensorflow as tf

# Load model
interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()

# Get input/output info
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Set input
input_data = np.random.randn(1, 224, 224, 3).astype(np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)

# Execute
interpreter.invoke()

# Get output
output = interpreter.get_tensor(output_details[0]['index'])

TFLite Benchmark Tool

# Download benchmark tool
# https://www.tensorflow.org/lite/performance/measurement

# Run benchmark
./benchmark_model \
    --graph=model.tflite \
    --num_threads=4 \
    --warmup_runs=10 \
    --num_runs=100

# Example output:
# Inference (avg): 15.2 ms
# Inference (std): 1.3 ms

TensorFlow Lite Micro

TFLite Micro is an ultra-lightweight inference framework for microcontrollers.

Design Goals

TFLite Micro features:

1. Minimal binary size

### Using TFLite Micro

```cpp
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
#include "tensorflow/lite/schema/schema_generated.h"

// Model data (typically loaded from Flash)
extern const unsigned char model_data[];

// Tensor Arena (size depends on model)
constexpr int kTensorArenaSize = 10 * 1024;
uint8_t tensor_arena[kTensorArenaSize];

void setup() {
    // Load model
    const tflite::Model* model = tflite::GetModel(model_data);

    // Set up operators
    static tflite::MicroMutableOpResolver<5> resolver;
    resolver.AddConv2D();
    resolver.AddMaxPool2D();
    resolver.AddFullyConnected();
    resolver.AddSoftmax();
    resolver.AddReshape();

    // Create interpreter
    static tflite::MicroInterpreter interpreter(
        model, resolver, tensor_arena, kTensorArenaSize);

    // Allocate tensors
    interpreter.AllocateTensors();

    // Get input tensor
    TfLiteTensor* input = interpreter.input(0);

    // Set input data...

    // Run inference
    interpreter.Invoke();

    // Get output
    TfLiteTensor* output = interpreter.output(0);
}

MLPerf Tiny

MLPerf Tiny is the AI benchmark standard designed for microcontrollers.

Benchmark Content

MLPerf Tiny Benchmarks:

Benchmark        Model             Task              Input
─────────────────────────────────────────────────────────────
Visual Wake      MobileNet v1      Image class       96×96 grayscale
                 (0.25)            (face detection)

Keyword Spot     DS-CNN            Speech recog      49×10 MFCC
                                   (keyword detect)

Anomaly Detect   FC AutoEncoder    Anomaly detect    128 features
                                   (machine sound)

Image Class      ResNet v1         Image class       32×32 RGB
                                   (CIFAR-10)

Performance Metrics

MLPerf Tiny metrics:

1. Latency
   - Single inference time
   - Unit: milliseconds

2. Throughput
   - Inferences per second
   - Unit: inferences/second

3. Energy
   - Energy per inference
   - Unit: μJ/inference

4. Accuracy
   - Must meet target accuracy
   - Example: Visual Wake Words > 80%

Typical Results

MLPerf Tiny v1.0 example results:

Hardware                VWW Latency    KWS Latency    Energy
─────────────────────────────────────────────────────────────
STM32L4R5 (Cortex-M4)   250 ms         50 ms          1.2 mJ
MAX78000 (dedicated NPU) 2.5 ms        0.5 ms         12 μJ
GAP9 (RISC-V + NPU)      5 ms          1 ms           25 μJ

Dedicated NPU can achieve 100x performance improvement

Edge AI Performance Analysis

Measurement Methods

1. Latency measurement
   - Use high-precision timer
   - Multiple runs for average
   - Note warm-up effects

2. Power measurement
   - Hardware power meter
   - Current sensor
   - Software estimation (imprecise)

3. Memory measurement
   - Static analysis (model size)
   - Runtime monitoring (peak RAM)

Latency Analysis

// Latency measurement on ARM Cortex-M
#include "arm_math.h"

volatile uint32_t start_cycles, end_cycles;

// Use DWT Cycle Counter
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

// Measure
start_cycles = DWT->CYCCNT;
interpreter.Invoke();
end_cycles = DWT->CYCCNT;

uint32_t cycles = end_cycles - start_cycles;
float latency_ms = (float)cycles / SystemCoreClock * 1000.0f;

Optimization Strategies

Model Optimization

1. Quantization
   - INT8 quantization (4x compression)
   - INT4 quantization (8x compression)
   - Mixed precision

2. Pruning
   - Remove unimportant weights
   - Structured pruning better for hardware

3. Knowledge Distillation
   - Train small model with large model
   - Maintain accuracy while reducing size

4. Architecture Search
   - NAS to find optimal architecture
   - Optimize for target hardware

Hardware Acceleration

Edge AI accelerators:

1. NPU (Neural Processing Unit)
   - Dedicated matrix operation units
   - Examples: Apple Neural Engine, Google Edge TPU

2. DSP (Digital Signal Processor)
   - Vector operations
   - Example: Qualcomm Hexagon

3. GPU (Mobile)
   - General parallel computing
   - Examples: Adreno, Mali

4. FPGA
   - Programmable hardware
   - Suitable for custom requirements

Kenji's Shipping Day

Six months after that first failed demo, Kenji's team shipped their product.

The final model was nothing like what they'd started with. The original MobileNetV2 had been replaced with a custom architecture—smaller, faster, and specifically designed for their hardware. They'd used knowledge distillation to train it, quantization to shrink it, and careful profiling to optimize every layer.

The result: 15 FPS inference on a $3 MCU, with 94% accuracy on their target task. Battery life: 18 months on a coin cell.

"The cloud version was 99% accurate," Kenji admitted. "We lost 5 percentage points. But we gained something more important: we can actually ship."

At the launch party, his manager asked what he'd learned.

"Edge AI isn't about making cloud AI smaller," Kenji said. "It's a different discipline entirely. Different constraints, different tools, different trade-offs. You can't just shrink a model and hope it works. You have to design for the edge from the beginning."

He paused. "Also, buy a good current meter. You'll need it."

Edge AI performance is where all the constraints collide: compute, memory, power, latency, accuracy, and cost. Mastering it requires understanding not just ML, but embedded systems, power electronics, and the art of making hard trade-offs. It's challenging—but when you ship a product that runs AI on a device that costs less than a cup of coffee, it's deeply satisfying.


Summary

Edge AI performance analysis requires considering unique constraints:

Key Frameworks

  • TensorFlow Lite: Mobile device standard
  • TFLite Micro: Microcontrollers
  • MLPerf Tiny: Standard benchmark
  • Core ML / NNAPI: Platform-specific acceleration

Core Performance Metrics

  • Latency: Inference time
  • Energy: Energy consumption
  • Memory: RAM/Flash usage
  • Accuracy: Post-quantization accuracy

Main Optimization Strategies

  • Quantization (INT8, INT4)
  • Pruning
  • Knowledge distillation
  • Hardware acceleration

Practical Measurement Methods

  • Cycle counter (latency)
  • Current sensor (power)
  • Static analysis + runtime monitoring (memory)

Chapter 30: Case Study: Web Server Optimization

Part VIII: Case Studies


"Premature optimization is the root of all evil. But premature pessimization is the root of all slowness." — Adapted from Donald Knuth

The Story of "Fast SSD" That Was Still Slow

Our API server handled static file requests. We had the latest NVMe SSD—rated at 7 GB/s read speed, 1 million IOPS.

But measured: average response time 50ms, peak throughput only 2,000 req/s.

"The SSD is so fast, how can it be this slow?"

After a week of debugging, we found the problem wasn't the SSD, but:

  1. Sync I/O: Each request blocked waiting for I/O completion
  2. Small files: Lots of 4KB requests, IOPS was the bottleneck
  3. Syscall overhead: Every read() is a syscall
  4. Context switches: Thread-per-request model

This chapter walks through analyzing a web server's performance using all the tools we've learned.

Scenario Setup

System Specs

Server:
- CPU: AMD EPYC 7543 (32 cores, 64 threads)
- RAM: 256 GB DDR4-3200
- Storage: Samsung PM9A3 NVMe SSD (7.68 TB)
  - Sequential Read: 6.9 GB/s
  - Random Read IOPS: 1,000,000 (4KB)
- Network: Mellanox ConnectX-6 (100 Gbps)
- OS: Ubuntu 22.04, Kernel 5.15

Application:
- Nginx + upstream API server
- Main workload: static files + JSON API
- Target: 50,000 req/s, P99 < 10ms

Initial State

# Benchmark with wrk
wrk -t12 -c400 -d30s http://server/api/users

Running 30s test @ http://server/api/users
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    45.23ms   67.89ms 523.12ms   87.65%
    Req/Sec   912.34    234.56    1.89k    72.34%
  328234 requests in 30.01s, 1.23GB read
Requests/sec:  10937.23
Transfer/sec:     42.01MB

Problems:

  • Throughput: 10,937 req/s (target 50,000)
  • P99 latency: estimated > 200ms (target < 10ms)
  • Gap: 5×

Step 1: Find the Bottleneck

CPU or I/O?

# Check CPU usage
mpstat -P ALL 1

# Result
CPU    %usr   %sys   %iowait   %idle
all    12.3   18.7     3.2     65.8

# CPU only ~31% used, lots of idle
# This is not CPU-bound
# Check I/O
iostat -x 1

Device         r/s     rkB/s   await  %util
nvme0n1     8234.00  32936.00    0.12   8.2%

# SSD only 8.2% utilized, not I/O-bound either

Conclusion: CPU, I/O, Network all not saturated. Problem is in the "software layer."

Use perf to Find Where CPU Time Goes

perf record -g -p $(pgrep -f "api_server") -- sleep 30
perf report

# Result
  35.2%  api_server  libc.so.6       [.] __GI___poll
  18.7%  api_server  [kernel]        [k] system_call_fastpath
  12.3%  api_server  libc.so.6       [.] malloc
   8.9%  api_server  api_server      [.] json_serialize
   6.5%  api_server  libc.so.6       [.] __GI___read
   ...

Findings:

  • 35% time in poll()—waiting for I/O events
  • 18.7% in syscall—too many system calls
  • 12.3% in malloc—frequent memory allocation

Use strace to See Syscall Pattern

strace -c -p $(pgrep -f "api_server") -f

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 32.15    1.234567         2.1    587654           poll
 24.89    0.956789         1.8    531234           read
 18.76    0.721234         2.3    313567           write
 12.34    0.474123         3.2    148234           open
  8.21    0.315678         2.9    108876           close
  3.65    0.140234         1.5     93489           fstat

Each request approximately:

  • 1 poll
  • 1 open
  • 1+ read
  • 1+ write
  • 1 close
  • 1 fstat

At least 6 syscalls per request. 10,000 req/s = 60,000 syscall/s.

Step 2: Optimize One by One

Optimization 1: Reduce Syscalls (sendfile)

Original flow:

Optimization 2: io_uring Instead of epoll

Traditional epoll pattern:

while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);  // syscall
    for (int i = 0; i < n; i++) {
        if (events[i].events & EPOLLIN) {
            read(fd, buf, size);   // syscall
            process(buf);
            write(fd, response, len);  // syscall
        }
    }
}

Each I/O operation is a separate syscall.

Using io_uring:

// Setup io_uring
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);

// Batch submit multiple I/O
struct io_uring_sqe *sqe;
for (int i = 0; i < batch_size; i++) {
    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read(sqe, fds[i], bufs[i], sizes[i], 0);
}

// One syscall submits all I/O
io_uring_submit(&ring);  // 1 syscall for N operations!

io_uring advantages:

  • Batch submission, fewer syscalls
  • Shared memory, avoids copying
  • Supports zero-copy (IORING_OP_SEND_ZC)

Result

Before: 18,234 req/s
After:  32,456 req/s (+78%)

Optimization 3: Memory Pool (Reduce malloc)

Original: malloc/free for each request:

void handle_request(int fd) {
    char *buffer = malloc(4096);  // malloc every time
    read(fd, buffer, 4096);
    char *response = malloc(response_size);  // malloc again
    build_response(buffer, response);
    write(fd, response, response_size);
    free(response);
    free(buffer);
}

Using memory pool:

// Thread-local buffer pool
static __thread struct {
    char request_buf[4096];
    char response_buf[65536];
} buffers;

void handle_request(int fd) {
    read(fd, buffers.request_buf, 4096);  // Reuse buffer
    build_response(buffers.request_buf, buffers.response_buf);
    write(fd, buffers.response_buf, response_size);
    // No free needed
}

Result

Before: 32,456 req/s
After:  41,234 req/s (+27%)

Optimization 4: Connection Pooling and Keep-Alive

Cost of each new connection:

TCP three-way handshake:  ~1 RTT
TLS handshake:            ~2 RTT (TLS 1.2) or 1 RTT (TLS 1.3)
Connection setup:         ~100μs

For short requests, this overhead can be longer than processing itself

Enable HTTP Keep-Alive:

# nginx.conf
http {
    keepalive_timeout 65;
    keepalive_requests 1000;  # Max 1000 requests per connection

    upstream backend {
        server 127.0.0.1:8080;
        keepalive 128;  # Keep connections to upstream too
    }
}

Result

Before: 41,234 req/s
After:  48,567 req/s (+18%)

Step 3: System-Level Tuning

TCP Tuning

# /etc/sysctl.conf

# Increase socket buffer
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Increase connection backlog
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Fast TIME_WAIT recycling
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

# Enable TCP Fast Open
net.ipv4.tcp_fastopen = 3

File Descriptor Limits

# /etc/security/limits.conf
* soft nofile 1000000
* hard nofile 1000000

# Or in systemd service
[Service]
LimitNOFILE=1000000

Final Results

Optimization Stage              Throughput      Improvement
────────────────────────────────────────────────────────────
Initial state                   10,937 req/s    baseline
+ sendfile                      18,234 req/s    +67%
+ io_uring                      32,456 req/s    +78%
+ memory pool                   41,234 req/s    +27%
+ keep-alive                    48,567 req/s    +18%
+ system tuning                 56,789 req/s    +17%
────────────────────────────────────────────────────────────
Total                           56,789 req/s    +419%

P99 latency: from 200+ms down to 8ms.

Exceeded target (50,000 req/s, P99 < 10ms).

Key Lessons

1. Bottleneck May Not Be Where You Think

Initial assumption: SSD too slow
Actual problem: syscall overhead, sync I/O, frequent malloc

Tools matter: perf, strace, bpftrace

2. Syscalls Are Expensive

One syscall ~100-1000 cycles
High throughput systems must reduce syscalls:
- Batch processing (io_uring)
- Zero-copy (sendfile)
- Avoid unnecessary calls (keep fd open)

3. Memory Allocation Is Hidden Cost

malloc/free itself isn't slow
But causes:
- Lock contention (multi-threaded)
- Cache pollution
- Memory fragmentation

Solution: memory pool, arena allocator

4. SSD Is Not Magic

SSD is fast, but:
- Each I/O has fixed overhead
- Queue depth matters
- Small I/O is inefficient
- Needs alignment

To fully utilize SSD performance:
- Async I/O
- High queue depth
- I/O coalescing
- Direct I/O (some scenarios)

Summary

Diagnostic Flow

  1. Identify bottleneck type (CPU/IO/Network)
  2. Use perf to find where CPU time goes
  3. Use strace to analyze syscall patterns
  4. Use bpftrace to see latency distribution

Optimization Techniques

ProblemSolution
Too many syscallssendfile, io_uring, batching
Sync I/Oio_uring, async I/O
Frequent mallocMemory pool, arena allocator
Connection overheadKeep-alive, connection pool
Low SSD efficiencyHigh queue depth, I/O coalescing

System Tuning

  • TCP buffer size
  • File descriptor limits
  • CPU affinity
  • NUMA awareness

Remember

"Fast" SSD + slow software = slow system
"Slow" HDD + good software = acceptable system

Software architecture determines whether hardware potential is realized

Chapter 31: Case Study: Database Query Optimization

Part VIII: Case Studies


"The fastest query is the one you don't have to make." — Unknown DBA

The Story of "Cross-Datacenter Query" That Could Only Run 10 Times Per Second

Our service needed to read data from MySQL in another datacenter. Network latency was about 20ms (speed of light limitation).

Simple code:

def get_user_orders(user_id):
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id)
    for order in orders:
        items = db.query("SELECT * FROM items WHERE order_id = ?", order.id)
        order.items = items
    return user, orders

A user has 10 orders, each order has 5 items.

Total: 1 + 1 + 10 = 12 queries.

Each query 20ms RTT → total latency 240ms.

Can only handle 4 requests per second (single thread).

This is the classic N+1 query problem, amplified by network latency.

Network Latency: The Underestimated Killer

Speed of Light Limits

Location              Distance    Fiber Latency (one-way)  RTT
───────────────────────────────────────────────────────────────
Same datacenter       < 1 km      ~0.005 ms               ~0.01 ms
Same city             ~50 km      ~0.25 ms                ~0.5 ms
Cross-city            ~350 km     ~1.75 ms                ~3.5 ms
Cross-country         ~2000 km    ~10 ms                  ~20 ms
Cross-continent       ~10000 km   ~50 ms                  ~100 ms

This is a physical limit, cannot be optimized. The only solution is to reduce round trip count.

Little's Law Again

Throughput = Concurrency / Latency

If RTT = 20ms, single connection:
Throughput = 1 / 0.02 = 50 queries/sec

To reach 1000 queries/sec:
Concurrency = 1000 × 0.02 = 20 parallel connections

Problem Analysis

Original Code Problems

# Classic N+1 problem
users = db.query("SELECT * FROM users LIMIT 100")
for user in users:
    # One extra query per user
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)
    user.orders = orders

# Total 101 queries!

Using EXPLAIN to Analyze

EXPLAIN SELECT * FROM orders WHERE user_id = 12345;

+----+-------------+--------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+--------+------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | orders | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+---------+-------------+

type = ALL → Full table scan! No index used.

Optimization Strategies

Optimization 1: Add Index

Most basic but most effective:

-- Check existing indexes
SHOW INDEX FROM orders;

-- Add missing index
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Verify
EXPLAIN SELECT * FROM orders WHERE user_id = 12345;

type = ref → Using index, scanning 10 rows (not 1M rows)

Optimization 2: Solve N+1 Problem

Method A: JOIN

-- Original: N+1 queries
SELECT * FROM users WHERE id = ?;
SELECT * FROM orders WHERE user_id = ?;

-- Optimized: 1 JOIN
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.id = ?;

Method B: Batch Query

# Original: N+1
users = db.query("SELECT * FROM users LIMIT 100")
for user in users:
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)

# Optimized: 2 queries
users = db.query("SELECT * FROM users LIMIT 100")
user_ids = [u.id for u in users]
orders = db.query("SELECT * FROM orders WHERE user_id IN (?)", user_ids)

# Combine in application layer
orders_by_user = group_by(orders, 'user_id')
for user in users:
    user.orders = orders_by_user.get(user.id, [])

Method C: ORM Eager Loading

Optimization 4: Connection Pool

# Original: create connection for each query
def query(sql):
    conn = mysql.connect(host='db.server.com', ...)  # TCP + TLS handshake
    cursor = conn.cursor()
    cursor.execute(sql)
    result = cursor.fetchall()
    conn.close()
    return result

# Optimized: connection pool
from sqlalchemy import create_engine
engine = create_engine(
    'mysql://user:pass@db.server.com/mydb',
    pool_size=20,           # Keep 20 connections
    max_overflow=10,        # Up to 10 extra
    pool_recycle=3600,      # Recycle after 1 hour
    pool_pre_ping=True      # Test connection before use
)

Connection establishment cost:

TCP three-way handshake:  1 RTT (~20ms)
TLS handshake:            2 RTT (~40ms) for TLS 1.2
MySQL authentication:     1 RTT (~20ms)
──────────────────────────────────────────
Total:                    ~80ms per new connection

With connection pool, this cost is paid only once.

Caching Strategy

Multi-Layer Cache Architecture

┌─────────────┐
│ Application │
│   Cache     │ ← L1: In-process cache (fastest, small capacity)
└──────┬──────┘
       │
┌──────▼──────┐
│   Redis /   │ ← L2: Distributed cache (fast, medium capacity)
│  Memcached  │
└──────┬──────┘
       │
┌──────▼──────┐
│   Database  │ ← L3: Database (slow, large capacity)
│ Buffer Pool │
└─────────────┘

Cache-Aside Pattern

def get_user(user_id):
    # 1. Check cache first
    cache_key = f"user:{user_id}"
    cached = redis.get(cache_key)
    if cached:
        return deserialize(cached)

    # 2. Cache miss, query database
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)

    # 3. Write to cache
    redis.setex(cache_key, 3600, serialize(user))  # 1 hour expiry

    return user

Cache Invalidation

# Invalidate on update
def update_user(user_id, data):
    db.execute("UPDATE users SET ... WHERE id = ?", user_id)
    redis.delete(f"user:{user_id}")  # Delete cache

# Or: update cache on update
def update_user_with_cache(user_id, data):
    db.execute("UPDATE users SET ... WHERE id = ?", user_id)
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    redis.setex(f"user:{user_id}", 3600, serialize(user))

Practical Example: Optimizing Cross-Datacenter Query

Back to the original problem:

# Original: 12 queries, 240ms
def get_user_orders(user_id):
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id)
    for order in orders:
        items = db.query("SELECT * FROM items WHERE order_id = ?", order.id)
        order.items = items
    return user, orders

Optimized Version

def get_user_orders_optimized(user_id):
    # 1. Check cache first
    cache_key = f"user_orders:{user_id}"
    cached = redis.get(cache_key)
    if cached:
        return deserialize(cached)

    # 2. Single query to get all data
    result = db.execute("""
        SELECT u.*, o.*, i.*
        FROM users u
        LEFT JOIN orders o ON u.id = o.user_id
        LEFT JOIN items i ON o.id = i.order_id
        WHERE u.id = ?
    """, user_id)

    # 3. Assemble in application layer
    user, orders = assemble_result(result)

    # 4. Write to cache
    redis.setex(cache_key, 1800, serialize((user, orders)))

    return user, orders

Results

Original:
- 12 queries × 20ms = 240ms
- Throughput: 4 req/s (single thread)

Optimized:
- Cache hit: < 1ms (Redis in same datacenter)
- Cache miss: 1 query × 20ms = 20ms
- Throughput: 50+ req/s (single thread)
- With 90% cache hit rate: average ~3ms

Summary

Network Latency Is a Hard Limit

20ms RTT = max 50 queries/sec (single connection)
Solution: reduce round trips, increase parallelism

N+1 Problem

Symptom: N+1 queries
Solution: JOIN, batch query, ORM eager loading

Optimization Layers

LayerOptimization Method
QueryIndex, JOIN, batching
ConnectionConnection pool, multiplexing
ProtocolPipeline, compression
CacheL1/L2 cache, Cache-Aside
StorageBuffer pool, partitioning, SSD tuning
NetworkTCP tuning, BBR

Caching Strategies

Cache-Aside: Fill on read
Write-Through: Update on write
Write-Behind: Async write

Watch out for: penetration, breakdown, avalanche

Remember

Query count × RTT = minimum latency

1 good query > 10 simple queries
On high-latency networks, this difference is even more pronounced

Chapter 32: Case Study: ML Inference Optimization

Part VIII: Case Studies


"Training is science. Inference is engineering." — An ML Engineer

The Story of "GPU Utilization at Only 3%"

We deployed an image classification service using ResNet-50. Hardware was NVIDIA A100 GPU—theoretically 312 TFLOPS (TF32).

But measured: only 50 images per second.

A100 processing one image requires about 8 GFLOPs. Theoretically:

312 TFLOPS / 8 GFLOPs = 39,000 images/sec

We achieved only 50 images/sec.

GPU utilization: 0.13%

Where's the problem?

1. Image transfer from CPU to GPU: ~5ms
2. GPU computation: ~0.2ms
3. Result transfer from GPU to CPU: ~0.1ms
4. Python overhead: ~10ms
5. Image decoding (CPU): ~5ms
────────────────────────────────────
Total: ~20ms per image

GPU computation is only 1%. The other 99% is spent "feeding data."

This is the core challenge of ML inference optimization: keeping the GPU busy.

ML Inference Characteristics

Training vs Inference

CharacteristicTrainingInference
Batch sizeLarge (32-4096)Small (1-64)
Latency requirementNot importantCritical
Precision requirementFP32/BF16Can be lower (INT8)
FrequencyOnceContinuous
Optimization goalThroughputLatency + Throughput

Common Bottlenecks

┌─────────────────────────────────────────────────────────────────┐
│                    Inference Pipeline                           │
├─────────┬──────────┬──────────┬──────────┬──────────┬──────────┤
│  Input  │ Preproc  │ H2D Copy │  Compute │ D2H Copy │ Postproc │
│  (I/O)  │  (CPU)   │  (PCIe)  │  (GPU)   │  (PCIe)  │  (CPU)   │
└─────────┴──────────┴──────────┴──────────┴──────────┴──────────┘

Common bottlenecks:
1. I/O: Reading data
2. CPU preprocessing: Decode, resize, normalize
3. PCIe transfer: CPU ↔ GPU
4. GPU compute: Model inference
5. CPU postprocessing: NMS, decoding

Measurement Tools

NVIDIA Nsight Systems

# Record profile
nsys profile -o report python inference.py

# View report
nsys-ui report.nsys-rep

Nsight Systems shows:

  • GPU kernel execution time
  • CPU/GPU synchronization points
  • Memory copy (H2D/D2H)
  • CUDA API calls

PyTorch Profiler

import torch
from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for i in range(100):
        with record_function("inference"):
            output = model(input_tensor)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Optimization Strategies

Optimization 1: Batching

Single inference cannot fully utilize GPU's parallel capability:

# Original: process one by one
for image in images:
    result = model(image.unsqueeze(0))  # batch_size = 1

# Optimized: batch processing
batch = torch.stack(images)  # batch_size = 32
results = model(batch)

Batching effect:

Batch Size    Latency (ms)    Throughput (img/s)    GPU Util
─────────────────────────────────────────────────────────────
    1              5.2              192               15%
    8              8.1              988               45%
   32             18.5             1730               78%
  128             62.3             2055               92%
  256            118.7             2157               95%

Batch size increases, throughput increases, but latency also increases.

Trade-off: Latency vs Throughput

Optimization 2: Reduce CPU-GPU Data Transfer

Use Pinned Memory

# Normal memory → GPU: requires extra copy
tensor = torch.randn(batch_size, 3, 224, 224)
tensor_gpu = tensor.to('cuda')  # slow


### Optimization 4: Quantization

**FP32 → FP16**

```python
# PyTorch automatic mixed precision
model = model.half()  # Convert to FP16
input_tensor = input_tensor.half()
output = model(input_tensor)

FP32 → INT8 (requires calibration)

import torch.quantization as quant

# Prepare for quantization
model.qconfig = quant.get_default_qconfig('fbgemm')
quant.prepare(model, inplace=True)

# Calibrate (with representative data)
with torch.no_grad():
    for data in calibration_loader:
        model(data)

# Convert
quant.convert(model, inplace=True)

Quantization effect:

Precision    Model Size    Latency    Accuracy Drop
───────────────────────────────────────────────────
FP32         98 MB         5.2 ms     baseline
FP16         49 MB         2.8 ms     ~0%
INT8         25 MB         1.5 ms     0.5-1%

Optimization 5: Preprocessing Optimization

CPU preprocessing is often the bottleneck:

# Original: PIL + torchvision (slow)
from PIL import Image
from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

image = Image.open("image.jpg")
tensor = transform(image)

Use NVIDIA DALI (GPU preprocessing)

from nvidia.dali import pipeline_def, fn
import nvidia.dali.types as types

@pipeline_def
def image_pipeline():
    jpegs, labels = fn.readers.file(file_root="images/")
    images = fn.decoders.image(jpegs, device="mixed")  # GPU decode
    images = fn.resize(images, size=[256, 256])
    images = fn.crop(images, crop=[224, 224])
    images = fn.normalize(images,
                         mean=[0.485*255, 0.456*255, 0.406*255],
                         std=[0.229*255, 0.224*255, 0.225*255])
    return images, labels

pipe = image_pipeline(batch_size=32, num_threads=4, device_id=0)
pipe.build()

LLM Inference Special Optimizations

Large language models have unique challenges:

Memory-bound Problem

LLaMA-7B parameters: 7B × 2 bytes (FP16) = 14 GB
A100 memory bandwidth: 2 TB/s
Each token generation requires reading all parameters

Theoretical max speed: 2000 GB/s ÷ 14 GB = 143 tokens/sec
Plus KV cache read/write overhead

KV Cache Optimization

# KV cache uses significant memory
# Sequence length 2048, batch 16, 32 layers, hidden 4096 each
# KV cache size = 2 × 16 × 32 × 2048 × 4096 × 2 bytes = 8.6 GB

# Use PagedAttention (vLLM)
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

Speculative Decoding

Use small model to "guess" multiple tokens, large model verifies:

Traditional: Large model generates 1 token → verify → generate next → ...

Speculative:
1. Small model quickly generates 4 tokens
2. Large model verifies all 4 at once
3. If 3 are correct, saves 2 large model calls

Summary

ML Inference Challenges

GPU computation is fast, but:
1. CPU-GPU data transfer is slow
2. CPU preprocessing is slow
3. Small batches can't fully utilize GPU

Optimization Strategies

ProblemSolution
Low GPU utilizationBatching, dynamic batching
Slow data transferPinned memory, CUDA streams
Model too largeQuantization (FP16/INT8), distillation
Preprocessing bottleneckDALI (GPU preprocessing)
Framework overheadTensorRT, ONNX Runtime

LLM Special Optimizations

  • KV cache management (PagedAttention)
  • Speculative decoding
  • Continuous batching

Tools

  • Profiling: Nsight Systems, PyTorch Profiler
  • Runtime: TensorRT, ONNX Runtime, vLLM
  • Serving: Triton Inference Server

Remember

Theoretical TFLOPS ≠ actual performance

Real bottlenecks are usually:
1. Data movement
2. Memory bandwidth
3. Software overhead

Optimization order:
Pipeline → Batching → Quantization → Model architecture

Chapter 33: How to Benchmark

Part IX: Synthesis


"The only thing worse than no data is bad data." — W. Edwards Deming (attributed)

The Perfect Checklist

The story happened on a Friday afternoon.

My colleague Emily had just joined the performance analysis team. She spent an entire week running benchmarks and prepared a professional-looking report: beautiful charts, detailed data, clear conclusions.

"This is the performance analysis of the new version," she said confidently. "It's 23% faster than the old version."

I looked at her report and asked one question: "How many times did you run it?"

"Once," she said. "The results were stable, and the charts look nice."

"Did you do warm-up?"

"What's warm-up?"

I sighed. This is a mistake every newcomer makes. Not because they're not smart, but because benchmarking looks too simple—run a program once, record the time, done.

But in reality, correct benchmarking is a science that requires rigorous methodology.

This chapter consolidates everything we've learned in the previous 15 chapters into a complete "How to Benchmark" guide.

The Benchmarking Checklist

Years of experience tell me that good benchmarks need to answer these questions:

┌─────────────────────────────────────────────────────────────────┐
│                   Benchmarking Checklist                        │
├─────────────────────────────────────────────────────────────────┤
│ □ 1. What are you measuring? (clearly define the metric)        │
│ □ 2. Is the environment controlled? (fixed freq, no turbo)      │
│ □ 3. Did you warm up? (let cache, branch predictor stabilize)   │
│ □ 4. How many runs? (N ≥ 10, preferably ≥ 30)                   │
│ □ 5. Are statistics complete? (median, stddev, CI)              │
│ □ 6. Are results reproducible? (can someone else get same data) │
│ □ 7. Is comparison fair? (same env, same load, same method)     │
└─────────────────────────────────────────────────────────────────┘

Let's analyze each one.

Step 1: Clearly Define What You're Measuring

This sounds obvious, but it's the most commonly overlooked step.

Wrong example:

"I want to test how fast my program is."

This sentence is meaningless. What does "fast" mean?

Correct example:

"I want to measure the average latency (in nanoseconds)
 of a single hash table lookup with 10,000 key-value pairs."

A clear metric definition should include:

ElementExample
OperationHash table lookup
Scale10,000 entries
UnitNanoseconds per lookup
StatisticMedian with 95% confidence interval

Step 2: Control the Test Environment

Environmental variation is the main source of unstable benchmark results.

Linux Environment Setup

# 1. Fix CPU frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

# 2. Isolate CPU cores (avoid scheduler interference)
# In grub config: isolcpus=2,3
taskset -c 2 ./benchmark  # Bind to isolated core

# 3. Disable ASLR (reduce variance from address randomization)
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

# 4. Clear page cache (if testing I/O)
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

Environment Checklist

□ CPU frequency: Fixed (not powersave/ondemand)
□ Turbo boost: Disabled
□ Hyperthreading: Decide based on test purpose
□ Background processes: Minimized
□ NUMA: Confirm memory affinity
□ Temperature: Stable (avoid thermal throttling)

Step 3: Warm-up — The Most Forgotten Step

The first time any code executes, there's a lot of "cold start" overhead:

  • Instruction cache miss: Code not yet loaded into cache
  • Data cache miss: Data not yet loaded into cache
  • Branch predictor: Hasn't learned branch patterns yet
  • Page fault: Memory pages not yet mapped
  • JIT compilation: If VM language, first run needs compilation

If you measure first execution time, you're measuring "cold start performance," not "steady state performance."

Warm-up Strategy

#define WARMUP_ITERATIONS 1000
#define MEASURED_ITERATIONS 10000

void benchmark(void) {
    // Phase 1: Warm-up (discard these results)
    for (int i = 0; i < WARMUP_ITERATIONS; i++) {
        operation_under_test();
    }

    // Phase 2: Measurement
    uint64_t times[MEASURED_ITERATIONS];
    for (int i = 0; i < MEASURED_ITERATIONS; i++) {
        uint64_t start = get_cycles();
        operation_under_test();
        uint64_t end = get_cycles();
        times[i] = end - start;

## Step 4: Statistics — Running Once Is Not Enough

This is the mistake Emily made: running only once.

**Why isn't once enough?**

Even in a perfectly controlled environment, measurements still have variance:

- Minor OS scheduler interference
- Cache state differences
- Hardware timer precision limits
- Power management adjustments

### Minimum Sample Size

| Purpose | Recommended Sample Size (N) |
|---------|----------------------------|
| Quick check | N ≥ 10 |
| Formal report | N ≥ 30 |
| Publication | N ≥ 100 |

### Choose the Right Statistics

```text
❌ Only report mean
   "Average latency: 150 ns"

✅ Report complete statistics
   "Latency: median = 145 ns, mean = 152 ns
    stddev = 23 ns, 95% CI = [141, 163] ns
    min = 120 ns, max = 310 ns"

Why use median instead of mean?

Outliers affect mean dramatically. If 99 measurements are 100 ns, but 1 is 10,000 ns (due to context switch), mean gets severely skewed. Median is immune to outliers.

Step 5: What If Variance Is Too High?

If your coefficient of variation (CV = stddev / mean) exceeds 5%, results may be unreliable.

Diagnostic Steps

1. Check environment
   - Is CPU frequency changing?
   - Are background processes running?
   - Is thermal throttling occurring?

2. Check program
   - Is there dynamic memory allocation? (malloc/free has high variance)
   - Are there I/O operations?
   - Are there system calls?

3. Check measurement method
   - Is timer resolution sufficient?
   - Is there timer wrap-around?

Techniques to Reduce Variance

// 1. Use inline assembly barrier to prevent compiler reordering
#define COMPILER_BARRIER() asm volatile("" ::: "memory")

// 2. Use CPU cycle counter instead of wall clock
static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    asm volatile("rdtscp" : "=a"(lo), "=d"(hi) :: "rcx");
    return ((uint64_t)hi << 32) | lo;
}

// 3. Pin thread to specific CPU
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(2, &cpuset);  // Use CPU 2
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

Step 6: Fair Comparison

When comparing two systems or algorithms, you must ensure "all else being equal."

Common Unfair Comparisons

TrapProblem
Different compilersGCC vs Clang optimize differently
Different optimization levels-O0 vs -O3 huge difference
Different data sizesSmall data fits in cache, large doesn't
Different hardwareComparing different CPUs needs normalization
Different warm-upOne has warm-up, one doesn't

Correct Approach

# Ensure same compilation environment
gcc --version  # Record version
CFLAGS="-O3 -march=native"  # Same optimization options

# Ensure same execution environment
uname -r       # Record kernel version
cat /proc/cpuinfo | grep "model name"  # Record CPU

# Ensure same test data
sha256sum test_data.bin  # Verify data integrity

Step 7: Document Everything

Your report should allow another person to reproduce your results.

Report Template

## Test Environment

### Hardware
- CPU: Intel Core i7-12700K @ 3.6 GHz (fixed, turbo disabled)
- Memory: 32 GB DDR5-4800
- Storage: Samsung 980 Pro NVMe

### Software
- OS: Ubuntu 22.04 LTS (kernel 6.5.0)
- Compiler: GCC 12.3.0
- Flags: -O3 -march=native -flto

### Environment Settings
- CPU governor: performance
- Turbo boost: disabled
- Hyperthreading: disabled
- ASLR: disabled
- Isolated cores: 2-3

## Methodology
- Warm-up: 1,000 iterations
- Measured: 10,000 iterations
- Repetitions: 30 independent runs
- Statistics: median with 95% CI

## Results

| Metric | Value | 95% CI |
|--------|-------|--------|
| Latency | 145 ns | ±8 ns |
| Throughput | 6.9 M ops/s | ±0.3 M |

## Reproduction Steps
git clone <repo> && cd benchmark
./setup_env.sh    # Setup environment
./run_benchmark.sh  # Run tests

The Anti-Patterns — Things to Never Do

1. Cherry-picking

❌ Ran 10 times, only report the best one
✅ Report statistical summary of all results

2. Hiding Variance

❌ "5% faster" (when variance is actually 20%)
✅ "5% ± 3% faster, statistically significant"

3. Unfair Baseline

❌ Compare your optimized program vs competitor's default config
✅ Both use default config, or both are optimized

4. Ignoring Cold Start

❌ Measurement includes first execution (cache miss, page fault)
✅ Clearly distinguish cold start and steady state performance

Emily's Story Ending

I helped Emily redesign her benchmark:

  1. Added warm-up: 1000 iterations
  2. Increased sample size: From 1 to 30 runs
  3. Fixed environment: Fixed CPU frequency, disabled turbo
  4. Calculated statistics: median, stddev, CI

New results:

Old conclusion: "New version is 23% faster"

New conclusion: "New version median latency reduced by 18%
                (95% CI: 15% - 21%)
                All results consistent across N=30 measurements
                Performance improvement statistically significant (p < 0.001)"

The number got smaller, but more credible.

Summary

The Benchmarking Checklist:

  1. Define clearly: What metric, what scale, what unit
  2. Control environment: Fixed frequency, isolated cores, minimal noise
  3. Warm up: Don't measure cold start unless that's what you want
  4. Run enough times: N ≥ 30 for serious work
  5. Report statistics: Median, stddev, confidence interval
  6. Compare fairly: Same environment, same workload, same methodology
  7. Document everything: Make it reproducible

The Golden Rule:

If someone cannot reproduce your benchmark results, your benchmark is worthless.

Next chapter, we'll discuss how to systematically perform performance optimization once you have correct data.

Chapter 34: How to Optimize

Part IX: Synthesis


"Premature optimization is the root of all evil." — Donald Knuth

"But so is no optimization at all." — Anonymous engineer

The 3 AM Flame Graph

It was the night before a deadline.

Our API server suddenly slowed down—P99 latency spiked from 50ms to 500ms. The service was about to violate SLA.

The team lead looked at me and said: "You're the performance expert. Figure it out."

I opened the Flame Graph and stared at that "volcano" for five minutes.

One function took 47% of CPU time: json_parse().

"Found it," I said. "JSON parsing is the bottleneck."

"Great!" The lead relaxed. "Then optimize it."

I shook my head: "No. We shouldn't optimize it."

The First Rule of Optimization: Ask "Why" First

Before optimizing anything, always ask yourself three questions:

┌─────────────────────────────────────────────────────────────────┐
│                   Before Optimizing, Ask:                       │
├─────────────────────────────────────────────────────────────────┤
│ 1. Does this operation really need to exist?                    │
│    (The fastest code is the code that never runs)               │
│                                                                 │
│ 2. Can this operation be done fewer times?                      │
│    (Caching, batching, lazy evaluation)                         │
│                                                                 │
│ 3. Can this operation be done in a simpler way?                 │
│    (Simpler algorithm, different data structure)                │
└─────────────────────────────────────────────────────────────────┘

Back to the JSON parsing case:

Why was json_parse() so slow? Because we were parsing the same config file for every request.

The solution wasn't to optimize parsing—the solution was to cache the config, parse it only once.

# ❌ Before: parse for every request
def handle_request():
    config = json_parse(open("config.json").read())
    # ...

# ✅ After: parse once at startup
CONFIG = json_parse(open("config.json").read())

def handle_request():
    config = CONFIG
    # ...

This change reduced P99 latency from 500ms to 45ms. We didn't optimize any code—we just stopped doing unnecessary work.

The Optimization Workflow

After years of experience, I've summarized a systematic optimization process:

┌───────────────────────────────────────────────────────────────┐
│                   Optimization Workflow                        │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│   1. Measure                                                  │
│      └── Establish baseline, quantify "how slow is it now"   │
│                                                               │
│   2. Profile                                                  │
│      └── Find bottleneck, understand "why is it slow"        │
│                                                               │
│   3. Analyze                                                  │
│      └── Understand root cause, ask "can we avoid it?"       │
│                                                               │
│   4. Hypothesize                                              │
│      └── Make prediction, estimate "how much faster"         │
│                                                               │
│   5. Implement                                                │
│      └── Implement the optimization                          │
│                                                               │
│   6. Measure Again                                            │
│      └── Verify effect, confirm "did it actually get faster" │
│                                                               │
│   7. Document                                                 │
│      └── Record changes, explain "why we did this"           │
│                                                               │
└───────────────────────────────────────────────────────────────┘

This isn't linear—you usually cycle through multiple times.

Step 1: Measure — Establish Baseline

Before optimizing, you must know "how slow is it now." Without a baseline, you can't judge if optimization worked.

Key Points for Recording Baseline

Baseline Report (Example)
=========================
Date: 2025-12-18
Commit: abc123

Metric              | Value      | Notes
--------------------|------------|------------------
P50 latency         | 45 ms      |
P99 latency         | 520 ms     | ← This is the problem
Throughput          | 1,200 req/s|
CPU usage           | 78%        |
Memory usage        | 2.1 GB     |

Important: Record the git commit hash. Later you'll need to compare "before vs after optimization," and you must ensure you're comparing specific versions.

Step 2: Profile — Find the Bottleneck

Profiling is the "microscope" for finding bottlenecks. Different types of bottlenecks need different tools.

Bottleneck Classification and Tool Selection

Bottleneck TypeSymptomsRecommended Tools
CPU-boundHigh CPU usage, high latencyperf, Flame Graph
Memory-boundHigh cache miss, low IPCperf stat, Cachegrind
I/O-boundHigh CPU idle, high I/O waitiostat, strace
Lock contentionLow CPU usage but high latencyperf lock, Off-CPU Flame Graph

Flame Graph Reading Tips

Step 3: Analyze — Understand Root Cause

After finding the hotspot, the next step is understanding "why is it slow."

Common Bottleneck Patterns

PatternSignsRoot Cause
Repeated computationSame function called too many timesMissing caching
Inefficient algorithmO(n²) slow on large dataNeed better algorithm
Cache missHigh L3 miss rateData structure not cache-friendly
Branch mispredictionHigh branch-miss rateData-dependent branches
Lock contentionMultiple threads waiting for same lockLock granularity too coarse

Diagnostic perf Commands

# View overall performance counters
perf stat -e cycles,instructions,cache-misses,branch-misses ./program

# Example output interpretation
#   3,000,000,000 cycles
#   1,500,000,000 instructions  # IPC = 0.5 (low, possibly memory-bound)
#      50,000,000 cache-misses  # 3.3% miss rate (might be a problem)
#       5,000,000 branch-misses # 0.3% miss rate (normal)

IPC (Instructions Per Cycle) Meaning:

IPCPossible Cause
< 0.5Severely memory-bound, CPU waiting for data
0.5-1.0Possibly cache miss or branch miss
1.0-2.0Reasonably normal
> 2.0Good, fully utilizing hardware

Step 4: Hypothesize — Predict the Effect

Before implementing optimization, predict "how much will this change improve performance."

This is important because:

  1. Avoid wasting time: If prediction is only 1% improvement, might not be worth it
  2. Verify understanding: If actual effect differs greatly from prediction, analysis was wrong
  3. Amdahl's Law: Overall speedup is limited by the proportion of the optimized part

Amdahl's Law

                    1
Speedup = ────────────────────────
          (1 - p) + p/s

p = proportion of time spent in optimized part
s = speedup factor for that part

Example:

If json_parse() takes 40% of time (p = 0.4), and we make it 10x faster (s = 10):

Speedup = 1 / (0.6 + 0.4/10)
        = 1 / (0.6 + 0.04)
        = 1 / 0.64
        = 1.56x

Even if we make JSON parsing 10x faster, overall is only 56% faster. This is the harsh reality of Amdahl's Law.

Step 5: Implement — Optimization Strategies

Hierarchical Optimization Strategy

Optimization should proceed from "high level" to "low level." High-level optimizations usually have the most significant effects:

╔═══════════════════════════════════════════════════════════════╗
║                 Optimization Hierarchy                         ║
╠═══════════════════════════════════════════════════════════════╣
║                                                               ║
║  Level 1: Architecture / Design                        10-100x║
║  ├─ Better algorithm (O(n²) → O(n log n))                     ║
║  ├─ Eliminate unnecessary operations (caching, lazy eval)     ║
║  └─ More suitable data structure                              ║
║                                                               ║
║  Level 2: Algorithm / Data Structure                    2-10x ║
║  ├─ Cache-friendly data layout                                ║
║  ├─ Reduce memory allocation                                  ║
║  └─ Batch processing                                          ║
║                                                               ║
║  Level 3: Implementation                                1-3x  ║
║  ├─ Avoid unnecessary copies                                  ║
║  ├─ Use faster library                                        ║
║  └─ Reduce branches                                           ║
║                                                               ║
║  Level 4: Low-level                                     1-2x  ║
║  ├─ SIMD vectorization                                        ║
║  ├─ Reduce cache miss                                         ║
║  └─ Compiler optimization options                             ║
║                                                               ║
╚═══════════════════════════════════════════════════════════════╝

Common Optimization Techniques

Caching / Memoization

// ❌ Before: compute every time
int fib(int n) {
    if (n <= 1) return n;
    return fib(n-1) + fib(n-2);  // O(2^n)
}

// ✅ After: remember computed results
int cache[100] = {0};
int fib(int n) {
    if (n <= 1) return n;
    if (cache[n] != 0) return cache[n];
    cache[n] = fib(n-1) + fib(n-2);
    return cache[n];  // O(n)
}

Batching

// ❌ Before: write one at a time
for (int i = 0; i < 1000; i++) {
    write_to_disk(data[i]);  // 1000 syscalls
}

// ✅ After: batch write
buffer_add(data, 1000);
write_to_disk(buffer);  // 1 syscall

Step 6: Measure Again — Verify the Effect

This is the most critical step. Don't assume "it should be faster"—prove it with data.

Comparison Report Example

Optimization: Cache config.json parsing
Commit: def456 (after) vs abc123 (before)

Metric              | Before     | After      | Change
--------------------|------------|------------|--------
P50 latency         | 45 ms      | 42 ms      | -6.7%
P99 latency         | 520 ms     | 48 ms      | -90.8% ✓
Throughput          | 1,200 req/s| 1,850 req/s| +54.2% ✓
CPU usage           | 78%        | 45%        | -42.3% ✓

If the effect is not as expected, go back to Step 3 and re-analyze. Your hypothesis might be wrong.

Step 7: Document — Leave a Record

Optimization knowledge must be passed on. The next person (including yourself six months later) needs to understand this change.

The Anti-Patterns — Don't Do This

1. Optimizing Without Measuring

❌ "I feel this function is slow, let me optimize it first"
✅ "Profiler shows this function takes 30% CPU, worth optimizing"

2. Optimizing the Wrong Place

❌ Spend three days optimizing a function that only takes 2% of time
✅ Focus on the hotspot that takes 80% of time

3. Over-optimizing

❌ Write unmaintainable code to save 1% time
✅ Balance between "performance" and "maintainability"

4. Forgetting to Verify

❌ "I added cache, it should be faster" (no measurement)
✅ "After adding cache, P99 dropped from 520ms to 48ms" (with data)

Summary

The Optimization Workflow:

  1. Measure: Establish baseline with concrete numbers
  2. Profile: Find the bottleneck with proper tools
  3. Analyze: Understand the root cause (ask "why?")
  4. Hypothesize: Predict the improvement using Amdahl's Law
  5. Implement: Start from architecture, then algorithm, then low-level
  6. Measure Again: Verify with data, not assumptions
  7. Document: Leave knowledge for future engineers

The Golden Rules:

"Don't guess. Measure."

"The fastest code is the code that never runs."

"Optimize the bottleneck, not the convenient thing."

Next chapter, we'll discuss how to automate these performance tests and integrate them into CI/CD pipelines.

Chapter 35: CI/CD for Performance

Part IX: Synthesis


"What gets measured gets managed." — Peter Drucker

"What gets automated gets repeated." — DevOps wisdom

The "Nobody Noticed" Performance Regression

Six months ago, our API latency was 50ms.

Today, it's 150ms.

Nobody noticed. No alerts. No tickets.

How did this happen?

I ran git log and found 847 commits in those six months. Each commit degraded performance by an average of 0.12 ms—a difference imperceptible to humans.

But accumulated: 847 × 0.12ms = 100ms.

This is the horror of "Gradual Performance Regression." Like boiling a frog slowly, it gets a little slower each day, until one day customers start complaining.

The solution? Integrate performance testing into the CI/CD pipeline, so every commit goes through performance checks.

Why Performance CI/CD Is Needed

┌─────────────────────────────────────────────────────────────────┐
│                     Why Automate Performance Testing?           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Catch regressions early                                     │
│     └── Find problems before PR merge, not in production        │
│                                                                 │
│  2. Track trends over time                                      │
│     └── Historical data reveals gradual degradation             │
│                                                                 │
│  3. Reproducible measurements                                   │
│     └── Fixed environment eliminates "fast on my machine"       │
│                                                                 │
│  4. Shift left                                                  │
│     └── Find early, fix early, lower cost                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Performance CI Pipeline

A complete performance CI pipeline includes these stages:

┌──────────────────────────────────────────────────────────────────┐
│                    Performance CI Pipeline                       │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│   │ Trigger │───▶│ Setup   │───▶│ Run     │───▶│ Compare │      │
│   │ (PR)    │    │ Env     │    │ Bench   │    │ Results │      │
│   └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                        │              │          │
│                                        ▼              ▼          │
│                                  ┌─────────┐    ┌─────────┐      │
│                                  │ Store   │    │ Report  │      │
│                                  │ Data    │    │ (PR)    │      │
│                                  └─────────┘    └─────────┘      │
│                                        │                         │
│                                        ▼                         │
│                                  ┌─────────┐                     │
│                                  │ Alert   │                     │
│                                  │ (Slack) │                     │
│                                  └─────────┘                     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Step 1: Dedicated Test Environment

This is the most important step.

Why Not Use GitHub-hosted Runners?

┌─────────────────────────────────────────────────────────────────┐
│                    Shared vs Dedicated Infrastructure           │
├────────────────────────────────────┬────────────────────────────┤
│         Shared Cloud Runner        │      Dedicated Machine     │
├────────────────────────────────────┼────────────────────────────┤
│ ❌ CPU "steal time" uncontrollable │ ✅ Full control of hardware│
│ ❌ Interference from other tenants │ ✅ No external interference│
│ ❌ VM config may differ each time  │ ✅ Completely consistent   │
│ ❌ Cannot fix CPU frequency        │ ✅ Can lock turbo, governor│
│ ❌ Variance up to 20-50%           │ ✅ Variance controlled 1-3%│
└────────────────────────────────────┴────────────────────────────┘

Environment Setup Script

#!/bin/bash
# setup_perf_env.sh - Setup performance test environment

# 1. Fix CPU frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

# 2. Disable ASLR
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

# 3. Clear page cache
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# 4. Set CPU affinity (isolate cores 2-3 for testing)
echo "Benchmark will run on isolated CPUs 2-3"

# 5. Verify settings
echo "Environment configured:"
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cat /proc/sys/kernel/randomize_va_space

Step 2: Benchmark Suite Design

Not all tests are suitable for CI. Need to balance:

TypeTimePurpose
Smoke tests< 1 minEvery commit, quick feedback
Core benchmarks5-15 minEvery PR, critical paths
Full suite30-60 minNightly, complete coverage
Soak testsHoursWeekly, find memory leaks

Benchmark Code Example (Go)

// benchmark_test.go
func BenchmarkHashLookup(b *testing.B) {
    table := buildHashTable(10000)
    keys := generateRandomKeys(1000)

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for _, key := range keys {

### Setting Thresholds

```yaml
# .github/perf-thresholds.yml
thresholds:
  # Change relative to baseline
  regression_threshold: 5%    # Fail if regression exceeds 5%
  improvement_threshold: 10%  # Manual review if improvement exceeds 10%

  # Absolute limits
  max_latency_p99: 100ms
  min_throughput: 1000 req/s

  # Statistical significance
  min_samples: 30
  confidence_level: 0.95

Statistical Significance Testing

Don't just compare means! Use statistical tests to determine if differences are significant:

from scipy import stats

def is_significant_regression(baseline, current, threshold=0.05):
    """
    Use Mann-Whitney U test to determine if there's significant regression
    """
    # Mann-Whitney U test (non-parametric)
    statistic, p_value = stats.mannwhitneyu(
        baseline, current,
        alternative='less'  # Test if current > baseline (regression)
    )

    if p_value < threshold:
        # Calculate effect size
        median_diff = np.median(current) - np.median(baseline)
        pct_diff = median_diff / np.median(baseline) * 100
        return True, pct_diff, p_value

    return False, 0, p_value

Step 4: GitHub Actions Integration

Complete Workflow Example

# .github/workflows/performance.yml
name: Performance Tests

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Nightly at 2 AM

jobs:
  benchmark:
    runs-on: [self-hosted, perf-runner]  # Dedicated runner

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for comparison

      - name: Setup environment
        run: |
          sudo ./scripts/setup_perf_env.sh

      - name: Build
        run: |
          make build-release

      - name: Run benchmarks
        run: |
          ./scripts/run_benchmarks.sh --output results.json

      - name: Compare with baseline
        id: compare
        run: |
          python scripts/compare_results.py \
            --current results.json \
            --baseline benchmarks/baseline.json \
            --threshold 5 \
            --output comparison.md

      - name: Comment on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('comparison.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

      - name: Fail on regression
        if: steps.compare.outputs.regression == 'true'
        run: exit 1

Step 5: Long-term Tracking and Visualization

Trend Chart

HashLookup Latency (ns) - Last 30 Days

160 ┤
    │
150 ┤     ╭─╮
    │    ╭╯ ╰╮
140 ┤───╯    ╰──────────────────────────────────── baseline
    │
130 ┤                    ╭──────────────────────╮
    │                   ╭╯                      ╰─
120 ┤                  ╭╯
    │                 ╭╯
110 ┼─────────────────┴─────────────────────────────
    └──────────────────────────────────────────────▶
     Day 1                                    Day 30

Common Pitfalls and Solutions

1. Flaky Benchmarks

Problem: Same commit, benchmark results differ each time

Solutions:
- Increase warm-up iterations
- Increase sample size
- Use median instead of mean
- Set variance threshold, re-run if exceeded

2. Environment Drift

Problem: Runner's OS update invalidates baseline

Solutions:
- Use Docker containers to fix environment
- Periodically rebuild baseline
- Record environment fingerprint

3. Over-sensitivity

Problem: 1% change triggers alert, too many false positives

Solutions:
- Raise threshold (5% is reasonable starting point)
- Use statistical significance testing
- Set cooldown period

4. Test Time Too Long

Problem: Full benchmark suite takes 2 hours

Solutions:
- Layer: smoke test (every commit) + full suite (nightly)
- Only run affected benchmarks
- Parallelize execution

Summary

Performance CI/CD Checklist:

  1. Dedicated environment: Use self-hosted runners with fixed configuration
  2. Layered testing: Smoke tests for every commit, full suite nightly
  3. Statistical comparison: Don't just compare means, use proper tests
  4. Automated reporting: Comment on PRs with clear results
  5. Historical tracking: Store data for trend analysis
  6. Smart alerting: Balance sensitivity with noise

The Key Insight:

Performance is not a feature you add at the end. It's a property you maintain continuously.

The Goal:

Every commit is performance-tested.
Every regression is caught before merge.
Every trend is visible to the team.

Next chapter, we'll enter Part VI and explore future trends and emerging technologies in performance analysis.

Appendix A: Benchmark Automation


"If you can't automate it, you can't scale it." — DevOps Proverb

Why Automation Is Needed

Manual benchmark execution has several problems:

Problems:
1. Human error (forgetting to set environment, wrong parameters)
2. Non-reproducible (different conditions each run)
3. Time-consuming (requires manual waiting and recording)
4. Hard to track (results scattered everywhere)

Solution:
  Automate benchmark workflow
  Integrate into CI/CD pipeline
  Automatically detect performance regressions

CI/CD Integration

GitHub Actions Example

# .github/workflows/benchmark.yml
name: Performance Benchmark

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  benchmark:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v4

    - name: Setup environment
      run: |
        # Lock CPU frequency (if possible)
        sudo cpupower frequency-set -g performance || true

        # Disable turbo boost
        echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo || true

    - name: Build
      run: |
        mkdir build && cd build
        cmake -DCMAKE_BUILD_TYPE=Release ..
        make -j$(nproc)

    - name: Run benchmarks
      run: |
        cd build
        ./benchmark --benchmark_format=json > benchmark_results.json

    - name: Upload results
      uses: actions/upload-artifact@v4
      with:
        name: benchmark-results
        path: build/benchmark_results.json

    - name: Compare with baseline
      run: |
        python scripts/compare_benchmarks.py \
          --baseline baseline.json \
          --current build/benchmark_results.json \
          --threshold 5

Dedicated Benchmark Runner

For serious performance testing, use a dedicated machine:

# Using self-hosted runner
jobs:
  benchmark:
    runs-on: [self-hosted, benchmark-machine]

    steps:
    - name: Ensure isolation
      run: |
        # Ensure no other programs running
        sudo systemctl stop cron
        sudo systemctl stop unattended-upgrades

        # Set CPU affinity
        taskset -c 0-3 ./benchmark

Google Benchmark Integration

Basic Setup

#include <benchmark/benchmark.h>

static void BM_VectorPushBack(benchmark::State& state) {
    for (auto _ : state) {
        std::vector<int> v;
        for (int i = 0; i < state.range(0); ++i) {
            v.push_back(i);
        }
    }
    state.SetComplexityN(state.range(0));
}

BENCHMARK(BM_VectorPushBack)
    ->Range(8, 8<<10)
    ->Complexity(benchmark::oN);

BENCHMARK_MAIN();

JSON Output

# Output JSON format
./benchmark --benchmark_format=json --benchmark_out=results.json

# Example output
{
  "context": {
    "date": "2024-01-15T10:30:00+08:00",
    "host_name": "benchmark-server",
    "executable": "./benchmark",
    "num_cpus": 8,
    "mhz_per_cpu": 3600,
    "cpu_scaling_enabled": false
  },
  "benchmarks": [
    {
      "name": "BM_VectorPushBack/8",
      "real_time": 45.2,
      "cpu_time": 44.8,
      "iterations": 15234567
    }
  ]
}

Comparison Tool

# Use Google Benchmark's comparison tool
pip install google-benchmark

# Compare two runs
compare.py benchmarks baseline.json current.json

# Output
Benchmark                    Time       CPU
--------------------------------------------
BM_VectorPushBack/8        -0.05     -0.04
BM_VectorPushBack/64       +0.12     +0.11  # Warning: regression
BM_VectorPushBack/512      -0.02     -0.03

Regression Detection

Statistical Method

import numpy as np
from scipy import stats

def detect_regression(baseline, current, threshold=0.05, alpha=0.05):
    """
    Use statistical test to detect performance regression

    Args:
        baseline: List of baseline measurements
        current: List of current measurements
        threshold: Acceptable performance change ratio
        alpha: Significance level

    Returns:
        (is_regression, p_value, change_percent)
    """
    # Calculate percent change
    baseline_mean = np.mean(baseline)
    current_mean = np.mean(current)
    change_percent = (current_mean - baseline_mean) / baseline_mean * 100

    # Perform t-test
    t_stat, p_value = stats.ttest_ind(baseline, current)

    # Determine if significant regression
    is_regression = (
        p_value < alpha and
        change_percent > threshold * 100
    )

    return is_regression, p_value, change_percent


# Usage example
baseline = [100.2, 101.5, 99.8, 100.1, 100.9]
current = [108.3, 109.1, 107.5, 108.8, 109.2]

is_reg, p, change = detect_regression(baseline, current)
print(f"Regression: {is_reg}, p-value: {p:.4f}, change: {change:.1f}%")
# Regression: True, p-value: 0.0001, change: 8.2%

Automation Script

#!/usr/bin/env python3
"""benchmark_compare.py - Compare benchmark results and detect regressions"""

import json
import sys
from pathlib import Path

def load_results(filepath):
    """Load Google Benchmark JSON results"""
    with open(filepath) as f:
        data = json.load(f)
    return {b['name']: b for b in data['benchmarks']}

def compare_results(baseline_path, current_path, threshold=5.0):
    """Compare two benchmark results"""
    baseline = load_results(baseline_path)
    current = load_results(current_path)

    regressions = []
    improvements = []

    for name, curr in current.items():
        if name not in baseline:
            continue

        base = baseline[name]
        change = (curr['real_time'] - base['real_time']) / base['real_time'] * 100

        if change > threshold:
            regressions.append((name, change))
        elif change < -threshold:
            improvements.append((name, change))

    return regressions, improvements

def main():
    if len(sys.argv) != 4:
        print("Usage: benchmark_compare.py <baseline> <current> <threshold>")
        sys.exit(1)

    baseline_path = sys.argv[1]
    current_path = sys.argv[2]
    threshold = float(sys.argv[3])

    regressions, improvements = compare_results(
        baseline_path, current_path, threshold
    )

    if regressions:
        print("❌ Performance Regressions Detected:")
        for name, change in regressions:
            print(f"  {name}: +{change:.1f}%")
        sys.exit(1)

    if improvements:
        print("✅ Performance Improvements:")
        for name, change in improvements:
            print(f"  {name}: {change:.1f}%")

    print("✅ No regressions detected")
    sys.exit(0)

if __name__ == "__main__":
    main()

Result Storage and Tracking

Database Storage

import sqlite3
from datetime import datetime

def init_db(db_path):
    """Initialize benchmark database"""
    conn = sqlite3.connect(db_path)
    conn.execute('''
        CREATE TABLE IF NOT EXISTS benchmarks (
            id INTEGER PRIMARY KEY,
            timestamp TEXT,
            commit_hash TEXT,
            benchmark_name TEXT,
            real_time REAL,
            cpu_time REAL,
            iterations INTEGER,
            UNIQUE(commit_hash, benchmark_name)
        )
    ''')
    conn.commit()
    return conn

def store_results(conn, commit_hash, results):
    """Store benchmark results"""
    timestamp = datetime.now().isoformat()

    for benchmark in results['benchmarks']:
        conn.execute('''
            INSERT OR REPLACE INTO benchmarks
            (timestamp, commit_hash, benchmark_name, real_time, cpu_time, iterations)
            VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            timestamp,
            commit_hash,
            benchmark['name'],
            benchmark['real_time'],
            benchmark['cpu_time'],
            benchmark['iterations']
        ))

    conn.commit()

def get_history(conn, benchmark_name, limit=100):
    """Get benchmark history"""
    cursor = conn.execute('''
        SELECT timestamp, commit_hash, real_time
        FROM benchmarks
        WHERE benchmark_name = ?
        ORDER BY timestamp DESC
        LIMIT ?
    ''', (benchmark_name, limit))

    return cursor.fetchall()

Visualization

import matplotlib.pyplot as plt
import pandas as pd

def plot_benchmark_history(history, benchmark_name):
    """Plot benchmark history trend"""
    df = pd.DataFrame(history, columns=['timestamp', 'commit', 'time'])
    df['timestamp'] = pd.to_datetime(df['timestamp'])

    plt.figure(figsize=(12, 6))
    plt.plot(df['timestamp'], df['time'], marker='o')
    plt.xlabel('Date')
    plt.ylabel('Time (ns)')
    plt.title(f'Benchmark History: {benchmark_name}')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(f'{benchmark_name}_history.png')

Advanced Techniques

Environment Consistency Check

#!/bin/bash
# check_environment.sh - Check benchmark environment

echo "=== Environment Check ==="

# CPU frequency
echo "CPU Frequency:"
cat /proc/cpuinfo | grep "MHz" | head -1

# CPU Governor
echo "CPU Governor:"
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Turbo Boost
echo "Turbo Boost:"
cat /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || echo "N/A"

# System load
echo "System Load:"
uptime

# Memory
echo "Memory:"
free -h

# Background processes
echo "Background Processes:"
ps aux | wc -l

Multiple Runs with Statistics

# Multiple runs in CI
- name: Run benchmarks (multiple iterations)
  run: |
    for i in {1..5}; do
      ./benchmark --benchmark_format=json > results_$i.json
    done

    # Merge results
    python scripts/merge_results.py results_*.json > final_results.json

Performance Budget

# performance_budget.yaml
benchmarks:
  BM_VectorPushBack/1024:
    max_time_ns: 50000
    max_memory_kb: 100

  BM_HashTableLookup:
    max_time_ns: 100

  BM_SortArray/10000:
    max_time_ns: 1000000
def check_budget(results, budget):
    """Check if performance exceeds budget"""
    violations = []

    for name, limits in budget['benchmarks'].items():
        if name not in results:
            continue

        result = results[name]

        if 'max_time_ns' in limits:
            if result['real_time'] > limits['max_time_ns']:
                violations.append(
                    f"{name}: {result['real_time']:.0f}ns > {limits['max_time_ns']}ns"
                )

    return violations

Summary

Key elements of benchmark automation:

CI/CD Integration

  • GitHub Actions / GitLab CI
  • Dedicated benchmark runner
  • Environment consistency

Automatic Regression Detection

  • Statistical testing
  • Threshold configuration
  • Automatic alerts

Result Management

  • Database storage
  • Historical tracking
  • Visualization

Best Practices

  • Multiple runs for statistics
  • Fixed environment settings
  • Performance budgets
  • Automated reporting

Appendix B: Embedded and RTOS Implementation


"In embedded systems, every cycle counts." — Embedded Systems Proverb

Simulator Environment Setup

Since this book doesn't assume readers have physical hardware, all exercises use simulators.

QEMU ARM Setup

# Install QEMU
sudo apt install qemu-system-arm

# Install ARM toolchain
sudo apt install gcc-arm-none-eabi

# Test QEMU
qemu-system-arm -M help | grep lm3s
# lm3s6965evb    Stellaris LM3S6965EVB (Cortex-M3)

QEMU RISC-V Setup

# Install QEMU
sudo apt install qemu-system-riscv32 qemu-system-riscv64

# Install RISC-V toolchain
sudo apt install gcc-riscv64-unknown-elf

# Test QEMU
qemu-system-riscv32 -M help | grep sifive
# sifive_e       RISC-V Board compatible with SiFive E SDK
# sifive_u       RISC-V Board compatible with SiFive U SDK

Renode Setup

# Download Renode
wget https://github.com/renode/renode/releases/download/v1.14.0/renode_1.14.0_amd64.deb
sudo dpkg -i renode_1.14.0_amd64.deb

# Test
renode --version

Exercise 1: ARM Cortex-M Cycle Counting

Goal

Use DWT (Data Watchpoint and Trace) cycle counter to measure function execution time.

Code

// cycle_count.c - ARM Cortex-M3 cycle counting example

#include <stdint.h>

// DWT register definitions
#define DWT_CTRL   (*(volatile uint32_t*)0xE0001000)
#define DWT_CYCCNT (*(volatile uint32_t*)0xE0001004)
#define DEMCR      (*(volatile uint32_t*)0xE000EDFC)

// Enable DWT
void dwt_init(void) {
    DEMCR |= (1 << 24);  // TRCENA
    DWT_CYCCNT = 0;
    DWT_CTRL |= 1;       // CYCCNTENA
}

// Measure function
uint32_t measure_cycles(void (*func)(void)) {
    uint32_t start = DWT_CYCCNT;
    func();
    uint32_t end = DWT_CYCCNT;
    return end - start;
}

// Test function
void test_function(void) {
    volatile int sum = 0;
    for (int i = 0; i < 1000; i++) {
        sum += i;
    }
}

int main(void) {
    dwt_init();

    uint32_t cycles = measure_cycles(test_function);

    // Use semihosting for output
    // printf("Cycles: %u\n", cycles);

    while (1);
    return 0;
}

Compile and Run

# Compile
arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb \
    -specs=nosys.specs -specs=nano.specs \
    -T linker.ld -o cycle_count.elf cycle_count.c

# Run in QEMU
qemu-system-arm -M lm3s6965evb -nographic \
    -kernel cycle_count.elf -semihosting

Exercise 2: RISC-V mcycle/minstret

Goal

Use RISC-V CSRs to read cycle count and instruction count.

Code

// riscv_counters.c - RISC-V performance counter example

#include <stdint.h>

// Read mcycle
static inline uint64_t read_mcycle(void) {
    uint32_t lo, hi;
    asm volatile (
        "csrr %0, mcycle\n"
        "csrr %1, mcycleh\n"
        : "=r"(lo), "=r"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

// Read minstret
static inline uint64_t read_minstret(void) {
    uint32_t lo, hi;
    asm volatile (
        "csrr %0, minstret\n"
        "csrr %1, minstreth\n"
        : "=r"(lo), "=r"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

// Calculate CPI
void measure_cpi(void (*func)(void)) {
    uint64_t cycles_start = read_mcycle();
    uint64_t instrs_start = read_minstret();

    func();

    uint64_t cycles_end = read_mcycle();
    uint64_t instrs_end = read_minstret();

    uint64_t cycles = cycles_end - cycles_start;
    uint64_t instrs = instrs_end - instrs_start;

    // CPI = cycles / instructions
    // Using integer division
    uint32_t cpi_int = cycles / instrs;
    uint32_t cpi_frac = (cycles * 100 / instrs) % 100;

    // Output: CPI = cpi_int.cpi_frac
}

int main(void) {
    // Test...
    return 0;
}

Exercise 3: FreeRTOS Context Switch Measurement

Goal

Measure FreeRTOS context switch time.

Method

Measurement method:

1. Create two tasks
2. Task A records time, then yields
3. Task B records time, then yields
4. Calculate time difference

Time difference = Context switch time

FreeRTOS Code

// context_switch.c - FreeRTOS context switch measurement

#include "FreeRTOS.h"
#include "task.h"

volatile uint32_t timestamp_a, timestamp_b;
volatile uint32_t switch_time;

void TaskA(void *pvParameters) {
    while (1) {
        timestamp_a = DWT_CYCCNT;
        taskYIELD();

        // Calculate time switching back from B
        switch_time = DWT_CYCCNT - timestamp_b;
    }
}

void TaskB(void *pvParameters) {
    while (1) {
        timestamp_b = DWT_CYCCNT;
        taskYIELD();
    }
}

int main(void) {
    dwt_init();

    xTaskCreate(TaskA, "TaskA", 128, NULL, 1, NULL);
    xTaskCreate(TaskB, "TaskB", 128, NULL, 1, NULL);

    vTaskStartScheduler();

    while (1);
}

Running on Renode

# Create Renode script
cat > freertos_test.resc << 'EOF'
mach create
machine LoadPlatformDescription @platforms/cpus/stm32f4.repl
sysbus LoadELF @context_switch.elf
showAnalyzer sysbus.uart1
start
EOF

# Run
renode freertos_test.resc

Exercise 4: Interrupt Latency Measurement

Measure time from interrupt trigger to ISR execution start.

Measurement Method Description

Measurement steps:

1. Record time before triggering interrupt
2. Record time at ISR start
3. Calculate difference

Notes:
  - Need to consider interrupt priority
  - Need to consider impact of other interrupts
  - Multiple measurements for statistics

Code

// interrupt_latency.c

volatile uint32_t trigger_time;
volatile uint32_t isr_start_time;
volatile uint32_t latency;

void SysTick_Handler(void) {
    isr_start_time = DWT_CYCCNT;
    latency = isr_start_time - trigger_time;
}

void measure_interrupt_latency(void) {
    // Configure SysTick
    SysTick->LOAD = 1000;  // Short period
    SysTick->VAL = 0;
    SysTick->CTRL = 7;     // Enable, use processor clock, enable interrupt

    // Wait for interrupt
    trigger_time = DWT_CYCCNT;
    __WFI();  // Wait for interrupt

    // latency now contains interrupt latency
}

Exercise 5: Memory Access Pattern Analysis

Analyze the impact of different memory access patterns on performance.

Memory Access Code

// memory_access.c

#define ARRAY_SIZE 1024

volatile uint32_t array[ARRAY_SIZE];

// Sequential access
uint32_t sequential_access(void) {
    uint32_t start = DWT_CYCCNT;

    for (int i = 0; i < ARRAY_SIZE; i++) {
        array[i] = i;
    }

    return DWT_CYCCNT - start;
}

// Strided access (stride = 16)
uint32_t strided_access(void) {
    uint32_t start = DWT_CYCCNT;

    for (int s = 0; s < 16; s++) {
        for (int i = s; i < ARRAY_SIZE; i += 16) {
            array[i] = i;
        }
    }

    return DWT_CYCCNT - start;
}

// Random access
uint32_t random_access(uint32_t *indices) {
    uint32_t start = DWT_CYCCNT;

    for (int i = 0; i < ARRAY_SIZE; i++) {
        array[indices[i]] = i;
    }

    return DWT_CYCCNT - start;
}

Power Measurement (Theory)

Without physical hardware, power measurement can only be discussed theoretically.

Measurement Equipment

Power measurement equipment:

1. Current Probe
   - Connected in series with power line
   - Measures current waveform
   - Example: Keysight N2820A

2. Power Analyzer
   - High precision power measurement
   - Example: Keysight N6705C

3. Built-in on Dev Boards
   - STM32 Nucleo IDD jumper
   - Nordic PPK2

Measurement Method

Power measurement steps:

1. Baseline measurement
   - Measure idle state power
   - Measure various sleep mode power

2. Dynamic measurement
   - Run benchmark
   - Record power waveform
   - Calculate average power

3. Energy calculation
   Energy = ∫ Power(t) dt
         ≈ Σ Power[i] × Δt

Simulated Power Estimation

# Simplified power model
def estimate_power(cycles, frequency_mhz, voltage_v):
    """
    Estimate dynamic power
    P = C × V² × f

    Assumptions:
    - Switching capacitance per cycle C ≈ 10 pF
    - Activity factor α ≈ 0.3
    """
    C = 10e-12  # 10 pF
    alpha = 0.3
    f = frequency_mhz * 1e6

    dynamic_power = alpha * C * (voltage_v ** 2) * f

    # Add static power (assume 1 mW)
    static_power = 1e-3

    return dynamic_power + static_power

# Example
power = estimate_power(1000000, 100, 1.8)
print(f"Estimated power: {power * 1000:.2f} mW")

Summary

Key techniques for embedded performance measurement:

Timing Measurement

  • ARM: DWT Cycle Counter
  • RISC-V: mcycle/minstret CSR
  • General: SysTick, hardware timers

RTOS Measurement

  • Context switch time
  • Interrupt latency
  • Task switching overhead

Memory Analysis

  • Access pattern impact
  • Cache effects (if available)
  • Alignment impact

Power Measurement

  • Requires dedicated equipment
  • Dynamic vs static power
  • Energy efficiency metrics

Appendix C: I/O and Storage Performance


"Storage is the new memory." — Jim Gray

Storage Performance Fundamentals

Key Metrics

Storage performance metrics:

1. Bandwidth (Throughput)
   - Unit: MB/s, GB/s
   - Maximum sequential read/write speed

2. IOPS (I/O Operations Per Second)
   - Unit: ops/s
   - Random read/write operations

3. Latency
   - Unit: μs, ms
   - Time for single I/O operation

4. Queue Depth
   - Concurrent I/O requests
   - Affects IOPS and latency

Storage Hierarchy

Storage hierarchy and typical performance:

Level         Latency        Bandwidth
─────────────────────────────────────────────
CPU Cache     1-10 ns        100+ GB/s
DRAM          50-100 ns      50-100 GB/s
NVMe SSD      10-100 μs      3-7 GB/s
SATA SSD      50-200 μs      500-600 MB/s
HDD           5-10 ms        100-200 MB/s
Network       1-100 ms       1-10 GB/s

fio (Flexible I/O Tester)

fio is the most commonly used storage benchmark tool.

Installation

# Ubuntu/Debian
sudo apt install fio

# macOS
brew install fio

# From source
git clone https://github.com/axboe/fio.git
cd fio && ./configure && make && sudo make install

Basic Usage

# Sequential write test
fio --name=seq_write \
    --ioengine=libaio \
    --direct=1 \
    --bs=1M \
    --size=1G \
    --numjobs=1 \
    --rw=write \
    --filename=/tmp/fio_test

# Random read test
fio --name=rand_read \
    --ioengine=libaio \
    --direct=1 \
    --bs=4K \
    --size=1G \
    --numjobs=4 \
    --iodepth=32 \
    --rw=randread \
    --filename=/tmp/fio_test

Common Parameters

fio parameters:

--ioengine    I/O engine (libaio, io_uring, sync)
--direct      Bypass page cache (1=yes)
--bs          Block size (4K, 1M, etc.)
--size        Test file size
--numjobs     Parallel jobs
--iodepth     Queue depth
--rw          Read/write mode (read, write, randread, randwrite, randrw)
--runtime     Runtime (seconds)
--time_based  Time-based instead of size-based

Job File

; fio_test.fio - Complete test configuration

[global]
ioengine=libaio
direct=1
size=1G
runtime=60
time_based
group_reporting

[seq_read]
rw=read
bs=1M
numjobs=1

[seq_write]
rw=write
bs=1M
numjobs=1

[rand_read]
rw=randread
bs=4K
numjobs=4
iodepth=32

[rand_write]
rw=randwrite
bs=4K
numjobs=4
iodepth=32

[mixed]
rw=randrw
rwmixread=70
bs=4K
numjobs=4
iodepth=32

Running and Output

# Run job file
fio fio_test.fio

# JSON output
fio fio_test.fio --output-format=json --output=results.json

# Output example
seq_read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB
  read: IOPS=3245, BW=3245MiB/s (3403MB/s)
    slat (usec): min=2, max=45, avg=5.2
    clat (usec): min=280, max=1234, avg=302.5
     lat (usec): min=285, max=1240, avg=307.7

ioping

ioping measures I/O latency, similar to ping.

Installation and Usage

# Install
sudo apt install ioping

# Measure latency
ioping -c 10 /tmp

# Output example
4 KiB <<< /tmp (ext4 /dev/sda1): request=1 time=234.5 us
4 KiB <<< /tmp (ext4 /dev/sda1): request=2 time=198.3 us
...
--- /tmp (ext4 /dev/sda1) ioping statistics ---
10 requests completed in 2.15 ms, 40 KiB read, 4.65 k iops, 18.6 MiB/s
min/avg/max/mdev = 156.2 us / 215.0 us / 312.4 us / 45.2 us

Advanced Usage

# Direct I/O (bypass cache)
ioping -D /dev/sda

# Specify size
ioping -s 1M /tmp

# Continuous test
ioping -c 100 -i 0 /tmp

Network I/O

iperf3

iperf3 is the standard tool for network bandwidth testing.

# Install
sudo apt install iperf3

# Server side
iperf3 -s

# Client side
iperf3 -c server_ip

# Output example
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  11.0 GBytes  9.42 Gbits/sec

iperf3 Advanced Usage

# UDP test
iperf3 -c server_ip -u -b 1G

# Multiple connections
iperf3 -c server_ip -P 4

# Bidirectional test
iperf3 -c server_ip --bidir

# JSON output
iperf3 -c server_ip -J > results.json

netperf

netperf focuses on latency testing.

# Install
sudo apt install netperf

# Server side
netserver

# TCP request/response latency
netperf -H server_ip -t TCP_RR

# Output example
TCP REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  bytes  bytes    bytes   secs.    per sec
16384  131072 1        1       10.00    45678.90

dd Test

dd is the simplest I/O testing method.

# Write test
dd if=/dev/zero of=/tmp/test bs=1M count=1024 conv=fdatasync

# Read test
dd if=/tmp/test of=/dev/null bs=1M

# Read after clearing cache
sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/tmp/test of=/dev/null bs=1M

Limitations of dd

Problems with dd:

1. Single-threaded
   - Cannot test parallel performance

2. No queue depth control
   - Cannot test NVMe's true performance

3. No statistics
   - Only average, no latency distribution

Recommendation:
  - Use dd for quick tests
  - Use fio for formal testing

File System Performance

Comparing Different File Systems

# Create test environment
for fs in ext4 xfs btrfs; do
    mkfs.$fs /dev/sdb1
    mount /dev/sdb1 /mnt/test

    fio --name=test --filename=/mnt/test/file \
        --size=10G --bs=4K --rw=randread \
        --iodepth=32 --numjobs=4 \
        --output-format=json > ${fs}_results.json

    umount /mnt/test
done

Mount Options Impact

# Default mount
mount /dev/sdb1 /mnt/test

# Performance-optimized mount
mount -o noatime,nodiratime,discard /dev/sdb1 /mnt/test

# Compare performance difference

I/O Scheduler

Viewing and Setting

# View current scheduler
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none

# Set scheduler
echo "none" | sudo tee /sys/block/sda/queue/scheduler

Scheduler Comparison

I/O scheduler characteristics:

Scheduler   Use Case              Features
─────────────────────────────────────────────────────
none        NVMe SSD              Lowest latency
mq-deadline General               Balance latency and throughput
kyber       Low latency needs     Auto-adjusting
bfq         Desktop/interactive   Fairness priority

Performance Analysis Tools

iostat

# Install
sudo apt install sysstat

# Basic usage
iostat -x 1

# Example output
Device   r/s     w/s     rkB/s   wkB/s   await  %util
sda      125.00  45.00   5000.0  1800.0  0.85   12.5
nvme0n1  3500.0  1200.0  14000.0 4800.0  0.12   45.2

blktrace

# Trace I/O
sudo blktrace -d /dev/sda -o trace

# Analyze
blkparse -i trace.blktrace.0

# Visualize
btt -i trace.blktrace.0

Summary

Key tools for I/O performance testing:

Storage Testing

  • fio: Complete storage benchmark
  • ioping: I/O latency testing
  • dd: Quick simple test

Network Testing

  • iperf3: Bandwidth testing
  • netperf: Latency testing

Analysis Tools

  • iostat: Real-time monitoring
  • blktrace: Detailed tracing

Testing Tips

  • Use direct I/O to bypass cache
  • Test different block sizes
  • Test different queue depths
  • Multiple runs for statistics

Appendix D: Power and Performance


"Performance per watt is the new performance." — Intel

Power Performance Fundamentals

Why Power Matters

Why power matters:

1. Data Centers
   - Electricity is major operational cost
   - Cooling requirements scale with power
   - Power supply limits

2. Mobile Devices
   - Battery life
   - Thermal design limits
   - User experience

3. Embedded Systems
   - Battery powered
   - Fanless design
   - Environmental constraints

4. Environmental
   - Carbon footprint
   - Energy efficiency regulations

Power Components

Processor power components:

1. Dynamic Power
   P_dynamic = α × C × V² × f

   α = activity factor
   C = capacitance
   V = voltage
   f = frequency

2. Static Power (Leakage)
   P_static = I_leak × V

   Temperature dependent
   Smaller process = more leakage

3. Total Power
   P_total = P_dynamic + P_static

RAPL (Running Average Power Limit)

RAPL is Intel's power monitoring interface.

Reading RAPL

# Using powercap interface
cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj

# Calculate power
E1=$(cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj)
sleep 1
E2=$(cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj)
echo "Power: $(( (E2 - E1) / 1000000 )) W"

RAPL Domains

RAPL power domains:

Domain              Description
─────────────────────────────────────────────
Package (PKG)       Entire CPU package
Core                CPU cores
Uncore              Non-core parts (L3 cache, etc.)
DRAM                Memory controller
GPU (if present)    Integrated graphics

Using perf

# Measure power with perf
sudo perf stat -e power/energy-pkg/,power/energy-cores/,power/energy-ram/ \
    ./benchmark

# Example output
Performance counter stats for './benchmark':
    45.23 Joules  power/energy-pkg/
    32.15 Joules  power/energy-cores/
    12.34 Joules  power/energy-ram/

    10.002345678 seconds time elapsed

Python RAPL Reader

import time
from pathlib import Path

class RAPLReader:
    def __init__(self):
        self.rapl_path = Path("/sys/class/powercap/intel-rapl")
        self.domains = self._find_domains()

    def _find_domains(self):
        domains = {}
        for d in self.rapl_path.glob("intel-rapl:*"):
            name = (d / "name").read_text().strip()
            domains[name] = d / "energy_uj"
        return domains

    def read_energy(self):
        """Read energy for all domains (microjoules)"""
        return {
            name: int(path.read_text())
            for name, path in self.domains.items()
        }

    def measure_power(self, duration=1.0):
        """Measure power (watts)"""
        e1 = self.read_energy()
        time.sleep(duration)
        e2 = self.read_energy()

        return {
            name: (e2[name] - e1[name]) / duration / 1e6
            for name in e1
        }

# Usage
rapl = RAPLReader()
power = rapl.measure_power(1.0)
print(f"Package power: {power.get('package-0', 0):.2f} W")

Performance per Watt

GFLOPS/W

GFLOPS/W calculation:

Performance/Power ratio = GFLOPS / Power (W)

Example:
  Performance = 100 GFLOPS
  Power = 50 W
  Efficiency = 100 / 50 = 2 GFLOPS/W

Measurement Method

import subprocess
import time

def measure_efficiency(benchmark_cmd, duration=10):
    """Measure performance per watt"""
    rapl = RAPLReader()

    # Start measurement
    e1 = rapl.read_energy()
    t1 = time.time()

    # Run benchmark
    result = subprocess.run(
        benchmark_cmd,
        capture_output=True,
        text=True
    )

    # End measurement
    t2 = time.time()
    e2 = rapl.read_energy()

    # Calculate
    elapsed = t2 - t1
    energy_j = (e2['package-0'] - e1['package-0']) / 1e6
    power_w = energy_j / elapsed

    # Parse GFLOPS from benchmark output
    # (depends on specific benchmark)
    gflops = parse_gflops(result.stdout)

    efficiency = gflops / power_w

    return {
        'gflops': gflops,
        'power_w': power_w,
        'efficiency': efficiency
    }

Thermal Throttling

What is Thermal Throttling

Thermal throttling mechanism:

When temperature exceeds threshold, processor will:
1. Reduce frequency
2. Reduce voltage
3. Skip clock cycles

Result:
  - Performance decreases
  - Power consumption decreases
  - Temperature stabilizes

Problem:
  - Benchmark results become unstable
  - Cannot achieve rated performance

Temperature Monitoring

# Using sensors
sudo apt install lm-sensors
sensors

# Output example
coretemp-isa-0000
Core 0:       +65.0°C  (high = +100.0°C, crit = +110.0°C)
Core 1:       +67.0°C  (high = +100.0°C, crit = +110.0°C)

# Using /sys
cat /sys/class/thermal/thermal_zone*/temp
# 65000 (millidegrees)

Detecting Throttling

# Using turbostat
sudo turbostat --interval 1

# Output example
Core  CPU  Avg_MHz  Busy%  Bzy_MHz  TSC_MHz  IRQ  SMI  POLL  C1  C1E  C6
-     -    2345     45.2   3600     3600     1234 0    0     55  0   0
0     0    2400     48.5   3600     3600     456  0    0     52  0   0

# Bzy_MHz < rated frequency = possible throttling

Python Monitoring

from pathlib import Path
import time

def monitor_thermal(duration=60, interval=1):
    """Monitor temperature and frequency"""
    results = []

    for _ in range(int(duration / interval)):
        # Read temperature
        temps = []
        for zone in Path("/sys/class/thermal").glob("thermal_zone*"):
            temp = int((zone / "temp").read_text()) / 1000
            temps.append(temp)

        # Read frequency
        freqs = []
        for cpu in Path("/sys/devices/system/cpu").glob("cpu[0-9]*"):
            freq_file = cpu / "cpufreq/scaling_cur_freq"
            if freq_file.exists():
                freq = int(freq_file.read_text()) / 1000  # MHz
                freqs.append(freq)

        results.append({
            'time': time.time(),
            'max_temp': max(temps),
            'avg_freq': sum(freqs) / len(freqs) if freqs else 0
        })

        time.sleep(interval)

    return results

DVFS (Dynamic Voltage and Frequency Scaling)

CPU Frequency Control

# View available frequencies
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies

# View current frequency
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

# Set frequency (requires root)
echo 2400000 | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed

Governor Settings

# View available governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
# performance powersave ondemand conservative schedutil

# Set governor
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Recommended to use 'performance' for benchmarks

Frequency Impact on Performance

Frequency vs Performance vs Power:

Frequency   Rel. Perf   Rel. Power  Efficiency
─────────────────────────────────────────────
2.0 GHz     1.0x        1.0x        1.0x
2.5 GHz     1.25x       1.56x       0.80x
3.0 GHz     1.50x       2.25x       0.67x
3.5 GHz     1.75x       3.06x       0.57x

Power ∝ f³ (because V also increases with f)
Efficiency decreases with frequency

GPU Power

NVIDIA GPU

# Using nvidia-smi
nvidia-smi --query-gpu=power.draw --format=csv -l 1

# Detailed info
nvidia-smi dmon -s p

# Example output
# gpu   pwr  gtemp  mtemp
    0   125W    65C    55C

Using NVML

import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

# Read power
power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000  # mW -> W
print(f"GPU Power: {power:.1f} W")

# Read temperature
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
print(f"GPU Temp: {temp}°C")

pynvml.nvmlShutdown()

Power Benchmark Best Practices

Environment Preparation

#!/bin/bash
# prepare_power_benchmark.sh

# 1. Set performance governor
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# 2. Disable turbo boost (optional, for stability)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

# 3. Ensure adequate cooling
# (wait for temperature to stabilize)

# 4. Stop unnecessary services
sudo systemctl stop cron
sudo systemctl stop unattended-upgrades

Measurement Flow

Power measurement flow:

1. Warm-up
   - Run benchmark until temperature stabilizes
   - Usually takes 1-5 minutes

2. Baseline measurement
   - Measure idle power
   - As baseline

3. Load measurement
   - Run benchmark
   - Record power simultaneously

4. Multiple repeats
   - At least 3-5 times
   - Calculate mean and standard deviation

Summary

Key points for power performance analysis:

Measurement Tools

  • RAPL: Intel CPU power
  • nvidia-smi: NVIDIA GPU power
  • sensors: Temperature monitoring

Key Metrics

  • GFLOPS/W: Performance per watt
  • Energy: Total energy consumption
  • Thermal headroom: Temperature margin

Influencing Factors

  • Frequency and voltage
  • Thermal throttling
  • Workload characteristics

Best Practices

  • Fixed frequency testing
  • Wait for temperature to stabilize
  • Multiple measurements for statistics
  • Record environmental conditions

Appendix E: Exercises and Solutions


"The best way to learn is by doing." — Richard Feynman

This appendix provides hands-on exercises for each chapter with detailed problem descriptions, solution approaches, and key code snippets.


Part I: Foundations (Chapters 1-4)

Exercise 1.1: Array Traversal with Statistical Analysis

Difficulty: Easy | Language: C

Problem: Run a simple array traversal 100 times and calculate statistical metrics to understand measurement variability.

Objectives:

  1. Measure execution time using clock_gettime(CLOCK_MONOTONIC)
  2. Calculate mean, standard deviation, and 95% confidence interval
  3. Evaluate result stability using coefficient of variation (CV)

Key Code:

#include <time.h>
#include <math.h>

#define ARRAY_SIZE (1024 * 1024)  // 1M elements
#define NUM_RUNS 100

static inline uint64_t get_time_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}

volatile int sink;  // Prevent dead code elimination

void traverse_array(int *arr, size_t size) {
    int sum = 0;
    for (size_t i = 0; i < size; i++) {
        sum += arr[i];
    }
    sink = sum;
}

Statistical Calculation:

// Mean
double mean = sum / NUM_RUNS;

// Standard deviation
double std_dev = sqrt(sq_diff_sum / (NUM_RUNS - 1));

// 95% CI (t-value ≈ 1.984 for df=99)
double margin = 1.984 * std_dev / sqrt(NUM_RUNS);

// Coefficient of variation
double cv = (std_dev / mean) * 100.0;

Expected Output:

Array size: 1048576 elements (4 MB)
Number of runs: 100

Results:
  Mean:              0.892 ms
  Std Dev:           0.045 ms
  95% CI:            [0.883, 0.901] ms
  CV (variability):  5.04%

Interpretation:

  • CV < 5%: Stable results
  • CV 5-15%: Some fluctuation, acceptable
  • CV > 15%: High variability, stabilize environment

Exercise 2.2: Timer Overhead Measurement

Difficulty: Easy | Language: C

Problem: Measure the overhead of clock_gettime by calling it consecutively 1000 times.

Objectives:

  1. Understand timer resolution limits
  2. Determine minimum measurable duration
  3. Analyze overhead distribution

Key Code:

#define NUM_SAMPLES 1000

// Measure timer overhead
for (int i = 0; i < NUM_SAMPLES; i++) {
    uint64_t t1 = get_time_ns();
    uint64_t t2 = get_time_ns();
    overhead[i] = t2 - t1;
}

// Sort for percentiles
qsort(overhead, NUM_SAMPLES, sizeof(uint64_t), compare_uint64);
uint64_t p50 = overhead[NUM_SAMPLES / 2];
uint64_t p95 = overhead[(int)(NUM_SAMPLES * 0.95)];

Expected Output:

Statistics (nanoseconds):
  Mean:    25.3 ns
  Std Dev: 8.2 ns
  Min:     18 ns
  Max:     156 ns

Percentiles:
  P50 (median): 23 ns
  P95:          42 ns
  P99:          78 ns

Practical Implication: For a 25ns timer overhead, measure operations of at least 2500ns (100x overhead) for <1% error.


Exercise 3.3: T-Test for Algorithm Comparison

Difficulty: Medium | Language: C

Problem: Use Welch's t-test to determine if two algorithm implementations have statistically significant performance differences.

Objectives:

  1. Collect benchmark samples for two algorithms
  2. Implement Welch's t-test (handles unequal variances)
  3. Interpret p-value for significance

Key Code:

// Algorithm A: Simple sum
void algo_a(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }
    sink = sum;
}

// Algorithm B: Unrolled sum (4x)
void algo_b(int *arr, int n) {
    int sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
    for (int i = 0; i <= n - 4; i += 4) {
        sum0 += arr[i];
        sum1 += arr[i + 1];
        sum2 += arr[i + 2];
        sum3 += arr[i + 3];
    }
    sink = sum0 + sum1 + sum2 + sum3;
}

Welch's T-Test:

// t-statistic
double se = sqrt(var_a / na + var_b / nb);
double t = (mean_a - mean_b) / se;

// Degrees of freedom (Welch-Satterthwaite)
double df = (v1 + v2) * (v1 + v2) /
            (v1 * v1 / (na - 1) + v2 * v2 / (nb - 1));

Interpretation:

  • p < 0.05: Statistically significant difference
  • p >= 0.05: No significant difference (could be noise)

Exercise 4.2: Box Plot Visualization

Difficulty: Easy | Language: Python

Problem: Create box plots to visualize benchmark result distributions and identify outliers.

Key Code:

import matplotlib.pyplot as plt
import numpy as np

def create_boxplot(data_dict, title, ylabel):
    fig, ax = plt.subplots(figsize=(10, 6))

    labels = list(data_dict.keys())
    data = [data_dict[k] for k in labels]

    bp = ax.boxplot(data, labels=labels, patch_artist=True)

    # Color boxes
    colors = ['lightblue', 'lightgreen', 'lightyellow', 'lightcoral']
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)

    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig('boxplot.png', dpi=150)

What Box Plots Show:

  • Box: Q1 to Q3 (50% of data)
  • Line in box: Median
  • Whiskers: 1.5×IQR from box edges
  • Points outside: Outliers

Part II: Benchmark Tools (Chapters 5-9)

Exercise 5.1: CoreMark Testing

Difficulty: Easy | Language: Shell

Problem: Build and run CoreMark with different compiler optimization levels.

Key Code:

#!/bin/bash

git clone https://github.com/eembc/coremark.git
cd coremark

for opt in O0 O1 O2 O3 Ofast; do
    echo "=== Testing -$opt ==="
    make clean
    make PORT_DIR=linux XCFLAGS="-$opt"
    ./coremark.exe 2>&1 | grep -E "CoreMark|Iterations"
done

Expected Results:

OptimizationCoreMark ScoreImprovement
-O0~5,000baseline
-O1~15,0003x
-O2~22,0004.4x
-O3~24,0004.8x
-Ofast~25,0005x

Exercise 6.1: STREAM Memory Bandwidth Analysis

Difficulty: Easy | Language: C

Problem: Measure memory bandwidth at different array sizes to observe cache hierarchy effects.

Key Code:

// Vary array size to hit different cache levels
size_t sizes[] = {
    8 * 1024,        // 8 KB  - L1
    64 * 1024,       // 64 KB - L2
    512 * 1024,      // 512 KB - L3
    8 * 1024 * 1024, // 8 MB  - L3/DRAM
    64 * 1024 * 1024 // 64 MB - DRAM
};

for (int s = 0; s < 5; s++) {
    double *a = aligned_alloc(64, sizes[s]);
    double *b = aligned_alloc(64, sizes[s]);

    // STREAM Copy: b[i] = a[i]
    uint64_t start = get_time_ns();
    for (size_t i = 0; i < sizes[s] / sizeof(double); i++) {
        b[i] = a[i];
    }
    uint64_t elapsed = get_time_ns() - start;

    double bw = (2.0 * sizes[s]) / elapsed;  // GB/s
    printf("Size: %6zu KB, Bandwidth: %.2f GB/s\n",
           sizes[s] / 1024, bw);
}

Exercise 9.2: CPU Cycle Counter

Difficulty: Medium | Language: C

Problem: Use architecture-specific cycle counters for precise timing.

Key Code (x86):

#include <x86intrin.h>

static inline uint64_t rdtsc(void) {
    return __rdtsc();
}

// More precise: RDTSCP serializes
static inline uint64_t rdtscp(void) {
    unsigned int aux;
    return __rdtscp(&aux);
}

// Usage
uint64_t start = rdtscp();
// ... operation ...
uint64_t end = rdtscp();
uint64_t cycles = end - start;

Key Code (ARM):

static inline uint64_t read_cycles(void) {
    uint64_t val;
    asm volatile("mrs %0, cntvct_el0" : "=r"(val));
    return val;
}

Part III: Analysis Theory (Chapters 10-12)

Exercise 10.1: Roofline Model

Difficulty: Medium | Language: Python

Problem: Generate a roofline model plot for your system.

Key Code:

import matplotlib.pyplot as plt
import numpy as np

# System parameters (measure these!)
peak_flops = 100  # GFLOPS
peak_bw = 50      # GB/s

# Calculate ridge point
ridge_point = peak_flops / peak_bw  # FLOP/byte

# Operational intensity range
oi = np.logspace(-2, 2, 100)

# Roofline
memory_bound = peak_bw * oi
compute_bound = np.full_like(oi, peak_flops)
roofline = np.minimum(memory_bound, compute_bound)

plt.figure(figsize=(10, 6))
plt.loglog(oi, roofline, 'b-', linewidth=2, label='Roofline')
plt.axvline(x=ridge_point, color='r', linestyle='--',
            label=f'Ridge Point ({ridge_point:.1f})')

plt.xlabel('Operational Intensity (FLOP/byte)')
plt.ylabel('Performance (GFLOPS)')
plt.title('Roofline Model')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('roofline.png', dpi=150)

Exercise 12.3: Branch Prediction Experiment

Difficulty: Medium | Language: C

Problem: Compare sorted vs unsorted array processing to observe branch prediction effects.

Key Code:

#define ARRAY_SIZE (32 * 1024)

// Test with sorted and unsorted data
void process_array(int *arr, int n, int threshold) {
    int count = 0;
    for (int i = 0; i < n; i++) {
        if (arr[i] < threshold) {  // Branch!
            count++;
        }
    }
    sink = count;
}

int main(void) {
    int *arr = malloc(ARRAY_SIZE * sizeof(int));

    // Fill with random data
    for (int i = 0; i < ARRAY_SIZE; i++) {
        arr[i] = rand() % 256;
    }

    // Test unsorted
    uint64_t start = rdtscp();
    for (int iter = 0; iter < 1000; iter++) {
        process_array(arr, ARRAY_SIZE, 128);
    }
    uint64_t unsorted_cycles = rdtscp() - start;

    // Sort the array
    qsort(arr, ARRAY_SIZE, sizeof(int), compare_int);

    // Test sorted
    start = rdtscp();
    for (int iter = 0; iter < 1000; iter++) {
        process_array(arr, ARRAY_SIZE, 128);
    }
    uint64_t sorted_cycles = rdtscp() - start;

    printf("Unsorted: %lu cycles\n", unsorted_cycles);
    printf("Sorted:   %lu cycles\n", sorted_cycles);
    printf("Speedup:  %.2fx\n",
           (double)unsorted_cycles / sorted_cycles);
}

Expected Result: Sorted array is 2-5x faster due to better branch prediction.


Part IV: Data Structures (Chapters 13-15)

Exercise 13.1: Array vs Linked List

Difficulty: Easy | Language: C

Problem: Compare sequential access performance of arrays vs linked lists.

Key Code:

// Array traversal
void sum_array(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }
    sink = sum;
}

// Linked list traversal
struct Node {
    int value;
    struct Node *next;
};

void sum_list(struct Node *head) {
    int sum = 0;
    while (head) {
        sum += head->value;
        head = head->next;
    }
    sink = sum;
}

Expected Result: Array is 5-20x faster due to cache-friendly sequential access.


Exercise 14.1: Hash Table vs Tree

Difficulty: Medium | Language: C++

Problem: Compare lookup performance of hash tables vs balanced trees.

Key Code:

#include <unordered_map>
#include <map>

void benchmark_hash(int n) {
    std::unordered_map<int, int> hash;
    for (int i = 0; i < n; i++) hash[i] = i;

    auto start = now();
    for (int i = 0; i < n; i++) {
        sink = hash[rand() % n];
    }
    auto elapsed = now() - start;
}

void benchmark_tree(int n) {
    std::map<int, int> tree;
    for (int i = 0; i < n; i++) tree[i] = i;

    auto start = now();
    for (int i = 0; i < n; i++) {
        sink = tree[rand() % n];
    }
    auto elapsed = now() - start;
}

Expected Result:

  • Hash: O(1) average, ~50-100ns per lookup
  • Tree: O(log n), ~200-500ns per lookup for n=1M

Exercise 15.1: Sorting Algorithm Comparison

Difficulty: Medium | Language: C++

Problem: Compare quicksort, mergesort, and heapsort performance.

Key Code:

#include <algorithm>

void benchmark_sort(std::vector<int>& data,
                    void (*sort_fn)(int*, int*)) {
    auto start = now();
    sort_fn(data.data(), data.data() + data.size());
    auto elapsed = now() - start;
}

// Test with different data patterns:
// 1. Random
// 2. Nearly sorted
// 3. Reverse sorted
// 4. Many duplicates

Part V: Parallelization (Chapters 16-18)

Exercise 16.1: SIMD Vector Addition

Difficulty: Advanced | Language: C

Problem: Implement vector addition using SIMD intrinsics.

Key Code (AVX2):

#include <immintrin.h>

void vector_add_scalar(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

void vector_add_avx(float *a, float *b, float *c, int n) {
    int i;
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
    // Handle remainder
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Expected Speedup: 4-8x for float, 8-16x for int8.


Exercise 17.2: False Sharing Detection

Difficulty: Advanced | Language: C

Problem: Demonstrate and fix false sharing in multi-threaded code.

Key Code:

// BAD: False sharing
struct Counter {
    int count;  // 4 bytes, multiple counters in same cache line
};

struct Counter counters[NUM_THREADS];

// GOOD: Padded to avoid false sharing
struct PaddedCounter {
    int count;
    char padding[60];  // Pad to 64-byte cache line
};

struct PaddedCounter padded_counters[NUM_THREADS];

void *thread_func(void *arg) {
    int id = *(int*)arg;
    for (int i = 0; i < ITERATIONS; i++) {
        counters[id].count++;  // False sharing!
    }
    return NULL;
}

Expected Result: Padded version 5-10x faster with multiple threads.


Part VI: Embedded Constraints (Chapters 19-22)

Exercise 19.1: Binary Footprint Analysis

Difficulty: Medium | Language: C/Shell

Problem: Analyze binary size and identify largest contributors.

Key Code:

#!/bin/bash

# Compile with different options
gcc -Os -o prog_Os prog.c
gcc -O2 -o prog_O2 prog.c
gcc -O3 -o prog_O3 prog.c

# Size comparison
size prog_Os prog_O2 prog_O3

# Symbol size analysis
nm --size-sort -S prog_Os | tail -20

# Section breakdown
objdump -h prog_Os | grep -E "\.text|\.data|\.bss|\.rodata"

Exercise 21.1: Stack Usage Analysis

Difficulty: Medium | Language: C

Problem: Measure and analyze function stack usage.

Key Code:

# Compile with stack usage info
gcc -fstack-usage -O2 -c program.c

# Output: program.su file
# Format: function_name:line:col:size qualifier
// GCC extension for runtime stack check
void check_stack_usage(void) {
    void *sp;
    asm volatile("mov %%rsp, %0" : "=r"(sp));
    printf("Current SP: %p\n", sp);
}

Part VII: AI/HPC (Chapters 23-29)

Exercise 26.1: CUDA Vector Addition

Difficulty: Advanced | Language: CUDA

Problem: Implement parallel vector addition on GPU.

Key Code:

__global__ void vector_add(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main(void) {
    int n = 1 << 20;  // 1M elements
    size_t bytes = n * sizeof(float);

    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int threads = 256;
    int blocks = (n + threads - 1) / threads;

    vector_add<<<blocks, threads>>>(d_a, d_b, d_c, n);
    cudaDeviceSynchronize();

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Exercise 27.1: LLM Inference Benchmark

Difficulty: Advanced | Language: Python

Problem: Measure LLM inference throughput and latency.

Key Code:

import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

def benchmark_inference(model, tokenizer, prompt, num_tokens=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Warm up
    with torch.no_grad():
        model.generate(**inputs, max_new_tokens=10)

    # Benchmark
    torch.cuda.synchronize()
    start = time.perf_counter()

    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=num_tokens)

    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    tokens_generated = outputs.shape[1] - inputs['input_ids'].shape[1]
    throughput = tokens_generated / elapsed

    print(f"Tokens: {tokens_generated}")
    print(f"Time: {elapsed:.2f}s")
    print(f"Throughput: {throughput:.1f} tokens/s")


Additional Exercises (No Solutions Provided)

The following exercises are designed for independent exploration. They are intentionally open-ended to encourage deeper investigation.

Part I: Foundations

Exercise 1.2: Environment Variability Study

Run the same benchmark on your system with different background conditions:

  • During a virus scan or system backup
  • With browser tabs open vs closed
  • On battery vs plugged in (laptop)
  • After fresh boot vs after hours of use

Document how much measurement variability each condition introduces. What's your system's "noise floor"?

Exercise 2.3: Timer Comparison

Compare the overhead and resolution of different timing methods on your system:

  • clock_gettime(CLOCK_MONOTONIC)
  • clock_gettime(CLOCK_MONOTONIC_RAW)
  • gettimeofday()
  • rdtsc/rdtscp (x86) or cntvct_el0 (ARM)

Which one is best for sub-microsecond measurements? Which is most portable?

Exercise 4.3: Outlier Investigation

Take a benchmark that produces outliers. Instead of removing them, investigate:

  • When do outliers occur? (First run? Every N runs?)
  • What system events correlate with outliers?
  • Can you predict when outliers will happen?

Part II: Benchmark Tools

Exercise 7.1: Geekbench Deep Dive

Run Geekbench 6 on your system and analyze:

  • Which sub-benchmarks score highest/lowest relative to the reference?
  • How does your single-core to multi-core scaling compare to the reference?
  • Run it 5 times - what's the CV of each sub-benchmark?

Exercise 8.1: SPEC CPU Proxy

Without access to SPEC CPU, create a "proxy benchmark suite" using open-source alternatives:

  • Find open-source equivalents for 5 SPEC workloads
  • Measure correlation with published SPEC scores (if available)
  • Document limitations of your proxy approach

Exercise 9.3: Custom Microbenchmark

Design a microbenchmark to measure a specific CPU feature:

  • L1 cache latency (not bandwidth)
  • TLB miss penalty
  • Memory prefetcher effectiveness

The benchmark must isolate the feature from other effects.


Part III: Analysis Theory

Exercise 10.2: Your Application's Roofline Position

Take a computationally intensive application you work with:

  • Calculate its theoretical operational intensity
  • Measure actual achieved FLOPS
  • Plot it on a roofline - is it memory or compute bound?
  • What optimization would help most?

Exercise 11.1: Amdahl's Law Reality Check

Profile a real application and identify:

  • The serial fraction (code that can't be parallelized)
  • Calculate theoretical speedup with 4, 8, 16 cores
  • Measure actual speedup
  • Explain the gap between theory and practice

Exercise 12.4: Prefetch Pattern Discovery

Experiment with different memory access patterns:

  • Sequential forward
  • Sequential backward
  • Stride-2, Stride-4, Stride-8, etc.
  • Random

At what stride does the hardware prefetcher stop helping? How does this vary by CPU generation?


Part IV: Data Structures

Exercise 13.2: Cache-Oblivious vs Cache-Aware

Implement matrix transpose two ways:

  • Naive (row-by-row)
  • Cache-oblivious (recursive)
  • Cache-aware (blocked, tuned for your L1)

Compare performance at different matrix sizes. At what size does blocking matter?

Exercise 14.2: Real-World Hash Table Benchmarking

Compare hash table implementations with realistic workloads:

  • String keys of varying length (5-100 chars)
  • Mixed read/write (90/10, 50/50, 10/90)
  • With and without deletions
  • Measure memory overhead, not just speed

Exercise 15.2: Sorting Stability Under Pressure

Benchmark sorting algorithms with adversarial inputs:

  • Quicksort killer sequences
  • Data that triggers worst-case for each algorithm
  • Measure not just average case, but worst case in practice

Part V: Parallelization

Exercise 16.2: Auto-vectorization Investigation

Take a loop that should vectorize:

  • Compile with -O3 -march=native and check if it vectorized
  • If not, identify what prevented vectorization
  • Fix the issue without using intrinsics
  • Measure the speedup

Exercise 17.3: Lock-Free vs Locking

Implement a concurrent counter three ways:

  • Mutex-protected
  • Spinlock
  • Atomic (lock-free)

Measure throughput at different thread counts (1, 2, 4, 8, 16, 32). At what point does each approach break down?

Exercise 18.1: OpenMP Scheduling Experiment

Take a parallel loop with varying work per iteration. Compare:

  • static scheduling
  • dynamic scheduling (chunk=1, 10, 100)
  • guided scheduling

How does optimal scheduling depend on work distribution and iteration count?


Part VI: Embedded Constraints

Exercise 20.1: Power State Characterization

If you have access to an embedded board with power measurement:

  • Measure power in different sleep states
  • Measure wake-up latency from each state
  • Calculate the break-even time for each state
  • Design a sleep strategy for a periodic workload

Exercise 21.2: Stack High-Water Mark

Implement a stack painting technique:

  • Fill stack with a known pattern at startup
  • Run your application under various loads
  • Check how much of the pattern was overwritten
  • Report peak stack usage

Exercise 22.1: Memory-Constrained Algorithm Design

Take an algorithm that requires O(n) auxiliary space. Modify it to run in O(1) space:

  • What's the time penalty?
  • Is there a time-space trade-off curve?
  • At what memory budget does the in-place version become faster?

Part VII: AI/HPC

Exercise 23.1: Metrics That Matter

Profile an AI inference workload and report:

  • FLOPS achieved vs theoretical peak
  • Memory bandwidth achieved vs theoretical peak
  • Which is the bottleneck?
  • What metric best predicts user-perceived performance?

Exercise 24.1: MLPerf Result Analysis

Download MLPerf results for a specific benchmark:

  • Plot performance vs power across submissions
  • Identify the Pareto frontier
  • What hardware characteristics predict placement on the frontier?

Exercise 25.1: HPCG vs HPL Correlation

Using published data from the TOP500 and HPCG rankings:

  • Plot HPCG efficiency vs HPL efficiency for the same systems
  • What's the correlation?
  • Can you identify systems that are outliers in one benchmark but not the other?

Exercise 28.1: ML Compiler Comparison

Take a model (e.g., ResNet-50) and compile it with:

  • PyTorch (eager mode)
  • TorchScript
  • ONNX Runtime
  • TensorRT (if NVIDIA GPU available)
  • TVM with auto-tuning

Report latency, throughput, and compilation time. Which is best for your use case?

Exercise 29.1: Quantization Impact Study

Take a model and quantize to INT8:

  • Measure accuracy drop on your validation set
  • Measure speedup on your target hardware
  • Try quantization-aware training - does it help?
  • Find the accuracy-speed Pareto frontier

Challenge Problems

These are open-ended research-level problems:

Challenge 1: The Perfect Benchmark

Design a benchmark that:

  • Runs in under 10 seconds
  • Has <1% CV on commodity hardware
  • Correlates with "real application" performance (define this)
  • Is resistant to gaming/optimization

Document your design decisions and trade-offs.

Challenge 2: Automatic Bottleneck Detection

Build a tool that:

  • Takes any program as input
  • Profiles it automatically
  • Identifies the top 3 performance bottlenecks
  • Suggests specific optimizations

Test it on 5 different programs. How often is it right?

Challenge 3: Performance Regression Detection

Design a CI/CD performance testing system that:

  • Detects 2% regressions reliably
  • Minimizes false positives
  • Runs in under 5 minutes
  • Works on noisy cloud VMs

What's the minimum number of runs needed? What statistical tests work best?


Summary

Exercise Difficulty Guide

LevelExercisesPrerequisites
Easy1.1, 2.2, 4.2, 5.1, 6.1, 13.1Basic C, Python
Medium3.3, 7.1-9.2, 10.1-15.1, 18-22, 29Stats, Linux tools
Advanced16.1, 17.2, 26.1-28.1, 30-34SIMD, CUDA, ML
Open-endedAdditional exercisesVaries

Key Takeaways

  1. Always measure: Never assume performance characteristics
  2. Statistics matter: Single measurements are meaningless
  3. Understand variance: CV tells you if results are stable
  4. Hardware awareness: Cache, branch prediction, memory matter
  5. Reproducibility: Document environment, automate tests

Appendix F: Environment Setup Guide


"A good setup is half the battle." — Engineering Proverb

Platform Overview

The examples and exercises in this book support multiple platforms.

Supported Platforms

Platform          Architecture  Notes
─────────────────────────────────────────────────────
Linux x86-64      x86-64        Most complete tool support
macOS             aarch64       Apple Silicon (M1/M2/M3)
Windows           x86-64        Recommend WSL2
RISC-V hardware   riscv64       SiFive, StarFive, Milk-V
RISC-V emulator   riscv64       QEMU, Spike
ARM dev boards    aarch64       Raspberry Pi, Jetson

Linux x86-64 Setup

Basic Tools

# Update system
sudo apt update && sudo apt upgrade -y

# Compilation tools
sudo apt install -y build-essential cmake ninja-build

# Performance tools
sudo apt install -y linux-tools-common linux-tools-generic
sudo apt install -y perf

# Other tools
sudo apt install -y git wget curl htop

Performance Analysis Tools

# perf
sudo apt install -y linux-tools-$(uname -r)

# Valgrind
sudo apt install -y valgrind

# FlameGraph
git clone https://github.com/brendangregg/FlameGraph.git

# sysstat (iostat, mpstat)
sudo apt install -y sysstat

Benchmark Tools

# fio
sudo apt install -y fio

# iperf3
sudo apt install -y iperf3

# stress-ng
sudo apt install -y stress-ng

# sysbench
sudo apt install -y sysbench

Permission Settings

# Allow non-root to use perf
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

# Permanent setting
echo 'kernel.perf_event_paranoid = 0' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Linux Baseline (Benchmarking)

# 1. Stop unnecessary services
sudo systemctl stop cron snapd unattended-upgrades

# 2. Set CPU frequency (disable scaling)
sudo cpupower frequency-set -g performance

# 3. Disable Turbo Boost
# Intel:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# AMD:
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

# 4. Disable ASLR
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

# 5. Run with CPU pinning
sudo nice -n -20 taskset -c 2 ./benchmark

Memory Configuration

# Enable huge pages
echo 1024 | sudo tee /proc/sys/vm/nr_hugepages

# NUMA binding
numactl --membind=0 --cpunodebind=0 ./benchmark

Restore System Settings

sudo cpupower frequency-set -g ondemand
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
echo 2 | sudo tee /proc/sys/kernel/randomize_va_space
sudo systemctl start cron

macOS Setup

Homebrew Installation

# Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Basic tools
brew install cmake ninja git wget

# Benchmark tools
brew install fio iperf3 stress-ng

Xcode Command Line Tools

xcode-select --install

Performance Analysis

# Instruments (installed with Xcode)
# Use GUI or command line

# Using sample
sudo sample <pid> 10 -file output.txt

# Using dtrace (requires disabling SIP)
sudo dtrace -n 'profile-997 { @[ustack()] = count(); }'

Notes

macOS limitations:

1. No perf
   - Use Instruments instead
   - Or use dtrace

2. SIP (System Integrity Protection)
   - Some tools require disabling
   - Not recommended for production

3. Apple Silicon
   - Some tools not yet supported
   - Use Rosetta 2 for x86 tools

Windows Setup

WSL2 Installation

# Run PowerShell as Administrator
wsl --install

# Install Ubuntu
wsl --install -d Ubuntu-22.04

# Enter WSL
wsl

Inside WSL2

# Same as Linux
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git

# Note: Some kernel features are limited
# perf requires special configuration

Native Windows Tools

Windows Performance Tools:

1. Windows Performance Toolkit
   - xperf
   - Windows Performance Analyzer

2. Visual Studio Profiler
   - CPU usage
   - Memory analysis

3. Intel VTune
   - Windows support
   - Detailed CPU analysis

Cross-Platform Tool Mapping

Linux → Windows

Linux Tool/CommandWindows EquivalentDescription
tasksetstart /affinityCPU affinity
niceProcess priority in Task ManagerPriority setting
cpupowerPower Options / ThrottleStopCPU frequency control
perfWindows Performance AnalyzerProfiling
top / htopTask Manager / Process ExplorerProcess monitoring
clock_gettimeQueryPerformanceCounterHigh-precision timing

Linux → macOS

Linux Tool/CommandmacOS EquivalentDescription
tasksetNot directly supportedCPU affinity
cpupowerNot supported (macOS auto-manages)CPU frequency control
perfInstruments / sampleProfiling
/proc/cpuinfosysctl -a / system_profilerSystem info
clock_gettimemach_absolute_timeHigh-precision timing

RISC-V Environment Setup

QEMU Emulator

# Install QEMU
sudo apt install -y qemu-system-riscv64 qemu-user

# Test
qemu-system-riscv64 --version
qemu-riscv64 --version

Toolchain

# Pre-built toolchain
sudo apt install -y gcc-riscv64-linux-gnu

# Or build from source
git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git
cd riscv-gnu-toolchain
./configure --prefix=/opt/riscv
make linux -j$(nproc)

Spike Simulator

# Install dependencies
sudo apt install -y device-tree-compiler

# Build Spike
git clone https://github.com/riscv-software-src/riscv-isa-sim.git
cd riscv-isa-sim
mkdir build && cd build
../configure --prefix=/opt/riscv
make -j$(nproc)
sudo make install

Running Examples

# Compile program
riscv64-linux-gnu-gcc -static -o hello hello.c

# Run on QEMU
qemu-riscv64 ./hello

# Run on Spike (requires pk)
spike pk ./hello

ARM Environment Setup

Cross Compilation

# Install toolchain
sudo apt install -y gcc-aarch64-linux-gnu
sudo apt install -y gcc-arm-none-eabi  # Bare metal

# Install QEMU
sudo apt install -y qemu-system-arm qemu-user

Raspberry Pi Setup

# On Raspberry Pi
sudo apt update && sudo apt upgrade -y

# Performance tools
sudo apt install -y linux-tools-generic
sudo apt install -y perf

# Note: Some tools may need to be built from source

Docker Environment

Using Docker for Isolated Environment

# Dockerfile
FROM ubuntu:22.04

RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    git \
    linux-tools-generic \
    perf \
    fio \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

Running

# Build image
docker build -t benchmark-env .

# Run (requires privileged mode for perf)
docker run --privileged -it benchmark-env

Environment Verification

Verification Script

#!/bin/bash
# verify_environment.sh

echo "=== Environment Verification ==="

# Compiler
echo -n "GCC: "
gcc --version | head -1

# perf
echo -n "perf: "
perf --version 2>/dev/null || echo "Not available"

# fio
echo -n "fio: "
fio --version 2>/dev/null || echo "Not available"

# Python
echo -n "Python: "
python3 --version

# Kernel version
echo -n "Kernel: "
uname -r

# CPU info
echo "CPU:"
lscpu | grep "Model name"
lscpu | grep "CPU(s):"

echo "=== Verification Complete ==="

Common Issues

perf Permission Issues

# Problem: perf requires root permission
# Solution:
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

# Or use sudo
sudo perf stat ./benchmark

Frequency Instability

# Problem: CPU frequency changes affect results
# Solution: Fix frequency
sudo cpupower frequency-set -g performance

# Or disable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

Insufficient Memory

# Problem: Large benchmark runs out of memory
# Solution: Add swap
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Summary

Environment setup key points:

Platform Selection

  • Linux x86-64: Most complete support
  • macOS: Need alternative tools
  • Windows: Recommend WSL2
  • RISC-V/ARM: Use emulator or physical hardware

Required Tools

  • Compiler (GCC/Clang)
  • Performance analysis (perf/Instruments)
  • Benchmark tools (fio, iperf3)

Environment Preparation

  • Fix CPU frequency
  • Set appropriate permissions
  • Verify tool availability

Isolation

  • Docker containers
  • Virtual machines
  • Dedicated benchmark machine

Cross-Platform Timer

// cross_platform_timer.h
#ifndef CROSS_PLATFORM_TIMER_H
#define CROSS_PLATFORM_TIMER_H

#include <stdint.h>

#if defined(_WIN32)
    #include <windows.h>
#elif defined(__APPLE__)
    #include <mach/mach_time.h>
#else
    #include <time.h>
#endif

static inline uint64_t get_time_ns(void) {
#if defined(_WIN32)
    static LARGE_INTEGER freq = {0};
    if (freq.QuadPart == 0) QueryPerformanceFrequency(&freq);
    LARGE_INTEGER counter;
    QueryPerformanceCounter(&counter);
    return (uint64_t)(counter.QuadPart * 1000000000ULL / freq.QuadPart);
#elif defined(__APPLE__)
    static mach_timebase_info_data_t timebase = {0};
    if (timebase.denom == 0) mach_timebase_info(&timebase);
    return mach_absolute_time() * timebase.numer / timebase.denom;
#else
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
#endif
}

#endif // CROSS_PLATFORM_TIMER_H

Benchmark Environment Checklist

Required

  • Record hardware specs (CPU, RAM, storage)
  • Record OS and kernel version
  • Record compiler version and flags
  • Close unnecessary background programs
  • Ensure sufficient available memory
  • Ensure power supply (laptop plugged in)
  • Fix CPU frequency
  • Disable Turbo Boost
  • Disable ASLR
  • Use CPU isolation
  • Set real-time priority
  • Disable antivirus real-time scanning
  • Disable Windows Update / macOS auto-updates
  • Disable Spotlight indexing (macOS)
  • Use "High Performance" power plan (Windows)

At Runtime

  • Run sufficient warm-up iterations
  • Run enough iterations for statistics
  • Record all raw data
  • Monitor system state (CPU temp, frequency)

Appendix G: Further Reading


"Standing on the shoulders of giants."

This appendix collects books, papers, and online resources that shaped how this book thinks about performance engineering and benchmarking. Treat it as a map: dip in when a topic from the main chapters sparks your curiosity.

Editor's note: If you're in the middle of a real performance incident, start with Systems Performance, the Roofline paper, and Drepper's memory article. Come back to the rest when things are calm.

Reading Guide (Inside This Book)

Reading Paths by Role

Different readers can take different paths through the main chapters. These are suggested starting points rather than strict sequences.

Reader typeGoalSuggested chapters (main text)
System / embedded engineerUnderstand system bottlenecksCh 1–4, 5–8, 9, 16–18, 19–22, 30, 33–35
ML / AI engineerFocus on AI/ML and LLM performanceCh 1–4, 5, 8, 19, 20, 23–27, 30, 32–35
HPC / perf researcherConnect theory, hardware, and modelsCh 1–4, 5–7, 10–12, 16–18, 23–27, 30–32, 33–35

Within each path, you can always jump to appendices for hands-on exercises and environment setup when you are ready to run real benchmarks.

Topic Map (Concept → Chapters)

Use this as a quick index when you want to revisit a concept from the main text.

  • Benchmarking methodology and statistics: Ch 1–4, 10
  • Profiling tools and observability: Ch 5–8, 30–32
  • Cache, memory, and locality: Ch 2, 6, 12–15, 18, Appendix C, Appendix E
  • Data structures and algorithms in practice: Ch 13–15, 30, 31
  • Parallelism and multi-core scaling: Ch 16–18, 23, 30–32
  • Embedded and footprint constraints: Ch 9, 19–22, Appendix B, Appendix E
  • AI/ML and LLM performance: Ch 20, 23–27, 29, 32
  • End-to-end practice (how to benchmark / optimize / ship): Ch 33–35, Appendix A

When the structure evolves in future versions, this topic map is the single place that should be updated.

Books

Systems Background

Computer Systems: A Programmer's Perspective (3rd Edition) - Randal E. Bryant and David R. O'Hallaron, Pearson, 2015. A comprehensive introduction to how modern computer systems work, useful background for understanding performance bottlenecks across hardware and software.

Performance Engineering

Systems Performance: Enterprise and the Cloud (2nd Edition) - Brendan Gregg, Addison-Wesley, 2020. A broad, practical reference for performance methodology, Linux observability tools, and real production case studies.

Key chapters:

  • Chapter 2: Methodologies
  • Chapter 6: CPUs
  • Chapter 7: Memory
  • Chapter 13: perf

BPF Performance Tools - Brendan Gregg, Addison-Wesley, 2019. A modern guide to Linux observability with eBPF, useful once basic tools like perf feel natural.

Key chapters:

  • Chapter 4: BCC
  • Chapter 5: bpftrace
  • Chapters 6-15: Subsystem analysis

The Art of Writing Efficient Programs - Fedor G. Pikus, Packt, 2021. Focuses on high-performance C++ and shows how algorithms interact with modern CPUs and memory systems.

Key chapters:

  • Chapter 2: Performance Measurements
  • Chapter 3: CPU Architecture
  • Chapter 4: Memory Architecture
  • Chapter 9: High-Performance C++

Computer Architecture

Computer Architecture: A Quantitative Approach (6th Edition) - John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2017. The classic reference for processors, memory hierarchies, and quantitative evaluation.

Key chapters:

  • Chapter 1: Fundamentals
  • Chapter 2: Memory Hierarchy
  • Appendix A: Instruction Set Principles

Modern Processor Design - John Paul Shen and Mikko H. Lipasti, Waveland Press, 2013. A deeper treatment of superscalar and out-of-order processors that explains many microarchitectural effects seen in benchmarks.


Benchmarking

Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software - Connie U. Smith and Lloyd G. Williams, Addison-Wesley, 2001. A foundational text on software performance engineering and workload design.

Every Computer Performance Book - Bob Wescott, 2013. A short, very practical book full of rules of thumb for real-world performance work.

Papers

Benchmarking Methodology

How Not to Measure Computer System Performance - David J. Lilja, IEEE Computer, 2005. A concise overview of common benchmarking mistakes.

Producing Wrong Data Without Doing Anything Obviously Wrong! - Todd Mytkowicz et al., ASPLOS 2009. Shows how environment size, link order, and other details can silently corrupt results.

Key findings:

  • UNIX environment size affects performance
  • Link order matters
  • Measurement bias is pervasive

Rigorous Benchmarking in Reasonable Time - Tomas Kalibera and Richard Jones, ISMM 2013. Explains how to design statistically sound experiments without burning weeks of CPU time.

Stabilizer: Statistically Sound Performance Evaluation - Charlie Curtsinger and Emery D. Berger, ASPLOS 2013. Uses randomization to make performance measurements more robust and statistically sound.

Roofline Model

Roofline: An Insightful Visual Performance Model for Multicore Architectures Samuel Williams et al., Communications of the ACM, 2009. Introduces the Roofline model used throughout this book.


Cache-Aware Roofline Model - Aleksandar Ilic et al., IEEE TPDS, 2017. Extends Roofline to account for multiple cache levels.


AI/ML Benchmarks

MLPerf: An Industry Standard Benchmark Suite for Machine Learning - Peter Mattson et al., IEEE Micro, 2020. Describes the design and goals of the MLPerf benchmark suite.

Measuring the Algorithmic Efficiency of Neural Networks - Danny Hernandez and Tom Brown, arXiv 2020. Studies trends in the algorithmic efficiency of neural networks over time.

Online Resources

Optimization Manuals

Agner Fog's Optimization Resources - https://www.agner.org/optimize/. A comprehensive collection of optimization manuals, instruction tables, and microarchitecture notes for x86/x64.

Intel 64 and IA-32 Architectures Optimization Reference Manual - https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html. Intel's official optimization guide for their processors.

ARM Performance Analysis Guides - https://developer.arm.com/documentation/. Official documentation and tuning guides for ARM CPUs.

Memory & Cache

What Every Programmer Should Know About Memory - Ulrich Drepper, https://people.freebsd.org/~lstewart/articles/cpumemory.pdf. A long but rewarding deep dive into modern memory hierarchies.

Gallery of Processor Cache Effects - Igor Ostrovsky, http://igoro.com/archive/gallery-of-processor-cache-effects/. An interactive tour of cache behavior.

Benchmarking Tools

SPEC CPU 2017 - https://www.spec.org/cpu2017/. The industry-standard CPU benchmark suite used in academia and industry.

Phoronix Test Suite - https://www.phoronix-test-suite.com/. A large collection of open-source benchmarks for Linux and other platforms.

Google Benchmark - https://github.com/google/benchmark. A C++ microbenchmarking framework that pairs well with the microbenchmark patterns in this book.

Courses

MIT 6.172: Performance Engineering of Software Systems https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/

An MIT course on performance engineering. Covers profiling, cache optimization, parallelism, and systematic performance methodology.


Berkeley CS267: Applications of Parallel Computers https://sites.google.com/lbl.gov/cs267-spr2024

An advanced course on parallel computing and high-performance computing (HPC).


CMU 15-418/618: Parallel Computer Architecture and Programming http://www.cs.cmu.edu/~418/

Another classic course on parallel programming and computer architecture.


Blogs

Brendan Gregg's Blog https://www.brendangregg.com/

Deep-dive articles on performance analysis and observability. Especially recommended:

  • "Linux Performance" (overview)
  • "Flame Graphs"
  • "CPU Flame Graphs"

Mechanical Sympathy https://mechanical-sympathy.blogspot.com/

Discussions of hardware-aware programming and the interaction between code and modern CPUs.


Daniel Lemire's Blog https://lemire.me/blog/

Regular posts on data-oriented design, SIMD optimization, and fast software techniques.


Travis Downs' Blog https://travisdowns.github.io/

Low-level CPU performance analysis, microbenchmarks, and deep dives into instruction behavior.


Tools

Profiling

ToolPlatformDescription
perfLinuxBuilt-in Linux profiler
VTunex86Intel's advanced profiler
InstrumentsmacOSApple's profiling suite
TracyCrossReal-time profiler popular in game development

Benchmarking

ToolLanguageDescription
Google BenchmarkC++Microbenchmark library
CriterionRustRust benchmark library
pytest-benchmarkPythonPython benchmark plugin
JMHJavaJava microbenchmark harness

Visualization

ToolDescription
FlameGraphStack trace and sample visualization
PerfettoChrome trace-style viewer for traces
HotspotGUI for visualizing perf data

Suggested Reading Paths

Different readers will care about different parts of this appendix. Here are a few short routes.

Reader typeCore bookKey paper / resourceCourse
System / embedded engineerSystems PerformanceDrepper, "What Every Programmer Should Know About Memory"MIT 6.172
ML / AI engineerSystems PerformanceMLPerf papers; "Measuring the Algorithmic Efficiency of Neural Networks"CS267 (selected lectures)
HPC / performance researcherComputer Architecture: A Quantitative ApproachRoofline and Cache-Aware Roofline papersCS267 or 15-418/618

System / Embedded Engineers

  • Start with Systems Performance for methodology, tools, and mental models.
  • Skim CAQA Ch. 1-2 and the Roofline paper when you need hardware intuition.
  • Keep Drepper's memory paper and Agner Fog's manuals nearby for tricky cache/latency behaviour.

ML / AI Engineers

  • Read the MLPerf paper and the algorithmic efficiency paper alongside this book's AI/ML chapters.
  • Use Systems Performance for general methodology and system-level bottlenecks.
  • Pair this with CS267 lectures focused on dense linear algebra and GPU performance.

HPC / Research-Oriented Readers

  • Start from CAQA and Modern Processor Design for architecture depth.
  • Study the Roofline and Cache-Aware Roofline papers, then apply them to your own kernels.
  • Use CS267 or 15-418/618 as a structured path through parallel architectures and performance case studies.

Most importantly, keep connecting what you read back to real measurements on systems you control. Reading without measurement becomes trivia; measurement without theory becomes blind trial-and-error.

Appendix H: Performance Models Deep Dive


"In theory, theory and practice are the same. In practice, they are not." — Yogi Berra

This appendix provides detailed mathematical foundations, proofs, and advanced applications for the performance models introduced in Chapter 10.

Little's Law: Mathematical Foundation

Rigorous Derivation

Little's Law states: L = λ × W

Where:

  • L = average number of items in system
  • λ = arrival rate (throughput)
  • W = average time in system (latency)

Intuitive Analogy

Imagine a highway toll booth:
- Vehicle arrival rate (λ) = 100 vehicles/hour
- Time to pass through (W) = 0.1 hour/vehicle
- Vehicles at booth (L) = 100 × 0.1 = 10 vehicles

This makes sense: if each vehicle needs 0.1 hours to pass,
and 100 vehicles arrive per hour, then at any moment,
there are on average 10 vehicles in the system.

Formal Proof Sketch

Consider a time interval T.
During T:
- Tasks arriving = N = λ × T
- Each task spends average time W in system
- Task j contributes time W_j to system occupancy

Average items in system (L):
L = (1/T) × Σ [time each task spent in system]
  = (1/T) × [total time contribution from all tasks]

Since each task contributes W on average:
L ≈ (1/T) × (N × W)
  = (1/T) × (λ × T × W)
  = λ × W

Why It Works in Computer Systems

1. Conservation Principle

Tasks enter → [ Processing System ] → Tasks leave

Conservation: tasks_in = tasks_out (steady state)
Items in system = tasks that entered but haven't left yet

2. Time Average = Space Average

Observing system for time T:
Time-averaged concurrency = (1/T) × ∫[0,T] items_in_system(t) dt

This equals the task-centric view:
Each task sees an average system load during its stay

3. Valid for All Queuing Models

Proven mathematically for:
M/M/1, M/M/c, M/G/1, G/G/c, and more

Practical Verification

def verify_littles_law(throughput, latency, measured_concurrency):
    """Verify Little's Law with measured data."""
    expected = throughput * latency
    error = abs(measured_concurrency - expected) / expected
    
    print(f"Throughput: {throughput:.1f}/s")
    print(f"Latency: {latency:.3f}s")
    print(f"Expected concurrency: {expected:.1f}")
    print(f"Measured concurrency: {measured_concurrency:.1f}")
    print(f"Error: {error:.1%}")
    
    # Error < 5% indicates stable system
    return error < 0.05

# Example
verify_littles_law(150, 0.08, 11.8)
# Output: Error: 1.7% ✓

Key Prerequisites

PrerequisiteDescription
Stable SystemInput rate ≈ output rate (no unbounded queue growth)
Long-term AverageShort-term may violate; converges over time
Task ConservationEvery task that enters eventually leaves

Counter-Examples (When It Fails)

Counter-Example 1: Task Loss

throughput_in = 100/s
throughput_out = 80/s  (20% dropped)
latency = 0.1s

Calculation: 100 × 0.1 = 10
Reality: tasks accumulate unboundedly → formula fails

Counter-Example 2: Burst Arrival

1000 tasks arrive instantly
throughput = 1000/s (instantaneous)
latency = 1s
Calculation: 1000 × 1 = 1000

But this is instantaneous, not steady-state average

Diagnostic Value

When measured values don't match expectations:

SituationPossible Causes
Actual > ExpectedLatency underestimated, tasks stuck, memory leak
Actual < ExpectedThroughput overestimated, parallel processing, measurement too short

Amdahl's Law: Extended Analysis

Mathematical Derivation

Let T be the total execution time on a single core. Let p be the parallelizable fraction.

  1. Single-core execution time: T_1 = T
  2. On N cores, parallel portion shrinks to (p×T)/N, serial portion remains (1-p)×T
  3. N-core execution time: T_N = (1-p)×T + (p×T)/N
  4. Speedup S = T_1 / T_N = 1 / ((1-p) + p/N)
  5. As N → ∞: S_max = 1 / (1-p)

If p = 0.9 (90% parallelizable): S_max = 10×

Measuring the Parallel Fraction p

Since p is difficult to calculate from code inspection, use empirical fitting:

import numpy as np
from scipy.optimize import curve_fit

def amdahl_model(n, p):
    """Amdahl's Law model."""
    return 1 / ((1 - p) + p / n)

# Measured data: cores and corresponding speedup
n_data = np.array([1, 2, 4, 8, 16])
speedup_data = np.array([1.0, 1.85, 3.20, 4.80, 5.60])

# Fit p
params, _ = curve_fit(amdahl_model, n_data, speedup_data)
p_fitted = params[0]

print(f"Fitted parallel fraction p = {p_fitted:.2%}")
print(f"Theoretical maximum speedup = {1/(1-p_fitted):.2f}x")

Real-World Serial Bottleneck Examples

Bottleneck TypeExampleMitigation
Lock ContentionGlobal mutex for loggingLock-free queues, per-thread buffers
I/O SerializationSequential file readsAsync I/O, memory-mapped files
Memory Allocatormalloc global lockjemalloc, tcmalloc, arena allocators
Kernel Bottlenecksyscall serializationio_uring, batched operations

Gustafson's Law: Scaled Speedup

Mathematical Derivation

Unlike Amdahl, Gustafson starts from the parallel execution result and works backward.

  1. On N cores, normalized execution time T_N = 1
  2. Let s be the serial fraction of this time, so parallel fraction = 1-s
  3. On single core, serial part still takes s, but parallel part takes N×(1-s)
  4. Single-core time: T_1 = s + N×(1-s)
  5. Scaled Speedup: S(N) = s + N×(1-s) = N - s×(N-1)

Conclusion: Speedup grows linearly with N as long as s is small.

Strong vs Weak Scaling

Scaling TypeDefinitionModelMetric
Strong ScalingFixed problem size, add resourcesAmdahlTime reduction
Weak ScalingProblem grows with resourcesGustafsonConstant time, larger problem

When to Use Each

Amdahl scenarios:

  • User-facing latency requirements
  • Real-time constraints
  • Interactive applications

Gustafson scenarios:

  • Scientific computing (higher resolution simulations)
  • Big data processing (process more data in same time)
  • Machine learning training (larger batches)

Universal Scalability Law (USL)

Formula and Parameters

C(N) = N / (1 + σ(N-1) + κN(N-1))

σ (sigma): Contention/Serialization coefficient
κ (kappa): Coherence/Crosstalk coefficient

Physical Meaning of Parameters

σ (Contention):

  • Represents queuing for shared resources
  • Like Amdahl's serial fraction
  • Effect: Linear degradation as N increases
  • Examples: mutex waits, database locks

κ (Coherence):

  • Represents pairwise communication overhead
  • Each node must communicate with others for consistency
  • Effect: Quadratic degradation (N×(N-1) pairs)
  • Examples: cache coherence traffic, distributed consensus

Three Scaling Regimes

ConditionBehaviorCurve Shape
σ=0, κ=0LinearStraight diagonal
σ>0, κ=0AmdahlApproaches horizontal asymptote
σ>0, κ>0USLPeak then decline (retrograde)

Optimal Parallelism

The peak of C(N) occurs at:

N* = sqrt((1 - σ) / κ)

Beyond N*, adding more processors hurts performance.

Python Fitting Example

import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

def usl_model(n, sigma, kappa):
    """Universal Scalability Law model."""
    return n / (1 + sigma * (n - 1) + kappa * n * (n - 1))

# Measured data
n_data = np.array([1, 2, 4, 8, 16, 32, 64])
throughput_data = np.array([100, 185, 340, 580, 820, 900, 750])

# Normalize to relative capacity
relative_capacity = throughput_data / throughput_data[0]

# Fit USL parameters
params, covariance = curve_fit(usl_model, n_data, relative_capacity,
                                p0=[0.01, 0.001], bounds=(0, 1))
sigma, kappa = params

print(f"Contention (σ): {sigma:.4f}")
print(f"Coherence (κ): {kappa:.6f}")

# Optimal parallelism
n_optimal = np.sqrt((1 - sigma) / kappa)
print(f"Optimal parallelism N*: {n_optimal:.1f}")

# Predict and plot
n_pred = np.linspace(1, 128, 100)
c_pred = usl_model(n_pred, sigma, kappa)

plt.figure(figsize=(10, 6))
plt.scatter(n_data, relative_capacity, label='Measured', s=100)
plt.plot(n_pred, c_pred, 'r-', label=f'USL fit (σ={sigma:.3f}, κ={kappa:.5f})')
plt.axvline(n_optimal, color='g', linestyle='--', label=f'N* = {n_optimal:.1f}')
plt.xlabel('Number of Processors (N)')
plt.ylabel('Relative Capacity C(N)')
plt.title('USL Analysis: Identifying Scalability Limits')
plt.legend()
plt.grid(True)
plt.savefig('usl_analysis.png', dpi=150)

Case Study: Database Connection Pooling

Observations:
- 10 connections: 1000 TPS
- 20 connections: 1800 TPS
- 40 connections: 2800 TPS
- 80 connections: 3200 TPS (peak!)
- 160 connections: 2400 TPS (retrograde)

Fitted: σ=0.02, κ=0.0001
N* = sqrt(0.98/0.0001) ≈ 99 connections

Diagnosis: κ > 0 indicates coherence issues
- Connection pool management overhead
- Distributed lock contention
- Context switch costs

Roofline Model: Cache-Aware Extensions

Cache-Aware Roofline Model (CARM)

Traditional Roofline considers only DRAM bandwidth. CARM extends to multiple memory hierarchy levels.

Multiple Rooflines

Performance (GFLOPS)
    ^
    |  Peak Compute ─────────────────────────
    |  /  /  /  /  /
    | / L1 Roofline (highest slope)
    |/  / L2 Roofline
    |  /  / L3 Roofline
    | /  /  / DRAM Roofline (lowest slope)
    |/  /  /  /
    └──────────────────────────────────────> Arithmetic Intensity

Calculating AI for Each Level

LevelArithmetic IntensityWhen to Use
AI_L1FLOPs / L1_trafficData fits in L1
AI_L2FLOPs / L2_trafficData fits in L2
AI_L3FLOPs / L3_trafficData fits in L3
AI_DRAMFLOPs / DRAM_trafficStreaming from memory

Measurement with perf

# Measure floating-point operations
perf stat -e fp_arith_inst_retired.scalar_double,\
fp_arith_inst_retired.128b_packed_double ./my_program

# Measure memory traffic (cache misses × cache line size)
perf stat -e L1-dcache-load-misses,L1-dcache-store-misses,\
LLC-load-misses,LLC-store-misses ./my_program

Optimization Strategy by Location

Point LocationDiagnosisOptimization
Below DRAM slopeDRAM bandwidth limitedPrefetching, streaming stores
Between L3 and DRAML3 miss issuesImprove data locality, blocking
Near L1 slope, below computeLow arithmetic intensityLoop fusion, vectorization
Near compute ceilingCompute limitedBetter algorithms, SIMD

Integer Roofline

For non-FP workloads (compilers, databases, encryption):

  • Y-axis: GOPS (Giga-Operations Per Second)
  • Integer ops often faster than FP
  • More sensitive to cache latency than FP

Energy Roofline (Green Computing)

  • Y-axis: GFLOPS/Watt
  • Finds energy-efficient operating points
  • Important for HPC and data centers

Queuing Theory Fundamentals

Kendall's Notation: A/S/c/K/N/D

SymbolMeaningCommon Values
AArrival distributionM (Poisson), D (Deterministic), G (General)
SService distributionM (Exponential), D, G
cNumber of servers1, c, ∞
KSystem capacity∞ (unbounded), finite
NPopulation size∞, finite
DQueue disciplineFIFO, LIFO, Priority

M/M/1 Model: Complete Analysis

Assumptions:

  • Poisson arrivals (rate λ)
  • Exponential service times (rate μ)
  • Single server
  • FIFO queue
  • Infinite capacity

Key Formulas:

Utilization: ρ = λ/μ (must be < 1 for stability)

Probability of n items in system: P_n = (1-ρ) × ρ^n

Average items in system: L = ρ/(1-ρ)

Average items in queue: L_q = ρ²/(1-ρ)

Average time in system: W = 1/(μ-λ)

Average time in queue: W_q = ρ/(μ-λ)

The Hockey Stick Effect:

Utilization (ρ)Avg Queue Length (L)Wait Time Multiplier
50%1.02× service time
70%2.33.3×
80%4.0
90%9.010×
95%19.020×
99%99.0100×

M/M/c Model: Multiple Servers

Erlang C Formula (probability of waiting):

import math

def erlang_c(c, rho):
    """Calculate probability of queuing in M/M/c system."""
    a = c * rho  # Offered load

    # Erlang C formula
    sum_term = sum((a**k) / math.factorial(k) for k in range(c))
    last_term = (a**c) / (math.factorial(c) * (1 - rho))

    return last_term / (sum_term + last_term)

# Example: 10 servers, 80% average utilization
prob_wait = erlang_c(10, 0.8)
print(f"Probability of queuing: {prob_wait:.1%}")

Capacity Planning Guidelines

  1. 70% Rule: Keep utilization ≤ 70% for latency-sensitive systems
  2. Target Wait Probability: For SLA, target < 5% probability of waiting
  3. Headroom for Bursts: Leave 30% headroom for traffic spikes

Connection Pool Sizing Formula

def optimal_pool_size(arrival_rate, service_time, target_wait_prob=0.05):
    """Calculate optimal connection pool size."""
    rho = arrival_rate * service_time

    # Binary search for minimum c where P(wait) < target
    for c in range(1, 1000):
        if c * rho > c:  # Unstable
            continue
        if erlang_c(c, rho/c) < target_wait_prob:
            return c
    return None

# Example: 100 requests/sec, 50ms average service time
pool_size = optimal_pool_size(100, 0.05)
print(f"Recommended pool size: {pool_size} connections")

Tools and Measurement

Performance Counter Events

ToolBest ForKey Events
perfLinux, generalcycles, instructions, cache-misses
Intel VTuneIntel CPUsvectorization, memory bandwidth
Intel AdvisorRoofline analysisFLOPS, memory traffic
LikwidHPC, multi-archconfigurable groups
PAPICross-platformportable API

Measurement Best Practices

  1. Warm up: Run several iterations before measuring
  2. Steady state: Ensure system reaches equilibrium
  3. Multiple runs: Report mean and standard deviation
  4. Control variables: Pin cores, disable frequency scaling
  5. Representative load: Use realistic workloads

References

  1. Amdahl, G.M. (1967). "Validity of the single processor approach"
  2. Gustafson, J.L. (1988). "Reevaluating Amdahl's Law"
  3. Gunther, N.J. (1993). "Practical Performance Analyst"
  4. Williams, S. et al. (2009). "Roofline: An Insightful Visual Performance Model"
  5. Little, J.D.C. (1961). "A Proof for the Queuing Formula: L = λW"
  6. Kleinrock, L. (1975). "Queueing Systems, Volume 1: Theory"
  7. Iyer, L.M. et al. (2015). "Cache-Aware Roofline Model"