title: "Performance and Benchmarking" subtitle: "Beyond the Bottleneck: From Classic Systems to Modern AI and HPC" author: "Danny Jiang" version: "Draft v0p9" date: "January 2026"
Performance and Benchmarking
Beyond the Bottleneck
From Classic Systems to Modern AI and HPC
Danny Jiang
Draft v0p9 - January 2026
Complete Book Contents:
- 35 chapters in 9 parts
- 8 appendices with exercises and reference materials
- Comprehensive coverage from benchmarking basics to AI/HPC and embedded systems
Licensed under CC BY 4.0
Copyright and License
Performance and Benchmarking
Beyond the Bottleneck: From Classic Systems to Modern AI and HPC
Copyright © 2025-2026 Danny Jiang
- Version: Draft v0p9
- Published: January 2026
- Author: Danny Jiang
- Contact: djiang.tw@gmail.com
License
This work is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
You are free to:
-
Share Copy and redistribute the material in any medium or format, even for commercial purposes
-
Adapt Remix, transform, and build upon the material, even for commercial purposes
Under the following terms:
-
Attribution You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
No additional restrictions You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Full License Terms: https://creativecommons.org/licenses/by/4.0/
Trademarks
- RISC-V is a trademark of RISC-V International
- ARM is a trademark of Arm Limited
- Intel, x86, and VTune are trademarks of Intel Corporation
- NVIDIA, CUDA, and Nsight are trademarks of NVIDIA Corporation
- Linux is a trademark of Linus Torvalds
- Other product and company names mentioned may be trademarks of their respective owners
Disclaimer
This book is provided "as is" without warranty of any kind, express or implied. The author and publisher disclaim all warranties, including but not limited to warranties of merchantability, fitness for a particular purpose, and non-infringement.
The information in this book is based on publicly available documentation, specifications, and the author's professional experience. While every effort has been made to ensure accuracy, hardware and software continue to evolve. Readers should verify information against current documentation and test thoroughly in their specific environments.
Performance measurements and benchmarks in this book are for the specific hardware and software configurations described. Results may vary on different systems.
About This Book
This is the complete book "Performance and Benchmarking." This book contains:
- 35 chapters in 9 parts
- 8 appendices with exercises, tools, and reference materials
- Comprehensive coverage from benchmarking basics to AI/HPC and embedded footprint analysis
Author GitHub: https://github.com/djiangtw
Updates and Errata: To be announced
January 2026
Preface
Who This Book Is For
This book is written for software engineers who need to perform performance measurement and analysis. You might be:
- A developer who needs to evaluate different algorithms or systems
- A QA engineer responsible for performance testing
- An embedded systems engineer evaluating hardware platforms
- A technical lead who needs to present performance data to customers
- A student interested in performance engineering
Regardless of your background, this book will help you avoid common benchmarking pitfalls and produce reliable, reproducible, and meaningful performance data.
Book Structure
This book is organized into nine parts:
Part I: Foundations - Benchmarking Methodology (Chapters 1-4) Establishes correct benchmarking mindset and methodology. Covers measurement environment, statistical methods, and result presentation.
Part II: Tools - Classic Benchmarks & Profiling (Chapters 5-9) Introduces various benchmark tools including CPU, memory, system-level, profiling, and embedded benchmarks.
Part III: Theory - Performance Modeling (Chapters 10-12) Deep dive into performance modeling including Roofline Model, Amdahl's Law, cache, and branch prediction.
Part IV: Practice - Data Structures & Algorithms (Chapters 13-15) Practical data structure performance analysis including array vs linked list, hash table vs tree, and sorting algorithms.
Part V: Advanced - Parallelism & Vectorization (Chapters 16-18) Advanced topics: SIMD, multi-core performance, and memory allocators.
Part VI: Embedded Constraints (Chapters 19-22) Embedded system footprint analysis: static/dynamic analysis, compiler optimization, stack analysis, and RTOS case study.
Part VII: AI/HPC Performance (Chapters 23-29) Modern performance domains: AI/ML benchmarks, HPC, GPU, LLM, ML compiler, and Edge AI.
Part VIII: Case Studies (Chapters 30-32) Real-world optimization case studies: web server, database query, and ML inference.
Part IX: Synthesis (Chapters 33-35) Bringing it all together: how to benchmark, how to optimize, and CI/CD for performance.
About Code and Commands
Code examples in this book are primarily in C, the most common language for performance measurement. Concepts apply to any language.
Commands and tools in this book focus on Linux. Linux is chosen because:
- Linux is the mainstream platform for servers and embedded systems
- Linux provides the most complete performance measurement tools
- Linux behavior is most predictable and controllable
Most concepts and methodologies apply to all operating systems. Appendices provide corresponding tools and commands for Windows and macOS users.
How to Use This Book
You can read sequentially or jump to chapters that interest you. However, I recommend:
- Read Part I first: Even if experienced, the methodology here helps avoid common mistakes
- Parts II-III are selectively readable: Choose relevant chapters based on what you need to measure
- Part IV is for data structures: When you need to understand how data structures perform in practice
- Part V is for low-level optimization: SIMD, multi-core, and memory allocators
- Part VI is for embedded: When you need to analyze footprint in resource-constrained systems
- Part VII is for AI/HPC: When you need to handle modern AI and HPC workloads
- Part VIII contains case studies: Real-world optimization examples
- Part IX provides synthesis: How to benchmark, optimize, and integrate into CI/CD
Each chapter ends with a Summary, and appendices contain exercises for practice.
Acknowledgments
This book exists thanks to the inspiration and support of many people.
First, I want to thank Gavin Guo and Jim Huang (jserv), two former colleagues from whom I learned a great deal—both through direct collaboration and through their publications, talks, and open-source contributions to performance analysis tools and methodology. Their work in the public domain continues to benefit engineers everywhere.
I thank the open-source community for creating the tools that make this book possible—perf, Valgrind, GCC, LLVM, and countless others. The transparency of open-source software allows us to understand performance at the deepest levels.
Thanks to engineers who share knowledge through blogs, papers, and conference talks. The work of Brendan Gregg on performance analysis, Fedor Pikus on C++ optimization, Ulrich Drepper on memory systems, and Agner Fog on x86 optimization has shaped my understanding and influenced this book.
I thank colleagues at SiFive, MIPS, Andes Technology, Broadcom, Western Digital, and SiS. Performance analysis and benchmarking has been my primary focus at SiFive and a significant part of my responsibilities at other companies. The practical experience gained from these teams—debugging real performance issues, building measurement infrastructure, and optimizing production systems—forms the foundation of the examples and case studies throughout this book.
Thanks to early reviewers who provided feedback on draft chapters. Your suggestions improved the technical accuracy and clarity of the material.
Finally, thanks to my family for their patience and support during many evenings and weekends of writing.
Feedback
If you find errors or have suggestions, please contact: djiang.tw@gmail.com
Let's begin.
Table of Contents
Part I: Foundations - Benchmarking Methodology
- Chapter 1: Why Benchmarking is Hard
- Chapter 2: Setting Up Your Measurement Environment
- Chapter 3: Measurement Methodology
- Chapter 4: Presenting Results
Part II: Tools - Classic Benchmarks & Profiling
- Chapter 5: CPU Benchmarks
- Chapter 6: Memory Benchmarks
- Chapter 7: System-Level Benchmarks
- Chapter 8: Profiling Tools
- Chapter 9: Embedded & RTOS Benchmarks
Part III: Theory - Performance Modeling
- Chapter 10: Performance Modeling
- Chapter 11: Galactic Algorithms
- Chapter 12: Cache & Branch Prediction
Part IV: Data Structures & Algorithms
- Chapter 13: Array vs Linked List
- Chapter 14: Hash Table vs Tree
- Chapter 15: Sorting Algorithms
Part V: Parallelism & Low-Level Optimization
- Chapter 16: SIMD & Vectorization
- Chapter 17: Multi-core Performance
- Chapter 18: Memory Allocators
Part VI: Embedded Constraints
- Chapter 19: Footprint Analysis Fundamentals
- Chapter 20: Compiler Size Optimization
- Chapter 21: Stack Analysis and Estimation
- Chapter 22: RTOS Footprint Case Study
Part VII: AI/HPC Performance
- Chapter 23: Evolution of Performance Metrics
- Chapter 24: AI/ML Benchmarks
- Chapter 25: HPC Benchmarks
- Chapter 26: GPU Benchmarking
- Chapter 27: LLM Performance Analysis
- Chapter 28: ML Compilers and Runtime
- Chapter 29: Edge AI Performance
Part VIII: Case Studies
- Chapter 30: Case Study: Web Server Optimization
- Chapter 31: Case Study: Database Query Optimization
- Chapter 32: Case Study: ML Inference Optimization
Part IX: Synthesis
- Chapter 33: How to Benchmark
- Chapter 34: How to Optimize
- Chapter 35: CI/CD for Performance
Appendices
- Appendix A: Benchmark Automation
- Appendix B: Embedded and RTOS Implementation
- Appendix C: I/O and Storage Performance
- Appendix D: Power and Performance
- Appendix E: Exercises and Solutions
- Appendix F: Environment Setup Guide
- Appendix G: Further Reading
- Appendix H: Performance Models Deep Dive
Chapter 1: Why Benchmarking Is Hard
Part I: Foundations
"There are three kinds of lies: lies, damned lies, and benchmarks." — Adapted from Benjamin Disraeli
The Meeting Before Launch
It was a Monday morning, two weeks before product launch. I sat in a conference room listening to Kevin, our marketing manager, describe what he needed.
"We need some performance numbers," he said, "for the datasheet and press release. Customers want to know how much faster our new chip is compared to the previous generation."
Fair enough. I was the performance engineer—this was literally my job.
"No problem," I said. "I can run some benchmarks. Should take about a week for a proper analysis."
Kevin frowned. "A week? We just need a few numbers. Tony got us the data in one day last time."
Tony was the engineer who handled the previous generation. He'd since left the company. I found his benchmark scripts and decided to run them first.
The results shocked me.
Those "Perfect" Numbers
Tony's scripts produced this output:
New Chip vs Old Chip Performance Comparison
============================================
Integer Operations: +47%
Floating Point: +62%
Memory Bandwidth: +35%
Overall Score: +48%
Too perfect. Every number showed improvement, and the gains were right in that "impressive but believable" range.
But I noticed a few oddities:
- No variance data — Each test had exactly one number, no standard deviation
- No environment description — No mention of test conditions
- No raw data — Only final conclusions, no underlying measurements
I decided to dig deeper.
Running It Again
I re-ran the same benchmark ten times. The results:
Run 1: +52%
Run 2: +31%
Run 3: +47%
Run 4: +68%
Run 5: +29%
Run 6: +41%
Run 7: +55%
Run 8: +33%
Run 9: +44%
Run 10: +38%
The variance was enormous. The best run showed +68%, the worst +29%—more than a 2× difference.
Tony's reported +47% was within this range, but he'd only run it once and happened to hit a favorable number. This wasn't fraud, but it wasn't accurate either.
Worse, when I checked the test environments:
- The new chip was tested in a 25°C air-conditioned lab
- The old chip was tested in a 35°C regular office
- The new chip had been freshly rebooted before testing
- The old chip had been running for three days before testing
This wasn't a fair comparison at all.
I Can Make the Numbers Say Anything
That evening, I ran an experiment. I wanted to know: if I deliberately manipulated test conditions, how wide could I make the performance gap?
Best case (conditions favoring the new chip):
- New chip: cold start, fresh reboot, all background processes killed, CPU frequency locked to maximum
- Old chip: warm, running for days, multiple background processes, CPU in power-saving mode
Result: +89%
Worst case (conditions reversed):
Result: +12%
Same hardware, same benchmark program, and the performance difference ranged from +12% to +89% depending purely on how I set up the test environment.
This is what makes benchmarking terrifying: numbers don't lie, but numbers can be manipulated.
I Told Kevin the Truth
The next day, I scheduled a meeting with Kevin.
"I have good news and bad news," I said.
"Bad news first."
"The +47% figure isn't reliable. The test environments were inconsistent, and there was no statistical analysis. If we publish that number, tech journalists will tear it apart."
Kevin's face fell. "And the good news?"
"The good news is the new chip really is faster. Under controlled, fair testing conditions, the performance improvement is somewhere between +25% and +35%, with 95% confidence. That's a number we can defend."
Kevin was quiet for a moment. "+25% doesn't sound as impressive as +47%."
"But +25% is real. +47% was a lucky single run."
In the end, we used +30% (the middle of our confidence interval) and added a footnote to the datasheet describing our test methodology.
That decision taught me a lesson: honest benchmarks may not look as impressive, but at least they won't blow up in your face later.
Why Benchmarking Is So Hard
This experience taught me the fundamental challenge of benchmarking: too many factors affect measurement results, and our intuition consistently overlooks them.
Let me walk through the six major factors that influence benchmark results:
1. System Noise
Your computer never does just one thing. Background processes, kernel threads, and interrupt handlers are all competing for CPU time.
$ perf stat -r 10 ./my_benchmark
Performance counter stats for './my_benchmark' (10 runs):
1,234,567 cycles ( +- 15.2% )
System noise alone can cause 15% variance—and that's on a "quiet" system.
2. CPU Frequency Scaling
Modern CPUs don't run at fixed frequencies. They boost when cold, throttle when hot, and save power when idle.
Run 1 (cold): 1,000 μs @ 4.2 GHz
Run 2 (warm): 1,150 μs @ 3.8 GHz
Run 3 (hot): 1,400 μs @ 3.2 GHz
Statistics 101: Three Things You Must Know
After seeing these six factors, you understand why a single measurement isn't enough. Let me introduce three statistical concepts every performance engineer should know.
Mean: The Most Common Lie
Mean is the most commonly reported statistic—and often the most misleading.
Consider these two benchmark results:
Benchmark A: 100, 100, 100, 100, 100
Mean: 100 μs
Benchmark B: 50, 50, 50, 50, 300
Mean: 100 μs
Same mean, completely different behavior. Benchmark B has a tail latency problem, but the mean hides it.
Lesson: Never report just the mean. Always include variance or percentiles.
Variance and Standard Deviation
Variance measures how spread out your data is. Standard deviation (σ) is the square root of variance, with the same units as your measurement.
Benchmark A: σ = 0 μs (perfectly consistent)
Benchmark B: σ = 100 μs (high variance)
Rule of thumb: if σ exceeds 5% of the mean, your measurements are too noisy.
Confidence Intervals
When you say "my optimization is 15% faster," how sure are you?
A confidence interval tells you where the true value likely falls. A 95% confidence interval means: if you repeated this experiment 100 times, 95 of them would contain the true value.
Performance improvement: 15% (95% CI: 8% to 22%)
This says: "I'm 95% confident the real improvement is between 8% and 22%."
If your confidence interval crosses zero, you can't claim any improvement at all:
Performance improvement: 5% (95% CI: -3% to 13%)
That might just be noise, not signal.
How Many Runs Do I Need?
One of the most common questions: how many times should I run my benchmark?
The answer depends on variance. Here's a practical approach:
Step 1: Run 10 times, calculate the standard deviation.
Step 2: Use this formula to estimate the required sample size:
n = (z × σ / E)²
where:
n = required sample size
z = 1.96 (for 95% confidence)
σ = standard deviation
E = acceptable margin of error
Example: Your benchmark has σ = 100 μs, and you want the error within ±10 μs:
n = (1.96 × 100 / 10)² = 384 samples
You need about 400 runs to get reliable results.
Step 3: If you can't run that many, either relax your error margin or reduce variance by controlling the test environment.
Warm-up: The Hidden Requirement
Watch what happens when I run a benchmark 100 times consecutively and plot the results:
Run Time (μs)
1 5,234 ← cold start
2 3,891
3 2,456
4 1,234
5 1,198
...
50 1,201
100 1,199 ← steady state
The first few runs are outliers—JIT compilation, cache warming, branch predictor training. These don't represent steady-state performance.
Solution: Always include warm-up runs, then discard that data:
// Warm-up phase (discard these)
for (int i = 0; i < WARMUP_RUNS; i++) {
run_benchmark();
}
// Measurement phase (keep these)
for (int i = 0; i < MEASURED_RUNS; i++) {
times[i] = run_benchmark();
}
How many warm-up runs? Enough that subsequent results stabilize. Plot the data—you'll see when it converges.
Back to That Meeting
Looking back at the pre-launch meeting, here's what proper analysis of Tony's data would have shown:
Tony's method (1 run):
Result: +47%
Confidence: Unknown
Reproducibility: Unverified
Proper method (100 runs, controlled environment):
Mean: +30%
σ: 4.2%
95% CI: [+25%, +35%]
Reproducibility: ✓
+47% became +30%. Less impressive, but true.
More importantly, this number was defensible. When tech journalists or competitors challenged us, we could produce complete methodology and raw data.
That's the value of proper benchmarking: not prettier numbers, but trustworthy numbers.
Guidelines for Reliable Benchmarking
Based on these experiences, here are my guidelines:
Guideline 1: Control the Environment
# Disable CPU frequency scaling
sudo cpupower frequency-set -g performance
# Disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# Pin to a specific CPU core
taskset -c 2 ./benchmark # bind to core 2
Guideline 2: Warm Up Before Measuring
Always discard the first runs. How many depends on your workload—measure until results stabilize.
Guideline 3: Report Variance, Not Just Mean
✗ Bad: "Latency: 1.2 ms"
✓ Good: "Latency: 1.2 ms (σ = 0.1 ms, n = 1000)"
Guideline 4: Use Confidence Intervals When Comparing
Don't say "A is 15% faster than B." Say "A is 15% faster than B (95% CI: 10% to 20%)."
Guideline 5: Be Suspicious of Large Improvements
If your optimization shows a 10× improvement, check everything three times. You've probably made a measurement error.
Summary
Benchmarking is hard because reality is messy. CPUs change frequency, caches retain old data, background processes steal cycles, and our intuition—trained on Big-O analysis—consistently fails us.
To benchmark correctly:
- Measure multiple times and compute variance
- Warm up before measuring
- Report uncertainty using confidence intervals
- Control the environment to reduce noise
- Stay skeptical of your own results
Chapter 2: Setting Up Your Measurement Environment
Part I: Foundations
"Measure what is measurable, and make measurable what is not so." — Galileo Galilei
The Unreproducible Bug
"I ran it ten times, and I got a different number each time."
This was the first thing Jason, a new engineer I was mentoring, said during his first performance analysis task. He was measuring a sorting algorithm, and his results varied by 40% between runs.
"What's your measurement environment?" I asked.
"Just my laptop."
I glanced at his screen—Slack was open, Chrome had twenty-something tabs, Spotify was playing music, and a Docker container was running in the background.
"That's your problem," I said. "You're not measuring your program. You're measuring your program plus Slack, plus Chrome, plus Spotify, plus Docker, plus whatever mood your laptop is in."
System Noise: The Invisible Enemy
Modern operating systems are time-shared. Even when you're running a single program, the OS is doing many things in the background:
- Kernel threads: Memory management, I/O scheduling, network processing
- Interrupts: Hardware interrupts from network cards, USB, timers
- Background daemons: Cron jobs, logging, file indexing
- Power management: CPU frequency scaling, thermal throttling
This "noise" steals CPU cycles from your program—unpredictably.
Let me show you a simple experiment. This is the same benchmark run 100 times on a "busy" system:
Run Time (μs) Notes
1 1,234
2 1,198
3 5,678 ← Context switch?
4 1,201
5 1,245
...
47 8,901 ← Interrupt storm?
...
100 1,199
Mean: 1,456 μs, but median: only 1,215 μs. Those outliers completely distort the average.
Step 1: Reduce System Noise
To get reproducible measurements, you must first reduce noise sources.
Kill Unnecessary Programs
This is basic, but often overlooked:
# Check what's running
ps aux | head -20
# Kill the usual CPU hogs
pkill chrome
pkill slack
pkill spotify
pkill docker
Disable Background Services
# Stop cron
sudo systemctl stop cron
# Stop logging daemon (careful—this stops system logs)
sudo systemctl stop rsyslog
# Stop indexing services
sudo systemctl stop tracker-miner-fs # GNOME
sudo systemctl stop mlocate # updatedb
Set Up CPU Isolation
Linux can reserve certain CPU cores exclusively for your benchmark, preventing the kernel from scheduling other work on them:
# Add to boot parameters (requires reboot)
isolcpus=2,3
# Or use cgroups for dynamic isolation
sudo cset shield -c 2,3 -k on
Then pin your benchmark to these isolated cores:
taskset -c 2 ./my_benchmark
CPU Frequency: The Hidden Variable
Modern CPUs don't run at fixed frequencies. They boost when cold, throttle when hot, and save power when idle. Great for laptops, terrible for benchmarking.
Turbo Boost: Friend or Foe?
When CPU load is low, the processor can temporarily overclock. But as temperature rises, frequency drops back down.
Time (s) Frequency Benchmark Time
0 4.2 GHz 952 μs
10 4.0 GHz 1,000 μs
30 3.6 GHz 1,111 μs
60 3.2 GHz 1,250 μs
Same benchmark, but later runs are 30% slower due to thermal throttling.
Solution: Lock the CPU Frequency
# Check current frequency governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set to performance mode (maximum frequency)
sudo cpupower frequency-set -g performance
# Or set a specific frequency
sudo cpupower frequency-set -f 2.0GHz
# Disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# Or for Intel:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
Important: Locking to "maximum frequency" isn't always best. If your benchmark runs long, the CPU may thermal-throttle anyway. Choose a frequency the system can sustain indefinitely.
Verify Frequency Stability
# Monitor CPU frequency in real-time
watch -n 0.5 "cat /proc/cpuinfo | grep MHz"
# Or use turbostat for detailed stats
sudo turbostat --interval 1
Cache State: Cold vs. Warm
CPU cache is another hidden variable. The same program can be 10× slower with a cold cache than a warm one.
The Problem
Consider this simple array sum:
long sum_array(int *arr, size_t n) {
long sum = 0;
for (size_t i = 0; i < n; i++) {
sum += arr[i];
}
return sum;
}
On first execution, the data isn't in cache. Every access goes to main memory:
First run (cold cache): 5,234 μs
Second run (warm cache): 523 μs
A 10× difference! If you only measure once, which result did you capture?
Solution: Explicitly Choose Cold or Warm
Option 1: Measure warm cache (steady state)
This is usually what you want—performance during normal operation:
// Warm-up runs (discard results)
for (int i = 0; i < WARMUP; i++) {
sum_array(arr, n);
}
// Measurement runs (record results)
for (int i = 0; i < RUNS; i++) {
times[i] = measure(sum_array, arr, n);
}
Option 2: Measure cold cache (worst case)
Sometimes you need first-run performance, like startup latency:
for (int i = 0; i < RUNS; i++) {
// Flush cache
flush_cache();
// Measure cold cache performance
times[i] = measure(sum_array, arr, n);
}
How to flush the cache:
// Method 1: Use clflush instruction (need to know addresses)
void flush_array(void *ptr, size_t size) {
char *p = (char *)ptr;
for (size_t i = 0; i < size; i += 64) { // 64 = cache line size
_mm_clflush(p + i);
}
_mm_mfence();
}
// Method 2: Access a large "trash" array to evict old data
void evict_cache(void) {
static char trash[32 * 1024 * 1024]; // 32MB > L3 cache
volatile char sum = 0;
for (size_t i = 0; i < sizeof(trash); i += 64) {
sum += trash[i];
}
}
Key principle: Whatever you choose, be explicit and consistent. Don't mix approaches between runs.
ASLR and Memory Layout
Address Space Layout Randomization (ASLR) is a security feature. Each time you run a program, the addresses of the stack, heap, and shared libraries are randomized.
How does this affect performance measurements?
Cache Conflicts
CPU caches use certain address bits to determine which cache set data belongs to. If your data structures happen to land at addresses that conflict, cache efficiency drops dramatically.
Because of ASLR, the same program may have different cache behavior on each run.
Solutions
Option 1: Disable ASLR
# Temporarily disable (affects current shell only)
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
# Disable for a specific program only
setarch $(uname -m) -R ./my_benchmark
Option 2: Run enough iterations to average it out
If you run enough times, ASLR's effects average out. But this requires more runs and produces higher variance.
My recommendation: Disable ASLR for benchmarking. We want reproducibility, not security.
NUMA: The Multi-Socket Trap
If you're benchmarking on a multi-socket server, there's another pitfall: NUMA (Non-Uniform Memory Access).
In NUMA systems, each CPU socket has its own "local" memory. Accessing local memory is fast; accessing remote memory is slow.
CPU 0 accessing local memory: 100 ns
CPU 0 accessing remote memory: 300 ns
If your program runs on CPU 0 but its data is allocated in CPU 1's memory, performance tanks.
Solution: Pin Both CPU and Memory
# Bind to node 0's CPUs and memory
numactl --cpunodebind=0 --membind=0 ./my_benchmark
# Or use interleave mode (distribute evenly)
numactl --interleave=all ./my_benchmark
Putting It All Together: Benchmark Environment Checklist
Based on everything above, here's my benchmark environment setup script:
#!/bin/bash
# benchmark_setup.sh - Create a reproducible benchmark environment
echo "=== Setting up benchmark environment ==="
# 1. Check for root privileges
if [ "$EUID" -ne 0 ]; then
echo "Please run as root"
exit 1
fi
# 2. Stop background services
echo "Stopping background services..."
systemctl stop cron
systemctl stop rsyslog
systemctl stop NetworkManager # if networking not needed
# 3. Set CPU frequency
echo "Setting CPU frequency..."
cpupower frequency-set -g performance
echo 0 > /sys/devices/system/cpu/cpufreq/boost
# 4. Disable ASLR
echo "Disabling ASLR..."
echo 0 > /proc/sys/kernel/randomize_va_space
# 5. Show CPU isolation status
echo "CPU isolation: $(cat /sys/devices/system/cpu/isolated)"
# 6. Display current status
echo ""
echo "=== Current status ==="
echo "CPU frequency: $(cat /proc/cpuinfo | grep MHz | head -1)"
echo "Turbo boost: $(cat /sys/devices/system/cpu/cpufreq/boost 2>/dev/null || echo 'N/A')"
echo "ASLR: $(cat /proc/sys/kernel/randomize_va_space)"
echo "Isolated CPUs: $(cat /sys/devices/system/cpu/isolated)"
echo ""
echo "Ready for benchmarking!"
echo "Remember to run: taskset -c <isolated_cpu> ./your_benchmark"
Back to Jason's Problem
Remember Jason's 40% variance?
I had him make these changes:
- Close all unnecessary programs
- Lock CPU frequency
- Add warm-up runs
- Pin to a specific CPU core
Results:
Before:
Mean: 1,234 μs
Std Dev: 512 μs (41.5%)
After:
Mean: 1,198 μs
Std Dev: 12 μs (1.0%)
Variance dropped from 41.5% to 1.0%. Now that's a measurement you can trust.
"So there was nothing wrong with my program," Jason said. "It was my measurement environment."
"Exactly," I said. "You've just learned the most important lesson in performance analysis: Before you measure your program, measure your measurement environment."
Summary
A reliable measurement environment is the foundation of correct benchmarking. This chapter covered:
System Noise
- Kill unnecessary programs and background services
- Use CPU isolation to reduce kernel interference
- Pin benchmarks to fixed CPU cores with
taskset
CPU Frequency
- Lock frequency to avoid turbo boost and thermal throttling
- Choose a frequency the system can sustain
- Verify stability throughout the entire test
Cache State
- Explicitly choose cold cache or warm cache measurement
- Be consistent—don't mix approaches
ASLR and NUMA
- Disable ASLR for reproducibility
- On NUMA systems, bind both CPU and memory to the same node
Chapter 3: Measurement Methodology
Part I: Foundations
"Not everything that can be counted counts, and not everything that counts can be counted." — William Bruce Cameron
A Tale of Two Timers
It was a simple enough task: measure how long a function takes to execute.
My colleague Lisa used gettimeofday():
struct timeval start, end;
gettimeofday(&start, NULL);
do_something();
gettimeofday(&end, NULL);
long elapsed = (end.tv_sec - start.tv_sec) * 1000000 +
(end.tv_usec - start.tv_usec);
printf("Time: %ld μs\n", elapsed);
I used clock_gettime(CLOCK_MONOTONIC):
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
do_something();
clock_gettime(CLOCK_MONOTONIC, &end);
long elapsed = (end.tv_sec - start.tv_sec) * 1000000000L +
(end.tv_nsec - start.tv_nsec);
printf("Time: %ld ns\n", elapsed);
We measured the same function but got different results.
Lisa's results bounced between 1,000 and 1,200 microseconds. Mine stayed stable around 1,050 microseconds.
Stranger still: after a sysadmin adjusted NTP (Network Time Protocol), Lisa's measurement came back negative—the function ended before it started.
"That's impossible," she said. "Time doesn't run backwards."
But the timer she was using can.
Wall Clock vs. Monotonic Clock
This is the first lesson in understanding timers: not all clocks are suitable for measuring elapsed time.
Wall Clock (gettimeofday, CLOCK_REALTIME)
A wall clock represents "what time is it right now?" It can be adjusted by NTP, changed manually by users, or affected by daylight saving time.
12:00:00.000 - Start measurement
12:00:01.000 - NTP adjusts time back to 11:59:59.500
11:59:59.600 - End measurement
Elapsed time = -0.4 seconds (negative!)
Wall clocks are for recording when something happened, not how long it took.
Monotonic Clock (CLOCK_MONOTONIC)
A monotonic clock is guaranteed to only move forward. It won't be adjusted by NTP (or will only be slewed gradually, never jumping).
Monotonic: 1000.000 - Start measurement
Monotonic: 1001.050 - End measurement
Elapsed time = 1.050 seconds (always positive)
Rule: When measuring elapsed time, always use a monotonic clock.
Timer Precision vs. Timer Resolution
Another common confusion: precision versus resolution.
Resolution: Smallest Reportable Unit
struct timespec res;
clock_getres(CLOCK_MONOTONIC, &res);
printf("Resolution: %ld ns\n", res.tv_nsec);
// Typically outputs: 1 ns
This tells you the timer can report values in 1-nanosecond increments. But this does not mean you can accurately measure 1-nanosecond intervals.
Precision: What You Can Actually Trust
Calling the timer itself takes time:
// Measure timer overhead
struct timespec t1, t2;
clock_gettime(CLOCK_MONOTONIC, &t1);
clock_gettime(CLOCK_MONOTONIC, &t2);
printf("Timer overhead: %ld ns\n",
(t2.tv_sec - t1.tv_sec) * 1000000000L + t2.tv_nsec - t1.tv_nsec);
On my system, this overhead is about 20-50 nanoseconds. This means:
- Measuring a 1 μs event: 2-5% error
- Measuring a 100 ns event: 20-50% error
- Measuring a 10 ns event: completely unreliable
Rule: The interval you're measuring should be at least 100× the timer overhead for less than 1% error.
CPU Cycles: The Most Precise Timer
When you need to measure very short intervals—tens to hundreds of CPU cycles—read the cycle counter directly.
x86: RDTSC
#include <x86intrin.h>
uint64_t start = __rdtsc();
do_something_very_fast();
uint64_t end = __rdtsc();
printf("Cycles: %lu\n", end - start);
ARM: CNTVCT_EL0
uint64_t read_cycles(void) {
uint64_t val;
asm volatile("mrs %0, cntvct_el0" : "=r" (val));
return val;
}
RISC-V: rdcycle
uint64_t read_cycles(void) {
uint64_t val;
asm volatile("rdcycle %0" : "=r" (val));
return val;
}
Handling Outliers
Real benchmark data will have outliers—extreme high or low values. The question is: should you remove them?
Why Outliers Happen
- Interrupts: System interrupts preempt your program
- Context switches: OS swaps your program out
- Page faults: First access to a new memory page
- GC pauses: If your language has garbage collection
- Thermal throttling: CPU overheats and slows down
Two Approaches
Conservative: Keep all data
If your goal is understanding "real world" behavior, outliers are part of reality.
# Report full percentiles
p50 = np.percentile(data, 50)
p90 = np.percentile(data, 90)
p99 = np.percentile(data, 99)
p999 = np.percentile(data, 99.9)
This approach is common for latency SLAs ("99% of requests complete within 10ms").
Aggressive: Remove outliers
If your goal is understanding algorithm performance, outliers are noise.
# Remove outliers beyond 3 standard deviations
mean = np.mean(data)
std = np.std(data)
filtered = [x for x in data if abs(x - mean) < 3 * std]
Or use a more robust method:
# Use IQR (Interquartile Range)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
filtered = [x for x in data if q1 - 1.5*iqr <= x <= q3 + 1.5*iqr]
My recommendation: Do both. Report filtered results as the primary data, but also report full percentiles so readers can see the outlier situation.
Statistical Significance: Is Your Improvement Real?
You made an optimization. Before: 1,000 μs. After: 980 μs. Is that a 2% improvement?
Not necessarily. It might just be noise.
Hypothesis Testing
The formal approach uses hypothesis testing:
- Null hypothesis (H0): No difference between groups
- Alternative hypothesis (H1): There is a difference
- Run a statistical test (e.g., t-test)
- Check p-value: If p < 0.05, reject H0
from scipy import stats
before = [1000, 1020, 980, 1010, 990, ...]
after = [980, 970, 990, 960, 985, ...]
t_stat, p_value = stats.ttest_ind(before, after)
print(f"p-value: {p_value}")
if p_value < 0.05:
print("Difference is statistically significant")
else:
print("Difference might be noise")
Effect Size
Even if a difference is "statistically significant," it doesn't mean it's "important." With 10,000 samples, even a 0.1% difference can be "significant."
Effect size tells you how big the difference is:
# Cohen's d
def cohens_d(before, after):
pooled_std = np.sqrt((np.std(before)**2 + np.std(after)**2) / 2)
return (np.mean(after) - np.mean(before)) / pooled_std
d = cohens_d(before, after)
# |d| < 0.2: small
# 0.2 <= |d| < 0.8: medium
# |d| >= 0.8: large
Practical Significance
Finally, ask yourself: does this improvement matter to users?
- 1,000 μs → 980 μs (2%): Probably unnoticeable
- 100 ms → 50 ms (50%): Users will feel it
- 10 s → 5 s (50%): Users will thank you
Rule: Statistical significance is necessary but not sufficient. You also need practical significance.
Warm-up Revisited
In Chapter 1 we mentioned warm-up. Here's a deeper look at determining how many warm-up iterations you need.
Visual Method: Plot It
The most intuitive approach is plotting each run's time:
Run# Time (μs)
1 5,234 **********************
2 3,891 ****************
3 2,456 **********
4 1,334 *****
5 1,256 *****
6 1,243 *****
7 1,238 *****
...
50 1,241 *****
You can see runs 1-4 are warm-up; from run 5 onward, results stabilize.
Automatic Method: CV Convergence
Calculate the rolling coefficient of variation (CV). When it stabilizes, warm-up is complete:
def find_warmup(times, threshold=0.05, window=10):
"""
Find where warm-up ends.
When rolling CV drops below threshold, consider it stable.
"""
for i in range(window, len(times)):
recent = times[i-window:i]
cv = np.std(recent) / np.mean(recent)
if cv < threshold:
return i - window
return len(times) // 2 # fallback
Rules of Thumb
- CPU-bound computation: Usually 3-5 warm-up runs (JIT, cache warming)
- Memory-intensive workload: May need 10-20 runs (page tables, TLB)
- JIT languages (Java, JavaScript): May need hundreds to thousands
Important: Always verify your assumptions. Plot the data—don't blindly trust rules of thumb.
A Simple Measurement Framework
Let's put all these concepts together into a reusable framework:
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
typedef struct {
double *samples;
size_t count;
double mean;
double std;
double median;
double p95;
double p99;
} BenchmarkResult;
double get_time_ns(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1e9 + ts.tv_nsec;
}
int compare_double(const void *a, const void *b) {
double da = *(const double *)a;
double db = *(const double *)b;
return (da > db) - (da < db);
}
void analyze_result(BenchmarkResult *r) {
qsort(r->samples, r->count, sizeof(double), compare_double);
// Mean
double sum = 0;
for (size_t i = 0; i < r->count; i++) sum += r->samples[i];
r->mean = sum / r->count;
// Standard deviation
double sq_sum = 0;
for (size_t i = 0; i < r->count; i++) {
sq_sum += (r->samples[i] - r->mean) * (r->samples[i] - r->mean);
}
r->std = sqrt(sq_sum / r->count);
// Percentiles
r->median = r->samples[r->count / 2];
r->p95 = r->samples[(size_t)(r->count * 0.95)];
r->p99 = r->samples[(size_t)(r->count * 0.99)];
}
void print_result(BenchmarkResult *r, const char *name) {
printf("%s:\n", name);
printf(" Mean: %.2f ns (σ = %.2f, CV = %.2f%%)\n",
r->mean, r->std, (r->std / r->mean) * 100);
printf(" Median: %.2f ns\n", r->median);
printf(" P95: %.2f ns, P99: %.2f ns\n", r->p95, r->p99);
}
Usage:
#define WARMUP 100
#define RUNS 1000
void benchmark_example(void) {
BenchmarkResult result;
result.samples = malloc(RUNS * sizeof(double));
result.count = RUNS;
// Warm-up (discard)
for (int i = 0; i < WARMUP; i++) do_something();
// Measurement
for (int i = 0; i < RUNS; i++) {
double start = get_time_ns();
do_something();
result.samples[i] = get_time_ns() - start;
}
analyze_result(&result);
print_result(&result, "do_something()");
free(result.samples);
}
Back to Lisa's Problem
Remember Lisa getting negative elapsed time with gettimeofday()?
That day she learned:
- Use the right timer:
CLOCK_MONOTONICwon't go backwards due to NTP - Understand timer overhead: Don't try to measure intervals too short
- CPU time vs. wall time: Choose based on your goal
A week later, she wrote a complete benchmark framework that became our team's standard tool.
"I had no idea timers were this complicated," she said.
"They are," I said. "But once you understand the pitfalls, you can write benchmarks that anyone can trust."
Summary
Correct measurement methodology is the core of reliable benchmarking. This chapter covered:
Timer Selection
- Use
CLOCK_MONOTONIC, not wall clocks - Understand timer overhead; don't measure intervals too short
- For cycle-level precision, use CPU cycle counters
CPU Time vs. Wall Time
- Wall time: user-perceived latency
- CPU time: actual CPU usage
- Their ratio tells you CPU utilization
Outlier Handling
- Conservative: keep all data, report percentiles
- Aggressive: remove outliers, but document your method
- Recommendation: do both
Statistical Significance
- Use hypothesis testing to confirm differences are real
- Check effect size to confirm differences matter
- Consider practical significance
Warm-up
- Discard unstable initial runs
- Use visual or automatic methods to determine warm-up count
- Always verify your assumptions
Chapter 4: Presenting Results
Part I: Foundations
"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey
The Chart That Lost Us the Contract
This happened a few years ago. Our team spent three months optimizing a critical module, achieving a 35% performance improvement. We were excited to present the results to the client.
My colleague Mark prepared the presentation. He made a bar chart in Excel:
Performance Comparison
Old System ████████████████████████████████████████ 1000 ms
New System █████ 650 ms
Looks great, right? The new system is clearly much shorter.
But wait—that's the problem.
But the client's CTO frowned. "Wait, where does the Y-axis start?"
"At 600," Mark said. "It makes the difference more visible."
The room went silent for a few seconds.
"So actually," the CTO said, "you went from 1000 to 650. If the Y-axis started at zero, the difference would look like this:"
Performance Comparison (Y-axis from 0)
Old ████████████████████████████████████████████████████████████████████████████████ 1000 ms
New █████████████████████████████████████████████████████████████████ 650 ms
"Your 35% improvement is real," he said. "But your chart made it look like 80%. That makes me wonder if your other data has similar problems."
We didn't get that contract.
Common Misleading Chart Techniques
Mark's mistake is common. Here are techniques frequently used (intentionally or not) to exaggerate results:
1. Truncated Y-Axis
The most common trick. Start the Y-axis at a non-zero value, and small differences look huge.
Misleading (Y starts at 95):
A █ 95
B ████████████ 100
Honest (Y starts at 0):
A ███████████████████████████████████████████████ 95
B ██████████████████████████████████████████████████ 100
A 5% difference looks like a 5× difference in the truncated version.
2. Selective Data Range
Show only the data points that favor you.
"Our product has been ahead for the last three months!"
(But we were behind for the previous six months—not shown)
3. Dual Y-Axis Abuse
Use two Y-axes with different scales to make unrelated trends appear correlated.
Left Y-axis: Sales (0-1000)
Right Y-axis: Temperature (20-25°C)
"Look! Temperature and sales are perfectly correlated!"
(It's just a coincidence from scale manipulation)
4. 3D Effects
3D charts look fancy but distort visual proportions.
5. Area vs. Length
When using circle sizes to represent values, people confuse area with diameter.
A = 100, B = 200
If using circle area:
A radius = 10
B radius = 14.14 (√2 times)
Visually, B looks only slightly larger, but it's actually 2×.
How to Present Benchmark Results Correctly
Rule 1: Start Y-Axis at Zero (Unless You Have Good Reason)
The only exception is when the data range is truly narrow, and you explicitly label it.
OK: "Note: Y-axis starts at 950 to show fine differences"
NOT OK: Sneakily starting at non-zero, hoping nobody notices
Rule 2: Show Uncertainty
Always include error bars. A bar chart without error bars is incomplete.
Performance (ms)
Mean [95% CI]
Algorithm A: 100 ████████████████████├──┤
Algorithm B: 95 ███████████████████├────┤
If error bars overlap, the difference may not be significant.
Rule 3: State Sample Size
"N = 1000 runs, 95% confidence interval"
A chart could come from 10 tests or 10,000 tests—the meaning is completely different.
Rule 4: Provide Raw Data or Distribution
Choosing the Right Chart Type
Different data needs different visualization.
Bar Chart: Compare Discrete Categories
Good for: comparing algorithms, systems, configurations
Throughput (ops/sec)
Algorithm A ████████████████████████ 2400
Algorithm B ██████████████████ 1800
Algorithm C ██████████████████████████████ 3000
Note: Bar charts are for independent categories, not trends.
Line Chart: Show Trends
Good for: change over time, change with parameters
Latency vs Data Size
Latency
(ms)
│ ●
│ ●
│ ●
│ ●
│ ●
│ ●
│ ●
│ ●
└────────────────────────────────
1KB 10KB 100KB 1MB 10MB
Data Size
Note: Line charts imply continuity. If your data is discrete, use bars.
Scatter Plot: Show Correlation or Distribution
Good for: relationships between variables, individual run results
Latency vs Throughput
Latency
│ ●
│ ● ●
│ ●●●
│ ●●●●
│ ●●●●
│ ●●●
│ ●●
│ ●
└──────────────────────
Throughput
Box Plot: Compare Distributions
Good for: comparing spread across multiple groups
Latency by Configuration
Config A Config B Config C
│ │ │
○ │ │ ← outlier
│ │ │
┌──┴──┐ ┌──┴──┐ ┌──┴──┐
│ │ │ │ │ │
├─────┤ ├─────┤ ├─────┤ ← median
│ │ │ │ │ │
└──┬──┘ └──┬──┘ └──┬──┘
│ │ │
│ ○ │ ← outlier
Box plots show: median (center line), quartiles (box), range (whiskers), outliers (dots).
Heatmap: Multi-Dimensional Data
Good for: effect of two parameters on performance
Throughput Heatmap
Thread Count
1 2 4 8 16
┌────┬────┬────┬────┬────┐
1KB │ ░░ │ ▒▒ │ ▓▓ │ ▓▓ │ ▒▒ │
├────┼────┼────┼────┼────┤
10KB │ ░░ │ ▒▒ │ ▓▓ │ ██ │ ▓▓ │
├────┼────┼────┼────┼────┤
100KB │ ░░ │ ▒▒ │ ▓▓ │ ██ │ ██ │
└────┴────┴────┴────┴────┘
Buffer
Size ░ Low ▒ Med ▓ High █ Best
Log Scale: When to Use It
When data spans multiple orders of magnitude, linear scale makes small values invisible.
Linear Scale:
Algorithm A █ 1 ms
Algorithm B ██████████████████████████████████████████████████ 1000 ms
Log Scale:
Algorithm A ██████████ 1 ms
Algorithm B ██████████████████████████████████████████████████ 1000 ms
(3 orders of magnitude difference)
When to use log scale:
- Data spans 2+ orders of magnitude
- You care about "ratios" rather than "absolute differences"
- Comparing different scales (like latency percentiles: p50, p99, p99.9)
Caution: Log scale makes large differences look smaller. Ensure readers understand it's logarithmic.
Fair Comparisons
Same Conditions
Comparisons must use identical conditions. If you change multiple variables, you don't know what caused the difference.
Bad:
"Algorithm A on Intel Xeon vs Algorithm B on AMD EPYC"
Good:
"Algorithm A vs Algorithm B, both on Intel Xeon E5-2690"
Baseline Choice
Your choice of baseline affects interpretation.
Scenario 1: A is baseline
A: 1.00× (baseline)
B: 1.35× faster
Scenario 2: B is baseline
A: 0.74× (26% slower)
B: 1.00× (baseline)
Same data, different narrative. Choose a reasonable baseline (usually "current system" or "industry standard") and be consistent.
Avoid Cherry-Picking
Don't only show favorable test cases.
Bad:
"Our system is 3× faster!" (on one specific workload we optimized for)
Good:
"Our system is 3× faster on workload A, 1.2× faster on workload B,
and 0.9× (10% slower) on workload C"
Report all results honestly, including where you perform worse.
Structure of a Benchmark Report
A complete benchmark report should include:
1. Executive Summary
One paragraph summarizing the key findings. For people who don't have time to read the full report.
2. Test Environment
## Test Environment
- **Hardware**: Intel Xeon E5-2690 v4 @ 2.6GHz, 128GB RAM
- **OS**: Ubuntu 22.04 LTS, kernel 5.15.0
- **Compiler**: GCC 11.2 with -O3
- **Date**: 2024-01-15
3. Methodology
- How many runs?
- How many warm-up iterations?
- How were outliers handled?
- What statistical methods were used?
4. Results
Charts + data tables. Charts give visual impression; tables give precise numbers.
5. Analysis
Explain the results. Why is A faster than B? Where are the bottlenecks?
6. Limitations
Honestly state test limitations.
## Limitations
- Tests performed on a single machine; results may vary on different hardware
- Only tested with synthetic workloads; real-world performance may differ
- Memory-bound workloads not covered in this benchmark
7. Raw Data
Provide raw data for readers to analyze themselves (in appendix or via link).
Practical Tools
Simple Charts: gnuplot
set terminal png size 800,600
set output 'benchmark.png'
set title 'Algorithm Performance'
set xlabel 'Data Size'
set ylabel 'Time (ms)'
set style data linespoints
plot 'data.txt' using 1:2 title 'Algorithm A', \
'data.txt' using 1:3 title 'Algorithm B'
Statistical Charts: Python + matplotlib
import matplotlib.pyplot as plt
import numpy as np
data_a = [100, 102, 98, 105, 97, ...]
data_b = [95, 93, 97, 94, 96, ...]
fig, ax = plt.subplots()
bp = ax.boxplot([data_a, data_b], labels=['Algorithm A', 'Algorithm B'])
ax.set_ylabel('Latency (μs)')
ax.set_title('Latency Comparison')
plt.savefig('comparison.png', dpi=150)
Interactive: Jupyter Notebook
Jupyter Notebooks let you combine code, data, charts, and analysis text in one place—easy to reproduce and share.
Back to Mark's Story
After that failure, our team established visualization standards:
- Y-axis starts at zero (unless explicitly labeled)
- Always include error bars
- State sample size and test environment
- Provide raw data
- Report all results honestly, including bad ones
Six months later, we had another chance to present to the same client. This time our charts didn't look as "impressive," but the CTO said:
"This is how I want to see data presented. Your improvement is 35%, and your chart clearly shows 35%—no more, no less. That makes me trust your other data too."
We got that contract.
Summary
Presenting benchmark results correctly is as important as measuring correctly. This chapter covered:
Avoiding Misleading Charts
- Don't truncate Y-axis (unless clearly labeled)
- Don't cherry-pick data ranges
- Avoid 3D effects and misleading area comparisons
Correct Visualization
- Always show error bars
- State sample size and test conditions
- Provide raw data or distributions
Choosing the Right Chart
- Bar chart: compare discrete categories
- Line chart: show trends
- Scatter plot: correlation
- Box plot: compare distributions
- Heatmap: multi-dimensional data
Fair Comparisons
- Compare under identical conditions
- Choose a reasonable baseline
- Report all results, not just favorable ones
Complete Reports
- Executive summary
- Test environment and methodology
- Results and analysis
- Limitations
- Raw data
Chapter 5: CPU Benchmarks
Part II: Tools
"Benchmarks are like statistics: you can prove anything with them if you try hard enough." — Unknown
The Dhrystone Revelation
In 1984, Reinhold Weicker released the Dhrystone benchmark. It's a short C program designed to measure CPU integer performance. Over thirty years later, it's still widely used.
But Dhrystone has a fundamental problem. Let me start with a story.
A few years ago, I was evaluating two embedded processors. Vendor A claimed 3.0 DMIPS/MHz; Vendor B claimed 2.8 DMIPS/MHz. A looked faster, right?
We bought two development boards and ran Dhrystone:
Chip A: 3.1 DMIPS/MHz (matches spec)
Chip B: 2.9 DMIPS/MHz (matches spec)
Great, specs are accurate. Then we ran our actual application—an image processing pipeline:
Chip A: 45 fps
Chip B: 62 fps
Wait, Chip B is 38% faster? But A has higher DMIPS!
This is Dhrystone's problem.
Why Dhrystone Is Unreliable
Problem 1: Too Small, Fits in Cache
The entire Dhrystone program is only a few KB. On modern processors, it fits entirely in L1 instruction cache. This means it measures "best case," not real-world performance.
Dhrystone code size: ~4 KB
L1 I-cache size: 32-64 KB
Result: 100% cache hit rate (unrealistic)
Problem 2: Compilers Can "Cheat"
Dhrystone's source code has computations that can be optimized away. Smart compilers can dramatically boost scores.
// A piece of Dhrystone code
Proc_1(Ptr_Val_Par)
{
// This function's result might not be used
// Compiler might optimize the entire function away
}
This is why DMIPS numbers sometimes include compiler versions:
"3.0 DMIPS/MHz (GCC 4.8, -O2)"
"4.2 DMIPS/MHz (Commercial Compiler X, -O3)"
Same chip, different compilers, 40% score difference. Are we measuring CPU or compiler?
Problem 3: Doesn't Represent Real Workloads
Dhrystone was designed in 1984, based on "typical" instruction distributions of that era. Modern programs are completely different:
- More memory access
- More complex control flow
- Larger working sets
- More SIMD and floating-point operations
Using Dhrystone to predict modern application performance is like using 1984 traffic data to predict today's congestion.
CoreMark: The Modern Alternative
EEMBC (Embedded Microprocessor Benchmark Consortium) released CoreMark in 2009 as a Dhrystone replacement.
CoreMark's Improvements
1. Prevents Compiler Cheating
CoreMark results are validated. If the compiler optimizes away computations, validation fails.
// CoreMark uses CRC to validate results
crc = crc_calc(result);
if (crc != EXPECTED_CRC) {
// Compiler cheated, result invalid
}
2. Larger Code Footprint
CoreMark is about 16-32 KB—larger than Dhrystone, but may still fit in L1 cache.
3. More Modern Workload Mix
Includes list processing, matrix operations, state machines—closer to modern applications.
CoreMark's Limitations
CoreMark is better than Dhrystone, but still has limits:
- Still synthetic — not a real application
- Still small — mainly measures cache-hot performance
- Single score — can't distinguish different workload types
SPEC CPU: The Industry Gold Standard
For serious CPU performance evaluation, SPEC CPU is the industry standard.
What Is SPEC CPU
SPEC (Standard Performance Evaluation Corporation) maintains several benchmark suites. SPEC CPU includes:
- SPECint: Integer operations (compilers, compression, database engines, etc.)
- SPECfp: Floating-point operations (scientific computing, simulation, etc.)
Each suite contains a dozen real applications, not synthetic code.
SPEC CPU 2006 Composition
SPECint 2006 (Integer)
----------------------
400.perlbench Perl interpreter
401.bzip2 Compression
403.gcc C compiler
429.mcf Combinatorial optimization
445.gobmk AI: Go game
456.hmmer Search gene sequence
458.sjeng AI: Chess
462.libquantum Quantum computing simulation
464.h264ref Video compression
471.omnetpp Network simulation
473.astar Path-finding
483.xalancbmk XML processing
SPECfp 2006 (Floating Point)
----------------------------
410.bwaves Fluid dynamics
416.gamess Quantum chemistry
433.milc Physics: QCD
434.zeusmp Physics: CFD
... (and more)
SPEC 2006 is still widely used in academia because:
- Many published papers use 2006 as baseline
- Rich historical data for comparison
- Some benchmarks (like
mcf,gcc) are classic memory-bound and compute-bound representatives
SPEC CPU 2017 Composition
SPECint 2017 Rate (Integer)
----------------------------
500.perlbench_r Perl interpreter
502.gcc_r C compiler
505.mcf_r Route planning
520.omnetpp_r Network simulation
523.xalancbmk_r XML processing
525.x264_r Video compression
531.deepsjeng_r AI game playing
541.leela_r Monte Carlo Go
548.exchange2_r AI puzzle solving
557.xz_r Data compression
SPECfp 2017 Rate (Floating Point)
---------------------------------
503.bwaves_r Fluid dynamics
507.cactuBSSN_r Physics
508.namd_r Molecular dynamics
510.parest_r Biomedical imaging
511.povray_r Ray tracing
... (and more)
2017 version improvements:
- Larger working sets (reflecting modern applications)
- More multi-threaded workloads (rate and speed versions)
- Removed some outdated benchmarks
- Added AI/ML-related workloads (like
leela)
Why SPEC Is More Trustworthy
1. Real Applications
These aren't synthetic code written for benchmarking. They're actually used software.
2. Strict Execution Rules
- Must run complete workloads (no cherry-picking)
- Must report complete environment configuration
- Results must be reviewed by SPEC before publication
3. Composite Score from Multiple Workloads
A single workload can be specifically optimized. But optimizing a dozen different applications simultaneously requires genuine architectural improvements.
SPEC's Downsides
- Expensive — Commercial licensing isn't cheap
- Time-consuming — Running the full suite can take days
- Complex — Requires expertise to set up and interpret correctly
For embedded systems and everyday comparisons, SPEC may be overkill.
Whetstone: The Floating-Point Veteran
Whetstone is a floating-point benchmark released in 1972—even older than Dhrystone. It measures MWIPS (Millions of Whetstone Instructions Per Second).
Why People Still Use It
- Historical data — Decades of data for comparison
- Simple — Runs in minutes
- Floating-point focus — If you only care about FP performance
Why You Shouldn't Use It
Same problems as Dhrystone: too old, too small, too easy to optimize.
Modern alternatives are LINPACK (for HPC rankings) or SPEC FP.
How to Use CPU Benchmarks Correctly
Rule 1: Know What You're Measuring
Each benchmark measures different things:
| Benchmark | Primary Measurement | Use Case |
|---|---|---|
| Dhrystone | Integer ops (small program) | Quick comparison, embedded |
| CoreMark | Integer ops (more modern) | Embedded, MCU |
| SPEC CPU | Real application performance | Servers, desktops |
| Whetstone | Floating-point (old) | Historical comparison |
| LINPACK | Linear algebra | HPC |
Rule 2: Don't Just Look at a Single Number
Bad: "Chip A: 5000 CoreMark"
Good: "Chip A: 5000 CoreMark @ 1GHz
- CPU: ARM Cortex-A72, 32KB L1-I, 32KB L1-D, 1MB L2
- Compiler: GCC 11.2 -O3 -mcpu=cortex-a72
- CoreMark/MHz: 5.0"
A single number hides too much information. Reports should include hardware specs, compiler version, and flags.
Rule 3: Ensure Identical Conditions When Comparing
Bad: "Chip A (3.0 GHz): 15000 CoreMark
Chip B (2.5 GHz): 12000 CoreMark
Conclusion: A is faster"
Good: "Chip A: 5000 CoreMark/GHz
Chip B: 4800 CoreMark/GHz
Conclusion: At same frequency, A is 4% faster"
Normalize to per-MHz or per-watt for fair comparison.
Rule 4: Cross-Validate with Multiple Benchmarks
Back to my opening story—Chip A had higher DMIPS, but Chip B was faster in practice.
If we had run more benchmarks:
Chip A:
Dhrystone: 3.1 DMIPS/MHz
CoreMark: 3.2 CM/MHz
Memory BW: 1.5 GB/s
Chip B:
Dhrystone: 2.9 DMIPS/MHz
CoreMark: 3.0 CM/MHz
Memory BW: 3.2 GB/s ← Big difference here!
Chip B's memory bandwidth was 2× that of A. Our image processing pipeline was memory-bound, so B was faster.
A single benchmark is never enough.
Rule 5: Be Careful with Cross-Architecture Comparisons
Dhrystone/CoreMark scores across different CPU architectures can't be directly compared:
Typical DMIPS/MHz Reference Values (varies with compiler and optimization)
──────────────────────────────────────────────────────────────────────────
Architecture DMIPS/MHz CoreMark/MHz
──────────────────────────────────────────────────────────────────────────
ARM Cortex-M0+ 0.95 2.4
ARM Cortex-M3 1.25 3.3
ARM Cortex-M4 1.25 3.4
ARM Cortex-M7 2.14 5.0
ARM Cortex-A53 2.3 5.5
ARM Cortex-A72 4.7 8.0
──────────────────────────────────────────────────────────────────────────
RISC-V RV32IMC 1.2-1.8 2.5-3.5
SiFive E31 (RV32IMAC) 1.61 3.1
SiFive E76 (RV32IMAFC)2.36 4.5
SiFive U74 (RV64GC) 2.5 5.0
──────────────────────────────────────────────────────────────────────────
x86 Skylake ~5.0 ~8.0
x86 Zen 3 ~5.5 ~9.0
──────────────────────────────────────────────────────────────────────────
Note: These numbers are highly dependent on compiler version, optimization level, and ISA extensions. The same RISC-V core can vary 30% across different compilers.
Practical Tips for Running Benchmarks
Setting Up the Environment
# 1. Lock CPU frequency
sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -f 2.0GHz
# 2. Disable turbo
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# 3. Pin to CPU
taskset -c 0 ./coremark
# 4. Set priority
sudo nice -n -20 ./coremark
Record Complete Environment
## Benchmark Environment
- **CPU**: Intel Core i7-10700 @ 2.9 GHz (locked)
- **Memory**: 32GB DDR4-3200
- **OS**: Ubuntu 22.04, kernel 5.15.0
- **Compiler**: GCC 11.2.0
- **Flags**: -O3 -march=native
- **CoreMark Version**: 1.01
- **Iterations**: 30000 (runtime ~10 seconds)
- **Date**: 2024-01-15
Run Multiple Times, Report Statistics
CoreMark Results (10 runs):
Mean: 24567.3 iterations/sec
StdDev: 123.4 (0.5%)
Min: 24312
Max: 24789
Back to That Image Processing Project
When we discovered Chip A had higher Dhrystone scores but worse real performance, I learned an important lesson:
Benchmarks are tools, not answers.
We ultimately chose Chip B because our application was memory-bound. If our application had been compute-bound, we might have chosen Chip A.
The correct approach is:
- First understand your workload characteristics (CPU-bound? Memory-bound? I/O-bound?)
- Choose appropriate benchmarks to evaluate
- Cross-validate with multiple benchmarks
- Finally, test on your actual application
No benchmark can replace testing on your actual application.
Summary
CPU benchmarks are tools for evaluating processor performance, but each has limitations:
Dhrystone
- Pros: Fast, universal, lots of historical data
- Cons: Too small, can be compiler-optimized, doesn't represent modern workloads
- Use for: Quick embedded system comparisons
CoreMark
- Pros: More modern than Dhrystone, anti-cheat design
- Cons: Still synthetic, still small
- Use for: Embedded systems, MCU evaluation
SPEC CPU
- Pros: Real applications, strict rules, industry standard
- Cons: Expensive, time-consuming, complex
- Use for: Formal server/desktop system evaluation
Correct Usage
- Know what each benchmark measures
- Don't just look at a single number
- Ensure identical comparison conditions
- Cross-validate with multiple benchmarks
- Ultimately test on your actual application
Chapter 6: Memory Benchmarks
Part II: Tools
"Memory is the new disk, and disk is the new tape." — Jim Gray
The Afternoon When O(1) Was Slower Than O(n)
"This can't be right."
I stared at the numbers on my screen, making sure I wasn't misreading them. Our hash table lookup (theoretically O(1)) was slower than linear search (O(n)). On an array of just 64 elements.
This was a few years ago. I was optimizing a lookup table in an embedded system. The original implementation was simple linear search; I "improved" it to a hash table, expecting massive performance gains.
The results:
Linear search (64 elements): 180 cycles
Hash table lookup: 340 cycles
The hash table was almost twice as slow.
I was confused at the time. Later I learned to use memory benchmarks to understand this phenomenon. The problem wasn't the algorithm—it was the memory access pattern.
Memory Is the Bottleneck in Modern Systems
Let's look at some numbers. Here's a typical modern processor's memory hierarchy:
Level Size Latency Bandwidth
─────────────────────────────────────────────────
Register ~1 KB 0 cycles N/A
L1 Cache 32-64 KB 3-4 cycles ~1 TB/s
L2 Cache 256-512 KB 10-14 cycles ~500 GB/s
L3 Cache 8-32 MB 30-50 cycles ~200 GB/s
DRAM 16-128 GB 100-300 cycles ~50 GB/s
─────────────────────────────────────────────────
From L1 cache to DRAM, latency differs by nearly 100×. This means:
- A single cache miss can waste 100 CPU cycles
- In those 100 cycles, the CPU could execute hundreds of instructions
- If your algorithm has poor memory access patterns, the CPU spends most of its time waiting for memory
Back to my hash table problem. Linear search operates on 64 contiguous elements, all in L1 cache. Hash table access patterns are random—each lookup might cache miss.
Linear search: 64 × 3 cycles (L1 hit) = 192 cycles
Hash table: 1 hash + 1-2 random access × 150 cycles = 300+ cycles
This is why understanding memory performance is so important.
LMbench: The Classic Memory Benchmark
LMbench is a benchmark suite developed by Larry McVoy in the 1990s. Though old, the fundamental concepts it measures remain important.
Memory Latency Measurement
LMbench uses pointer chasing to measure memory latency:
// Pointer chasing: each node points to the next (random location)
struct node {
struct node *next;
char padding[STRIDE - sizeof(void*)];
};
// Measure latency
void measure_latency(struct node *head, int count) {
struct node *p = head;
for (int i = 0; i < count; i++) {
p = p->next; // Must wait for previous access to complete
}
}
The key to this technique: each memory access depends on the previous result, so the CPU cannot prefetch or parallelize. This measures true latency, not bandwidth.
Memory Bandwidth Measurement
Bandwidth is measured differently:
// Sequential read - measures bandwidth
void measure_bandwidth(char *buffer, size_t size) {
volatile int sum = 0;
for (size_t i = 0; i < size; i += 64) { // Each cache line
sum += buffer[i];
}
}
Here accesses are sequential; the CPU can prefetch. We're measuring how fast the system can "feed" the CPU.
STREAM Benchmark
STREAM is a memory bandwidth benchmark developed by John McCalpin, specifically measuring sustained memory bandwidth.
Four Core Tests
// STREAM's four operations
// Assume a, b, c are large arrays, scalar is a constant
// 1. Copy: c = a
for (int i = 0; i < N; i++)
c[i] = a[i];
// 2. Scale: b = scalar * c
for (int i = 0; i < N; i++)
b[i] = scalar * c[i];
// 3. Add: c = a + b
for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];
// 4. Triad: a = b + scalar * c
for (int i = 0; i < N; i++)
a[i] = b[i] + scalar * c[i];
Why These Four?
Each operation measures a different memory access pattern:
| Operation | Reads | Writes | Bytes/Element |
|---|---|---|---|
| Copy | 1 | 1 | 16 (read 8 + write 8) |
| Scale | 1 | 1 | 16 |
| Add | 2 | 1 | 24 |
| Triad | 2 | 1 | 24 |
Pointer Chasing: Measuring True Latency
STREAM measures bandwidth, but sometimes you need to know latency. Pointer chasing is the standard method.
Basic Principle
// Create a randomly linked array
void setup_pointer_chase(void **array, size_t count) {
// Initialize sequentially first
for (size_t i = 0; i < count - 1; i++) {
array[i] = &array[i + 1];
}
array[count - 1] = &array[0];
// Then shuffle (Fisher-Yates)
for (size_t i = count - 1; i > 0; i--) {
size_t j = rand() % (i + 1);
void *temp = array[i];
array[i] = array[j];
array[j] = temp;
}
}
// Measure
uint64_t measure_latency(void **array, size_t iterations) {
void **p = array;
uint64_t start = rdtsc();
for (size_t i = 0; i < iterations; i++) {
p = (void **)*p; // Depends on previous result
}
uint64_t end = rdtsc();
// Prevent optimization
volatile void *sink = p;
(void)sink;
return (end - start) / iterations;
}
Why Shuffle?
Without shuffling, the access pattern is sequential, and the CPU prefetcher will preload the next location. After shuffling, access is random—each is a true memory access.
Measuring Different Working Set Sizes
// Measure L1, L2, L3, DRAM latency
size_t sizes[] = {
8 * 1024, // 8 KB - should be in L1
64 * 1024, // 64 KB - probably in L2
512 * 1024, // 512 KB - should be in L2/L3
4 * 1024 * 1024, // 4 MB - should be in L3
64 * 1024 * 1024 // 64 MB - should be in DRAM
};
for (int i = 0; i < 5; i++) {
size_t count = sizes[i] / sizeof(void*);
void **array = malloc(sizes[i]);
setup_pointer_chase(array, count);
uint64_t latency = measure_latency(array, 10000000);
printf("%8zu KB: %3lu cycles\n", sizes[i] / 1024, latency);
free(array);
}
Typical results:
Working Set Latency
8 KB: 4 cycles (L1)
64 KB: 12 cycles (L2)
512 KB: 35 cycles (L3)
4096 KB: 45 cycles (L3)
65536 KB: 150 cycles (DRAM)
This is a visualization of the memory hierarchy. You can clearly see the latency difference at each level.
TLB Effects
Memory access isn't only affected by cache—there's also the TLB (Translation Lookaside Buffer).
What Is TLB
Modern systems use virtual memory; every memory access requires address translation. TLB is a cache for the page table:
Virtual Address → TLB lookup → Physical Address → Cache/Memory
↓
TLB miss?
↓
Page table walk
(slow, may require multiple memory accesses)
Measuring TLB Effects
// Access once per page
#define PAGE_SIZE 4096
void measure_tlb(char *buffer, size_t pages) {
volatile int sum = 0;
for (size_t i = 0; i < pages; i++) {
sum += buffer[i * PAGE_SIZE]; // One access per page
}
}
If pages exceeds the number of TLB entries, you'll start seeing TLB miss costs.
Typical results:
Pages accessed Latency per access
16: 4 cycles (TLB hit)
64: 4 cycles (TLB hit)
256: 5 cycles (TLB misses starting)
1024: 25 cycles (heavy TLB misses)
4096: 35 cycles (TLB thrashing)
Huge Pages
One solution to TLB problems is using huge pages (2MB or 1GB):
# Check huge pages
cat /proc/meminfo | grep Huge
# Allocate huge pages
echo 100 | sudo tee /proc/sys/vm/nr_hugepages
# Use in program
#include <sys/mman.h>
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
With 2MB huge pages, you need only 1/512 the TLB entries to cover the same memory range.
Common Pitfalls
Pitfall 1: Array Too Small
If your test array fits entirely in L1 cache, you're not measuring memory performance:
// Bad: only measures L1 cache
char buffer[4096];
measure_bandwidth(buffer, 4096);
// Good: ensure it exceeds cache size
char *buffer = malloc(100 * 1024 * 1024); // 100 MB
measure_bandwidth(buffer, 100 * 1024 * 1024);
Pitfall 2: Compiler Optimization
The compiler might optimize away your memory accesses:
// Bad: compiler might optimize away
int sum = 0;
for (int i = 0; i < N; i++) {
sum += array[i];
}
// Good: use volatile to prevent optimization
volatile int sum = 0;
for (int i = 0; i < N; i++) {
sum += array[i];
}
Pitfall 3: Not Considering Prefetcher
Modern CPU prefetchers are smart. Sequential access gets prefetched, showing lower latency:
// This gets prefetched, not true memory latency
for (int i = 0; i < N; i++) {
sum += array[i];
}
// Use pointer chasing to measure true latency
p = *p; // Must wait for previous access to complete
Pitfall 4: NUMA Effects
On multi-socket systems, memory location matters:
# Check NUMA topology
numactl --hardware
# Bind to specific node
numactl --cpunodebind=0 --membind=0 ./benchmark
Back to That Hash Table
Now let's re-analyze my hash table problem with this knowledge.
Linear search (64 elements):
- Array size: 64 × 8 bytes = 512 bytes
- Entirely in L1 cache
- Sequential access, prefetcher effective
- Each comparison: ~3 cycles
- Average 32 lookups: 32 × 3 = 96 cycles
- Plus some overhead: ~180 cycles ✓
Hash table lookup:
- Hash computation: ~20 cycles
- Bucket access: possible cache miss, ~100 cycles
- If collision, another random access
- Total: ~200-400 cycles ✓
Lesson: On small datasets, cache-friendly simple algorithms are often faster than "clever" algorithms.
Summary
Memory performance is often the bottleneck in modern systems:
Memory Hierarchy
- L1 → L2 → L3 → DRAM, latency can differ by 100×
- Understanding which level your working set is in matters
Benchmark Tools
- LMbench: Classic latency and bandwidth measurement
- STREAM: Standard bandwidth benchmark
- Pointer chasing: Measures true random access latency
Measurement Techniques
- Use large enough arrays to avoid measuring only cache
- Use volatile to prevent compiler optimization
- Use pointer chasing to measure latency
- Consider TLB and NUMA effects
Practical Advice
- On small datasets, cache locality matters more than Big-O
- Random access is 10-100× slower than sequential access
- Profile first, then optimize—don't guess
Chapter 7: System-Level Benchmarks
Part II: Tools
"The best benchmark is the actual workload." — Anonymous sysadmin
The "New Server Is Slower" Ticket
"This new server has problems. It's slower than the old one."
That was the trouble ticket I received. A freshly racked server—faster CPU, more memory, newer SSD—but users complained it "felt slower."
"Felt" is a hard thing to debug.
I ran CPU benchmarks—new server was 40% faster. Memory benchmarks—30% faster. Disk I/O—3× faster. Every individual test showed the new server was better.
But users insisted: "It's just slower."
Finally I found the problem: network latency. The new server was in a different rack, adding 2ms latency to the database. For this database-heavy application, each request accessed the database dozens of times. The accumulated latency was noticeable.
This is why we need system-level benchmarks—measuring CPU, memory, and disk separately isn't enough. We need to measure how the entire system works together.
Micro-benchmarks vs System-level Benchmarks
Let's clarify the difference between these two types:
| Type | Measures | Examples |
|---|---|---|
| Micro-benchmark | Single component | CoreMark (CPU), STREAM (memory) |
| System-level | Entire system | UnixBench, Sysbench, Phoronix |
The problem with micro-benchmarks:
CPU score: 100 points
Memory score: 100 points
Disk score: 100 points
─────────────────────────
System score: ???
(Not 300—might be 50)
System performance isn't a simple sum of component performance. The bottleneck could be anywhere—CPU, memory, disk, network, or even the OS kernel.
UnixBench: The Classic System Benchmark
UnixBench is one of the oldest system benchmarks, first released in 1984. Though dated, the fundamental concepts it measures remain important.
UnixBench Test Items
Dhrystone CPU integer operations
Whetstone CPU floating-point operations
Execl Throughput Process creation (execl)
File Copy Disk I/O
Pipe Throughput IPC performance
Pipe-based Switching Context switch
Process Creation fork() performance
Shell Scripts Shell script execution
System Call syscall overhead
Running UnixBench
# Download and compile
git clone https://github.com/kdlucas/byte-unixbench.git
cd byte-unixbench/UnixBench
make
# Run (single-threaded and multi-threaded)
./Run
Typical Output
========================================================================
BYTE UNIX Benchmarks (Version 5.1.3)
System: myserver
OS: GNU/Linux -- 5.15.0-generic -- #1 SMP
Machine: x86_64 (x86_64)
CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
------------------------------------------------------------------------
Benchmark Run: Wed Dec 18 2024 10:00:00
1 parallel copy of tests:
Dhrystone 2 using register variables 45000000.0 lps (10.0 s, 7 samples)
Double-Precision Whetstone 8500.0 MWIPS (10.0 s, 7 samples)
Execl Throughput 5000.0 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 900000.0 KBps (30.0 s, 2 samples)
...
System Benchmarks Index Values:
BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 45000000.0 3855.8
Double-Precision Whetstone 55.0 8500.0 1545.5
...
========
System Benchmarks Index Score: 2150.3
UnixBench Limitations
- Too old — Many tests designed in the 1980-90s
- Doesn't represent modern workloads — No web server, database tests
- Controversial index calculation — Geometric mean may hide individual weaknesses
- Single-machine only — Doesn't test network performance
Sysbench: A More Modern Choice
Sysbench is a more modern benchmark tool, particularly suited for testing database servers.
Sysbench Test Types
# CPU test
sysbench cpu --cpu-max-prime=20000 run
# Memory test
sysbench memory --memory-block-size=1K --memory-total-size=10G run
# Disk I/O test
sysbench fileio --file-total-size=10G prepare
sysbench fileio --file-total-size=10G --file-test-mode=rndrw run
sysbench fileio --file-total-size=10G cleanup
# MySQL test
sysbench oltp_read_write --mysql-host=localhost --mysql-user=root \
--mysql-db=test --tables=10 --table-size=100000 prepare
sysbench oltp_read_write --mysql-host=localhost --mysql-user=root \
--mysql-db=test --tables=10 --table-size=100000 --threads=16 run
Sysbench CPU Test Analysis
sysbench cpu --cpu-max-prime=20000 --threads=4 run
Sysbench Advantages
- Database-ready — Built-in MySQL/PostgreSQL support
- Scriptable — Can write custom Lua scripts
- Modern metrics — Provides latency percentiles
- Active maintenance — Continuously updated
Phoronix Test Suite: The Most Comprehensive Option
Phoronix Test Suite (PTS) is currently the most comprehensive open-source benchmark suite, containing hundreds of tests.
Installing Phoronix Test Suite
# Ubuntu/Debian
sudo apt install phoronix-test-suite
# Or download from official site
wget https://phoronix-test-suite.com/releases/phoronix-test-suite-10.8.4.tar.gz
tar xvf phoronix-test-suite-10.8.4.tar.gz
cd phoronix-test-suite
sudo ./install-sh
Common Commands
# List all available tests
phoronix-test-suite list-available-tests
# Install a test
phoronix-test-suite install pts/compress-7zip
# Run a single test
phoronix-test-suite run pts/compress-7zip
# Run a test suite
phoronix-test-suite run pts/disk
# Compare two results
phoronix-test-suite merge-results result1 result2
Common Test Suites
| Suite | Contents |
|---|---|
| pts/disk | Disk I/O (fio, iozone, bonnie++) |
| pts/cpu | CPU (compress, encode, compile) |
| pts/memory | Memory bandwidth and latency |
| pts/network | Network throughput |
| pts/compilation | Kernel compile, GCC compile |
Example Run
$ phoronix-test-suite run pts/compress-7zip
Phoronix Test Suite v10.8.4
7-Zip Compression 16.02
Test: Compression Rating
Processor: Intel Core i7-10700 @ 4.80GHz (8 Cores / 16 Threads)
Memory: 32GB DDR4-3200
OS: Ubuntu 22.04
Compression Rating:
58234 MIPS
Decompression Rating:
71823 MIPS
OpenBenchmarking.org Integration
Phoronix's unique feature is integration with OpenBenchmarking.org, where you can:
- Upload results — Share to the cloud
- Compare results — Compare with other users' results
- Track history — Observe performance trends over time
# Upload results
phoronix-test-suite upload-result my-result
# Compare results to baseline
phoronix-test-suite compare-results-to-baseline my-result baseline-id
Choosing the Right System Benchmark
Different scenarios call for different tools:
| Scenario | Recommended Tool | Reason |
|---|---|---|
| Quick system health check | UnixBench | Simple, fast, covers basics |
| Database server | Sysbench | Dedicated OLTP tests |
| Comprehensive analysis | Phoronix | Hundreds of tests, customizable |
| CI/CD automation | Sysbench + custom scripts | Scriptable, easy to integrate |
| Hardware purchase decisions | Phoronix + public comparisons | Lots of public data |
Practical Advice
1. Define "Performance" First
Before running benchmarks, ask yourself:
- For this system, what is "performance"?
- Do users care about throughput or latency?
- Which component is most likely the bottleneck?
2. Tests Should Simulate Real Usage
# Bad: test CPU alone
sysbench cpu run
# Better: simulate real database workload
sysbench oltp_read_write --tables=10 --table-size=1000000 \
--threads=32 --time=300 run
3. Run Multiple Times, Report Statistics
# Run 5 times, take median
for i in {1..5}; do
sysbench cpu run >> results.txt
done
4. Record Complete Environment
# Record environment before benchmark
echo "=== System Info ===" > env.txt
uname -a >> env.txt
cat /proc/cpuinfo | grep "model name" | head -1 >> env.txt
free -h >> env.txt
df -h >> env.txt
Back to That Ticket
Now you know why individual CPU/memory/disk benchmarks didn't find the problem. The real bottleneck was network latency, which wasn't in the tests I ran.
If I had used Sysbench's OLTP test to directly measure "the complete path from application to database," I should have found the problem.
Lesson: Choose benchmarks that represent your real workload.
Summary
System-level benchmarks measure overall system performance, not individual components:
Tool Selection
- UnixBench: Classic but dated, good for quick checks
- Sysbench: Modern, scriptable, good for database workloads
- Phoronix: Most comprehensive, good for deep analysis
Best Practices
- Define what "performance" means first
- Choose tests that represent real workloads
- Run multiple times, report statistics
- Record complete environment information
Common Pitfalls
- Testing components individually, ignoring system integration
- Using tests that don't represent real workloads
- Looking at only one metric, ignoring latency distribution
Chapter 8: Profiling Tools
Part II: Tools
"Premature optimization is the root of all evil." — Donald Knuth
"But you can't optimize what you don't measure." — Also Donald Knuth (paraphrased)
The Three Weeks I Optimized the Wrong Thing
I once spent three weeks optimizing a function.
It was an image processing pipeline that ran too slowly. I looked at the code and decided the bottleneck must be in the pixel processing loop—after all, it had millions of iterations.
So I started optimizing: SIMD vectorization, loop unrolling, cache blocking... Three weeks later, that loop was 5× faster.
Overall performance improvement? 3%.
Because the real bottleneck wasn't there. It was I/O. File reading time. The part I never looked at.
If I had run a profiler first, I would have found the real bottleneck in five minutes. Three weeks of work could have been avoided with five minutes of profiling.
This is the value of profiling.
Basic Profiling Concepts
Sampling vs Instrumentation
Profilers use two main approaches:
Sampling
Every N milliseconds, pause the program, record current location
↓
Count which functions appear most often
↓
More appearances = more execution time
Pros: Low overhead, doesn't affect program behavior Cons: Statistical approximation, may miss short functions
Instrumentation
Insert timing code at every function entry/exit
↓
Precisely record time for each call
↓
Completely accurate call counts and times
Pros: Precise Cons: High overhead, may change program behavior
Common Profilers
| Tool | Type | Platform | Features |
|---|---|---|---|
| perf | Sampling | Linux | Most powerful, hardware PMU support |
| gprof | Instrumentation | Unix | Classic, but outdated |
| Valgrind | Simulation | Linux/macOS | Precise but very slow |
| VTune | Sampling | Linux/Windows | Intel official, GUI |
| Instruments | Sampling | macOS | Apple official |
gprof: The Classic Instrumentation Profiler
gprof is a product of 1980s BSD Unix, still widely used in teaching. Understanding its workings and limitations helps appreciate modern tools.
Basic Usage
# 1. Compile with -pg flag
gcc -pg -O2 -o my_program my_program.c
# 2. Run program (generates gmon.out)
./my_program
# 3. Analyze results
gprof my_program gmon.out > analysis.txt
Reading gprof Output
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.23 1.23 1.23 100000 0.01 0.02 process_pixel
23.45 1.87 0.64 100000 0.01 0.01 apply_filter
12.34 2.21 0.34 1 340.00 450.00 main
8.76 2.45 0.24 1000000 0.00 0.00 get_value
- % time: Percentage of total time in this function
- self seconds: Time in function itself (excluding callees)
- calls: Number of calls
- self ms/call: Time per call
gprof's Serious Limitations
1. Can only profile instrumented code
# These won't appear in gprof reports:
# - System call time
# - Dynamic libraries (unless also compiled with -pg)
# - I/O wait time
2. Instrumentation changes performance characteristics
3. Sampling resolution too low — 10ms intervals
4. Poor multi-threading support
5. Can't see inlined functions
gprof vs perf
| Aspect | gprof | perf |
|---|---|---|
| Mechanism | Instrumentation | Sampling (PMU) |
| Overhead | High (10-30%) | Low (<5%) |
| System calls | Invisible | Visible |
| Shared libraries | Need recompile | Direct support |
| Multi-threading | Poor | Good |
Conclusion: Unless there's a special reason, prefer perf on modern Linux.
perf: Linux Performance Analysis Powerhouse
perf is a performance analysis tool built into the Linux kernel, extremely powerful.
Basic Usage
# Record performance data
perf record ./my_program
# View report
### perf record + perf report
```bash
# Record (-g means record call stack)
perf record -g ./my_program
# Interactive report
perf report
Overhead Command Shared Object Symbol
45.23% my_program my_program [.] process_pixel
23.45% my_program libc.so.6 [.] memcpy
12.34% my_program my_program [.] read_file
8.76% my_program libm.so.6 [.] exp
...
Now you know process_pixel takes 45% of the time. That's where to optimize.
Common perf Events
# CPU cycles
perf stat -e cycles,instructions ./program
# Cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program
# Branch prediction
perf stat -e branch-misses,branches ./program
# List all available events
perf list
perf Advanced: annotate
# See source-level hotspots
perf annotate process_pixel
│ void process_pixel(int* data, int n) {
│ for (int i = 0; i < n; i++) {
45.23 │ data[i] = expensive_calc(data[i]);
│ }
│ }
Now you know the expensive_calc line is slowest.
Flame Graphs: Visualizing Call Stacks
Flame Graph is a visualization method invented by Brendan Gregg that lets you see program hotspots at a glance.
What Is a Flame Graph
┌─────────────────────────────────────────────────────────────┐
│ main │
├─────────────────────────────┬───────────────────────────────┤
│ process_image │ read_file │
├─────────────┬───────────────┤ │
│ process_row │ apply_filter │ │
├─────────────┴───────────────┤ │
│ process_pixel │ │
└─────────────────────────────┴───────────────────────────────┘
Width = time proportion
Height = call stack depth
- X-axis: Not time order—alphabetically sorted function names
- Y-axis: Call stack depth
- Width: CPU time proportion of function (including children)
Generating Flame Graphs
# 1. Record with perf
perf record -g ./my_program
# 2. Generate readable stack trace
perf script > out.perf
# 3. Convert with FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git
./FlameGraph/stackcollapse-perf.pl out.perf > out.folded
./FlameGraph/flamegraph.pl out.folded > flamegraph.svg
# 4. Open SVG in browser
firefox flamegraph.svg
Reading Flame Graphs
- Find the widest "plateaus" — These are where most time is spent
- Read bottom to top — Bottom is caller, top is callee
- Click to zoom in — SVG is interactive
Common Patterns
Pattern 1: Single Hotspot
┌──────────────────────────────────────────┐
│ main │
├──────────────────────────────────────────┤
│ hot_function (90%) │
└──────────────────────────────────────────┘
→ Optimize hot_function
Pattern 2: Wide Base
┌──────────────────────────────────────────┐
│ main │
├────┬────┬────┬────┬────┬────┬────┬───────┤
│ f1 │ f2 │ f3 │ f4 │ f5 │ f6 │ f7 │ ... │
└────┴────┴────┴────┴────┴────┴────┴───────┘
→ No single hotspot; need overall optimization or algorithm change
Pattern 3: Deep Narrow Stack
┌───┐
│ a │
├───┤
│ b │
├───┤
│ c │
├───┤
│ d │ ← Actual work here
└───┘
→ Consider reducing call stack depth or inlining
Valgrind: Memory and Cache Analysis
Valgrind is an instrumentation framework. The most commonly used tools are Memcheck (memory error detection) and Cachegrind (cache analysis).
Cachegrind: Cache Behavior Analysis
valgrind --tool=cachegrind ./my_program
==12345== Cachegrind, a cache and branch-prediction profiler
==12345==
==12345== I refs: 1,234,567,890
==12345== I1 misses: 123,456
==12345== LLi misses: 12,345
==12345== I1 miss rate: 0.01%
==12345==
==12345== D refs: 456,789,012 (234,567,890 rd + 222,221,122 wr)
==12345== D1 misses: 12,345,678 ( 8,765,432 rd + 3,580,246 wr)
==12345== LLd misses: 1,234,567 ( 876,543 rd + 358,024 wr)
==12345== D1 miss rate: 2.7% ( 3.7% + 1.6%)
- I refs: Instruction fetches
- D refs: Data reads/writes
- D1 misses: L1 data cache misses
- LLd misses: Last-level cache misses (goes to DRAM)
Callgrind: Call Graph Analysis
valgrind --tool=callgrind ./my_program
# Visualize
kcachegrind callgrind.out.12345
KCachegrind provides interactive call graph visualization showing:
- Inclusive/exclusive time per function
- Call graph
- Source code annotation
Valgrind Limitations
- Very slow — Usually 10-50× slower
- Not real execution — Runs on virtual CPU
- No GPU support — Only analyzes CPU code
Practical Workflow
Step 1: Quick Look with perf stat
perf stat ./my_program
Check if IPC, cache misses, branch misses are abnormal.
Step 2: Find Hotspots with perf record
perf record -g ./my_program
perf report
Find the hottest functions.
Step 3: Visualize with Flame Graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flame.svg
See the overall structure at a glance.
Step 4: Deep Analysis
Choose tools based on problem type:
- Cache issues → Cachegrind or VTune Memory Access
- Branch issues → perf branch-misses events
- Multi-threading issues → VTune Threading
Step 5: Profile Again After Optimization
# Before
perf stat ./my_program_v1
# After
perf stat ./my_program_v2
# Compare
Confirm optimization was effective.
Common Pitfalls
Pitfall 1: Debug Build Overhead
# Bad: debug build
gcc -g -O0 program.c
perf record ./a.out # Results don't represent production
# Good: release build with debug info
gcc -g -O3 program.c
perf record ./a.out # Closer to real performance
Pitfall 2: Compiler Optimization Changes Behavior
Use -fno-omit-frame-pointer to preserve frame pointer for more accurate call stacks.
Pitfall 3: Short Execution Time Noise
# Bad: too short, too much noise
perf stat ./quick_program # 0.01 seconds
# Good: run long enough
perf stat ./quick_program --iterations=10000 # 10 seconds
Back to That Three-Week Story
If I had done this:
$ perf record -g ./image_pipeline
$ perf report
Overhead Symbol
65.23% read_file ← This was the bottleneck!
23.45% write_file
8.76% process_pixel ← Where I spent three weeks
...
Five minutes would have revealed I/O was the bottleneck. Three weeks of optimization could have been spent on the right thing.
Summary
Profiling is the first step in performance optimization:
Tool Selection
- perf: Linux first choice, free and powerful
- Flame Graph: Visualize call stacks
- Valgrind/Cachegrind: Cache behavior analysis
- VTune: Intel CPU deep analysis
- Instruments: macOS development
Workflow
perf statfor quick overviewperf record+perf reportto find hotspots- Flame Graph for visualization
- Deep analysis (cache/branch/threading)
- Profile again after optimization
Core Principles
- Profile first, then optimize
- Don't guess bottlenecks before profiling
- Profile with release builds
- Ensure execution time is long enough
Chapter 9: Embedded & RTOS Benchmarks
Part II: Tools
"In embedded systems, the worst case is the only case that matters." — Jack Ganssle
The "Average 1ms, But Sometimes 100ms" Disaster
"Average latency 1ms, fully meets spec."
That was the vendor's benchmark report. We were using this MCU for motor control, with a requirement to update PWM output every 1ms. Average 1ms? Perfect.
After the system went live, the motor started stuttering. Not every time—just "occasionally."
We spent three days debugging. Finally we discovered: behind that "average 1ms," there was a 0.1% chance of jumping to 50-100ms. In typical benchmark reports, these outliers get averaged away—invisible.
But for motor control, 0.1% of 100ms delays = stuttering once per second.
This is the fundamental difference between embedded/RTOS benchmarking and GPOS benchmarking: we care about worst case, not average case.
GPOS vs RTOS vs Bare-metal
Let's clarify the differences between these three environments:
| Feature | GPOS | RTOS | Bare-metal |
|---|---|---|---|
| Examples | Linux, Windows, macOS | FreeRTOS, Zephyr, RT-Linux | Running directly on hardware |
| Scheduling | Time-slicing, variable priority | Fixed priority, preemptive | None (or super loop) |
| Memory | Virtual memory, paging | Usually flat memory | Flat memory |
| Interrupt latency | Not guaranteed (may be ms) | Guaranteed upper bound (usually μs) | Minimal (cycles) |
| Jitter | High (background processes) | Low (deterministic) | Lowest |
| Tool support | Rich (perf, VTune) | Medium (trace, SEGGER) | Basic (GPIO toggle) |
Why This Matters
On GPOS, if an operation is "usually" 1ms, "occasionally" 10ms, most applications can tolerate it.
On RTOS/bare-metal:
- Motor control: 100ms delay = motor loses control
- Automotive ABS: 10ms delay = brake failure
- Medical devices: delay = potentially fatal
RTOS benchmarks must report worst-case, not just average.
Time Measurement: What If There's No OS?
On GPOS, we use clock_gettime() or rdtsc. On bare-metal, these APIs don't exist.
ARM Cortex-M: DWT Cycle Counter
The Data Watchpoint and Trace (DWT) unit provides a cycle counter:
// Enable DWT cycle counter (need to enable trace first)
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
// Read cycle count
static inline uint32_t get_cycles(void) {
return DWT->CYCCNT;
}
// Usage
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
uint32_t elapsed = end - start; // cycles
Note: DWT->CYCCNT is 32-bit, overflows on high-frequency MCUs (168MHz ≈ 25 seconds)
RISC-V: mcycle/minstret CSRs
RISC-V has standard cycle and instruction counters:
// Read cycle counter
static inline uint64_t get_mcycle(void) {
uint64_t cycle;
asm volatile ("rdcycle %0" : "=r"(cycle));
return cycle;
}
// Read instruction counter
static inline uint64_t get_minstret(void) {
uint64_t instret;
asm volatile ("rdinstret %0" : "=r"(instret));
return instret;
}
// Calculate CPI
uint64_t cycles_start = get_mcycle();
uint64_t instr_start = get_minstret();
my_function();
uint64_t cycles = get_mcycle() - cycles_start;
uint64_t instrs = get_minstret() - instr_start;
double cpi = (double)cycles / instrs;
Running on QEMU
# ARM Cortex-M3 (lm3s6965evb)
qemu-system-arm -M lm3s6965evb -nographic -kernel firmware.elf
# RISC-V (sifive_e - FE310)
qemu-system-riscv32 -M sifive_e -nographic -kernel firmware.elf
QEMU's cycle counter is "functional," not cycle-accurate. Numbers can verify program logic but don't represent real hardware cycle counts.
Porting Open-Source Benchmarks to Bare-metal
Good news: most CPU/memory benchmarks port easily:
| Benchmark | Porting Difficulty | Dependencies | Notes |
|---|---|---|---|
| Dhrystone | Easy | libc only | Need to remove time() calls |
| CoreMark | Easy | libc only | Official bare-metal support |
| Embench | Easy | None | Designed for embedded |
| Whetstone | Easy | libm | Needs floating-point support |
| STREAM | Medium | None | Needs enough memory |
| lmbench | Hard | POSIX | Core algorithms portable |
CoreMark Bare-metal Port
CoreMark officially supports bare-metal; just implement a few porting functions:
// core_portme.c - ARM Cortex-M implementation
// 1. Timing start/end
void start_time(void) {
start_cycles = DWT->CYCCNT;
}
void stop_time(void) {
end_cycles = DWT->CYCCNT;
}
CORE_TICKS get_time(void) {
### Compile and Run
```bash
# Cross-compile for ARM Cortex-M4
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -O3 \
-DITERATIONS=10000 \
core_main.c core_list_join.c core_matrix.c \
core_state.c core_util.c core_portme.c \
-T linker.ld -o coremark_arm.elf
# Cross-compile for RISC-V (RV32IMAC)
riscv32-unknown-elf-gcc -march=rv32imac -mabi=ilp32 -O3 \
-DITERATIONS=10000 \
core_main.c core_list_join.c core_matrix.c \
core_state.c core_util.c core_portme.c \
-T linker.ld -o coremark_riscv.elf
# Run on QEMU ARM
qemu-system-arm -M lm3s6965evb -nographic \
-semihosting -kernel coremark_arm.elf
# Run on QEMU RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
-kernel coremark_riscv.elf
Embench: Designed for Embedded
Embench is a modern embedded benchmark developed by EEMBC and academia:
# Download
git clone https://github.com/embench/embench-iot.git
cd embench-iot
# Build ARM version
python3 build_all.py --arch arm --chip cortex-m4 \
--board qemu-arm
# Run (needs appropriate runner)
python3 benchmark_speed.py --target-module run_qemu
Embench includes 19 real-application kernels:
aha-mont64 Montgomery multiplication
crc32 CRC calculation
cubic Cubic root solver
edn FIR filter
huffbench Huffman encoding
matmult-int Integer matrix multiply
md5sum MD5 hash
minver Matrix inversion
nbody N-body simulation
nettle-aes AES encryption
...
RTOS Benchmarks: Measuring the OS Itself
When using an RTOS, besides application performance, you need to measure OS overhead.
Context Switch Time
// FreeRTOS context switch benchmark
static TaskHandle_t task1, task2;
static volatile uint32_t switch_start, switch_end;
void Task1(void *pvParameters) {
for (;;) {
switch_start = get_cycles();
xTaskNotifyGive(task2); // Wake Task2
ulTaskNotifyTake(pdTRUE, portMAX_DELAY); // Wait
}
}
void Task2(void *pvParameters) {
for (;;) {
ulTaskNotifyTake(pdTRUE, portMAX_DELAY);
switch_end = get_cycles();
uint32_t elapsed = switch_end - switch_start;
// Record or accumulate elapsed
xTaskNotifyGive(task1);
}
}
Typical results (depends on MCU and RTOS):
RTOS MCU Context Switch
─────────────────────────────────────────────────
FreeRTOS Cortex-M4@168MHz ~200 cycles
Zephyr Cortex-M4@168MHz ~300 cycles
RT-Thread Cortex-M4@168MHz ~250 cycles
Interrupt Latency
Time from interrupt trigger to ISR execution start:
// Set up GPIO interrupt (STM32)
void EXTI0_IRQHandler(void) {
uint32_t entry_time = get_cycles(); // First line of ISR
// Calculate latency
uint32_t latency = entry_time - trigger_time;
record_latency(latency);
// Clear interrupt flag
EXTI->PR = EXTI_PR_PR0;
}
// Trigger in main program
trigger_time = get_cycles();
// Trigger via software or external GPIO
EXTI->SWIER = EXTI_SWIER_SWIER0;
Important: Measure multiple times, report distribution!
Interrupt Latency Distribution (10000 samples):
Min: 12 cycles
Max: 89 cycles
Avg: 15 cycles
P99: 45 cycles
P99.9: 78 cycles
That P99.9 of 78 cycles is the number to consider in design.
Semaphore/Mutex Overhead
static SemaphoreHandle_t sem;
void measure_semaphore_overhead(void) {
uint32_t total = 0;
for (int i = 0; i < 10000; i++) {
uint32_t start = get_cycles();
xSemaphoreTake(sem, portMAX_DELAY);
xSemaphoreGive(sem);
uint32_t end = get_cycles();
total += (end - start);
}
printf("Semaphore take+give: %lu cycles avg\n", total / 10000);
}
Determinism Measurement
A key RTOS characteristic is determinism. How do we quantify it?
Jitter Measurement
#define SAMPLES 10000
static uint32_t latencies[SAMPLES];
// Periodic task
void PeriodicTask(void *pvParameters) {
TickType_t last_wake = xTaskGetTickCount();
int idx = 0;
for (;;) {
uint32_t expected = last_wake * CYCLES_PER_TICK;
uint32_t actual = get_cycles();
if (idx < SAMPLES) {
latencies[idx++] = actual - expected;
}
vTaskDelayUntil(&last_wake, pdMS_TO_TICKS(1));
}
}
// Analyze jitter
void analyze_jitter(void) {
uint32_t min = UINT32_MAX, max = 0;
uint64_t sum = 0;
for (int i = 0; i < SAMPLES; i++) {
if (latencies[i] < min) min = latencies[i];
if (latencies[i] > max) max = latencies[i];
sum += latencies[i];
}
printf("Jitter: min=%lu, max=%lu, avg=%lu, range=%lu\n",
min, max, (uint32_t)(sum/SAMPLES), max-min);
}
WCET Estimation
Worst-Case Execution Time (WCET) is critical for real-time system design:
#define WCET_SAMPLES 100000
uint32_t measure_wcet(void (*func)(void)) {
uint32_t max_time = 0;
for (int i = 0; i < WCET_SAMPLES; i++) {
uint32_t start = get_cycles();
func();
uint32_t elapsed = get_cycles() - start;
if (elapsed > max_time) {
max_time = elapsed;
}
}
return max_time;
}
Warning: Measured WCET is only the observed maximum; true WCET may be larger. Rigorous WCET analysis requires static analysis tools (like aiT, Bound-T).
Running on Simulators
QEMU + Semihosting
Semihosting lets bare-metal programs use host I/O:
// ARM semihosting
static inline void semihosting_write(const char *s) {
asm volatile (
"mov r0, #0x04\n" // SYS_WRITE0
"mov r1, %0\n"
"bkpt #0xAB\n"
:
: "r"(s)
: "r0", "r1"
);
}
# ARM
qemu-system-arm -M lm3s6965evb -nographic \
-semihosting-config enable=on,target=native \
-kernel firmware_arm.elf
# RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
-semihosting-config enable=on,target=native \
-kernel firmware_riscv.elf
Common Pitfalls
Pitfall 1: Only Reporting Averages
Bad: "Average latency 1ms"
Good: "Latency: avg=1ms, max=15ms, P99=3ms, P99.9=12ms"
Pitfall 2: Ignoring Interrupt Effects
During measurement, other interrupts can pollute results:
// Disable interrupts during measurement
__disable_irq();
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
__enable_irq();
But this doesn't represent reality. Real systems have interrupts—measure both "with interrupts" and "without interrupts" scenarios.
Pitfall 3: Simulator ≠ Real Hardware
QEMU cycle count: 1000 cycles
Real hardware: 3500 cycles
QEMU is a functional simulator, not cycle-accurate. Use it to verify program correctness, not for performance evaluation.
Pitfall 4: Cache Matters in Embedded Too
Many assume MCUs don't have cache. Wrong:
- Cortex-M7 has I-cache and D-cache
- Modern RISC-V MCUs may have cache
- Flash to RAM access speed differences
// Cortex-M7 cache control
SCB_EnableICache();
SCB_EnableDCache();
// Invalidate before measurement
SCB_InvalidateDCache();
Summary
Embedded/RTOS benchmarking differs fundamentally from GPOS:
Core Differences
- GPOS cares about average case
- RTOS/bare-metal cares about worst case
- Determinism matters more than throughput
Time Measurement
- ARM: DWT cycle counter, SysTick
- RISC-V: mcycle/minstret CSRs
- Handle overflow carefully
Portable Benchmarks
- CoreMark, Dhrystone, Embench: Easy to port
- STREAM: Needs enough memory
- lmbench: Core algorithms portable
RTOS Measurements
- Context switch time
- Interrupt latency (report distribution!)
- Semaphore/mutex overhead
- Jitter and WCET
Simulator Usage
- QEMU: Functional verification, not performance evaluation
- Renode: Better peripheral and RTOS support
- Simulators cannot measure power consumption
Chapter 10: Performance Modeling
Part III: Theory
"All models are wrong, but some are useful." — George Box
The "Theoretically Impossible" Optimization
"That's impossible. Your optimization violates Amdahl's Law."
That's what a senior engineer said during code review. I claimed to have sped up a program by 10×, but according to his analysis, only 50% was parallelizable—so by Amdahl's Law, the theoretical limit was 2×.
He was right—if I had just added more threads.
But I didn't parallelize. I changed the algorithm from O(n²) to O(n log n). That's outside Amdahl's Law's scope.
This experience taught me two things:
- Performance models are useful, but know their applicable scope
- Sometimes, breaking out of the model's framework is where real breakthroughs happen
This chapter covers the most important performance models in performance engineering:
- Amdahl's Law: The parallelization ceiling
- Gustafson's Law: The scaling horizon
- Universal Scalability Law: Real-world gravity
- Roofline Model: Compute or Memory bound?
- Little's Law: System's physical conservation
- Queuing Theory: Why 90% utilization causes collapse
Amdahl's Law: The Parallelization Ceiling
Historical Background
In 1967, legendary computer architect Gene Amdahl presented a profoundly influential paper at the AFIPS Spring Joint Computer Conference. The industry was debating whether to invest in "single extremely fast processors" or research connecting multiple "relatively slower processors" for parallel computing.
Amdahl pointed out that regardless of hardware scaling, programs always contain serial portions that cannot be parallelized—such as I/O initialization, memory allocation, or specific logical dependencies. These serial portions become the "ceiling" of overall system performance.
Basic Formula
Assume a program has a parallelizable portion (fraction p) and a serial portion (fraction 1-p):
Speedup = 1 / ((1 - p) + p/n)
Where:
- p = parallelizable fraction
- n = number of processors
- 1-p = serial fraction
Visualization
Original program (single thread):
┌──────────────────────────────────────────┐
│ Serial (20%) │ Parallel (80%) │
└──────────────────────────────────────────┘
Total: 100 time units
4 threads:
┌───────────┬──────────┐
│ Serial │ Parallel │ Thread 1
│ (20%) │ (20%) │
└───────────┴──────────┘
│ (20%) │ Thread 2
├──────────┤
│ (20%) │ Thread 3
├──────────┤
│ (20%) │ Thread 4
└──────────┘
Total: 20 + 20 = 40 time units
Speedup: 100/40 = 2.5×
Practical Calculation
def amdahl_speedup(p, n):
"""
p: parallelizable fraction (0 to 1)
n: number of processors
"""
return 1 / ((1 - p) + p / n)
# 80% parallelizable
p = 0.8
print(f"1 processor: {amdahl_speedup(p, 1):.2f}x")
print(f"2 processors: {amdahl_speedup(p, 2):.2f}x")
print(f"4 processors: {amdahl_speedup(p, 4):.2f}x")
print(f"8 processors: {amdahl_speedup(p, 8):.2f}x")
print(f"16 processors: {amdahl_speedup(p, 16):.2f}x")
print(f"∞ processors: {amdahl_speedup(p, 1000000):.2f}x")
1 processor: 1.00x
2 processors: 1.67x
4 processors: 2.50x
8 processors: 3.33x
16 processors: 4.00x
∞ processors: 5.00x ← This is the ceiling
The Harsh Reality
| Parallelizable Fraction | Theoretical Max Speedup |
|---|---|
| 50% | 2× |
| 75% | 4× |
| 90% | 10× |
| 95% | 20× |
| 99% | 100× |
Even if 99% of code is parallelizable, maximum speedup is only 100×. That 1% serial portion determines the ceiling.
Amdahl's Law Limitations
1. Assumes fixed workload
Amdahl's Law assumes total work is constant. In reality, more resources might mean processing larger problems (Gustafson's Law).
2. Ignores parallelization overhead
In practice, adding threads brings:
- Thread creation/destruction costs
- Synchronization costs (mutex, barrier)
- Cache coherence overhead
- False sharing
3. Only considers CPU
Doesn't account for memory bandwidth, I/O, network, or other bottlenecks.
Real-World Serial Bottlenecks
In practice, serial bottlenecks often hide in details:
- Lock Contention: Even with 128 threads, if they all compete for the same
mutex, lock-waiting time is serial - I/O Operations: Reading disk or network packets is typically sequential
- Memory Allocation: Frequent
malloccalls may cause global lock contention in the allocator
How to measure the parallel fraction p? Typically through empirical measurement: measure execution time at different core counts and fit backwards to find p. See Appendix H for detailed measurement methods and Python fitting code.
Gustafson's Law: A Different Perspective
Gustafson proposed a different assumption: more processors means solving larger problems, not solving the same problem faster.
Formula
Speedup = (1 - p) + p × n
Where:
- p = parallel portion fraction
- n = number of processors
Comparison
def gustafson_speedup(p, n):
return (1 - p) + p * n
p = 0.8 # 80% parallel
print("Processors | Amdahl | Gustafson")
print("-" * 35)
for n in [1, 2, 4, 8, 16, 64]:
a = amdahl_speedup(p, n)
g = gustafson_speedup(p, n)
print(f"{n:10} | {a:6.2f} | {g:6.2f}")
Processors | Amdahl | Gustafson
-----------------------------------
1 | 1.00 | 1.00
2 | 1.67 | 1.80
4 | 2.50 | 3.40
8 | 3.33 | 6.60
16 | 4.00 | 13.00
64 | 4.71 | 51.40
Gustafson's view: 64 processors can process 51× larger problems.
When to Use Which?
| Scenario | Applicable Law | Key Metric |
|---|---|---|
| Fixed-size problem, finish faster | Amdahl | Strong Scaling |
| Fixed time, process larger problem | Gustafson | Weak Scaling |
| Real-time systems (fixed deadline) | Amdahl | Latency |
| Scientific computing (bigger is better) | Gustafson | Throughput |
| UI response, single function optimization | Amdahl | Response Time |
| Big data processing, AI training | Gustafson | Data Volume |
Practical advice: If customers complain "App starts too slowly," use Amdahl to find serial bottlenecks; if they want "more transactions in the same time," use Gustafson to think about scaling.
Universal Scalability Law: Real-World Gravity
If Amdahl's Law is the performance "ceiling" and Gustafson's Law is the "distant horizon," then the Universal Scalability Law (USL) is real-world "gravity"—it explains why some systems' performance not only plateaus but actually declines when adding more cores.
Why Amdahl Isn't Enough
In 1993, Neil Gunther proposed USL to address "Negative Scaling" phenomena. Amdahl assumes performance eventually approaches a constant, but in real distributed systems, we often see performance reach a peak then decline.
Amdahl only considers "work being serialized" overhead, ignoring the "communication and coordination" between cores needed to maintain data consistency.
Formula and Parameters
C(N) = N / (1 + σ(N-1) + κN(N-1))
Where:
- N = number of processors
- σ (sigma) = contention coefficient - overhead from waiting for same resource
- κ (kappa) = coherence coefficient - communication overhead for consistency
Key insights:
- σ is linear: Like Amdahl's serial portion, growth slows then plateaus
- κ is quadratic: N(N-1) represents pairwise node communication, overhead grows explosively
Three Scaling Behaviors
Performance
^
| Linear (σ=0, κ=0)
| /
| / Amdahl (κ=0)
| / _______________
| / /
| / / USL (σ>0, κ>0)
| / / /\
| / / / \ ← Retrograde!
|/ / / \
└────────────────────────> N (cores)
When κ > 0, systems exhibit "Retrograde Behavior"—beyond a critical point, communication overhead exceeds computation gains, and performance declines.
Identifying System Bottlenecks
- Contention-bound (σ dominant): Curve gradually flattens like a slope. Optimization: reduce lock contention, shrink critical sections
- Coherence-bound (κ dominant): Curve like a mountain with clear peak then rapid drop. Optimization: reduce cross-node communication, avoid false sharing
Practical Application
USL's greatest power: with just a few measurement points (e.g., N=1,2,4,8), you can predict system behavior at N=64.
# Optimal parallelism
N_optimal = sqrt((1 - sigma) / kappa)
Real-World Example: Web Server Scaling
A team benchmarked their API server at different instance counts:
Instances │ Throughput (req/s) │ Speedup
──────────┼────────────────────┼─────────
1 │ 1,000 │ 1.0×
2 │ 1,850 │ 1.85×
4 │ 3,200 │ 3.2×
8 │ 4,800 │ 4.8×
16 │ 5,600 │ 5.6×
32 │ 4,200 │ 4.2× ← Degradation!
After USL fitting: σ = 0.05, κ = 0.008
Diagnosis: High κ indicates coherence-bound behavior—each request hits a shared database, causing cross-instance coordination overhead. The optimal instance count is √((1-0.05)/0.008) ≈ 11 instances.
Solution: Add read replicas and cache layer to reduce database roundtrips.
USL Limitations
- Assumes homogeneous nodes: All processors/nodes must be identical
- Steady-state assumption: Doesn't capture transient behavior during ramp-up
- Single resource model: Real systems have multiple bottlenecks (CPU, memory, network)
- Fitting sensitivity: Results depend on measurement quality; outliers can skew σ and κ
See Appendix H for detailed Python fitting code and case studies.
Roofline Model: Finding the Bottleneck
The Roofline Model is a visualization tool proposed by UC Berkeley in 2008 for analyzing whether a program is compute-bound or memory-bound.
Core Concepts
Every program has two characteristics:
-
Operational Intensity (OI): How many operations per byte of memory access
OI = FLOPs / Bytes moved -
Attainable Performance: Actual achievable FLOPS
The system has two limits:
- Peak Compute: Maximum CPU FLOPS (horizontal line)
- Peak Memory Bandwidth: Memory bandwidth limit (sloped line)
The Roofline Diagram
Performance (GFLOPS)
^
| __________________ Peak Compute (roof)
| /← Ridge Point
| /
| /
| / ← Memory Bandwidth (slope)
| /
| /
| /
| /
| /
|──────────────────────────────────────> Operational Intensity
(FLOPs/Byte)
Calculation Example
Assume a system with:
- Peak Compute: 100 GFLOPS
- Memory Bandwidth: 50 GB/s
def roofline_performance(oi, peak_compute, bandwidth):
"""
oi: Operational Intensity (FLOPs/Byte)
peak_compute: Peak GFLOPS
bandwidth: Memory bandwidth (GB/s)
"""
memory_bound = oi * bandwidth # GFLOPS
return min(memory_bound, peak_compute)
peak = 100 # GFLOPS
bw = 50 # GB/s
ridge_point = peak / bw # 2 FLOPs/Byte
print("Operational Intensity | Attainable GFLOPS | Bound")
print("-" * 55)
for oi in [0.1, 0.5, 1, 2, 4, 8, 16]:
perf = roofline_performance(oi, peak, bw)
bound = "Memory" if oi < ridge_point else "Compute"
print(f"{oi:21.1f} | {perf:17.1f} | {bound}")
Operational Intensity | Attainable GFLOPS | Bound
-------------------------------------------------------
0.1 | 5.0 | Memory
0.5 | 25.0 | Memory
1.0 | 50.0 | Memory
2.0 | 100.0 | Compute ← Ridge Point
4.0 | 100.0 | Compute
8.0 | 100.0 | Compute
16.0 | 100.0 | Compute
Real-World Examples
Operational Intensity of different algorithms:
| Algorithm | OI (FLOPs/Byte) | Usually... |
|---|---|---|
| STREAM copy | 0 | Memory-bound |
| SpMV (sparse) | 0.25 | Memory-bound |
| BLAS Level 1 | 0.25-0.5 | Memory-bound |
| Stencil | 0.5-1 | Memory-bound |
| BLAS Level 2 | 1-2 | Borderline |
| Dense GEMM | High | Compute-bound |
| FFT | Medium | Depends on implementation |
Cache-Aware Roofline Model (CARM)
Traditional Roofline only considers DRAM bandwidth, but modern processors have multiple cache levels. CARM draws different rooflines for each memory level:
Performance
^
| __________________________ L1 Peak (highest slope)
| /_________________________ L2 Peak
|//________________________ L3 Peak
|||_______________________ DRAM Peak (lowest slope)
|||/
||/
|/
└──────────────────────────> Operational Intensity
Diagnostic logic: If your point falls below the DRAM slope, the problem is cache misses (optimize prefetching, data locality); if near L1 but below compute peak, arithmetic intensity is insufficient (consider loop fusion).
Multi-core Roofline Considerations
- Shared Bandwidth: Multiple cores share DRAM bus, total bandwidth saturates faster
- Ridge Point shifts right: Compute peak increases linearly with cores, but bandwidth doesn't, making programs more likely to be memory-bound
- NUMA effects: Local DRAM bandwidth is much higher than Remote DRAM; label separately
Roofline Limitations
- Static view: Doesn't capture phase behavior—a program may be memory-bound in one phase, compute-bound in another
- Assumes perfect overlap: Ignores latency hiding and out-of-order execution limitations
- Single bottleneck model: Real programs may have mixed OI across different kernels
- Measurement challenges: Accurately counting FLOPs and bytes moved requires careful instrumentation
See Appendix H for detailed CARM analysis and tool usage.
Little's Law: System's Physical Conservation
In performance engineering, some laws transcend algorithms and hardware architectures. Little's Law is like conservation of energy in physics—it defines the fundamental boundaries of system operation.
In 1961, John Little proved that in stable systems, three core metrics have an invariant relationship.
Formula
L = λ × W
Where:
- L = average number of items in system (in-flight requests)
- λ = arrival rate (throughput)
- W = average wait time (latency)
Intuitive Understanding
Imagine a restaurant:
30 people in restaurant (L)
10 people arrive per minute (λ)
Each person stays 3 minutes on average (W)
L = λ × W
30 = 10 × 3 ✓
Applications in Computer Systems
1. Memory System
Outstanding memory requests = Bandwidth × Latency
Example:
- Memory latency: 100 ns
- Required bandwidth: 50 GB/s
Outstanding requests = 50 GB/s × 100 ns = 5000 bytes = 78 cache lines
If CPU can only maintain 16 outstanding requests,
Actual bandwidth = 16 × 64 bytes / 100 ns = 10.24 GB/s
This is why modern CPUs need deep memory hierarchies and prefetchers.
2. Network System
Bandwidth-Delay Product (BDP) = Bandwidth × RTT
Example trans-Pacific connection:
- Bandwidth: 10 Gbps
- RTT: 150 ms
BDP = 10 Gbps × 150 ms = 1.5 Gb = 187.5 MB
TCP window needs at least 187.5 MB to fill the pipe
3. Concurrent System
Throughput = Concurrency / Latency
Example web server:
- Each request latency: 50 ms
- Want 1000 requests/sec throughput
Required concurrency = 1000 × 0.05 = 50 concurrent requests
Key Prerequisites
Little's Law is powerful because it makes no assumptions about task distribution or service order. But three prerequisites must hold:
- System must be stable: Arrival rate = Departure rate. If tasks keep accumulating, the formula fails
- Long-term average: Describes equilibrium state, not instantaneous bursts
- Task conservation: Tasks don't disappear (like dropped packets) or self-replicate
Dimensional analysis verification: Throughput [tasks/time] × Latency [time/task] = [tasks]
Diagnostic Value When Formula "Fails"
When measured data doesn't match L = λW, this typically indicates:
- Actual concurrency > Expected: Long-tail requests inflating average, or resource leaks (unclosed connections)
- Actual concurrency < Expected: Throughput overestimated, or tasks being batched
Little's Law is the performance engineer's "Sanity Check." See Appendix H for detailed mathematical proof, verification methods, and architectural applications.
Queuing Theory Fundamentals
If Little's Law tells us system's physical conservation, queuing theory reveals the dynamic changes as requests wait to be processed. Understanding queuing theory explains why systems "suddenly collapse" at 90% load.
M/M/1 Model
This is the most basic queuing model, assuming Poisson arrivals, exponential service times, and a single server.
Key formulas:
- ρ = λ/μ (utilization, λ=arrival rate, μ=service rate)
- L = ρ/(1-ρ) (average queue length)
- W = 1/(μ-λ) (average wait time)
Why does latency explode at 90% utilization?
When ρ → 1, (1-ρ) → 0, causing W → ∞. Any small arrival fluctuation causes queue accumulation, and subsequent requests' wait times grow in a chain reaction—this is the "Hockey Stick Effect."
Practical Rules of Thumb
- 70% Rule: For latency-sensitive systems, keep Utilization ≤ 70%
- Latency multiplier: At 50% utilization, wait time is 2× pure processing time; at 90% it's 10×
- Separate processing from waiting: If response time increases but Service Time hasn't changed, the problem is "queuing"
Real-World Example: Database Connection Pool Sizing
A service connects to PostgreSQL with a connection pool. Current setup:
- Arrival rate: 500 queries/sec
- Average query time: 10ms
- Current pool size: 10 connections
Analysis using M/M/c:
ρ = λ / (c × μ) = 500 / (10 × 100) = 0.5 per connection
With 10 connections at 50% average utilization, queuing probability is acceptable.
But during peak hours:
Peak arrival: 800 queries/sec
ρ = 800 / (10 × 100) = 0.8 per connection
At 80% utilization, wait time ≈ 4× service time = 40ms added latency!
Solution: Increase pool to 15 connections:
ρ = 800 / (15 × 100) = 0.53 per connection
Wait time drops to ~1.1× service time = 11ms added latency
See Appendix H for detailed M/M/c model, Erlang C formula, and capacity planning.
Integrated Application: Finding the Real Bottleneck
Let's use an example to apply these models together.
Problem
You have an image processing pipeline:
- Read image (I/O)
- Apply filter (compute)
- Write result (I/O)
Current performance: 100 images/sec Target: 500 images/sec
Analysis
Step 1: Amdahl Analysis
First, measure each stage's time:
Read: 2 ms (20%)
Filter: 6 ms (60%)
Write: 2 ms (20%)
Total: 10 ms
If we only optimize Filter (parallelize):
# Filter is 60%, even with infinite parallelism
max_speedup = amdahl_speedup(0.6, float('inf'))
print(f"Max speedup: {max_speedup:.2f}x") # 2.5x
2.5x only gets us to 250 images/sec—not enough.
Step 2: Roofline Analysis of Filter
Filter characteristics:
- Per pixel: 20 FLOPs
- Per pixel: read 4 bytes, write 4 bytes = 8 bytes
- OI = 20/8 = 2.5 FLOPs/Byte
System:
- Peak: 200 GFLOPS
- Bandwidth: 50 GB/s
- Ridge point: 4 FLOPs/Byte
OI = 2.5 < 4 → Memory-bound
Optimization direction: not more threads, but improve memory access pattern.
Step 3: Little's Law Analysis of I/O
Read throughput: 500 images/sec × 4 MB/image = 2 GB/s
Disk latency: assume SSD, 0.1 ms
Required queue depth = 2 GB/s × 0.1 ms = 200 KB ≈ 50 images
But we're using sync I/O (queue depth = 1)
Problem found: I/O isn't pipelined, need async I/O.
Solution
- Use async I/O with queue depth = 64
- Improve filter's cache locality (loop tiling)
- Now filter is compute-bound, can apply SIMD optimization
Result: Achieved 600 images/sec.
Common Pitfalls
Pitfall 1: Only Looking at Averages
Bad: "Average latency 10ms, throughput 100/sec"
Little's Law: L = 100 × 0.01 = 1
Good: "Average latency 10ms, P99 latency 500ms"
That P99 might cause problems
Pitfall 2: Ignoring Model Assumptions
Amdahl's Law assumes:
- Fixed workload
- No parallelization overhead
- Perfect parallelism (no synchronization waits)
None of these hold in reality.
Pitfall 3: Over-trusting Roofline
Roofline says your program is memory-bound, but:
- Might be due to poor cache miss patterns
- Might be because prefetcher can't predict
- Might be due to false sharing
Need deeper analysis (perf, VTune).
Pitfall 4: Confusing Throughput and Latency
System A: 1000 req/s, 100ms latency
System B: 800 req/s, 10ms latency
Which is better? Depends on your needs.
Little's Law:
A: 1000 × 0.1 = 100 concurrent
B: 800 × 0.01 = 8 concurrent
A needs more resources to maintain that throughput
Which Model to Use: Decision Guide
┌─────────────────────────────────┐
│ What's your performance │
│ question? │
└────────────────┬────────────────┘
│
┌────────────────────────────┼────────────────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ "Will more cores │ │ "Is my code │ │ "Why is latency │
│ help?" │ │ compute or │ │ so high?" │
└─────────┬─────────┘ │ memory bound?" │ └─────────┬─────────┘
│ └─────────┬─────────┘ │
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ Amdahl's Law │ │ Roofline Model │ │ Little's Law │
│ (fixed workload) │ │ │ │ + Queuing Theory │
└─────────┬─────────┘ └───────────────────┘ └───────────────────┘
│
▼ Performance degrades at high N?
│
┌─────────┴─────────┐
│ Yes │ No
│ ▼ │ ▼
│ USL │ Gustafson
│ (find σ, κ) │ (scale workload)
└───────────────────┘
Quick Reference:
| Symptom | Model | Key Metric |
|---|---|---|
| Adding cores doesn't help | Amdahl | Serial fraction (1-p) |
| Performance drops at high N | USL | σ (contention), κ (coherence) |
| Slow despite high CPU | Roofline | Operational Intensity |
| Latency spikes at peak load | Queuing | ρ (utilization) |
| Throughput × Latency mismatch | Little's Law | L = λ × W |
Summary
Six Performance Models, six perspectives:
Amdahl's Law
- Theoretical upper limit for parallelization
- Serial portion determines the ceiling
- Use to judge "will adding more cores help"
Gustafson's Law
- More resources → process larger problems
- More optimistic than Amdahl
- Applies to scalable workloads
Universal Scalability Law (USL)
- Captures contention (σ) and coherence (κ) overhead
- Predicts performance degradation with more nodes
- Find optimal parallelism: N_opt = √((1-σ)/κ)
Roofline Model
- Compute-bound vs Memory-bound
- Visualize performance bottlenecks
- Guide optimization direction
Little's Law
- L = λ × W
- Connects throughput, latency, concurrency
- Diagnose queuing and resource utilization
Queuing Theory (M/M/1)
- Explains "Hockey Stick Effect" at high utilization
- 70% rule for latency-sensitive systems
- Response time = Service time / (1 - utilization)
Usage Recommendations
- First use Amdahl/Gustafson to evaluate parallelization potential
- Use USL when scaling shows degradation—identify σ vs κ bottlenecks
- Use Roofline to determine compute vs memory bound
- Use Little's Law to analyze throughput/latency relationships
- Use queuing theory to understand utilization vs latency tradeoffs
- Remember: all models are simplifications; actual measurement is always the final answer
Chapter 11: Galactic Algorithms
Part III: Theory
"In theory, there is no difference between theory and practice. In practice, there is." — Yogi Berra
The Story of an O(n) Algorithm Losing to O(n²)
I once encountered a classic problem in an interview: find two numbers in an array that sum to a target value.
The candidate gave me an O(n²) brute-force solution. I smugly said, "This can be optimized to O(n) with a hash table."
The interviewer asked, "What if the array only has 10 elements?"
"Uh... O(n) is still faster, right?"
He shook his head. "Let's measure."
We measured. The O(n²) version was 3× faster.
Why? Because 10 elements in a nested loop means only 45 comparisons, while the hash table needs:
- Hash computation (possibly more expensive than comparison)
- Hash collision handling
- Memory allocation
- Cache misses (hash table isn't contiguous)
This is Big-O's blind spot: it only tells you behavior as n approaches infinity, but in reality n is often finite.
What Big-O Really Means
What Is Big-O
Big-O describes asymptotic behavior: the growth rate as n approaches infinity.
T(n) = 5n² + 3n + 100
As n → ∞:
- 100 can be ignored
- 3n can be ignored
- Coefficient 5 can be ignored
Conclusion: T(n) = O(n²)
What Big-O Ignores
1. Constant Factors
Algorithm A: T(n) = 1000n → O(n)
Algorithm B: T(n) = n² → O(n²)
Crossover point: 1000n = n² → n = 1000
When n < 1000, B is faster than A!
2. Lower-Order Terms
T(n) = n² + 1000000n
Big-O says this is O(n²), but when n < 1000000,
the linear term dominates execution time.
3. Actual Hardware Characteristics
- Cache behavior
- Branch prediction
- SIMD friendliness
- Memory allocation
Classic Example: Insertion Sort vs Merge Sort
Insertion Sort: O(n²)
Merge Sort: O(n log n)
But:
- Insertion sort has small constant factors
- Merge sort needs extra space and copying
- Insertion sort is cache-friendly
Measured: n < 32, insertion sort is usually faster
This is why std::sort and Arrays.sort are hybrid sorts—quicksort/mergesort for large arrays, switching to insertion sort for small ones.
Galactic Algorithms: Theoretically Optimal Monsters
Some algorithms are "optimal" in Big-O terms but nobody actually uses them. These are called Galactic Algorithms—only faster at galactic-scale input sizes.
Example 1: Matrix Multiplication
Basic algorithm: O(n³)
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
C[i][j] += A[i][k] * B[k][j];
Strassen's algorithm: O(n^2.807)
In 1969, Strassen discovered you could use 7 multiplications instead of 8, recursively.
Faster algorithms:
| Year | Algorithm | Complexity |
|---|---|---|
| 1969 | Strassen | O(n^2.807) |
| 1990 | Coppersmith-Winograd | O(n^2.376) |
| 2014 | Le Gall | O(n^2.3728639) |
| 2020 | Alman-Williams | O(n^2.3728596) |
What's actually used: Strassen (sometimes) + highly optimized O(n³)
Why not use O(n^2.37) algorithms?
Coppersmith-Winograd's constant factor estimated at over 10^20
Need n > 10^20 to be faster than Strassen
A 10^20 × 10^20 matrix needs more memory than atoms in the universe
Example 2: Integer Multiplication
Grade school algorithm: O(n²)
Karatsuba: O(n^1.585) - 1960
Schönhage-Strassen: O(n log n log log n) - FFT-based
Constant Factors: The Ignored Elephant
A Tale of Cycles
Suppose we compare two algorithms:
Algorithm A: 10n operations, each takes 1 cycle
Algorithm B: n operations, each takes 100 cycles
A total: 10n cycles
B total: 100n cycles
A is 10× faster, even though it has 10× more "operations"
What operations take 100 cycles?
| Operation | Approximate cycles |
|---|---|
| Integer addition | 1 |
| Integer multiplication | 3-4 |
| Integer division | 20-80 |
| L1 cache hit | 4 |
| L2 cache hit | 12 |
| L3 cache hit | 40 |
| DRAM access | 100-300 |
| Branch misprediction | 15-20 |
| System call | 1000+ |
Real Example: Hash Table vs Array
// Version 1: Hash table lookup O(1)
int find_hash(hash_table *ht, int key) {
int hash = compute_hash(key); // ~10 cycles
int idx = hash % ht->size; // ~20 cycles (division!)
// Possible collision handling...
return ht->buckets[idx]; // Possible cache miss
}
// Version 2: Linear search O(n)
int find_linear(int *arr, int n, int key) {
for (int i = 0; i < n; i++) // Sequential access, cache-friendly
if (arr[i] == key) return i;
return -1;
}
Crossover point is around n = 10-50, depending on implementation details.
Cache Impact
// Version 1: Random access
int sum_random(int *arr, int *indices, int n) {
int sum = 0;
for (int i = 0; i < n; i++)
sum += arr[indices[i]]; // Possible cache miss every time
return sum;
}
// Version 2: Sequential access
int sum_sequential(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++)
sum += arr[i]; // Prefetcher can predict
return sum;
}
Measured difference can be 10-50×, even though Big-O is O(n) for both.
Why Theory and Practice Diverge
Reason 1: Memory Hierarchy
Big-O assumes all memory accesses cost the same.
Theoretical model:
┌──────────────────────────────────┐
│ Memory │
│ (uniform cost) │
└──────────────────────────────────┘
Reality:
┌────────┐
│ L1 (4) │ ← 4 cycles
├────────┤
│ L2 (12)│ ← 12 cycles
├────────┤
│ L3 (40)│ ← 40 cycles
├────────┤
│RAM(200)│ ← 200 cycles
└────────┘
Memory hierarchy-friendly algorithms can beat theoretically faster ones.
Reason 2: Branch Prediction
// Version 1: With branches
int sum_positive(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++)
if (arr[i] > 0)
sum += arr[i];
return sum;
}
// Version 2: Branchless
int sum_positive_branchless(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
int mask = arr[i] >> 31; // 0 if positive, -1 if negative
sum += arr[i] & ~mask;
}
return sum;
}
If positive/negative distribution is random, version 2 can be 2-3× faster (avoiding branch misprediction).
Reason 3: SIMD
// Scalar
for (int i = 0; i < n; i++)
c[i] = a[i] + b[i];
// SIMD (AVX2)
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_store_ps(&c[i], vc);
}
SIMD version can be 4-8× faster, but Big-O is O(n) for both.
Practical Advice
1. Don't Prematurely Optimize Complexity
Bad: "This is O(n²), must change to O(n log n)!"
Good: "This is O(n²), what's the max n? 100? Doesn't matter."
2. Profile First, Then Optimize
Bad: Spend a week switching to theoretically faster algorithm
Good: Use perf to find bottleneck is cache misses, adjust data layout
3. Know Your n
| n Range | Optimization Strategy |
|---|---|
| n < 10 | Write whatever, prioritize readability |
| n < 100 | Simple algorithms, watch constant factors |
| n < 10000 | Algorithm starts to matter |
| n > 100000 | Algorithm very important |
| n > 10^9 | Algorithm + parallelization + distributed |
4. Consider Hybrid Methods
def smart_sort(arr):
if len(arr) < 20:
return insertion_sort(arr)
else:
return quicksort(arr)
5. Benchmark Real Workloads
Don't just test worst case or best case. Test:
- Typical input sizes
- Typical input distributions
- On real hardware
Summary
Big-O Limitations
- Ignores constant factors
- Ignores lower-order terms
- Assumes n → ∞
- Doesn't consider hardware characteristics
Galactic Algorithms
- Theoretically optimal
- Practically unusable
- Constant factors too large for the universe
Constant Factor Sources
- Cache miss vs hit (10-100×)
- Branch misprediction (15-20 cycles)
- Division vs multiplication (10-20×)
- System calls (1000+ cycles)
Practical Advice
- Know your n
- Profile first, then optimize
- Consider hybrid methods
- Test real workloads
- Don't blindly trust Big-O
Remember
O(n) with cache misses < O(n²) with cache hits
...when n is small enough
O(n log n) Quicksort < O(n) Radix Sort
...when n is small enough
Theory is the map, measurement is the territory.
When map and territory disagree, trust the territory.
Chapter 12: Cache & Branch Prediction
Part III: Theory
"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton
The Story of "One Line Change" That Made It 10× Faster
I inherited an image processing codebase. Processing a 4K image took 800ms—too slow.
I spent two days profiling and found the hotspot in a nested loop. The code looked normal:
// Original version
for (int x = 0; x < width; x++) {
for (int y = 0; y < height; y++) {
output[y][x] = process(input[y][x]);
}
}
I changed one thing:
// Modified version
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
output[y][x] = process(input[y][x]);
}
}
Just swapped the loop order of x and y.
Result: 800ms → 80ms. 10× faster.
Why? Because C's 2D arrays are row-major, and [y][x] access should have x in the inner loop (sequential access). The original version jumped an entire row each access, causing massive cache misses.
This is the power (or curse) of cache.
Cache Basics
Why We Need Cache
CPU speed vs Memory speed (approximate):
- CPU: 1 cycle = 0.3 ns (3 GHz)
- L1: ~4 cycles = 1.2 ns
- L2: ~12 cycles = 4 ns
- L3: ~40 cycles = 12 ns
- DRAM: ~200 cycles = 60 ns
If every access went to DRAM, CPU would spend most time waiting.
Cache Hierarchy
┌─────────────┐
│ CPU │
│ Registers │ ← Few bytes, < 1 cycle
└──────┬──────┘
│
┌──────▼──────┐
│ L1 Cache │ ← 32-64 KB, ~4 cycles
│ (per core) │
└──────┬──────┘
│
┌──────▼──────┐
│ L2 Cache │ ← 256 KB - 1 MB, ~12 cycles
│ (per core) │
└──────┬──────┘
│
┌──────▼──────┐
│ L3 Cache │ ← 8-64 MB, ~40 cycles (shared)
│ (shared) │
└──────┬──────┘
│
┌──────▼──────┐
│ DRAM │ ← GB scale, ~200 cycles
└─────────────┘
Cache Line
Cache doesn't operate byte-by-byte, but in cache lines (typically 64 bytes):
When you access address 0x1000:
- CPU doesn't read just 1 byte
- It reads the entire cache line: 0x1000 - 0x103F (64 bytes)
- Subsequent accesses to 0x1001, 0x1002... will hit
This is the basis of spatial locality.
Types of Cache Misses
The 3C Model
| Type | Name | Cause | Solution |
|---|---|---|---|
| Compulsory | Cold start | First access | Prefetching |
| Capacity | Capacity | Cache too small | Larger cache, better locality |
| Conflict | Conflict | Multiple addresses map to same location | Higher associativity |
(Sometimes a fourth C is added: Coherence, in multi-core systems)
Measuring Cache Misses
# Using perf
perf stat -e L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses ./my_program
1,234,567,890 L1-dcache-loads
12,345,678 L1-dcache-load-misses # 1% miss rate
123,456,789 LLC-loads
12,345,678 LLC-load-misses # 10% miss rate
Common Cache Problems
Problem 1: Wrong Access Order
Row-major vs Column-major:
// C/C++ is row-major: arr[row][col]
// Fortran is column-major: arr(row, col)
// Correct order for row-major (C)
for (int row = 0; row < N; row++)
for (int col = 0; col < N; col++)
sum += arr[row][col]; // Sequential access ✓
// Wrong order in C (column-major style)
for (int col = 0; col < N; col++)
for (int row = 0; row < N; row++)
sum += arr[row][col]; // Strided access in C ✗
Problem 2: Stride Access
Cache Optimization Techniques
Technique 1: Loop Tiling (Blocking)
Split large loops into small blocks that fit in cache:
// Original: matrix multiplication
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
// Tiled version
#define BLOCK 64
for (int ii = 0; ii < N; ii += BLOCK)
for (int jj = 0; jj < N; jj += BLOCK)
for (int kk = 0; kk < N; kk += BLOCK)
for (int i = ii; i < min(ii+BLOCK, N); i++)
for (int j = jj; j < min(jj+BLOCK, N); j++)
for (int k = kk; k < min(kk+BLOCK, N); k++)
C[i][j] += A[i][k] * B[k][j];
Each BLOCK × BLOCK sub-matrix can fit in L1/L2.
Technique 2: Data Layout Optimization
Array of Structures (AoS) vs Structure of Arrays (SoA):
// AoS: if you only need x, you load useless y, z too
struct Particle {
float x, y, z;
float vx, vy, vz;
float mass;
};
struct Particle particles[N];
// SoA: only access the fields you need
struct Particles {
float x[N], y[N], z[N];
float vx[N], vy[N], vz[N];
float mass[N];
};
struct Particles particles;
// When only accessing x, SoA is more cache-friendly
for (int i = 0; i < N; i++)
sum += particles.x[i]; // Sequential access
Technique 3: Prefetching
Tell CPU to load data ahead of time:
#include <xmmintrin.h> // for _mm_prefetch
for (int i = 0; i < N; i++) {
// Prefetch data we'll need soon
_mm_prefetch(&arr[i + 64], _MM_HINT_T0);
sum += arr[i];
}
Modern CPUs have hardware prefetchers that work well for sequential access. Random access may need software prefetch.
Branch Prediction Basics
Why We Need Branch Prediction
Modern CPUs are pipelined:
Instruction 1: Fetch → Decode → Execute → Memory → Writeback
Instruction 2: Fetch → Decode → Execute → Memory → Writeback
Instruction 3: Fetch → Decode → Execute → Memory → Writeback
...
When encountering a branch (if/else):
if (condition) {
// path A
} else {
// path B
}
CPU doesn't know whether to fetch A or B's instructions. It must guess (predict), then continue.
If wrong (misprediction) → flush pipeline, restart → waste 15-20 cycles.
Typical Branch Prediction Accuracy
| Pattern | Accuracy |
|---|---|
| Always-taken loop | ~99% |
| Always not-taken | ~99% |
| Predictable pattern (TTNTTN...) | ~95%+ |
| Random (50/50) | ~50% |
Measuring Branch Misprediction
perf stat -e branches,branch-misses ./my_program
500,000,000 branches
2,500,000 branch-misses # 0.5% miss rate (good)
Over 1-2% branch miss rate is worth investigating.
Branch Optimization Techniques
Technique 1: Branchless Programming
// With branch
int max1(int a, int b) {
if (a > b) return a;
else return b;
}
// Branchless (using conditional move)
int max2(int a, int b) {
return a > b ? a : b; // Compiler may use cmov
}
// Manual branchless (guaranteed)
int max3(int a, int b) {
int diff = a - b;
int mask = diff >> 31; // All 0s or all 1s
return a - (diff & mask);
}
Technique 2: Sort to Improve Branch Prediction
// Processing positive numbers in array
// If array is random, branch is hard to predict
// Original (assume 50% positive)
for (int i = 0; i < N; i++)
if (arr[i] > 0)
process(arr[i]);
// If sorted first
std::sort(arr, arr + N);
// Now all negatives first, positives last
// Branch becomes: N/2 consecutive not-taken, N/2 consecutive taken
// Much more predictable!
Of course, sorting has its own cost. Only worth it if used multiple times.
Technique 3: Profile-Guided Optimization (PGO)
Let the compiler optimize branch layout based on actual execution profile:
# 1. Compile with instrumentation
gcc -fprofile-generate -o program program.c
# 2. Run typical workload
./program typical_input
# 3. Recompile with profile
gcc -fprofile-use -o program_optimized program.c
PGO can:
- Put common paths together (better icache)
- Adjust default branch predictions
- Inline frequently-called functions
Measurement and Diagnosis
Using perf to Find Cache/Branch Problems
# Comprehensive analysis
perf stat -e cycles,instructions,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses,\
branches,branch-misses \
./my_program
Summary
Cache Core Concepts
- Cache line (64 bytes) is the minimum unit
- 3C Model: Compulsory, Capacity, Conflict
- Spatial locality + temporal locality
Common Cache Problems
- Wrong access order (row vs column major)
- Stride access
- False sharing
- Cache thrashing
Cache Optimization Techniques
- Loop tiling/blocking
- AoS → SoA transformation
- Prefetching
- Hot/cold splitting
Branch Prediction
- Misprediction cost: 15-20 cycles
- Predictable patterns: ~99% accurate
- Random patterns: ~50% accurate
Branch Optimization Techniques
- Branchless programming
- Sort to improve patterns
- Profile-Guided Optimization (PGO)
Diagnostic Tools
perf stat -e L1-dcache-load-misses,branch-misses ./prog
Rules of Thumb
- Cache miss rate > 5%: worth investigating
- Branch miss rate > 2%: worth investigating
- Optimization order: algorithm → data structure → cache → branch
Chapter 13: Array vs Linked List
Part IV: Data Structures & Algorithms
"In theory, there is no difference between theory and practice. But in practice, there is." — Jan L. A. van de Snepscheut
The Story of the "Optimal" Data Structure
A senior engineer once joined a team working on a high-frequency trading system. The codebase had a critical hot path that maintained an order book—a sorted collection of pending orders. The previous developer had implemented it using a doubly-linked list, reasoning that O(1) insertion and deletion would be optimal for the frequent updates.
The senior engineer ran the profiler and found something surprising: the linked list implementation was spending 70% of its time on cache misses. The "O(1)" insertions were indeed fast in terms of pointer operations, but each node access triggered a cache miss that cost 100+ cycles.
She rewrote the hot path using a sorted array with binary search and memmove for insertions. Despite the O(n) insertion complexity, the new implementation was 8x faster. The lesson: Big-O notation hides constants, and those constants are dominated by memory access patterns.
Why This Comparison Matters
In algorithm textbooks, we learn:
- Arrays: O(1) random access, O(n) insertion/deletion
- Linked Lists: O(n) random access, O(1) insertion/deletion
This theoretical analysis assumes uniform memory access cost—an assumption that hasn't been true since the 1980s. Modern CPUs have multi-level cache hierarchies where L1 cache access takes ~4 cycles, while main memory access takes 100-300 cycles.
Cache Locality: The Decisive Victory
Although linked lists have O(1) theoretical insertion/deletion advantage, arrays (or std::vector) are overwhelmingly superior in practice.
Cache Line Utilization
Arrays occupy contiguous memory. When the CPU reads one element, it loads an entire cache line (typically 64 bytes), bringing multiple adjacent elements into cache for "free":
Array (contiguous memory):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ A0 │ A1 │ A2 │ A3 │ A4 │ A5 │ A6 │ A7 │
└────┴────┴────┴────┴────┴────┴────┴────┘
↑ One cache line fetch (64B) gets 16 integers
Linked list nodes are scattered across the heap. Each node access potentially triggers a cache miss:
Linked List (scattered memory):
┌────┬────┐ ┌────┬────┐ ┌────┬────┐
│ D0 │ *──┼────→│ D1 │ *──┼────→│ D2 │ *──┼→...
└────┴────┘ └────┴────┘ └────┴────┘
0x1000 0x5F20 0x3A80
↑ Each access = potential cache miss (100+ cycles)
Hardware Prefetching
Modern CPUs have hardware prefetchers that detect sequential access patterns and proactively load upcoming data. This works beautifully for arrays but is useless for linked lists.
The "pointer chasing" pattern of linked lists has data dependencies: you must load node N to discover the address of node N+1. This serializes memory accesses and prevents the CPU pipeline from hiding latency.
The Crossover Point
Empirical measurements on modern desktop CPUs show that even for middle insertion, arrays beat linked lists when N < several thousand elements. The memmove operation is highly optimized (often vectorized) and benefits from contiguous memory.
Memory Overhead
On a 64-bit system, each linked list node requires two 8-byte pointers (next, prev) plus alignment padding. For a list of integers (4 bytes each), the pointer overhead can exceed 400% of actual data.
Benchmarking Methodology
When benchmarking data structures, you must avoid measuring the allocator rather than the data structure itself. As with benchmarking in general, the goal is to control confounding factors so that you are actually measuring the behavior of the data structure, not side effects from the allocator or environment.
Fair Comparison Requires
Same allocation strategy: Linked list performance heavily depends on node memory layout. Using a pool allocator that places nodes contiguously dramatically improves linked list performance. Note whether custom allocators are used.
Warm-up consideration: First access triggers page faults and cold cache misses. Run multiple iterations and distinguish cold start from warm state.
Background memory pressure: In production, memory isn't pristine. Add background memory activity to simulate realistic cache pollution.
Sample Benchmark Structure
// Prevent compiler from optimizing away the computation
volatile int sink;
void benchmark_sequential_access(int* array, int n) {
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
long sum = 0;
for (int i = 0; i < n; i++) {
sum += array[i];
}
sink = sum; // Prevent dead code elimination
clock_gettime(CLOCK_MONOTONIC, &end);
// Report timing...
}
Measuring Cache Effects Directly
To truly understand the array vs. linked list difference, let's measure cache behavior using hardware performance counters:
# Using perf to measure cache misses during sequential traversal
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
./array_traverse 1000000
# Typical output for array:
# 12,543,210 cache-references
# 62,891 cache-misses # 0.50%
# Typical output for linked list:
# 12,892,344 cache-references
# 9,234,567 cache-misses # 71.6%
The linked list shows 140x more cache misses for the same logical operation. Each cache miss costs approximately 100 cycles on modern hardware, explaining why the linked list is 30-50x slower despite doing the "same" work.
Measuring with perf to Confirm the Theory
# Memory bandwidth comparison
perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses \
./benchmark_both_structures
# Array traversal:
# 1,024,567 L1-dcache-loads
# 12,345 L1-dcache-load-misses # 1.2%
# 2,345 LLC-loads
# 234 LLC-load-misses # 10.0%
# Linked list traversal:
# 1,089,234 L1-dcache-loads
# 987,654 L1-dcache-load-misses # 90.7%
# 876,543 LLC-loads
# 765,432 LLC-load-misses # 87.3%
Small Size Optimization (SSO)
An important insight that surprises many developers: when N is small, simple linear search often beats complex data structures.
O(n) vs O(1): When N < 10-20, linear array search is typically faster than hash table lookup. Array operations fit entirely in L1 cache with no hash computation overhead.
Branch prediction: Binary search on sorted arrays causes ~50% branch misprediction rate on random queries. Linear search has trivial prediction patterns.
Here's a demonstration:
// Linear search: trivial, but cache-friendly
int linear_search(int* arr, int n, int target) {
for (int i = 0; i < n; i++) {
if (arr[i] == target) return i;
}
return -1;
}
// Measured performance for N=16, random queries:
// Linear search: ~15 cycles average
// Binary search: ~45 cycles average (branch misprediction penalty)
// Hash table: ~60 cycles average (hash computation + indirection)
Many high-performance libraries and standard containers implement SSO-style optimizations: when the element count is below a threshold, they fall back to simple arrays or small in-object buffers internally. The std::string small-string optimization and boost::container::small_vector are canonical examples.
When to Use Each
Use Arrays (std::vector) When
- Sequential or random access patterns dominate
- Size is known or changes infrequently
- Cache performance is critical
- Elements are small (pointers, integers, small structs)
Use Linked Lists When
- Frequent insertions/deletions at known positions (with held iterators)
- Need stable pointers/references after insertion
- Elements are very large (moving is expensive)
- Memory fragmentation is acceptable
The Modern Reality
In C++, Bjarne Stroustrup and others have empirically shown that std::vector outperforms std::list for almost all use cases. The exceptions are rare and specific.
Practical Measurements
Typical results on modern hardware (Intel Core i7, DDR4):
| Operation | Array (N=1000) | Linked List (N=1000) | Ratio |
|---|---|---|---|
| Sequential traverse | 0.5 μs | 15 μs | 30x |
| Random access | 0.02 μs | 8 μs | 400x |
| Middle insertion | 0.3 μs | 0.05 μs* | 0.17x |
| Memory per element | 4 bytes | 24 bytes | 6x |
*Linked list insertion assumes you already have the iterator—finding the position is O(n).
These results come from one specific CPU, compiler, and standard library implementation. On a different machine or implementation the absolute timings will change, but the qualitative trend is robust: contiguous arrays dominate linked lists for traversal and random access on modern hardware.
Summary
- Cache locality often matters more than algorithmic complexity. The gap between L1 cache and main memory is 100x.
- Measure, don't assume. Your intuition from algorithm class may be wrong for modern hardware.
- Consider your actual access patterns. If you traverse more than you insert, prefer arrays.
- When in doubt, use std::vector. It's the default choice for a reason.
Chapter 14: Hash Table vs Tree
Part IV: Data Structures & Algorithms
"O(1) is just O(n) that got lucky." — Anonymous
The O(1) That Wasn't
A startup was building a real-time bidding system that needed to look up advertiser data by ID. The initial implementation used a hash table—the obvious choice for O(1) lookups. Everything worked perfectly in testing.
In production, disaster struck. Under high load, the system would occasionally freeze for hundreds of milliseconds. Debugging revealed that when the hash table needed to resize, it triggered a massive rehashing operation that blocked the main thread. Worse, during a targeted attack, malicious requests with crafted IDs caused hash collisions, degrading the "O(1)" lookups to O(n).
The team switched to a Red-Black tree. The lookups were now O(log n)—technically "slower"—but performance became predictable. The worst-case latency dropped from 500ms to 2ms. The lesson: average-case complexity isn't the whole story.
Theoretical Complexity Review
This is the classic O(1) vs O(log n) battle, but the "constant" in O(1) can be surprisingly expensive.
| Operation | Hash Table (avg) | Hash Table (worst) | Balanced Tree |
|---|---|---|---|
| Search | O(1) | O(n) | O(log n) |
| Insert | O(1) | O(n) | O(log n) |
| Delete | O(1) | O(n) | O(log n) |
| Range query | O(n) | O(n) | O(log n + k) |
| Ordered iteration | O(n log n) | O(n log n) | O(n) |
Hash Table Deep Dive
The Hidden Costs of O(1)
The O(1) of hash tables includes:
- Computing the hash code
- Handling collisions
- (Occasionally) Resizing the entire table
Hash function complexity: For simple integer keys, hashing is trivial. For long strings, computing a good hash can require examining every character—potentially hundreds of operations before the "O(1)" lookup even begins.
Collision Handling
Chaining (separate chaining): Each bucket contains a linked list of colliding elements. When collisions are frequent, the hash table degrades into multiple linked lists—losing both O(1) complexity and cache locality.
Open addressing (linear probing, quadratic probing): Collisions are resolved by probing subsequent slots. Better cache locality than chaining, but performance degrades rapidly as load factor approaches 1.0.
Load Factor and Rehashing
Hash tables maintain a load factor (elements / buckets). When this exceeds a threshold (typically 0.75), the table must:
- Allocate a new, larger backing array
- Recompute hashes for ALL existing elements
- Insert everything into the new array
This operation is O(n) and causes latency spikes. In latency-sensitive applications, these spikes can be catastrophic.
Hash Collision Attacks
Balanced trees (like Red-Black trees) provide stable O(log n) performance regardless of data distribution. Hash tables are vulnerable to algorithmic complexity attacks: an attacker who knows the hash function can craft inputs that all collide, turning O(1) into O(n).
This vulnerability led to security incidents in web frameworks (PHP, Python, Ruby) that used predictable hash functions for request parameters. Many modern languages and libraries mitigate this by using randomized hash functions or stronger default hashers, which makes these attacks harder in practice. The underlying worst-case behavior of hash tables, however, has not changed.
Tree Structures
Balanced Trees (Red-Black, AVL)
Self-balancing binary search trees guarantee O(log n) operations by maintaining height invariants.
Advantages:
- Predictable worst-case performance
- Natural ordering (in-order traversal gives sorted sequence)
- Efficient range queries
Cache behavior: Traditional BSTs have poor cache locality—each node comparison may trigger a cache miss. For N = 1 million elements, log₂(N) ≈ 20 comparisons, potentially 20 cache misses.
B-Trees: Cache-Friendly Trees
B-Trees store multiple keys per node, designed for systems where node access is expensive (originally disk, but now relevant for cache):
B-Tree node (order 4):
┌─────┬─────┬─────┐
│ K1 │ K2 │ K3 │ ← Multiple keys in one cache line
└──┬──┴──┬──┴──┬──┘
↓ ↓ ↓
children...
A B-Tree with 64-byte nodes can store ~15 integer keys per node. For 1 million elements, height ≈ 5, meaning only 5 cache misses worst case (vs. 20 for a binary tree).
Benchmarking Comparison
Key Type Matters
Integer keys: Hash tables shine. Computing a hash is trivial (often just a modulo or multiplication), and comparison is a single instruction.
String keys: The gap narrows. Hash computation requires examining the entire string. Trees only compare until a difference is found (often early).
Data Distribution Effects
Hash table performance depends heavily on key distribution:
| Distribution | Hash Table | Tree |
|---|---|---|
| Random | Excellent | Good |
| Sequential | Good | Good |
| Adversarial | Catastrophic | Good |
| Clustered | Degraded | Good |
Typical Results (N = 100,000, random keys)
| Operation | std::unordered_map | std::map | Ratio |
|---|---|---|---|
| Lookup (int key) | 25 ns | 180 ns | 0.14x |
| Lookup (string key) | 80 ns | 150 ns | 0.53x |
| Insert | 45 ns | 200 ns | 0.23x |
| Range query (1%) | 2500 μs | 50 μs | 50x |
| Iteration (sorted) | 3000 μs | 1500 μs | 2x |
These measurements are representative of a particular hardware and standard-library implementation. Treat them as directional guidance rather than universal constants: the exact numbers will vary, but the trade-offs they illustrate are robust.
Measuring Hash vs Tree Performance
Here's how to benchmark the key operations:
#include <chrono>
#include <map>
#include <unordered_map>
#include <random>
#include <iostream>
template<typename Map>
void benchmark_lookup(Map& m, const std::vector<int>& keys,
const std::string& name) {
volatile int sink = 0; // Prevent optimization
auto start = std::chrono::high_resolution_clock::now();
for (int key : keys) {
auto it = m.find(key);
if (it != m.end()) sink = it->second;
}
auto end = std::chrono::high_resolution_clock::now();
auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(
end - start).count();
std::cout << name << ": " << ns / keys.size() << " ns/lookup\n";
}
int main() {
constexpr int N = 100000;
std::vector<int> keys(N);
std::iota(keys.begin(), keys.end(), 0);
std::shuffle(keys.begin(), keys.end(), std::mt19937{42});
std::unordered_map<int, int> hash_map;
std::map<int, int> tree_map;
for (int k : keys) {
hash_map[k] = k;
tree_map[k] = k;
}
// Warmup then measure
benchmark_lookup(hash_map, keys, "unordered_map");
benchmark_lookup(tree_map, keys, "map");
}
Measuring Latency Distribution
Average latency hides the story. For latency-sensitive applications, measure the distribution:
# Using perf to capture latency distribution
perf record -e cycles:u ./hash_vs_tree_benchmark
perf report
# Or capture timing histogram in code:
# Track min, max, P50, P99, P99.9 per operation
When to Use Each
Use Hash Tables When
- Average-case performance matters more than worst-case
- Keys have good hash distribution (or you control the hash function)
- Order is irrelevant
- Range queries are not needed
- Latency spikes from rehashing are acceptable
Use Trees When
- Ordered traversal is needed (logs, time-series data)
- Latency stability is critical (avoid rehashing spikes)
- Range queries are common (find all items between X and Y)
- Data comes from untrusted sources (security)
- Memory allocation must be predictable
The std::map vs std::unordered_map Decision
In C++, this is a common choice:
- Default to
std::unordered_mapfor simple key-value lookups - Use
std::mapwhen you need ordering or predictable performance - Consider sorted
std::vectorfor read-heavy workloads (binary search with excellent cache locality). For small to medium, mostly-read tables, a sorted vector often outperforms bothstd::mapandstd::unordered_mapthanks to tiny constant factors and contiguous storage.
Real-World Implementation Notes
Why std::unordered_map Can Be Slow
The C++ standard's std::unordered_map is often criticized for performance. Reasons include:
- Required to use chaining (linked lists for buckets)
- Iterator stability requirements limit optimization
- Node-based allocation (each element is separately allocated)
Alternatives like absl::flat_hash_map (Google) or robin_hood::unordered_map use open addressing with better cache behavior.
Summary
- Hash tables trade predictability for average-case speed. The O(1) comes with hidden costs and worst-case risks.
- Trees provide guarantees. O(log n) may be "slower" but is always O(log n).
- Consider your requirements: Need ordering? Use trees. Need range queries? Use trees. Need raw speed with controlled inputs? Use hash tables.
- Measure with realistic data. Adversarial or clustered data can devastate hash table performance.
Chapter 15: Sorting Algorithms
Part IV: Data Structures & Algorithms
"Premature optimization is the root of all evil, but late optimization is the root of all frustration." — Anonymous
The Hybrid Sort Revolution
Why doesn't std::sort just use Quicksort—the algorithm with the best average-case performance? Why does it fall back to Heapsort sometimes? And why does it switch to Insertion Sort for small arrays?
A developer was benchmarking sorting algorithms for a data processing pipeline. She implemented a "pure" Quicksort, confident it would match or beat the standard library. On random data, it was competitive. On nearly-sorted data, it was actually faster. Then she tested on adversarial input—data specifically crafted to trigger O(n²) behavior—and watched her algorithm crawl.
The standard library's std::sort finished in the same time as always. She discovered that modern library sorts are hybrid algorithms: they combine multiple sorting strategies, switching between them based on data characteristics. There is no single "best" sorting algorithm—only best combinations.
From Theory to Hardware Reality
In modern processor architecture, sorting algorithm performance depends not just on comparison count, but on branch prediction success rate, cache hit rate, and memory access patterns.
Algorithm Complexity Overview
| Algorithm | Best | Average | Worst | Space | Stable | Hardware Friendliness |
|---|---|---|---|---|---|---|
| Quicksort | O(n log n) | O(n log n) | O(n²) | O(log n) | No | Excellent locality |
| Mergesort | O(n log n) | O(n log n) | O(n log n) | O(n) | Yes | Extra memory copy |
| Heapsort | O(n log n) | O(n log n) | O(n log n) | O(1) | No | Cache-unfriendly |
| Insertion | O(n) | O(n²) | O(n²) | O(1) | Yes | Excellent for small N |
| Timsort | O(n) | O(n log n) | O(n log n) | O(n) | Yes | Optimized for real data |
Classic Algorithms: Theory vs Practice
Quicksort:
- Practical advantage: Excellent locality. The partition operation scans contiguous memory, which is extremely cache-friendly.
- Optimization: Modern implementations use median-of-three pivot selection to avoid O(n²) worst case.
Mergesort:
- Characteristic: Stable, guaranteed O(n log n).
- Disadvantage: Requires O(n) extra space; merge phase involves heavy data copying, pressuring memory bandwidth.
Heapsort:
- Theory: O(n log n) in-place.
- Practical bottleneck: Cache-unfriendly. Heap access constantly jumps between indices i and 2i+1; when N is large, nearly every comparison triggers a cache miss.
Small Array Optimization
When N shrinks, algorithm overhead (constant factors) dominates performance.
Why Insertion Sort Wins for Small Arrays
Insertion Sort has extremely simple code with minimal instruction overhead. When n < 20 or so, its O(n²) computation produces less latency than Quicksort's recursive call overhead.
Cutoff threshold: Typically 16 to 32. Modern C++ libraries stop Quicksort recursion at this range, then do a single Insertion Sort pass over the nearly-sorted array.
The performance crossover happens because Insertion Sort's simplicity wins at small sizes despite its O(n²) complexity. For very small arrays (< 16 elements), Insertion Sort is faster due to minimal overhead—no recursive calls, no partition logic, just simple element shifts. As size increases, Quicksort's O(n log n) complexity takes over and becomes increasingly faster. The crossover point where Quicksort starts to win is typically around 16-32 elements, which is why hybrid algorithms switch strategies at this threshold.
Branch Prediction Friendly Algorithms
Sorting is fundamentally about comparisons. With random data, the branch predictor has ~50% failure rate.
Optimization: Some implementations use Conditional Move (CMOV) instructions to replace if branches, avoiding misprediction penalties.
Cache-Aware Sorting
Cache-oblivious Algorithms: Designed to work efficiently without knowing exact cache size. These typically use recursive divide-and-conquer, ensuring that at some recursion level, the working set fits in cache.
Block-based strategies: Process data in cache-sized blocks to maximize temporal locality.
Input Patterns Matter
Sorting performance varies dramatically with input characteristics:
| Input Pattern | Quicksort | Mergesort | Timsort |
|---|---|---|---|
| Random | Excellent | Good | Good |
| Nearly sorted | Good | Good | Excellent (O(n)) |
| Reverse sorted | Degraded* | Good | Good |
| Many duplicates | Degraded* | Good | Good |
*Without proper pivot selection or 3-way partitioning.
Parallel Sorting
Parallel MergeSort: Classic divide-and-conquer. Dispatch subtasks to different threads early, merge at the end.
Sorting Networks (e.g., Bitonic Sort): Branch-free comparisons, ideal for GPU SIMD execution.
GPU Sorting: Leverages GPU's massive memory bandwidth. Usually uses Radix Sort, which can be transformed into parallel Prefix Sum (Scan) operations.
For small or medium-sized arrays, or on systems with high synchronization overhead, the thread-management and merge costs of parallel sorting can outweigh any speedup. Parallel sort is a tool for genuinely large problems with enough work to amortize its overhead, not a free performance knob to turn on by default.
Stability Considerations
When needed? When sorting by multiple keys (e.g., first by "date", then by "amount").
Cost: To maintain relative order of equal elements, algorithms typically cannot swap non-adjacent elements freely. This excludes efficient algorithms like Quicksort and Heapsort.
Compromise: Stable sorts usually need O(n) or O(log n) auxiliary space, or use complex hybrid algorithms like TimSort.
Real-World Implementations
std::sort (C++): Introsort
Introsort = Quicksort + Heapsort + Insertion Sort:
- Start with Quicksort
- If recursion depth exceeds 2·log(n), switch to Heapsort (guarantees O(n log n))
- For small partitions, use Insertion Sort
Java's Arrays.sort()
- Primitives: Dual-Pivot Quicksort
- Objects: Timsort (for stability)
Python's sorted(): Timsort
Core idea: Real-world data is often partially sorted. Timsort finds existing ascending runs and merges them. For nearly-sorted data, it approaches O(n).
Radix Sort and Linear Time
When applicable, non-comparison sorts break the O(n log n) barrier:
Radix Sort / Counting Sort: When N >> value range, complexity is O(n·k) where k is digit count. Dominant in specific domains such as image processing, fixed-width integer IDs, and log/event UIDs.
Requirements:
- Keys must be decomposable into digits/characters
- Value range should be bounded
- Often requires O(n) extra space
Outside of these constrained settings, general-purpose comparison-based sorts are usually a better default: they handle arbitrary key types, integrate well with existing libraries, and have predictable performance on a wide range of workloads.
Benchmarking Sort Algorithms
Methodology
- Test multiple input patterns: Random, sorted, reverse, duplicates
- Warm up: First iteration has cold cache effects
- Multiple runs: Report median, not mean (avoids outlier skew)
- Realistic sizes: Test at N = 100, 1K, 10K, 100K, 1M
A Complete Benchmarking Example
#include <algorithm>
#include <chrono>
#include <random>
#include <vector>
#include <iostream>
enum class Pattern { RANDOM, SORTED, REVERSE, NEARLY_SORTED, DUPLICATES };
std::vector<int> generate_data(int n, Pattern pattern) {
std::vector<int> data(n);
std::mt19937 gen(42);
switch (pattern) {
case Pattern::RANDOM:
std::iota(data.begin(), data.end(), 0);
std::shuffle(data.begin(), data.end(), gen);
break;
case Pattern::SORTED:
std::iota(data.begin(), data.end(), 0);
break;
case Pattern::REVERSE:
std::iota(data.rbegin(), data.rend(), 0);
break;
case Pattern::NEARLY_SORTED:
std::iota(data.begin(), data.end(), 0);
// Swap 5% of elements
for (int i = 0; i < n / 20; i++) {
std::swap(data[gen() % n], data[gen() % n]);
}
break;
case Pattern::DUPLICATES:
for (int& x : data) x = gen() % 100; // Only 100 unique values
break;
}
return data;
}
double benchmark_sort(std::vector<int> data) {
auto start = std::chrono::high_resolution_clock::now();
std::sort(data.begin(), data.end());
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration<double, std::micro>(end - start).count();
}
Measure Branch Mispredictions
# Compare branch behavior across input patterns
perf stat -e branches,branch-misses ./sort_benchmark random
perf stat -e branches,branch-misses ./sort_benchmark sorted
# Typical results:
# Random: 45% branch misprediction rate (comparisons are 50/50)
# Sorted: 3% branch misprediction rate (predictable patterns)
Measuring Cache Effects
# Compare cache behavior for different algorithms
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
./quicksort_benchmark
perf stat -e cache-references,cache-misses,L1-dcache-load-misses \
./heapsort_benchmark
# Heapsort typically shows 3-5x higher cache miss rate
This reveals why some algorithms suffer on random data despite good complexity.
Summary Comparison Table
| Algorithm | Average | Best Case | Space | Stable | Characteristics |
|---|---|---|---|---|---|
| std::sort | O(n log n) | O(n log n) | O(log n) | No | Cache-friendly, Introsort guarantees O(n log n) |
| Timsort | O(n log n) | O(n) | O(n) | Yes | Excellent for real data, complex implementation |
| Quicksort | O(n log n) | O(n log n) | O(log n) | No | Small partition constant, branch-heavy |
| Radix Sort | O(n·k) | O(n) | O(n+k) | Yes | Non-comparison, GPU-friendly |
Summary
- No single "best" sorting algorithm. The winner depends on data characteristics and hardware.
- Hybrid approaches dominate in practice. Modern library sorts combine multiple strategies.
- Input characteristics matter as much as algorithm choice. Nearly-sorted data is very different from random.
- Treat algorithm choice as a hypothesis and benchmark it on your real workload. Branch mispredictions and cache behavior often dominate theoretical complexity.
Chapter 16: SIMD & Vectorization
Part V: Parallelism & Low-Level Optimization
"The free lunch is over." — Herb Sutter
The 10x Speedup That Cost Nothing
A game studio was optimizing their physics engine. The collision detection code was clean, well-structured, and algorithmically optimal—yet it consumed 40% of frame time. A senior engineer spent an afternoon adding SIMD intrinsics to the inner loop. The result: 8x speedup, bringing collision detection down to 5% of frame time.
No algorithmic changes. No architectural redesign. Just telling the CPU to do 8 operations at once instead of 1.
This is the promise of SIMD: massive speedups for data-parallel workloads, often with minimal code changes. But it's also a minefield of alignment requirements, instruction set variations, and subtle correctness issues.
What is SIMD?
Single Instruction, Multiple Data: process multiple data elements with one instruction.
Scalar Operation:
A[0] + B[0] = C[0]
A[1] + B[1] = C[1]
A[2] + B[2] = C[2]
A[3] + B[3] = C[3]
→ 4 instructions
SIMD Operation (256-bit AVX):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│A[0]│A[1]│A[2]│A[3]│A[4]│A[5]│A[6]│A[7]│ (8 floats)
└────┴────┴────┴────┴────┴────┴────┴────┘
+
┌────┬────┬────┬────┬────┬────┬────┬────┐
│B[0]│B[1]│B[2]│B[3]│B[4]│B[5]│B[6]│B[7]│
└────┴────┴────┴────┴────┴────┴────┴────┘
↓
┌────┬────┬────┬────┬────┬────┬────┬────┐
│C[0]│C[1]│C[2]│C[3]│C[4]│C[5]│C[6]│C[7]│
└────┴────┴────┴────┴────┴────┴────┴────┘
→ 1 instruction
Theoretical vs Practical Speedup
With 256-bit registers processing 8 floats, you might expect 8x speedup. Reality is more nuanced:
| Factor | Impact |
|---|---|
| Memory bandwidth | Often the bottleneck, not compute |
| Alignment overhead | Unaligned loads are slower |
| Remainder handling | N not divisible by vector width |
| Register pressure | Limited SIMD registers |
| Instruction latency | Some SIMD ops have higher latency |
Typical real-world speedup: 2-6x for memory-bound workloads, 4-8x for compute-bound.
SIMD Instruction Sets
x86/x64 Evolution
| Generation | Width | Year | Key Features |
|---|---|---|---|
| SSE | 128-bit | 1999 | 4x float |
| SSE2 | 128-bit | 2001 | 2x double, integer ops |
| AVX | 256-bit | 2011 | 8x float, 3-operand |
| AVX2 | 256-bit | 2013 | 256-bit integer |
| AVX-512 | 512-bit | 2017 | Masking, scatter/gather |
ARM
- NEON: 128-bit, ubiquitous on ARM (phones, tablets, M-series Macs)
- SVE/SVE2: Scalable Vector Extension (128-2048 bit), write-once-run-anywhere
RISC-V
- RVV (Vector Extension): Variable-length VLEN (128-65536 bits), vector-length agnostic (VLA) programming model. One codebase can run on different hardware widths without recompilation.
The Portability Challenge
Code written for AVX won't run on ARM. Code for AVX-512 won't run on older x86. Solutions:
- Runtime dispatch: Detect CPU features, call appropriate implementation
- Portable libraries: Highway, XSIMD, std::simd (C++23)
- Compiler auto-vectorization: Let the compiler handle it
Auto-Vectorization
Modern compilers can automatically vectorize simple loops. This is the easiest path to SIMD benefits.
Helping the Compiler
// ✓ Easy to vectorize: simple loop, no dependencies
void add_arrays(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
// ✗ Hard to vectorize: potential aliasing
void add_arrays_bad(float* a, float* b, float* c, int n) {
// Compiler doesn't know if a, b, c overlap
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
// ✓ Fixed with restrict (C) or __restrict (C++)
void add_arrays_good(float* __restrict a,
float* __restrict b,
float* __restrict c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
Compiler Flags
# GCC/Clang: Enable vectorization
-O3 -march=native -ffast-math
# Check what got vectorized
-fopt-info-vec-optimized # GCC: show successful vectorization
-fopt-info-vec-missed # GCC: show failed attempts
-Rpass=loop-vectorize # Clang: show successful
-Rpass-missed=loop-vectorize # Clang: show failures
When Auto-Vectorization Fails
Common blockers:
- Pointer aliasing: Use
restrict/__restrict - Loop-carried dependencies:
a[i] = a[i-1] + 1can't parallelize - Function calls in loop: Unless inlined
- Complex control flow:
ifstatements inside loops - Non-contiguous access: Strided or indirect indexing
Manual SIMD Programming
When auto-vectorization isn't enough, you have options:
Intel Intrinsics Example
#include <immintrin.h>
void add_arrays_avx(float* a, float* b, float* c, int n) {
int i = 0;
// Process 8 floats at a time
for (; i + 7 < n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_storeu_ps(&c[i], vc);
}
// Handle remainder
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
}
Portable SIMD with Highway
#include "hwy/highway.h"
void add_arrays_highway(float* a, float* b, float* c, int n) {
namespace hn = hwy::HWY_NAMESPACE;
using D = hn::ScalableTag<float>;
D d;
for (int i = 0; i < n; i += hn::Lanes(d)) {
auto va = hn::LoadU(d, a + i);
auto vb = hn::LoadU(d, b + i);
hn::StoreU(hn::Add(va, vb), d, c + i);
}
}
Highway automatically uses the best available SIMD for the target platform.
Memory Alignment
SIMD performance is sensitive to memory alignment:
// Aligned allocation (C++17)
float* data = static_cast<float*>(
std::aligned_alloc(32, n * sizeof(float))); // 32-byte for AVX
// Aligned load (faster)
__m256 v = _mm256_load_ps(aligned_ptr); // Requires 32-byte alignment
// Unaligned load (works anywhere, slightly slower)
__m256 v = _mm256_loadu_ps(any_ptr);
Rule of thumb: Align to vector width (16 bytes for SSE, 32 for AVX, 64 for AVX-512).
Measuring SIMD Performance
Verify Vectorization Happened
# Check assembly for vector instructions
objdump -d binary | grep -E 'vmov|vadd|vmul' # AVX
objdump -d binary | grep -E 'movaps|addps' # SSE
Benchmark Methodology
- Warm up: First iterations may have cold cache/code effects
- Large enough N: Small arrays may not show SIMD benefit
- Measure throughput: Elements processed per second
- Check memory bandwidth: SIMD can't help if memory-bound
Common Pitfalls
1. Assuming Linear Speedup
8-wide SIMD ≠ 8x speedup. Memory bandwidth, alignment, and remainder handling all reduce gains.
2. Ignoring Horizontal Operations
Summing a vector requires "horizontal" operations that are slower:
// Horizontal sum of 8 floats in AVX - multiple instructions
float hsum(__m256 v) {
__m128 lo = _mm256_castps256_ps128(v);
__m128 hi = _mm256_extractf128_ps(v, 1);
lo = _mm_add_ps(lo, hi);
// ... more shuffles and adds
}
3. AVX-512 Frequency Throttling
On some Intel CPUs, heavy AVX-512 use causes frequency reduction (up to 20%). The wider operations may not compensate for the lower clock speed.
4. Forgetting the Scalar Remainder
When N isn't divisible by vector width, you need scalar cleanup code.
Summary
- SIMD provides 2-8x speedup for data-parallel workloads
- Start with auto-vectorization: Use
-O3 -march=native, check compiler output - Help the compiler: Use
restrict, avoid aliasing, keep loops simple - Manual SIMD when needed: Intrinsics for maximum control, portable libraries for cross-platform
- Measure carefully: Verify vectorization happened, account for memory bandwidth
Chapter 17: Multi-core Performance
Part V: Parallelism & Low-Level Optimization
"More threads doesn't always mean more performance." — Every performance engineer
The Thread That Made It Slower
A team was optimizing a data processing pipeline. The single-threaded version processed 100,000 records per second. "Easy win," they thought—just parallelize it across 8 cores.
The 8-threaded version processed... 60,000 records per second.
After a week of investigation, they found the culprit: false sharing. Each thread had its own counter variable, but all counters happened to be allocated on the same cache line. Every increment by any thread invalidated the cache for all other threads. The "parallel" code was actually serialized by cache coherence traffic.
Adding 56 bytes of padding between counters brought performance to 750,000 records per second—7.5x the original, finally matching expectations.
Parallelism Fundamentals
Amdahl's Law: The Ceiling You Can't Break
Speedup = 1 / ((1-P) + P/N)
Where:
- P = parallelizable fraction
- N = number of processors
The brutal math: If 10% of your code is sequential (P = 0.9), maximum speedup with infinite processors is 10x. With 8 cores, you get 4.7x. With 64 cores, you get 8.8x.
Speedup vs Cores (P = 0.9):
│
10 │ ─────────── Theoretical max
│ ╱
8 │ ╱
│ ╱
4 │ ╱
│╱
1 └────────────────────────────
1 4 8 16 32 64 Cores
Gustafson's Law: Scale the Problem
Amdahl assumes fixed problem size. Gustafson observed that in practice, we often scale the problem with the hardware:
Scaled Speedup = N + (1-N) × s
Where s = sequential fraction
If you have 8 cores and 5% sequential work, scaled speedup = 8 + (1-8) × 0.05 = 7.65x.
Key insight: Larger problems often have proportionally less sequential work.
Thread Overhead
Thread Creation Cost
| Operation | Typical Cost |
|---|---|
| Thread creation (OS) | 10-100 μs |
| Thread pool task dispatch | 0.1-1 μs |
| Context switch | 1-10 μs |
| Mutex lock (uncontended) | 10-50 ns |
| Mutex lock (contended) | 1-100 μs |
| Atomic increment | 5-50 ns |
Rule of thumb: Don't create threads for tasks shorter than 10-100 μs. Use thread pools.
Context Switch Overhead
When threads exceed available cores, the OS must context switch. Each switch:
- Saves/restores registers
- May flush TLB entries
- Pollutes caches with new thread's data
Symptom: Performance degrades when thread count >> core count.
Synchronization Costs
Lock Contention
Uncontended lock:
Thread 1: [acquire]────[work]────[release]
Thread 2: [acquire]────[work]────[release]
Contended lock:
Thread 1: [acquire]────[work]────[release]
Thread 2: [spin/wait............][acquire]────[work]────[release]
↑ Wasted time
Contention scales poorly: With N threads competing for one lock, average wait time grows as O(N).
Atomic Operations
Atomics avoid locks but aren't free:
std::atomic<int> counter;
counter++; // Generates LOCK XADD instruction
Cost: 5-50 ns per atomic operation, depending on contention. Under high contention, atomics can be as slow as locks.
False Sharing: The Silent Killer
Cache Line (64 bytes):
┌────────────────────────────────────────────────────────────────┐
│ counter_1 │ counter_2 │ counter_3 │ counter_4 │ ... padding ... │
│ (4 bytes) │ (4 bytes) │ (4 bytes) │ (4 bytes) │ │
└────────────────────────────────────────────────────────────────┘
↑ ↑ ↑ ↑
Thread 1 Thread 2 Thread 3 Thread 4
Every write invalidates the entire line for all other threads!
Solution: Pad to cache line boundaries:
struct alignas(64) PaddedCounter {
std::atomic<int> value;
char padding[60]; // Fill to 64 bytes
};
Cache Coherence: MESI Protocol
When multiple cores share memory, hardware maintains coherence via MESI:
| State | Meaning |
|---|---|
| Modified | This cache has the only valid copy (dirty) |
| Exclusive | This cache has the only copy (clean) |
| Shared | Multiple caches have this line |
| Invalid | This cache line is stale |
Cache line bouncing: When two cores repeatedly write to the same line, it bounces between Modified states, generating bus traffic.
Measuring Parallel Performance
Scalability Metrics
Strong scaling: Fixed problem size, increase cores.
- Ideal: 2x cores → 2x speedup
- Reality: Diminishing returns due to Amdahl's Law
Weak scaling: Proportional problem size (2x cores → 2x data).
- Ideal: Constant time regardless of core count
- Reality: Communication overhead grows
Profiling Tools
# Linux perf for cache and synchronization events
perf stat -e cache-misses,cache-references,\
context-switches,cpu-migrations ./program
# Intel VTune for detailed threading analysis
vtune -collect threading ./program
# Detect false sharing
perf c2c record ./program
perf c2c report
Patterns for Good Parallelism
Embarrassingly Parallel
No communication between tasks. Maximum scalability.
// Perfect parallelism: each iteration independent
#pragma omp parallel for
for (int i = 0; i < n; i++) {
result[i] = expensive_computation(input[i]);
}
Data Parallelism with Reduction
// Reduction: each thread accumulates locally, then combine
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++) {
sum += data[i];
}
Pipeline Parallelism
Stage 1 (Parse) → Stage 2 (Process) → Stage 3 (Write)
↓ ↓ ↓
Thread 1 Thread 2 Thread 3
Good when stages have similar duration. Bottleneck is the slowest stage.
Task Parallelism with Work Stealing
// Task-based parallelism (TBB, OpenMP tasks)
tbb::parallel_for(0, n, [&](int i) {
process(items[i]);
});
Work stealing balances load dynamically: idle threads steal tasks from busy threads' queues.
Common Anti-Patterns
1. Over-Synchronization
// Bad: Lock held during I/O
mutex.lock();
result = expensive_network_call(); // Other threads blocked!
mutex.unlock();
// Better: Minimize critical section
auto data = expensive_network_call();
mutex.lock();
result = data;
mutex.unlock();
2. Thread-per-Request
Creating a thread for each request doesn't scale. Use thread pools with bounded concurrency.
3. Ignoring NUMA
On multi-socket systems, memory access time varies by location:
Socket 0 Socket 1
┌─────────┐ ┌─────────┐
│ Core 0-7│ │Core 8-15│
│ Memory A│ ←──→ │Memory B │
└─────────┘ └─────────┘
↑ Local: 80ns ↑ Remote: 150ns
Solution: Pin threads to cores, allocate memory locally.
Summary
- Amdahl's Law sets hard limits on parallel speedup. Know your sequential fraction.
- Synchronization is expensive. Minimize critical sections, prefer lock-free when possible.
- False sharing is a silent killer. Pad data structures to cache line boundaries.
- Measure scalability with both strong and weak scaling tests.
- Use appropriate patterns: Embarrassingly parallel > reduction > pipeline > fine-grained locking.
Chapter 18: Memory Allocators
Part V: Parallelism & Low-Level Optimization
"Memory allocation is the dark matter of performance." — Anonymous
The malloc That Wasn't Free
A high-frequency trading firm was debugging mysterious latency spikes. Their code was tight—no unnecessary copies, no blocking I/O, carefully tuned algorithms. Yet every few seconds, a trade would take 10x longer than expected.
The culprit: malloc. Under heavy allocation load, the default glibc allocator would occasionally need to request memory from the kernel via mmap, causing a 50-100 μs stall. In a system where microseconds mattered, this was catastrophic.
Switching to jemalloc with pre-warmed arenas eliminated the spikes entirely. The fastest allocation is the one you don't make—but when you must allocate, the allocator matters enormously.
How malloc Works
The Allocation Problem
Memory allocation seems simple: give me N bytes, return a pointer. But the allocator must solve several hard problems:
- Speed: Allocation should be O(1) or close to it
- Fragmentation: Avoid wasting memory on gaps
- Thread safety: Multiple threads allocating simultaneously
- Cache efficiency: Keep related allocations together
Traditional malloc (glibc ptmalloc2)
Heap Structure:
┌─────────────────────────────────────────────────┐
│ Chunk │ Chunk │ Free │ Chunk │ Free │ Chunk │...│
│ 16B │ 32B │ 64B │ 128B │ 48B │ 24B │ │
└─────────────────────────────────────────────────┘
↑
Free list links free chunks together
Free list: Freed memory is linked together. Allocation searches for a suitable chunk.
Coalescing: Adjacent free chunks are merged to reduce fragmentation.
Size classes: Small allocations use bins of fixed sizes (16, 32, 48, 64... bytes).
System Call Overhead
User Space: Kernel Space:
┌──────────────┐ ┌──────────────┐
│ malloc() │ ──brk──→│ Extend heap │ (small allocs)
│ │ │ │
│ malloc() │ ─mmap──→│ Map pages │ (large allocs, >128KB)
│ │ │ │
│ free() │ ─munmap→│ Unmap pages │ (large frees)
└──────────────┘ └──────────────┘
The problem: System calls are expensive (1-10 μs). Good allocators minimize them by:
- Caching freed memory for reuse
- Requesting memory in large chunks
- Delaying returns to the OS
Modern Allocators
jemalloc (Facebook/Meta)
Originally developed for FreeBSD, now widely used (Firefox, Redis, Facebook services).
Key innovations:
- Thread-local caches: Each thread has its own cache, avoiding lock contention
- Arenas: Multiple independent heaps reduce contention
- Size classes: Carefully chosen to minimize internal fragmentation
- Huge page support: Reduces TLB misses for large allocations
tcmalloc (Google)
Google's thread-caching malloc, used in most Google services.
Architecture:
┌─────────────────────────────────────────────────┐
│ Thread 1 Cache │ Thread 2 Cache │ Thread N Cache│
└───────┬────────┴───────┬────────┴───────┬───────┘
│ │ │
└────────────────┼────────────────┘
↓
┌─────────────────────┐
│ Central Free List │
└─────────────────────┘
↓
┌─────────────────────┐
│ Page Heap │
└─────────────────────┘
Small objects (< 256KB): Served from thread-local cache, no locking. Large objects: Served from central page heap with locking.
mimalloc (Microsoft)
Microsoft's allocator, designed for maximum performance.
Key features:
- Free list sharding: Multiple free lists per size class reduce contention
- Eager page reuse: Freed pages are immediately available for reuse
- Segment-based: Memory organized in segments for better locality
Performance Comparison
| Allocator | Single-thread | Multi-thread | Memory Overhead | Latency Consistency |
|---|---|---|---|---|
| glibc ptmalloc2 | Baseline | Poor (lock contention) | Medium | Variable |
| jemalloc | 1.1-1.3x | 2-5x | Low | Good |
| tcmalloc | 1.1-1.2x | 2-4x | Low | Good |
| mimalloc | 1.2-1.5x | 3-6x | Very Low | Excellent |
Speedup relative to glibc on typical workloads
Memory Pools and Custom Allocators
When general-purpose allocators aren't enough:
Object Pools
Pre-allocate a fixed number of same-sized objects:
template<typename T, size_t N>
class ObjectPool {
std::array<T, N> storage;
std::array<T*, N> free_list;
size_t free_count = N;
public:
T* allocate() {
if (free_count == 0) return nullptr;
return free_list[--free_count];
}
void deallocate(T* ptr) {
free_list[free_count++] = ptr;
}
};
Use case: Game entities, network connections, fixed-size messages.
Arena/Bump Allocators
Allocate sequentially, free all at once:
class Arena {
char* buffer;
size_t offset = 0;
size_t capacity;
public:
void* allocate(size_t size) {
size = align_up(size, 8);
if (offset + size > capacity) return nullptr;
void* ptr = buffer + offset;
offset += size;
return ptr;
}
void reset() { offset = 0; } // "Free" everything
};
Use case: Per-frame game allocations, request-scoped web server data, compiler passes.
Stack Allocators
LIFO allocation pattern:
class StackAllocator {
char* buffer;
size_t offset = 0;
public:
void* allocate(size_t size);
void deallocate(void* ptr); // Must be most recent allocation
};
Benchmarking Allocators
Methodology
- Use realistic workloads: Synthetic benchmarks often don't reflect real patterns
- Measure latency distribution: Mean is less important than P99/P999
- Test under contention: Single-threaded benchmarks miss the point
- Monitor memory usage: Fast but memory-hungry isn't always better
Tools
# heaptrack: Allocation profiling
heaptrack ./program
heaptrack_gui heaptrack.program.*.gz
# Valgrind massif: Heap profiling
valgrind --tool=massif ./program
ms_print massif.out.*
# perf for allocation-related events
perf stat -e page-faults,minor-faults,major-faults ./program
Switching Allocators
Most modern allocators can be used via LD_PRELOAD:
# Use jemalloc
LD_PRELOAD=/usr/lib/libjemalloc.so ./program
# Use tcmalloc
LD_PRELOAD=/usr/lib/libtcmalloc.so ./program
# Use mimalloc
LD_PRELOAD=/usr/lib/libmimalloc.so ./program
Embedded and Real-Time Considerations
Deterministic Allocation
Real-time systems need bounded allocation time. Solutions:
- Static allocation: Allocate everything at startup
- Pool allocators: Fixed-size pools with O(1) allocation
- TLSF (Two-Level Segregated Fit): O(1) general-purpose allocator
Memory Budget
Embedded systems have fixed memory. Strategies:
- Pre-calculate maximum memory needs
- Use compile-time allocation where possible
- Implement memory monitoring and alerts
Summary
- Default malloc is often a bottleneck in multi-threaded applications
- Modern allocators (jemalloc, tcmalloc, mimalloc) provide 2-6x better multi-threaded performance
- Custom allocators (pools, arenas) can provide 10-100x speedup for specific patterns
- Measure before switching: Profile allocation patterns to understand your needs
- Consider latency, not just throughput: P99 latency often matters more than average
Chapter 19: Footprint Analysis Fundamentals
Part VI: Embedded Constraints
"In embedded systems, every byte counts—literally." — Jack Ganssle
The Missing 2 KB
Two weeks before mass production, our project war room reeked of espresso and anxiety.
The development team had just merged the final security patch, but the automated build server flashed an angry red warning:
region 'FLASH' overflowed by 2048 bytes
This was a 128 KB flash system, and our binary had grown to 130 KB. Those extra 2 KB stood like an insurmountable wall between us and product shipment.
Senior engineer Zhang immediately shouted: "Quick! Strip out all the debug strings from printf, and disable those unnecessary assert statements!"
Everyone scrambled through the source code, hunting for strings to delete. An hour later, the second build result arrived: only 400 bytes saved.
The team fell silent. Blind "intuition-based optimization" proved utterly powerless against hard memory constraints.
"We need data, not guesses." Junior performance engineer Ming broke the chaos.
Instead of rushing to delete code, he calmly ran size and nm --size-sort. In the detailed linker map file, he discovered the real "space killer" wasn't printf—it was a newly introduced third-party sensor driver.
That driver had inadvertently pulled in the floating-point emulation library, all because of a calibration routine that mistakenly used double for fewer than ten lines of data processing.
Through systematic analysis tools, the team fixed just two lines of code, converting floating-point to fixed-point arithmetic. The binary instantly shrank by 15 KB.
Optimizing footprint isn't a guessing game of "deleting code"—it's a precise science of measurement.
What is Footprint?
In embedded systems, footprint refers to the memory space a program occupies. Unlike desktop systems, embedded memory is a hard constraint—your firmware must fit into fixed-size flash and RAM.
Static vs Dynamic Footprint
Footprint can be categorized into two types:
┌─────────────────────────────────────────────────────────┐
│ Static Footprint (determined at compile time) │
├─────────────────────────────────────────────────────────┤
│ .text │ Machine code, instructions │ Stored in Flash │
│ .rodata │ Constants, string literals │ Stored in Flash │
│ .data │ Initialized globals │ Flash → RAM │
│ .bss │ Uninitialized globals │ RAM (zeroed) │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Dynamic Footprint (changes at runtime) │
├─────────────────────────────────────────────────────────┤
│ Stack │ Local variables, call frames │ RAM │
│ Heap │ Dynamically allocated memory │ RAM │
└─────────────────────────────────────────────────────────┘
Flash vs RAM Occupancy
Understanding the mapping between sections and memory is crucial:
Flash usage = .text + .rodata + .data (initial values)
RAM usage = .data + .bss + stack + heap
Note that .data occupies both Flash (storing initial values) and RAM (runtime storage). This detail trips up many engineers.
Why "Won't Fit in Flash" Is So Common
Typical embedded system memory constraints:
Device Type Flash RAM
────────────────────────────────
Low-end MCU 32 KB 4 KB
Mid-range MCU 256 KB 64 KB
High-end MCU 1 MB 256 KB
Application CPU Unlimited* 512 MB+
* Has file system and virtual memory
As features accumulate, code size easily grows unnoticed. A single "harmless" library reference might bring in tens of KB of hidden dependencies.
The Toolbox
Just as performance analysis needs profilers, footprint analysis requires specialized tools. Here are four essential tools for systems software engineers.
1. The size Command: Quick Overview
size is the most basic tool for quickly grasping a binary's overall structure:
$ riscv64-unknown-elf-size -A firmware.elf
section size addr
.text 0x4500 0x80000000
.rodata 0x0800 0x80004500
.data 0x0100 0x80004d00
.bss 0x0200 0x20000000
.stack 0x1000 0x20000200
Total 0x6000
Key metrics:
- Flash usage:
.text+.rodata+.data= 21,248 bytes - RAM usage:
.data+.bss+.stack= 4,864 bytes
Pro tip: Use -A (System V format) instead of the default Berkeley format for more detailed section breakdown.
2. The nm Command: Symbol-Level Analysis
When you find a section is too large, dig into symbol level to find the culprit:
$ riscv64-unknown-elf-nm -S --size-sort -r firmware.elf | head -10
80001a20 000005d4 T core_process_loop
20000040 00000400 B network_buffer
800021f4 00000210 t parse_json_string
80002404 000001c8 T uart_send_buffer
...
Output interpretation:
- Column 1: Symbol address
- Column 2: Symbol size (bytes)
- Column 3: Symbol type (T=text, B=bss, D=data)
- Column 4: Symbol name
Pro tip: Filter by type, e.g., finding only large variables in RAM:
$ nm -S --size-sort -r firmware.elf | grep -E ' [BD] '
3. bloaty: Modern Footprint Analyzer
Bloaty McBloatface is an advanced footprint analysis tool from Google. It displays space distribution hierarchically and supports diff comparison between versions.
# Analyze by compile unit (source file)
$ bloaty firmware.elf -d compileunits
VM SIZE FILE SIZE
-------------- --------------
62.5% 5.15Ki tasks.c 62.5% 5.15Ki
21.2% 1.75Ki queue.c 21.2% 1.75Ki
8.5% 712B list.c 8.5% 712B
7.8% 650B port.c 7.8% 650B
Version comparison (Diff)—bloaty's most powerful feature:
$ bloaty new_firmware.elf -- old_firmware.elf
VM SIZE FILE SIZE
-------------- --------------
+15.2% +2.1Ki .text +15.2% +2.1Ki
[ = ] 0 .rodata [ = ] 0
+8.3% +128B .bss +8.3% +128B
-------------- --------------
+12.1% +2.2Ki TOTAL +12.1% +2.2Ki
This diff capability is especially useful in CI/CD—automatically compare footprint changes after each commit.
4. Linker Map File: The Ultimate Truth
The linker map file records how the compiler combines all object files into the final binary. It's the ultimate weapon for solving "where did the space go?" mysteries.
Generating a map file:
$ riscv64-unknown-elf-gcc main.o lib.o -Wl,-Map=output.map -o firmware.elf
Map file example:
.text.core_init
0x0000000080000100 0x48 main.o
.text.uart_send
0x0000000080000148 0x20 uart.o
*fill* 0x0000000080000168 0x08
.text.process_data
0x0000000080000170 0x120 process.o
Key observations:
*fill*indicates padding (alignment)—hidden space waste- You can trace each symbol back to its source object file
- You can discover libraries that were accidentally linked in
Analysis Workflow
Establish a systematic analysis process instead of guessing by intuition:
Step 1: Baseline Measurement
↓
$ size firmware.elf
Record .text, .data, .bss sizes
↓
Step 2: Identify Heavy Hitters
↓
$ nm -S --size-sort -r firmware.elf | head -20
Find symbols consuming the most space
↓
Step 3: Trace Origins
↓
Check linker map file
Confirm which object files these symbols come from
↓
Step 4: Analyze Causes
↓
- Is there an accidentally included library?
- Are there unnecessary features being compiled in?
- Are there oversized static buffers?
↓
Step 5: Verify Changes
↓
$ bloaty new.elf -- old.elf
Confirm changes actually reduced footprint
Common "Space Killers"
1. Floating-Point Library
- Using float/double on MCUs without FPU
- Even a single printf("%f") pulls in the entire float formatting library
2. Standard Library Functions
- printf family: 10-20 KB
- malloc/free: 1-5 KB
- Consider newlib-nano or custom minimal versions
3. Oversized Static Buffers
- char log_buffer[4096]; // Do you really need this big?
4. Unused Features
- Referencing a library but only using a small part
- Not enabling --gc-sections to remove dead code
Case Study: Tracing an Accidental Library Reference
Let's return to the opening story and reconstruct Ming's analysis process with tools.
Step 1: Discover the problem
$ size firmware_before.elf
text data bss dec hex filename
133120 256 4096 137472 21900 firmware_before.elf
Flash usage is 133,376 bytes (.text + .data), exceeding the 128 KB limit.
Step 2: Find the heavy hitters
$ nm -S --size-sort -r firmware_before.elf | head -5
80010000 00003a00 T __aeabi_ddiv
8000c600 00002800 T __aeabi_dmul
80009e00 00001c00 T __aeabi_dadd
80008200 00001c00 T __aeabi_dsub
80006600 00001400 T __aeabi_d2iz
These __aeabi_d* functions are software emulation for double floating-point operations! They total about 50 KB.
Step 3: Trace the origin
Search for these symbols' source in the linker map file:
$ grep -A1 "__aeabi_ddiv" output.map
__aeabi_ddiv
0x80010000 0x3a00 libgcc.a(dp-bit.o)
It's libgcc's double-precision floating-point emulation.
Step 4: Find the caller
$ grep -r "double\|float" src/
src/drivers/sensor.c:42: double calibrated = raw_value * 0.0125;
There it is! A simple calibration operation pulled in 50 KB of floating-point library.
Step 5: Fix it
Convert double operations to fixed-point:
// Before: pulls in 50 KB floating-point library
double calibrated = raw_value * 0.0125;
// After: fixed-point, 0 KB overhead
int32_t calibrated = (raw_value * 125) / 10000;
Step 6: Verify
$ bloaty firmware_after.elf -- firmware_before.elf
VM SIZE FILE SIZE
-------------- --------------
-37.5% -50.0Ki .text -37.5% -50.0Ki
[ = ] 0 .data [ = ] 0
[ = ] 0 .bss [ = ] 0
-------------- --------------
-37.5% -50.0Ki TOTAL -37.5% -50.0Ki
Success—50 KB saved!
Summary
- Footprint = memory space a program occupies, including code size (Flash) and data size (RAM)
- Measurement tools:
size: Quick overview of section sizesnm --size-sort: Find the largest symbolsbloaty: Hierarchical analysis and version comparison- Linker map file: Trace symbol origins
- Analysis workflow: Baseline measurement → Identify heavy hitters → Trace origins → Analyze causes → Verify changes
- Common pitfalls: Floating-point library, standard library functions, oversized static buffers, unremoved dead code
- Core principle: Measure, don't guess
Chapter 20: Compiler Size Optimization
Part VI: Embedded Constraints
"Premature optimization is the root of all evil, but so is premature pessimization." — Unknown
The Optimization Level Myth
Every C/C++ developer knows -O2 and -O3 make programs run faster. But in the embedded world, there's an often-overlooked friend: -Os and -Oz.
In the previous chapter's story, we used systematic analysis to identify the floating-point library as the "space killer." But that was just the beginning—true size optimization requires understanding what the compiler does behind the scenes.
This chapter answers a core question with experimental data: How much code size difference do different compiler options actually produce?
Optimization Level Comparison
GCC Optimization Level Definitions
Level Goal Characteristics
────────────────────────────────────────────────────────────
-O0 Debug No optimization, max debuggability
-O1 Basic Reduce code size, moderate optimization
-O2 Standard Balance speed and size
-O3 Aggressive Maximum speed, may increase code size
-Os Size optimization Based on -O2, but prefer smaller code
-Oz Minimum size Clang only, more aggressive than -Os
-Og Debug optimization Suitable for use with debugger
Experiment: Compiling the Same Program
We use a typical embedded application (RTOS task management + UART driver) as our test baseline:
# Compile same source with different optimization levels
$ riscv64-unknown-elf-gcc -O0 -o test_O0.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O1 -o test_O1.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O2 -o test_O2.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O3 -o test_O3.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -Os -o test_Os.elf main.c drivers/*.c
Measurement results (.text section size):
Opt Level .text (bytes) vs -O0 vs -Os
────────────────────────────────────────────────────────
-O0 28,672 100.0% +78.2%
-O1 18,432 64.3% +14.5%
-O2 20,480 71.4% +27.3%
-O3 24,576 85.7% +52.7%
-Os 16,096 56.1% baseline
-Oz* 14,848 51.8% -7.8%
* Clang only
Key observations:
- -O3 is larger than -O2: Aggressive inlining and loop unrolling increase code size
- -Os is ~20% smaller than -O2: Significant for memory-constrained systems
- -O0 is largest: No optimization, each statement generates separate instructions
Speed vs Size Trade-off
Code Size
▲
│ ● -O3 (fastest, but usually largest)
│
│ ● -O0 (slow and large, no optimization)
│
│ ● -O2 (fast, balanced size)
│
│ ● -O1 (basic optimization)
│
│ ● -Os (small size, good speed)
│ ● -Oz (smallest size, sacrifices speed)
│
└───────────────────────────────────────────────────────► Speed
Rules of thumb:
- Development/debug:
-Og - Release (speed priority):
-O2or-O3 - Release (space priority):
-Osor-Oz - Extremely constrained:
-Oz+ LTO +--gc-sections
Advanced Compiler Options
Basic optimization levels are just the beginning. Here are advanced size optimization techniques.
1. Dead Code Elimination
Use -ffunction-sections, -fdata-sections with linker's --gc-sections:
# Compile: place each function/data in its own section
$ gcc -ffunction-sections -fdata-sections -Os -c main.c -o main.o
# Link: remove unused sections
$ gcc -Wl,--gc-sections main.o -o firmware.elf
Effect example:
Options .text size
──────────────────────────────────────────────
Without gc-sections 24,576 bytes
With gc-sections 18,432 bytes
Savings 6,144 bytes (25%)
This is especially effective when using large libraries (like newlib)—you might only use memcpy and strlen, but without gc-sections, malloc, printf, and more get linked in.
2. Link-Time Optimization (LTO)
LTO allows the compiler to perform cross-compile-unit optimization at link time:
# Both compile and link need -flto
$ gcc -flto -Os -c main.c -o main.o
$ gcc -flto -Os -c uart.c -o uart.o
$ gcc -flto -Os main.o uart.o -o firmware.elf
LTO advantages:
- Cross-file inlining decisions
- More precise dead code elimination
- Better constant propagation
LTO costs:
- Significantly increased compile time (possibly 2-5x)
- Debug information may be harder to trace
- Some linker scripts need adjustment
Effect example:
Options .text size Compile time
───────────────────────────────────────────────
-Os 16,096 1.2s
-Os -flto 14,336 3.8s
Savings 1,760 (11%) +217%
3. Inlining Control
Inlining is the most critical speed vs size trade-off:
// Force inline (may increase code size)
static inline __attribute__((always_inline))
void critical_function(void) { ... }
// Prevent inline (ensure minimum code size)
__attribute__((noinline))
void large_function(void) { ... }
When to force inline:
- Very small helper functions (1-3 lines)
- Functions on hot paths
- When call overhead exceeds the function itself
When to prevent inline:
- Large functions (over 20-30 lines)
- Functions with multiple call sites
- Error handling paths
Standard Library Selection
The standard C library is one of the biggest "hidden costs" in embedded systems.
newlib vs newlib-nano
Library printf support .text increase
───────────────────────────────────────────────────────
newlib Full ~50-80 KB
newlib-nano Basic (no float) ~8-15 KB
Custom minimal Integer only ~1-2 KB
Using newlib-nano:
# GCC's --specs option
$ arm-none-eabi-gcc --specs=nano.specs -Os main.c -o firmware.elf
Verifying the effect:
$ size firmware_newlib.elf
text data bss dec hex filename
52480 256 4096 56832 ddc0 firmware_newlib.elf
$ size firmware_nano.elf
text data bss dec hex filename
12288 256 4096 16640 4100 firmware_nano.elf
Difference: 40 KB. Huge on 64 KB or 128 KB flash systems.
Avoiding printf
If you only need to output integers or simple strings, consider a lightweight custom version:
// Full printf: ~15-50 KB
printf("Value: %d\n", value);
// Custom lightweight version: ~200 bytes
void print_int(const char* prefix, int value) {
uart_puts(prefix);
char buf[12];
itoa(value, buf, 10);
uart_puts(buf);
uart_puts("\n");
}
Size Impact of Specific Features
Certain C/C++ features have significant impact on code size:
Floating-Point Operations
Feature Extra size on MCU without FPU
────────────────────────────────────────────────────────────────
float arithmetic +10-15 KB (software emulation)
double arithmetic +25-50 KB (software emulation)
printf("%f") +15-25 KB (formatting)
math.h (sin, cos, etc.) +10-30 KB (depends on usage)
Best practices:
- Use fixed-point arithmetic whenever possible
- If float is necessary, avoid double
- Never use
printf("%f")on MCUs without FPU
C++ Features
Feature Typical size impact
────────────────────────────────────────────────────────────────
Virtual functions +8-16 bytes vtable per class
RTTI (typeid, dynamic_cast) +2-10 KB
Exception handling +10-50 KB
STL containers Depends on usage, can be tens of KB
Recommended embedded C++ compile options:
$ g++ -fno-rtti -fno-exceptions -Os ...
Experiment: Complete Optimization Workflow
Let's demonstrate the optimization workflow with a complete example:
Initial state:
$ riscv64-unknown-elf-gcc -O2 main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
45056 512 8192 53760 d200 firmware.elf
Flash usage: 45.5 KB, target: 32 KB.
Step 1: Switch to -Os
$ riscv64-unknown-elf-gcc -Os main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
36864 512 8192 45568 b200 firmware.elf
Saved: 8.2 KB (-18%). Still 4.9 KB over.
Step 2: Add gc-sections
$ riscv64-unknown-elf-gcc -Os -ffunction-sections -fdata-sections \
-Wl,--gc-sections main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
30720 512 8192 39424 9a00 firmware.elf
Saved: 6.1 KB (-17%). Target achieved! But let's continue.
Step 3: Add LTO
$ riscv64-unknown-elf-gcc -Os -flto -ffunction-sections -fdata-sections \
-Wl,--gc-sections main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
28672 512 8192 37376 9200 firmware.elf
Saved: 2 KB (-7%).
Step 4: Use newlib-nano
$ riscv64-unknown-elf-gcc -Os -flto -ffunction-sections -fdata-sections \
-Wl,--gc-sections --specs=nano.specs main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
18432 512 8192 27136 6a00 firmware.elf
Saved: 10.2 KB (-36%).
Optimization summary:
Phase .text size Cumulative savings
───────────────────────────────────────────────────────────
Original (-O2) 45,056 baseline
Step 1: -Os 36,864 -18%
Step 2: gc-sections 30,720 -32%
Step 3: LTO 28,672 -36%
Step 4: newlib-nano 18,432 -59%
From 45 KB to 18 KB—60% flash space saved!
Common Pitfalls
1. Over-Inlining
// This function gets inlined 20 times by -O3 = 20x original size
inline void update_display(int x, int y, int color) {
// 50 lines of drawing logic
...
}
Solution: Use -Os or manually mark __attribute__((noinline)).
2. Forgetting gc-sections
# Wrong: added -ffunction-sections at compile, forgot --gc-sections at link
$ gcc -ffunction-sections -fdata-sections -c file.c
$ gcc file.o -o output # Forgot -Wl,--gc-sections!
3. Debug Symbols Impact
Debug symbols don't increase Flash usage, but make the ELF file larger:
$ riscv64-unknown-elf-gcc -g -Os main.c -o debug.elf
$ riscv64-unknown-elf-gcc -Os main.c -o release.elf
$ ls -lh *.elf
-rwxr-xr-x 1 user user 245K debug.elf
-rwxr-xr-x 1 user user 35K release.elf
# But .text size is the same:
$ size debug.elf release.elf
text data bss dec hex filename
18432 512 8192 27136 6a00 debug.elf
18432 512 8192 27136 6a00 release.elf
Summary
- Optimization levels:
-Osis usually the best choice for embedded, 15-25% smaller than-O2 - Dead code elimination:
-ffunction-sections -fdata-sections -Wl,--gc-sectionssaves 10-30% - LTO: Cross-compile-unit optimization, additional 5-15% savings (but increases compile time)
- Standard library: newlib-nano can save 30-50 KB compared to newlib
- Avoid: Floating-point operations, printf("%f"), C++ exceptions/RTTI
- Optimization order: First analyze with tools (previous chapter), then choose appropriate compiler options (this chapter)
Chapter 21: Stack Analysis and Estimation
Part VI: Embedded Constraints
"Stack overflow is the most insidious bug in embedded systems—it corrupts silently and strikes randomly." — Miro Samek
The Unpredictable Crash
This is a classic embedded nightmare story.
The product had shipped. It was running smoothly. Customers were happy. Then suddenly, one customer reported: "The system randomly reboots. No pattern whatsoever."
The team checked all logs—no error messages. Checked power—stable. Checked temperature—normal range. The problem was like a ghost, impossible to reproduce.
Three weeks later, the issue was narrowed down to a specific usage scenario: when a user rapidly pressed three buttons in sequence, the system had a 30% chance of rebooting.
An engineer finally captured a crash dump and found the program counter (PC) pointing to a completely meaningless address.
The answer: Stack overflow.
The button interrupt handler had nested calls three levels deep into decoding logic, combined with a 512-byte local variable buffer. This combination only occurred under a specific sequence of operations that was never tested during development.
Stack overflow doesn't throw an exception. It doesn't leave an error message. It silently overwrites other memory regions, causing the system to crash at unpredictable times.
This chapter teaches you how to analyze, estimate, and monitor stack usage—before the problem becomes a field failure.
Stack Basics
What is the Stack?
The stack is a contiguous memory region used to store:
Stack contents:
────────────────────────────────────────
1. Function return addresses
2. Local variables
3. Function parameters (some calling conventions)
4. Saved register values (caller/callee saved)
5. Interrupt handler context
Stack Growth Direction
Most architectures have stacks that grow downward:
High addr ┌───────────────────────┐
│ Stack start │ ← Initial SP value
├───────────────────────┤
│ main() frame │
├───────────────────────┤
│ func_a() frame │
├───────────────────────┤
│ func_b() frame │ ← Current SP
├───────────────────────┤
│ │
│ Unused stack space │
│ │
├───────────────────────┤
│ Stack guard/bottom │
Low addr └───────────────────────┘
Why Stack Size Is Hard to Estimate
Static footprint problem:
─────────────────────────────────────────────────────────
.text, .data, .bss → Fully determined at compile time
Stack → Only know actual usage at runtime
Difficulties:
1. Call depth varies dynamically
2. Recursion depth
3. Function pointer calls
4. Interrupt nesting
5. Conditional local variables (different paths use different sizes)
Static Analysis Tools
GCC's -fstack-usage
GCC can generate stack usage reports for each function:
$ riscv64-unknown-elf-gcc -fstack-usage -Os -c main.c
$ cat main.su
main.c:10:5:main 64 static
main.c:25:6:process_data 256 static
main.c:50:6:handle_interrupt 128 static
Output format:
file:line:column:function_name stack_usage(bytes) type
Type descriptions:
- static: Determinable at compile time
- dynamic: Uses VLA or alloca, size unknown
- bounded: Has upper limit but cannot be precisely determined
Pro tip: Find the largest stack users
$ cat *.su | sort -t$'\t' -k2 -nr | head -10
drivers/usb.c:120:6:usb_parse_descriptor 1024 static
app/json.c:45:6:parse_json_object 512 static
app/protocol.c:80:6:decode_frame 384 static
...
Checkstack Script
The Linux kernel provides a checkstack script that analyzes stack usage directly from assembly:
# Generate assembly and analyze
$ riscv64-unknown-elf-objdump -d firmware.elf | ./checkstack.pl riscv
0x80001234 usb_parse_descriptor [firmware.elf]: 1024
0x80002468 parse_json_object [firmware.elf]: 512
0x80003690 decode_frame [firmware.elf]: 384
Limitations
Static analysis has its limits:
Static analysis can handle:
✅ Fixed-size local variables
✅ Direct function call chains
✅ Call depth known at compile time
Static analysis cannot handle:
❌ Recursion depth (unless explicitly bounded)
❌ Function pointer calls
❌ Interrupt nesting levels
❌ VLA (Variable Length Array)
❌ alloca()
Dynamic Measurement Methods
When static analysis is insufficient, we need dynamic measurement.
Stack Painting
This is the classic stack usage measurement method:
#define STACK_PATTERN 0xDEADBEEF
// Fill stack region with pattern at startup
void stack_paint(void) {
extern uint32_t _stack_start; // Defined in linker script
extern uint32_t _stack_end;
uint32_t *p = &_stack_start;
while (p < &_stack_end) {
*p++ = STACK_PATTERN;
}
}
// Measure stack high water mark at any time
size_t stack_get_high_water_mark(void) {
extern uint32_t _stack_start;
extern uint32_t _stack_end;
uint32_t *p = &_stack_start;
while (p < &_stack_end && *p == STACK_PATTERN) {
p++;
}
return ((size_t)&_stack_end - (size_t)p);
}
Usage:
int main(void) {
stack_paint(); // Paint at startup
// ... run application ...
// Check periodically
size_t used = stack_get_high_water_mark();
printf("Stack high water mark: %zu bytes\n", used);
}
Important: This method only measures "maximum ever used"—it cannot guarantee future usage won't exceed this value.
Runtime Stack Monitoring
Add stack overflow detection in debug builds:
// Check stack at each function entry
void __attribute__((no_instrument_function))
stack_check(void) {
extern uint32_t _stack_start;
register uint32_t sp asm("sp");
if (sp < (uint32_t)&_stack_start + STACK_GUARD_SIZE) {
// Stack overflow detected!
panic("Stack overflow!");
}
}
// GCC's -finstrument-functions can auto-insert
void __attribute__((no_instrument_function))
__cyg_profile_func_enter(void *this_fn, void *call_site) {
stack_check();
}
GCC compile option:
# Auto-insert hooks at function entry/exit
$ gcc -finstrument-functions -Os main.c -o firmware.elf
Note: This adds runtime overhead and code size—use only in debug builds.
RTOS Task Stack Sizing
In RTOS environments, each task has its own stack, making the problem more complex.
FreeRTOS Stack Configuration
// FreeRTOS task creation
#define TASK_STACK_SIZE 512 // words (2048 bytes on 32-bit)
xTaskCreate(
task_function,
"TaskName",
TASK_STACK_SIZE, // ← How to determine this?
NULL,
tskIDLE_PRIORITY + 1,
&task_handle
);
Methods for Estimating Task Stack Size
Task Stack requirement =
Context save size (RTOS framework)
+ Task function's own stack usage
+ Stack usage of all possible called functions
+ Interrupt handling (if sharing stack)
+ Safety margin (typically 25-50%)
FreeRTOS Context Save Size (ARM Cortex-M example):
Architecture Context Size (bytes)
──────────────────────────────────────────
ARM Cortex-M0 64
ARM Cortex-M3/M4 64 (no FPU) / 200 (with FPU)
ARM Cortex-M7 64 (no FPU) / 232 (with FPU)
RISC-V RV32 64-128 (depends on ABI)
FreeRTOS Stack Monitoring
FreeRTOS provides built-in stack high water mark functionality:
// Enable stack overflow detection
#define configCHECK_FOR_STACK_OVERFLOW 2
// Get task's stack usage
UBaseType_t stack_remaining = uxTaskGetStackHighWaterMark(task_handle);
printf("Task stack remaining: %u words\n", stack_remaining);
// Stack overflow hook (configCHECK_FOR_STACK_OVERFLOW >= 1)
void vApplicationStackOverflowHook(TaskHandle_t xTask, char *pcTaskName) {
// Handle stack overflow
panic("Stack overflow in task: %s\n", pcTaskName);
}
Practical Recommendations
Task Stack Sizing Best Practices:
──────────────────────────────────────────────────────────
1. Initial configuration: Start with generous space (e.g., 2 KB)
2. Measure actual usage:
- Run system, trigger all possible execution paths
- Use uxTaskGetStackHighWaterMark() to check
3. Adjust configuration:
- Actual usage + 25-50% safety margin
- Example: Measured 800 bytes → Configure 1024-1200 bytes
4. Continuous monitoring:
- Keep stack checking in production builds
- Periodically log stack usage statistics
Worst-Case Stack Depth (WCSD) Analysis
In safety-critical systems (automotive, medical, aerospace), you need to calculate Worst-Case Stack Depth (WCSD).
Call Graph Analysis
Build call graph, calculate deepest path:
main() [64 bytes]
├── init() [32 bytes]
│ └── hal_init() [128 bytes]
└── loop() [48 bytes]
├── read_sensor() [96 bytes]
│ └── spi_transfer() [64 bytes]
└── process() [256 bytes]
└── filter() [128 bytes]
Deepest path 1: main → init → hal_init
= 64 + 32 + 128 = 224 bytes
Deepest path 2: main → loop → process → filter
= 64 + 48 + 256 + 128 = 496 bytes
WCSD = max(224, 496) = 496 bytes
Interrupt Impact
Must consider interrupts when calculating WCSD:
Application stack usage: 496 bytes
+ Interrupt handler: 128 bytes
+ Nested interrupt (if any): 128 bytes
────────────────────────────────────────────
Total WCSD: 752 bytes
Tool Support
WCSD analysis tools:
──────────────────────────────────────────────────────────
Tool Platform Features
Polyspace Commercial Formal verification
PC-lint Plus Commercial Static analysis
StackAnalyzer Commercial Professional WCSD analysis
(AbsInt)
GCC -fstack-usage Open source Basic, requires manual call chain calc
Common Pitfalls
1. Large Local Variables
void process_packet(void) {
char buffer[4096]; // 4 KB local variable!
// ...
}
Solution: Use static buffer or dynamic allocation
// Use static (moves to .bss, doesn't use stack)
static char buffer[4096];
// Or use heap (if system allows)
char *buffer = malloc(4096);
2. Recursive Functions
int factorial(int n) {
if (n <= 1) return 1;
return n * factorial(n - 1); // Depth = n
}
// factorial(1000) needs ~64 KB stack!
Solution: Convert to loop
int factorial(int n) {
int result = 1;
for (int i = 2; i <= n; i++) {
result *= i;
}
return result;
}
3. Complex Logic in Interrupts
void __attribute__((interrupt)) timer_isr(void) {
char log_buffer[512]; // Large buffer in ISR
sprintf(log_buffer, "...");
process_complex_data(); // Call complex function
}
Solution: ISRs should be as short as possible; defer complex logic to main loop or task
volatile bool timer_flag = false;
void __attribute__((interrupt)) timer_isr(void) {
timer_flag = true; // Just set flag
}
void main_loop(void) {
if (timer_flag) {
timer_flag = false;
process_complex_data(); // Handle in normal context
}
}
4. VLA and alloca
void process(size_t n) {
int data[n]; // VLA - stack usage depends on runtime value
// ...
}
void another(size_t size) {
void *buf = alloca(size); // Same problem
// ...
}
Solution: Avoid VLA and alloca; use static buffers or heap
Summary
- Stack overflow is the hardest bug to detect in embedded systems—silent corruption, random crashes
- Static analysis:
- GCC
-fstack-usage: Generates per-function stack usage report - Limitation: Cannot handle recursion, function pointers, interrupt nesting
- GCC
- Dynamic measurement:
- Stack painting: Fill with pattern, check high water mark later
- Runtime monitoring: Check stack pointer at function entry
- RTOS environments:
- Each task has independent stack, estimate separately
- FreeRTOS provides
uxTaskGetStackHighWaterMark()
- WCSD analysis: Safety-critical systems need worst-case stack depth calculation
- Best practices:
- Avoid large local variables
- Avoid recursion
- Keep ISRs short
- Configure safety margin (25-50%)
Chapter 22: RTOS Footprint Case Study
Part VI: Embedded Constraints
"The best RTOS is the one that fits your constraints—memory, CPU, and development time." — Colin Walls
The Real Challenge of Choosing an RTOS
"Our MCU only has 64 KB flash and 8 KB RAM. Which RTOS should we choose?"
This is one of the most common questions embedded developers ask. The internet is full of comparison articles, but few answer this question in a data-driven way.
This chapter skips the marketing speak and feature lists. We measure with tools. We speak with data.
We'll analyze the footprint of three mainstream RTOSes:
- FreeRTOS: The most popular open-source RTOS
- Zephyr: A modern IoT RTOS
- RT-Thread: A highly popular RTOS from China
Measurement Methodology
Before comparing, we need to establish fair measurement conditions.
Test Platform
Hardware: QEMU emulator (to avoid hardware differences)
Target: ARM Cortex-M4 (no FPU)
Compiler: GCC 13.2 (arm-none-eabi)
Optimization: -Os -flto -ffunction-sections -fdata-sections -Wl,--gc-sections
Test Scenarios
We define three test scenarios:
Scenario 1: Minimal
- Kernel scheduler only
- 1 task
- No other features
Scenario 2: Basic
- Kernel + semaphore + queue
- 3 tasks
- Timer service
Scenario 3: Typical
- Scenario 2 + shell/console
- 5 tasks
- Dynamic memory allocation
Measurement Method
# For each RTOS and scenario:
$ arm-none-eabi-size firmware.elf
$ arm-none-eabi-nm -S --size-sort firmware.elf > symbols.txt
$ bloaty firmware.elf -d compileunits > modules.txt
FreeRTOS Analysis
Minimal Configuration
// FreeRTOS minimal configuration
#define configUSE_PREEMPTION 1
#define configUSE_IDLE_HOOK 0
#define configUSE_TICK_HOOK 0
#define configMINIMAL_STACK_SIZE 64
#define configTOTAL_HEAP_SIZE 1024
#define configMAX_PRIORITIES 4
#define configUSE_MUTEXES 0
#define configUSE_SEMAPHORES 0
#define configUSE_TIMERS 0
#define configUSE_QUEUE_SETS 0
Measurement results:
$ arm-none-eabi-size freertos_minimal.elf
text data bss dec hex filename
3584 120 1152 4856 12f8 freertos_minimal.elf
Section breakdown:
.text = 3,584 bytes (kernel code)
.data = 120 bytes (initialized data)
.bss = 1,152 bytes (heap + TCB)
Main components:
$ arm-none-eabi-nm -S --size-sort freertos_minimal.elf | head -10
20000100 00000400 B ucHeap # 1024 bytes heap
08000a40 00000280 T xTaskCreate
08000cc0 00000200 T vTaskSwitchContext
08000ec0 00000180 T xTaskIncrementTick
08001040 00000140 T prvIdleTask
...
Feature vs Footprint Table
FreeRTOS feature impact on footprint:
Feature .text increase .bss increase
────────────────────────────────────────────────────────────
Basic kernel (1 task) 3,584 1,152
+ Semaphores +320 +0
+ Mutexes +480 +0
+ Queues +640 +0
+ Timers +1,024 +256
+ Task notifications +256 +0
+ Event groups +512 +64
────────────────────────────────────────────────────────────
Typical configuration ~6,500 ~1,500
Zephyr Analysis
Zephyr uses Kconfig for fine-grained configuration.
Minimal Configuration
# prj.conf for Zephyr minimal
CONFIG_KERNEL=y
CONFIG_MAIN_THREAD_PRIORITY=0
CONFIG_MAIN_STACK_SIZE=512
CONFIG_IDLE_STACK_SIZE=256
CONFIG_HEAP_MEM_POOL_SIZE=0
CONFIG_MINIMAL_LIBC=y
# Disable unneeded features
CONFIG_PRINTK=n
CONFIG_LOG=n
CONFIG_SHELL=n
Measurement results:
$ arm-none-eabi-size zephyr_minimal.elf
text data bss dec hex filename
5120 256 1280 6656 1a00 zephyr_minimal.elf
Analysis: Zephyr minimal is about 1.5 KB larger than FreeRTOS (.text). This is because Zephyr has a more complete abstraction layer and device model.
Feature vs Footprint Table
Zephyr feature impact on footprint:
Feature .text increase .bss increase
────────────────────────────────────────────────────────────
Basic kernel 5,120 1,280
+ Semaphores +128 +0
+ Mutexes +256 +0
+ Queues (k_msgq) +384 +0
+ Timers +512 +128
+ Shell +12,000+ +2,000+
+ Logging +4,000+ +1,000+
+ Networking +50,000+ +10,000+
────────────────────────────────────────────────────────────
Typical (no shell) ~7,000 ~1,500
Typical (with shell) ~20,000 ~4,000
RT-Thread Analysis
RT-Thread has a nano version specifically targeting minimal footprint.
Minimal Configuration (RT-Thread Nano)
// rtconfig.h for RT-Thread Nano
#define RT_THREAD_PRIORITY_MAX 8
#define RT_TICK_PER_SECOND 1000
#define RT_USING_OVERFLOW_CHECK
#define RT_USING_HOOK
#define RT_USING_IDLE_HOOK
// Disable most features
// #define RT_USING_SEMAPHORE
// #define RT_USING_MUTEX
// #define RT_USING_MAILBOX
// #define RT_USING_MESSAGEQUEUE
Measurement results:
$ arm-none-eabi-size rtthread_minimal.elf
text data bss dec hex filename
2816 96 896 3808 ee0 rtthread_minimal.elf
Analysis: RT-Thread Nano is the smallest of the three—only 2.8 KB .text.
Feature vs Footprint Table
RT-Thread feature impact on footprint:
Feature .text increase .bss increase
────────────────────────────────────────────────────────────
Basic kernel (Nano) 2,816 896
+ Semaphores +192 +0
+ Mutexes +256 +0
+ Mailbox +320 +0
+ Message queue +384 +0
+ Timer +512 +128
+ FinSH shell +10,000+ +2,000+
+ Device framework +3,000+ +500+
────────────────────────────────────────────────────────────
Typical (no shell) ~5,000 ~1,200
Typical (with shell) ~15,000 ~3,500
Comparison Summary
Minimal Configuration Comparison
RTOS .text .data .bss Total
────────────────────────────────────────────────────────────
RT-Thread Nano 2,816 96 896 3,808
FreeRTOS 3,584 120 1,152 4,856
Zephyr 5,120 256 1,280 6,656
Visualization:
.text size (bytes):
RT-Thread Nano ████████████████ 2,816
FreeRTOS ████████████████████ 3,584
Zephyr ████████████████████████████ 5,120
0 1K 2K 3K 4K 5K 6K
Typical Configuration Comparison
RTOS .text .data .bss Total
────────────────────────────────────────────────────────────
RT-Thread 5,000 128 1,200 6,328
FreeRTOS 6,500 150 1,500 8,150
Zephyr 7,000 300 1,500 8,800
Feature Richness vs Footprint Trade-off
Footprint
▲
│
Zephyr ● │ ← Most features, largest footprint
│
FreeRTOS ● │ ← Balanced
│
RT-Thread Nano ● ← Minimal footprint, fewer features
│
────────────────┼──────────────► Features
│
Selection Recommendations
When to Choose FreeRTOS
✅ Recommended when:
- Need mature, stable, widely-used solution
- Team already has FreeRTOS experience
- Need AWS IoT integration
- Memory constraints are moderate (32 KB+ flash)
❌ Not recommended when:
- Extremely tight memory (< 16 KB flash)
- Need advanced networking stack
- Need comprehensive device driver framework
When to Choose Zephyr
✅ Recommended when:
- Building IoT products with networking
- Need comprehensive device driver support
- Want modern build system (CMake + Kconfig)
- Have sufficient memory (64 KB+ flash)
❌ Not recommended when:
- Extremely tight memory constraints
- Simple bare-metal would suffice
- Team unfamiliar with Kconfig/devicetree
When to Choose RT-Thread
✅ Recommended when:
- Extremely tight memory (< 16 KB flash)
- Need minimal kernel (RT-Thread Nano)
- Chinese documentation/community preferred
- Need rich middleware (GUI, filesystem, etc.)
❌ Not recommended when:
- Need extensive English documentation
- Need AWS/Azure cloud integration
- Team unfamiliar with RT-Thread ecosystem
Optimization Techniques
Regardless of which RTOS you choose, these techniques help reduce footprint:
1. Disable Unused Features
// FreeRTOS example
#define configUSE_MUTEXES 0 // If not using mutexes
#define configUSE_RECURSIVE_MUTEXES 0
#define configUSE_COUNTING_SEMAPHORES 0
#define configUSE_QUEUE_SETS 0
#define configUSE_TASK_NOTIFICATIONS 0
2. Reduce Priority Levels
// Fewer priorities = smaller scheduler data structures
#define configMAX_PRIORITIES 4 // Instead of 32
3. Minimize Stack Sizes
// Measure actual usage, then add 25% margin
#define configMINIMAL_STACK_SIZE 64 // words
#define configTIMER_TASK_STACK_DEPTH 128
4. Use Static Allocation
// FreeRTOS static allocation (no heap overhead)
#define configSUPPORT_STATIC_ALLOCATION 1
#define configSUPPORT_DYNAMIC_ALLOCATION 0
StaticTask_t xTaskBuffer;
StackType_t xStack[128];
xTaskCreateStatic(task_func, "Task", 128, NULL, 1, xStack, &xTaskBuffer);
5. Compiler Optimization
# Always use these for release builds
$ arm-none-eabi-gcc -Os -flto \
-ffunction-sections -fdata-sections \
-Wl,--gc-sections \
--specs=nano.specs \
...
Case Study: Fitting into 32 KB Flash
Requirement: IoT sensor node with:
- 3 tasks (sensor, communication, LED)
- UART driver
- Simple protocol parsing
- 32 KB flash, 8 KB RAM
Initial attempt with FreeRTOS:
$ arm-none-eabi-size firmware.elf
text data bss dec hex filename
38912 512 4096 43520 aa00 firmware.elf
Problem: 38 KB > 32 KB limit!
Optimization steps:
Step 1: Disable unused features
configUSE_TIMERS = 0
configUSE_MUTEXES = 0
Result: 35,840 bytes (-3 KB)
Step 2: Use newlib-nano
--specs=nano.specs
Result: 28,672 bytes (-7 KB)
Step 3: Replace printf with custom
Custom uart_print_int()
Result: 26,624 bytes (-2 KB)
Step 4: Enable LTO
-flto
Result: 24,576 bytes (-2 KB)
Step 5: Static allocation
configSUPPORT_DYNAMIC_ALLOCATION = 0
Result: 23,552 bytes (-1 KB)
Final: 23.5 KB < 32 KB ✓
Summary
- Measurement methodology: Fair comparison requires identical conditions (compiler, optimization, platform)
- Minimal footprint ranking: RT-Thread Nano (2.8 KB) < FreeRTOS (3.6 KB) < Zephyr (5.1 KB)
- Feature trade-off: More features = larger footprint; choose based on actual needs
- Selection criteria:
- Extremely constrained: RT-Thread Nano
- Balanced: FreeRTOS
- Feature-rich IoT: Zephyr
- Optimization techniques:
- Disable unused features
- Reduce priority levels and stack sizes
- Use static allocation
- Compiler optimization (-Os, LTO, gc-sections)
- Use newlib-nano
- Key principle: Measure, compare, then decide—don't rely on marketing claims
Chapter 23: Evolution of Performance Metrics
Part VII: AI/HPC
"The metrics you optimize for determine the systems you build." — Anonymous
The Presentation That Fell Flat
Marcus had spent three weeks benchmarking the company's new AI accelerator. His presentation to the executive team was packed with data: cache hit rates, branch prediction accuracy, instructions per cycle, memory bandwidth utilization. He was proud of the thoroughness.
The VP of Engineering interrupted five minutes in. "Marcus, this is great detail, but I need one number. How does this compare to the A100 we're currently using?"
Marcus pulled up his IPC comparison chart. "As you can see, our chip achieves 4.2 IPC compared to—"
"IPC?" The VP frowned. "Nobody talks about IPC for AI workloads. What's our TFLOPS? What's the tokens per second for Llama inference?"
Marcus stared at his slides. He'd spent weeks measuring the wrong things.
That evening, Marcus called his former colleague Sarah, now at a leading AI chip startup. "I feel like an idiot," he admitted. "I've been doing CPU performance analysis for fifteen years. When did everything change?"
Sarah laughed sympathetically. "It's not just you. The entire industry went through a metrics revolution. The numbers that mattered for compiling code and running databases are almost irrelevant for training transformers. Let me walk you through what happened."
From IPC to TOPS
Twenty years ago, the core metric for evaluating CPU performance was IPC (Instructions Per Cycle). Engineers used it to compare efficiency across different microarchitectures. A higher IPC meant the processor could execute more instructions in the same amount of time—a clear indicator of "better."
But Marcus discovered what many engineers learn the hard way: metrics that worked for one era can be meaningless in another.
Today, if you ask an AI engineer "what's the IPC of this GPU," they'd look at you the way a race car driver would look at someone asking about their vehicle's cup holder capacity. It's not wrong, exactly—it's just irrelevant.
Modern AI/HPC uses completely different metrics. Where CPUs were measured in instructions, AI accelerators are measured in operations:
| Era | Primary Metric | What It Measures |
|---|---|---|
| 1990s-2000s | IPC (Instructions Per Cycle) | CPU efficiency for general-purpose code |
| 2010s | GFLOPS (Billion FP ops/sec) | GPU compute for graphics and early ML |
| 2020s | TFLOPS/TOPS | AI accelerator throughput at various precisions |
TFLOPS (Tera Floating-point Operations Per Second) has replaced IPC as the primary metric. Furthermore, TOPS (Tera Operations Per Second) describes performance for low-precision operations (INT8, INT4) that dominate AI inference.
Why Do Metrics Evolve?
This change reflects several fundamental shifts in how we compute:
1. Vectorization of Compute Units
Think about what happens when you execute an ADD instruction on a traditional CPU: you add two numbers and get one result. One instruction, one operation.
Now consider a Tensor Core on an NVIDIA H100. A single matrix-multiply-accumulate (MMA) instruction performs 256 FP16 multiply-add operations simultaneously. If you measured this in "instructions per cycle," you'd get a small number—maybe 1 or 2. But in terms of actual useful work for AI, it's doing 256 times more than a CPU instruction.
Traditional CPU:
ADD r1, r2, r3 → 1 addition
Tensor Core (NVIDIA H100):
MMA.F16 → 256 FP16 multiply-add operations
Comparing IPC between these two architectures would be like comparing a delivery truck's "packages per trip" when one truck carries 1 box and another carries 256. The metric doesn't capture what matters.
2. The Memory Wall Changed the Game
When Sarah explained this to Marcus, she drew a simple graph on a napkin. "Compute capability has been growing at roughly 2x every two years," she said. "Memory bandwidth? Maybe 1.3x. After a few decades, compute is 1000x faster while memory is only 10x faster."
This means that for many workloads, the processor spends most of its time waiting for data. Measuring "instructions per second" becomes meaningless when the bottleneck is "bytes per second." The Roofline model, which we'll explore shortly, captures this reality.
3. Workloads Became More Homogeneous
Traditional CPU workloads are diverse: branching code, pointer chasing, string manipulation, system calls. Every program is different, so a general metric like IPC made sense.
AI workloads are remarkably similar at their core: they're dominated by matrix multiplication (GEMM). Whether you're training a vision model, a language model, or a recommendation system, 80-95% of the compute is matrix multiply. For such homogeneous workloads, measuring FLOPS directly is far more meaningful than counting abstract "instructions."
The New Metrics Landscape
Here's how the transition from traditional to modern metrics looks across different dimensions:
| Traditional Metric | Modern Metric | Why It Changed |
|---|---|---|
| CPI / IPC | FLOPS / TOPS | Single instruction now does hundreds of ops |
| Memory Bandwidth | Roofline Model | Compute/bandwidth ratio determines bottleneck |
| Amdahl's Law | Comm-to-Compute Ratio | Network, not serial code, limits scaling |
| Latency vs. Throughput | TTFT vs. TPS | LLM streaming creates new user experience |
| Power (W) | Energy Efficiency (GFLOPS/W) | Electricity is now a major cost center |
Let's explore each evolution in depth, starting with the most fundamental: the shift from counting instructions to counting operations.
CPI/IPC → FLOPS/TOPS
Traditional: CPI and IPC
CPI (Cycles Per Instruction) and IPC (Instructions Per Cycle) are reciprocal metrics:
CPI = Total_Cycles / Total_Instructions
IPC = Total_Instructions / Total_Cycles = 1 / CPI
These metrics assume:
- Each instruction has roughly equal "value"
- Program performance is proportional to instruction count
- Microarchitecture improvements are reflected in IPC gains
Modern: FLOPS and TOPS
FLOPS (Floating-point Operations Per Second) directly measures compute capability:
GFLOPS = Floating_Point_Operations / Time / 10^9
TFLOPS = GFLOPS / 1000
TOPS (Tera Operations Per Second) is used for integer operations, especially low-precision AI inference:
TOPS = Operations / Time / 10^12
TOPS at Different Precisions
Modern AI accelerators support multiple precisions, each with different TOPS:
NVIDIA H100 Theoretical Peak:
┌────────────┬──────────────┐
│ Precision │ TOPS │
├────────────┼──────────────┤
│ FP64 │ 67 TFLOPS │
│ FP32 │ 134 TFLOPS │
│ TF32 │ 989 TFLOPS │
│ FP16 │ 1979 TFLOPS │
│ FP8 │ 3958 TFLOPS │
│ INT8 │ 3958 TOPS │
└────────────┴──────────────┘
Note: These are theoretical peaks. Actual performance is typically 50-80% of peak, depending on workload characteristics.
Peak vs Sustained
When reporting FLOPS, you must distinguish:
- Peak FLOPS: Theoretical maximum, assuming 100% utilization
- Sustained FLOPS: Performance maintainable under actual workloads
Typical ratios:
Matrix multiply (GEMM): 70-95% of peak
Convolution (Conv): 50-80% of peak
Attention mechanism: 30-70% of peak
Memory-intensive ops: 10-30% of peak
Roofline diagram construction:
Performance (GFLOPS)
^
| __________________ Peak Compute (roof)
| /
| / ◄─── Compute Bound Region
| / (Horizontal = at compute ceiling)
| /
| / Ridge Point
| /
| /
| / ◄─── Memory Bound Region
| / Performance scales linearly with AI
| / (Slope = memory bandwidth)
| /
| /
| /
| /
| /
|────┴─────────────────────────────────────────> Arithmetic Intensity
(FLOPs/Byte)
Interpreting Roofline
Two "roofs":
- Horizontal line: Peak Compute (compute ceiling)
- Diagonal line: Memory Bandwidth × AI (memory ceiling)
Whichever line an application falls below is its bottleneck:
AI (FLOPs/Byte) Bottleneck Typical Applications
──────────────────────────────────────────
< 10 Memory Vector add, STREAM
10-50 Boundary Sparse matrix, Conv2D
50-200 Boundary/Compute Dense matrix multiply
> 200 Compute Highly optimized GEMM
Amdahl's Law → Communication-to-Computation
When Marcus first learned parallel programming, Amdahl's Law was gospel. "If 10% of your code is sequential," his professor had said, "you can never get more than 10x speedup, no matter how many processors you throw at it."
That mental model worked fine for multi-core CPUs. But when Marcus started working on distributed AI training across hundreds of GPUs, he discovered a new bottleneck that Amdahl never considered: the network.
Traditional: Amdahl's Law
Amdahl's Law describes the theoretical speedup limit of parallelization:
Speedup = 1 / ((1 - P) + P/N)
Where:
P = parallelizable fraction
N = number of processors
If 95% of your code is parallelizable (P = 0.95), then even with infinite processors, your speedup is limited to 1 / 0.05 = 20x. The sequential 5% becomes the ceiling.
This law assumes that parallel work is truly parallel—that processors can work independently without coordination. In a shared-memory multi-core system, this is approximately true.
Modern: Communication-to-Computation Ratio
In distributed AI training, a new bottleneck emerges: communication.
Typical data-parallel training:
GPU 0 ─┬── Forward ──┬── Backward ──┬── AllReduce ──┐
GPU 1 ─┤ │ │ │
GPU 2 ─┤ │ │ │
GPU 3 ─┘ │ │ │
▼ ▼ ▼
Compute Compute Communication
AllReduce operations need to synchronize gradients across all GPUs—this is the main communication bottleneck.
Communication-to-Computation Ratio
C2C Ratio = Communication_Time / Computation_Time
Ideal: C2C << 1 (communication time much less than compute time)
Reality: C2C worsens as GPU count increases
Influencing factors:
- Model size: Larger gradients mean more data to transfer
- Batch size: Larger batches increase compute time, improving C2C
- Network bandwidth: InfiniBand vs Ethernet makes huge difference
- AllReduce algorithm: Ring AllReduce, Hierarchical AllReduce
Latency vs. Throughput → TTFT vs. TPS
The traditional latency/throughput trade-off still exists in AI, but LLMs introduced something new: the user experience of streaming.
When you chat with an LLM, you don't wait for the entire response to be generated before seeing anything. The words appear one by one, like watching someone type in real-time. This streaming experience created entirely new metrics that capture what users actually care about.
Traditional: Latency and Throughput
For a traditional web service, performance is simple:
- Latency: How long until the response is complete?
- Throughput: How many requests can we handle per second?
You optimize for one, the other, or some balance. A user either sees the result or doesn't.
Modern: LLM's TTFT and TPS
LLMs broke this model because users experience the response progressively. A 10-second response feels fast if words start appearing immediately. The same 10-second response feels slow if there's a 3-second pause before anything appears.
This led to two new metrics that capture different aspects of user experience:
TTFT (Time To First Token): How long until the user sees something?
This is the "perceived responsiveness" metric. Users are more tolerant of slow generation if the response starts quickly. TTFT is dominated by the Prefill phase—processing the entire input prompt before any output can begin.
TPS (Tokens Per Second): How fast do subsequent words appear?
TPS = Tokens generated per second (Decode phase)
Factors affecting TPS:
1. KV Cache size
2. Batch size
3. Memory bandwidth (usually the bottleneck)
Two Phases of LLM Inference
Request processing flow:
[Input Prompt] ──► [Prefill] ──► [Decode × N] ──► [Complete]
▼ ▼
TTFT TPOT × N
Where:
TTFT = Time To First Token
TPOT = Time Per Output Token
N = number of output tokens
Total latency = TTFT + (N × TPOT)
Prefill vs Decode Characteristics
Phase Characteristics Bottleneck
─────────────────────────────────────────
Prefill Parallel process Compute-bound
prompt Uses Tensor Cores
Compute many positions at once
Decode Autoregressive Memory-bound
generation Needs to read KV Cache
Process 1 token at a time
This explains why:
- TTFT increases with prompt length
- TPS is almost unaffected by prompt length
- Batch processing can significantly improve overall throughput
Power → Energy Efficiency
Traditional: Power (Watts)
Early system evaluation treated power as a "constraint" rather than a "metric":
Traditional thinking:
"This CPU draws 100W, ensure adequate cooling"
"Server room power capacity is XX kW"
Power was seen as a problem to "handle," not a target to "optimize."
Modern: Energy Efficiency
In two extreme scenarios, energy efficiency becomes a core metric:
1. Hyperscale Data Centers
Training GPT-4 scale models:
- Thousands of GPUs
- Megawatts of power consumption
- Electricity becomes major cost
Efficiency metrics: GFLOPS/W, TOPS/W
2. Edge Devices
AI inference on phones/IoT:
- Limited battery capacity
- Thermal Design Power (TDP) limits
- User experience affected by heat
Efficiency metrics: Inferences/mAh, TOPS/W
Epilogue: Marcus's Second Presentation
Three weeks later, Marcus gave his second presentation to the executive team. This time, his slides told a different story:
"Our chip achieves 847 TFLOPS at FP16, putting it between the A100 and H100. For Llama-70B inference at batch size 1, we measure 23 tokens per second—competitive with the A100."
He showed a Roofline diagram. "We're currently memory-bound for decode-heavy workloads, achieving 78% of theoretical bandwidth. For prefill-heavy workloads, we hit 65% of peak compute."
The VP nodded. "Now I understand what we're buying. Good work."
After the meeting, Sarah texted him: "Heard it went well. What changed?"
Marcus replied: "I stopped measuring what's easy and started measuring what matters."
That's the first lesson of AI/HPC performance analysis: know which metrics matter for your workload, because the right metric is worth more than a thousand benchmarks.
Summary
Performance metric evolution reflects fundamental changes in computing workloads:
From Single to Multi-dimensional
- Traditional: Single IPC or MHz could explain performance
- Modern: Need combination of metrics (FLOPS, bandwidth, efficiency, latency)
From Absolute to Relative
- Traditional: Pursue highest absolute performance
- Modern: Pursue best "ratios" (Roofline, efficiency)
From Hardware to Application
- Traditional: Hardware specs determine performance
- Modern: Application characteristics (AI, C2C, TTFT) determine evaluation approach
Key Metric Evolution
- IPC → FLOPS/TOPS: Vectorized computation
- Bandwidth → Roofline: Compute/bandwidth ratio
- Amdahl → C2C: Communication becomes new bottleneck
- Latency → TTFT: LLM-specific metrics
- Power → Efficiency: Performance per watt
- Code Size → Quantization: Reduce data movement
Chapter 24: AI/ML Benchmarks
Part VII: AI/HPC
"In machine learning, the only benchmark that matters is your production workload." — Unknown
The Benchmark That Proved Nothing
Linda was evaluating AI accelerator chips for her company's inference infrastructure. Vendor A claimed "500 TOPS," Vendor B claimed "450 TOPS." Easy decision, right? Go with the bigger number.
She ran her actual workload—a BERT-based text classifier—on both chips. Vendor B, with the "smaller" TOPS number, was 40% faster.
"How is this possible?" she asked Vendor A's sales engineer.
He shifted uncomfortably. "Well, our 500 TOPS is at INT4 precision. Your model uses FP16. And our number is for batch size 256—you're running batch size 1. Also, that's peak theoretical throughput. Sustained performance depends on..."
Linda cut him off. "So that number on your datasheet is essentially meaningless for my use case?"
"I wouldn't say meaningless..."
This is the fundamental problem with AI benchmarks. Unlike traditional software where "faster" has a clear meaning, AI performance depends on model architecture, precision, batch size, optimization level, and a dozen other factors. Two chips with identical specs can have 3x performance differences on real workloads.
Why AI Benchmarking Is Different
AI performance evaluation is more complex than traditional software for three fundamental reasons:
1. "Correct" Is Not Binary
When a sorting algorithm produces [1, 3, 2, 4, 5], it's wrong. Period. But when an image classifier achieves 76.3% accuracy instead of 78.1%, is that acceptable? It depends on the application, the cost savings, the latency requirements. Traditional benchmarks measure speed at fixed correctness. AI benchmarks must navigate a speed-accuracy trade-off.
2. Performance and Accuracy Are Intertwined
Quantization—running a model at lower precision—makes inference faster but slightly less accurate:
| Precision | Relative Speed | Typical Accuracy Loss |
|---|---|---|
| FP32 | 1.0x | 0% (baseline) |
| FP16 | ~2.0x | -0.1% |
| INT8 | ~2.5x | -0.5% |
| INT4 | ~4.0x | -2.0% |
A benchmark that only reports speed is incomplete. A benchmark that only reports accuracy ignores practical constraints.
3. Hardware Diversity Creates Comparison Nightmares
The same PyTorch model can run on CPUs, NVIDIA GPUs, AMD GPUs, Google TPUs, Apple Neural Engine, Qualcomm NPUs, Intel Gaudi, and dozens of custom ASICs. Each platform has different optimization paths, different supported operations, different precision formats. Comparing "apples to apples" requires carefully controlled methodology.
MLPerf: The Industry Standard
The industry's answer to the AI benchmarking problem is MLPerf, maintained by MLCommons—a consortium including NVIDIA, Google, Intel, AMD, Meta, Microsoft, and dozens of other companies. MLPerf attempts to create standardized, reproducible benchmarks that allow meaningful comparisons across different hardware platforms.
Think of MLPerf as the SPEC CPU of the AI world: a standardized suite with strict rules about what you can and cannot change.
The MLPerf Family
MLPerf isn't a single benchmark—it's a family of benchmarks targeting different scenarios:
| Benchmark | What It Measures | Typical Submitters |
|---|---|---|
| MLPerf Training | Time to train to target accuracy | NVIDIA, Google, Intel, hyperscalers |
| MLPerf Inference | Inference latency and throughput | AI chip startups, cloud providers |
| MLPerf HPC | ML on supercomputers | National labs, research institutions |
| MLPerf Tiny | Performance on microcontrollers | Embedded chip vendors |
| MLPerf Mobile | Performance on phones/tablets | Qualcomm, MediaTek, Apple |
| MLPerf Storage | Data pipeline performance | Storage vendors |
For most readers of this book, Training and Inference are the benchmarks you'll encounter most often.
MLPerf Training: Racing to Accuracy
The Training benchmark measures one thing: how long does it take to train a model from random initialization to a specified target accuracy?
Benchmark Model Dataset Target Accuracy
─────────────────────────────────────────────────────────────────
ResNet ResNet-50 v1.5 ImageNet 75.9% Top-1
RetinaNet RetinaNet COCO 34.0% mAP
BERT BERT-Large Wikipedia 0.72 F1
DLRM DLRM Criteo 0.8025 AUC
3D U-Net 3D U-Net KiTS19 0.908 Mean Dice
GPT-3 GPT-3 175B C4 2.69 log perplexity
Stable Diffusion Stable Diffusion LAION-400M 10.0 FID
Result Format
Typical result:
System: 8x NVIDIA H100 SXM5
Benchmark: BERT-Large Training
Time: 2.3 minutes (to reach 0.72 F1)
Comparison:
DGX H100 (8 GPU): 2.3 min
DGX A100 (8 GPU): 5.1 min
Cloud TPU v4 (16): 3.8 min
Closed vs Open Division
Closed Division:
- Must use specified model architecture
- Must achieve specified accuracy
- Can only adjust batch size, learning rate, etc.
- Purpose: Fair hardware comparison
Open Division:
- Can modify model architecture
- Can use different optimization techniques
- Purpose: Showcase innovative methods
MLPerf Inference
Measures "inference performance in deployment scenarios."
Scenario Definitions
Scenario Description Primary Metric
─────────────────────────────────────────────────────────
Server Concurrent requests, QPS (within latency SLO)
latency constraints
Offline Batch processing, Throughput (samples/sec)
no latency constraints
SingleStream One request at a time Latency (ms)
MultiStream Multiple independent Number of streams
streams
Server Scenario Details
Server scenario simulates real services:
Request arrives → Queue → Process → Response
↑
Latency constraint
SLO (Service Level Objective):
- Example: 99% of requests must complete within 15ms
### Measurement Items
```text
1. GEMM (General Matrix Multiplication)
- Matrix sizes: from small (256×256) to large (4096×4096)
- Precision: FP32, FP16, INT8
- Simulates: Fully connected layers, Attention
2. Convolution
- Various kernel sizes (1×1, 3×3, 5×5)
- Stride, padding variations
- Simulates: CNN convolution layers
3. RNN (Recurrent Neural Networks)
- LSTM, GRU
- Different hidden sizes and sequence lengths
- Simulates: Sequence models
4. All-Reduce
- Measures distributed training communication performance
- Different data sizes and GPU counts
Typical Results
NVIDIA A100 DeepBench Results:
GEMM (4096×4096, FP16):
Peak: 312 TFLOPS
Achieved: 285 TFLOPS (91.3%)
Conv2D (3×3, 256 channels):
Peak: 312 TFLOPS
Achieved: 198 TFLOPS (63.5%)
Reason: Convolution requires more memory access, cannot achieve pure GEMM efficiency
Other AI Benchmarks
AI Benchmark (ETH Zürich)
AI performance testing designed specifically for mobile devices:
Features:
- Targets phone NPUs and GPUs
- Covers multiple AI tasks
- Has Android App for direct testing
Test items:
1. Image Classification (MobileNet, EfficientNet)
2. Object Detection (YOLO, SSD)
3. Image Segmentation
4. Face Recognition
5. Super Resolution
6. Language Models
Result format:
Total score + individual scores
Comparable with other devices
DAWNBench
Developed by Stanford, focuses on "cost to train to target accuracy":
Core metrics:
Time-to-Accuracy
Cost-to-Accuracy ($)
Example:
"How much time/money to reach 93% Top-5 accuracy on ImageNet?"
This metric is more practical:
- Not just speed, but also cost
- Considers cloud computing pricing
Geekbench ML
Cross-platform consumer AI benchmark:
Pros:
- Easy to run (download app)
- Cross-platform (Windows, macOS, iOS, Android)
- Result database for easy comparison
Cons:
- Not transparent (doesn't disclose all details)
- May be targeted for optimization
- Not suitable for serious performance analysis
Running MLPerf: Practical Guide
Environment Setup
# 1. Get MLPerf code
git clone https://github.com/mlcommons/inference.git
cd inference
# 2. Choose benchmark (ResNet-50 as example)
cd vision/classification_and_detection
# 3. Prepare dataset
# ImageNet validation set (50,000 images)
# Need to download from official source
# 4. Install dependencies
pip install -r requirements.txt
Running Inference Benchmark
# SingleStream scenario (measure latency)
python3 main.py --backend onnxruntime \
--model resnet50 \
--scenario SingleStream \
--accuracy
# Server scenario (measure QPS)
python3 main.py --backend onnxruntime \
--model resnet50 \
--scenario Server \
--qps 100
Interpreting Results
MLPerf Inference Result Example:
TestScenario.SingleStream:
qps: 156.25
latency (ns): 6400000 (6.4 ms)
result summary:
samples processed: 50000
accuracy: 76.15%
target accuracy: 76.46%
Result: 6.4ms latency, 76.15% accuracy (below target, needs adjustment)
Building Your Own AI Benchmark
For specific applications, you may need custom benchmarks:
Design Principles
1. Define clear metrics
- Latency (P50, P95, P99)
- Throughput (samples/sec)
- Accuracy (specific definition)
- Resource usage (memory, power)
2. Reproducibility
- Fix random seeds
- Record complete environment
- Use version control
3. Reflect real workload
- Use actual data distribution
- Simulate real request patterns
- Consider batch mixing
Report Format Suggestion
AI Benchmark Report
═══════════════════════════════════════════════════════════
Model: YOLOv8-Medium
Hardware: NVIDIA RTX 4090
Precision: FP16
Batch Size: 1
Latency Results (1000 iterations):
Mean: 4.2 ms
P50: 4.1 ms
P95: 5.8 ms
P99: 7.2 ms
Std: 0.9 ms
Throughput (60 seconds):
238 images/second
Accuracy (on validation set):
mAP@0.5: 45.2%
mAP@0.5:0.95: 33.1%
GPU Utilization: 85%
GPU Memory: 4.2 GB / 24 GB
Power: 280W average
Environment:
CUDA: 12.2
cuDNN: 8.9
TensorRT: 8.6
Driver: 535.86
OS: Ubuntu 22.04
What Linda Learned
Linda eventually made her chip decision. Neither Vendor A nor Vendor B.
She built her own benchmark using her actual production model, with her actual batch sizes, at her actual precision requirements. She tested latency at P50, P95, and P99. She measured power consumption under load. She calculated cost per inference.
Vendor C, which she'd initially dismissed because of lower published TOPS numbers, turned out to be the best fit for her specific workload—40% lower cost per inference than either A or B.
"The published specs weren't wrong," she explained to her team. "They just weren't measuring what mattered to us. TOPS at INT4 with batch size 256 is a valid metric—it's just not our metric."
The lesson: standardized benchmarks like MLPerf provide valuable apples-to-apples comparisons, but the most important benchmark is always the one that reflects your actual production workload.
Summary
AI/ML Benchmarking has its unique challenges and methods:
Major Benchmarks
- MLPerf: Industry standard, complete models, strict rules
- DeepBench: Core operations, low-level performance analysis
- AI Benchmark: Mobile devices, consumer-oriented
- DAWNBench: Cost-oriented, time/money
Key Considerations
- Performance and accuracy are trade-offs
- Hardware diversity makes comparison difficult
- Need clear measurement conditions
Choosing a Benchmark
- Hardware procurement evaluation: MLPerf
- Low-level optimization analysis: DeepBench
- Consumer comparison: Geekbench ML / AI Benchmark
- Cost-sensitive scenarios: DAWNBench
Custom Benchmarks
- Define clear metrics
- Ensure reproducibility
- Reflect real workload
- Report complete environment
Chapter 25: HPC Benchmarks
Part VII: AI/HPC
"The TOP500 list is not about who has the biggest computer, but about who can solve the biggest problems." — Jack Dongarra
When Your Supercomputer Ranks #1 But Can't Run Your Code
Dr. Zhang's research group had just gotten access to their national lab's newest supercomputer—ranked in the top 20 of the TOP500 list. Exciting times. Their climate simulation code, which took 48 hours on the previous system, should scream on this new machine.
It didn't. The simulation took 52 hours. Slower than before.
"How is this possible?" Dr. Zhang asked the system administrator. "This machine has 10x the LINPACK score of our old system."
The admin nodded sympathetically. "LINPACK measures dense linear algebra. Your code is sparse matrix-heavy with irregular memory access patterns. This new system has great compute, but the memory bandwidth per core actually went down. Your code is memory-bound, not compute-bound."
Dr. Zhang had just learned one of the fundamental lessons of HPC benchmarking: the benchmark that ranks supercomputers tells you almost nothing about how those supercomputers will perform on your specific workload.
This chapter explores the benchmarks used to evaluate HPC systems, their strengths, their limitations, and how to interpret them for real applications.
A Brief History of HPC Performance Measurement
Since 1993, the TOP500 list has been published twice yearly, ranking the world's most powerful supercomputers. It's become the definitive measure of "who has the biggest computer"—and also a cautionary tale about what benchmarks can and cannot tell you.
The performance growth has been staggering:
| Year | Milestone | FLOPS |
|---|---|---|
| 1993 | First TOP500 list | 59.7 GFLOPS (CM-5, Los Alamos) |
| 1997 | First teraFLOPS | 1.0 TFLOPS (ASCI Red, Intel) |
| 2008 | First petaFLOPS | 1.0 PFLOPS (Roadrunner, IBM) |
| 2022 | First exaFLOPS | 1.1 EFLOPS (Frontier, AMD) |
Over 30 years, peak performance has grown approximately 20 million times. But how is this measured, and what does it actually mean?
LINPACK: The Benchmark That Defines the TOP500
LINPACK (and its modern parallel version, HPL—High Performance LINPACK) is the benchmark used for TOP500 rankings. It measures performance on a specific mathematical operation: solving a dense system of linear equations.
What LINPACK Actually Computes
The problem is deceptively simple:
Solve Ax = b for x
Where:
A is an n×n dense matrix
b is a known vector
x is the unknown we're solving for
The standard approach uses LU decomposition: factorize A into lower and upper triangular matrices (L and U), then solve through forward and backward substitution. The computational complexity is O(n³), dominated by floating-point multiply-add operations.
Why This Particular Problem?
When Jack Dongarra and colleagues created LINPACK in the 1970s, they needed a benchmark that was:
- Mathematically rigorous: The correct answer can be verified
- Scalable: You can always use a bigger matrix
- Compute-intensive: Limited by arithmetic, not I/O
- Representative: Linear algebra was central to many scientific codes
For the computing of that era, these were reasonable choices. Dense linear algebra was a dominant workload. Memory bandwidth was relatively fast compared to compute.
HPL (High Performance LINPACK)
HPL is the parallel version of LINPACK for distributed systems:
HPL Parameters:
N: Problem size (matrix dimension)
NB: Block size
P×Q: Processor grid
Performance calculation:
FLOPS = (2/3 × N³ + 2 × N²) / Time
Result Interpretation
Typical HPL output:
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 100000 256 8 8 1234.56 5.40e+05
--------------------------------------------------------------------------------
Interpretation:
N = 100000 (matrix size)
P×Q = 64 processors
Time = 1234.56 seconds
Gflops = 540,000 GFLOPS = 540 TFLOPS
Efficiency Calculation
HPL Efficiency = Achieved FLOPS / Theoretical Peak FLOPS
Typical efficiency:
Well-optimized systems: 70-85%
Average systems: 50-70%
Unoptimized: 30-50%
Factors affecting efficiency:
- Problem size (larger is better)
- Network bandwidth and latency
- Memory bandwidth
- Software optimization level
HPCG
HPCG (High Performance Conjugate Gradients) was designed to complement LINPACK's shortcomings.
Design Motivation
LINPACK's problems:
- Compute-intensive, regular memory access
- Modern applications are often memory-intensive
- High LINPACK efficiency doesn't mean high application efficiency
HPCG's goals:
- Access patterns closer to real applications
- Measure memory system performance
- Reflect sparse matrix operations
Mathematical Background
HPCG uses conjugate gradient method to solve sparse linear systems:
Ax = b
Where A is a sparse matrix (from 3D 27-point stencil)
Main operations:
1. SpMV (Sparse Matrix-Vector multiply)
2. Vector operations (AXPY, dot product)
3. Multigrid preconditioning
## Graph500
**Graph500** measures graph analysis performance, reflecting data-intensive applications.
### Why Graph500 Is Needed
```text
Many important applications are graph-oriented:
- Social network analysis
- Web page ranking
- Bioinformatics
- Cybersecurity
Characteristics of these applications:
- Irregular data access
- Low computational density
- High memory bandwidth requirements
Benchmark Content
Graph500 contains three kernels:
1. Graph Construction
- Build graph data structure from edge list
- Measures data processing capability
2. BFS (Breadth-First Search)
- Breadth-first search from random starting point
- Primary performance metric
3. SSSP (Single-Source Shortest Path)
- Calculate shortest paths
- More complex graph algorithm
Performance Metric: GTEPS
GTEPS = Giga Traversed Edges Per Second
= Billions of edges traversed per second
Calculation:
GTEPS = Total_Edges / Time / 10^9
Typical values:
Top systems: 10,000+ GTEPS
General HPC: 100-1000 GTEPS
Single node: 1-10 GTEPS
Other HPC Benchmarks
STREAM
Classic benchmark for measuring memory bandwidth:
Four kernels:
Copy: a[i] = b[i]
Scale: a[i] = q * b[i]
Add: a[i] = b[i] + c[i]
Triad: a[i] = b[i] + q * c[i]
Result units: GB/s or MB/s
// STREAM Triad core code
#pragma omp parallel for
for (int i = 0; i < N; i++) {
a[i] = b[i] + scalar * c[i];
}
NAS Parallel Benchmarks (NPB)
Parallel computing benchmark suite developed by NASA:
Benchmark Description Characteristics
─────────────────────────────────────────────────────────
EP Embarrassingly Parallel No communication
MG Multigrid Long/short range comm
CG Conjugate Gradient Irregular access
FT FFT All-to-all comm
IS Integer Sort Random access
LU LU decomposition Regular comm
SP Scalar Pentadiagonal Regular comm
BT Block Tridiagonal Regular comm
OSU Micro-Benchmarks
Measures MPI communication performance:
Measurement items:
- Point-to-point latency
- Point-to-point bandwidth
- Collective operations (AllReduce, Broadcast, etc.)
- One-sided operations
Typical results:
InfiniBand HDR:
Latency: ~1 μs
Bandwidth: ~200 Gb/s
Ethernet 100G:
Latency: ~5 μs
Bandwidth: ~100 Gb/s
Result Analysis and Reporting
Performance Analysis
When analyzing HPC benchmark results, consider:
1. Efficiency
Achieved performance / Theoretical peak
2. Scalability
How performance changes with node count
3. Bottleneck identification
- High HPL but low HPCG → Memory bottleneck
- Low Graph500 → Memory latency issues
- High OSU latency → Network problems
4. Energy efficiency
GFLOPS/W or GTEPS/W
Common Issues
1. Low HPL efficiency
- Problem size not large enough
- Block size not suitable
- Network becoming bottleneck
- BLAS library not optimized
2. Very low HPCG efficiency
- This is normal (typically 1-5%)
- Reflects memory system limitations
- Can try optimizing memory configuration
3. Poor Graph500 performance
- Memory latency is key
- NUMA configuration matters
- Consider using huge pages
Future of HPC Benchmarks
Emerging Benchmarks
1. HPL-MxP (Mixed Precision)
- Uses mixed precision
- Reflects AI hardware capabilities
- Tracking started in 2024
2. MLPerf HPC
- AI applications on HPC
- Scientific computing + machine learning
3. IO500
- Storage system performance
- Important for data-intensive applications
Trends
1. From FLOPS to multi-dimensional metrics
- Performance isn't just compute speed
- Memory, communication, energy efficiency all matter
2. Application-oriented
- Real applications more meaningful than synthetic benchmarks
- Mini-apps becoming trend
3. Energy efficiency first
- GREEN500 importance increasing
- Power becoming design constraint
The End of Dr. Zhang's Story
Six months later, Dr. Zhang's research group completed their analysis. The "slower" new supercomputer wasn't slower after all—it just required different optimization.
The old system had high memory bandwidth per core, which matched their original code. The new system had more compute per core but less bandwidth, requiring them to restructure their algorithms to improve arithmetic intensity.
After optimization, the new system ran their climate simulation in 12 hours instead of 48—a 4× improvement. But they had to earn that improvement through months of algorithm work.
"LINPACK told us nothing about this," Dr. Zhang said at a department meeting. "The new machine had 10× the LINPACK score, but our speedup was 4× after significant effort. Someone using a different code might get 8×. Someone with an even more memory-bound code might get 0.5×."
"So what's the point of LINPACK?" a student asked.
"It's a common yardstick," Dr. Zhang replied. "It tells you something about the system, just not everything. The real lesson is that no single benchmark captures all aspects of performance. The TOP500 ranks supercomputers, but it doesn't rank how well they'll run your code."
Summary
HPC Benchmarks provide standard methods for evaluating supercomputer performance:
Major Benchmarks
- HPL/LINPACK: Dense linear algebra, TOP500 foundation
- HPCG: Sparse operations, closer to real applications
- Graph500: Graph analysis, data-intensive
- STREAM: Memory bandwidth
Key Insights
- High HPL efficiency doesn't mean high application efficiency
- HPCG efficiency is typically only 1-5%
- Different benchmarks measure different aspects
Practical Recommendations
- Use multiple benchmarks to evaluate systems
- Focus on efficiency rather than absolute performance
- Consider energy efficiency (GREEN500)
- Combine with application benchmarks for evaluation
Chapter 26: GPU Benchmarking
Part VII: AI/HPC
"The GPU is the new CPU." — Jensen Huang
The First Time Everything Looks Wrong
Carlos had been a CPU performance engineer for eight years. When his company pivoted to AI, he was assigned to optimize their training infrastructure. "How different can GPUs be?" he thought. "Cores are cores. Memory is memory."
His first profiling session was humbling.
He ran Intel VTune out of habit—it showed barely any useful data. He tried perf—the GPU was invisible. He looked at nvidia-smi and saw "GPU Utilization: 73%." Was that good? Bad? What was the other 27% doing?
When Carlos finally got Nsight Compute working, the metrics were alien. "SM Occupancy: 42%." "Warp Stall: Memory." "L2 Hit Rate: 31%." Nothing mapped to his mental model built on branch prediction, instruction-level parallelism, and cache hierarchies.
"I feel like I'm starting from scratch," he told his new teammate, Wei, a GPU specialist.
"You kind of are," Wei admitted. "But the good news is, the fundamentals transfer. You still care about memory access patterns, instruction throughput, and utilization. It's just that everything is scaled up by a factor of 1000, and the vocabulary is different."
This chapter will help you navigate that transition.
Why GPU Profiling Is Different
GPU performance analysis differs from CPU analysis in three fundamental ways:
1. Parallelism at Unprecedented Scale
When Carlos worked on CPUs, "high parallelism" meant 64 threads across 32 cores. A GPU like the NVIDIA H100 runs over 270,000 threads simultaneously. The optimization strategies that work at 64-thread scale—careful load balancing, avoiding contention—become both more critical and harder to reason about at 270,000-thread scale.
| System | Parallel Units | Threads | Ratio |
|---|---|---|---|
| Xeon 8490H | 60 cores | 120 threads | 1x |
| H100 SXM5 | 132 SMs | 270,336 threads | 2,250x |
2. Memory Bandwidth Dominates
CPUs are designed for low-latency access to small amounts of data. GPUs are designed for high-bandwidth access to large amounts of data. The H100 can sustain 3 TB/s memory bandwidth—30x higher than a high-end CPU. But that bandwidth is shared across all 270,000 threads, so per-thread bandwidth is actually lower.
3. Different Execution Model
CPUs execute threads independently. GPUs execute threads in groups called warps (NVIDIA) or wavefronts (AMD). All 32 threads in a warp execute the same instruction at the same time. When threads diverge (different branches), performance collapses.
Memory Hierarchy
GPU Memory Hierarchy:
┌─────────────────────────────────────────────────────────┐
│ HBM (80 GB) │
│ ~3 TB/s │
├─────────────────────────────────────────────────────────┤
│ L2 Cache (50 MB) │
│ ~12 TB/s │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Shared Mem │ │ Shared Mem │ │ Shared Mem │ ... │
│ │ (228 KB) │ │ (228 KB) │ │ (228 KB) │ │
│ │ ~20 TB/s │ │ ~20 TB/s │ │ ~20 TB/s │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Registers │ │ Registers │ │ Registers │ ... │
│ │ (256 KB) │ │ (256 KB) │ │ (256 KB) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
Execution Model
CUDA Execution Model:
Grid
└── Block (up to 1024 threads)
└── Warp (32 threads, SIMT)
└── Thread
Key concepts:
- Warp is the minimum scheduling unit
- Threads in same warp execute same instruction
- Warp divergence reduces efficiency
CUDA Performance Analysis Tools
nvidia-smi
The most basic GPU monitoring tool:
# Real-time monitoring
nvidia-smi
# Continuous monitoring (update every second)
nvidia-smi -l 1
# Query specific metrics
nvidia-smi --query-gpu=utilization.gpu,memory.used,power.draw \
--format=csv -l 1
Output example:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA H100 80GB On | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 115W / 700W | 1024MiB / 81559MiB | 45% Default |
+-------------------------------+----------------------+----------------------+
Nsight Compute
Modern CUDA kernel analysis tool:
# Basic analysis
ncu ./my_cuda_app
# Detailed analysis (all sections)
ncu --set full ./my_cuda_app
# Analyze specific kernel
ncu --kernel-name "myKernel" ./my_cuda_app
# Output report
ncu -o report ./my_cuda_app
Nsight Systems
System-level profiling:
# Basic trace
nsys profile ./my_cuda_app
# Include CUDA API and kernels
nsys profile --trace=cuda,nvtx ./my_cuda_app
# Output report
nsys profile -o timeline ./my_cuda_app
Key Performance Metrics
SM Occupancy
Occupancy = Active Warps / Maximum Warps per SM
Influencing factors:
- Registers used per thread
- Shared memory used per block
- Block size
### Tensor Core Generations
```text
Generation GPU Supported Precision Matrix Size
─────────────────────────────────────────────────────────────────
V1 (Volta) V100 FP16 4×4×4
V2 (Turing) RTX 20xx FP16, INT8, INT4 8×8×4
V3 (Ampere) A100 FP16, BF16, TF32, 8×8×4
FP64, INT8
V4 (Hopper) H100 FP16, BF16, TF32, 16×8×16
FP8, FP64, INT8
Tensor Core Efficiency
Conditions for high Tensor Core efficiency:
1. Matrix dimension alignment
- m, n, k should be multiples of 8 or 16
- Depends on precision and GPU generation
2. Memory alignment
- Matrix start address aligned to 16 bytes
- Leading dimension aligned
3. Sufficient parallelism
- Need enough tiles to fill GPU
- Small matrices have low efficiency
Typical efficiency:
Large matrices (4096+): 80-95% of peak
Medium matrices (1024): 50-80% of peak
Small matrices (256): 20-50% of peak
Practical Profiling Workflow
Step 1: System-Level Analysis
# Use Nsight Systems for overall view
nsys profile -o overview ./my_app
# View report
nsys-ui overview.nsys-rep
Identify:
- CPU vs GPU time distribution
- Kernel execution time
- Data transfer overhead
- Synchronization waits
Step 2: Identify Hotspots
From Nsight Systems report, find:
1. Longest-running kernels
2. Frequently called kernels
3. Data transfer bottlenecks
4. Unnecessary synchronization
Step 3: Deep Analysis
# Detailed analysis of specific kernel
ncu --set full \
--kernel-name "hotKernel" \
-o detailed_report \
./my_app
Step 4: Bottleneck Diagnosis
Nsight Compute provides bottleneck analysis:
Memory Bound:
- High memory throughput
- Low compute throughput
- Solution: Optimize access patterns, use shared memory
Compute Bound:
- High compute throughput
- Low memory throughput
- Solution: Algorithm optimization, use Tensor Cores
Latency Bound:
- Low occupancy
- High stall percentage
- Solution: Increase parallelism, reduce dependencies
Instruction Bound:
- Instruction issue becomes bottleneck
- Solution: Reduce instruction count, use vectorization
Common Performance Issues
Warp Divergence
// Problematic code
__global__ void divergent(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx % 2 == 0) {
// Even threads take this path
data[idx] = expensive_function_a(data[idx]);
} else {
// Odd threads take this path
data[idx] = expensive_function_b(data[idx]);
}
}
// Threads in same warp take different branches
// Leads to serialized execution
Memory Coalescing
// Problem: Non-coalesced access
__global__ void strided(float* data, int stride) {
int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
data[idx] = 1.0f; // Strided access, low efficiency
}
// Solution: Coalesced access
__global__ void coalesced(float* data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = 1.0f; // Contiguous access, high efficiency
}
Bank Conflicts
// Problem: Shared memory bank conflict
__shared__ float smem[32][32];
// Threads in same warp access same bank
float val = smem[threadIdx.x][0]; // 32-way bank conflict
// Solution: Padding
__shared__ float smem[32][33]; // Add one column padding
float val = smem[threadIdx.x][0]; // No conflict
Low Occupancy
Diagnosis:
ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active
Causes and solutions:
1. Too many registers used
- Reduce per-thread variables
- Use __launch_bounds__
2. Too much shared memory used
- Reduce shared memory size
- Use dynamic allocation
3. Unsuitable block size
- Adjust block dimensions
- Ensure multiple of 32
Carlos's First Optimization Win
Three months into his GPU journey, Carlos achieved his first major optimization win.
The team's attention kernel was running at 35% of theoretical peak. After careful analysis with Nsight Compute, he identified the bottleneck: bank conflicts in shared memory during the softmax computation.
"In CPU terms," he explained to the team, "it's like having eight cores all trying to access the same cache line. On a GPU, that's 32 threads fighting for the same memory bank, and they have to take turns."
He restructured the data layout to eliminate the conflicts. The kernel jumped to 68% of peak—nearly doubling throughput.
"The weird thing," Carlos admitted, "is that once you understand the GPU execution model, this stuff becomes obvious. The tools show you exactly where the problem is. The hard part is learning to read what they're telling you."
Wei nodded. "Welcome to GPU performance. The tools are powerful. The bottlenecks are visible. The optimization strategies are well-documented. You just have to unlearn your CPU instincts first."
GPU benchmarking isn't harder than CPU benchmarking—it's different. The same principles apply: measure before optimizing, understand the hardware model, and let data guide your decisions. The vocabulary and tools are new, but the discipline is the same.
Summary
GPU Benchmarking requires understanding GPU's unique architecture:
Key Tools
- nvidia-smi: Basic monitoring
- Nsight Systems: System-level analysis
- Nsight Compute: Kernel-level analysis
- rocprof/Omniperf: AMD GPU
Key Metrics
- Occupancy: Warp utilization
- Memory Throughput: Memory bandwidth utilization
- Compute Throughput: Compute unit utilization
- Tensor Core Utilization: Matrix operation efficiency
Common Bottlenecks
- Warp divergence
- Non-coalesced memory access
- Bank conflicts
- Low occupancy
Best Practices
- First use Nsight Systems for overall view
- Identify hotspot kernels
- Use Nsight Compute for deep analysis
- Choose optimization strategy based on bottleneck type
Chapter 27: LLM Performance Analysis
Part VII: AI/HPC
"The best way to predict the future is to invent it." — Alan Kay
The Mystery of the Stuttering Chatbot
Priya's team had deployed their LLM-powered customer service chatbot. The initial demo went great—responses were fast and coherent. But in production, users started complaining.
"Sometimes it takes forever to start responding," one support ticket read. "And then when it does start, the words come out in bursts—fast for a bit, then pause, then fast again."
Priya looked at the metrics dashboard. Average response latency: 2.3 seconds. That seemed fine. But the P99 was 8.7 seconds, and users were experiencing something the dashboard couldn't capture: the feeling of waiting.
She started instrumenting more carefully. The "slow start" issue was easy to identify: long prompts (users pasting entire documents) caused extended prefill times. But the "stuttering" was mysterious. The tokens-per-second metric looked stable.
After a week of investigation, she found the cause: garbage collection in the KV cache management code. Every 50 tokens or so, the system would pause to clean up old cache entries. Each pause was only 100ms, but users noticed it as unnatural hesitation.
"Traditional latency metrics don't capture this," she realized. "LLM inference isn't like a web request. Users experience it as a conversation, and conversations have rhythm."
This chapter explores the unique performance characteristics of LLM inference—characteristics that traditional benchmarking approaches fail to capture.
Why LLM Inference Is Different
LLM inference differs from traditional AI inference in fundamental ways that affect every aspect of performance analysis.
Autoregressive Generation: One Token at a Time
When you ask a vision model to classify an image, it produces the answer in one forward pass. But when you ask an LLM to write a paragraph, it generates that paragraph one token at a time—each token requiring a complete forward pass through the model.
Traditional AI: Input → Model → Complete Output (one pass)
LLM Inference: Input → Model → Token 1
Model → Token 2
Model → Token 3
...
Model → Token N
A 100-token response requires 100 forward passes. This fundamentally changes the performance characteristics.
Two Distinct Phases with Different Bottlenecks
LLM inference has two phases with completely different performance profiles:
Prefill Phase (processing the input prompt):
- Processes all input tokens in parallel
- Compute-intensive: GPU tensor cores are busy
- Latency scales with prompt length
- Generates the initial KV cache
Decode Phase (generating the output):
- Generates one token at a time
- Memory-intensive: waiting for data transfer
- Latency is relatively constant per token
- Reads and updates the KV cache
This split creates a key insight: TTFT (Time To First Token) and TPS (Tokens Per Second) are largely independent metrics because they're dominated by different phases.
Why Decode Is Memory-Bound
During the decode phase, something counterintuitive happens: a 70-billion parameter model, running on hardware capable of 2000 TFLOPS, achieves only a tiny fraction of its theoretical compute performance.
The reason is arithmetic intensity. To generate one token, you must:
- Read 70 billion parameters from memory (~140 GB at FP16)
- Read the KV cache (potentially tens of GB more)
- Perform matrix operations for exactly 1 token
- Write the new KV cache entry
The computation is minimal compared to the data movement. Even with 3 TB/s memory bandwidth (H100), reading 140 GB takes ~47ms. That's your per-token latency floor for single-user inference.
Core Performance Metrics
TTFT (Time To First Token)
TTFT = Time from receiving request to outputting first token
Components:
TTFT = Network latency + Queue time + Prefill time
Influencing factors:
- Prompt length (primary)
- Model size
- Hardware performance
- System load
Typical values (7B model, single GPU):
Prompt 100 tokens: 50-100 ms
Prompt 1000 tokens: 200-500 ms
Prompt 4000 tokens: 500-2000 ms
TPS / Throughput
TPS = Tokens Per Second (tokens generated per second)
Two definitions:
1. Single-request TPS: Generation speed for one request
2. System TPS: Total throughput of entire system
Single-request TPS (7B model, single GPU):
FP16: 30-50 tokens/sec
INT8: 50-80 tokens/sec
INT4: 80-120 tokens/sec
System TPS (depends on batch size):
Batch 1: 30-50 tokens/sec
Batch 32: 500-1000 tokens/sec
Batch 128: 1500-3000 tokens/sec
TPOT (Time Per Output Token)
TPOT = Time to generate each output token
= 1 / Single-request TPS
TPOT determines user-perceived "typing speed"
Typical values:
Fast (good experience): < 50 ms/token
Acceptable: 50-100 ms/token
Slow (poor experience): > 100 ms/token
End-to-End Latency
Total Latency = TTFT + (N × TPOT)
Where N = number of output tokens
Example:
TTFT = 100 ms
TPOT = 30 ms
N = 100 tokens
Total = 100 + (100 × 30) = 3100 ms = 3.1 seconds
KV Cache
What Is KV Cache
KV Cache is the most important optimization technique in LLM inference:
vLLM and PagedAttention
PagedAttention Principle
PagedAttention introduced by vLLM solves the KV Cache fragmentation problem:
Traditional approach:
Pre-allocate maximum sequence length KV Cache for each request
Causes significant waste
PagedAttention:
Divide KV Cache into fixed-size "pages"
Allocate pages on demand
Similar to OS virtual memory
Effect:
- Memory utilization from ~50% to ~95%
- Can serve more concurrent requests
- Supports longer sequences
Using vLLM
# Install
pip install vllm
# Start API server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--tensor-parallel-size 1
# Use OpenAI-compatible API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-hf",
"prompt": "Hello, world!",
"max_tokens": 100
}'
vLLM Performance Tuning
Key parameters:
--gpu-memory-utilization 0.9
GPU memory usage ratio (default 0.9)
--max-num-seqs 256
Maximum concurrent sequences
--max-num-batched-tokens 8192
Maximum tokens per iteration
--block-size 16
KV Cache page size
--swap-space 4
CPU swap space (GB)
Other LLM Inference Frameworks
TensorRT-LLM
NVIDIA's high-performance LLM inference framework:
Features:
- Deeply optimized CUDA kernels
- Supports Tensor Parallelism
- Supports In-flight Batching
- Integrates with Triton Inference Server
Performance:
Usually 10-30% faster than vLLM (scenario dependent)
Text Generation Inference (TGI)
Hugging Face's inference framework:
Features:
- Easy to use
- Supports multiple models
- Built-in Continuous Batching
- Docker deployment friendly
llama.cpp
LLM inference for CPU and edge devices:
Features:
- Pure C/C++ implementation
- Supports quantization (GGUF format)
- Runs on CPU, Apple Silicon
- Low memory footprint
Performance Optimization Strategies
Batching Strategies
1. Static Batching
- Fixed batch size
- Wait for batch to fill or timeout
- Simple but inefficient
2. Continuous Batching
- Dynamically add/remove requests
- Don't wait for batch to fill
- Higher GPU utilization
3. In-flight Batching
- Add new requests during decode
- Maximize throughput
Quantization
Quantization impact on LLM:
Precision Model Size Memory BW TPS Improvement
─────────────────────────────────────────────────
FP16 1.0x 1.0x 1.0x
INT8 0.5x 0.5x ~1.5-2x
INT4 0.25x 0.25x ~2-3x
Note:
- Decode is memory-bound
- Reducing data directly improves performance
- But may affect output quality
Speculative Decoding
Principle:
Use small model to "guess" multiple tokens
Verify with large model
If correct, accept multiple tokens at once
Effect:
- Can improve TPS by 2-3x
- Doesn't affect output quality
- Requires additional small model
Priya's Dashboard, Revisited
After her investigation, Priya rebuilt the monitoring dashboard for the chatbot. The new version tracked metrics that actually mattered for user experience:
- TTFT distribution (not just average, but P50/P95/P99)
- ITL histogram (to catch stuttering)
- KV cache memory pressure (to predict when GC would trigger)
- Prefill queue depth (to predict TTFT spikes)
She also added a "smoothness score"—a custom metric that penalized high ITL variance, even when average TPS looked fine.
"The old dashboard said everything was fine," she told her manager. "The new one would have caught the stuttering issue on day one."
The lesson she learned applies beyond LLMs: the right metrics depend on the user experience you're trying to deliver. For a batch processing system, average throughput might be enough. For an interactive chatbot, you need to measure what users actually feel—and that means understanding the unique characteristics of your workload.
LLM inference isn't just "AI inference with more parameters." It's a fundamentally different workload with its own performance model, its own bottlenecks, and its own metrics. Master those, and you can build systems that don't just perform well on benchmarks—they feel fast to users.
Summary
LLM performance analysis has its unique metrics and challenges:
Core Metrics
- TTFT: First token latency
- TPS: Generation speed
- TPOT: Time per token
- ITL: Inter-token latency
Key Technologies
- KV Cache: Avoid redundant computation
- PagedAttention: Solve memory fragmentation
- Continuous Batching: Improve throughput
- Quantization: Reduce memory requirements
Inference Frameworks
- vLLM: PagedAttention, high throughput
- TensorRT-LLM: NVIDIA optimized
- TGI: Hugging Face, easy to use
- llama.cpp: CPU/edge devices
Optimization Directions
- Prefill: Compute optimization (Tensor Cores)
- Decode: Memory optimization (quantization, KV Cache)
- System: Batching strategies
Chapter 28: ML Compilers and Runtime
Part VII: AI/HPC
"A compiler is a program that translates a program written in one language into a program written in another language." — Alfred Aho
The 50x Speedup That Came from Nowhere
Raj was benchmarking different inference backends for his computer vision model. The model was straightforward—a ResNet-50 for image classification. He expected small differences between backends, maybe 10-20%.
The first run with vanilla PyTorch: 15ms per image.
With TorchScript: 12ms. A modest 20% improvement.
Then he tried TensorRT: 0.3ms.
Raj stared at the numbers. Fifty times faster? He ran it again. Same result. He checked the accuracy—identical to within floating-point tolerance.
"How is this possible?" he asked his colleague Ming, who had experience with ML compilers. "It's the same model. Same GPU. Same input."
Ming smiled. "You just discovered why ML compilers exist. PyTorch is designed for flexibility and debugging. It executes operations one at a time, with Python overhead between each one. TensorRT analyzes the entire graph, fuses operations together, chooses optimal kernel implementations, and preallocates all memory. The math is identical, but the execution is completely different."
"But why doesn't everyone just use TensorRT then?"
"Trade-offs. TensorRT compilation can take 20 minutes. It doesn't support every operation. It's harder to debug. And if your model changes frequently, recompiling is painful. ML compilers are powerful, but they're not free."
This chapter explores the world of ML compilers—how they achieve dramatic speedups, what trade-offs they involve, and how to benchmark systems that use them.
Why ML Compilers Exist
When you train a model in PyTorch or TensorFlow, the framework prioritizes flexibility. Each operation is dispatched independently. Gradient computation is tracked automatically. You can stop, inspect, and modify execution at any point.
This flexibility is essential for research but disastrous for production inference. Every layer of abstraction adds overhead. Every dynamic dispatch adds latency. Every flexibility feature you're not using is still costing you.
ML compilers bridge this gap: they take a high-level model description and produce optimized code for specific hardware, eliminating the flexibility overhead in exchange for performance.
The Complexity They Hide
Consider what's required to run a simple convolutional neural network efficiently:
- Operation Fusion: A Conv → BatchNorm → ReLU sequence should be executed as a single fused kernel, not three separate operations
- Memory Layout: Should the tensor be stored as NCHW or NHWC? Different hardware prefers different layouts
- Precision Selection: Which layers can use FP16? Which need FP32? Where should quantization happen?
- Kernel Selection: For a 3×3 convolution with batch size 32, which of the 47 available kernel implementations is fastest?
- Memory Planning: How should intermediate activations be allocated to minimize fragmentation?
Multiply this by dozens of hardware targets, hundreds of possible operators, and millions of possible configurations. No human can manually optimize this. ML compilers make it tractable.
What ML Compilers Do
Input: High-level model description (PyTorch, TensorFlow, ONNX)
↓
┌─────────────────────────────────────────────────────────┐
│ Frontend │
│ - Parse model │
│ - Build computation graph │
│ - Type inference │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Graph Optimization │
│ - Operator fusion │
│ - Constant folding │
│ - Dead code elimination │
│ - Layout transformation │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Backend │
│ - Hardware-specific optimization │
│ - Memory planning │
│ - Code generation │
└─────────────────────────────────────────────────────────┘
↓
Output: Optimized executable program
TVM (Apache TVM)
TVM is one of the most well-known open-source ML Compilers.
TVM Architecture
┌─────────────────────────────────────────────────────────┐
│ Relay (High-level IR) │
│ - Functional IR │
│ - Supports dynamic shapes │
│ - Graph-level optimization │
├─────────────────────────────────────────────────────────┤
│ TIR (Tensor IR) │
│ - Low-level IR │
│ - Loop representation │
│ - Hardware mapping │
├─────────────────────────────────────────────────────────┤
│ Runtime │
│ - Cross-platform execution │
│ - Memory management │
│ - Device abstraction │
└─────────────────────────────────────────────────────────┘
Using TVM
import tvm
from tvm import relay
import onnx
# 1. Load ONNX model
onnx_model = onnx.load("model.onnx")
# 2. Convert to Relay IR
mod, params = relay.frontend.from_onnx(onnx_model)
# 3. Set target hardware
target = tvm.target.Target("cuda")
# 4. Compile
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, params=params)
# 5. Execute
dev = tvm.cuda(0)
module = tvm.contrib.graph_executor.GraphModule(lib["default"](dev))
module.set_input("input", input_data)
module.run()
output = module.get_output(0)
ONNX Runtime
ONNX Runtime is a cross-platform inference engine developed by Microsoft.
ONNX Runtime Features
Advantages:
- Wide hardware support
- Mature and stable
- Easy to integrate
- Supports multiple Execution Providers
Execution Providers:
- CPU (default)
- CUDA
- TensorRT
- DirectML
- OpenVINO
- CoreML
- NNAPI
Using ONNX Runtime
import onnxruntime as ort
import numpy as np
# Create session
session = ort.InferenceSession(
"model.onnx",
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
# Prepare input
## XLA (Accelerated Linear Algebra)
XLA is an ML Compiler developed by Google, primarily used for TensorFlow and JAX.
### XLA Features
```text
Design goals:
- Automatically optimize TensorFlow/JAX programs
- Support TPU
- JIT and AOT compilation
Main optimizations:
- Operator fusion
- Memory optimization
- Parallelization
Using XLA in TensorFlow
import tensorflow as tf
# Method 1: Use jit_compile
@tf.function(jit_compile=True)
def model_fn(x):
return tf.nn.relu(tf.matmul(x, w) + b)
# Method 2: Enable globally
tf.config.optimizer.set_jit(True)
Using XLA in JAX
import jax
import jax.numpy as jnp
# JAX uses XLA by default
@jax.jit
def model_fn(x, w, b):
return jax.nn.relu(jnp.dot(x, w) + b)
# Execute (automatically compiled)
result = model_fn(x, w, b)
Performance Comparison
Benchmark Setup
Model: ResNet-50
Hardware: NVIDIA A100
Batch Size: 1, 8, 32
Precision: FP32, FP16
Typical Results
Framework/Compiler Batch=1 Batch=8 Batch=32
─────────────────────────────────────────────────────
PyTorch (eager) 5.2 ms 8.1 ms 18.5 ms
PyTorch (compile) 3.8 ms 5.2 ms 12.1 ms
ONNX Runtime (CUDA) 3.5 ms 4.8 ms 11.2 ms
TensorRT 2.1 ms 3.2 ms 7.8 ms
TVM (tuned) 2.4 ms 3.5 ms 8.5 ms
Note: Actual values depend on specific configuration
Selection Guide
Scenario Recommendation
─────────────────────────────────────────────────────
Rapid prototyping PyTorch eager
Production (NVIDIA) TensorRT
Cross-platform ONNX Runtime
Edge devices TVM, IREE
TPU XLA (JAX)
Research/experiments PyTorch compile
Common Optimization Techniques
Operator Fusion
Before optimization:
x = Conv(input)
y = BatchNorm(x)
z = ReLU(y)
3 memory read/writes
After optimization:
z = FusedConvBNReLU(input)
1 memory read/write
Effect: Reduces memory bandwidth requirements
Constant Folding
Before optimization:
a = Constant(2)
b = Constant(3)
c = Add(a, b)
y = Mul(x, c)
After optimization:
y = Mul(x, 5)
Effect: Reduces runtime computation
Layout Transformation
Different hardware prefers different data layouts:
CPU: NCHW (batch, channel, height, width)
GPU: NCHW or NHWC
TPU: NHWC
ML Compilers automatically insert necessary transformations
and try to minimize the number of conversions
Memory Planning
Problem:
Intermediate results need memory
How to minimize total memory usage?
Solution:
Analyze tensor lifetimes
Reuse memory that's no longer needed
Similar to compiler register allocation
The Compiler That Saved the Project
Remember Aisha's edge deployment problem? After two weeks of manual optimization, she was still 40% short of her latency target.
Then she tried TVM with auto-tuning. She let it run overnight on a representative workload.
The next morning, she had a model that met her latency target with room to spare. The auto-tuner had found optimizations she never would have discovered manually—unusual tile sizes, unexpected operator fusion patterns, memory layouts that seemed counterintuitive but worked perfectly for her specific hardware.
"I spent two weeks doing what the compiler did in eight hours," she admitted. "And it did it better."
But she also learned the limits. When she tried to deploy the same model on a slightly different chip variant, the auto-tuned schedule performed poorly. She had to re-tune for the new target.
"ML compilers aren't magic," she concluded. "They're tools that trade tuning time for performance. For production deployment on known hardware, they're invaluable. For rapid prototyping across many targets, they might slow you down."
The lesson: ML compilers represent a fundamental shift in how we think about optimization—from hand-crafted expertise to automated search. But like any tool, knowing when to use them is as important as knowing how.
Summary
ML Compilers are key technology for modern AI deployment:
Main Tools
- TVM: Open source, auto-tuning, cross-platform
- IREE: Lightweight, suitable for edge devices
- ONNX Runtime: Mature, stable, easy to integrate
- XLA: Backend for TensorFlow/JAX
Core Optimizations
- Operator fusion
- Constant folding
- Layout transformation
- Memory planning
Selection Considerations
- Target hardware
- Performance requirements
- Development efficiency
- Maintenance cost
Performance Analysis
- Use each framework's profiling tools
- Compare performance across compilers
- Consider tuning time vs performance gain
Chapter 29: Edge AI Performance
Part VII: AI/HPC
"The future of AI is at the edge." — Pete Warden
When the Cloud Isn't an Option
Elena was developing a hearing aid that used AI to separate speech from background noise. The algorithm worked beautifully in the lab—running on a workstation with an RTX 4090.
Now she had to make it run on a device the size of a fingernail, powered by a battery smaller than a watch battery, with less computing power than a 1990s calculator.
"This model needs 50 GFLOPS," her signal processing colleague said, reviewing the specs. "The chip provides 0.5 GFLOPS. That's a 100x gap."
"And it needs to run in real-time," Elena added. "15ms latency max, or users will notice the delay between lip movement and audio. Oh, and the battery needs to last 16 hours."
This is edge AI in a nutshell: the same intelligence that runs on data center GPUs, compressed into devices that run on milliwatts. The constraints seem impossible until you realize that millions of such devices ship every month.
This chapter explores performance analysis for edge AI—where the rules are different, the constraints are brutal, and traditional GPU-centric thinking will lead you astray.
A Different World of Constraints
Edge AI operates under constraints that would seem absurd to cloud engineers:
| Resource | Cloud (A100 GPU) | Edge (Typical MCU) | Ratio |
|---|---|---|---|
| Memory | 80 GB | 256 KB - 1 MB | 80,000 - 320,000x |
| Compute | 312 TFLOPS | 1-100 MOPS | 3M - 300M x |
| Power | 400W | 10mW - 1W | 400 - 40,000x |
| Cost | $10,000+ | $1 - $10 | 1,000 - 10,000x |
That's 6-9 orders of magnitude difference across every dimension. Techniques that work in the cloud—larger batch sizes, more parameters, higher precision—are simply impossible.
The Four Constraints of Edge AI
Memory: Your model must fit. Not "mostly fit" or "fit with swapping." The entire model, plus activations, plus input/output buffers, must fit in available RAM and Flash. There's no cloud to offload to.
Power: Battery life matters more than speed. A model that runs 2x faster but consumes 3x the energy is worse, not better. Thermal limits cap sustained performance.
Latency: Real-time means real-time. A hearing aid with 100ms delay is unusable. An autonomous vehicle with 200ms perception delay is dangerous. You need both low latency and consistent latency.
Cost: When you're shipping millions of units, every cent matters. A $2 chip that needs a $3 NPU accelerator is twice as expensive as a $2.50 chip that doesn't.
TensorFlow Lite
TensorFlow Lite is Google's lightweight inference framework for mobile and embedded devices.
Model Conversion
import tensorflow as tf
# Load SavedModel
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")
# Basic conversion
tflite_model = converter.convert()
# Save
with open("model.tflite", "wb") as f:
f.write(tflite_model)
Quantization Options
# Dynamic range quantization (simplest)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Full integer quantization (requires representative dataset)
def representative_dataset():
for _ in range(100):
yield [np.random.randn(1, 224, 224, 3).astype(np.float32)]
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Float16 quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
Running Inference
import numpy as np
import tensorflow as tf
# Load model
interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()
# Get input/output info
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Set input
input_data = np.random.randn(1, 224, 224, 3).astype(np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)
# Execute
interpreter.invoke()
# Get output
output = interpreter.get_tensor(output_details[0]['index'])
TFLite Benchmark Tool
# Download benchmark tool
# https://www.tensorflow.org/lite/performance/measurement
# Run benchmark
./benchmark_model \
--graph=model.tflite \
--num_threads=4 \
--warmup_runs=10 \
--num_runs=100
# Example output:
# Inference (avg): 15.2 ms
# Inference (std): 1.3 ms
TensorFlow Lite Micro
TFLite Micro is an ultra-lightweight inference framework for microcontrollers.
Design Goals
TFLite Micro features:
1. Minimal binary size
### Using TFLite Micro
```cpp
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
#include "tensorflow/lite/schema/schema_generated.h"
// Model data (typically loaded from Flash)
extern const unsigned char model_data[];
// Tensor Arena (size depends on model)
constexpr int kTensorArenaSize = 10 * 1024;
uint8_t tensor_arena[kTensorArenaSize];
void setup() {
// Load model
const tflite::Model* model = tflite::GetModel(model_data);
// Set up operators
static tflite::MicroMutableOpResolver<5> resolver;
resolver.AddConv2D();
resolver.AddMaxPool2D();
resolver.AddFullyConnected();
resolver.AddSoftmax();
resolver.AddReshape();
// Create interpreter
static tflite::MicroInterpreter interpreter(
model, resolver, tensor_arena, kTensorArenaSize);
// Allocate tensors
interpreter.AllocateTensors();
// Get input tensor
TfLiteTensor* input = interpreter.input(0);
// Set input data...
// Run inference
interpreter.Invoke();
// Get output
TfLiteTensor* output = interpreter.output(0);
}
MLPerf Tiny
MLPerf Tiny is the AI benchmark standard designed for microcontrollers.
Benchmark Content
MLPerf Tiny Benchmarks:
Benchmark Model Task Input
─────────────────────────────────────────────────────────────
Visual Wake MobileNet v1 Image class 96×96 grayscale
(0.25) (face detection)
Keyword Spot DS-CNN Speech recog 49×10 MFCC
(keyword detect)
Anomaly Detect FC AutoEncoder Anomaly detect 128 features
(machine sound)
Image Class ResNet v1 Image class 32×32 RGB
(CIFAR-10)
Performance Metrics
MLPerf Tiny metrics:
1. Latency
- Single inference time
- Unit: milliseconds
2. Throughput
- Inferences per second
- Unit: inferences/second
3. Energy
- Energy per inference
- Unit: μJ/inference
4. Accuracy
- Must meet target accuracy
- Example: Visual Wake Words > 80%
Typical Results
MLPerf Tiny v1.0 example results:
Hardware VWW Latency KWS Latency Energy
─────────────────────────────────────────────────────────────
STM32L4R5 (Cortex-M4) 250 ms 50 ms 1.2 mJ
MAX78000 (dedicated NPU) 2.5 ms 0.5 ms 12 μJ
GAP9 (RISC-V + NPU) 5 ms 1 ms 25 μJ
Dedicated NPU can achieve 100x performance improvement
Edge AI Performance Analysis
Measurement Methods
1. Latency measurement
- Use high-precision timer
- Multiple runs for average
- Note warm-up effects
2. Power measurement
- Hardware power meter
- Current sensor
- Software estimation (imprecise)
3. Memory measurement
- Static analysis (model size)
- Runtime monitoring (peak RAM)
Latency Analysis
// Latency measurement on ARM Cortex-M
#include "arm_math.h"
volatile uint32_t start_cycles, end_cycles;
// Use DWT Cycle Counter
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
// Measure
start_cycles = DWT->CYCCNT;
interpreter.Invoke();
end_cycles = DWT->CYCCNT;
uint32_t cycles = end_cycles - start_cycles;
float latency_ms = (float)cycles / SystemCoreClock * 1000.0f;
Optimization Strategies
Model Optimization
1. Quantization
- INT8 quantization (4x compression)
- INT4 quantization (8x compression)
- Mixed precision
2. Pruning
- Remove unimportant weights
- Structured pruning better for hardware
3. Knowledge Distillation
- Train small model with large model
- Maintain accuracy while reducing size
4. Architecture Search
- NAS to find optimal architecture
- Optimize for target hardware
Hardware Acceleration
Edge AI accelerators:
1. NPU (Neural Processing Unit)
- Dedicated matrix operation units
- Examples: Apple Neural Engine, Google Edge TPU
2. DSP (Digital Signal Processor)
- Vector operations
- Example: Qualcomm Hexagon
3. GPU (Mobile)
- General parallel computing
- Examples: Adreno, Mali
4. FPGA
- Programmable hardware
- Suitable for custom requirements
Kenji's Shipping Day
Six months after that first failed demo, Kenji's team shipped their product.
The final model was nothing like what they'd started with. The original MobileNetV2 had been replaced with a custom architecture—smaller, faster, and specifically designed for their hardware. They'd used knowledge distillation to train it, quantization to shrink it, and careful profiling to optimize every layer.
The result: 15 FPS inference on a $3 MCU, with 94% accuracy on their target task. Battery life: 18 months on a coin cell.
"The cloud version was 99% accurate," Kenji admitted. "We lost 5 percentage points. But we gained something more important: we can actually ship."
At the launch party, his manager asked what he'd learned.
"Edge AI isn't about making cloud AI smaller," Kenji said. "It's a different discipline entirely. Different constraints, different tools, different trade-offs. You can't just shrink a model and hope it works. You have to design for the edge from the beginning."
He paused. "Also, buy a good current meter. You'll need it."
Edge AI performance is where all the constraints collide: compute, memory, power, latency, accuracy, and cost. Mastering it requires understanding not just ML, but embedded systems, power electronics, and the art of making hard trade-offs. It's challenging—but when you ship a product that runs AI on a device that costs less than a cup of coffee, it's deeply satisfying.
Summary
Edge AI performance analysis requires considering unique constraints:
Key Frameworks
- TensorFlow Lite: Mobile device standard
- TFLite Micro: Microcontrollers
- MLPerf Tiny: Standard benchmark
- Core ML / NNAPI: Platform-specific acceleration
Core Performance Metrics
- Latency: Inference time
- Energy: Energy consumption
- Memory: RAM/Flash usage
- Accuracy: Post-quantization accuracy
Main Optimization Strategies
- Quantization (INT8, INT4)
- Pruning
- Knowledge distillation
- Hardware acceleration
Practical Measurement Methods
- Cycle counter (latency)
- Current sensor (power)
- Static analysis + runtime monitoring (memory)
Chapter 30: Case Study: Web Server Optimization
Part VIII: Case Studies
"Premature optimization is the root of all evil. But premature pessimization is the root of all slowness." — Adapted from Donald Knuth
The Story of "Fast SSD" That Was Still Slow
Our API server handled static file requests. We had the latest NVMe SSD—rated at 7 GB/s read speed, 1 million IOPS.
But measured: average response time 50ms, peak throughput only 2,000 req/s.
"The SSD is so fast, how can it be this slow?"
After a week of debugging, we found the problem wasn't the SSD, but:
- Sync I/O: Each request blocked waiting for I/O completion
- Small files: Lots of 4KB requests, IOPS was the bottleneck
- Syscall overhead: Every read() is a syscall
- Context switches: Thread-per-request model
This chapter walks through analyzing a web server's performance using all the tools we've learned.
Scenario Setup
System Specs
Server:
- CPU: AMD EPYC 7543 (32 cores, 64 threads)
- RAM: 256 GB DDR4-3200
- Storage: Samsung PM9A3 NVMe SSD (7.68 TB)
- Sequential Read: 6.9 GB/s
- Random Read IOPS: 1,000,000 (4KB)
- Network: Mellanox ConnectX-6 (100 Gbps)
- OS: Ubuntu 22.04, Kernel 5.15
Application:
- Nginx + upstream API server
- Main workload: static files + JSON API
- Target: 50,000 req/s, P99 < 10ms
Initial State
# Benchmark with wrk
wrk -t12 -c400 -d30s http://server/api/users
Running 30s test @ http://server/api/users
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 45.23ms 67.89ms 523.12ms 87.65%
Req/Sec 912.34 234.56 1.89k 72.34%
328234 requests in 30.01s, 1.23GB read
Requests/sec: 10937.23
Transfer/sec: 42.01MB
Problems:
- Throughput: 10,937 req/s (target 50,000)
- P99 latency: estimated > 200ms (target < 10ms)
- Gap: 5×
Step 1: Find the Bottleneck
CPU or I/O?
# Check CPU usage
mpstat -P ALL 1
# Result
CPU %usr %sys %iowait %idle
all 12.3 18.7 3.2 65.8
# CPU only ~31% used, lots of idle
# This is not CPU-bound
# Check I/O
iostat -x 1
Device r/s rkB/s await %util
nvme0n1 8234.00 32936.00 0.12 8.2%
# SSD only 8.2% utilized, not I/O-bound either
Conclusion: CPU, I/O, Network all not saturated. Problem is in the "software layer."
Use perf to Find Where CPU Time Goes
perf record -g -p $(pgrep -f "api_server") -- sleep 30
perf report
# Result
35.2% api_server libc.so.6 [.] __GI___poll
18.7% api_server [kernel] [k] system_call_fastpath
12.3% api_server libc.so.6 [.] malloc
8.9% api_server api_server [.] json_serialize
6.5% api_server libc.so.6 [.] __GI___read
...
Findings:
- 35% time in poll()—waiting for I/O events
- 18.7% in syscall—too many system calls
- 12.3% in malloc—frequent memory allocation
Use strace to See Syscall Pattern
strace -c -p $(pgrep -f "api_server") -f
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
32.15 1.234567 2.1 587654 poll
24.89 0.956789 1.8 531234 read
18.76 0.721234 2.3 313567 write
12.34 0.474123 3.2 148234 open
8.21 0.315678 2.9 108876 close
3.65 0.140234 1.5 93489 fstat
Each request approximately:
- 1 poll
- 1 open
- 1+ read
- 1+ write
- 1 close
- 1 fstat
At least 6 syscalls per request. 10,000 req/s = 60,000 syscall/s.
Step 2: Optimize One by One
Optimization 1: Reduce Syscalls (sendfile)
Original flow:
Optimization 2: io_uring Instead of epoll
Traditional epoll pattern:
while (1) {
int n = epoll_wait(epfd, events, MAX_EVENTS, -1); // syscall
for (int i = 0; i < n; i++) {
if (events[i].events & EPOLLIN) {
read(fd, buf, size); // syscall
process(buf);
write(fd, response, len); // syscall
}
}
}
Each I/O operation is a separate syscall.
Using io_uring:
// Setup io_uring
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);
// Batch submit multiple I/O
struct io_uring_sqe *sqe;
for (int i = 0; i < batch_size; i++) {
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fds[i], bufs[i], sizes[i], 0);
}
// One syscall submits all I/O
io_uring_submit(&ring); // 1 syscall for N operations!
io_uring advantages:
- Batch submission, fewer syscalls
- Shared memory, avoids copying
- Supports zero-copy (IORING_OP_SEND_ZC)
Result
Before: 18,234 req/s
After: 32,456 req/s (+78%)
Optimization 3: Memory Pool (Reduce malloc)
Original: malloc/free for each request:
void handle_request(int fd) {
char *buffer = malloc(4096); // malloc every time
read(fd, buffer, 4096);
char *response = malloc(response_size); // malloc again
build_response(buffer, response);
write(fd, response, response_size);
free(response);
free(buffer);
}
Using memory pool:
// Thread-local buffer pool
static __thread struct {
char request_buf[4096];
char response_buf[65536];
} buffers;
void handle_request(int fd) {
read(fd, buffers.request_buf, 4096); // Reuse buffer
build_response(buffers.request_buf, buffers.response_buf);
write(fd, buffers.response_buf, response_size);
// No free needed
}
Result
Before: 32,456 req/s
After: 41,234 req/s (+27%)
Optimization 4: Connection Pooling and Keep-Alive
Cost of each new connection:
TCP three-way handshake: ~1 RTT
TLS handshake: ~2 RTT (TLS 1.2) or 1 RTT (TLS 1.3)
Connection setup: ~100μs
For short requests, this overhead can be longer than processing itself
Enable HTTP Keep-Alive:
# nginx.conf
http {
keepalive_timeout 65;
keepalive_requests 1000; # Max 1000 requests per connection
upstream backend {
server 127.0.0.1:8080;
keepalive 128; # Keep connections to upstream too
}
}
Result
Before: 41,234 req/s
After: 48,567 req/s (+18%)
Step 3: System-Level Tuning
TCP Tuning
# /etc/sysctl.conf
# Increase socket buffer
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# Increase connection backlog
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
# Fast TIME_WAIT recycling
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
# Enable TCP Fast Open
net.ipv4.tcp_fastopen = 3
File Descriptor Limits
# /etc/security/limits.conf
* soft nofile 1000000
* hard nofile 1000000
# Or in systemd service
[Service]
LimitNOFILE=1000000
Final Results
Optimization Stage Throughput Improvement
────────────────────────────────────────────────────────────
Initial state 10,937 req/s baseline
+ sendfile 18,234 req/s +67%
+ io_uring 32,456 req/s +78%
+ memory pool 41,234 req/s +27%
+ keep-alive 48,567 req/s +18%
+ system tuning 56,789 req/s +17%
────────────────────────────────────────────────────────────
Total 56,789 req/s +419%
P99 latency: from 200+ms down to 8ms.
Exceeded target (50,000 req/s, P99 < 10ms).
Key Lessons
1. Bottleneck May Not Be Where You Think
Initial assumption: SSD too slow
Actual problem: syscall overhead, sync I/O, frequent malloc
Tools matter: perf, strace, bpftrace
2. Syscalls Are Expensive
One syscall ~100-1000 cycles
High throughput systems must reduce syscalls:
- Batch processing (io_uring)
- Zero-copy (sendfile)
- Avoid unnecessary calls (keep fd open)
3. Memory Allocation Is Hidden Cost
malloc/free itself isn't slow
But causes:
- Lock contention (multi-threaded)
- Cache pollution
- Memory fragmentation
Solution: memory pool, arena allocator
4. SSD Is Not Magic
SSD is fast, but:
- Each I/O has fixed overhead
- Queue depth matters
- Small I/O is inefficient
- Needs alignment
To fully utilize SSD performance:
- Async I/O
- High queue depth
- I/O coalescing
- Direct I/O (some scenarios)
Summary
Diagnostic Flow
- Identify bottleneck type (CPU/IO/Network)
- Use perf to find where CPU time goes
- Use strace to analyze syscall patterns
- Use bpftrace to see latency distribution
Optimization Techniques
| Problem | Solution |
|---|---|
| Too many syscalls | sendfile, io_uring, batching |
| Sync I/O | io_uring, async I/O |
| Frequent malloc | Memory pool, arena allocator |
| Connection overhead | Keep-alive, connection pool |
| Low SSD efficiency | High queue depth, I/O coalescing |
System Tuning
- TCP buffer size
- File descriptor limits
- CPU affinity
- NUMA awareness
Remember
"Fast" SSD + slow software = slow system
"Slow" HDD + good software = acceptable system
Software architecture determines whether hardware potential is realized
Chapter 31: Case Study: Database Query Optimization
Part VIII: Case Studies
"The fastest query is the one you don't have to make." — Unknown DBA
The Story of "Cross-Datacenter Query" That Could Only Run 10 Times Per Second
Our service needed to read data from MySQL in another datacenter. Network latency was about 20ms (speed of light limitation).
Simple code:
def get_user_orders(user_id):
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id)
for order in orders:
items = db.query("SELECT * FROM items WHERE order_id = ?", order.id)
order.items = items
return user, orders
A user has 10 orders, each order has 5 items.
Total: 1 + 1 + 10 = 12 queries.
Each query 20ms RTT → total latency 240ms.
Can only handle 4 requests per second (single thread).
This is the classic N+1 query problem, amplified by network latency.
Network Latency: The Underestimated Killer
Speed of Light Limits
Location Distance Fiber Latency (one-way) RTT
───────────────────────────────────────────────────────────────
Same datacenter < 1 km ~0.005 ms ~0.01 ms
Same city ~50 km ~0.25 ms ~0.5 ms
Cross-city ~350 km ~1.75 ms ~3.5 ms
Cross-country ~2000 km ~10 ms ~20 ms
Cross-continent ~10000 km ~50 ms ~100 ms
This is a physical limit, cannot be optimized. The only solution is to reduce round trip count.
Little's Law Again
Throughput = Concurrency / Latency
If RTT = 20ms, single connection:
Throughput = 1 / 0.02 = 50 queries/sec
To reach 1000 queries/sec:
Concurrency = 1000 × 0.02 = 20 parallel connections
Problem Analysis
Original Code Problems
# Classic N+1 problem
users = db.query("SELECT * FROM users LIMIT 100")
for user in users:
# One extra query per user
orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)
user.orders = orders
# Total 101 queries!
Using EXPLAIN to Analyze
EXPLAIN SELECT * FROM orders WHERE user_id = 12345;
+----+-------------+--------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | orders | ALL | NULL | NULL | NULL | NULL | 1000000 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+---------+-------------+
type = ALL → Full table scan! No index used.
Optimization Strategies
Optimization 1: Add Index
Most basic but most effective:
-- Check existing indexes
SHOW INDEX FROM orders;
-- Add missing index
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Verify
EXPLAIN SELECT * FROM orders WHERE user_id = 12345;
type = ref → Using index, scanning 10 rows (not 1M rows)
Optimization 2: Solve N+1 Problem
Method A: JOIN
-- Original: N+1 queries
SELECT * FROM users WHERE id = ?;
SELECT * FROM orders WHERE user_id = ?;
-- Optimized: 1 JOIN
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.id = ?;
Method B: Batch Query
# Original: N+1
users = db.query("SELECT * FROM users LIMIT 100")
for user in users:
orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)
# Optimized: 2 queries
users = db.query("SELECT * FROM users LIMIT 100")
user_ids = [u.id for u in users]
orders = db.query("SELECT * FROM orders WHERE user_id IN (?)", user_ids)
# Combine in application layer
orders_by_user = group_by(orders, 'user_id')
for user in users:
user.orders = orders_by_user.get(user.id, [])
Method C: ORM Eager Loading
Optimization 4: Connection Pool
# Original: create connection for each query
def query(sql):
conn = mysql.connect(host='db.server.com', ...) # TCP + TLS handshake
cursor = conn.cursor()
cursor.execute(sql)
result = cursor.fetchall()
conn.close()
return result
# Optimized: connection pool
from sqlalchemy import create_engine
engine = create_engine(
'mysql://user:pass@db.server.com/mydb',
pool_size=20, # Keep 20 connections
max_overflow=10, # Up to 10 extra
pool_recycle=3600, # Recycle after 1 hour
pool_pre_ping=True # Test connection before use
)
Connection establishment cost:
TCP three-way handshake: 1 RTT (~20ms)
TLS handshake: 2 RTT (~40ms) for TLS 1.2
MySQL authentication: 1 RTT (~20ms)
──────────────────────────────────────────
Total: ~80ms per new connection
With connection pool, this cost is paid only once.
Caching Strategy
Multi-Layer Cache Architecture
┌─────────────┐
│ Application │
│ Cache │ ← L1: In-process cache (fastest, small capacity)
└──────┬──────┘
│
┌──────▼──────┐
│ Redis / │ ← L2: Distributed cache (fast, medium capacity)
│ Memcached │
└──────┬──────┘
│
┌──────▼──────┐
│ Database │ ← L3: Database (slow, large capacity)
│ Buffer Pool │
└─────────────┘
Cache-Aside Pattern
def get_user(user_id):
# 1. Check cache first
cache_key = f"user:{user_id}"
cached = redis.get(cache_key)
if cached:
return deserialize(cached)
# 2. Cache miss, query database
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
# 3. Write to cache
redis.setex(cache_key, 3600, serialize(user)) # 1 hour expiry
return user
Cache Invalidation
# Invalidate on update
def update_user(user_id, data):
db.execute("UPDATE users SET ... WHERE id = ?", user_id)
redis.delete(f"user:{user_id}") # Delete cache
# Or: update cache on update
def update_user_with_cache(user_id, data):
db.execute("UPDATE users SET ... WHERE id = ?", user_id)
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
redis.setex(f"user:{user_id}", 3600, serialize(user))
Practical Example: Optimizing Cross-Datacenter Query
Back to the original problem:
# Original: 12 queries, 240ms
def get_user_orders(user_id):
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
orders = db.query("SELECT * FROM orders WHERE user_id = ?", user_id)
for order in orders:
items = db.query("SELECT * FROM items WHERE order_id = ?", order.id)
order.items = items
return user, orders
Optimized Version
def get_user_orders_optimized(user_id):
# 1. Check cache first
cache_key = f"user_orders:{user_id}"
cached = redis.get(cache_key)
if cached:
return deserialize(cached)
# 2. Single query to get all data
result = db.execute("""
SELECT u.*, o.*, i.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
LEFT JOIN items i ON o.id = i.order_id
WHERE u.id = ?
""", user_id)
# 3. Assemble in application layer
user, orders = assemble_result(result)
# 4. Write to cache
redis.setex(cache_key, 1800, serialize((user, orders)))
return user, orders
Results
Original:
- 12 queries × 20ms = 240ms
- Throughput: 4 req/s (single thread)
Optimized:
- Cache hit: < 1ms (Redis in same datacenter)
- Cache miss: 1 query × 20ms = 20ms
- Throughput: 50+ req/s (single thread)
- With 90% cache hit rate: average ~3ms
Summary
Network Latency Is a Hard Limit
20ms RTT = max 50 queries/sec (single connection)
Solution: reduce round trips, increase parallelism
N+1 Problem
Symptom: N+1 queries
Solution: JOIN, batch query, ORM eager loading
Optimization Layers
| Layer | Optimization Method |
|---|---|
| Query | Index, JOIN, batching |
| Connection | Connection pool, multiplexing |
| Protocol | Pipeline, compression |
| Cache | L1/L2 cache, Cache-Aside |
| Storage | Buffer pool, partitioning, SSD tuning |
| Network | TCP tuning, BBR |
Caching Strategies
Cache-Aside: Fill on read
Write-Through: Update on write
Write-Behind: Async write
Watch out for: penetration, breakdown, avalanche
Remember
Query count × RTT = minimum latency
1 good query > 10 simple queries
On high-latency networks, this difference is even more pronounced
Chapter 32: Case Study: ML Inference Optimization
Part VIII: Case Studies
"Training is science. Inference is engineering." — An ML Engineer
The Story of "GPU Utilization at Only 3%"
We deployed an image classification service using ResNet-50. Hardware was NVIDIA A100 GPU—theoretically 312 TFLOPS (TF32).
But measured: only 50 images per second.
A100 processing one image requires about 8 GFLOPs. Theoretically:
312 TFLOPS / 8 GFLOPs = 39,000 images/sec
We achieved only 50 images/sec.
GPU utilization: 0.13%
Where's the problem?
1. Image transfer from CPU to GPU: ~5ms
2. GPU computation: ~0.2ms
3. Result transfer from GPU to CPU: ~0.1ms
4. Python overhead: ~10ms
5. Image decoding (CPU): ~5ms
────────────────────────────────────
Total: ~20ms per image
GPU computation is only 1%. The other 99% is spent "feeding data."
This is the core challenge of ML inference optimization: keeping the GPU busy.
ML Inference Characteristics
Training vs Inference
| Characteristic | Training | Inference |
|---|---|---|
| Batch size | Large (32-4096) | Small (1-64) |
| Latency requirement | Not important | Critical |
| Precision requirement | FP32/BF16 | Can be lower (INT8) |
| Frequency | Once | Continuous |
| Optimization goal | Throughput | Latency + Throughput |
Common Bottlenecks
┌─────────────────────────────────────────────────────────────────┐
│ Inference Pipeline │
├─────────┬──────────┬──────────┬──────────┬──────────┬──────────┤
│ Input │ Preproc │ H2D Copy │ Compute │ D2H Copy │ Postproc │
│ (I/O) │ (CPU) │ (PCIe) │ (GPU) │ (PCIe) │ (CPU) │
└─────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Common bottlenecks:
1. I/O: Reading data
2. CPU preprocessing: Decode, resize, normalize
3. PCIe transfer: CPU ↔ GPU
4. GPU compute: Model inference
5. CPU postprocessing: NMS, decoding
Measurement Tools
NVIDIA Nsight Systems
# Record profile
nsys profile -o report python inference.py
# View report
nsys-ui report.nsys-rep
Nsight Systems shows:
- GPU kernel execution time
- CPU/GPU synchronization points
- Memory copy (H2D/D2H)
- CUDA API calls
PyTorch Profiler
import torch
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for i in range(100):
with record_function("inference"):
output = model(input_tensor)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Optimization Strategies
Optimization 1: Batching
Single inference cannot fully utilize GPU's parallel capability:
# Original: process one by one
for image in images:
result = model(image.unsqueeze(0)) # batch_size = 1
# Optimized: batch processing
batch = torch.stack(images) # batch_size = 32
results = model(batch)
Batching effect:
Batch Size Latency (ms) Throughput (img/s) GPU Util
─────────────────────────────────────────────────────────────
1 5.2 192 15%
8 8.1 988 45%
32 18.5 1730 78%
128 62.3 2055 92%
256 118.7 2157 95%
Batch size increases, throughput increases, but latency also increases.
Trade-off: Latency vs Throughput
Optimization 2: Reduce CPU-GPU Data Transfer
Use Pinned Memory
# Normal memory → GPU: requires extra copy
tensor = torch.randn(batch_size, 3, 224, 224)
tensor_gpu = tensor.to('cuda') # slow
### Optimization 4: Quantization
**FP32 → FP16**
```python
# PyTorch automatic mixed precision
model = model.half() # Convert to FP16
input_tensor = input_tensor.half()
output = model(input_tensor)
FP32 → INT8 (requires calibration)
import torch.quantization as quant
# Prepare for quantization
model.qconfig = quant.get_default_qconfig('fbgemm')
quant.prepare(model, inplace=True)
# Calibrate (with representative data)
with torch.no_grad():
for data in calibration_loader:
model(data)
# Convert
quant.convert(model, inplace=True)
Quantization effect:
Precision Model Size Latency Accuracy Drop
───────────────────────────────────────────────────
FP32 98 MB 5.2 ms baseline
FP16 49 MB 2.8 ms ~0%
INT8 25 MB 1.5 ms 0.5-1%
Optimization 5: Preprocessing Optimization
CPU preprocessing is often the bottleneck:
# Original: PIL + torchvision (slow)
from PIL import Image
from torchvision import transforms
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
image = Image.open("image.jpg")
tensor = transform(image)
Use NVIDIA DALI (GPU preprocessing)
from nvidia.dali import pipeline_def, fn
import nvidia.dali.types as types
@pipeline_def
def image_pipeline():
jpegs, labels = fn.readers.file(file_root="images/")
images = fn.decoders.image(jpegs, device="mixed") # GPU decode
images = fn.resize(images, size=[256, 256])
images = fn.crop(images, crop=[224, 224])
images = fn.normalize(images,
mean=[0.485*255, 0.456*255, 0.406*255],
std=[0.229*255, 0.224*255, 0.225*255])
return images, labels
pipe = image_pipeline(batch_size=32, num_threads=4, device_id=0)
pipe.build()
LLM Inference Special Optimizations
Large language models have unique challenges:
Memory-bound Problem
LLaMA-7B parameters: 7B × 2 bytes (FP16) = 14 GB
A100 memory bandwidth: 2 TB/s
Each token generation requires reading all parameters
Theoretical max speed: 2000 GB/s ÷ 14 GB = 143 tokens/sec
Plus KV cache read/write overhead
KV Cache Optimization
# KV cache uses significant memory
# Sequence length 2048, batch 16, 32 layers, hidden 4096 each
# KV cache size = 2 × 16 × 32 × 2048 × 4096 × 2 bytes = 8.6 GB
# Use PagedAttention (vLLM)
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)
Speculative Decoding
Use small model to "guess" multiple tokens, large model verifies:
Traditional: Large model generates 1 token → verify → generate next → ...
Speculative:
1. Small model quickly generates 4 tokens
2. Large model verifies all 4 at once
3. If 3 are correct, saves 2 large model calls
Summary
ML Inference Challenges
GPU computation is fast, but:
1. CPU-GPU data transfer is slow
2. CPU preprocessing is slow
3. Small batches can't fully utilize GPU
Optimization Strategies
| Problem | Solution |
|---|---|
| Low GPU utilization | Batching, dynamic batching |
| Slow data transfer | Pinned memory, CUDA streams |
| Model too large | Quantization (FP16/INT8), distillation |
| Preprocessing bottleneck | DALI (GPU preprocessing) |
| Framework overhead | TensorRT, ONNX Runtime |
LLM Special Optimizations
- KV cache management (PagedAttention)
- Speculative decoding
- Continuous batching
Tools
- Profiling: Nsight Systems, PyTorch Profiler
- Runtime: TensorRT, ONNX Runtime, vLLM
- Serving: Triton Inference Server
Remember
Theoretical TFLOPS ≠ actual performance
Real bottlenecks are usually:
1. Data movement
2. Memory bandwidth
3. Software overhead
Optimization order:
Pipeline → Batching → Quantization → Model architecture
Chapter 33: How to Benchmark
Part IX: Synthesis
"The only thing worse than no data is bad data." — W. Edwards Deming (attributed)
The Perfect Checklist
The story happened on a Friday afternoon.
My colleague Emily had just joined the performance analysis team. She spent an entire week running benchmarks and prepared a professional-looking report: beautiful charts, detailed data, clear conclusions.
"This is the performance analysis of the new version," she said confidently. "It's 23% faster than the old version."
I looked at her report and asked one question: "How many times did you run it?"
"Once," she said. "The results were stable, and the charts look nice."
"Did you do warm-up?"
"What's warm-up?"
I sighed. This is a mistake every newcomer makes. Not because they're not smart, but because benchmarking looks too simple—run a program once, record the time, done.
But in reality, correct benchmarking is a science that requires rigorous methodology.
This chapter consolidates everything we've learned in the previous 15 chapters into a complete "How to Benchmark" guide.
The Benchmarking Checklist
Years of experience tell me that good benchmarks need to answer these questions:
┌─────────────────────────────────────────────────────────────────┐
│ Benchmarking Checklist │
├─────────────────────────────────────────────────────────────────┤
│ □ 1. What are you measuring? (clearly define the metric) │
│ □ 2. Is the environment controlled? (fixed freq, no turbo) │
│ □ 3. Did you warm up? (let cache, branch predictor stabilize) │
│ □ 4. How many runs? (N ≥ 10, preferably ≥ 30) │
│ □ 5. Are statistics complete? (median, stddev, CI) │
│ □ 6. Are results reproducible? (can someone else get same data) │
│ □ 7. Is comparison fair? (same env, same load, same method) │
└─────────────────────────────────────────────────────────────────┘
Let's analyze each one.
Step 1: Clearly Define What You're Measuring
This sounds obvious, but it's the most commonly overlooked step.
Wrong example:
"I want to test how fast my program is."
This sentence is meaningless. What does "fast" mean?
Correct example:
"I want to measure the average latency (in nanoseconds)
of a single hash table lookup with 10,000 key-value pairs."
A clear metric definition should include:
| Element | Example |
|---|---|
| Operation | Hash table lookup |
| Scale | 10,000 entries |
| Unit | Nanoseconds per lookup |
| Statistic | Median with 95% confidence interval |
Step 2: Control the Test Environment
Environmental variation is the main source of unstable benchmark results.
Linux Environment Setup
# 1. Fix CPU frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# 2. Isolate CPU cores (avoid scheduler interference)
# In grub config: isolcpus=2,3
taskset -c 2 ./benchmark # Bind to isolated core
# 3. Disable ASLR (reduce variance from address randomization)
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
# 4. Clear page cache (if testing I/O)
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
Environment Checklist
□ CPU frequency: Fixed (not powersave/ondemand)
□ Turbo boost: Disabled
□ Hyperthreading: Decide based on test purpose
□ Background processes: Minimized
□ NUMA: Confirm memory affinity
□ Temperature: Stable (avoid thermal throttling)
Step 3: Warm-up — The Most Forgotten Step
The first time any code executes, there's a lot of "cold start" overhead:
- Instruction cache miss: Code not yet loaded into cache
- Data cache miss: Data not yet loaded into cache
- Branch predictor: Hasn't learned branch patterns yet
- Page fault: Memory pages not yet mapped
- JIT compilation: If VM language, first run needs compilation
If you measure first execution time, you're measuring "cold start performance," not "steady state performance."
Warm-up Strategy
#define WARMUP_ITERATIONS 1000
#define MEASURED_ITERATIONS 10000
void benchmark(void) {
// Phase 1: Warm-up (discard these results)
for (int i = 0; i < WARMUP_ITERATIONS; i++) {
operation_under_test();
}
// Phase 2: Measurement
uint64_t times[MEASURED_ITERATIONS];
for (int i = 0; i < MEASURED_ITERATIONS; i++) {
uint64_t start = get_cycles();
operation_under_test();
uint64_t end = get_cycles();
times[i] = end - start;
## Step 4: Statistics — Running Once Is Not Enough
This is the mistake Emily made: running only once.
**Why isn't once enough?**
Even in a perfectly controlled environment, measurements still have variance:
- Minor OS scheduler interference
- Cache state differences
- Hardware timer precision limits
- Power management adjustments
### Minimum Sample Size
| Purpose | Recommended Sample Size (N) |
|---------|----------------------------|
| Quick check | N ≥ 10 |
| Formal report | N ≥ 30 |
| Publication | N ≥ 100 |
### Choose the Right Statistics
```text
❌ Only report mean
"Average latency: 150 ns"
✅ Report complete statistics
"Latency: median = 145 ns, mean = 152 ns
stddev = 23 ns, 95% CI = [141, 163] ns
min = 120 ns, max = 310 ns"
Why use median instead of mean?
Outliers affect mean dramatically. If 99 measurements are 100 ns, but 1 is 10,000 ns (due to context switch), mean gets severely skewed. Median is immune to outliers.
Step 5: What If Variance Is Too High?
If your coefficient of variation (CV = stddev / mean) exceeds 5%, results may be unreliable.
Diagnostic Steps
1. Check environment
- Is CPU frequency changing?
- Are background processes running?
- Is thermal throttling occurring?
2. Check program
- Is there dynamic memory allocation? (malloc/free has high variance)
- Are there I/O operations?
- Are there system calls?
3. Check measurement method
- Is timer resolution sufficient?
- Is there timer wrap-around?
Techniques to Reduce Variance
// 1. Use inline assembly barrier to prevent compiler reordering
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
// 2. Use CPU cycle counter instead of wall clock
static inline uint64_t rdtsc(void) {
uint32_t lo, hi;
asm volatile("rdtscp" : "=a"(lo), "=d"(hi) :: "rcx");
return ((uint64_t)hi << 32) | lo;
}
// 3. Pin thread to specific CPU
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(2, &cpuset); // Use CPU 2
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
Step 6: Fair Comparison
When comparing two systems or algorithms, you must ensure "all else being equal."
Common Unfair Comparisons
| Trap | Problem |
|---|---|
| Different compilers | GCC vs Clang optimize differently |
| Different optimization levels | -O0 vs -O3 huge difference |
| Different data sizes | Small data fits in cache, large doesn't |
| Different hardware | Comparing different CPUs needs normalization |
| Different warm-up | One has warm-up, one doesn't |
Correct Approach
# Ensure same compilation environment
gcc --version # Record version
CFLAGS="-O3 -march=native" # Same optimization options
# Ensure same execution environment
uname -r # Record kernel version
cat /proc/cpuinfo | grep "model name" # Record CPU
# Ensure same test data
sha256sum test_data.bin # Verify data integrity
Step 7: Document Everything
Your report should allow another person to reproduce your results.
Report Template
## Test Environment
### Hardware
- CPU: Intel Core i7-12700K @ 3.6 GHz (fixed, turbo disabled)
- Memory: 32 GB DDR5-4800
- Storage: Samsung 980 Pro NVMe
### Software
- OS: Ubuntu 22.04 LTS (kernel 6.5.0)
- Compiler: GCC 12.3.0
- Flags: -O3 -march=native -flto
### Environment Settings
- CPU governor: performance
- Turbo boost: disabled
- Hyperthreading: disabled
- ASLR: disabled
- Isolated cores: 2-3
## Methodology
- Warm-up: 1,000 iterations
- Measured: 10,000 iterations
- Repetitions: 30 independent runs
- Statistics: median with 95% CI
## Results
| Metric | Value | 95% CI |
|--------|-------|--------|
| Latency | 145 ns | ±8 ns |
| Throughput | 6.9 M ops/s | ±0.3 M |
## Reproduction Steps
git clone <repo> && cd benchmark
./setup_env.sh # Setup environment
./run_benchmark.sh # Run tests
The Anti-Patterns — Things to Never Do
1. Cherry-picking
❌ Ran 10 times, only report the best one
✅ Report statistical summary of all results
2. Hiding Variance
❌ "5% faster" (when variance is actually 20%)
✅ "5% ± 3% faster, statistically significant"
3. Unfair Baseline
❌ Compare your optimized program vs competitor's default config
✅ Both use default config, or both are optimized
4. Ignoring Cold Start
❌ Measurement includes first execution (cache miss, page fault)
✅ Clearly distinguish cold start and steady state performance
Emily's Story Ending
I helped Emily redesign her benchmark:
- Added warm-up: 1000 iterations
- Increased sample size: From 1 to 30 runs
- Fixed environment: Fixed CPU frequency, disabled turbo
- Calculated statistics: median, stddev, CI
New results:
Old conclusion: "New version is 23% faster"
New conclusion: "New version median latency reduced by 18%
(95% CI: 15% - 21%)
All results consistent across N=30 measurements
Performance improvement statistically significant (p < 0.001)"
The number got smaller, but more credible.
Summary
The Benchmarking Checklist:
- Define clearly: What metric, what scale, what unit
- Control environment: Fixed frequency, isolated cores, minimal noise
- Warm up: Don't measure cold start unless that's what you want
- Run enough times: N ≥ 30 for serious work
- Report statistics: Median, stddev, confidence interval
- Compare fairly: Same environment, same workload, same methodology
- Document everything: Make it reproducible
The Golden Rule:
If someone cannot reproduce your benchmark results, your benchmark is worthless.
Next chapter, we'll discuss how to systematically perform performance optimization once you have correct data.
Chapter 34: How to Optimize
Part IX: Synthesis
"Premature optimization is the root of all evil." — Donald Knuth
"But so is no optimization at all." — Anonymous engineer
The 3 AM Flame Graph
It was the night before a deadline.
Our API server suddenly slowed down—P99 latency spiked from 50ms to 500ms. The service was about to violate SLA.
The team lead looked at me and said: "You're the performance expert. Figure it out."
I opened the Flame Graph and stared at that "volcano" for five minutes.
One function took 47% of CPU time: json_parse().
"Found it," I said. "JSON parsing is the bottleneck."
"Great!" The lead relaxed. "Then optimize it."
I shook my head: "No. We shouldn't optimize it."
The First Rule of Optimization: Ask "Why" First
Before optimizing anything, always ask yourself three questions:
┌─────────────────────────────────────────────────────────────────┐
│ Before Optimizing, Ask: │
├─────────────────────────────────────────────────────────────────┤
│ 1. Does this operation really need to exist? │
│ (The fastest code is the code that never runs) │
│ │
│ 2. Can this operation be done fewer times? │
│ (Caching, batching, lazy evaluation) │
│ │
│ 3. Can this operation be done in a simpler way? │
│ (Simpler algorithm, different data structure) │
└─────────────────────────────────────────────────────────────────┘
Back to the JSON parsing case:
Why was json_parse() so slow? Because we were parsing the same config file for every request.
The solution wasn't to optimize parsing—the solution was to cache the config, parse it only once.
# ❌ Before: parse for every request
def handle_request():
config = json_parse(open("config.json").read())
# ...
# ✅ After: parse once at startup
CONFIG = json_parse(open("config.json").read())
def handle_request():
config = CONFIG
# ...
This change reduced P99 latency from 500ms to 45ms. We didn't optimize any code—we just stopped doing unnecessary work.
The Optimization Workflow
After years of experience, I've summarized a systematic optimization process:
┌───────────────────────────────────────────────────────────────┐
│ Optimization Workflow │
├───────────────────────────────────────────────────────────────┤
│ │
│ 1. Measure │
│ └── Establish baseline, quantify "how slow is it now" │
│ │
│ 2. Profile │
│ └── Find bottleneck, understand "why is it slow" │
│ │
│ 3. Analyze │
│ └── Understand root cause, ask "can we avoid it?" │
│ │
│ 4. Hypothesize │
│ └── Make prediction, estimate "how much faster" │
│ │
│ 5. Implement │
│ └── Implement the optimization │
│ │
│ 6. Measure Again │
│ └── Verify effect, confirm "did it actually get faster" │
│ │
│ 7. Document │
│ └── Record changes, explain "why we did this" │
│ │
└───────────────────────────────────────────────────────────────┘
This isn't linear—you usually cycle through multiple times.
Step 1: Measure — Establish Baseline
Before optimizing, you must know "how slow is it now." Without a baseline, you can't judge if optimization worked.
Key Points for Recording Baseline
Baseline Report (Example)
=========================
Date: 2025-12-18
Commit: abc123
Metric | Value | Notes
--------------------|------------|------------------
P50 latency | 45 ms |
P99 latency | 520 ms | ← This is the problem
Throughput | 1,200 req/s|
CPU usage | 78% |
Memory usage | 2.1 GB |
Important: Record the git commit hash. Later you'll need to compare "before vs after optimization," and you must ensure you're comparing specific versions.
Step 2: Profile — Find the Bottleneck
Profiling is the "microscope" for finding bottlenecks. Different types of bottlenecks need different tools.
Bottleneck Classification and Tool Selection
| Bottleneck Type | Symptoms | Recommended Tools |
|---|---|---|
| CPU-bound | High CPU usage, high latency | perf, Flame Graph |
| Memory-bound | High cache miss, low IPC | perf stat, Cachegrind |
| I/O-bound | High CPU idle, high I/O wait | iostat, strace |
| Lock contention | Low CPU usage but high latency | perf lock, Off-CPU Flame Graph |
Flame Graph Reading Tips
Step 3: Analyze — Understand Root Cause
After finding the hotspot, the next step is understanding "why is it slow."
Common Bottleneck Patterns
| Pattern | Signs | Root Cause |
|---|---|---|
| Repeated computation | Same function called too many times | Missing caching |
| Inefficient algorithm | O(n²) slow on large data | Need better algorithm |
| Cache miss | High L3 miss rate | Data structure not cache-friendly |
| Branch misprediction | High branch-miss rate | Data-dependent branches |
| Lock contention | Multiple threads waiting for same lock | Lock granularity too coarse |
Diagnostic perf Commands
# View overall performance counters
perf stat -e cycles,instructions,cache-misses,branch-misses ./program
# Example output interpretation
# 3,000,000,000 cycles
# 1,500,000,000 instructions # IPC = 0.5 (low, possibly memory-bound)
# 50,000,000 cache-misses # 3.3% miss rate (might be a problem)
# 5,000,000 branch-misses # 0.3% miss rate (normal)
IPC (Instructions Per Cycle) Meaning:
| IPC | Possible Cause |
|---|---|
| < 0.5 | Severely memory-bound, CPU waiting for data |
| 0.5-1.0 | Possibly cache miss or branch miss |
| 1.0-2.0 | Reasonably normal |
| > 2.0 | Good, fully utilizing hardware |
Step 4: Hypothesize — Predict the Effect
Before implementing optimization, predict "how much will this change improve performance."
This is important because:
- Avoid wasting time: If prediction is only 1% improvement, might not be worth it
- Verify understanding: If actual effect differs greatly from prediction, analysis was wrong
- Amdahl's Law: Overall speedup is limited by the proportion of the optimized part
Amdahl's Law
1
Speedup = ────────────────────────
(1 - p) + p/s
p = proportion of time spent in optimized part
s = speedup factor for that part
Example:
If json_parse() takes 40% of time (p = 0.4), and we make it 10x faster (s = 10):
Speedup = 1 / (0.6 + 0.4/10)
= 1 / (0.6 + 0.04)
= 1 / 0.64
= 1.56x
Even if we make JSON parsing 10x faster, overall is only 56% faster. This is the harsh reality of Amdahl's Law.
Step 5: Implement — Optimization Strategies
Hierarchical Optimization Strategy
Optimization should proceed from "high level" to "low level." High-level optimizations usually have the most significant effects:
╔═══════════════════════════════════════════════════════════════╗
║ Optimization Hierarchy ║
╠═══════════════════════════════════════════════════════════════╣
║ ║
║ Level 1: Architecture / Design 10-100x║
║ ├─ Better algorithm (O(n²) → O(n log n)) ║
║ ├─ Eliminate unnecessary operations (caching, lazy eval) ║
║ └─ More suitable data structure ║
║ ║
║ Level 2: Algorithm / Data Structure 2-10x ║
║ ├─ Cache-friendly data layout ║
║ ├─ Reduce memory allocation ║
║ └─ Batch processing ║
║ ║
║ Level 3: Implementation 1-3x ║
║ ├─ Avoid unnecessary copies ║
║ ├─ Use faster library ║
║ └─ Reduce branches ║
║ ║
║ Level 4: Low-level 1-2x ║
║ ├─ SIMD vectorization ║
║ ├─ Reduce cache miss ║
║ └─ Compiler optimization options ║
║ ║
╚═══════════════════════════════════════════════════════════════╝
Common Optimization Techniques
Caching / Memoization
// ❌ Before: compute every time
int fib(int n) {
if (n <= 1) return n;
return fib(n-1) + fib(n-2); // O(2^n)
}
// ✅ After: remember computed results
int cache[100] = {0};
int fib(int n) {
if (n <= 1) return n;
if (cache[n] != 0) return cache[n];
cache[n] = fib(n-1) + fib(n-2);
return cache[n]; // O(n)
}
Batching
// ❌ Before: write one at a time
for (int i = 0; i < 1000; i++) {
write_to_disk(data[i]); // 1000 syscalls
}
// ✅ After: batch write
buffer_add(data, 1000);
write_to_disk(buffer); // 1 syscall
Step 6: Measure Again — Verify the Effect
This is the most critical step. Don't assume "it should be faster"—prove it with data.
Comparison Report Example
Optimization: Cache config.json parsing
Commit: def456 (after) vs abc123 (before)
Metric | Before | After | Change
--------------------|------------|------------|--------
P50 latency | 45 ms | 42 ms | -6.7%
P99 latency | 520 ms | 48 ms | -90.8% ✓
Throughput | 1,200 req/s| 1,850 req/s| +54.2% ✓
CPU usage | 78% | 45% | -42.3% ✓
If the effect is not as expected, go back to Step 3 and re-analyze. Your hypothesis might be wrong.
Step 7: Document — Leave a Record
Optimization knowledge must be passed on. The next person (including yourself six months later) needs to understand this change.
The Anti-Patterns — Don't Do This
1. Optimizing Without Measuring
❌ "I feel this function is slow, let me optimize it first"
✅ "Profiler shows this function takes 30% CPU, worth optimizing"
2. Optimizing the Wrong Place
❌ Spend three days optimizing a function that only takes 2% of time
✅ Focus on the hotspot that takes 80% of time
3. Over-optimizing
❌ Write unmaintainable code to save 1% time
✅ Balance between "performance" and "maintainability"
4. Forgetting to Verify
❌ "I added cache, it should be faster" (no measurement)
✅ "After adding cache, P99 dropped from 520ms to 48ms" (with data)
Summary
The Optimization Workflow:
- Measure: Establish baseline with concrete numbers
- Profile: Find the bottleneck with proper tools
- Analyze: Understand the root cause (ask "why?")
- Hypothesize: Predict the improvement using Amdahl's Law
- Implement: Start from architecture, then algorithm, then low-level
- Measure Again: Verify with data, not assumptions
- Document: Leave knowledge for future engineers
The Golden Rules:
"Don't guess. Measure."
"The fastest code is the code that never runs."
"Optimize the bottleneck, not the convenient thing."
Next chapter, we'll discuss how to automate these performance tests and integrate them into CI/CD pipelines.
Chapter 35: CI/CD for Performance
Part IX: Synthesis
"What gets measured gets managed." — Peter Drucker
"What gets automated gets repeated." — DevOps wisdom
The "Nobody Noticed" Performance Regression
Six months ago, our API latency was 50ms.
Today, it's 150ms.
Nobody noticed. No alerts. No tickets.
How did this happen?
I ran git log and found 847 commits in those six months. Each commit degraded performance by an average of 0.12 ms—a difference imperceptible to humans.
But accumulated: 847 × 0.12ms = 100ms.
This is the horror of "Gradual Performance Regression." Like boiling a frog slowly, it gets a little slower each day, until one day customers start complaining.
The solution? Integrate performance testing into the CI/CD pipeline, so every commit goes through performance checks.
Why Performance CI/CD Is Needed
┌─────────────────────────────────────────────────────────────────┐
│ Why Automate Performance Testing? │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Catch regressions early │
│ └── Find problems before PR merge, not in production │
│ │
│ 2. Track trends over time │
│ └── Historical data reveals gradual degradation │
│ │
│ 3. Reproducible measurements │
│ └── Fixed environment eliminates "fast on my machine" │
│ │
│ 4. Shift left │
│ └── Find early, fix early, lower cost │
│ │
└─────────────────────────────────────────────────────────────────┘
The Performance CI Pipeline
A complete performance CI pipeline includes these stages:
┌──────────────────────────────────────────────────────────────────┐
│ Performance CI Pipeline │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Trigger │───▶│ Setup │───▶│ Run │───▶│ Compare │ │
│ │ (PR) │ │ Env │ │ Bench │ │ Results │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Store │ │ Report │ │
│ │ Data │ │ (PR) │ │
│ └─────────┘ └─────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ Alert │ │
│ │ (Slack) │ │
│ └─────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Step 1: Dedicated Test Environment
This is the most important step.
Why Not Use GitHub-hosted Runners?
┌─────────────────────────────────────────────────────────────────┐
│ Shared vs Dedicated Infrastructure │
├────────────────────────────────────┬────────────────────────────┤
│ Shared Cloud Runner │ Dedicated Machine │
├────────────────────────────────────┼────────────────────────────┤
│ ❌ CPU "steal time" uncontrollable │ ✅ Full control of hardware│
│ ❌ Interference from other tenants │ ✅ No external interference│
│ ❌ VM config may differ each time │ ✅ Completely consistent │
│ ❌ Cannot fix CPU frequency │ ✅ Can lock turbo, governor│
│ ❌ Variance up to 20-50% │ ✅ Variance controlled 1-3%│
└────────────────────────────────────┴────────────────────────────┘
Environment Setup Script
#!/bin/bash
# setup_perf_env.sh - Setup performance test environment
# 1. Fix CPU frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# 2. Disable ASLR
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
# 3. Clear page cache
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
# 4. Set CPU affinity (isolate cores 2-3 for testing)
echo "Benchmark will run on isolated CPUs 2-3"
# 5. Verify settings
echo "Environment configured:"
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cat /proc/sys/kernel/randomize_va_space
Step 2: Benchmark Suite Design
Not all tests are suitable for CI. Need to balance:
| Type | Time | Purpose |
|---|---|---|
| Smoke tests | < 1 min | Every commit, quick feedback |
| Core benchmarks | 5-15 min | Every PR, critical paths |
| Full suite | 30-60 min | Nightly, complete coverage |
| Soak tests | Hours | Weekly, find memory leaks |
Benchmark Code Example (Go)
// benchmark_test.go
func BenchmarkHashLookup(b *testing.B) {
table := buildHashTable(10000)
keys := generateRandomKeys(1000)
b.ResetTimer()
for i := 0; i < b.N; i++ {
for _, key := range keys {
### Setting Thresholds
```yaml
# .github/perf-thresholds.yml
thresholds:
# Change relative to baseline
regression_threshold: 5% # Fail if regression exceeds 5%
improvement_threshold: 10% # Manual review if improvement exceeds 10%
# Absolute limits
max_latency_p99: 100ms
min_throughput: 1000 req/s
# Statistical significance
min_samples: 30
confidence_level: 0.95
Statistical Significance Testing
Don't just compare means! Use statistical tests to determine if differences are significant:
from scipy import stats
def is_significant_regression(baseline, current, threshold=0.05):
"""
Use Mann-Whitney U test to determine if there's significant regression
"""
# Mann-Whitney U test (non-parametric)
statistic, p_value = stats.mannwhitneyu(
baseline, current,
alternative='less' # Test if current > baseline (regression)
)
if p_value < threshold:
# Calculate effect size
median_diff = np.median(current) - np.median(baseline)
pct_diff = median_diff / np.median(baseline) * 100
return True, pct_diff, p_value
return False, 0, p_value
Step 4: GitHub Actions Integration
Complete Workflow Example
# .github/workflows/performance.yml
name: Performance Tests
on:
pull_request:
branches: [main]
push:
branches: [main]
schedule:
- cron: '0 2 * * *' # Nightly at 2 AM
jobs:
benchmark:
runs-on: [self-hosted, perf-runner] # Dedicated runner
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history for comparison
- name: Setup environment
run: |
sudo ./scripts/setup_perf_env.sh
- name: Build
run: |
make build-release
- name: Run benchmarks
run: |
./scripts/run_benchmarks.sh --output results.json
- name: Compare with baseline
id: compare
run: |
python scripts/compare_results.py \
--current results.json \
--baseline benchmarks/baseline.json \
--threshold 5 \
--output comparison.md
- name: Comment on PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const body = fs.readFileSync('comparison.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
- name: Fail on regression
if: steps.compare.outputs.regression == 'true'
run: exit 1
Step 5: Long-term Tracking and Visualization
Trend Chart
HashLookup Latency (ns) - Last 30 Days
160 ┤
│
150 ┤ ╭─╮
│ ╭╯ ╰╮
140 ┤───╯ ╰──────────────────────────────────── baseline
│
130 ┤ ╭──────────────────────╮
│ ╭╯ ╰─
120 ┤ ╭╯
│ ╭╯
110 ┼─────────────────┴─────────────────────────────
└──────────────────────────────────────────────▶
Day 1 Day 30
Common Pitfalls and Solutions
1. Flaky Benchmarks
Problem: Same commit, benchmark results differ each time
Solutions:
- Increase warm-up iterations
- Increase sample size
- Use median instead of mean
- Set variance threshold, re-run if exceeded
2. Environment Drift
Problem: Runner's OS update invalidates baseline
Solutions:
- Use Docker containers to fix environment
- Periodically rebuild baseline
- Record environment fingerprint
3. Over-sensitivity
Problem: 1% change triggers alert, too many false positives
Solutions:
- Raise threshold (5% is reasonable starting point)
- Use statistical significance testing
- Set cooldown period
4. Test Time Too Long
Problem: Full benchmark suite takes 2 hours
Solutions:
- Layer: smoke test (every commit) + full suite (nightly)
- Only run affected benchmarks
- Parallelize execution
Summary
Performance CI/CD Checklist:
- Dedicated environment: Use self-hosted runners with fixed configuration
- Layered testing: Smoke tests for every commit, full suite nightly
- Statistical comparison: Don't just compare means, use proper tests
- Automated reporting: Comment on PRs with clear results
- Historical tracking: Store data for trend analysis
- Smart alerting: Balance sensitivity with noise
The Key Insight:
Performance is not a feature you add at the end. It's a property you maintain continuously.
The Goal:
Every commit is performance-tested.
Every regression is caught before merge.
Every trend is visible to the team.
Next chapter, we'll enter Part VI and explore future trends and emerging technologies in performance analysis.
Appendix A: Benchmark Automation
"If you can't automate it, you can't scale it." — DevOps Proverb
Why Automation Is Needed
Manual benchmark execution has several problems:
Problems:
1. Human error (forgetting to set environment, wrong parameters)
2. Non-reproducible (different conditions each run)
3. Time-consuming (requires manual waiting and recording)
4. Hard to track (results scattered everywhere)
Solution:
Automate benchmark workflow
Integrate into CI/CD pipeline
Automatically detect performance regressions
CI/CD Integration
GitHub Actions Example
# .github/workflows/benchmark.yml
name: Performance Benchmark
on:
push:
branches: [main]
pull_request:
branches: [main]
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup environment
run: |
# Lock CPU frequency (if possible)
sudo cpupower frequency-set -g performance || true
# Disable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo || true
- name: Build
run: |
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)
- name: Run benchmarks
run: |
cd build
./benchmark --benchmark_format=json > benchmark_results.json
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: build/benchmark_results.json
- name: Compare with baseline
run: |
python scripts/compare_benchmarks.py \
--baseline baseline.json \
--current build/benchmark_results.json \
--threshold 5
Dedicated Benchmark Runner
For serious performance testing, use a dedicated machine:
# Using self-hosted runner
jobs:
benchmark:
runs-on: [self-hosted, benchmark-machine]
steps:
- name: Ensure isolation
run: |
# Ensure no other programs running
sudo systemctl stop cron
sudo systemctl stop unattended-upgrades
# Set CPU affinity
taskset -c 0-3 ./benchmark
Google Benchmark Integration
Basic Setup
#include <benchmark/benchmark.h>
static void BM_VectorPushBack(benchmark::State& state) {
for (auto _ : state) {
std::vector<int> v;
for (int i = 0; i < state.range(0); ++i) {
v.push_back(i);
}
}
state.SetComplexityN(state.range(0));
}
BENCHMARK(BM_VectorPushBack)
->Range(8, 8<<10)
->Complexity(benchmark::oN);
BENCHMARK_MAIN();
JSON Output
# Output JSON format
./benchmark --benchmark_format=json --benchmark_out=results.json
# Example output
{
"context": {
"date": "2024-01-15T10:30:00+08:00",
"host_name": "benchmark-server",
"executable": "./benchmark",
"num_cpus": 8,
"mhz_per_cpu": 3600,
"cpu_scaling_enabled": false
},
"benchmarks": [
{
"name": "BM_VectorPushBack/8",
"real_time": 45.2,
"cpu_time": 44.8,
"iterations": 15234567
}
]
}
Comparison Tool
# Use Google Benchmark's comparison tool
pip install google-benchmark
# Compare two runs
compare.py benchmarks baseline.json current.json
# Output
Benchmark Time CPU
--------------------------------------------
BM_VectorPushBack/8 -0.05 -0.04
BM_VectorPushBack/64 +0.12 +0.11 # Warning: regression
BM_VectorPushBack/512 -0.02 -0.03
Regression Detection
Statistical Method
import numpy as np
from scipy import stats
def detect_regression(baseline, current, threshold=0.05, alpha=0.05):
"""
Use statistical test to detect performance regression
Args:
baseline: List of baseline measurements
current: List of current measurements
threshold: Acceptable performance change ratio
alpha: Significance level
Returns:
(is_regression, p_value, change_percent)
"""
# Calculate percent change
baseline_mean = np.mean(baseline)
current_mean = np.mean(current)
change_percent = (current_mean - baseline_mean) / baseline_mean * 100
# Perform t-test
t_stat, p_value = stats.ttest_ind(baseline, current)
# Determine if significant regression
is_regression = (
p_value < alpha and
change_percent > threshold * 100
)
return is_regression, p_value, change_percent
# Usage example
baseline = [100.2, 101.5, 99.8, 100.1, 100.9]
current = [108.3, 109.1, 107.5, 108.8, 109.2]
is_reg, p, change = detect_regression(baseline, current)
print(f"Regression: {is_reg}, p-value: {p:.4f}, change: {change:.1f}%")
# Regression: True, p-value: 0.0001, change: 8.2%
Automation Script
#!/usr/bin/env python3
"""benchmark_compare.py - Compare benchmark results and detect regressions"""
import json
import sys
from pathlib import Path
def load_results(filepath):
"""Load Google Benchmark JSON results"""
with open(filepath) as f:
data = json.load(f)
return {b['name']: b for b in data['benchmarks']}
def compare_results(baseline_path, current_path, threshold=5.0):
"""Compare two benchmark results"""
baseline = load_results(baseline_path)
current = load_results(current_path)
regressions = []
improvements = []
for name, curr in current.items():
if name not in baseline:
continue
base = baseline[name]
change = (curr['real_time'] - base['real_time']) / base['real_time'] * 100
if change > threshold:
regressions.append((name, change))
elif change < -threshold:
improvements.append((name, change))
return regressions, improvements
def main():
if len(sys.argv) != 4:
print("Usage: benchmark_compare.py <baseline> <current> <threshold>")
sys.exit(1)
baseline_path = sys.argv[1]
current_path = sys.argv[2]
threshold = float(sys.argv[3])
regressions, improvements = compare_results(
baseline_path, current_path, threshold
)
if regressions:
print("❌ Performance Regressions Detected:")
for name, change in regressions:
print(f" {name}: +{change:.1f}%")
sys.exit(1)
if improvements:
print("✅ Performance Improvements:")
for name, change in improvements:
print(f" {name}: {change:.1f}%")
print("✅ No regressions detected")
sys.exit(0)
if __name__ == "__main__":
main()
Result Storage and Tracking
Database Storage
import sqlite3
from datetime import datetime
def init_db(db_path):
"""Initialize benchmark database"""
conn = sqlite3.connect(db_path)
conn.execute('''
CREATE TABLE IF NOT EXISTS benchmarks (
id INTEGER PRIMARY KEY,
timestamp TEXT,
commit_hash TEXT,
benchmark_name TEXT,
real_time REAL,
cpu_time REAL,
iterations INTEGER,
UNIQUE(commit_hash, benchmark_name)
)
''')
conn.commit()
return conn
def store_results(conn, commit_hash, results):
"""Store benchmark results"""
timestamp = datetime.now().isoformat()
for benchmark in results['benchmarks']:
conn.execute('''
INSERT OR REPLACE INTO benchmarks
(timestamp, commit_hash, benchmark_name, real_time, cpu_time, iterations)
VALUES (?, ?, ?, ?, ?, ?)
''', (
timestamp,
commit_hash,
benchmark['name'],
benchmark['real_time'],
benchmark['cpu_time'],
benchmark['iterations']
))
conn.commit()
def get_history(conn, benchmark_name, limit=100):
"""Get benchmark history"""
cursor = conn.execute('''
SELECT timestamp, commit_hash, real_time
FROM benchmarks
WHERE benchmark_name = ?
ORDER BY timestamp DESC
LIMIT ?
''', (benchmark_name, limit))
return cursor.fetchall()
Visualization
import matplotlib.pyplot as plt
import pandas as pd
def plot_benchmark_history(history, benchmark_name):
"""Plot benchmark history trend"""
df = pd.DataFrame(history, columns=['timestamp', 'commit', 'time'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
plt.figure(figsize=(12, 6))
plt.plot(df['timestamp'], df['time'], marker='o')
plt.xlabel('Date')
plt.ylabel('Time (ns)')
plt.title(f'Benchmark History: {benchmark_name}')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig(f'{benchmark_name}_history.png')
Advanced Techniques
Environment Consistency Check
#!/bin/bash
# check_environment.sh - Check benchmark environment
echo "=== Environment Check ==="
# CPU frequency
echo "CPU Frequency:"
cat /proc/cpuinfo | grep "MHz" | head -1
# CPU Governor
echo "CPU Governor:"
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Turbo Boost
echo "Turbo Boost:"
cat /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || echo "N/A"
# System load
echo "System Load:"
uptime
# Memory
echo "Memory:"
free -h
# Background processes
echo "Background Processes:"
ps aux | wc -l
Multiple Runs with Statistics
# Multiple runs in CI
- name: Run benchmarks (multiple iterations)
run: |
for i in {1..5}; do
./benchmark --benchmark_format=json > results_$i.json
done
# Merge results
python scripts/merge_results.py results_*.json > final_results.json
Performance Budget
# performance_budget.yaml
benchmarks:
BM_VectorPushBack/1024:
max_time_ns: 50000
max_memory_kb: 100
BM_HashTableLookup:
max_time_ns: 100
BM_SortArray/10000:
max_time_ns: 1000000
def check_budget(results, budget):
"""Check if performance exceeds budget"""
violations = []
for name, limits in budget['benchmarks'].items():
if name not in results:
continue
result = results[name]
if 'max_time_ns' in limits:
if result['real_time'] > limits['max_time_ns']:
violations.append(
f"{name}: {result['real_time']:.0f}ns > {limits['max_time_ns']}ns"
)
return violations
Summary
Key elements of benchmark automation:
CI/CD Integration
- GitHub Actions / GitLab CI
- Dedicated benchmark runner
- Environment consistency
Automatic Regression Detection
- Statistical testing
- Threshold configuration
- Automatic alerts
Result Management
- Database storage
- Historical tracking
- Visualization
Best Practices
- Multiple runs for statistics
- Fixed environment settings
- Performance budgets
- Automated reporting
Appendix B: Embedded and RTOS Implementation
"In embedded systems, every cycle counts." — Embedded Systems Proverb
Simulator Environment Setup
Since this book doesn't assume readers have physical hardware, all exercises use simulators.
QEMU ARM Setup
# Install QEMU
sudo apt install qemu-system-arm
# Install ARM toolchain
sudo apt install gcc-arm-none-eabi
# Test QEMU
qemu-system-arm -M help | grep lm3s
# lm3s6965evb Stellaris LM3S6965EVB (Cortex-M3)
QEMU RISC-V Setup
# Install QEMU
sudo apt install qemu-system-riscv32 qemu-system-riscv64
# Install RISC-V toolchain
sudo apt install gcc-riscv64-unknown-elf
# Test QEMU
qemu-system-riscv32 -M help | grep sifive
# sifive_e RISC-V Board compatible with SiFive E SDK
# sifive_u RISC-V Board compatible with SiFive U SDK
Renode Setup
# Download Renode
wget https://github.com/renode/renode/releases/download/v1.14.0/renode_1.14.0_amd64.deb
sudo dpkg -i renode_1.14.0_amd64.deb
# Test
renode --version
Exercise 1: ARM Cortex-M Cycle Counting
Goal
Use DWT (Data Watchpoint and Trace) cycle counter to measure function execution time.
Code
// cycle_count.c - ARM Cortex-M3 cycle counting example
#include <stdint.h>
// DWT register definitions
#define DWT_CTRL (*(volatile uint32_t*)0xE0001000)
#define DWT_CYCCNT (*(volatile uint32_t*)0xE0001004)
#define DEMCR (*(volatile uint32_t*)0xE000EDFC)
// Enable DWT
void dwt_init(void) {
DEMCR |= (1 << 24); // TRCENA
DWT_CYCCNT = 0;
DWT_CTRL |= 1; // CYCCNTENA
}
// Measure function
uint32_t measure_cycles(void (*func)(void)) {
uint32_t start = DWT_CYCCNT;
func();
uint32_t end = DWT_CYCCNT;
return end - start;
}
// Test function
void test_function(void) {
volatile int sum = 0;
for (int i = 0; i < 1000; i++) {
sum += i;
}
}
int main(void) {
dwt_init();
uint32_t cycles = measure_cycles(test_function);
// Use semihosting for output
// printf("Cycles: %u\n", cycles);
while (1);
return 0;
}
Compile and Run
# Compile
arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb \
-specs=nosys.specs -specs=nano.specs \
-T linker.ld -o cycle_count.elf cycle_count.c
# Run in QEMU
qemu-system-arm -M lm3s6965evb -nographic \
-kernel cycle_count.elf -semihosting
Exercise 2: RISC-V mcycle/minstret
Goal
Use RISC-V CSRs to read cycle count and instruction count.
Code
// riscv_counters.c - RISC-V performance counter example
#include <stdint.h>
// Read mcycle
static inline uint64_t read_mcycle(void) {
uint32_t lo, hi;
asm volatile (
"csrr %0, mcycle\n"
"csrr %1, mcycleh\n"
: "=r"(lo), "=r"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
// Read minstret
static inline uint64_t read_minstret(void) {
uint32_t lo, hi;
asm volatile (
"csrr %0, minstret\n"
"csrr %1, minstreth\n"
: "=r"(lo), "=r"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
// Calculate CPI
void measure_cpi(void (*func)(void)) {
uint64_t cycles_start = read_mcycle();
uint64_t instrs_start = read_minstret();
func();
uint64_t cycles_end = read_mcycle();
uint64_t instrs_end = read_minstret();
uint64_t cycles = cycles_end - cycles_start;
uint64_t instrs = instrs_end - instrs_start;
// CPI = cycles / instructions
// Using integer division
uint32_t cpi_int = cycles / instrs;
uint32_t cpi_frac = (cycles * 100 / instrs) % 100;
// Output: CPI = cpi_int.cpi_frac
}
int main(void) {
// Test...
return 0;
}
Exercise 3: FreeRTOS Context Switch Measurement
Goal
Measure FreeRTOS context switch time.
Method
Measurement method:
1. Create two tasks
2. Task A records time, then yields
3. Task B records time, then yields
4. Calculate time difference
Time difference = Context switch time
FreeRTOS Code
// context_switch.c - FreeRTOS context switch measurement
#include "FreeRTOS.h"
#include "task.h"
volatile uint32_t timestamp_a, timestamp_b;
volatile uint32_t switch_time;
void TaskA(void *pvParameters) {
while (1) {
timestamp_a = DWT_CYCCNT;
taskYIELD();
// Calculate time switching back from B
switch_time = DWT_CYCCNT - timestamp_b;
}
}
void TaskB(void *pvParameters) {
while (1) {
timestamp_b = DWT_CYCCNT;
taskYIELD();
}
}
int main(void) {
dwt_init();
xTaskCreate(TaskA, "TaskA", 128, NULL, 1, NULL);
xTaskCreate(TaskB, "TaskB", 128, NULL, 1, NULL);
vTaskStartScheduler();
while (1);
}
Running on Renode
# Create Renode script
cat > freertos_test.resc << 'EOF'
mach create
machine LoadPlatformDescription @platforms/cpus/stm32f4.repl
sysbus LoadELF @context_switch.elf
showAnalyzer sysbus.uart1
start
EOF
# Run
renode freertos_test.resc
Exercise 4: Interrupt Latency Measurement
Measure time from interrupt trigger to ISR execution start.
Measurement Method Description
Measurement steps:
1. Record time before triggering interrupt
2. Record time at ISR start
3. Calculate difference
Notes:
- Need to consider interrupt priority
- Need to consider impact of other interrupts
- Multiple measurements for statistics
Code
// interrupt_latency.c
volatile uint32_t trigger_time;
volatile uint32_t isr_start_time;
volatile uint32_t latency;
void SysTick_Handler(void) {
isr_start_time = DWT_CYCCNT;
latency = isr_start_time - trigger_time;
}
void measure_interrupt_latency(void) {
// Configure SysTick
SysTick->LOAD = 1000; // Short period
SysTick->VAL = 0;
SysTick->CTRL = 7; // Enable, use processor clock, enable interrupt
// Wait for interrupt
trigger_time = DWT_CYCCNT;
__WFI(); // Wait for interrupt
// latency now contains interrupt latency
}
Exercise 5: Memory Access Pattern Analysis
Analyze the impact of different memory access patterns on performance.
Memory Access Code
// memory_access.c
#define ARRAY_SIZE 1024
volatile uint32_t array[ARRAY_SIZE];
// Sequential access
uint32_t sequential_access(void) {
uint32_t start = DWT_CYCCNT;
for (int i = 0; i < ARRAY_SIZE; i++) {
array[i] = i;
}
return DWT_CYCCNT - start;
}
// Strided access (stride = 16)
uint32_t strided_access(void) {
uint32_t start = DWT_CYCCNT;
for (int s = 0; s < 16; s++) {
for (int i = s; i < ARRAY_SIZE; i += 16) {
array[i] = i;
}
}
return DWT_CYCCNT - start;
}
// Random access
uint32_t random_access(uint32_t *indices) {
uint32_t start = DWT_CYCCNT;
for (int i = 0; i < ARRAY_SIZE; i++) {
array[indices[i]] = i;
}
return DWT_CYCCNT - start;
}
Power Measurement (Theory)
Without physical hardware, power measurement can only be discussed theoretically.
Measurement Equipment
Power measurement equipment:
1. Current Probe
- Connected in series with power line
- Measures current waveform
- Example: Keysight N2820A
2. Power Analyzer
- High precision power measurement
- Example: Keysight N6705C
3. Built-in on Dev Boards
- STM32 Nucleo IDD jumper
- Nordic PPK2
Measurement Method
Power measurement steps:
1. Baseline measurement
- Measure idle state power
- Measure various sleep mode power
2. Dynamic measurement
- Run benchmark
- Record power waveform
- Calculate average power
3. Energy calculation
Energy = ∫ Power(t) dt
≈ Σ Power[i] × Δt
Simulated Power Estimation
# Simplified power model
def estimate_power(cycles, frequency_mhz, voltage_v):
"""
Estimate dynamic power
P = C × V² × f
Assumptions:
- Switching capacitance per cycle C ≈ 10 pF
- Activity factor α ≈ 0.3
"""
C = 10e-12 # 10 pF
alpha = 0.3
f = frequency_mhz * 1e6
dynamic_power = alpha * C * (voltage_v ** 2) * f
# Add static power (assume 1 mW)
static_power = 1e-3
return dynamic_power + static_power
# Example
power = estimate_power(1000000, 100, 1.8)
print(f"Estimated power: {power * 1000:.2f} mW")
Summary
Key techniques for embedded performance measurement:
Timing Measurement
- ARM: DWT Cycle Counter
- RISC-V: mcycle/minstret CSR
- General: SysTick, hardware timers
RTOS Measurement
- Context switch time
- Interrupt latency
- Task switching overhead
Memory Analysis
- Access pattern impact
- Cache effects (if available)
- Alignment impact
Power Measurement
- Requires dedicated equipment
- Dynamic vs static power
- Energy efficiency metrics
Appendix C: I/O and Storage Performance
"Storage is the new memory." — Jim Gray
Storage Performance Fundamentals
Key Metrics
Storage performance metrics:
1. Bandwidth (Throughput)
- Unit: MB/s, GB/s
- Maximum sequential read/write speed
2. IOPS (I/O Operations Per Second)
- Unit: ops/s
- Random read/write operations
3. Latency
- Unit: μs, ms
- Time for single I/O operation
4. Queue Depth
- Concurrent I/O requests
- Affects IOPS and latency
Storage Hierarchy
Storage hierarchy and typical performance:
Level Latency Bandwidth
─────────────────────────────────────────────
CPU Cache 1-10 ns 100+ GB/s
DRAM 50-100 ns 50-100 GB/s
NVMe SSD 10-100 μs 3-7 GB/s
SATA SSD 50-200 μs 500-600 MB/s
HDD 5-10 ms 100-200 MB/s
Network 1-100 ms 1-10 GB/s
fio (Flexible I/O Tester)
fio is the most commonly used storage benchmark tool.
Installation
# Ubuntu/Debian
sudo apt install fio
# macOS
brew install fio
# From source
git clone https://github.com/axboe/fio.git
cd fio && ./configure && make && sudo make install
Basic Usage
# Sequential write test
fio --name=seq_write \
--ioengine=libaio \
--direct=1 \
--bs=1M \
--size=1G \
--numjobs=1 \
--rw=write \
--filename=/tmp/fio_test
# Random read test
fio --name=rand_read \
--ioengine=libaio \
--direct=1 \
--bs=4K \
--size=1G \
--numjobs=4 \
--iodepth=32 \
--rw=randread \
--filename=/tmp/fio_test
Common Parameters
fio parameters:
--ioengine I/O engine (libaio, io_uring, sync)
--direct Bypass page cache (1=yes)
--bs Block size (4K, 1M, etc.)
--size Test file size
--numjobs Parallel jobs
--iodepth Queue depth
--rw Read/write mode (read, write, randread, randwrite, randrw)
--runtime Runtime (seconds)
--time_based Time-based instead of size-based
Job File
; fio_test.fio - Complete test configuration
[global]
ioengine=libaio
direct=1
size=1G
runtime=60
time_based
group_reporting
[seq_read]
rw=read
bs=1M
numjobs=1
[seq_write]
rw=write
bs=1M
numjobs=1
[rand_read]
rw=randread
bs=4K
numjobs=4
iodepth=32
[rand_write]
rw=randwrite
bs=4K
numjobs=4
iodepth=32
[mixed]
rw=randrw
rwmixread=70
bs=4K
numjobs=4
iodepth=32
Running and Output
# Run job file
fio fio_test.fio
# JSON output
fio fio_test.fio --output-format=json --output=results.json
# Output example
seq_read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB
read: IOPS=3245, BW=3245MiB/s (3403MB/s)
slat (usec): min=2, max=45, avg=5.2
clat (usec): min=280, max=1234, avg=302.5
lat (usec): min=285, max=1240, avg=307.7
ioping
ioping measures I/O latency, similar to ping.
Installation and Usage
# Install
sudo apt install ioping
# Measure latency
ioping -c 10 /tmp
# Output example
4 KiB <<< /tmp (ext4 /dev/sda1): request=1 time=234.5 us
4 KiB <<< /tmp (ext4 /dev/sda1): request=2 time=198.3 us
...
--- /tmp (ext4 /dev/sda1) ioping statistics ---
10 requests completed in 2.15 ms, 40 KiB read, 4.65 k iops, 18.6 MiB/s
min/avg/max/mdev = 156.2 us / 215.0 us / 312.4 us / 45.2 us
Advanced Usage
# Direct I/O (bypass cache)
ioping -D /dev/sda
# Specify size
ioping -s 1M /tmp
# Continuous test
ioping -c 100 -i 0 /tmp
Network I/O
iperf3
iperf3 is the standard tool for network bandwidth testing.
# Install
sudo apt install iperf3
# Server side
iperf3 -s
# Client side
iperf3 -c server_ip
# Output example
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 11.0 GBytes 9.42 Gbits/sec
iperf3 Advanced Usage
# UDP test
iperf3 -c server_ip -u -b 1G
# Multiple connections
iperf3 -c server_ip -P 4
# Bidirectional test
iperf3 -c server_ip --bidir
# JSON output
iperf3 -c server_ip -J > results.json
netperf
netperf focuses on latency testing.
# Install
sudo apt install netperf
# Server side
netserver
# TCP request/response latency
netperf -H server_ip -t TCP_RR
# Output example
TCP REQUEST/RESPONSE TEST
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes bytes bytes bytes secs. per sec
16384 131072 1 1 10.00 45678.90
dd Test
dd is the simplest I/O testing method.
# Write test
dd if=/dev/zero of=/tmp/test bs=1M count=1024 conv=fdatasync
# Read test
dd if=/tmp/test of=/dev/null bs=1M
# Read after clearing cache
sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/tmp/test of=/dev/null bs=1M
Limitations of dd
Problems with dd:
1. Single-threaded
- Cannot test parallel performance
2. No queue depth control
- Cannot test NVMe's true performance
3. No statistics
- Only average, no latency distribution
Recommendation:
- Use dd for quick tests
- Use fio for formal testing
File System Performance
Comparing Different File Systems
# Create test environment
for fs in ext4 xfs btrfs; do
mkfs.$fs /dev/sdb1
mount /dev/sdb1 /mnt/test
fio --name=test --filename=/mnt/test/file \
--size=10G --bs=4K --rw=randread \
--iodepth=32 --numjobs=4 \
--output-format=json > ${fs}_results.json
umount /mnt/test
done
Mount Options Impact
# Default mount
mount /dev/sdb1 /mnt/test
# Performance-optimized mount
mount -o noatime,nodiratime,discard /dev/sdb1 /mnt/test
# Compare performance difference
I/O Scheduler
Viewing and Setting
# View current scheduler
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none
# Set scheduler
echo "none" | sudo tee /sys/block/sda/queue/scheduler
Scheduler Comparison
I/O scheduler characteristics:
Scheduler Use Case Features
─────────────────────────────────────────────────────
none NVMe SSD Lowest latency
mq-deadline General Balance latency and throughput
kyber Low latency needs Auto-adjusting
bfq Desktop/interactive Fairness priority
Performance Analysis Tools
iostat
# Install
sudo apt install sysstat
# Basic usage
iostat -x 1
# Example output
Device r/s w/s rkB/s wkB/s await %util
sda 125.00 45.00 5000.0 1800.0 0.85 12.5
nvme0n1 3500.0 1200.0 14000.0 4800.0 0.12 45.2
blktrace
# Trace I/O
sudo blktrace -d /dev/sda -o trace
# Analyze
blkparse -i trace.blktrace.0
# Visualize
btt -i trace.blktrace.0
Summary
Key tools for I/O performance testing:
Storage Testing
- fio: Complete storage benchmark
- ioping: I/O latency testing
- dd: Quick simple test
Network Testing
- iperf3: Bandwidth testing
- netperf: Latency testing
Analysis Tools
- iostat: Real-time monitoring
- blktrace: Detailed tracing
Testing Tips
- Use direct I/O to bypass cache
- Test different block sizes
- Test different queue depths
- Multiple runs for statistics
Appendix D: Power and Performance
"Performance per watt is the new performance." — Intel
Power Performance Fundamentals
Why Power Matters
Why power matters:
1. Data Centers
- Electricity is major operational cost
- Cooling requirements scale with power
- Power supply limits
2. Mobile Devices
- Battery life
- Thermal design limits
- User experience
3. Embedded Systems
- Battery powered
- Fanless design
- Environmental constraints
4. Environmental
- Carbon footprint
- Energy efficiency regulations
Power Components
Processor power components:
1. Dynamic Power
P_dynamic = α × C × V² × f
α = activity factor
C = capacitance
V = voltage
f = frequency
2. Static Power (Leakage)
P_static = I_leak × V
Temperature dependent
Smaller process = more leakage
3. Total Power
P_total = P_dynamic + P_static
RAPL (Running Average Power Limit)
RAPL is Intel's power monitoring interface.
Reading RAPL
# Using powercap interface
cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj
# Calculate power
E1=$(cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj)
sleep 1
E2=$(cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj)
echo "Power: $(( (E2 - E1) / 1000000 )) W"
RAPL Domains
RAPL power domains:
Domain Description
─────────────────────────────────────────────
Package (PKG) Entire CPU package
Core CPU cores
Uncore Non-core parts (L3 cache, etc.)
DRAM Memory controller
GPU (if present) Integrated graphics
Using perf
# Measure power with perf
sudo perf stat -e power/energy-pkg/,power/energy-cores/,power/energy-ram/ \
./benchmark
# Example output
Performance counter stats for './benchmark':
45.23 Joules power/energy-pkg/
32.15 Joules power/energy-cores/
12.34 Joules power/energy-ram/
10.002345678 seconds time elapsed
Python RAPL Reader
import time
from pathlib import Path
class RAPLReader:
def __init__(self):
self.rapl_path = Path("/sys/class/powercap/intel-rapl")
self.domains = self._find_domains()
def _find_domains(self):
domains = {}
for d in self.rapl_path.glob("intel-rapl:*"):
name = (d / "name").read_text().strip()
domains[name] = d / "energy_uj"
return domains
def read_energy(self):
"""Read energy for all domains (microjoules)"""
return {
name: int(path.read_text())
for name, path in self.domains.items()
}
def measure_power(self, duration=1.0):
"""Measure power (watts)"""
e1 = self.read_energy()
time.sleep(duration)
e2 = self.read_energy()
return {
name: (e2[name] - e1[name]) / duration / 1e6
for name in e1
}
# Usage
rapl = RAPLReader()
power = rapl.measure_power(1.0)
print(f"Package power: {power.get('package-0', 0):.2f} W")
Performance per Watt
GFLOPS/W
GFLOPS/W calculation:
Performance/Power ratio = GFLOPS / Power (W)
Example:
Performance = 100 GFLOPS
Power = 50 W
Efficiency = 100 / 50 = 2 GFLOPS/W
Measurement Method
import subprocess
import time
def measure_efficiency(benchmark_cmd, duration=10):
"""Measure performance per watt"""
rapl = RAPLReader()
# Start measurement
e1 = rapl.read_energy()
t1 = time.time()
# Run benchmark
result = subprocess.run(
benchmark_cmd,
capture_output=True,
text=True
)
# End measurement
t2 = time.time()
e2 = rapl.read_energy()
# Calculate
elapsed = t2 - t1
energy_j = (e2['package-0'] - e1['package-0']) / 1e6
power_w = energy_j / elapsed
# Parse GFLOPS from benchmark output
# (depends on specific benchmark)
gflops = parse_gflops(result.stdout)
efficiency = gflops / power_w
return {
'gflops': gflops,
'power_w': power_w,
'efficiency': efficiency
}
Thermal Throttling
What is Thermal Throttling
Thermal throttling mechanism:
When temperature exceeds threshold, processor will:
1. Reduce frequency
2. Reduce voltage
3. Skip clock cycles
Result:
- Performance decreases
- Power consumption decreases
- Temperature stabilizes
Problem:
- Benchmark results become unstable
- Cannot achieve rated performance
Temperature Monitoring
# Using sensors
sudo apt install lm-sensors
sensors
# Output example
coretemp-isa-0000
Core 0: +65.0°C (high = +100.0°C, crit = +110.0°C)
Core 1: +67.0°C (high = +100.0°C, crit = +110.0°C)
# Using /sys
cat /sys/class/thermal/thermal_zone*/temp
# 65000 (millidegrees)
Detecting Throttling
# Using turbostat
sudo turbostat --interval 1
# Output example
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C6
- - 2345 45.2 3600 3600 1234 0 0 55 0 0
0 0 2400 48.5 3600 3600 456 0 0 52 0 0
# Bzy_MHz < rated frequency = possible throttling
Python Monitoring
from pathlib import Path
import time
def monitor_thermal(duration=60, interval=1):
"""Monitor temperature and frequency"""
results = []
for _ in range(int(duration / interval)):
# Read temperature
temps = []
for zone in Path("/sys/class/thermal").glob("thermal_zone*"):
temp = int((zone / "temp").read_text()) / 1000
temps.append(temp)
# Read frequency
freqs = []
for cpu in Path("/sys/devices/system/cpu").glob("cpu[0-9]*"):
freq_file = cpu / "cpufreq/scaling_cur_freq"
if freq_file.exists():
freq = int(freq_file.read_text()) / 1000 # MHz
freqs.append(freq)
results.append({
'time': time.time(),
'max_temp': max(temps),
'avg_freq': sum(freqs) / len(freqs) if freqs else 0
})
time.sleep(interval)
return results
DVFS (Dynamic Voltage and Frequency Scaling)
CPU Frequency Control
# View available frequencies
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
# View current frequency
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
# Set frequency (requires root)
echo 2400000 | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed
Governor Settings
# View available governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
# performance powersave ondemand conservative schedutil
# Set governor
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Recommended to use 'performance' for benchmarks
Frequency Impact on Performance
Frequency vs Performance vs Power:
Frequency Rel. Perf Rel. Power Efficiency
─────────────────────────────────────────────
2.0 GHz 1.0x 1.0x 1.0x
2.5 GHz 1.25x 1.56x 0.80x
3.0 GHz 1.50x 2.25x 0.67x
3.5 GHz 1.75x 3.06x 0.57x
Power ∝ f³ (because V also increases with f)
Efficiency decreases with frequency
GPU Power
NVIDIA GPU
# Using nvidia-smi
nvidia-smi --query-gpu=power.draw --format=csv -l 1
# Detailed info
nvidia-smi dmon -s p
# Example output
# gpu pwr gtemp mtemp
0 125W 65C 55C
Using NVML
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
# Read power
power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000 # mW -> W
print(f"GPU Power: {power:.1f} W")
# Read temperature
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
print(f"GPU Temp: {temp}°C")
pynvml.nvmlShutdown()
Power Benchmark Best Practices
Environment Preparation
#!/bin/bash
# prepare_power_benchmark.sh
# 1. Set performance governor
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# 2. Disable turbo boost (optional, for stability)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# 3. Ensure adequate cooling
# (wait for temperature to stabilize)
# 4. Stop unnecessary services
sudo systemctl stop cron
sudo systemctl stop unattended-upgrades
Measurement Flow
Power measurement flow:
1. Warm-up
- Run benchmark until temperature stabilizes
- Usually takes 1-5 minutes
2. Baseline measurement
- Measure idle power
- As baseline
3. Load measurement
- Run benchmark
- Record power simultaneously
4. Multiple repeats
- At least 3-5 times
- Calculate mean and standard deviation
Summary
Key points for power performance analysis:
Measurement Tools
- RAPL: Intel CPU power
- nvidia-smi: NVIDIA GPU power
- sensors: Temperature monitoring
Key Metrics
- GFLOPS/W: Performance per watt
- Energy: Total energy consumption
- Thermal headroom: Temperature margin
Influencing Factors
- Frequency and voltage
- Thermal throttling
- Workload characteristics
Best Practices
- Fixed frequency testing
- Wait for temperature to stabilize
- Multiple measurements for statistics
- Record environmental conditions
Appendix E: Exercises and Solutions
"The best way to learn is by doing." — Richard Feynman
This appendix provides hands-on exercises for each chapter with detailed problem descriptions, solution approaches, and key code snippets.
Part I: Foundations (Chapters 1-4)
Exercise 1.1: Array Traversal with Statistical Analysis
Difficulty: Easy | Language: C
Problem: Run a simple array traversal 100 times and calculate statistical metrics to understand measurement variability.
Objectives:
- Measure execution time using
clock_gettime(CLOCK_MONOTONIC) - Calculate mean, standard deviation, and 95% confidence interval
- Evaluate result stability using coefficient of variation (CV)
Key Code:
#include <time.h>
#include <math.h>
#define ARRAY_SIZE (1024 * 1024) // 1M elements
#define NUM_RUNS 100
static inline uint64_t get_time_ns(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}
volatile int sink; // Prevent dead code elimination
void traverse_array(int *arr, size_t size) {
int sum = 0;
for (size_t i = 0; i < size; i++) {
sum += arr[i];
}
sink = sum;
}
Statistical Calculation:
// Mean
double mean = sum / NUM_RUNS;
// Standard deviation
double std_dev = sqrt(sq_diff_sum / (NUM_RUNS - 1));
// 95% CI (t-value ≈ 1.984 for df=99)
double margin = 1.984 * std_dev / sqrt(NUM_RUNS);
// Coefficient of variation
double cv = (std_dev / mean) * 100.0;
Expected Output:
Array size: 1048576 elements (4 MB)
Number of runs: 100
Results:
Mean: 0.892 ms
Std Dev: 0.045 ms
95% CI: [0.883, 0.901] ms
CV (variability): 5.04%
Interpretation:
- CV < 5%: Stable results
- CV 5-15%: Some fluctuation, acceptable
- CV > 15%: High variability, stabilize environment
Exercise 2.2: Timer Overhead Measurement
Difficulty: Easy | Language: C
Problem: Measure the overhead of clock_gettime by calling it consecutively 1000 times.
Objectives:
- Understand timer resolution limits
- Determine minimum measurable duration
- Analyze overhead distribution
Key Code:
#define NUM_SAMPLES 1000
// Measure timer overhead
for (int i = 0; i < NUM_SAMPLES; i++) {
uint64_t t1 = get_time_ns();
uint64_t t2 = get_time_ns();
overhead[i] = t2 - t1;
}
// Sort for percentiles
qsort(overhead, NUM_SAMPLES, sizeof(uint64_t), compare_uint64);
uint64_t p50 = overhead[NUM_SAMPLES / 2];
uint64_t p95 = overhead[(int)(NUM_SAMPLES * 0.95)];
Expected Output:
Statistics (nanoseconds):
Mean: 25.3 ns
Std Dev: 8.2 ns
Min: 18 ns
Max: 156 ns
Percentiles:
P50 (median): 23 ns
P95: 42 ns
P99: 78 ns
Practical Implication: For a 25ns timer overhead, measure operations of at least 2500ns (100x overhead) for <1% error.
Exercise 3.3: T-Test for Algorithm Comparison
Difficulty: Medium | Language: C
Problem: Use Welch's t-test to determine if two algorithm implementations have statistically significant performance differences.
Objectives:
- Collect benchmark samples for two algorithms
- Implement Welch's t-test (handles unequal variances)
- Interpret p-value for significance
Key Code:
// Algorithm A: Simple sum
void algo_a(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
sum += arr[i];
}
sink = sum;
}
// Algorithm B: Unrolled sum (4x)
void algo_b(int *arr, int n) {
int sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
for (int i = 0; i <= n - 4; i += 4) {
sum0 += arr[i];
sum1 += arr[i + 1];
sum2 += arr[i + 2];
sum3 += arr[i + 3];
}
sink = sum0 + sum1 + sum2 + sum3;
}
Welch's T-Test:
// t-statistic
double se = sqrt(var_a / na + var_b / nb);
double t = (mean_a - mean_b) / se;
// Degrees of freedom (Welch-Satterthwaite)
double df = (v1 + v2) * (v1 + v2) /
(v1 * v1 / (na - 1) + v2 * v2 / (nb - 1));
Interpretation:
- p < 0.05: Statistically significant difference
- p >= 0.05: No significant difference (could be noise)
Exercise 4.2: Box Plot Visualization
Difficulty: Easy | Language: Python
Problem: Create box plots to visualize benchmark result distributions and identify outliers.
Key Code:
import matplotlib.pyplot as plt
import numpy as np
def create_boxplot(data_dict, title, ylabel):
fig, ax = plt.subplots(figsize=(10, 6))
labels = list(data_dict.keys())
data = [data_dict[k] for k in labels]
bp = ax.boxplot(data, labels=labels, patch_artist=True)
# Color boxes
colors = ['lightblue', 'lightgreen', 'lightyellow', 'lightcoral']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
ax.set_ylabel(ylabel)
ax.set_title(title)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('boxplot.png', dpi=150)
What Box Plots Show:
- Box: Q1 to Q3 (50% of data)
- Line in box: Median
- Whiskers: 1.5×IQR from box edges
- Points outside: Outliers
Part II: Benchmark Tools (Chapters 5-9)
Exercise 5.1: CoreMark Testing
Difficulty: Easy | Language: Shell
Problem: Build and run CoreMark with different compiler optimization levels.
Key Code:
#!/bin/bash
git clone https://github.com/eembc/coremark.git
cd coremark
for opt in O0 O1 O2 O3 Ofast; do
echo "=== Testing -$opt ==="
make clean
make PORT_DIR=linux XCFLAGS="-$opt"
./coremark.exe 2>&1 | grep -E "CoreMark|Iterations"
done
Expected Results:
| Optimization | CoreMark Score | Improvement |
|---|---|---|
| -O0 | ~5,000 | baseline |
| -O1 | ~15,000 | 3x |
| -O2 | ~22,000 | 4.4x |
| -O3 | ~24,000 | 4.8x |
| -Ofast | ~25,000 | 5x |
Exercise 6.1: STREAM Memory Bandwidth Analysis
Difficulty: Easy | Language: C
Problem: Measure memory bandwidth at different array sizes to observe cache hierarchy effects.
Key Code:
// Vary array size to hit different cache levels
size_t sizes[] = {
8 * 1024, // 8 KB - L1
64 * 1024, // 64 KB - L2
512 * 1024, // 512 KB - L3
8 * 1024 * 1024, // 8 MB - L3/DRAM
64 * 1024 * 1024 // 64 MB - DRAM
};
for (int s = 0; s < 5; s++) {
double *a = aligned_alloc(64, sizes[s]);
double *b = aligned_alloc(64, sizes[s]);
// STREAM Copy: b[i] = a[i]
uint64_t start = get_time_ns();
for (size_t i = 0; i < sizes[s] / sizeof(double); i++) {
b[i] = a[i];
}
uint64_t elapsed = get_time_ns() - start;
double bw = (2.0 * sizes[s]) / elapsed; // GB/s
printf("Size: %6zu KB, Bandwidth: %.2f GB/s\n",
sizes[s] / 1024, bw);
}
Exercise 9.2: CPU Cycle Counter
Difficulty: Medium | Language: C
Problem: Use architecture-specific cycle counters for precise timing.
Key Code (x86):
#include <x86intrin.h>
static inline uint64_t rdtsc(void) {
return __rdtsc();
}
// More precise: RDTSCP serializes
static inline uint64_t rdtscp(void) {
unsigned int aux;
return __rdtscp(&aux);
}
// Usage
uint64_t start = rdtscp();
// ... operation ...
uint64_t end = rdtscp();
uint64_t cycles = end - start;
Key Code (ARM):
static inline uint64_t read_cycles(void) {
uint64_t val;
asm volatile("mrs %0, cntvct_el0" : "=r"(val));
return val;
}
Part III: Analysis Theory (Chapters 10-12)
Exercise 10.1: Roofline Model
Difficulty: Medium | Language: Python
Problem: Generate a roofline model plot for your system.
Key Code:
import matplotlib.pyplot as plt
import numpy as np
# System parameters (measure these!)
peak_flops = 100 # GFLOPS
peak_bw = 50 # GB/s
# Calculate ridge point
ridge_point = peak_flops / peak_bw # FLOP/byte
# Operational intensity range
oi = np.logspace(-2, 2, 100)
# Roofline
memory_bound = peak_bw * oi
compute_bound = np.full_like(oi, peak_flops)
roofline = np.minimum(memory_bound, compute_bound)
plt.figure(figsize=(10, 6))
plt.loglog(oi, roofline, 'b-', linewidth=2, label='Roofline')
plt.axvline(x=ridge_point, color='r', linestyle='--',
label=f'Ridge Point ({ridge_point:.1f})')
plt.xlabel('Operational Intensity (FLOP/byte)')
plt.ylabel('Performance (GFLOPS)')
plt.title('Roofline Model')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('roofline.png', dpi=150)
Exercise 12.3: Branch Prediction Experiment
Difficulty: Medium | Language: C
Problem: Compare sorted vs unsorted array processing to observe branch prediction effects.
Key Code:
#define ARRAY_SIZE (32 * 1024)
// Test with sorted and unsorted data
void process_array(int *arr, int n, int threshold) {
int count = 0;
for (int i = 0; i < n; i++) {
if (arr[i] < threshold) { // Branch!
count++;
}
}
sink = count;
}
int main(void) {
int *arr = malloc(ARRAY_SIZE * sizeof(int));
// Fill with random data
for (int i = 0; i < ARRAY_SIZE; i++) {
arr[i] = rand() % 256;
}
// Test unsorted
uint64_t start = rdtscp();
for (int iter = 0; iter < 1000; iter++) {
process_array(arr, ARRAY_SIZE, 128);
}
uint64_t unsorted_cycles = rdtscp() - start;
// Sort the array
qsort(arr, ARRAY_SIZE, sizeof(int), compare_int);
// Test sorted
start = rdtscp();
for (int iter = 0; iter < 1000; iter++) {
process_array(arr, ARRAY_SIZE, 128);
}
uint64_t sorted_cycles = rdtscp() - start;
printf("Unsorted: %lu cycles\n", unsorted_cycles);
printf("Sorted: %lu cycles\n", sorted_cycles);
printf("Speedup: %.2fx\n",
(double)unsorted_cycles / sorted_cycles);
}
Expected Result: Sorted array is 2-5x faster due to better branch prediction.
Part IV: Data Structures (Chapters 13-15)
Exercise 13.1: Array vs Linked List
Difficulty: Easy | Language: C
Problem: Compare sequential access performance of arrays vs linked lists.
Key Code:
// Array traversal
void sum_array(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
sum += arr[i];
}
sink = sum;
}
// Linked list traversal
struct Node {
int value;
struct Node *next;
};
void sum_list(struct Node *head) {
int sum = 0;
while (head) {
sum += head->value;
head = head->next;
}
sink = sum;
}
Expected Result: Array is 5-20x faster due to cache-friendly sequential access.
Exercise 14.1: Hash Table vs Tree
Difficulty: Medium | Language: C++
Problem: Compare lookup performance of hash tables vs balanced trees.
Key Code:
#include <unordered_map>
#include <map>
void benchmark_hash(int n) {
std::unordered_map<int, int> hash;
for (int i = 0; i < n; i++) hash[i] = i;
auto start = now();
for (int i = 0; i < n; i++) {
sink = hash[rand() % n];
}
auto elapsed = now() - start;
}
void benchmark_tree(int n) {
std::map<int, int> tree;
for (int i = 0; i < n; i++) tree[i] = i;
auto start = now();
for (int i = 0; i < n; i++) {
sink = tree[rand() % n];
}
auto elapsed = now() - start;
}
Expected Result:
- Hash: O(1) average, ~50-100ns per lookup
- Tree: O(log n), ~200-500ns per lookup for n=1M
Exercise 15.1: Sorting Algorithm Comparison
Difficulty: Medium | Language: C++
Problem: Compare quicksort, mergesort, and heapsort performance.
Key Code:
#include <algorithm>
void benchmark_sort(std::vector<int>& data,
void (*sort_fn)(int*, int*)) {
auto start = now();
sort_fn(data.data(), data.data() + data.size());
auto elapsed = now() - start;
}
// Test with different data patterns:
// 1. Random
// 2. Nearly sorted
// 3. Reverse sorted
// 4. Many duplicates
Part V: Parallelization (Chapters 16-18)
Exercise 16.1: SIMD Vector Addition
Difficulty: Advanced | Language: C
Problem: Implement vector addition using SIMD intrinsics.
Key Code (AVX2):
#include <immintrin.h>
void vector_add_scalar(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
void vector_add_avx(float *a, float *b, float *c, int n) {
int i;
for (i = 0; i <= n - 8; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_storeu_ps(&c[i], vc);
}
// Handle remainder
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
}
Expected Speedup: 4-8x for float, 8-16x for int8.
Exercise 17.2: False Sharing Detection
Difficulty: Advanced | Language: C
Problem: Demonstrate and fix false sharing in multi-threaded code.
Key Code:
// BAD: False sharing
struct Counter {
int count; // 4 bytes, multiple counters in same cache line
};
struct Counter counters[NUM_THREADS];
// GOOD: Padded to avoid false sharing
struct PaddedCounter {
int count;
char padding[60]; // Pad to 64-byte cache line
};
struct PaddedCounter padded_counters[NUM_THREADS];
void *thread_func(void *arg) {
int id = *(int*)arg;
for (int i = 0; i < ITERATIONS; i++) {
counters[id].count++; // False sharing!
}
return NULL;
}
Expected Result: Padded version 5-10x faster with multiple threads.
Part VI: Embedded Constraints (Chapters 19-22)
Exercise 19.1: Binary Footprint Analysis
Difficulty: Medium | Language: C/Shell
Problem: Analyze binary size and identify largest contributors.
Key Code:
#!/bin/bash
# Compile with different options
gcc -Os -o prog_Os prog.c
gcc -O2 -o prog_O2 prog.c
gcc -O3 -o prog_O3 prog.c
# Size comparison
size prog_Os prog_O2 prog_O3
# Symbol size analysis
nm --size-sort -S prog_Os | tail -20
# Section breakdown
objdump -h prog_Os | grep -E "\.text|\.data|\.bss|\.rodata"
Exercise 21.1: Stack Usage Analysis
Difficulty: Medium | Language: C
Problem: Measure and analyze function stack usage.
Key Code:
# Compile with stack usage info
gcc -fstack-usage -O2 -c program.c
# Output: program.su file
# Format: function_name:line:col:size qualifier
// GCC extension for runtime stack check
void check_stack_usage(void) {
void *sp;
asm volatile("mov %%rsp, %0" : "=r"(sp));
printf("Current SP: %p\n", sp);
}
Part VII: AI/HPC (Chapters 23-29)
Exercise 26.1: CUDA Vector Addition
Difficulty: Advanced | Language: CUDA
Problem: Implement parallel vector addition on GPU.
Key Code:
__global__ void vector_add(float *a, float *b, float *c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
int main(void) {
int n = 1 << 20; // 1M elements
size_t bytes = n * sizeof(float);
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int threads = 256;
int blocks = (n + threads - 1) / threads;
vector_add<<<blocks, threads>>>(d_a, d_b, d_c, n);
cudaDeviceSynchronize();
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
Exercise 27.1: LLM Inference Benchmark
Difficulty: Advanced | Language: Python
Problem: Measure LLM inference throughput and latency.
Key Code:
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
def benchmark_inference(model, tokenizer, prompt, num_tokens=100):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Warm up
with torch.no_grad():
model.generate(**inputs, max_new_tokens=10)
# Benchmark
torch.cuda.synchronize()
start = time.perf_counter()
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=num_tokens)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
tokens_generated = outputs.shape[1] - inputs['input_ids'].shape[1]
throughput = tokens_generated / elapsed
print(f"Tokens: {tokens_generated}")
print(f"Time: {elapsed:.2f}s")
print(f"Throughput: {throughput:.1f} tokens/s")
Additional Exercises (No Solutions Provided)
The following exercises are designed for independent exploration. They are intentionally open-ended to encourage deeper investigation.
Part I: Foundations
Exercise 1.2: Environment Variability Study
Run the same benchmark on your system with different background conditions:
- During a virus scan or system backup
- With browser tabs open vs closed
- On battery vs plugged in (laptop)
- After fresh boot vs after hours of use
Document how much measurement variability each condition introduces. What's your system's "noise floor"?
Exercise 2.3: Timer Comparison
Compare the overhead and resolution of different timing methods on your system:
clock_gettime(CLOCK_MONOTONIC)clock_gettime(CLOCK_MONOTONIC_RAW)gettimeofday()rdtsc/rdtscp(x86) orcntvct_el0(ARM)
Which one is best for sub-microsecond measurements? Which is most portable?
Exercise 4.3: Outlier Investigation
Take a benchmark that produces outliers. Instead of removing them, investigate:
- When do outliers occur? (First run? Every N runs?)
- What system events correlate with outliers?
- Can you predict when outliers will happen?
Part II: Benchmark Tools
Exercise 7.1: Geekbench Deep Dive
Run Geekbench 6 on your system and analyze:
- Which sub-benchmarks score highest/lowest relative to the reference?
- How does your single-core to multi-core scaling compare to the reference?
- Run it 5 times - what's the CV of each sub-benchmark?
Exercise 8.1: SPEC CPU Proxy
Without access to SPEC CPU, create a "proxy benchmark suite" using open-source alternatives:
- Find open-source equivalents for 5 SPEC workloads
- Measure correlation with published SPEC scores (if available)
- Document limitations of your proxy approach
Exercise 9.3: Custom Microbenchmark
Design a microbenchmark to measure a specific CPU feature:
- L1 cache latency (not bandwidth)
- TLB miss penalty
- Memory prefetcher effectiveness
The benchmark must isolate the feature from other effects.
Part III: Analysis Theory
Exercise 10.2: Your Application's Roofline Position
Take a computationally intensive application you work with:
- Calculate its theoretical operational intensity
- Measure actual achieved FLOPS
- Plot it on a roofline - is it memory or compute bound?
- What optimization would help most?
Exercise 11.1: Amdahl's Law Reality Check
Profile a real application and identify:
- The serial fraction (code that can't be parallelized)
- Calculate theoretical speedup with 4, 8, 16 cores
- Measure actual speedup
- Explain the gap between theory and practice
Exercise 12.4: Prefetch Pattern Discovery
Experiment with different memory access patterns:
- Sequential forward
- Sequential backward
- Stride-2, Stride-4, Stride-8, etc.
- Random
At what stride does the hardware prefetcher stop helping? How does this vary by CPU generation?
Part IV: Data Structures
Exercise 13.2: Cache-Oblivious vs Cache-Aware
Implement matrix transpose two ways:
- Naive (row-by-row)
- Cache-oblivious (recursive)
- Cache-aware (blocked, tuned for your L1)
Compare performance at different matrix sizes. At what size does blocking matter?
Exercise 14.2: Real-World Hash Table Benchmarking
Compare hash table implementations with realistic workloads:
- String keys of varying length (5-100 chars)
- Mixed read/write (90/10, 50/50, 10/90)
- With and without deletions
- Measure memory overhead, not just speed
Exercise 15.2: Sorting Stability Under Pressure
Benchmark sorting algorithms with adversarial inputs:
- Quicksort killer sequences
- Data that triggers worst-case for each algorithm
- Measure not just average case, but worst case in practice
Part V: Parallelization
Exercise 16.2: Auto-vectorization Investigation
Take a loop that should vectorize:
- Compile with
-O3 -march=nativeand check if it vectorized - If not, identify what prevented vectorization
- Fix the issue without using intrinsics
- Measure the speedup
Exercise 17.3: Lock-Free vs Locking
Implement a concurrent counter three ways:
- Mutex-protected
- Spinlock
- Atomic (lock-free)
Measure throughput at different thread counts (1, 2, 4, 8, 16, 32). At what point does each approach break down?
Exercise 18.1: OpenMP Scheduling Experiment
Take a parallel loop with varying work per iteration. Compare:
- static scheduling
- dynamic scheduling (chunk=1, 10, 100)
- guided scheduling
How does optimal scheduling depend on work distribution and iteration count?
Part VI: Embedded Constraints
Exercise 20.1: Power State Characterization
If you have access to an embedded board with power measurement:
- Measure power in different sleep states
- Measure wake-up latency from each state
- Calculate the break-even time for each state
- Design a sleep strategy for a periodic workload
Exercise 21.2: Stack High-Water Mark
Implement a stack painting technique:
- Fill stack with a known pattern at startup
- Run your application under various loads
- Check how much of the pattern was overwritten
- Report peak stack usage
Exercise 22.1: Memory-Constrained Algorithm Design
Take an algorithm that requires O(n) auxiliary space. Modify it to run in O(1) space:
- What's the time penalty?
- Is there a time-space trade-off curve?
- At what memory budget does the in-place version become faster?
Part VII: AI/HPC
Exercise 23.1: Metrics That Matter
Profile an AI inference workload and report:
- FLOPS achieved vs theoretical peak
- Memory bandwidth achieved vs theoretical peak
- Which is the bottleneck?
- What metric best predicts user-perceived performance?
Exercise 24.1: MLPerf Result Analysis
Download MLPerf results for a specific benchmark:
- Plot performance vs power across submissions
- Identify the Pareto frontier
- What hardware characteristics predict placement on the frontier?
Exercise 25.1: HPCG vs HPL Correlation
Using published data from the TOP500 and HPCG rankings:
- Plot HPCG efficiency vs HPL efficiency for the same systems
- What's the correlation?
- Can you identify systems that are outliers in one benchmark but not the other?
Exercise 28.1: ML Compiler Comparison
Take a model (e.g., ResNet-50) and compile it with:
- PyTorch (eager mode)
- TorchScript
- ONNX Runtime
- TensorRT (if NVIDIA GPU available)
- TVM with auto-tuning
Report latency, throughput, and compilation time. Which is best for your use case?
Exercise 29.1: Quantization Impact Study
Take a model and quantize to INT8:
- Measure accuracy drop on your validation set
- Measure speedup on your target hardware
- Try quantization-aware training - does it help?
- Find the accuracy-speed Pareto frontier
Challenge Problems
These are open-ended research-level problems:
Challenge 1: The Perfect Benchmark
Design a benchmark that:
- Runs in under 10 seconds
- Has <1% CV on commodity hardware
- Correlates with "real application" performance (define this)
- Is resistant to gaming/optimization
Document your design decisions and trade-offs.
Challenge 2: Automatic Bottleneck Detection
Build a tool that:
- Takes any program as input
- Profiles it automatically
- Identifies the top 3 performance bottlenecks
- Suggests specific optimizations
Test it on 5 different programs. How often is it right?
Challenge 3: Performance Regression Detection
Design a CI/CD performance testing system that:
- Detects 2% regressions reliably
- Minimizes false positives
- Runs in under 5 minutes
- Works on noisy cloud VMs
What's the minimum number of runs needed? What statistical tests work best?
Summary
Exercise Difficulty Guide
| Level | Exercises | Prerequisites |
|---|---|---|
| Easy | 1.1, 2.2, 4.2, 5.1, 6.1, 13.1 | Basic C, Python |
| Medium | 3.3, 7.1-9.2, 10.1-15.1, 18-22, 29 | Stats, Linux tools |
| Advanced | 16.1, 17.2, 26.1-28.1, 30-34 | SIMD, CUDA, ML |
| Open-ended | Additional exercises | Varies |
Key Takeaways
- Always measure: Never assume performance characteristics
- Statistics matter: Single measurements are meaningless
- Understand variance: CV tells you if results are stable
- Hardware awareness: Cache, branch prediction, memory matter
- Reproducibility: Document environment, automate tests
Appendix F: Environment Setup Guide
"A good setup is half the battle." — Engineering Proverb
Platform Overview
The examples and exercises in this book support multiple platforms.
Supported Platforms
Platform Architecture Notes
─────────────────────────────────────────────────────
Linux x86-64 x86-64 Most complete tool support
macOS aarch64 Apple Silicon (M1/M2/M3)
Windows x86-64 Recommend WSL2
RISC-V hardware riscv64 SiFive, StarFive, Milk-V
RISC-V emulator riscv64 QEMU, Spike
ARM dev boards aarch64 Raspberry Pi, Jetson
Linux x86-64 Setup
Basic Tools
# Update system
sudo apt update && sudo apt upgrade -y
# Compilation tools
sudo apt install -y build-essential cmake ninja-build
# Performance tools
sudo apt install -y linux-tools-common linux-tools-generic
sudo apt install -y perf
# Other tools
sudo apt install -y git wget curl htop
Performance Analysis Tools
# perf
sudo apt install -y linux-tools-$(uname -r)
# Valgrind
sudo apt install -y valgrind
# FlameGraph
git clone https://github.com/brendangregg/FlameGraph.git
# sysstat (iostat, mpstat)
sudo apt install -y sysstat
Benchmark Tools
# fio
sudo apt install -y fio
# iperf3
sudo apt install -y iperf3
# stress-ng
sudo apt install -y stress-ng
# sysbench
sudo apt install -y sysbench
Permission Settings
# Allow non-root to use perf
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
# Permanent setting
echo 'kernel.perf_event_paranoid = 0' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
Linux Baseline (Benchmarking)
# 1. Stop unnecessary services
sudo systemctl stop cron snapd unattended-upgrades
# 2. Set CPU frequency (disable scaling)
sudo cpupower frequency-set -g performance
# 3. Disable Turbo Boost
# Intel:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# AMD:
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# 4. Disable ASLR
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
# 5. Run with CPU pinning
sudo nice -n -20 taskset -c 2 ./benchmark
Memory Configuration
# Enable huge pages
echo 1024 | sudo tee /proc/sys/vm/nr_hugepages
# NUMA binding
numactl --membind=0 --cpunodebind=0 ./benchmark
Restore System Settings
sudo cpupower frequency-set -g ondemand
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
echo 2 | sudo tee /proc/sys/kernel/randomize_va_space
sudo systemctl start cron
macOS Setup
Homebrew Installation
# Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Basic tools
brew install cmake ninja git wget
# Benchmark tools
brew install fio iperf3 stress-ng
Xcode Command Line Tools
xcode-select --install
Performance Analysis
# Instruments (installed with Xcode)
# Use GUI or command line
# Using sample
sudo sample <pid> 10 -file output.txt
# Using dtrace (requires disabling SIP)
sudo dtrace -n 'profile-997 { @[ustack()] = count(); }'
Notes
macOS limitations:
1. No perf
- Use Instruments instead
- Or use dtrace
2. SIP (System Integrity Protection)
- Some tools require disabling
- Not recommended for production
3. Apple Silicon
- Some tools not yet supported
- Use Rosetta 2 for x86 tools
Windows Setup
WSL2 Installation
# Run PowerShell as Administrator
wsl --install
# Install Ubuntu
wsl --install -d Ubuntu-22.04
# Enter WSL
wsl
Inside WSL2
# Same as Linux
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git
# Note: Some kernel features are limited
# perf requires special configuration
Native Windows Tools
Windows Performance Tools:
1. Windows Performance Toolkit
- xperf
- Windows Performance Analyzer
2. Visual Studio Profiler
- CPU usage
- Memory analysis
3. Intel VTune
- Windows support
- Detailed CPU analysis
Cross-Platform Tool Mapping
Linux → Windows
| Linux Tool/Command | Windows Equivalent | Description |
|---|---|---|
taskset | start /affinity | CPU affinity |
nice | Process priority in Task Manager | Priority setting |
cpupower | Power Options / ThrottleStop | CPU frequency control |
perf | Windows Performance Analyzer | Profiling |
top / htop | Task Manager / Process Explorer | Process monitoring |
clock_gettime | QueryPerformanceCounter | High-precision timing |
Linux → macOS
| Linux Tool/Command | macOS Equivalent | Description |
|---|---|---|
taskset | Not directly supported | CPU affinity |
cpupower | Not supported (macOS auto-manages) | CPU frequency control |
perf | Instruments / sample | Profiling |
/proc/cpuinfo | sysctl -a / system_profiler | System info |
clock_gettime | mach_absolute_time | High-precision timing |
RISC-V Environment Setup
QEMU Emulator
# Install QEMU
sudo apt install -y qemu-system-riscv64 qemu-user
# Test
qemu-system-riscv64 --version
qemu-riscv64 --version
Toolchain
# Pre-built toolchain
sudo apt install -y gcc-riscv64-linux-gnu
# Or build from source
git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git
cd riscv-gnu-toolchain
./configure --prefix=/opt/riscv
make linux -j$(nproc)
Spike Simulator
# Install dependencies
sudo apt install -y device-tree-compiler
# Build Spike
git clone https://github.com/riscv-software-src/riscv-isa-sim.git
cd riscv-isa-sim
mkdir build && cd build
../configure --prefix=/opt/riscv
make -j$(nproc)
sudo make install
Running Examples
# Compile program
riscv64-linux-gnu-gcc -static -o hello hello.c
# Run on QEMU
qemu-riscv64 ./hello
# Run on Spike (requires pk)
spike pk ./hello
ARM Environment Setup
Cross Compilation
# Install toolchain
sudo apt install -y gcc-aarch64-linux-gnu
sudo apt install -y gcc-arm-none-eabi # Bare metal
# Install QEMU
sudo apt install -y qemu-system-arm qemu-user
Raspberry Pi Setup
# On Raspberry Pi
sudo apt update && sudo apt upgrade -y
# Performance tools
sudo apt install -y linux-tools-generic
sudo apt install -y perf
# Note: Some tools may need to be built from source
Docker Environment
Using Docker for Isolated Environment
# Dockerfile
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
linux-tools-generic \
perf \
fio \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /workspace
Running
# Build image
docker build -t benchmark-env .
# Run (requires privileged mode for perf)
docker run --privileged -it benchmark-env
Environment Verification
Verification Script
#!/bin/bash
# verify_environment.sh
echo "=== Environment Verification ==="
# Compiler
echo -n "GCC: "
gcc --version | head -1
# perf
echo -n "perf: "
perf --version 2>/dev/null || echo "Not available"
# fio
echo -n "fio: "
fio --version 2>/dev/null || echo "Not available"
# Python
echo -n "Python: "
python3 --version
# Kernel version
echo -n "Kernel: "
uname -r
# CPU info
echo "CPU:"
lscpu | grep "Model name"
lscpu | grep "CPU(s):"
echo "=== Verification Complete ==="
Common Issues
perf Permission Issues
# Problem: perf requires root permission
# Solution:
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
# Or use sudo
sudo perf stat ./benchmark
Frequency Instability
# Problem: CPU frequency changes affect results
# Solution: Fix frequency
sudo cpupower frequency-set -g performance
# Or disable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
Insufficient Memory
# Problem: Large benchmark runs out of memory
# Solution: Add swap
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Summary
Environment setup key points:
Platform Selection
- Linux x86-64: Most complete support
- macOS: Need alternative tools
- Windows: Recommend WSL2
- RISC-V/ARM: Use emulator or physical hardware
Required Tools
- Compiler (GCC/Clang)
- Performance analysis (perf/Instruments)
- Benchmark tools (fio, iperf3)
Environment Preparation
- Fix CPU frequency
- Set appropriate permissions
- Verify tool availability
Isolation
- Docker containers
- Virtual machines
- Dedicated benchmark machine
Cross-Platform Timer
// cross_platform_timer.h
#ifndef CROSS_PLATFORM_TIMER_H
#define CROSS_PLATFORM_TIMER_H
#include <stdint.h>
#if defined(_WIN32)
#include <windows.h>
#elif defined(__APPLE__)
#include <mach/mach_time.h>
#else
#include <time.h>
#endif
static inline uint64_t get_time_ns(void) {
#if defined(_WIN32)
static LARGE_INTEGER freq = {0};
if (freq.QuadPart == 0) QueryPerformanceFrequency(&freq);
LARGE_INTEGER counter;
QueryPerformanceCounter(&counter);
return (uint64_t)(counter.QuadPart * 1000000000ULL / freq.QuadPart);
#elif defined(__APPLE__)
static mach_timebase_info_data_t timebase = {0};
if (timebase.denom == 0) mach_timebase_info(&timebase);
return mach_absolute_time() * timebase.numer / timebase.denom;
#else
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
#endif
}
#endif // CROSS_PLATFORM_TIMER_H
Benchmark Environment Checklist
Required
- Record hardware specs (CPU, RAM, storage)
- Record OS and kernel version
- Record compiler version and flags
- Close unnecessary background programs
- Ensure sufficient available memory
- Ensure power supply (laptop plugged in)
Recommended (Linux)
- Fix CPU frequency
- Disable Turbo Boost
- Disable ASLR
- Use CPU isolation
- Set real-time priority
Recommended (Windows/macOS)
- Disable antivirus real-time scanning
- Disable Windows Update / macOS auto-updates
- Disable Spotlight indexing (macOS)
- Use "High Performance" power plan (Windows)
At Runtime
- Run sufficient warm-up iterations
- Run enough iterations for statistics
- Record all raw data
- Monitor system state (CPU temp, frequency)
Appendix G: Further Reading
"Standing on the shoulders of giants."
This appendix collects books, papers, and online resources that shaped how this book thinks about performance engineering and benchmarking. Treat it as a map: dip in when a topic from the main chapters sparks your curiosity.
Editor's note: If you're in the middle of a real performance incident, start with Systems Performance, the Roofline paper, and Drepper's memory article. Come back to the rest when things are calm.
Reading Guide (Inside This Book)
Reading Paths by Role
Different readers can take different paths through the main chapters. These are suggested starting points rather than strict sequences.
| Reader type | Goal | Suggested chapters (main text) |
|---|---|---|
| System / embedded engineer | Understand system bottlenecks | Ch 1–4, 5–8, 9, 16–18, 19–22, 30, 33–35 |
| ML / AI engineer | Focus on AI/ML and LLM performance | Ch 1–4, 5, 8, 19, 20, 23–27, 30, 32–35 |
| HPC / perf researcher | Connect theory, hardware, and models | Ch 1–4, 5–7, 10–12, 16–18, 23–27, 30–32, 33–35 |
Within each path, you can always jump to appendices for hands-on exercises and environment setup when you are ready to run real benchmarks.
Topic Map (Concept → Chapters)
Use this as a quick index when you want to revisit a concept from the main text.
- Benchmarking methodology and statistics: Ch 1–4, 10
- Profiling tools and observability: Ch 5–8, 30–32
- Cache, memory, and locality: Ch 2, 6, 12–15, 18, Appendix C, Appendix E
- Data structures and algorithms in practice: Ch 13–15, 30, 31
- Parallelism and multi-core scaling: Ch 16–18, 23, 30–32
- Embedded and footprint constraints: Ch 9, 19–22, Appendix B, Appendix E
- AI/ML and LLM performance: Ch 20, 23–27, 29, 32
- End-to-end practice (how to benchmark / optimize / ship): Ch 33–35, Appendix A
When the structure evolves in future versions, this topic map is the single place that should be updated.
Books
Systems Background
Computer Systems: A Programmer's Perspective (3rd Edition) - Randal E. Bryant and David R. O'Hallaron, Pearson, 2015. A comprehensive introduction to how modern computer systems work, useful background for understanding performance bottlenecks across hardware and software.
Performance Engineering
Systems Performance: Enterprise and the Cloud (2nd Edition) - Brendan Gregg, Addison-Wesley, 2020. A broad, practical reference for performance methodology, Linux observability tools, and real production case studies.
Key chapters:
- Chapter 2: Methodologies
- Chapter 6: CPUs
- Chapter 7: Memory
- Chapter 13: perf
BPF Performance Tools - Brendan Gregg, Addison-Wesley, 2019. A modern guide
to Linux observability with eBPF, useful once basic tools like perf feel
natural.
Key chapters:
- Chapter 4: BCC
- Chapter 5: bpftrace
- Chapters 6-15: Subsystem analysis
The Art of Writing Efficient Programs - Fedor G. Pikus, Packt, 2021. Focuses on high-performance C++ and shows how algorithms interact with modern CPUs and memory systems.
Key chapters:
- Chapter 2: Performance Measurements
- Chapter 3: CPU Architecture
- Chapter 4: Memory Architecture
- Chapter 9: High-Performance C++
Computer Architecture
Computer Architecture: A Quantitative Approach (6th Edition) - John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2017. The classic reference for processors, memory hierarchies, and quantitative evaluation.
Key chapters:
- Chapter 1: Fundamentals
- Chapter 2: Memory Hierarchy
- Appendix A: Instruction Set Principles
Modern Processor Design - John Paul Shen and Mikko H. Lipasti, Waveland Press, 2013. A deeper treatment of superscalar and out-of-order processors that explains many microarchitectural effects seen in benchmarks.
Benchmarking
Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software - Connie U. Smith and Lloyd G. Williams, Addison-Wesley, 2001. A foundational text on software performance engineering and workload design.
Every Computer Performance Book - Bob Wescott, 2013. A short, very practical book full of rules of thumb for real-world performance work.
Papers
Benchmarking Methodology
How Not to Measure Computer System Performance - David J. Lilja, IEEE Computer, 2005. A concise overview of common benchmarking mistakes.
Producing Wrong Data Without Doing Anything Obviously Wrong! - Todd Mytkowicz et al., ASPLOS 2009. Shows how environment size, link order, and other details can silently corrupt results.
Key findings:
- UNIX environment size affects performance
- Link order matters
- Measurement bias is pervasive
Rigorous Benchmarking in Reasonable Time - Tomas Kalibera and Richard Jones, ISMM 2013. Explains how to design statistically sound experiments without burning weeks of CPU time.
Stabilizer: Statistically Sound Performance Evaluation - Charlie Curtsinger and Emery D. Berger, ASPLOS 2013. Uses randomization to make performance measurements more robust and statistically sound.
Roofline Model
Roofline: An Insightful Visual Performance Model for Multicore Architectures Samuel Williams et al., Communications of the ACM, 2009. Introduces the Roofline model used throughout this book.
Cache-Aware Roofline Model - Aleksandar Ilic et al., IEEE TPDS, 2017. Extends Roofline to account for multiple cache levels.
AI/ML Benchmarks
MLPerf: An Industry Standard Benchmark Suite for Machine Learning - Peter Mattson et al., IEEE Micro, 2020. Describes the design and goals of the MLPerf benchmark suite.
Measuring the Algorithmic Efficiency of Neural Networks - Danny Hernandez and Tom Brown, arXiv 2020. Studies trends in the algorithmic efficiency of neural networks over time.
Online Resources
Optimization Manuals
Agner Fog's Optimization Resources - https://www.agner.org/optimize/. A comprehensive collection of optimization manuals, instruction tables, and microarchitecture notes for x86/x64.
Intel 64 and IA-32 Architectures Optimization Reference Manual - https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html. Intel's official optimization guide for their processors.
ARM Performance Analysis Guides - https://developer.arm.com/documentation/. Official documentation and tuning guides for ARM CPUs.
Memory & Cache
What Every Programmer Should Know About Memory - Ulrich Drepper, https://people.freebsd.org/~lstewart/articles/cpumemory.pdf. A long but rewarding deep dive into modern memory hierarchies.
Gallery of Processor Cache Effects - Igor Ostrovsky, http://igoro.com/archive/gallery-of-processor-cache-effects/. An interactive tour of cache behavior.
Benchmarking Tools
SPEC CPU 2017 - https://www.spec.org/cpu2017/. The industry-standard CPU benchmark suite used in academia and industry.
Phoronix Test Suite - https://www.phoronix-test-suite.com/. A large collection of open-source benchmarks for Linux and other platforms.
Google Benchmark - https://github.com/google/benchmark. A C++ microbenchmarking framework that pairs well with the microbenchmark patterns in this book.
Courses
MIT 6.172: Performance Engineering of Software Systems https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/
An MIT course on performance engineering. Covers profiling, cache optimization, parallelism, and systematic performance methodology.
Berkeley CS267: Applications of Parallel Computers https://sites.google.com/lbl.gov/cs267-spr2024
An advanced course on parallel computing and high-performance computing (HPC).
CMU 15-418/618: Parallel Computer Architecture and Programming http://www.cs.cmu.edu/~418/
Another classic course on parallel programming and computer architecture.
Blogs
Brendan Gregg's Blog https://www.brendangregg.com/
Deep-dive articles on performance analysis and observability. Especially recommended:
- "Linux Performance" (overview)
- "Flame Graphs"
- "CPU Flame Graphs"
Mechanical Sympathy https://mechanical-sympathy.blogspot.com/
Discussions of hardware-aware programming and the interaction between code and modern CPUs.
Daniel Lemire's Blog https://lemire.me/blog/
Regular posts on data-oriented design, SIMD optimization, and fast software techniques.
Travis Downs' Blog https://travisdowns.github.io/
Low-level CPU performance analysis, microbenchmarks, and deep dives into instruction behavior.
Tools
Profiling
| Tool | Platform | Description |
|---|---|---|
| perf | Linux | Built-in Linux profiler |
| VTune | x86 | Intel's advanced profiler |
| Instruments | macOS | Apple's profiling suite |
| Tracy | Cross | Real-time profiler popular in game development |
Benchmarking
| Tool | Language | Description |
|---|---|---|
| Google Benchmark | C++ | Microbenchmark library |
| Criterion | Rust | Rust benchmark library |
| pytest-benchmark | Python | Python benchmark plugin |
| JMH | Java | Java microbenchmark harness |
Visualization
| Tool | Description |
|---|---|
| FlameGraph | Stack trace and sample visualization |
| Perfetto | Chrome trace-style viewer for traces |
| Hotspot | GUI for visualizing perf data |
Suggested Reading Paths
Different readers will care about different parts of this appendix. Here are a few short routes.
| Reader type | Core book | Key paper / resource | Course |
|---|---|---|---|
| System / embedded engineer | Systems Performance | Drepper, "What Every Programmer Should Know About Memory" | MIT 6.172 |
| ML / AI engineer | Systems Performance | MLPerf papers; "Measuring the Algorithmic Efficiency of Neural Networks" | CS267 (selected lectures) |
| HPC / performance researcher | Computer Architecture: A Quantitative Approach | Roofline and Cache-Aware Roofline papers | CS267 or 15-418/618 |
System / Embedded Engineers
- Start with Systems Performance for methodology, tools, and mental models.
- Skim CAQA Ch. 1-2 and the Roofline paper when you need hardware intuition.
- Keep Drepper's memory paper and Agner Fog's manuals nearby for tricky cache/latency behaviour.
ML / AI Engineers
- Read the MLPerf paper and the algorithmic efficiency paper alongside this book's AI/ML chapters.
- Use Systems Performance for general methodology and system-level bottlenecks.
- Pair this with CS267 lectures focused on dense linear algebra and GPU performance.
HPC / Research-Oriented Readers
- Start from CAQA and Modern Processor Design for architecture depth.
- Study the Roofline and Cache-Aware Roofline papers, then apply them to your own kernels.
- Use CS267 or 15-418/618 as a structured path through parallel architectures and performance case studies.
Most importantly, keep connecting what you read back to real measurements on systems you control. Reading without measurement becomes trivia; measurement without theory becomes blind trial-and-error.
Appendix H: Performance Models Deep Dive
"In theory, theory and practice are the same. In practice, they are not." — Yogi Berra
This appendix provides detailed mathematical foundations, proofs, and advanced applications for the performance models introduced in Chapter 10.
Little's Law: Mathematical Foundation
Rigorous Derivation
Little's Law states: L = λ × W
Where:
- L = average number of items in system
- λ = arrival rate (throughput)
- W = average time in system (latency)
Intuitive Analogy
Imagine a highway toll booth:
- Vehicle arrival rate (λ) = 100 vehicles/hour
- Time to pass through (W) = 0.1 hour/vehicle
- Vehicles at booth (L) = 100 × 0.1 = 10 vehicles
This makes sense: if each vehicle needs 0.1 hours to pass,
and 100 vehicles arrive per hour, then at any moment,
there are on average 10 vehicles in the system.
Formal Proof Sketch
Consider a time interval T.
During T:
- Tasks arriving = N = λ × T
- Each task spends average time W in system
- Task j contributes time W_j to system occupancy
Average items in system (L):
L = (1/T) × Σ [time each task spent in system]
= (1/T) × [total time contribution from all tasks]
Since each task contributes W on average:
L ≈ (1/T) × (N × W)
= (1/T) × (λ × T × W)
= λ × W
Why It Works in Computer Systems
1. Conservation Principle
Tasks enter → [ Processing System ] → Tasks leave
Conservation: tasks_in = tasks_out (steady state)
Items in system = tasks that entered but haven't left yet
2. Time Average = Space Average
Observing system for time T:
Time-averaged concurrency = (1/T) × ∫[0,T] items_in_system(t) dt
This equals the task-centric view:
Each task sees an average system load during its stay
3. Valid for All Queuing Models
Proven mathematically for:
M/M/1, M/M/c, M/G/1, G/G/c, and more
Practical Verification
def verify_littles_law(throughput, latency, measured_concurrency):
"""Verify Little's Law with measured data."""
expected = throughput * latency
error = abs(measured_concurrency - expected) / expected
print(f"Throughput: {throughput:.1f}/s")
print(f"Latency: {latency:.3f}s")
print(f"Expected concurrency: {expected:.1f}")
print(f"Measured concurrency: {measured_concurrency:.1f}")
print(f"Error: {error:.1%}")
# Error < 5% indicates stable system
return error < 0.05
# Example
verify_littles_law(150, 0.08, 11.8)
# Output: Error: 1.7% ✓
Key Prerequisites
| Prerequisite | Description |
|---|---|
| Stable System | Input rate ≈ output rate (no unbounded queue growth) |
| Long-term Average | Short-term may violate; converges over time |
| Task Conservation | Every task that enters eventually leaves |
Counter-Examples (When It Fails)
Counter-Example 1: Task Loss
throughput_in = 100/s
throughput_out = 80/s (20% dropped)
latency = 0.1s
Calculation: 100 × 0.1 = 10
Reality: tasks accumulate unboundedly → formula fails
Counter-Example 2: Burst Arrival
1000 tasks arrive instantly
throughput = 1000/s (instantaneous)
latency = 1s
Calculation: 1000 × 1 = 1000
But this is instantaneous, not steady-state average
Diagnostic Value
When measured values don't match expectations:
| Situation | Possible Causes |
|---|---|
| Actual > Expected | Latency underestimated, tasks stuck, memory leak |
| Actual < Expected | Throughput overestimated, parallel processing, measurement too short |
Amdahl's Law: Extended Analysis
Mathematical Derivation
Let T be the total execution time on a single core. Let p be the parallelizable fraction.
- Single-core execution time: T_1 = T
- On N cores, parallel portion shrinks to (p×T)/N, serial portion remains (1-p)×T
- N-core execution time: T_N = (1-p)×T + (p×T)/N
- Speedup S = T_1 / T_N = 1 / ((1-p) + p/N)
- As N → ∞: S_max = 1 / (1-p)
If p = 0.9 (90% parallelizable): S_max = 10×
Measuring the Parallel Fraction p
Since p is difficult to calculate from code inspection, use empirical fitting:
import numpy as np
from scipy.optimize import curve_fit
def amdahl_model(n, p):
"""Amdahl's Law model."""
return 1 / ((1 - p) + p / n)
# Measured data: cores and corresponding speedup
n_data = np.array([1, 2, 4, 8, 16])
speedup_data = np.array([1.0, 1.85, 3.20, 4.80, 5.60])
# Fit p
params, _ = curve_fit(amdahl_model, n_data, speedup_data)
p_fitted = params[0]
print(f"Fitted parallel fraction p = {p_fitted:.2%}")
print(f"Theoretical maximum speedup = {1/(1-p_fitted):.2f}x")
Real-World Serial Bottleneck Examples
| Bottleneck Type | Example | Mitigation |
|---|---|---|
| Lock Contention | Global mutex for logging | Lock-free queues, per-thread buffers |
| I/O Serialization | Sequential file reads | Async I/O, memory-mapped files |
| Memory Allocator | malloc global lock | jemalloc, tcmalloc, arena allocators |
| Kernel Bottleneck | syscall serialization | io_uring, batched operations |
Gustafson's Law: Scaled Speedup
Mathematical Derivation
Unlike Amdahl, Gustafson starts from the parallel execution result and works backward.
- On N cores, normalized execution time T_N = 1
- Let s be the serial fraction of this time, so parallel fraction = 1-s
- On single core, serial part still takes s, but parallel part takes N×(1-s)
- Single-core time: T_1 = s + N×(1-s)
- Scaled Speedup: S(N) = s + N×(1-s) = N - s×(N-1)
Conclusion: Speedup grows linearly with N as long as s is small.
Strong vs Weak Scaling
| Scaling Type | Definition | Model | Metric |
|---|---|---|---|
| Strong Scaling | Fixed problem size, add resources | Amdahl | Time reduction |
| Weak Scaling | Problem grows with resources | Gustafson | Constant time, larger problem |
When to Use Each
Amdahl scenarios:
- User-facing latency requirements
- Real-time constraints
- Interactive applications
Gustafson scenarios:
- Scientific computing (higher resolution simulations)
- Big data processing (process more data in same time)
- Machine learning training (larger batches)
Universal Scalability Law (USL)
Formula and Parameters
C(N) = N / (1 + σ(N-1) + κN(N-1))
σ (sigma): Contention/Serialization coefficient
κ (kappa): Coherence/Crosstalk coefficient
Physical Meaning of Parameters
σ (Contention):
- Represents queuing for shared resources
- Like Amdahl's serial fraction
- Effect: Linear degradation as N increases
- Examples: mutex waits, database locks
κ (Coherence):
- Represents pairwise communication overhead
- Each node must communicate with others for consistency
- Effect: Quadratic degradation (N×(N-1) pairs)
- Examples: cache coherence traffic, distributed consensus
Three Scaling Regimes
| Condition | Behavior | Curve Shape |
|---|---|---|
| σ=0, κ=0 | Linear | Straight diagonal |
| σ>0, κ=0 | Amdahl | Approaches horizontal asymptote |
| σ>0, κ>0 | USL | Peak then decline (retrograde) |
Optimal Parallelism
The peak of C(N) occurs at:
N* = sqrt((1 - σ) / κ)
Beyond N*, adding more processors hurts performance.
Python Fitting Example
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
def usl_model(n, sigma, kappa):
"""Universal Scalability Law model."""
return n / (1 + sigma * (n - 1) + kappa * n * (n - 1))
# Measured data
n_data = np.array([1, 2, 4, 8, 16, 32, 64])
throughput_data = np.array([100, 185, 340, 580, 820, 900, 750])
# Normalize to relative capacity
relative_capacity = throughput_data / throughput_data[0]
# Fit USL parameters
params, covariance = curve_fit(usl_model, n_data, relative_capacity,
p0=[0.01, 0.001], bounds=(0, 1))
sigma, kappa = params
print(f"Contention (σ): {sigma:.4f}")
print(f"Coherence (κ): {kappa:.6f}")
# Optimal parallelism
n_optimal = np.sqrt((1 - sigma) / kappa)
print(f"Optimal parallelism N*: {n_optimal:.1f}")
# Predict and plot
n_pred = np.linspace(1, 128, 100)
c_pred = usl_model(n_pred, sigma, kappa)
plt.figure(figsize=(10, 6))
plt.scatter(n_data, relative_capacity, label='Measured', s=100)
plt.plot(n_pred, c_pred, 'r-', label=f'USL fit (σ={sigma:.3f}, κ={kappa:.5f})')
plt.axvline(n_optimal, color='g', linestyle='--', label=f'N* = {n_optimal:.1f}')
plt.xlabel('Number of Processors (N)')
plt.ylabel('Relative Capacity C(N)')
plt.title('USL Analysis: Identifying Scalability Limits')
plt.legend()
plt.grid(True)
plt.savefig('usl_analysis.png', dpi=150)
Case Study: Database Connection Pooling
Observations:
- 10 connections: 1000 TPS
- 20 connections: 1800 TPS
- 40 connections: 2800 TPS
- 80 connections: 3200 TPS (peak!)
- 160 connections: 2400 TPS (retrograde)
Fitted: σ=0.02, κ=0.0001
N* = sqrt(0.98/0.0001) ≈ 99 connections
Diagnosis: κ > 0 indicates coherence issues
- Connection pool management overhead
- Distributed lock contention
- Context switch costs
Roofline Model: Cache-Aware Extensions
Cache-Aware Roofline Model (CARM)
Traditional Roofline considers only DRAM bandwidth. CARM extends to multiple memory hierarchy levels.
Multiple Rooflines
Performance (GFLOPS)
^
| Peak Compute ─────────────────────────
| / / / / /
| / L1 Roofline (highest slope)
|/ / L2 Roofline
| / / L3 Roofline
| / / / DRAM Roofline (lowest slope)
|/ / / /
└──────────────────────────────────────> Arithmetic Intensity
Calculating AI for Each Level
| Level | Arithmetic Intensity | When to Use |
|---|---|---|
| AI_L1 | FLOPs / L1_traffic | Data fits in L1 |
| AI_L2 | FLOPs / L2_traffic | Data fits in L2 |
| AI_L3 | FLOPs / L3_traffic | Data fits in L3 |
| AI_DRAM | FLOPs / DRAM_traffic | Streaming from memory |
Measurement with perf
# Measure floating-point operations
perf stat -e fp_arith_inst_retired.scalar_double,\
fp_arith_inst_retired.128b_packed_double ./my_program
# Measure memory traffic (cache misses × cache line size)
perf stat -e L1-dcache-load-misses,L1-dcache-store-misses,\
LLC-load-misses,LLC-store-misses ./my_program
Optimization Strategy by Location
| Point Location | Diagnosis | Optimization |
|---|---|---|
| Below DRAM slope | DRAM bandwidth limited | Prefetching, streaming stores |
| Between L3 and DRAM | L3 miss issues | Improve data locality, blocking |
| Near L1 slope, below compute | Low arithmetic intensity | Loop fusion, vectorization |
| Near compute ceiling | Compute limited | Better algorithms, SIMD |
Integer Roofline
For non-FP workloads (compilers, databases, encryption):
- Y-axis: GOPS (Giga-Operations Per Second)
- Integer ops often faster than FP
- More sensitive to cache latency than FP
Energy Roofline (Green Computing)
- Y-axis: GFLOPS/Watt
- Finds energy-efficient operating points
- Important for HPC and data centers
Queuing Theory Fundamentals
Kendall's Notation: A/S/c/K/N/D
| Symbol | Meaning | Common Values |
|---|---|---|
| A | Arrival distribution | M (Poisson), D (Deterministic), G (General) |
| S | Service distribution | M (Exponential), D, G |
| c | Number of servers | 1, c, ∞ |
| K | System capacity | ∞ (unbounded), finite |
| N | Population size | ∞, finite |
| D | Queue discipline | FIFO, LIFO, Priority |
M/M/1 Model: Complete Analysis
Assumptions:
- Poisson arrivals (rate λ)
- Exponential service times (rate μ)
- Single server
- FIFO queue
- Infinite capacity
Key Formulas:
Utilization: ρ = λ/μ (must be < 1 for stability)
Probability of n items in system: P_n = (1-ρ) × ρ^n
Average items in system: L = ρ/(1-ρ)
Average items in queue: L_q = ρ²/(1-ρ)
Average time in system: W = 1/(μ-λ)
Average time in queue: W_q = ρ/(μ-λ)
The Hockey Stick Effect:
| Utilization (ρ) | Avg Queue Length (L) | Wait Time Multiplier |
|---|---|---|
| 50% | 1.0 | 2× service time |
| 70% | 2.3 | 3.3× |
| 80% | 4.0 | 5× |
| 90% | 9.0 | 10× |
| 95% | 19.0 | 20× |
| 99% | 99.0 | 100× |
M/M/c Model: Multiple Servers
Erlang C Formula (probability of waiting):
import math
def erlang_c(c, rho):
"""Calculate probability of queuing in M/M/c system."""
a = c * rho # Offered load
# Erlang C formula
sum_term = sum((a**k) / math.factorial(k) for k in range(c))
last_term = (a**c) / (math.factorial(c) * (1 - rho))
return last_term / (sum_term + last_term)
# Example: 10 servers, 80% average utilization
prob_wait = erlang_c(10, 0.8)
print(f"Probability of queuing: {prob_wait:.1%}")
Capacity Planning Guidelines
- 70% Rule: Keep utilization ≤ 70% for latency-sensitive systems
- Target Wait Probability: For SLA, target < 5% probability of waiting
- Headroom for Bursts: Leave 30% headroom for traffic spikes
Connection Pool Sizing Formula
def optimal_pool_size(arrival_rate, service_time, target_wait_prob=0.05):
"""Calculate optimal connection pool size."""
rho = arrival_rate * service_time
# Binary search for minimum c where P(wait) < target
for c in range(1, 1000):
if c * rho > c: # Unstable
continue
if erlang_c(c, rho/c) < target_wait_prob:
return c
return None
# Example: 100 requests/sec, 50ms average service time
pool_size = optimal_pool_size(100, 0.05)
print(f"Recommended pool size: {pool_size} connections")
Tools and Measurement
Performance Counter Events
| Tool | Best For | Key Events |
|---|---|---|
| perf | Linux, general | cycles, instructions, cache-misses |
| Intel VTune | Intel CPUs | vectorization, memory bandwidth |
| Intel Advisor | Roofline analysis | FLOPS, memory traffic |
| Likwid | HPC, multi-arch | configurable groups |
| PAPI | Cross-platform | portable API |
Measurement Best Practices
- Warm up: Run several iterations before measuring
- Steady state: Ensure system reaches equilibrium
- Multiple runs: Report mean and standard deviation
- Control variables: Pin cores, disable frequency scaling
- Representative load: Use realistic workloads
References
- Amdahl, G.M. (1967). "Validity of the single processor approach"
- Gustafson, J.L. (1988). "Reevaluating Amdahl's Law"
- Gunther, N.J. (1993). "Practical Performance Analyst"
- Williams, S. et al. (2009). "Roofline: An Insightful Visual Performance Model"
- Little, J.D.C. (1961). "A Proof for the Queuing Formula: L = λW"
- Kleinrock, L. (1975). "Queueing Systems, Volume 1: Theory"
- Iyer, L.M. et al. (2015). "Cache-Aware Roofline Model"