Appendix G: Further Reading
"Standing on the shoulders of giants."
This appendix collects books, papers, and online resources that shaped how this book thinks about performance engineering and benchmarking. Treat it as a map: dip in when a topic from the main chapters sparks your curiosity.
Editor's note: If you're in the middle of a real performance incident, start with Systems Performance, the Roofline paper, and Drepper's memory article. Come back to the rest when things are calm.
Reading Guide (Inside This Book)
Reading Paths by Role
Different readers can take different paths through the main chapters. These are suggested starting points rather than strict sequences.
| Reader type | Goal | Suggested chapters (main text) |
|---|---|---|
| System / embedded engineer | Understand system bottlenecks | Ch 1–4, 5–8, 9, 16–18, 19–22, 30, 33–35 |
| ML / AI engineer | Focus on AI/ML and LLM performance | Ch 1–4, 5, 8, 19, 20, 23–27, 30, 32–35 |
| HPC / perf researcher | Connect theory, hardware, and models | Ch 1–4, 5–7, 10–12, 16–18, 23–27, 30–32, 33–35 |
Within each path, you can always jump to appendices for hands-on exercises and environment setup when you are ready to run real benchmarks.
Topic Map (Concept → Chapters)
Use this as a quick index when you want to revisit a concept from the main text.
- Benchmarking methodology and statistics: Ch 1–4, 10
- Profiling tools and observability: Ch 5–8, 30–32
- Cache, memory, and locality: Ch 2, 6, 12–15, 18, Appendix C, Appendix E
- Data structures and algorithms in practice: Ch 13–15, 30, 31
- Parallelism and multi-core scaling: Ch 16–18, 23, 30–32
- Embedded and footprint constraints: Ch 9, 19–22, Appendix B, Appendix E
- AI/ML and LLM performance: Ch 20, 23–27, 29, 32
- End-to-end practice (how to benchmark / optimize / ship): Ch 33–35, Appendix A
When the structure evolves in future versions, this topic map is the single place that should be updated.
Books
Systems Background
Computer Systems: A Programmer's Perspective (3rd Edition) - Randal E. Bryant and David R. O'Hallaron, Pearson, 2015. A comprehensive introduction to how modern computer systems work, useful background for understanding performance bottlenecks across hardware and software.
Performance Engineering
Systems Performance: Enterprise and the Cloud (2nd Edition) - Brendan Gregg, Addison-Wesley, 2020. A broad, practical reference for performance methodology, Linux observability tools, and real production case studies.
Key chapters:
- Chapter 2: Methodologies
- Chapter 6: CPUs
- Chapter 7: Memory
- Chapter 13: perf
BPF Performance Tools - Brendan Gregg, Addison-Wesley, 2019. A modern guide
to Linux observability with eBPF, useful once basic tools like perf feel
natural.
Key chapters:
- Chapter 4: BCC
- Chapter 5: bpftrace
- Chapters 6-15: Subsystem analysis
The Art of Writing Efficient Programs - Fedor G. Pikus, Packt, 2021. Focuses on high-performance C++ and shows how algorithms interact with modern CPUs and memory systems.
Key chapters:
- Chapter 2: Performance Measurements
- Chapter 3: CPU Architecture
- Chapter 4: Memory Architecture
- Chapter 9: High-Performance C++
Computer Architecture
Computer Architecture: A Quantitative Approach (6th Edition) - John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2017. The classic reference for processors, memory hierarchies, and quantitative evaluation.
Key chapters:
- Chapter 1: Fundamentals
- Chapter 2: Memory Hierarchy
- Appendix A: Instruction Set Principles
Modern Processor Design - John Paul Shen and Mikko H. Lipasti, Waveland Press, 2013. A deeper treatment of superscalar and out-of-order processors that explains many microarchitectural effects seen in benchmarks.
Benchmarking
Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software - Connie U. Smith and Lloyd G. Williams, Addison-Wesley, 2001. A foundational text on software performance engineering and workload design.
Every Computer Performance Book - Bob Wescott, 2013. A short, very practical book full of rules of thumb for real-world performance work.
Papers
Benchmarking Methodology
How Not to Measure Computer System Performance - David J. Lilja, IEEE Computer, 2005. A concise overview of common benchmarking mistakes.
Producing Wrong Data Without Doing Anything Obviously Wrong! - Todd Mytkowicz et al., ASPLOS 2009. Shows how environment size, link order, and other details can silently corrupt results.
Key findings:
- UNIX environment size affects performance
- Link order matters
- Measurement bias is pervasive
Rigorous Benchmarking in Reasonable Time - Tomas Kalibera and Richard Jones, ISMM 2013. Explains how to design statistically sound experiments without burning weeks of CPU time.
Stabilizer: Statistically Sound Performance Evaluation - Charlie Curtsinger and Emery D. Berger, ASPLOS 2013. Uses randomization to make performance measurements more robust and statistically sound.
Roofline Model
Roofline: An Insightful Visual Performance Model for Multicore Architectures Samuel Williams et al., Communications of the ACM, 2009. Introduces the Roofline model used throughout this book.
Cache-Aware Roofline Model - Aleksandar Ilic et al., IEEE TPDS, 2017. Extends Roofline to account for multiple cache levels.
AI/ML Benchmarks
MLPerf: An Industry Standard Benchmark Suite for Machine Learning - Peter Mattson et al., IEEE Micro, 2020. Describes the design and goals of the MLPerf benchmark suite.
Measuring the Algorithmic Efficiency of Neural Networks - Danny Hernandez and Tom Brown, arXiv 2020. Studies trends in the algorithmic efficiency of neural networks over time.
Online Resources
Optimization Manuals
Agner Fog's Optimization Resources - https://www.agner.org/optimize/. A comprehensive collection of optimization manuals, instruction tables, and microarchitecture notes for x86/x64.
Intel 64 and IA-32 Architectures Optimization Reference Manual - https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html. Intel's official optimization guide for their processors.
ARM Performance Analysis Guides - https://developer.arm.com/documentation/. Official documentation and tuning guides for ARM CPUs.
Memory & Cache
What Every Programmer Should Know About Memory - Ulrich Drepper, https://people.freebsd.org/~lstewart/articles/cpumemory.pdf. A long but rewarding deep dive into modern memory hierarchies.
Gallery of Processor Cache Effects - Igor Ostrovsky, http://igoro.com/archive/gallery-of-processor-cache-effects/. An interactive tour of cache behavior.
Benchmarking Tools
SPEC CPU 2017 - https://www.spec.org/cpu2017/. The industry-standard CPU benchmark suite used in academia and industry.
Phoronix Test Suite - https://www.phoronix-test-suite.com/. A large collection of open-source benchmarks for Linux and other platforms.
Google Benchmark - https://github.com/google/benchmark. A C++ microbenchmarking framework that pairs well with the microbenchmark patterns in this book.
Courses
MIT 6.172: Performance Engineering of Software Systems https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/
An MIT course on performance engineering. Covers profiling, cache optimization, parallelism, and systematic performance methodology.
Berkeley CS267: Applications of Parallel Computers https://sites.google.com/lbl.gov/cs267-spr2024
An advanced course on parallel computing and high-performance computing (HPC).
CMU 15-418/618: Parallel Computer Architecture and Programming http://www.cs.cmu.edu/~418/
Another classic course on parallel programming and computer architecture.
Blogs
Brendan Gregg's Blog https://www.brendangregg.com/
Deep-dive articles on performance analysis and observability. Especially recommended:
- "Linux Performance" (overview)
- "Flame Graphs"
- "CPU Flame Graphs"
Mechanical Sympathy https://mechanical-sympathy.blogspot.com/
Discussions of hardware-aware programming and the interaction between code and modern CPUs.
Daniel Lemire's Blog https://lemire.me/blog/
Regular posts on data-oriented design, SIMD optimization, and fast software techniques.
Travis Downs' Blog https://travisdowns.github.io/
Low-level CPU performance analysis, microbenchmarks, and deep dives into instruction behavior.
Tools
Profiling
| Tool | Platform | Description |
|---|---|---|
| perf | Linux | Built-in Linux profiler |
| VTune | x86 | Intel's advanced profiler |
| Instruments | macOS | Apple's profiling suite |
| Tracy | Cross | Real-time profiler popular in game development |
Benchmarking
| Tool | Language | Description |
|---|---|---|
| Google Benchmark | C++ | Microbenchmark library |
| Criterion | Rust | Rust benchmark library |
| pytest-benchmark | Python | Python benchmark plugin |
| JMH | Java | Java microbenchmark harness |
Visualization
| Tool | Description |
|---|---|
| FlameGraph | Stack trace and sample visualization |
| Perfetto | Chrome trace-style viewer for traces |
| Hotspot | GUI for visualizing perf data |
Suggested Reading Paths
Different readers will care about different parts of this appendix. Here are a few short routes.
| Reader type | Core book | Key paper / resource | Course |
|---|---|---|---|
| System / embedded engineer | Systems Performance | Drepper, "What Every Programmer Should Know About Memory" | MIT 6.172 |
| ML / AI engineer | Systems Performance | MLPerf papers; "Measuring the Algorithmic Efficiency of Neural Networks" | CS267 (selected lectures) |
| HPC / performance researcher | Computer Architecture: A Quantitative Approach | Roofline and Cache-Aware Roofline papers | CS267 or 15-418/618 |
System / Embedded Engineers
- Start with Systems Performance for methodology, tools, and mental models.
- Skim CAQA Ch. 1-2 and the Roofline paper when you need hardware intuition.
- Keep Drepper's memory paper and Agner Fog's manuals nearby for tricky cache/latency behaviour.
ML / AI Engineers
- Read the MLPerf paper and the algorithmic efficiency paper alongside this book's AI/ML chapters.
- Use Systems Performance for general methodology and system-level bottlenecks.
- Pair this with CS267 lectures focused on dense linear algebra and GPU performance.
HPC / Research-Oriented Readers
- Start from CAQA and Modern Processor Design for architecture depth.
- Study the Roofline and Cache-Aware Roofline papers, then apply them to your own kernels.
- Use CS267 or 15-418/618 as a structured path through parallel architectures and performance case studies.
Most importantly, keep connecting what you read back to real measurements on systems you control. Reading without measurement becomes trivia; measurement without theory becomes blind trial-and-error.