Appendix G: Further Reading


"Standing on the shoulders of giants."

This appendix collects books, papers, and online resources that shaped how this book thinks about performance engineering and benchmarking. Treat it as a map: dip in when a topic from the main chapters sparks your curiosity.

Editor's note: If you're in the middle of a real performance incident, start with Systems Performance, the Roofline paper, and Drepper's memory article. Come back to the rest when things are calm.

Reading Guide (Inside This Book)

Reading Paths by Role

Different readers can take different paths through the main chapters. These are suggested starting points rather than strict sequences.

Reader typeGoalSuggested chapters (main text)
System / embedded engineerUnderstand system bottlenecksCh 1–4, 5–8, 9, 16–18, 19–22, 30, 33–35
ML / AI engineerFocus on AI/ML and LLM performanceCh 1–4, 5, 8, 19, 20, 23–27, 30, 32–35
HPC / perf researcherConnect theory, hardware, and modelsCh 1–4, 5–7, 10–12, 16–18, 23–27, 30–32, 33–35

Within each path, you can always jump to appendices for hands-on exercises and environment setup when you are ready to run real benchmarks.

Topic Map (Concept → Chapters)

Use this as a quick index when you want to revisit a concept from the main text.

  • Benchmarking methodology and statistics: Ch 1–4, 10
  • Profiling tools and observability: Ch 5–8, 30–32
  • Cache, memory, and locality: Ch 2, 6, 12–15, 18, Appendix C, Appendix E
  • Data structures and algorithms in practice: Ch 13–15, 30, 31
  • Parallelism and multi-core scaling: Ch 16–18, 23, 30–32
  • Embedded and footprint constraints: Ch 9, 19–22, Appendix B, Appendix E
  • AI/ML and LLM performance: Ch 20, 23–27, 29, 32
  • End-to-end practice (how to benchmark / optimize / ship): Ch 33–35, Appendix A

When the structure evolves in future versions, this topic map is the single place that should be updated.

Books

Systems Background

Computer Systems: A Programmer's Perspective (3rd Edition) - Randal E. Bryant and David R. O'Hallaron, Pearson, 2015. A comprehensive introduction to how modern computer systems work, useful background for understanding performance bottlenecks across hardware and software.

Performance Engineering

Systems Performance: Enterprise and the Cloud (2nd Edition) - Brendan Gregg, Addison-Wesley, 2020. A broad, practical reference for performance methodology, Linux observability tools, and real production case studies.

Key chapters:

  • Chapter 2: Methodologies
  • Chapter 6: CPUs
  • Chapter 7: Memory
  • Chapter 13: perf

BPF Performance Tools - Brendan Gregg, Addison-Wesley, 2019. A modern guide to Linux observability with eBPF, useful once basic tools like perf feel natural.

Key chapters:

  • Chapter 4: BCC
  • Chapter 5: bpftrace
  • Chapters 6-15: Subsystem analysis

The Art of Writing Efficient Programs - Fedor G. Pikus, Packt, 2021. Focuses on high-performance C++ and shows how algorithms interact with modern CPUs and memory systems.

Key chapters:

  • Chapter 2: Performance Measurements
  • Chapter 3: CPU Architecture
  • Chapter 4: Memory Architecture
  • Chapter 9: High-Performance C++

Computer Architecture

Computer Architecture: A Quantitative Approach (6th Edition) - John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2017. The classic reference for processors, memory hierarchies, and quantitative evaluation.

Key chapters:

  • Chapter 1: Fundamentals
  • Chapter 2: Memory Hierarchy
  • Appendix A: Instruction Set Principles

Modern Processor Design - John Paul Shen and Mikko H. Lipasti, Waveland Press, 2013. A deeper treatment of superscalar and out-of-order processors that explains many microarchitectural effects seen in benchmarks.


Benchmarking

Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software - Connie U. Smith and Lloyd G. Williams, Addison-Wesley, 2001. A foundational text on software performance engineering and workload design.

Every Computer Performance Book - Bob Wescott, 2013. A short, very practical book full of rules of thumb for real-world performance work.

Papers

Benchmarking Methodology

How Not to Measure Computer System Performance - David J. Lilja, IEEE Computer, 2005. A concise overview of common benchmarking mistakes.

Producing Wrong Data Without Doing Anything Obviously Wrong! - Todd Mytkowicz et al., ASPLOS 2009. Shows how environment size, link order, and other details can silently corrupt results.

Key findings:

  • UNIX environment size affects performance
  • Link order matters
  • Measurement bias is pervasive

Rigorous Benchmarking in Reasonable Time - Tomas Kalibera and Richard Jones, ISMM 2013. Explains how to design statistically sound experiments without burning weeks of CPU time.

Stabilizer: Statistically Sound Performance Evaluation - Charlie Curtsinger and Emery D. Berger, ASPLOS 2013. Uses randomization to make performance measurements more robust and statistically sound.

Roofline Model

Roofline: An Insightful Visual Performance Model for Multicore Architectures Samuel Williams et al., Communications of the ACM, 2009. Introduces the Roofline model used throughout this book.


Cache-Aware Roofline Model - Aleksandar Ilic et al., IEEE TPDS, 2017. Extends Roofline to account for multiple cache levels.


AI/ML Benchmarks

MLPerf: An Industry Standard Benchmark Suite for Machine Learning - Peter Mattson et al., IEEE Micro, 2020. Describes the design and goals of the MLPerf benchmark suite.

Measuring the Algorithmic Efficiency of Neural Networks - Danny Hernandez and Tom Brown, arXiv 2020. Studies trends in the algorithmic efficiency of neural networks over time.

Online Resources

Optimization Manuals

Agner Fog's Optimization Resources - https://www.agner.org/optimize/. A comprehensive collection of optimization manuals, instruction tables, and microarchitecture notes for x86/x64.

Intel 64 and IA-32 Architectures Optimization Reference Manual - https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html. Intel's official optimization guide for their processors.

ARM Performance Analysis Guides - https://developer.arm.com/documentation/. Official documentation and tuning guides for ARM CPUs.

Memory & Cache

What Every Programmer Should Know About Memory - Ulrich Drepper, https://people.freebsd.org/~lstewart/articles/cpumemory.pdf. A long but rewarding deep dive into modern memory hierarchies.

Gallery of Processor Cache Effects - Igor Ostrovsky, http://igoro.com/archive/gallery-of-processor-cache-effects/. An interactive tour of cache behavior.

Benchmarking Tools

SPEC CPU 2017 - https://www.spec.org/cpu2017/. The industry-standard CPU benchmark suite used in academia and industry.

Phoronix Test Suite - https://www.phoronix-test-suite.com/. A large collection of open-source benchmarks for Linux and other platforms.

Google Benchmark - https://github.com/google/benchmark. A C++ microbenchmarking framework that pairs well with the microbenchmark patterns in this book.

Courses

MIT 6.172: Performance Engineering of Software Systems https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/

An MIT course on performance engineering. Covers profiling, cache optimization, parallelism, and systematic performance methodology.


Berkeley CS267: Applications of Parallel Computers https://sites.google.com/lbl.gov/cs267-spr2024

An advanced course on parallel computing and high-performance computing (HPC).


CMU 15-418/618: Parallel Computer Architecture and Programming http://www.cs.cmu.edu/~418/

Another classic course on parallel programming and computer architecture.


Blogs

Brendan Gregg's Blog https://www.brendangregg.com/

Deep-dive articles on performance analysis and observability. Especially recommended:

  • "Linux Performance" (overview)
  • "Flame Graphs"
  • "CPU Flame Graphs"

Mechanical Sympathy https://mechanical-sympathy.blogspot.com/

Discussions of hardware-aware programming and the interaction between code and modern CPUs.


Daniel Lemire's Blog https://lemire.me/blog/

Regular posts on data-oriented design, SIMD optimization, and fast software techniques.


Travis Downs' Blog https://travisdowns.github.io/

Low-level CPU performance analysis, microbenchmarks, and deep dives into instruction behavior.


Tools

Profiling

ToolPlatformDescription
perfLinuxBuilt-in Linux profiler
VTunex86Intel's advanced profiler
InstrumentsmacOSApple's profiling suite
TracyCrossReal-time profiler popular in game development

Benchmarking

ToolLanguageDescription
Google BenchmarkC++Microbenchmark library
CriterionRustRust benchmark library
pytest-benchmarkPythonPython benchmark plugin
JMHJavaJava microbenchmark harness

Visualization

ToolDescription
FlameGraphStack trace and sample visualization
PerfettoChrome trace-style viewer for traces
HotspotGUI for visualizing perf data

Suggested Reading Paths

Different readers will care about different parts of this appendix. Here are a few short routes.

Reader typeCore bookKey paper / resourceCourse
System / embedded engineerSystems PerformanceDrepper, "What Every Programmer Should Know About Memory"MIT 6.172
ML / AI engineerSystems PerformanceMLPerf papers; "Measuring the Algorithmic Efficiency of Neural Networks"CS267 (selected lectures)
HPC / performance researcherComputer Architecture: A Quantitative ApproachRoofline and Cache-Aware Roofline papersCS267 or 15-418/618

System / Embedded Engineers

  • Start with Systems Performance for methodology, tools, and mental models.
  • Skim CAQA Ch. 1-2 and the Roofline paper when you need hardware intuition.
  • Keep Drepper's memory paper and Agner Fog's manuals nearby for tricky cache/latency behaviour.

ML / AI Engineers

  • Read the MLPerf paper and the algorithmic efficiency paper alongside this book's AI/ML chapters.
  • Use Systems Performance for general methodology and system-level bottlenecks.
  • Pair this with CS267 lectures focused on dense linear algebra and GPU performance.

HPC / Research-Oriented Readers

  • Start from CAQA and Modern Processor Design for architecture depth.
  • Study the Roofline and Cache-Aware Roofline papers, then apply them to your own kernels.
  • Use CS267 or 15-418/618 as a structured path through parallel architectures and performance case studies.

Most importantly, keep connecting what you read back to real measurements on systems you control. Reading without measurement becomes trivia; measurement without theory becomes blind trial-and-error.