Chapter 8: Profiling Tools
Part II: Tools
"Premature optimization is the root of all evil." — Donald Knuth
"But you can't optimize what you don't measure." — Also Donald Knuth (paraphrased)
The Three Weeks I Optimized the Wrong Thing
I once spent three weeks optimizing a function.
It was an image processing pipeline that ran too slowly. I looked at the code and decided the bottleneck must be in the pixel processing loop—after all, it had millions of iterations.
So I started optimizing: SIMD vectorization, loop unrolling, cache blocking... Three weeks later, that loop was 5× faster.
Overall performance improvement? 3%.
Because the real bottleneck wasn't there. It was I/O. File reading time. The part I never looked at.
If I had run a profiler first, I would have found the real bottleneck in five minutes. Three weeks of work could have been avoided with five minutes of profiling.
This is the value of profiling.
Basic Profiling Concepts
Sampling vs Instrumentation
Profilers use two main approaches:
Sampling
Every N milliseconds, pause the program, record current location
↓
Count which functions appear most often
↓
More appearances = more execution time
Pros: Low overhead, doesn't affect program behavior Cons: Statistical approximation, may miss short functions
Instrumentation
Insert timing code at every function entry/exit
↓
Precisely record time for each call
↓
Completely accurate call counts and times
Pros: Precise Cons: High overhead, may change program behavior
Common Profilers
| Tool | Type | Platform | Features |
|---|---|---|---|
| perf | Sampling | Linux | Most powerful, hardware PMU support |
| gprof | Instrumentation | Unix | Classic, but outdated |
| Valgrind | Simulation | Linux/macOS | Precise but very slow |
| VTune | Sampling | Linux/Windows | Intel official, GUI |
| Instruments | Sampling | macOS | Apple official |
gprof: The Classic Instrumentation Profiler
gprof is a product of 1980s BSD Unix, still widely used in teaching. Understanding its workings and limitations helps appreciate modern tools.
Basic Usage
# 1. Compile with -pg flag
gcc -pg -O2 -o my_program my_program.c
# 2. Run program (generates gmon.out)
./my_program
# 3. Analyze results
gprof my_program gmon.out > analysis.txt
Reading gprof Output
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.23 1.23 1.23 100000 0.01 0.02 process_pixel
23.45 1.87 0.64 100000 0.01 0.01 apply_filter
12.34 2.21 0.34 1 340.00 450.00 main
8.76 2.45 0.24 1000000 0.00 0.00 get_value
- % time: Percentage of total time in this function
- self seconds: Time in function itself (excluding callees)
- calls: Number of calls
- self ms/call: Time per call
gprof's Serious Limitations
1. Can only profile instrumented code
# These won't appear in gprof reports:
# - System call time
# - Dynamic libraries (unless also compiled with -pg)
# - I/O wait time
2. Instrumentation changes performance characteristics
3. Sampling resolution too low — 10ms intervals
4. Poor multi-threading support
5. Can't see inlined functions
gprof vs perf
| Aspect | gprof | perf |
|---|---|---|
| Mechanism | Instrumentation | Sampling (PMU) |
| Overhead | High (10-30%) | Low (<5%) |
| System calls | Invisible | Visible |
| Shared libraries | Need recompile | Direct support |
| Multi-threading | Poor | Good |
Conclusion: Unless there's a special reason, prefer perf on modern Linux.
perf: Linux Performance Analysis Powerhouse
perf is a performance analysis tool built into the Linux kernel, extremely powerful.
Basic Usage
# Record performance data
perf record ./my_program
# View report
### perf record + perf report
```bash
# Record (-g means record call stack)
perf record -g ./my_program
# Interactive report
perf report
Overhead Command Shared Object Symbol
45.23% my_program my_program [.] process_pixel
23.45% my_program libc.so.6 [.] memcpy
12.34% my_program my_program [.] read_file
8.76% my_program libm.so.6 [.] exp
...
Now you know process_pixel takes 45% of the time. That's where to optimize.
Common perf Events
# CPU cycles
perf stat -e cycles,instructions ./program
# Cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program
# Branch prediction
perf stat -e branch-misses,branches ./program
# List all available events
perf list
perf Advanced: annotate
# See source-level hotspots
perf annotate process_pixel
│ void process_pixel(int* data, int n) {
│ for (int i = 0; i < n; i++) {
45.23 │ data[i] = expensive_calc(data[i]);
│ }
│ }
Now you know the expensive_calc line is slowest.
Flame Graphs: Visualizing Call Stacks
Flame Graph is a visualization method invented by Brendan Gregg that lets you see program hotspots at a glance.
What Is a Flame Graph
┌─────────────────────────────────────────────────────────────┐
│ main │
├─────────────────────────────┬───────────────────────────────┤
│ process_image │ read_file │
├─────────────┬───────────────┤ │
│ process_row │ apply_filter │ │
├─────────────┴───────────────┤ │
│ process_pixel │ │
└─────────────────────────────┴───────────────────────────────┘
Width = time proportion
Height = call stack depth
- X-axis: Not time order—alphabetically sorted function names
- Y-axis: Call stack depth
- Width: CPU time proportion of function (including children)
Generating Flame Graphs
# 1. Record with perf
perf record -g ./my_program
# 2. Generate readable stack trace
perf script > out.perf
# 3. Convert with FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git
./FlameGraph/stackcollapse-perf.pl out.perf > out.folded
./FlameGraph/flamegraph.pl out.folded > flamegraph.svg
# 4. Open SVG in browser
firefox flamegraph.svg
Reading Flame Graphs
- Find the widest "plateaus" — These are where most time is spent
- Read bottom to top — Bottom is caller, top is callee
- Click to zoom in — SVG is interactive
Common Patterns
Pattern 1: Single Hotspot
┌──────────────────────────────────────────┐
│ main │
├──────────────────────────────────────────┤
│ hot_function (90%) │
└──────────────────────────────────────────┘
→ Optimize hot_function
Pattern 2: Wide Base
┌──────────────────────────────────────────┐
│ main │
├────┬────┬────┬────┬────┬────┬────┬───────┤
│ f1 │ f2 │ f3 │ f4 │ f5 │ f6 │ f7 │ ... │
└────┴────┴────┴────┴────┴────┴────┴───────┘
→ No single hotspot; need overall optimization or algorithm change
Pattern 3: Deep Narrow Stack
┌───┐
│ a │
├───┤
│ b │
├───┤
│ c │
├───┤
│ d │ ← Actual work here
└───┘
→ Consider reducing call stack depth or inlining
Valgrind: Memory and Cache Analysis
Valgrind is an instrumentation framework. The most commonly used tools are Memcheck (memory error detection) and Cachegrind (cache analysis).
Cachegrind: Cache Behavior Analysis
valgrind --tool=cachegrind ./my_program
==12345== Cachegrind, a cache and branch-prediction profiler
==12345==
==12345== I refs: 1,234,567,890
==12345== I1 misses: 123,456
==12345== LLi misses: 12,345
==12345== I1 miss rate: 0.01%
==12345==
==12345== D refs: 456,789,012 (234,567,890 rd + 222,221,122 wr)
==12345== D1 misses: 12,345,678 ( 8,765,432 rd + 3,580,246 wr)
==12345== LLd misses: 1,234,567 ( 876,543 rd + 358,024 wr)
==12345== D1 miss rate: 2.7% ( 3.7% + 1.6%)
- I refs: Instruction fetches
- D refs: Data reads/writes
- D1 misses: L1 data cache misses
- LLd misses: Last-level cache misses (goes to DRAM)
Callgrind: Call Graph Analysis
valgrind --tool=callgrind ./my_program
# Visualize
kcachegrind callgrind.out.12345
KCachegrind provides interactive call graph visualization showing:
- Inclusive/exclusive time per function
- Call graph
- Source code annotation
Valgrind Limitations
- Very slow — Usually 10-50× slower
- Not real execution — Runs on virtual CPU
- No GPU support — Only analyzes CPU code
Practical Workflow
Step 1: Quick Look with perf stat
perf stat ./my_program
Check if IPC, cache misses, branch misses are abnormal.
Step 2: Find Hotspots with perf record
perf record -g ./my_program
perf report
Find the hottest functions.
Step 3: Visualize with Flame Graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flame.svg
See the overall structure at a glance.
Step 4: Deep Analysis
Choose tools based on problem type:
- Cache issues → Cachegrind or VTune Memory Access
- Branch issues → perf branch-misses events
- Multi-threading issues → VTune Threading
Step 5: Profile Again After Optimization
# Before
perf stat ./my_program_v1
# After
perf stat ./my_program_v2
# Compare
Confirm optimization was effective.
Common Pitfalls
Pitfall 1: Debug Build Overhead
# Bad: debug build
gcc -g -O0 program.c
perf record ./a.out # Results don't represent production
# Good: release build with debug info
gcc -g -O3 program.c
perf record ./a.out # Closer to real performance
Pitfall 2: Compiler Optimization Changes Behavior
Use -fno-omit-frame-pointer to preserve frame pointer for more accurate call stacks.
Pitfall 3: Short Execution Time Noise
# Bad: too short, too much noise
perf stat ./quick_program # 0.01 seconds
# Good: run long enough
perf stat ./quick_program --iterations=10000 # 10 seconds
Back to That Three-Week Story
If I had done this:
$ perf record -g ./image_pipeline
$ perf report
Overhead Symbol
65.23% read_file ← This was the bottleneck!
23.45% write_file
8.76% process_pixel ← Where I spent three weeks
...
Five minutes would have revealed I/O was the bottleneck. Three weeks of optimization could have been spent on the right thing.
Summary
Profiling is the first step in performance optimization:
Tool Selection
- perf: Linux first choice, free and powerful
- Flame Graph: Visualize call stacks
- Valgrind/Cachegrind: Cache behavior analysis
- VTune: Intel CPU deep analysis
- Instruments: macOS development
Workflow
perf statfor quick overviewperf record+perf reportto find hotspots- Flame Graph for visualization
- Deep analysis (cache/branch/threading)
- Profile again after optimization
Core Principles
- Profile first, then optimize
- Don't guess bottlenecks before profiling
- Profile with release builds
- Ensure execution time is long enough