Chapter 8: Profiling Tools

Part II: Tools

"Premature optimization is the root of all evil." — Donald Knuth

"But you can't optimize what you don't measure." — Also Donald Knuth (paraphrased)

The Three Weeks I Optimized the Wrong Thing

I once spent three weeks optimizing a function.

It was an image processing pipeline that ran too slowly. I looked at the code and decided the bottleneck must be in the pixel processing loop—after all, it had millions of iterations.

So I started optimizing: SIMD vectorization, loop unrolling, cache blocking... Three weeks later, that loop was 5× faster.

Overall performance improvement? 3%.

Because the real bottleneck wasn't there. It was I/O. File reading time. The part I never looked at.

If I had run a profiler first, I would have found the real bottleneck in five minutes. Three weeks of work could have been avoided with five minutes of profiling.

This is the value of profiling.

Basic Profiling Concepts

Sampling vs Instrumentation

Profilers use two main approaches:

Sampling

Every N milliseconds, pause the program, record current location
     ↓
Count which functions appear most often
     ↓
More appearances = more execution time

Pros: Low overhead, doesn't affect program behavior Cons: Statistical approximation, may miss short functions

Instrumentation

Insert timing code at every function entry/exit
     ↓
Precisely record time for each call
     ↓
Completely accurate call counts and times

Pros: Precise Cons: High overhead, may change program behavior

Common Profilers

Tool	Type	Platform	Features
perf	Sampling	Linux	Most powerful, hardware PMU support
gprof	Instrumentation	Unix	Classic, but outdated
Valgrind	Simulation	Linux/macOS	Precise but very slow
VTune	Sampling	Linux/Windows	Intel official, GUI
Instruments	Sampling	macOS	Apple official

gprof: The Classic Instrumentation Profiler

gprof is a product of 1980s BSD Unix, still widely used in teaching. Understanding its workings and limitations helps appreciate modern tools.

Basic Usage

# 1. Compile with -pg flag
gcc -pg -O2 -o my_program my_program.c

# 2. Run program (generates gmon.out)
./my_program

# 3. Analyze results
gprof my_program gmon.out > analysis.txt

Reading gprof Output

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 45.23      1.23     1.23   100000     0.01     0.02  process_pixel
 23.45      1.87     0.64   100000     0.01     0.01  apply_filter
 12.34      2.21     0.34        1   340.00   450.00  main
  8.76      2.45     0.24  1000000     0.00     0.00  get_value

% time: Percentage of total time in this function
self seconds: Time in function itself (excluding callees)
calls: Number of calls
self ms/call: Time per call

gprof's Serious Limitations

1. Can only profile instrumented code

# These won't appear in gprof reports:
# - System call time
# - Dynamic libraries (unless also compiled with -pg)
# - I/O wait time

2. Instrumentation changes performance characteristics

3. Sampling resolution too low — 10ms intervals

4. Poor multi-threading support

5. Can't see inlined functions

gprof vs perf

Aspect	gprof	perf
Mechanism	Instrumentation	Sampling (PMU)
Overhead	High (10-30%)	Low (<5%)
System calls	Invisible	Visible
Shared libraries	Need recompile	Direct support
Multi-threading	Poor	Good

Conclusion: Unless there's a special reason, prefer perf on modern Linux.

perf: Linux Performance Analysis Powerhouse

perf is a performance analysis tool built into the Linux kernel, extremely powerful.

Basic Usage

# Record performance data
perf record ./my_program

# View report

### perf record + perf report

```bash
# Record (-g means record call stack)
perf record -g ./my_program

# Interactive report
perf report

Overhead  Command      Shared Object        Symbol
  45.23%  my_program   my_program           [.] process_pixel
  23.45%  my_program   libc.so.6            [.] memcpy
  12.34%  my_program   my_program           [.] read_file
   8.76%  my_program   libm.so.6            [.] exp
   ...

Now you know process_pixel takes 45% of the time. That's where to optimize.

Common perf Events

# CPU cycles
perf stat -e cycles,instructions ./program

# Cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program

# Branch prediction
perf stat -e branch-misses,branches ./program

# List all available events
perf list

perf Advanced: annotate

# See source-level hotspots
perf annotate process_pixel

       │    void process_pixel(int* data, int n) {
       │        for (int i = 0; i < n; i++) {
 45.23 │            data[i] = expensive_calc(data[i]);
       │        }
       │    }

Now you know the expensive_calc line is slowest.

Flame Graphs: Visualizing Call Stacks

Flame Graph is a visualization method invented by Brendan Gregg that lets you see program hotspots at a glance.

What Is a Flame Graph

┌─────────────────────────────────────────────────────────────┐
│                          main                                │
├─────────────────────────────┬───────────────────────────────┤
│       process_image         │          read_file            │
├─────────────┬───────────────┤                               │
│ process_row │  apply_filter │                               │
├─────────────┴───────────────┤                               │
│        process_pixel        │                               │
└─────────────────────────────┴───────────────────────────────┘

Width = time proportion
Height = call stack depth

X-axis: Not time order—alphabetically sorted function names
Y-axis: Call stack depth
Width: CPU time proportion of function (including children)

Generating Flame Graphs

# 1. Record with perf
perf record -g ./my_program

# 2. Generate readable stack trace
perf script > out.perf

# 3. Convert with FlameGraph tools
git clone https://github.com/brendangregg/FlameGraph.git
./FlameGraph/stackcollapse-perf.pl out.perf > out.folded
./FlameGraph/flamegraph.pl out.folded > flamegraph.svg

# 4. Open SVG in browser
firefox flamegraph.svg

Reading Flame Graphs

Find the widest "plateaus" — These are where most time is spent
Read bottom to top — Bottom is caller, top is callee
Click to zoom in — SVG is interactive

Common Patterns

Pattern 1: Single Hotspot

┌──────────────────────────────────────────┐
│                  main                     │
├──────────────────────────────────────────┤
│              hot_function (90%)           │
└──────────────────────────────────────────┘

→ Optimize hot_function

Pattern 2: Wide Base

┌──────────────────────────────────────────┐
│                  main                     │
├────┬────┬────┬────┬────┬────┬────┬───────┤
│ f1 │ f2 │ f3 │ f4 │ f5 │ f6 │ f7 │ ...   │
└────┴────┴────┴────┴────┴────┴────┴───────┘

→ No single hotspot; need overall optimization or algorithm change

Pattern 3: Deep Narrow Stack

┌───┐
│ a │
├───┤
│ b │
├───┤
│ c │
├───┤
│ d │ ← Actual work here
└───┘

→ Consider reducing call stack depth or inlining

Valgrind: Memory and Cache Analysis

Valgrind is an instrumentation framework. The most commonly used tools are Memcheck (memory error detection) and Cachegrind (cache analysis).

Cachegrind: Cache Behavior Analysis

valgrind --tool=cachegrind ./my_program

==12345== Cachegrind, a cache and branch-prediction profiler
==12345==
==12345== I   refs:      1,234,567,890
==12345== I1  misses:          123,456
==12345== LLi misses:           12,345
==12345== I1  miss rate:          0.01%
==12345==
==12345== D   refs:        456,789,012  (234,567,890 rd + 222,221,122 wr)
==12345== D1  misses:       12,345,678  (  8,765,432 rd +   3,580,246 wr)
==12345== LLd misses:        1,234,567  (    876,543 rd +     358,024 wr)
==12345== D1  miss rate:           2.7% (        3.7%   +         1.6%)

I refs: Instruction fetches
D refs: Data reads/writes
D1 misses: L1 data cache misses
LLd misses: Last-level cache misses (goes to DRAM)

Callgrind: Call Graph Analysis

valgrind --tool=callgrind ./my_program

# Visualize
kcachegrind callgrind.out.12345

KCachegrind provides interactive call graph visualization showing:

Inclusive/exclusive time per function
Call graph
Source code annotation

Valgrind Limitations

Very slow — Usually 10-50× slower
Not real execution — Runs on virtual CPU
No GPU support — Only analyzes CPU code

Practical Workflow

Step 1: Quick Look with perf stat

perf stat ./my_program

Check if IPC, cache misses, branch misses are abnormal.

Step 2: Find Hotspots with perf record

perf record -g ./my_program
perf report

Find the hottest functions.

Step 3: Visualize with Flame Graph

perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > flame.svg

See the overall structure at a glance.

Step 4: Deep Analysis

Choose tools based on problem type:

Cache issues → Cachegrind or VTune Memory Access
Branch issues → perf branch-misses events
Multi-threading issues → VTune Threading

Step 5: Profile Again After Optimization

# Before
perf stat ./my_program_v1

# After
perf stat ./my_program_v2

# Compare

Confirm optimization was effective.

Common Pitfalls

Pitfall 1: Debug Build Overhead

# Bad: debug build
gcc -g -O0 program.c
perf record ./a.out  # Results don't represent production

# Good: release build with debug info
gcc -g -O3 program.c
perf record ./a.out  # Closer to real performance

Pitfall 2: Compiler Optimization Changes Behavior

Use -fno-omit-frame-pointer to preserve frame pointer for more accurate call stacks.

Pitfall 3: Short Execution Time Noise

# Bad: too short, too much noise
perf stat ./quick_program  # 0.01 seconds

# Good: run long enough
perf stat ./quick_program --iterations=10000  # 10 seconds

Back to That Three-Week Story

If I had done this:

$ perf record -g ./image_pipeline
$ perf report

Overhead  Symbol
  65.23%  read_file      ← This was the bottleneck!
  23.45%  write_file
   8.76%  process_pixel  ← Where I spent three weeks
   ...

Five minutes would have revealed I/O was the bottleneck. Three weeks of optimization could have been spent on the right thing.

Summary

Profiling is the first step in performance optimization:

Tool Selection

perf: Linux first choice, free and powerful
Flame Graph: Visualize call stacks
Valgrind/Cachegrind: Cache behavior analysis
VTune: Intel CPU deep analysis
Instruments: macOS development

Workflow

perf stat for quick overview
perf record + perf report to find hotspots
Flame Graph for visualization
Deep analysis (cache/branch/threading)
Profile again after optimization

Core Principles

Profile first, then optimize
Don't guess bottlenecks before profiling
Profile with release builds
Ensure execution time is long enough

Performance and Benchmarking