Chapter 34: How to Optimize
Part IX: Synthesis
"Premature optimization is the root of all evil." — Donald Knuth
"But so is no optimization at all." — Anonymous engineer
The 3 AM Flame Graph
It was the night before a deadline.
Our API server suddenly slowed down—P99 latency spiked from 50ms to 500ms. The service was about to violate SLA.
The team lead looked at me and said: "You're the performance expert. Figure it out."
I opened the Flame Graph and stared at that "volcano" for five minutes.
One function took 47% of CPU time: json_parse().
"Found it," I said. "JSON parsing is the bottleneck."
"Great!" The lead relaxed. "Then optimize it."
I shook my head: "No. We shouldn't optimize it."
The First Rule of Optimization: Ask "Why" First
Before optimizing anything, always ask yourself three questions:
┌─────────────────────────────────────────────────────────────────┐
│ Before Optimizing, Ask: │
├─────────────────────────────────────────────────────────────────┤
│ 1. Does this operation really need to exist? │
│ (The fastest code is the code that never runs) │
│ │
│ 2. Can this operation be done fewer times? │
│ (Caching, batching, lazy evaluation) │
│ │
│ 3. Can this operation be done in a simpler way? │
│ (Simpler algorithm, different data structure) │
└─────────────────────────────────────────────────────────────────┘
Back to the JSON parsing case:
Why was json_parse() so slow? Because we were parsing the same config file for every request.
The solution wasn't to optimize parsing—the solution was to cache the config, parse it only once.
# ❌ Before: parse for every request
def handle_request():
config = json_parse(open("config.json").read())
# ...
# ✅ After: parse once at startup
CONFIG = json_parse(open("config.json").read())
def handle_request():
config = CONFIG
# ...
This change reduced P99 latency from 500ms to 45ms. We didn't optimize any code—we just stopped doing unnecessary work.
The Optimization Workflow
After years of experience, I've summarized a systematic optimization process:
┌───────────────────────────────────────────────────────────────┐
│ Optimization Workflow │
├───────────────────────────────────────────────────────────────┤
│ │
│ 1. Measure │
│ └── Establish baseline, quantify "how slow is it now" │
│ │
│ 2. Profile │
│ └── Find bottleneck, understand "why is it slow" │
│ │
│ 3. Analyze │
│ └── Understand root cause, ask "can we avoid it?" │
│ │
│ 4. Hypothesize │
│ └── Make prediction, estimate "how much faster" │
│ │
│ 5. Implement │
│ └── Implement the optimization │
│ │
│ 6. Measure Again │
│ └── Verify effect, confirm "did it actually get faster" │
│ │
│ 7. Document │
│ └── Record changes, explain "why we did this" │
│ │
└───────────────────────────────────────────────────────────────┘
This isn't linear—you usually cycle through multiple times.
Step 1: Measure — Establish Baseline
Before optimizing, you must know "how slow is it now." Without a baseline, you can't judge if optimization worked.
Key Points for Recording Baseline
Baseline Report (Example)
=========================
Date: 2025-12-18
Commit: abc123
Metric | Value | Notes
--------------------|------------|------------------
P50 latency | 45 ms |
P99 latency | 520 ms | ← This is the problem
Throughput | 1,200 req/s|
CPU usage | 78% |
Memory usage | 2.1 GB |
Important: Record the git commit hash. Later you'll need to compare "before vs after optimization," and you must ensure you're comparing specific versions.
Step 2: Profile — Find the Bottleneck
Profiling is the "microscope" for finding bottlenecks. Different types of bottlenecks need different tools.
Bottleneck Classification and Tool Selection
| Bottleneck Type | Symptoms | Recommended Tools |
|---|---|---|
| CPU-bound | High CPU usage, high latency | perf, Flame Graph |
| Memory-bound | High cache miss, low IPC | perf stat, Cachegrind |
| I/O-bound | High CPU idle, high I/O wait | iostat, strace |
| Lock contention | Low CPU usage but high latency | perf lock, Off-CPU Flame Graph |
Flame Graph Reading Tips
Step 3: Analyze — Understand Root Cause
After finding the hotspot, the next step is understanding "why is it slow."
Common Bottleneck Patterns
| Pattern | Signs | Root Cause |
|---|---|---|
| Repeated computation | Same function called too many times | Missing caching |
| Inefficient algorithm | O(n²) slow on large data | Need better algorithm |
| Cache miss | High L3 miss rate | Data structure not cache-friendly |
| Branch misprediction | High branch-miss rate | Data-dependent branches |
| Lock contention | Multiple threads waiting for same lock | Lock granularity too coarse |
Diagnostic perf Commands
# View overall performance counters
perf stat -e cycles,instructions,cache-misses,branch-misses ./program
# Example output interpretation
# 3,000,000,000 cycles
# 1,500,000,000 instructions # IPC = 0.5 (low, possibly memory-bound)
# 50,000,000 cache-misses # 3.3% miss rate (might be a problem)
# 5,000,000 branch-misses # 0.3% miss rate (normal)
IPC (Instructions Per Cycle) Meaning:
| IPC | Possible Cause |
|---|---|
| < 0.5 | Severely memory-bound, CPU waiting for data |
| 0.5-1.0 | Possibly cache miss or branch miss |
| 1.0-2.0 | Reasonably normal |
| > 2.0 | Good, fully utilizing hardware |
Step 4: Hypothesize — Predict the Effect
Before implementing optimization, predict "how much will this change improve performance."
This is important because:
- Avoid wasting time: If prediction is only 1% improvement, might not be worth it
- Verify understanding: If actual effect differs greatly from prediction, analysis was wrong
- Amdahl's Law: Overall speedup is limited by the proportion of the optimized part
Amdahl's Law
1
Speedup = ────────────────────────
(1 - p) + p/s
p = proportion of time spent in optimized part
s = speedup factor for that part
Example:
If json_parse() takes 40% of time (p = 0.4), and we make it 10x faster (s = 10):
Speedup = 1 / (0.6 + 0.4/10)
= 1 / (0.6 + 0.04)
= 1 / 0.64
= 1.56x
Even if we make JSON parsing 10x faster, overall is only 56% faster. This is the harsh reality of Amdahl's Law.
Step 5: Implement — Optimization Strategies
Hierarchical Optimization Strategy
Optimization should proceed from "high level" to "low level." High-level optimizations usually have the most significant effects:
╔═══════════════════════════════════════════════════════════════╗
║ Optimization Hierarchy ║
╠═══════════════════════════════════════════════════════════════╣
║ ║
║ Level 1: Architecture / Design 10-100x║
║ ├─ Better algorithm (O(n²) → O(n log n)) ║
║ ├─ Eliminate unnecessary operations (caching, lazy eval) ║
║ └─ More suitable data structure ║
║ ║
║ Level 2: Algorithm / Data Structure 2-10x ║
║ ├─ Cache-friendly data layout ║
║ ├─ Reduce memory allocation ║
║ └─ Batch processing ║
║ ║
║ Level 3: Implementation 1-3x ║
║ ├─ Avoid unnecessary copies ║
║ ├─ Use faster library ║
║ └─ Reduce branches ║
║ ║
║ Level 4: Low-level 1-2x ║
║ ├─ SIMD vectorization ║
║ ├─ Reduce cache miss ║
║ └─ Compiler optimization options ║
║ ║
╚═══════════════════════════════════════════════════════════════╝
Common Optimization Techniques
Caching / Memoization
// ❌ Before: compute every time
int fib(int n) {
if (n <= 1) return n;
return fib(n-1) + fib(n-2); // O(2^n)
}
// ✅ After: remember computed results
int cache[100] = {0};
int fib(int n) {
if (n <= 1) return n;
if (cache[n] != 0) return cache[n];
cache[n] = fib(n-1) + fib(n-2);
return cache[n]; // O(n)
}
Batching
// ❌ Before: write one at a time
for (int i = 0; i < 1000; i++) {
write_to_disk(data[i]); // 1000 syscalls
}
// ✅ After: batch write
buffer_add(data, 1000);
write_to_disk(buffer); // 1 syscall
Step 6: Measure Again — Verify the Effect
This is the most critical step. Don't assume "it should be faster"—prove it with data.
Comparison Report Example
Optimization: Cache config.json parsing
Commit: def456 (after) vs abc123 (before)
Metric | Before | After | Change
--------------------|------------|------------|--------
P50 latency | 45 ms | 42 ms | -6.7%
P99 latency | 520 ms | 48 ms | -90.8% ✓
Throughput | 1,200 req/s| 1,850 req/s| +54.2% ✓
CPU usage | 78% | 45% | -42.3% ✓
If the effect is not as expected, go back to Step 3 and re-analyze. Your hypothesis might be wrong.
Step 7: Document — Leave a Record
Optimization knowledge must be passed on. The next person (including yourself six months later) needs to understand this change.
The Anti-Patterns — Don't Do This
1. Optimizing Without Measuring
❌ "I feel this function is slow, let me optimize it first"
✅ "Profiler shows this function takes 30% CPU, worth optimizing"
2. Optimizing the Wrong Place
❌ Spend three days optimizing a function that only takes 2% of time
✅ Focus on the hotspot that takes 80% of time
3. Over-optimizing
❌ Write unmaintainable code to save 1% time
✅ Balance between "performance" and "maintainability"
4. Forgetting to Verify
❌ "I added cache, it should be faster" (no measurement)
✅ "After adding cache, P99 dropped from 520ms to 48ms" (with data)
Summary
The Optimization Workflow:
- Measure: Establish baseline with concrete numbers
- Profile: Find the bottleneck with proper tools
- Analyze: Understand the root cause (ask "why?")
- Hypothesize: Predict the improvement using Amdahl's Law
- Implement: Start from architecture, then algorithm, then low-level
- Measure Again: Verify with data, not assumptions
- Document: Leave knowledge for future engineers
The Golden Rules:
"Don't guess. Measure."
"The fastest code is the code that never runs."
"Optimize the bottleneck, not the convenient thing."
Next chapter, we'll discuss how to automate these performance tests and integrate them into CI/CD pipelines.