Chapter 34: How to Optimize

Part IX: Synthesis

"Premature optimization is the root of all evil." — Donald Knuth

"But so is no optimization at all." — Anonymous engineer

The 3 AM Flame Graph

It was the night before a deadline.

Our API server suddenly slowed down—P99 latency spiked from 50ms to 500ms. The service was about to violate SLA.

The team lead looked at me and said: "You're the performance expert. Figure it out."

I opened the Flame Graph and stared at that "volcano" for five minutes.

One function took 47% of CPU time: json_parse().

"Found it," I said. "JSON parsing is the bottleneck."

"Great!" The lead relaxed. "Then optimize it."

I shook my head: "No. We shouldn't optimize it."

The First Rule of Optimization: Ask "Why" First

Before optimizing anything, always ask yourself three questions:

┌─────────────────────────────────────────────────────────────────┐
│                   Before Optimizing, Ask:                       │
├─────────────────────────────────────────────────────────────────┤
│ 1. Does this operation really need to exist?                    │
│    (The fastest code is the code that never runs)               │
│                                                                 │
│ 2. Can this operation be done fewer times?                      │
│    (Caching, batching, lazy evaluation)                         │
│                                                                 │
│ 3. Can this operation be done in a simpler way?                 │
│    (Simpler algorithm, different data structure)                │
└─────────────────────────────────────────────────────────────────┘

Back to the JSON parsing case:

Why was json_parse() so slow? Because we were parsing the same config file for every request.

The solution wasn't to optimize parsing—the solution was to cache the config, parse it only once.

# ❌ Before: parse for every request
def handle_request():
    config = json_parse(open("config.json").read())
    # ...

# ✅ After: parse once at startup
CONFIG = json_parse(open("config.json").read())

def handle_request():
    config = CONFIG
    # ...

This change reduced P99 latency from 500ms to 45ms. We didn't optimize any code—we just stopped doing unnecessary work.

The Optimization Workflow

After years of experience, I've summarized a systematic optimization process:

┌───────────────────────────────────────────────────────────────┐
│                   Optimization Workflow                        │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│   1. Measure                                                  │
│      └── Establish baseline, quantify "how slow is it now"   │
│                                                               │
│   2. Profile                                                  │
│      └── Find bottleneck, understand "why is it slow"        │
│                                                               │
│   3. Analyze                                                  │
│      └── Understand root cause, ask "can we avoid it?"       │
│                                                               │
│   4. Hypothesize                                              │
│      └── Make prediction, estimate "how much faster"         │
│                                                               │
│   5. Implement                                                │
│      └── Implement the optimization                          │
│                                                               │
│   6. Measure Again                                            │
│      └── Verify effect, confirm "did it actually get faster" │
│                                                               │
│   7. Document                                                 │
│      └── Record changes, explain "why we did this"           │
│                                                               │
└───────────────────────────────────────────────────────────────┘

This isn't linear—you usually cycle through multiple times.

Step 1: Measure — Establish Baseline

Before optimizing, you must know "how slow is it now." Without a baseline, you can't judge if optimization worked.

Key Points for Recording Baseline

Baseline Report (Example)
=========================
Date: 2025-12-18
Commit: abc123

Metric              | Value      | Notes
--------------------|------------|------------------
P50 latency         | 45 ms      |
P99 latency         | 520 ms     | ← This is the problem
Throughput          | 1,200 req/s|
CPU usage           | 78%        |
Memory usage        | 2.1 GB     |

Important: Record the git commit hash. Later you'll need to compare "before vs after optimization," and you must ensure you're comparing specific versions.

Step 2: Profile — Find the Bottleneck

Profiling is the "microscope" for finding bottlenecks. Different types of bottlenecks need different tools.

Bottleneck Classification and Tool Selection

Bottleneck Type	Symptoms	Recommended Tools
CPU-bound	High CPU usage, high latency	perf, Flame Graph
Memory-bound	High cache miss, low IPC	perf stat, Cachegrind
I/O-bound	High CPU idle, high I/O wait	iostat, strace
Lock contention	Low CPU usage but high latency	perf lock, Off-CPU Flame Graph

Flame Graph Reading Tips

Step 3: Analyze — Understand Root Cause

After finding the hotspot, the next step is understanding "why is it slow."

Common Bottleneck Patterns

Pattern	Signs	Root Cause
Repeated computation	Same function called too many times	Missing caching
Inefficient algorithm	O(n²) slow on large data	Need better algorithm
Cache miss	High L3 miss rate	Data structure not cache-friendly
Branch misprediction	High branch-miss rate	Data-dependent branches
Lock contention	Multiple threads waiting for same lock	Lock granularity too coarse

Diagnostic perf Commands

# View overall performance counters
perf stat -e cycles,instructions,cache-misses,branch-misses ./program

# Example output interpretation
#   3,000,000,000 cycles
#   1,500,000,000 instructions  # IPC = 0.5 (low, possibly memory-bound)
#      50,000,000 cache-misses  # 3.3% miss rate (might be a problem)
#       5,000,000 branch-misses # 0.3% miss rate (normal)

IPC (Instructions Per Cycle) Meaning:

IPC	Possible Cause
< 0.5	Severely memory-bound, CPU waiting for data
0.5-1.0	Possibly cache miss or branch miss
1.0-2.0	Reasonably normal
> 2.0	Good, fully utilizing hardware

Step 4: Hypothesize — Predict the Effect

Before implementing optimization, predict "how much will this change improve performance."

This is important because:

Avoid wasting time: If prediction is only 1% improvement, might not be worth it
Verify understanding: If actual effect differs greatly from prediction, analysis was wrong
Amdahl's Law: Overall speedup is limited by the proportion of the optimized part

Amdahl's Law

                    1
Speedup = ────────────────────────
          (1 - p) + p/s

p = proportion of time spent in optimized part
s = speedup factor for that part

Example:

If json_parse() takes 40% of time (p = 0.4), and we make it 10x faster (s = 10):

Speedup = 1 / (0.6 + 0.4/10)
        = 1 / (0.6 + 0.04)
        = 1 / 0.64
        = 1.56x

Even if we make JSON parsing 10x faster, overall is only 56% faster. This is the harsh reality of Amdahl's Law.

Step 5: Implement — Optimization Strategies

Hierarchical Optimization Strategy

Optimization should proceed from "high level" to "low level." High-level optimizations usually have the most significant effects:

╔═══════════════════════════════════════════════════════════════╗
║                 Optimization Hierarchy                         ║
╠═══════════════════════════════════════════════════════════════╣
║                                                               ║
║  Level 1: Architecture / Design                        10-100x║
║  ├─ Better algorithm (O(n²) → O(n log n))                     ║
║  ├─ Eliminate unnecessary operations (caching, lazy eval)     ║
║  └─ More suitable data structure                              ║
║                                                               ║
║  Level 2: Algorithm / Data Structure                    2-10x ║
║  ├─ Cache-friendly data layout                                ║
║  ├─ Reduce memory allocation                                  ║
║  └─ Batch processing                                          ║
║                                                               ║
║  Level 3: Implementation                                1-3x  ║
║  ├─ Avoid unnecessary copies                                  ║
║  ├─ Use faster library                                        ║
║  └─ Reduce branches                                           ║
║                                                               ║
║  Level 4: Low-level                                     1-2x  ║
║  ├─ SIMD vectorization                                        ║
║  ├─ Reduce cache miss                                         ║
║  └─ Compiler optimization options                             ║
║                                                               ║
╚═══════════════════════════════════════════════════════════════╝

Common Optimization Techniques

Caching / Memoization

// ❌ Before: compute every time
int fib(int n) {
    if (n <= 1) return n;
    return fib(n-1) + fib(n-2);  // O(2^n)
}

// ✅ After: remember computed results
int cache[100] = {0};
int fib(int n) {
    if (n <= 1) return n;
    if (cache[n] != 0) return cache[n];
    cache[n] = fib(n-1) + fib(n-2);
    return cache[n];  // O(n)
}

Batching

// ❌ Before: write one at a time
for (int i = 0; i < 1000; i++) {
    write_to_disk(data[i]);  // 1000 syscalls
}

// ✅ After: batch write
buffer_add(data, 1000);
write_to_disk(buffer);  // 1 syscall

Step 6: Measure Again — Verify the Effect

This is the most critical step. Don't assume "it should be faster"—prove it with data.

Comparison Report Example

Optimization: Cache config.json parsing
Commit: def456 (after) vs abc123 (before)

Metric              | Before     | After      | Change
--------------------|------------|------------|--------
P50 latency         | 45 ms      | 42 ms      | -6.7%
P99 latency         | 520 ms     | 48 ms      | -90.8% ✓
Throughput          | 1,200 req/s| 1,850 req/s| +54.2% ✓
CPU usage           | 78%        | 45%        | -42.3% ✓

If the effect is not as expected, go back to Step 3 and re-analyze. Your hypothesis might be wrong.

❌ "I feel this function is slow, let me optimize it first"
✅ "Profiler shows this function takes 30% CPU, worth optimizing"

2. Optimizing the Wrong Place

❌ Spend three days optimizing a function that only takes 2% of time
✅ Focus on the hotspot that takes 80% of time

3. Over-optimizing

❌ Write unmaintainable code to save 1% time
✅ Balance between "performance" and "maintainability"

4. Forgetting to Verify

❌ "I added cache, it should be faster" (no measurement)
✅ "After adding cache, P99 dropped from 520ms to 48ms" (with data)

Summary

The Optimization Workflow:

Measure: Establish baseline with concrete numbers
Profile: Find the bottleneck with proper tools
Analyze: Understand the root cause (ask "why?")
Hypothesize: Predict the improvement using Amdahl's Law
Implement: Start from architecture, then algorithm, then low-level
Measure Again: Verify with data, not assumptions
Document: Leave knowledge for future engineers

The Golden Rules:

"Don't guess. Measure."

"The fastest code is the code that never runs."

"Optimize the bottleneck, not the convenient thing."

Next chapter, we'll discuss how to automate these performance tests and integrate them into CI/CD pipelines.

Performance and Benchmarking