Appendix E: Exercises and Solutions
"The best way to learn is by doing." — Richard Feynman
This appendix provides hands-on exercises for each chapter with detailed problem descriptions, solution approaches, and key code snippets.
Part I: Foundations (Chapters 1-4)
Exercise 1.1: Array Traversal with Statistical Analysis
Difficulty: Easy | Language: C
Problem: Run a simple array traversal 100 times and calculate statistical metrics to understand measurement variability.
Objectives:
- Measure execution time using
clock_gettime(CLOCK_MONOTONIC) - Calculate mean, standard deviation, and 95% confidence interval
- Evaluate result stability using coefficient of variation (CV)
Key Code:
#include <time.h>
#include <math.h>
#define ARRAY_SIZE (1024 * 1024) // 1M elements
#define NUM_RUNS 100
static inline uint64_t get_time_ns(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}
volatile int sink; // Prevent dead code elimination
void traverse_array(int *arr, size_t size) {
int sum = 0;
for (size_t i = 0; i < size; i++) {
sum += arr[i];
}
sink = sum;
}
Statistical Calculation:
// Mean
double mean = sum / NUM_RUNS;
// Standard deviation
double std_dev = sqrt(sq_diff_sum / (NUM_RUNS - 1));
// 95% CI (t-value ≈ 1.984 for df=99)
double margin = 1.984 * std_dev / sqrt(NUM_RUNS);
// Coefficient of variation
double cv = (std_dev / mean) * 100.0;
Expected Output:
Array size: 1048576 elements (4 MB)
Number of runs: 100
Results:
Mean: 0.892 ms
Std Dev: 0.045 ms
95% CI: [0.883, 0.901] ms
CV (variability): 5.04%
Interpretation:
- CV < 5%: Stable results
- CV 5-15%: Some fluctuation, acceptable
- CV > 15%: High variability, stabilize environment
Exercise 2.2: Timer Overhead Measurement
Difficulty: Easy | Language: C
Problem: Measure the overhead of clock_gettime by calling it consecutively 1000 times.
Objectives:
- Understand timer resolution limits
- Determine minimum measurable duration
- Analyze overhead distribution
Key Code:
#define NUM_SAMPLES 1000
// Measure timer overhead
for (int i = 0; i < NUM_SAMPLES; i++) {
uint64_t t1 = get_time_ns();
uint64_t t2 = get_time_ns();
overhead[i] = t2 - t1;
}
// Sort for percentiles
qsort(overhead, NUM_SAMPLES, sizeof(uint64_t), compare_uint64);
uint64_t p50 = overhead[NUM_SAMPLES / 2];
uint64_t p95 = overhead[(int)(NUM_SAMPLES * 0.95)];
Expected Output:
Statistics (nanoseconds):
Mean: 25.3 ns
Std Dev: 8.2 ns
Min: 18 ns
Max: 156 ns
Percentiles:
P50 (median): 23 ns
P95: 42 ns
P99: 78 ns
Practical Implication: For a 25ns timer overhead, measure operations of at least 2500ns (100x overhead) for <1% error.
Exercise 3.3: T-Test for Algorithm Comparison
Difficulty: Medium | Language: C
Problem: Use Welch's t-test to determine if two algorithm implementations have statistically significant performance differences.
Objectives:
- Collect benchmark samples for two algorithms
- Implement Welch's t-test (handles unequal variances)
- Interpret p-value for significance
Key Code:
// Algorithm A: Simple sum
void algo_a(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
sum += arr[i];
}
sink = sum;
}
// Algorithm B: Unrolled sum (4x)
void algo_b(int *arr, int n) {
int sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
for (int i = 0; i <= n - 4; i += 4) {
sum0 += arr[i];
sum1 += arr[i + 1];
sum2 += arr[i + 2];
sum3 += arr[i + 3];
}
sink = sum0 + sum1 + sum2 + sum3;
}
Welch's T-Test:
// t-statistic
double se = sqrt(var_a / na + var_b / nb);
double t = (mean_a - mean_b) / se;
// Degrees of freedom (Welch-Satterthwaite)
double df = (v1 + v2) * (v1 + v2) /
(v1 * v1 / (na - 1) + v2 * v2 / (nb - 1));
Interpretation:
- p < 0.05: Statistically significant difference
- p >= 0.05: No significant difference (could be noise)
Exercise 4.2: Box Plot Visualization
Difficulty: Easy | Language: Python
Problem: Create box plots to visualize benchmark result distributions and identify outliers.
Key Code:
import matplotlib.pyplot as plt
import numpy as np
def create_boxplot(data_dict, title, ylabel):
fig, ax = plt.subplots(figsize=(10, 6))
labels = list(data_dict.keys())
data = [data_dict[k] for k in labels]
bp = ax.boxplot(data, labels=labels, patch_artist=True)
# Color boxes
colors = ['lightblue', 'lightgreen', 'lightyellow', 'lightcoral']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
ax.set_ylabel(ylabel)
ax.set_title(title)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('boxplot.png', dpi=150)
What Box Plots Show:
- Box: Q1 to Q3 (50% of data)
- Line in box: Median
- Whiskers: 1.5×IQR from box edges
- Points outside: Outliers
Part II: Benchmark Tools (Chapters 5-9)
Exercise 5.1: CoreMark Testing
Difficulty: Easy | Language: Shell
Problem: Build and run CoreMark with different compiler optimization levels.
Key Code:
#!/bin/bash
git clone https://github.com/eembc/coremark.git
cd coremark
for opt in O0 O1 O2 O3 Ofast; do
echo "=== Testing -$opt ==="
make clean
make PORT_DIR=linux XCFLAGS="-$opt"
./coremark.exe 2>&1 | grep -E "CoreMark|Iterations"
done
Expected Results:
| Optimization | CoreMark Score | Improvement |
|---|---|---|
| -O0 | ~5,000 | baseline |
| -O1 | ~15,000 | 3x |
| -O2 | ~22,000 | 4.4x |
| -O3 | ~24,000 | 4.8x |
| -Ofast | ~25,000 | 5x |
Exercise 6.1: STREAM Memory Bandwidth Analysis
Difficulty: Easy | Language: C
Problem: Measure memory bandwidth at different array sizes to observe cache hierarchy effects.
Key Code:
// Vary array size to hit different cache levels
size_t sizes[] = {
8 * 1024, // 8 KB - L1
64 * 1024, // 64 KB - L2
512 * 1024, // 512 KB - L3
8 * 1024 * 1024, // 8 MB - L3/DRAM
64 * 1024 * 1024 // 64 MB - DRAM
};
for (int s = 0; s < 5; s++) {
double *a = aligned_alloc(64, sizes[s]);
double *b = aligned_alloc(64, sizes[s]);
// STREAM Copy: b[i] = a[i]
uint64_t start = get_time_ns();
for (size_t i = 0; i < sizes[s] / sizeof(double); i++) {
b[i] = a[i];
}
uint64_t elapsed = get_time_ns() - start;
double bw = (2.0 * sizes[s]) / elapsed; // GB/s
printf("Size: %6zu KB, Bandwidth: %.2f GB/s\n",
sizes[s] / 1024, bw);
}
Exercise 9.2: CPU Cycle Counter
Difficulty: Medium | Language: C
Problem: Use architecture-specific cycle counters for precise timing.
Key Code (x86):
#include <x86intrin.h>
static inline uint64_t rdtsc(void) {
return __rdtsc();
}
// More precise: RDTSCP serializes
static inline uint64_t rdtscp(void) {
unsigned int aux;
return __rdtscp(&aux);
}
// Usage
uint64_t start = rdtscp();
// ... operation ...
uint64_t end = rdtscp();
uint64_t cycles = end - start;
Key Code (ARM):
static inline uint64_t read_cycles(void) {
uint64_t val;
asm volatile("mrs %0, cntvct_el0" : "=r"(val));
return val;
}
Part III: Analysis Theory (Chapters 10-12)
Exercise 10.1: Roofline Model
Difficulty: Medium | Language: Python
Problem: Generate a roofline model plot for your system.
Key Code:
import matplotlib.pyplot as plt
import numpy as np
# System parameters (measure these!)
peak_flops = 100 # GFLOPS
peak_bw = 50 # GB/s
# Calculate ridge point
ridge_point = peak_flops / peak_bw # FLOP/byte
# Operational intensity range
oi = np.logspace(-2, 2, 100)
# Roofline
memory_bound = peak_bw * oi
compute_bound = np.full_like(oi, peak_flops)
roofline = np.minimum(memory_bound, compute_bound)
plt.figure(figsize=(10, 6))
plt.loglog(oi, roofline, 'b-', linewidth=2, label='Roofline')
plt.axvline(x=ridge_point, color='r', linestyle='--',
label=f'Ridge Point ({ridge_point:.1f})')
plt.xlabel('Operational Intensity (FLOP/byte)')
plt.ylabel('Performance (GFLOPS)')
plt.title('Roofline Model')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('roofline.png', dpi=150)
Exercise 12.3: Branch Prediction Experiment
Difficulty: Medium | Language: C
Problem: Compare sorted vs unsorted array processing to observe branch prediction effects.
Key Code:
#define ARRAY_SIZE (32 * 1024)
// Test with sorted and unsorted data
void process_array(int *arr, int n, int threshold) {
int count = 0;
for (int i = 0; i < n; i++) {
if (arr[i] < threshold) { // Branch!
count++;
}
}
sink = count;
}
int main(void) {
int *arr = malloc(ARRAY_SIZE * sizeof(int));
// Fill with random data
for (int i = 0; i < ARRAY_SIZE; i++) {
arr[i] = rand() % 256;
}
// Test unsorted
uint64_t start = rdtscp();
for (int iter = 0; iter < 1000; iter++) {
process_array(arr, ARRAY_SIZE, 128);
}
uint64_t unsorted_cycles = rdtscp() - start;
// Sort the array
qsort(arr, ARRAY_SIZE, sizeof(int), compare_int);
// Test sorted
start = rdtscp();
for (int iter = 0; iter < 1000; iter++) {
process_array(arr, ARRAY_SIZE, 128);
}
uint64_t sorted_cycles = rdtscp() - start;
printf("Unsorted: %lu cycles\n", unsorted_cycles);
printf("Sorted: %lu cycles\n", sorted_cycles);
printf("Speedup: %.2fx\n",
(double)unsorted_cycles / sorted_cycles);
}
Expected Result: Sorted array is 2-5x faster due to better branch prediction.
Part IV: Data Structures (Chapters 13-15)
Exercise 13.1: Array vs Linked List
Difficulty: Easy | Language: C
Problem: Compare sequential access performance of arrays vs linked lists.
Key Code:
// Array traversal
void sum_array(int *arr, int n) {
int sum = 0;
for (int i = 0; i < n; i++) {
sum += arr[i];
}
sink = sum;
}
// Linked list traversal
struct Node {
int value;
struct Node *next;
};
void sum_list(struct Node *head) {
int sum = 0;
while (head) {
sum += head->value;
head = head->next;
}
sink = sum;
}
Expected Result: Array is 5-20x faster due to cache-friendly sequential access.
Exercise 14.1: Hash Table vs Tree
Difficulty: Medium | Language: C++
Problem: Compare lookup performance of hash tables vs balanced trees.
Key Code:
#include <unordered_map>
#include <map>
void benchmark_hash(int n) {
std::unordered_map<int, int> hash;
for (int i = 0; i < n; i++) hash[i] = i;
auto start = now();
for (int i = 0; i < n; i++) {
sink = hash[rand() % n];
}
auto elapsed = now() - start;
}
void benchmark_tree(int n) {
std::map<int, int> tree;
for (int i = 0; i < n; i++) tree[i] = i;
auto start = now();
for (int i = 0; i < n; i++) {
sink = tree[rand() % n];
}
auto elapsed = now() - start;
}
Expected Result:
- Hash: O(1) average, ~50-100ns per lookup
- Tree: O(log n), ~200-500ns per lookup for n=1M
Exercise 15.1: Sorting Algorithm Comparison
Difficulty: Medium | Language: C++
Problem: Compare quicksort, mergesort, and heapsort performance.
Key Code:
#include <algorithm>
void benchmark_sort(std::vector<int>& data,
void (*sort_fn)(int*, int*)) {
auto start = now();
sort_fn(data.data(), data.data() + data.size());
auto elapsed = now() - start;
}
// Test with different data patterns:
// 1. Random
// 2. Nearly sorted
// 3. Reverse sorted
// 4. Many duplicates
Part V: Parallelization (Chapters 16-18)
Exercise 16.1: SIMD Vector Addition
Difficulty: Advanced | Language: C
Problem: Implement vector addition using SIMD intrinsics.
Key Code (AVX2):
#include <immintrin.h>
void vector_add_scalar(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
void vector_add_avx(float *a, float *b, float *c, int n) {
int i;
for (i = 0; i <= n - 8; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_storeu_ps(&c[i], vc);
}
// Handle remainder
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
}
Expected Speedup: 4-8x for float, 8-16x for int8.
Exercise 17.2: False Sharing Detection
Difficulty: Advanced | Language: C
Problem: Demonstrate and fix false sharing in multi-threaded code.
Key Code:
// BAD: False sharing
struct Counter {
int count; // 4 bytes, multiple counters in same cache line
};
struct Counter counters[NUM_THREADS];
// GOOD: Padded to avoid false sharing
struct PaddedCounter {
int count;
char padding[60]; // Pad to 64-byte cache line
};
struct PaddedCounter padded_counters[NUM_THREADS];
void *thread_func(void *arg) {
int id = *(int*)arg;
for (int i = 0; i < ITERATIONS; i++) {
counters[id].count++; // False sharing!
}
return NULL;
}
Expected Result: Padded version 5-10x faster with multiple threads.
Part VI: Embedded Constraints (Chapters 19-22)
Exercise 19.1: Binary Footprint Analysis
Difficulty: Medium | Language: C/Shell
Problem: Analyze binary size and identify largest contributors.
Key Code:
#!/bin/bash
# Compile with different options
gcc -Os -o prog_Os prog.c
gcc -O2 -o prog_O2 prog.c
gcc -O3 -o prog_O3 prog.c
# Size comparison
size prog_Os prog_O2 prog_O3
# Symbol size analysis
nm --size-sort -S prog_Os | tail -20
# Section breakdown
objdump -h prog_Os | grep -E "\.text|\.data|\.bss|\.rodata"
Exercise 21.1: Stack Usage Analysis
Difficulty: Medium | Language: C
Problem: Measure and analyze function stack usage.
Key Code:
# Compile with stack usage info
gcc -fstack-usage -O2 -c program.c
# Output: program.su file
# Format: function_name:line:col:size qualifier
// GCC extension for runtime stack check
void check_stack_usage(void) {
void *sp;
asm volatile("mov %%rsp, %0" : "=r"(sp));
printf("Current SP: %p\n", sp);
}
Part VII: AI/HPC (Chapters 23-29)
Exercise 26.1: CUDA Vector Addition
Difficulty: Advanced | Language: CUDA
Problem: Implement parallel vector addition on GPU.
Key Code:
__global__ void vector_add(float *a, float *b, float *c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
int main(void) {
int n = 1 << 20; // 1M elements
size_t bytes = n * sizeof(float);
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int threads = 256;
int blocks = (n + threads - 1) / threads;
vector_add<<<blocks, threads>>>(d_a, d_b, d_c, n);
cudaDeviceSynchronize();
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
Exercise 27.1: LLM Inference Benchmark
Difficulty: Advanced | Language: Python
Problem: Measure LLM inference throughput and latency.
Key Code:
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
def benchmark_inference(model, tokenizer, prompt, num_tokens=100):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Warm up
with torch.no_grad():
model.generate(**inputs, max_new_tokens=10)
# Benchmark
torch.cuda.synchronize()
start = time.perf_counter()
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=num_tokens)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
tokens_generated = outputs.shape[1] - inputs['input_ids'].shape[1]
throughput = tokens_generated / elapsed
print(f"Tokens: {tokens_generated}")
print(f"Time: {elapsed:.2f}s")
print(f"Throughput: {throughput:.1f} tokens/s")
Additional Exercises (No Solutions Provided)
The following exercises are designed for independent exploration. They are intentionally open-ended to encourage deeper investigation.
Part I: Foundations
Exercise 1.2: Environment Variability Study
Run the same benchmark on your system with different background conditions:
- During a virus scan or system backup
- With browser tabs open vs closed
- On battery vs plugged in (laptop)
- After fresh boot vs after hours of use
Document how much measurement variability each condition introduces. What's your system's "noise floor"?
Exercise 2.3: Timer Comparison
Compare the overhead and resolution of different timing methods on your system:
clock_gettime(CLOCK_MONOTONIC)clock_gettime(CLOCK_MONOTONIC_RAW)gettimeofday()rdtsc/rdtscp(x86) orcntvct_el0(ARM)
Which one is best for sub-microsecond measurements? Which is most portable?
Exercise 4.3: Outlier Investigation
Take a benchmark that produces outliers. Instead of removing them, investigate:
- When do outliers occur? (First run? Every N runs?)
- What system events correlate with outliers?
- Can you predict when outliers will happen?
Part II: Benchmark Tools
Exercise 7.1: Geekbench Deep Dive
Run Geekbench 6 on your system and analyze:
- Which sub-benchmarks score highest/lowest relative to the reference?
- How does your single-core to multi-core scaling compare to the reference?
- Run it 5 times - what's the CV of each sub-benchmark?
Exercise 8.1: SPEC CPU Proxy
Without access to SPEC CPU, create a "proxy benchmark suite" using open-source alternatives:
- Find open-source equivalents for 5 SPEC workloads
- Measure correlation with published SPEC scores (if available)
- Document limitations of your proxy approach
Exercise 9.3: Custom Microbenchmark
Design a microbenchmark to measure a specific CPU feature:
- L1 cache latency (not bandwidth)
- TLB miss penalty
- Memory prefetcher effectiveness
The benchmark must isolate the feature from other effects.
Part III: Analysis Theory
Exercise 10.2: Your Application's Roofline Position
Take a computationally intensive application you work with:
- Calculate its theoretical operational intensity
- Measure actual achieved FLOPS
- Plot it on a roofline - is it memory or compute bound?
- What optimization would help most?
Exercise 11.1: Amdahl's Law Reality Check
Profile a real application and identify:
- The serial fraction (code that can't be parallelized)
- Calculate theoretical speedup with 4, 8, 16 cores
- Measure actual speedup
- Explain the gap between theory and practice
Exercise 12.4: Prefetch Pattern Discovery
Experiment with different memory access patterns:
- Sequential forward
- Sequential backward
- Stride-2, Stride-4, Stride-8, etc.
- Random
At what stride does the hardware prefetcher stop helping? How does this vary by CPU generation?
Part IV: Data Structures
Exercise 13.2: Cache-Oblivious vs Cache-Aware
Implement matrix transpose two ways:
- Naive (row-by-row)
- Cache-oblivious (recursive)
- Cache-aware (blocked, tuned for your L1)
Compare performance at different matrix sizes. At what size does blocking matter?
Exercise 14.2: Real-World Hash Table Benchmarking
Compare hash table implementations with realistic workloads:
- String keys of varying length (5-100 chars)
- Mixed read/write (90/10, 50/50, 10/90)
- With and without deletions
- Measure memory overhead, not just speed
Exercise 15.2: Sorting Stability Under Pressure
Benchmark sorting algorithms with adversarial inputs:
- Quicksort killer sequences
- Data that triggers worst-case for each algorithm
- Measure not just average case, but worst case in practice
Part V: Parallelization
Exercise 16.2: Auto-vectorization Investigation
Take a loop that should vectorize:
- Compile with
-O3 -march=nativeand check if it vectorized - If not, identify what prevented vectorization
- Fix the issue without using intrinsics
- Measure the speedup
Exercise 17.3: Lock-Free vs Locking
Implement a concurrent counter three ways:
- Mutex-protected
- Spinlock
- Atomic (lock-free)
Measure throughput at different thread counts (1, 2, 4, 8, 16, 32). At what point does each approach break down?
Exercise 18.1: OpenMP Scheduling Experiment
Take a parallel loop with varying work per iteration. Compare:
- static scheduling
- dynamic scheduling (chunk=1, 10, 100)
- guided scheduling
How does optimal scheduling depend on work distribution and iteration count?
Part VI: Embedded Constraints
Exercise 20.1: Power State Characterization
If you have access to an embedded board with power measurement:
- Measure power in different sleep states
- Measure wake-up latency from each state
- Calculate the break-even time for each state
- Design a sleep strategy for a periodic workload
Exercise 21.2: Stack High-Water Mark
Implement a stack painting technique:
- Fill stack with a known pattern at startup
- Run your application under various loads
- Check how much of the pattern was overwritten
- Report peak stack usage
Exercise 22.1: Memory-Constrained Algorithm Design
Take an algorithm that requires O(n) auxiliary space. Modify it to run in O(1) space:
- What's the time penalty?
- Is there a time-space trade-off curve?
- At what memory budget does the in-place version become faster?
Part VII: AI/HPC
Exercise 23.1: Metrics That Matter
Profile an AI inference workload and report:
- FLOPS achieved vs theoretical peak
- Memory bandwidth achieved vs theoretical peak
- Which is the bottleneck?
- What metric best predicts user-perceived performance?
Exercise 24.1: MLPerf Result Analysis
Download MLPerf results for a specific benchmark:
- Plot performance vs power across submissions
- Identify the Pareto frontier
- What hardware characteristics predict placement on the frontier?
Exercise 25.1: HPCG vs HPL Correlation
Using published data from the TOP500 and HPCG rankings:
- Plot HPCG efficiency vs HPL efficiency for the same systems
- What's the correlation?
- Can you identify systems that are outliers in one benchmark but not the other?
Exercise 28.1: ML Compiler Comparison
Take a model (e.g., ResNet-50) and compile it with:
- PyTorch (eager mode)
- TorchScript
- ONNX Runtime
- TensorRT (if NVIDIA GPU available)
- TVM with auto-tuning
Report latency, throughput, and compilation time. Which is best for your use case?
Exercise 29.1: Quantization Impact Study
Take a model and quantize to INT8:
- Measure accuracy drop on your validation set
- Measure speedup on your target hardware
- Try quantization-aware training - does it help?
- Find the accuracy-speed Pareto frontier
Challenge Problems
These are open-ended research-level problems:
Challenge 1: The Perfect Benchmark
Design a benchmark that:
- Runs in under 10 seconds
- Has <1% CV on commodity hardware
- Correlates with "real application" performance (define this)
- Is resistant to gaming/optimization
Document your design decisions and trade-offs.
Challenge 2: Automatic Bottleneck Detection
Build a tool that:
- Takes any program as input
- Profiles it automatically
- Identifies the top 3 performance bottlenecks
- Suggests specific optimizations
Test it on 5 different programs. How often is it right?
Challenge 3: Performance Regression Detection
Design a CI/CD performance testing system that:
- Detects 2% regressions reliably
- Minimizes false positives
- Runs in under 5 minutes
- Works on noisy cloud VMs
What's the minimum number of runs needed? What statistical tests work best?
Summary
Exercise Difficulty Guide
| Level | Exercises | Prerequisites |
|---|---|---|
| Easy | 1.1, 2.2, 4.2, 5.1, 6.1, 13.1 | Basic C, Python |
| Medium | 3.3, 7.1-9.2, 10.1-15.1, 18-22, 29 | Stats, Linux tools |
| Advanced | 16.1, 17.2, 26.1-28.1, 30-34 | SIMD, CUDA, ML |
| Open-ended | Additional exercises | Varies |
Key Takeaways
- Always measure: Never assume performance characteristics
- Statistics matter: Single measurements are meaningless
- Understand variance: CV tells you if results are stable
- Hardware awareness: Cache, branch prediction, memory matter
- Reproducibility: Document environment, automate tests