Appendix E: Exercises and Solutions

"The best way to learn is by doing." — Richard Feynman

This appendix provides hands-on exercises for each chapter with detailed problem descriptions, solution approaches, and key code snippets.

Part I: Foundations (Chapters 1-4)

Exercise 1.1: Array Traversal with Statistical Analysis

Difficulty: Easy | Language: C

Problem: Run a simple array traversal 100 times and calculate statistical metrics to understand measurement variability.

Objectives:

Measure execution time using clock_gettime(CLOCK_MONOTONIC)
Calculate mean, standard deviation, and 95% confidence interval
Evaluate result stability using coefficient of variation (CV)

Key Code:

#include <time.h>
#include <math.h>

#define ARRAY_SIZE (1024 * 1024)  // 1M elements
#define NUM_RUNS 100

static inline uint64_t get_time_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}

volatile int sink;  // Prevent dead code elimination

void traverse_array(int *arr, size_t size) {
    int sum = 0;
    for (size_t i = 0; i < size; i++) {
        sum += arr[i];
    }
    sink = sum;
}

Statistical Calculation:

// Mean
double mean = sum / NUM_RUNS;

// Standard deviation
double std_dev = sqrt(sq_diff_sum / (NUM_RUNS - 1));

// 95% CI (t-value ≈ 1.984 for df=99)
double margin = 1.984 * std_dev / sqrt(NUM_RUNS);

// Coefficient of variation
double cv = (std_dev / mean) * 100.0;

Expected Output:

Array size: 1048576 elements (4 MB)
Number of runs: 100

Results:
  Mean:              0.892 ms
  Std Dev:           0.045 ms
  95% CI:            [0.883, 0.901] ms
  CV (variability):  5.04%

Interpretation:

CV < 5%: Stable results
CV 5-15%: Some fluctuation, acceptable
CV > 15%: High variability, stabilize environment

Exercise 2.2: Timer Overhead Measurement

Difficulty: Easy | Language: C

Problem: Measure the overhead of clock_gettime by calling it consecutively 1000 times.

Objectives:

Understand timer resolution limits
Determine minimum measurable duration
Analyze overhead distribution

Key Code:

#define NUM_SAMPLES 1000

// Measure timer overhead
for (int i = 0; i < NUM_SAMPLES; i++) {
    uint64_t t1 = get_time_ns();
    uint64_t t2 = get_time_ns();
    overhead[i] = t2 - t1;
}

// Sort for percentiles
qsort(overhead, NUM_SAMPLES, sizeof(uint64_t), compare_uint64);
uint64_t p50 = overhead[NUM_SAMPLES / 2];
uint64_t p95 = overhead[(int)(NUM_SAMPLES * 0.95)];

Expected Output:

Statistics (nanoseconds):
  Mean:    25.3 ns
  Std Dev: 8.2 ns
  Min:     18 ns
  Max:     156 ns

Percentiles:
  P50 (median): 23 ns
  P95:          42 ns
  P99:          78 ns

Practical Implication: For a 25ns timer overhead, measure operations of at least 2500ns (100x overhead) for <1% error.

Exercise 3.3: T-Test for Algorithm Comparison

Difficulty: Medium | Language: C

Problem: Use Welch's t-test to determine if two algorithm implementations have statistically significant performance differences.

Objectives:

Collect benchmark samples for two algorithms
Implement Welch's t-test (handles unequal variances)
Interpret p-value for significance

Key Code:

// Algorithm A: Simple sum
void algo_a(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }
    sink = sum;
}

// Algorithm B: Unrolled sum (4x)
void algo_b(int *arr, int n) {
    int sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
    for (int i = 0; i <= n - 4; i += 4) {
        sum0 += arr[i];
        sum1 += arr[i + 1];
        sum2 += arr[i + 2];
        sum3 += arr[i + 3];
    }
    sink = sum0 + sum1 + sum2 + sum3;
}

Welch's T-Test:

// t-statistic
double se = sqrt(var_a / na + var_b / nb);
double t = (mean_a - mean_b) / se;

// Degrees of freedom (Welch-Satterthwaite)
double df = (v1 + v2) * (v1 + v2) /
            (v1 * v1 / (na - 1) + v2 * v2 / (nb - 1));

Interpretation:

p < 0.05: Statistically significant difference
p >= 0.05: No significant difference (could be noise)

Exercise 4.2: Box Plot Visualization

Difficulty: Easy | Language: Python

Problem: Create box plots to visualize benchmark result distributions and identify outliers.

Key Code:

import matplotlib.pyplot as plt
import numpy as np

def create_boxplot(data_dict, title, ylabel):
    fig, ax = plt.subplots(figsize=(10, 6))

    labels = list(data_dict.keys())
    data = [data_dict[k] for k in labels]

    bp = ax.boxplot(data, labels=labels, patch_artist=True)

    # Color boxes
    colors = ['lightblue', 'lightgreen', 'lightyellow', 'lightcoral']
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)

    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig('boxplot.png', dpi=150)

What Box Plots Show:

Box: Q1 to Q3 (50% of data)
Line in box: Median
Whiskers: 1.5×IQR from box edges
Points outside: Outliers

Part II: Benchmark Tools (Chapters 5-9)

Exercise 5.1: CoreMark Testing

Difficulty: Easy | Language: Shell

Problem: Build and run CoreMark with different compiler optimization levels.

Key Code:

#!/bin/bash

git clone https://github.com/eembc/coremark.git
cd coremark

for opt in O0 O1 O2 O3 Ofast; do
    echo "=== Testing -$opt ==="
    make clean
    make PORT_DIR=linux XCFLAGS="-$opt"
    ./coremark.exe 2>&1 | grep -E "CoreMark|Iterations"
done

Expected Results:

Optimization	CoreMark Score	Improvement
-O0	~5,000	baseline
-O1	~15,000	3x
-O2	~22,000	4.4x
-O3	~24,000	4.8x
-Ofast	~25,000	5x

Exercise 6.1: STREAM Memory Bandwidth Analysis

Difficulty: Easy | Language: C

Problem: Measure memory bandwidth at different array sizes to observe cache hierarchy effects.

Key Code:

// Vary array size to hit different cache levels
size_t sizes[] = {
    8 * 1024,        // 8 KB  - L1
    64 * 1024,       // 64 KB - L2
    512 * 1024,      // 512 KB - L3
    8 * 1024 * 1024, // 8 MB  - L3/DRAM
    64 * 1024 * 1024 // 64 MB - DRAM
};

for (int s = 0; s < 5; s++) {
    double *a = aligned_alloc(64, sizes[s]);
    double *b = aligned_alloc(64, sizes[s]);

    // STREAM Copy: b[i] = a[i]
    uint64_t start = get_time_ns();
    for (size_t i = 0; i < sizes[s] / sizeof(double); i++) {
        b[i] = a[i];
    }
    uint64_t elapsed = get_time_ns() - start;

    double bw = (2.0 * sizes[s]) / elapsed;  // GB/s
    printf("Size: %6zu KB, Bandwidth: %.2f GB/s\n",
           sizes[s] / 1024, bw);
}

Exercise 9.2: CPU Cycle Counter

Difficulty: Medium | Language: C

Problem: Use architecture-specific cycle counters for precise timing.

Key Code (x86):

#include <x86intrin.h>

static inline uint64_t rdtsc(void) {
    return __rdtsc();
}

// More precise: RDTSCP serializes
static inline uint64_t rdtscp(void) {
    unsigned int aux;
    return __rdtscp(&aux);
}

// Usage
uint64_t start = rdtscp();
// ... operation ...
uint64_t end = rdtscp();
uint64_t cycles = end - start;

Key Code (ARM):

static inline uint64_t read_cycles(void) {
    uint64_t val;
    asm volatile("mrs %0, cntvct_el0" : "=r"(val));
    return val;
}

Part III: Analysis Theory (Chapters 10-12)

Exercise 10.1: Roofline Model

Difficulty: Medium | Language: Python

Problem: Generate a roofline model plot for your system.

Key Code:

import matplotlib.pyplot as plt
import numpy as np

# System parameters (measure these!)
peak_flops = 100  # GFLOPS
peak_bw = 50      # GB/s

# Calculate ridge point
ridge_point = peak_flops / peak_bw  # FLOP/byte

# Operational intensity range
oi = np.logspace(-2, 2, 100)

# Roofline
memory_bound = peak_bw * oi
compute_bound = np.full_like(oi, peak_flops)
roofline = np.minimum(memory_bound, compute_bound)

plt.figure(figsize=(10, 6))
plt.loglog(oi, roofline, 'b-', linewidth=2, label='Roofline')
plt.axvline(x=ridge_point, color='r', linestyle='--',
            label=f'Ridge Point ({ridge_point:.1f})')

plt.xlabel('Operational Intensity (FLOP/byte)')
plt.ylabel('Performance (GFLOPS)')
plt.title('Roofline Model')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('roofline.png', dpi=150)

Exercise 12.3: Branch Prediction Experiment

Difficulty: Medium | Language: C

Problem: Compare sorted vs unsorted array processing to observe branch prediction effects.

Key Code:

#define ARRAY_SIZE (32 * 1024)

// Test with sorted and unsorted data
void process_array(int *arr, int n, int threshold) {
    int count = 0;
    for (int i = 0; i < n; i++) {
        if (arr[i] < threshold) {  // Branch!
            count++;
        }
    }
    sink = count;
}

int main(void) {
    int *arr = malloc(ARRAY_SIZE * sizeof(int));

    // Fill with random data
    for (int i = 0; i < ARRAY_SIZE; i++) {
        arr[i] = rand() % 256;
    }

    // Test unsorted
    uint64_t start = rdtscp();
    for (int iter = 0; iter < 1000; iter++) {
        process_array(arr, ARRAY_SIZE, 128);
    }
    uint64_t unsorted_cycles = rdtscp() - start;

    // Sort the array
    qsort(arr, ARRAY_SIZE, sizeof(int), compare_int);

    // Test sorted
    start = rdtscp();
    for (int iter = 0; iter < 1000; iter++) {
        process_array(arr, ARRAY_SIZE, 128);
    }
    uint64_t sorted_cycles = rdtscp() - start;

    printf("Unsorted: %lu cycles\n", unsorted_cycles);
    printf("Sorted:   %lu cycles\n", sorted_cycles);
    printf("Speedup:  %.2fx\n",
           (double)unsorted_cycles / sorted_cycles);
}

Expected Result: Sorted array is 2-5x faster due to better branch prediction.

Part IV: Data Structures (Chapters 13-15)

Exercise 13.1: Array vs Linked List

Difficulty: Easy | Language: C

Problem: Compare sequential access performance of arrays vs linked lists.

Key Code:

// Array traversal
void sum_array(int *arr, int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }
    sink = sum;
}

// Linked list traversal
struct Node {
    int value;
    struct Node *next;
};

void sum_list(struct Node *head) {
    int sum = 0;
    while (head) {
        sum += head->value;
        head = head->next;
    }
    sink = sum;
}

Expected Result: Array is 5-20x faster due to cache-friendly sequential access.

Exercise 14.1: Hash Table vs Tree

Difficulty: Medium | Language: C++

Problem: Compare lookup performance of hash tables vs balanced trees.

Key Code:

#include <unordered_map>
#include <map>

void benchmark_hash(int n) {
    std::unordered_map<int, int> hash;
    for (int i = 0; i < n; i++) hash[i] = i;

    auto start = now();
    for (int i = 0; i < n; i++) {
        sink = hash[rand() % n];
    }
    auto elapsed = now() - start;
}

void benchmark_tree(int n) {
    std::map<int, int> tree;
    for (int i = 0; i < n; i++) tree[i] = i;

    auto start = now();
    for (int i = 0; i < n; i++) {
        sink = tree[rand() % n];
    }
    auto elapsed = now() - start;
}

Expected Result:

Hash: O(1) average, ~50-100ns per lookup
Tree: O(log n), ~200-500ns per lookup for n=1M

Exercise 15.1: Sorting Algorithm Comparison

Difficulty: Medium | Language: C++

Problem: Compare quicksort, mergesort, and heapsort performance.

Key Code:

#include <algorithm>

void benchmark_sort(std::vector<int>& data,
                    void (*sort_fn)(int*, int*)) {
    auto start = now();
    sort_fn(data.data(), data.data() + data.size());
    auto elapsed = now() - start;
}

// Test with different data patterns:
// 1. Random
// 2. Nearly sorted
// 3. Reverse sorted
// 4. Many duplicates

Part V: Parallelization (Chapters 16-18)

Exercise 16.1: SIMD Vector Addition

Difficulty: Advanced | Language: C

Problem: Implement vector addition using SIMD intrinsics.

Key Code (AVX2):

#include <immintrin.h>

void vector_add_scalar(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

void vector_add_avx(float *a, float *b, float *c, int n) {
    int i;
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
    // Handle remainder
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Expected Speedup: 4-8x for float, 8-16x for int8.

Difficulty: Advanced | Language: C

Problem: Demonstrate and fix false sharing in multi-threaded code.

Key Code:

// BAD: False sharing
struct Counter {
    int count;  // 4 bytes, multiple counters in same cache line
};

struct Counter counters[NUM_THREADS];

// GOOD: Padded to avoid false sharing
struct PaddedCounter {
    int count;
    char padding[60];  // Pad to 64-byte cache line
};

struct PaddedCounter padded_counters[NUM_THREADS];

void *thread_func(void *arg) {
    int id = *(int*)arg;
    for (int i = 0; i < ITERATIONS; i++) {
        counters[id].count++;  // False sharing!
    }
    return NULL;
}

Expected Result: Padded version 5-10x faster with multiple threads.

Part VI: Embedded Constraints (Chapters 19-22)

Exercise 19.1: Binary Footprint Analysis

Difficulty: Medium | Language: C/Shell

Problem: Analyze binary size and identify largest contributors.

Key Code:

#!/bin/bash

# Compile with different options
gcc -Os -o prog_Os prog.c
gcc -O2 -o prog_O2 prog.c
gcc -O3 -o prog_O3 prog.c

# Size comparison
size prog_Os prog_O2 prog_O3

# Symbol size analysis
nm --size-sort -S prog_Os | tail -20

# Section breakdown
objdump -h prog_Os | grep -E "\.text|\.data|\.bss|\.rodata"

Exercise 21.1: Stack Usage Analysis

Difficulty: Medium | Language: C

Problem: Measure and analyze function stack usage.

Key Code:

# Compile with stack usage info
gcc -fstack-usage -O2 -c program.c

# Output: program.su file
# Format: function_name:line:col:size qualifier

// GCC extension for runtime stack check
void check_stack_usage(void) {
    void *sp;
    asm volatile("mov %%rsp, %0" : "=r"(sp));
    printf("Current SP: %p\n", sp);
}

Part VII: AI/HPC (Chapters 23-29)

Exercise 26.1: CUDA Vector Addition

Difficulty: Advanced | Language: CUDA

Problem: Implement parallel vector addition on GPU.

Key Code:

__global__ void vector_add(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main(void) {
    int n = 1 << 20;  // 1M elements
    size_t bytes = n * sizeof(float);

    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int threads = 256;
    int blocks = (n + threads - 1) / threads;

    vector_add<<<blocks, threads>>>(d_a, d_b, d_c, n);
    cudaDeviceSynchronize();

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Exercise 27.1: LLM Inference Benchmark

Difficulty: Advanced | Language: Python

Problem: Measure LLM inference throughput and latency.

Key Code:

import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

def benchmark_inference(model, tokenizer, prompt, num_tokens=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Warm up
    with torch.no_grad():
        model.generate(**inputs, max_new_tokens=10)

    # Benchmark
    torch.cuda.synchronize()
    start = time.perf_counter()

    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=num_tokens)

    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    tokens_generated = outputs.shape[1] - inputs['input_ids'].shape[1]
    throughput = tokens_generated / elapsed

    print(f"Tokens: {tokens_generated}")
    print(f"Time: {elapsed:.2f}s")
    print(f"Throughput: {throughput:.1f} tokens/s")

Additional Exercises (No Solutions Provided)

The following exercises are designed for independent exploration. They are intentionally open-ended to encourage deeper investigation.

Part I: Foundations

Exercise 1.2: Environment Variability Study

Run the same benchmark on your system with different background conditions:

During a virus scan or system backup
With browser tabs open vs closed
On battery vs plugged in (laptop)
After fresh boot vs after hours of use

Document how much measurement variability each condition introduces. What's your system's "noise floor"?

Exercise 2.3: Timer Comparison

Compare the overhead and resolution of different timing methods on your system:

clock_gettime(CLOCK_MONOTONIC)
clock_gettime(CLOCK_MONOTONIC_RAW)
gettimeofday()
rdtsc/rdtscp (x86) or cntvct_el0 (ARM)

Which one is best for sub-microsecond measurements? Which is most portable?

Exercise 4.3: Outlier Investigation

Take a benchmark that produces outliers. Instead of removing them, investigate:

When do outliers occur? (First run? Every N runs?)
What system events correlate with outliers?
Can you predict when outliers will happen?

Part II: Benchmark Tools

Exercise 7.1: Geekbench Deep Dive

Run Geekbench 6 on your system and analyze:

Which sub-benchmarks score highest/lowest relative to the reference?
How does your single-core to multi-core scaling compare to the reference?
Run it 5 times - what's the CV of each sub-benchmark?

Exercise 8.1: SPEC CPU Proxy

Without access to SPEC CPU, create a "proxy benchmark suite" using open-source alternatives:

Find open-source equivalents for 5 SPEC workloads
Measure correlation with published SPEC scores (if available)
Document limitations of your proxy approach

Exercise 9.3: Custom Microbenchmark

Design a microbenchmark to measure a specific CPU feature:

L1 cache latency (not bandwidth)
TLB miss penalty
Memory prefetcher effectiveness

The benchmark must isolate the feature from other effects.

Part III: Analysis Theory

Exercise 10.2: Your Application's Roofline Position

Take a computationally intensive application you work with:

Calculate its theoretical operational intensity
Measure actual achieved FLOPS
Plot it on a roofline - is it memory or compute bound?
What optimization would help most?

Exercise 11.1: Amdahl's Law Reality Check

Profile a real application and identify:

The serial fraction (code that can't be parallelized)
Calculate theoretical speedup with 4, 8, 16 cores
Measure actual speedup
Explain the gap between theory and practice

Exercise 12.4: Prefetch Pattern Discovery

Experiment with different memory access patterns:

Sequential forward
Sequential backward
Stride-2, Stride-4, Stride-8, etc.
Random

At what stride does the hardware prefetcher stop helping? How does this vary by CPU generation?

Part IV: Data Structures

Exercise 13.2: Cache-Oblivious vs Cache-Aware

Implement matrix transpose two ways:

Naive (row-by-row)
Cache-oblivious (recursive)
Cache-aware (blocked, tuned for your L1)

Compare performance at different matrix sizes. At what size does blocking matter?

Exercise 14.2: Real-World Hash Table Benchmarking

Compare hash table implementations with realistic workloads:

String keys of varying length (5-100 chars)
Mixed read/write (90/10, 50/50, 10/90)
With and without deletions
Measure memory overhead, not just speed

Exercise 15.2: Sorting Stability Under Pressure

Benchmark sorting algorithms with adversarial inputs:

Quicksort killer sequences
Data that triggers worst-case for each algorithm
Measure not just average case, but worst case in practice

Part V: Parallelization

Exercise 16.2: Auto-vectorization Investigation

Take a loop that should vectorize:

Compile with -O3 -march=native and check if it vectorized
If not, identify what prevented vectorization
Fix the issue without using intrinsics
Measure the speedup

Exercise 17.3: Lock-Free vs Locking

Implement a concurrent counter three ways:

Mutex-protected
Spinlock
Atomic (lock-free)

Measure throughput at different thread counts (1, 2, 4, 8, 16, 32). At what point does each approach break down?

Exercise 18.1: OpenMP Scheduling Experiment

Take a parallel loop with varying work per iteration. Compare:

static scheduling
dynamic scheduling (chunk=1, 10, 100)
guided scheduling

How does optimal scheduling depend on work distribution and iteration count?

Part VI: Embedded Constraints

Exercise 20.1: Power State Characterization

If you have access to an embedded board with power measurement:

Measure power in different sleep states
Measure wake-up latency from each state
Calculate the break-even time for each state
Design a sleep strategy for a periodic workload

Exercise 21.2: Stack High-Water Mark

Implement a stack painting technique:

Fill stack with a known pattern at startup
Run your application under various loads
Check how much of the pattern was overwritten
Report peak stack usage

Exercise 22.1: Memory-Constrained Algorithm Design

Take an algorithm that requires O(n) auxiliary space. Modify it to run in O(1) space:

What's the time penalty?
Is there a time-space trade-off curve?
At what memory budget does the in-place version become faster?

Part VII: AI/HPC

Exercise 23.1: Metrics That Matter

Profile an AI inference workload and report:

FLOPS achieved vs theoretical peak
Memory bandwidth achieved vs theoretical peak
Which is the bottleneck?
What metric best predicts user-perceived performance?

Exercise 24.1: MLPerf Result Analysis

Download MLPerf results for a specific benchmark:

Plot performance vs power across submissions
Identify the Pareto frontier
What hardware characteristics predict placement on the frontier?

Exercise 25.1: HPCG vs HPL Correlation

Using published data from the TOP500 and HPCG rankings:

Plot HPCG efficiency vs HPL efficiency for the same systems
What's the correlation?
Can you identify systems that are outliers in one benchmark but not the other?

Exercise 28.1: ML Compiler Comparison

Take a model (e.g., ResNet-50) and compile it with:

PyTorch (eager mode)
TorchScript
ONNX Runtime
TensorRT (if NVIDIA GPU available)
TVM with auto-tuning

Report latency, throughput, and compilation time. Which is best for your use case?

Exercise 29.1: Quantization Impact Study

Take a model and quantize to INT8:

Measure accuracy drop on your validation set
Measure speedup on your target hardware
Try quantization-aware training - does it help?
Find the accuracy-speed Pareto frontier

Challenge Problems

These are open-ended research-level problems:

Challenge 1: The Perfect Benchmark

Design a benchmark that:

Runs in under 10 seconds
Has <1% CV on commodity hardware
Correlates with "real application" performance (define this)
Is resistant to gaming/optimization

Document your design decisions and trade-offs.

Challenge 2: Automatic Bottleneck Detection

Build a tool that:

Takes any program as input
Profiles it automatically
Identifies the top 3 performance bottlenecks
Suggests specific optimizations

Test it on 5 different programs. How often is it right?

Challenge 3: Performance Regression Detection

Design a CI/CD performance testing system that:

Detects 2% regressions reliably
Minimizes false positives
Runs in under 5 minutes
Works on noisy cloud VMs

What's the minimum number of runs needed? What statistical tests work best?

Summary

Exercise Difficulty Guide

Level	Exercises	Prerequisites
Easy	1.1, 2.2, 4.2, 5.1, 6.1, 13.1	Basic C, Python
Medium	3.3, 7.1-9.2, 10.1-15.1, 18-22, 29	Stats, Linux tools
Advanced	16.1, 17.2, 26.1-28.1, 30-34	SIMD, CUDA, ML
Open-ended	Additional exercises	Varies

Key Takeaways

Always measure: Never assume performance characteristics
Statistics matter: Single measurements are meaningless
Understand variance: CV tells you if results are stable
Hardware awareness: Cache, branch prediction, memory matter
Reproducibility: Document environment, automate tests

Performance and Benchmarking