Appendix E: Exercises

This appendix provides hands-on exercises to reinforce the concepts covered throughout the book. Each exercise is designed to help you gain practical experience with hardware-aware data structure implementation and performance analysis.

Chapter 5: Linked Lists - The Cache Killer

Exercise 1: Benchmark Challenge

Objective: Compare array-based and linked list implementations of a stack.

Task:

Implement both array and linked list versions of a stack
Measure push/pop performance for 10,000 operations
Use the benchmark framework from Chapter 3
Measure both execution time and cache misses

Questions:

Which implementation is faster? By how much?
What is the cache miss rate for each?
How does performance change with different stack sizes (100, 1000, 10000 elements)?

Exercise 2: Memory Pool

Objective: Understand the impact of allocation overhead on linked list performance.

Task:

Implement a memory pool allocator for linked list nodes
Compare performance with malloc-based allocation
Measure allocation time and fragmentation

Questions:

How much faster is the memory pool?
What is the memory overhead of the pool?
How does pool size affect performance?

Exercise 3: Unrolled List

Objective: Explore cache-friendly variations of linked lists.

Task:

Implement an unrolled linked list with 16 elements per node
Benchmark against standard linked list and array
Measure cache behavior for sequential traversal

Questions:

How does the unrolled list compare to standard linked list?
What is the optimal number of elements per node?
When would you choose an unrolled list over an array?

Exercise 4: Cache Analysis

Objective: Analyze cache behavior at different data sizes.

Task:

Use perf to measure cache misses for array vs linked list traversal
Vary the data size from 1 KB to 1 MB
Plot cache miss rate vs data size

Questions:

At what size does the linked list become completely cache-hostile?
How does the cache miss rate change as data exceeds L1, L2, L3 cache sizes?
Can you identify the cache size thresholds from the data?

Exercise 5: Real-Time Analysis

Objective: Understand predictability requirements for real-time systems.

Task:

Measure worst-case execution time for linked list operations
Run 10,000 iterations and record min, max, median, P99
Compare variance between array and linked list

Questions:

How much variance do you see in linked list operations?
Is the variance acceptable for a 1 kHz control loop?
What causes the worst-case execution times?

Chapter 1: The Performance Gap

Exercise 1: Hash Table vs Binary Search

Objective: Reproduce the Chapter 1 experiment comparing hash tables and binary search.

Task:

Implement a hash table with 1024 buckets for 500 device configurations
Implement binary search on a sorted array for the same data
Measure cache misses and execution time for 10,000 lookups
Vary the number of entries (100, 500, 1000, 5000)

Questions:

At what size does the hash table become slower than binary search?
What is the cache miss rate for each implementation?
How does the hash table size (number of buckets) affect performance?

Exercise 2: Cache Miss Analysis

Objective: Understand the relationship between cache misses and execution time.

Task:

Write a simple array traversal program
Use perf to measure cache misses and cycles
Calculate the cost per cache miss
Compare with the theoretical 100-cycle penalty

Questions:

What is the actual cache miss penalty on your hardware?
How does it vary with L1, L2, and L3 cache misses?
Can you identify the cache sizes from the performance data?

Chapter 2: Memory Hierarchy

Exercise 1: Cache Line Size Detection

Objective: Experimentally determine your CPU’s cache line size.

Task:

Create an array and access elements with varying strides (1, 2, 4, 8, 16, 32, 64, 128 bytes)
Measure cache misses for each stride
Plot cache misses vs stride
Identify the cache line size from the inflection point

Questions:

What is your CPU’s cache line size?
How does performance change when stride equals cache line size?
What happens when stride is larger than cache line size?

Objective: Demonstrate the performance impact of false sharing.

Task:

Create a multi-threaded program where each thread updates a separate counter
Version 1: Pack counters tightly in an array
Version 2: Pad counters to separate cache lines
Measure throughput for both versions

Questions:

How much slower is the packed version?
How many cache line bounces occur in the packed version?
What is the optimal padding size?

Chapter 3: Benchmarking and Profiling

Exercise 1: Build a Microbenchmark Framework

Objective: Create a reusable benchmarking framework.

Task:

Implement high-precision timing using RDTSC or clock_gettime
Add statistical analysis (mean, median, stddev, percentiles)
Implement warmup runs and outlier detection
Add cache miss measurement using perf_event_open

Questions:

How many iterations are needed for stable results?
What is the overhead of your timing mechanism?
How do you detect and handle outliers?

Exercise 2: Profiling with perf

Objective: Master the perf profiling tool.

Task:

Write a program with an obvious performance bottleneck
Use perf record and perf report to find the hotspot
Use perf stat to measure cache misses, branch mispredictions
Use perf annotate to see assembly-level performance

Questions:

What percentage of time is spent in the hotspot?
What is the cache miss rate in the hotspot?
Can you identify the exact instruction causing cache misses?

Chapter 4: Arrays and Cache Locality

Exercise 1: Row-Major vs Column-Major

Objective: Measure the performance impact of access patterns.

Task:

Create a 1000×1000 matrix
Sum all elements using row-major order
Sum all elements using column-major order
Measure cache misses and execution time

Questions:

How much slower is column-major access?
What is the cache miss rate for each?
How does matrix size affect the performance gap?

Exercise 2: Structure of Arrays vs Array of Structures

Objective: Compare SoA and AoS layouts for cache efficiency.

Task:

Implement particle simulation with AoS layout
Implement the same simulation with SoA layout
Measure performance for position updates only
Measure performance when accessing all fields

Questions:

Which layout is faster for position-only updates?
Which layout is faster when accessing all fields?
How does the number of fields affect the trade-off?

Chapter 6: Stacks and Queues

Exercise 1: Ring Buffer Implementation

Objective: Implement a cache-friendly ring buffer queue.

Task:

Implement a ring buffer with power-of-2 size
Compare with a linked list queue
Measure performance for enqueue/dequeue operations
Test with different buffer sizes (64, 256, 1024, 4096)

Questions:

How much faster is the ring buffer?
What happens when the buffer is full?
How does buffer size affect cache performance?

Exercise 2: Stack Overflow Detection

Objective: Understand stack memory layout and overflow detection.

Task:

Write a recursive function that overflows the stack
Add canary values to detect overflow
Measure the stack size on your system
Implement a custom stack with overflow protection

Questions:

What is the default stack size on your system?
How can you detect stack overflow before it crashes?
What is the performance overhead of canary checks?

Chapter 7: Hash Tables and Cache Conflicts

Exercise 1: Hash Function Quality

Objective: Compare different hash functions for cache behavior.

Task:

Implement three hash functions: simple sum, FNV-1a, MurmurHash
Measure distribution quality (bucket occupancy variance)
Measure cache miss rate for lookups
Test with real-world string data (e.g., dictionary words)

Questions:

Which hash function has the best distribution?
Which has the best cache behavior?
Is there a trade-off between distribution and cache performance?

Exercise 2: Open Addressing vs Chaining

Objective: Compare collision resolution strategies.

Task:

Implement hash table with chaining
Implement hash table with linear probing
Measure performance at different load factors (0.5, 0.7, 0.9)
Measure cache misses for both implementations

Questions:

Which is faster at low load factors?
Which is faster at high load factors?
What is the cache miss rate for each?

Chapter 8: Dynamic Arrays and Memory Management

Exercise 1: Growth Factor Comparison

Objective: Compare different growth strategies for dynamic arrays.

Task:

Implement dynamic array with 1.5× growth
Implement dynamic array with 2× growth
Implement dynamic array with φ (1.618) growth
Measure total reallocations and memory waste for growing to 1M elements

Questions:

Which growth factor minimizes reallocations?
Which minimizes memory waste?
Which has the best overall performance?

Exercise 2: Custom Allocator

Objective: Implement a simple memory allocator.

Task:

Implement a bump allocator (arena)
Implement a free list allocator
Compare with malloc for small allocations
Measure fragmentation over time

Questions:

How much faster is the bump allocator?
When does the free list allocator fragment?
What is the memory overhead of each allocator?

Chapter 9: Binary Search Trees

Exercise 1: BST vs Sorted Array

Objective: Compare tree-based and array-based search structures.

Task:

Implement Red-Black tree
Implement sorted array with binary search
Measure lookup performance for 10,000 elements
Measure cache misses for both

Questions:

Which is faster for lookups?
Which is faster for insertions?
At what size does the tree become slower?

Exercise 2: Tree Layout Optimization

Objective: Explore cache-friendly tree layouts.

Task:

Implement standard pointer-based BST
Implement array-based BST (implicit pointers)
Implement van Emde Boas layout
Measure cache misses for tree traversal

Questions:

Which layout has the fewest cache misses?
How does tree depth affect the performance gap?
What is the memory overhead of each layout?

Chapter 10: B-Trees and Cache-Conscious Trees

Exercise 1: Optimal Node Size

Objective: Find the optimal B-tree node size for your hardware.

Task:

Implement B-tree with configurable node size
Test node sizes: 16, 32, 64, 128, 256 bytes
Measure lookup performance for 100,000 elements
Measure cache misses for each node size

Questions:

What is the optimal node size?
How does it relate to your cache line size?
What happens when node size exceeds cache line size?

Exercise 2: B-Tree vs Hash Table

Objective: Compare B-trees and hash tables for in-memory databases.

Task:

Implement B-tree with optimal node size
Implement cache-friendly hash table
Measure performance for random lookups
Measure performance for range queries

Questions:

Which is faster for point queries?
Which is faster for range queries?
How does dataset size affect the trade-off?

Chapter 11: Tries and Radix Trees

Exercise 1: Trie Memory Optimization

Objective: Reduce trie memory consumption.

Task:

Implement standard trie (26 pointers per node)
Implement compressed trie (radix tree)
Implement array-mapped trie (bitmap + compact array)
Measure memory usage and lookup performance

Questions:

How much memory does each implementation use?
Which has the best lookup performance?
What is the trade-off between memory and speed?

Exercise 2: Autocomplete Performance

Objective: Compare data structures for autocomplete.

Task:

Implement autocomplete with trie
Implement autocomplete with sorted array + binary search
Implement autocomplete with hash table
Test with 50,000 words from a dictionary

Questions:

Which is fastest for prefix search?
Which uses the least memory?
How does prefix length affect performance?

Chapter 12: Heaps and Priority Queues

Exercise 1: Heap Implementations

Objective: Compare different heap implementations.

Task:

Implement binary heap (array-based)
Implement d-ary heap (d=4, d=8)
Implement Fibonacci heap
Measure insert and extract-min performance

Questions:

Which heap has the best cache behavior?
What is the optimal d for d-ary heap?
When is Fibonacci heap worth the complexity?

Exercise 2: Priority Queue for Task Scheduling

Objective: Build a real-time task scheduler.

Task:

Implement priority queue with binary heap
Add tasks with different priorities
Measure worst-case extract-min time
Ensure deterministic timing for real-time use

Questions:

What is the worst-case execution time?
Is it acceptable for a 1 kHz control loop?
How can you reduce worst-case time?

Chapter 13: Lock-Free Data Structures

Exercise 1: Lock-Free Queue

Objective: Implement a lock-free queue using CAS.

Task:

Implement Michael-Scott lock-free queue
Implement mutex-based queue for comparison
Measure throughput with 1, 2, 4, 8 threads
Measure contention using perf

Questions:

At what thread count does lock-free win?
What is the overhead of CAS operations?
How do you handle the ABA problem?

Exercise 2: Lock-Free Stack

Objective: Build a simpler lock-free data structure.

Task:

Implement lock-free stack using CAS
Test with multiple producer/consumer threads
Measure performance vs mutex-based stack
Identify and fix ABA problem

Questions:

Is the lock-free stack faster than mutex-based?
How many CAS retries occur under contention?
What is the memory ordering requirement?

Chapter 14: String Processing and Cache Efficiency

Exercise 1: String Search Optimization

Objective: Optimize string search for cache efficiency.

Task:

Implement naive string search
Implement Boyer-Moore algorithm
Implement SIMD-based search (if available)
Measure cache misses for each

Questions:

Which algorithm has the fewest cache misses?
How does string length affect performance?
When is SIMD worth the complexity?

Exercise 2: Log Parser Optimization

Objective: Build a high-performance log parser.

Task:

Parse log lines using strchr/strncpy
Optimize using manual parsing (avoid string functions)
Add SIMD optimization for timestamp parsing
Measure throughput (lines per second)

Questions:

How much faster is manual parsing?
What is the cache miss rate for each approach?
Can you achieve 3M lines/second?

Chapter 15: Graphs and Cache-Efficient Traversal

Exercise 1: Graph Representations

Objective: Compare graph representations for cache efficiency.

Task:

Implement adjacency list (array of pointers)
Implement adjacency array (CSR format)
Implement adjacency matrix
Measure BFS performance for each

Questions:

Which has the fewest cache misses?
Which is fastest for sparse graphs?
Which is fastest for dense graphs?

Exercise 2: Graph Traversal Optimization

Objective: Optimize BFS for cache efficiency.

Task:

Implement standard BFS with adjacency list
Optimize using CSR format
Add prefetching hints
Measure cache misses and execution time

Questions:

How much does CSR format improve performance?
Does prefetching help?
What is the optimal prefetch distance?

Chapter 16: Bloom Filters and Probabilistic Data Structures

Exercise 1: Bloom Filter Implementation

Objective: Build and tune a Bloom filter.

Task:

Implement Bloom filter with configurable size and hash count
Test false positive rate with different parameters
Compare memory usage with hash table
Measure lookup performance

Questions:

What is the optimal number of hash functions?
How does filter size affect false positive rate?
What is the memory savings vs hash table?

Exercise 2: Counting Bloom Filter

Objective: Extend Bloom filter to support deletions.

Task:

Implement counting Bloom filter
Test with insertions and deletions
Measure memory overhead vs standard Bloom filter
Measure false positive rate

Questions:

How much more memory does counting require?
Does deletion increase false positive rate?
When is counting Bloom filter worth it?

Chapter 17: Bootloader Data Structures

Exercise 1: Bootloader Optimization

Objective: Minimize bootloader execution time.

Task:

Implement device tree parser with linked lists
Optimize using fixed-size arrays
Measure boot time for both implementations
Profile to find remaining bottlenecks

Questions:

How much faster is the array-based version?
What is the largest bottleneck in boot time?
Can you boot in under 500ms?

Exercise 2: Memory-Constrained Data Structures

Objective: Design data structures for bootloader constraints.

Task:

Implement symbol table with minimal memory
Avoid dynamic allocation entirely
Measure memory usage and lookup performance
Compare with standard implementations

Questions:

How much memory can you save?
What is the performance trade-off?
Is the complexity worth it?

Chapter 18: Device Driver Queues

Exercise 1: DMA Ring Buffer

Objective: Implement a high-performance DMA ring buffer.

Task:

Implement ring buffer for packet reception
Add overflow detection and handling
Measure packet loss rate at line rate
Optimize for cache efficiency

Questions:

What buffer size minimizes packet loss?
How do you handle buffer overflow?
What is the cache miss rate?

Exercise 2: Interrupt Handler Optimization

Objective: Minimize interrupt handler execution time.

Task:

Implement interrupt handler with linked list queue
Optimize using lock-free ring buffer
Measure interrupt latency
Measure worst-case execution time

Questions:

How much faster is the ring buffer?
What is the worst-case interrupt latency?
Is it acceptable for real-time requirements?

Chapter 19: Firmware Memory Management

Exercise 1: Memory Pool Allocator

Objective: Eliminate fragmentation in firmware.

Task:

Implement fixed-size memory pools
Implement slab allocator for multiple sizes
Measure fragmentation over 72 hours
Compare with malloc

Questions:

Does fragmentation occur with memory pools?
What is the memory overhead?
How many pool sizes do you need?

Exercise 2: Long-Running Firmware Test

Objective: Ensure firmware stability over time.

Task:

Implement firmware with your memory allocator
Run continuous operation test for 72 hours
Monitor memory usage and fragmentation
Identify and fix any memory leaks

Questions:

Does the firmware run for 72 hours without crashing?
What is the memory usage trend over time?
Are there any memory leaks?

Chapter 20: Benchmark Case Studies

Exercise 1: Dhrystone Analysis

Objective: Understand how compiler optimization affects Dhrystone scores.

Task:

Download Dhrystone 2.1 source code
Compile with different optimization levels: -O0, -O1, -O2, -O3
Compile with different compilers: GCC, Clang (if available)
Measure DMIPS/MHz for each configuration
Use objdump -d to examine the generated assembly code

Questions:

How much do scores vary between optimization levels?
How much do scores vary between compilers?
Can you identify specific optimizations that inflate the score?
Look at the assembly: is the compiler eliminating dead code?
Why is Dhrystone considered obsolete?

Expected Results:

5-10× speedup from -O0 to -O3
20-50% variance between compilers
Evidence of constant propagation and dead code elimination

Exercise 2: Coremark Implementation and Analysis

Objective: Run Coremark and understand what it measures.

Task:

Clone Coremark from GitHub: https://github.com/eembc/coremark
Compile for your platform (native x86/ARM or RISC-V QEMU)
Run with at least 10 seconds of iterations
Analyze the four workloads:
- Linked list operations (core_list_join.c)
- Matrix operations (core_matrix.c)
- State machine (core_state.c)
- CRC calculation (core_util.c)
Use perf to measure cache misses for each workload
Compile with different flags and compare scores

Questions:

What is your CoreMark/MHz score?
Which workload has the highest cache miss rate?
Which workload takes the most time?
How do compiler flags affect the score?
Why can’t the compiler optimize away Coremark like it does Dhrystone?

Expected Results:

CoreMark/MHz between 2.5-5.5 (depending on processor)
Linked list workload has highest cache miss rate
Matrix workload takes most time
-O3 gives 10-30% improvement over -O2

Advanced:

Modify Coremark to use different list sizes
Measure how cache size affects performance
Compare performance on different architectures (x86 vs ARM vs RISC-V)

Exercise 3: Design Your Own Benchmark (Optional Challenge)

Objective: Apply benchmark design principles to create a domain-specific benchmark.

Task:

Choose a specific workload (e.g., packet processing, image filtering, crypto)
Identify the key operations in that workload
Design a benchmark that:
- Uses runtime-determined inputs
- Resists compiler optimization
- Validates results
- Represents realistic data sizes
Implement the benchmark
Test with different compilers and optimization levels
Document your methodology

Questions:

What operations does your benchmark measure?
How do you prevent dead code elimination?
How do you validate correctness?
What are the limitations of your benchmark?
How does it compare to existing benchmarks?

Example Workloads:

Packet processing: Parse headers, checksum, routing table lookup
Image filtering: Convolution, color space conversion
Crypto: AES encryption, SHA hashing
JSON parsing: Tokenization, validation, tree building

Deliverables:

Source code with clear documentation
Run rules (iterations, validation, reporting)
Benchmark results on at least one platform
Analysis of what the benchmark measures and doesn’t measure

Submission Guidelines

For readers who want feedback on their implementations:

Code: Share your implementation on GitHub
Benchmarks: Include benchmark results with hardware specifications
Analysis: Write a brief analysis of your findings
Discussion: Join the book’s discussion forum (URL TBD)

Resources

Benchmark framework: See Appendix A
Hardware specifications: See Appendix B
Profiling tools: See Appendix C
Further reading: See Appendix D

Keyboard shortcuts

Data Structures in Practice