Appendix E: Exercises
This appendix provides hands-on exercises to reinforce the concepts covered throughout the book. Each exercise is designed to help you gain practical experience with hardware-aware data structure implementation and performance analysis.
Chapter 5: Linked Lists - The Cache Killer
Exercise 1: Benchmark Challenge
Objective: Compare array-based and linked list implementations of a stack.
Task:
- Implement both array and linked list versions of a stack
- Measure push/pop performance for 10,000 operations
- Use the benchmark framework from Chapter 3
- Measure both execution time and cache misses
Questions:
- Which implementation is faster? By how much?
- What is the cache miss rate for each?
- How does performance change with different stack sizes (100, 1000, 10000 elements)?
Exercise 2: Memory Pool
Objective: Understand the impact of allocation overhead on linked list performance.
Task:
- Implement a memory pool allocator for linked list nodes
- Compare performance with malloc-based allocation
- Measure allocation time and fragmentation
Questions:
- How much faster is the memory pool?
- What is the memory overhead of the pool?
- How does pool size affect performance?
Exercise 3: Unrolled List
Objective: Explore cache-friendly variations of linked lists.
Task:
- Implement an unrolled linked list with 16 elements per node
- Benchmark against standard linked list and array
- Measure cache behavior for sequential traversal
Questions:
- How does the unrolled list compare to standard linked list?
- What is the optimal number of elements per node?
- When would you choose an unrolled list over an array?
Exercise 4: Cache Analysis
Objective: Analyze cache behavior at different data sizes.
Task:
- Use
perfto measure cache misses for array vs linked list traversal - Vary the data size from 1 KB to 1 MB
- Plot cache miss rate vs data size
Questions:
- At what size does the linked list become completely cache-hostile?
- How does the cache miss rate change as data exceeds L1, L2, L3 cache sizes?
- Can you identify the cache size thresholds from the data?
Exercise 5: Real-Time Analysis
Objective: Understand predictability requirements for real-time systems.
Task:
- Measure worst-case execution time for linked list operations
- Run 10,000 iterations and record min, max, median, P99
- Compare variance between array and linked list
Questions:
- How much variance do you see in linked list operations?
- Is the variance acceptable for a 1 kHz control loop?
- What causes the worst-case execution times?
Chapter 1: The Performance Gap
Exercise 1: Hash Table vs Binary Search
Objective: Reproduce the Chapter 1 experiment comparing hash tables and binary search.
Task:
- Implement a hash table with 1024 buckets for 500 device configurations
- Implement binary search on a sorted array for the same data
- Measure cache misses and execution time for 10,000 lookups
- Vary the number of entries (100, 500, 1000, 5000)
Questions:
- At what size does the hash table become slower than binary search?
- What is the cache miss rate for each implementation?
- How does the hash table size (number of buckets) affect performance?
Exercise 2: Cache Miss Analysis
Objective: Understand the relationship between cache misses and execution time.
Task:
- Write a simple array traversal program
- Use
perfto measure cache misses and cycles - Calculate the cost per cache miss
- Compare with the theoretical 100-cycle penalty
Questions:
- What is the actual cache miss penalty on your hardware?
- How does it vary with L1, L2, and L3 cache misses?
- Can you identify the cache sizes from the performance data?
Chapter 2: Memory Hierarchy
Exercise 1: Cache Line Size Detection
Objective: Experimentally determine your CPU’s cache line size.
Task:
- Create an array and access elements with varying strides (1, 2, 4, 8, 16, 32, 64, 128 bytes)
- Measure cache misses for each stride
- Plot cache misses vs stride
- Identify the cache line size from the inflection point
Questions:
- What is your CPU’s cache line size?
- How does performance change when stride equals cache line size?
- What happens when stride is larger than cache line size?
Exercise 2: False Sharing
Objective: Demonstrate the performance impact of false sharing.
Task:
- Create a multi-threaded program where each thread updates a separate counter
- Version 1: Pack counters tightly in an array
- Version 2: Pad counters to separate cache lines
- Measure throughput for both versions
Questions:
- How much slower is the packed version?
- How many cache line bounces occur in the packed version?
- What is the optimal padding size?
Chapter 3: Benchmarking and Profiling
Exercise 1: Build a Microbenchmark Framework
Objective: Create a reusable benchmarking framework.
Task:
- Implement high-precision timing using RDTSC or clock_gettime
- Add statistical analysis (mean, median, stddev, percentiles)
- Implement warmup runs and outlier detection
- Add cache miss measurement using perf_event_open
Questions:
- How many iterations are needed for stable results?
- What is the overhead of your timing mechanism?
- How do you detect and handle outliers?
Exercise 2: Profiling with perf
Objective: Master the perf profiling tool.
Task:
- Write a program with an obvious performance bottleneck
- Use
perf recordandperf reportto find the hotspot - Use
perf statto measure cache misses, branch mispredictions - Use
perf annotateto see assembly-level performance
Questions:
- What percentage of time is spent in the hotspot?
- What is the cache miss rate in the hotspot?
- Can you identify the exact instruction causing cache misses?
Chapter 4: Arrays and Cache Locality
Exercise 1: Row-Major vs Column-Major
Objective: Measure the performance impact of access patterns.
Task:
- Create a 1000×1000 matrix
- Sum all elements using row-major order
- Sum all elements using column-major order
- Measure cache misses and execution time
Questions:
- How much slower is column-major access?
- What is the cache miss rate for each?
- How does matrix size affect the performance gap?
Exercise 2: Structure of Arrays vs Array of Structures
Objective: Compare SoA and AoS layouts for cache efficiency.
Task:
- Implement particle simulation with AoS layout
- Implement the same simulation with SoA layout
- Measure performance for position updates only
- Measure performance when accessing all fields
Questions:
- Which layout is faster for position-only updates?
- Which layout is faster when accessing all fields?
- How does the number of fields affect the trade-off?
Chapter 6: Stacks and Queues
Exercise 1: Ring Buffer Implementation
Objective: Implement a cache-friendly ring buffer queue.
Task:
- Implement a ring buffer with power-of-2 size
- Compare with a linked list queue
- Measure performance for enqueue/dequeue operations
- Test with different buffer sizes (64, 256, 1024, 4096)
Questions:
- How much faster is the ring buffer?
- What happens when the buffer is full?
- How does buffer size affect cache performance?
Exercise 2: Stack Overflow Detection
Objective: Understand stack memory layout and overflow detection.
Task:
- Write a recursive function that overflows the stack
- Add canary values to detect overflow
- Measure the stack size on your system
- Implement a custom stack with overflow protection
Questions:
- What is the default stack size on your system?
- How can you detect stack overflow before it crashes?
- What is the performance overhead of canary checks?
Chapter 7: Hash Tables and Cache Conflicts
Exercise 1: Hash Function Quality
Objective: Compare different hash functions for cache behavior.
Task:
- Implement three hash functions: simple sum, FNV-1a, MurmurHash
- Measure distribution quality (bucket occupancy variance)
- Measure cache miss rate for lookups
- Test with real-world string data (e.g., dictionary words)
Questions:
- Which hash function has the best distribution?
- Which has the best cache behavior?
- Is there a trade-off between distribution and cache performance?
Exercise 2: Open Addressing vs Chaining
Objective: Compare collision resolution strategies.
Task:
- Implement hash table with chaining
- Implement hash table with linear probing
- Measure performance at different load factors (0.5, 0.7, 0.9)
- Measure cache misses for both implementations
Questions:
- Which is faster at low load factors?
- Which is faster at high load factors?
- What is the cache miss rate for each?
Chapter 8: Dynamic Arrays and Memory Management
Exercise 1: Growth Factor Comparison
Objective: Compare different growth strategies for dynamic arrays.
Task:
- Implement dynamic array with 1.5× growth
- Implement dynamic array with 2× growth
- Implement dynamic array with φ (1.618) growth
- Measure total reallocations and memory waste for growing to 1M elements
Questions:
- Which growth factor minimizes reallocations?
- Which minimizes memory waste?
- Which has the best overall performance?
Exercise 2: Custom Allocator
Objective: Implement a simple memory allocator.
Task:
- Implement a bump allocator (arena)
- Implement a free list allocator
- Compare with malloc for small allocations
- Measure fragmentation over time
Questions:
- How much faster is the bump allocator?
- When does the free list allocator fragment?
- What is the memory overhead of each allocator?
Chapter 9: Binary Search Trees
Exercise 1: BST vs Sorted Array
Objective: Compare tree-based and array-based search structures.
Task:
- Implement Red-Black tree
- Implement sorted array with binary search
- Measure lookup performance for 10,000 elements
- Measure cache misses for both
Questions:
- Which is faster for lookups?
- Which is faster for insertions?
- At what size does the tree become slower?
Exercise 2: Tree Layout Optimization
Objective: Explore cache-friendly tree layouts.
Task:
- Implement standard pointer-based BST
- Implement array-based BST (implicit pointers)
- Implement van Emde Boas layout
- Measure cache misses for tree traversal
Questions:
- Which layout has the fewest cache misses?
- How does tree depth affect the performance gap?
- What is the memory overhead of each layout?
Chapter 10: B-Trees and Cache-Conscious Trees
Exercise 1: Optimal Node Size
Objective: Find the optimal B-tree node size for your hardware.
Task:
- Implement B-tree with configurable node size
- Test node sizes: 16, 32, 64, 128, 256 bytes
- Measure lookup performance for 100,000 elements
- Measure cache misses for each node size
Questions:
- What is the optimal node size?
- How does it relate to your cache line size?
- What happens when node size exceeds cache line size?
Exercise 2: B-Tree vs Hash Table
Objective: Compare B-trees and hash tables for in-memory databases.
Task:
- Implement B-tree with optimal node size
- Implement cache-friendly hash table
- Measure performance for random lookups
- Measure performance for range queries
Questions:
- Which is faster for point queries?
- Which is faster for range queries?
- How does dataset size affect the trade-off?
Chapter 11: Tries and Radix Trees
Exercise 1: Trie Memory Optimization
Objective: Reduce trie memory consumption.
Task:
- Implement standard trie (26 pointers per node)
- Implement compressed trie (radix tree)
- Implement array-mapped trie (bitmap + compact array)
- Measure memory usage and lookup performance
Questions:
- How much memory does each implementation use?
- Which has the best lookup performance?
- What is the trade-off between memory and speed?
Exercise 2: Autocomplete Performance
Objective: Compare data structures for autocomplete.
Task:
- Implement autocomplete with trie
- Implement autocomplete with sorted array + binary search
- Implement autocomplete with hash table
- Test with 50,000 words from a dictionary
Questions:
- Which is fastest for prefix search?
- Which uses the least memory?
- How does prefix length affect performance?
Chapter 12: Heaps and Priority Queues
Exercise 1: Heap Implementations
Objective: Compare different heap implementations.
Task:
- Implement binary heap (array-based)
- Implement d-ary heap (d=4, d=8)
- Implement Fibonacci heap
- Measure insert and extract-min performance
Questions:
- Which heap has the best cache behavior?
- What is the optimal d for d-ary heap?
- When is Fibonacci heap worth the complexity?
Exercise 2: Priority Queue for Task Scheduling
Objective: Build a real-time task scheduler.
Task:
- Implement priority queue with binary heap
- Add tasks with different priorities
- Measure worst-case extract-min time
- Ensure deterministic timing for real-time use
Questions:
- What is the worst-case execution time?
- Is it acceptable for a 1 kHz control loop?
- How can you reduce worst-case time?
Chapter 13: Lock-Free Data Structures
Exercise 1: Lock-Free Queue
Objective: Implement a lock-free queue using CAS.
Task:
- Implement Michael-Scott lock-free queue
- Implement mutex-based queue for comparison
- Measure throughput with 1, 2, 4, 8 threads
- Measure contention using perf
Questions:
- At what thread count does lock-free win?
- What is the overhead of CAS operations?
- How do you handle the ABA problem?
Exercise 2: Lock-Free Stack
Objective: Build a simpler lock-free data structure.
Task:
- Implement lock-free stack using CAS
- Test with multiple producer/consumer threads
- Measure performance vs mutex-based stack
- Identify and fix ABA problem
Questions:
- Is the lock-free stack faster than mutex-based?
- How many CAS retries occur under contention?
- What is the memory ordering requirement?
Chapter 14: String Processing and Cache Efficiency
Exercise 1: String Search Optimization
Objective: Optimize string search for cache efficiency.
Task:
- Implement naive string search
- Implement Boyer-Moore algorithm
- Implement SIMD-based search (if available)
- Measure cache misses for each
Questions:
- Which algorithm has the fewest cache misses?
- How does string length affect performance?
- When is SIMD worth the complexity?
Exercise 2: Log Parser Optimization
Objective: Build a high-performance log parser.
Task:
- Parse log lines using strchr/strncpy
- Optimize using manual parsing (avoid string functions)
- Add SIMD optimization for timestamp parsing
- Measure throughput (lines per second)
Questions:
- How much faster is manual parsing?
- What is the cache miss rate for each approach?
- Can you achieve 3M lines/second?
Chapter 15: Graphs and Cache-Efficient Traversal
Exercise 1: Graph Representations
Objective: Compare graph representations for cache efficiency.
Task:
- Implement adjacency list (array of pointers)
- Implement adjacency array (CSR format)
- Implement adjacency matrix
- Measure BFS performance for each
Questions:
- Which has the fewest cache misses?
- Which is fastest for sparse graphs?
- Which is fastest for dense graphs?
Exercise 2: Graph Traversal Optimization
Objective: Optimize BFS for cache efficiency.
Task:
- Implement standard BFS with adjacency list
- Optimize using CSR format
- Add prefetching hints
- Measure cache misses and execution time
Questions:
- How much does CSR format improve performance?
- Does prefetching help?
- What is the optimal prefetch distance?
Chapter 16: Bloom Filters and Probabilistic Data Structures
Exercise 1: Bloom Filter Implementation
Objective: Build and tune a Bloom filter.
Task:
- Implement Bloom filter with configurable size and hash count
- Test false positive rate with different parameters
- Compare memory usage with hash table
- Measure lookup performance
Questions:
- What is the optimal number of hash functions?
- How does filter size affect false positive rate?
- What is the memory savings vs hash table?
Exercise 2: Counting Bloom Filter
Objective: Extend Bloom filter to support deletions.
Task:
- Implement counting Bloom filter
- Test with insertions and deletions
- Measure memory overhead vs standard Bloom filter
- Measure false positive rate
Questions:
- How much more memory does counting require?
- Does deletion increase false positive rate?
- When is counting Bloom filter worth it?
Chapter 17: Bootloader Data Structures
Exercise 1: Bootloader Optimization
Objective: Minimize bootloader execution time.
Task:
- Implement device tree parser with linked lists
- Optimize using fixed-size arrays
- Measure boot time for both implementations
- Profile to find remaining bottlenecks
Questions:
- How much faster is the array-based version?
- What is the largest bottleneck in boot time?
- Can you boot in under 500ms?
Exercise 2: Memory-Constrained Data Structures
Objective: Design data structures for bootloader constraints.
Task:
- Implement symbol table with minimal memory
- Avoid dynamic allocation entirely
- Measure memory usage and lookup performance
- Compare with standard implementations
Questions:
- How much memory can you save?
- What is the performance trade-off?
- Is the complexity worth it?
Chapter 18: Device Driver Queues
Exercise 1: DMA Ring Buffer
Objective: Implement a high-performance DMA ring buffer.
Task:
- Implement ring buffer for packet reception
- Add overflow detection and handling
- Measure packet loss rate at line rate
- Optimize for cache efficiency
Questions:
- What buffer size minimizes packet loss?
- How do you handle buffer overflow?
- What is the cache miss rate?
Exercise 2: Interrupt Handler Optimization
Objective: Minimize interrupt handler execution time.
Task:
- Implement interrupt handler with linked list queue
- Optimize using lock-free ring buffer
- Measure interrupt latency
- Measure worst-case execution time
Questions:
- How much faster is the ring buffer?
- What is the worst-case interrupt latency?
- Is it acceptable for real-time requirements?
Chapter 19: Firmware Memory Management
Exercise 1: Memory Pool Allocator
Objective: Eliminate fragmentation in firmware.
Task:
- Implement fixed-size memory pools
- Implement slab allocator for multiple sizes
- Measure fragmentation over 72 hours
- Compare with malloc
Questions:
- Does fragmentation occur with memory pools?
- What is the memory overhead?
- How many pool sizes do you need?
Exercise 2: Long-Running Firmware Test
Objective: Ensure firmware stability over time.
Task:
- Implement firmware with your memory allocator
- Run continuous operation test for 72 hours
- Monitor memory usage and fragmentation
- Identify and fix any memory leaks
Questions:
- Does the firmware run for 72 hours without crashing?
- What is the memory usage trend over time?
- Are there any memory leaks?
Chapter 20: Benchmark Case Studies
Exercise 1: Dhrystone Analysis
Objective: Understand how compiler optimization affects Dhrystone scores.
Task:
- Download Dhrystone 2.1 source code
- Compile with different optimization levels:
-O0,-O1,-O2,-O3 - Compile with different compilers: GCC, Clang (if available)
- Measure DMIPS/MHz for each configuration
- Use
objdump -dto examine the generated assembly code
Questions:
- How much do scores vary between optimization levels?
- How much do scores vary between compilers?
- Can you identify specific optimizations that inflate the score?
- Look at the assembly: is the compiler eliminating dead code?
- Why is Dhrystone considered obsolete?
Expected Results:
- 5-10× speedup from
-O0to-O3 - 20-50% variance between compilers
- Evidence of constant propagation and dead code elimination
Exercise 2: Coremark Implementation and Analysis
Objective: Run Coremark and understand what it measures.
Task:
- Clone Coremark from GitHub: https://github.com/eembc/coremark
- Compile for your platform (native x86/ARM or RISC-V QEMU)
- Run with at least 10 seconds of iterations
- Analyze the four workloads:
- Linked list operations (core_list_join.c)
- Matrix operations (core_matrix.c)
- State machine (core_state.c)
- CRC calculation (core_util.c)
- Use
perfto measure cache misses for each workload - Compile with different flags and compare scores
Questions:
- What is your CoreMark/MHz score?
- Which workload has the highest cache miss rate?
- Which workload takes the most time?
- How do compiler flags affect the score?
- Why can’t the compiler optimize away Coremark like it does Dhrystone?
Expected Results:
- CoreMark/MHz between 2.5-5.5 (depending on processor)
- Linked list workload has highest cache miss rate
- Matrix workload takes most time
-O3gives 10-30% improvement over-O2
Advanced:
- Modify Coremark to use different list sizes
- Measure how cache size affects performance
- Compare performance on different architectures (x86 vs ARM vs RISC-V)
Exercise 3: Design Your Own Benchmark (Optional Challenge)
Objective: Apply benchmark design principles to create a domain-specific benchmark.
Task:
- Choose a specific workload (e.g., packet processing, image filtering, crypto)
- Identify the key operations in that workload
- Design a benchmark that:
- Uses runtime-determined inputs
- Resists compiler optimization
- Validates results
- Represents realistic data sizes
- Implement the benchmark
- Test with different compilers and optimization levels
- Document your methodology
Questions:
- What operations does your benchmark measure?
- How do you prevent dead code elimination?
- How do you validate correctness?
- What are the limitations of your benchmark?
- How does it compare to existing benchmarks?
Example Workloads:
- Packet processing: Parse headers, checksum, routing table lookup
- Image filtering: Convolution, color space conversion
- Crypto: AES encryption, SHA hashing
- JSON parsing: Tokenization, validation, tree building
Deliverables:
- Source code with clear documentation
- Run rules (iterations, validation, reporting)
- Benchmark results on at least one platform
- Analysis of what the benchmark measures and doesn’t measure
Submission Guidelines
For readers who want feedback on their implementations:
- Code: Share your implementation on GitHub
- Benchmarks: Include benchmark results with hardware specifications
- Analysis: Write a brief analysis of your findings
- Discussion: Join the book’s discussion forum (URL TBD)
Resources
- Benchmark framework: See Appendix A
- Hardware specifications: See Appendix B
- Profiling tools: See Appendix C
- Further reading: See Appendix D