Chapter 8. Microarchitecture Variations
Part V — Pipeline & Microarchitecture
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Distinguish In-Order from Out-of-Order: Understand the performance differences and use cases of both execution models
- Understand Register Renaming: Know how Physical Registers eliminate False Dependencies (WAW/WAR)
- Master ROB Mechanism: Understand how the Reorder Buffer guarantees “out-of-order execution, in-order commit”
- Recognize Speculative Execution: Understand the principles and security risks (Spectre/Meltdown)
- Use Performance Counters: Calculate CPI using
mcycleandminstret
💡 Scenario: The Michelin Restaurant Kitchen Philosophy
Scene: Junior is comparing benchmark results of two RISC-V cores and finds that despite similar frequencies, performance differs by a factor of two.
Junior: “Professor, I looked at specs for two RISC-V cores. SiFive U74 and Alibaba C910 both run at about 1.5GHz, but the CoreMark scores differ by almost 2x! How is this possible?”
Professor: “Have you ever been to a Michelin-star restaurant?”
Junior: “Sure, but what does that have to do with CPUs?”
Professor: “Imagine two restaurants.
The first one (In-Order): The chef strictly follows the order of tickets. If the first dish needs 10 minutes for ingredients to thaw, all other dishes must wait—even if the second dish’s ingredients are already ready.
The second one (Out-of-Order): The chef looks at which dish has ingredients ready first and starts with that. While waiting for ingredients to thaw, the chef has already finished three other dishes.“
Junior: “So Out-of-Order means the CPU doesn’t wait idly?”
Professor: “Exactly. But this requires a smart ‘restaurant manager’ to coordinate:
- Reservation Station: Tracks what ingredients each dish needs; starts cooking when ingredients arrive.
- Reorder Buffer: Even though cooking order is scrambled, dishes must still be served in the order customers placed them—otherwise chaos ensues.
- Register Renaming: If two dishes both need ‘eggs,’ but they’re actually different eggs, label them differently to avoid confusion.“
Junior: “Sounds complex. What’s the cost?”
Professor: “Transistor count explodes, and power consumption goes up with it. That’s why phone ‘big cores’ are power-hungry while ‘little cores’ are efficient—big cores are usually Out-of-Order, little cores are In-Order.”
Junior: “Let’s measure the actual performance difference!”
In Chapter 7, we explored the classic five-stage in-order pipeline. But modern processors go far beyond this simple model. Out-of-order (OOO) execution allows processors to dynamically reorder instructions to extract more parallelism, dramatically improving performance. While in-order processors execute instructions in program order and stall on dependencies, out-of-order processors can execute independent instructions while waiting for slow operations to complete.
This chapter explores the microarchitectural techniques that enable high-performance RISC-V processors: register renaming to eliminate false dependencies, reorder buffers to maintain precise exceptions, speculative execution to execute beyond branches, and advanced branch prediction to minimize misprediction penalties. We’ll examine the cache hierarchy that hides memory latency, and cache coherence protocols that maintain consistency across multiple cores. Understanding these techniques is essential for anyone designing high-performance RISC-V systems or optimizing code for modern processors.
8.1 Out-of-Order Execution Basics
In-Order vs Out-of-Order
In-order processors execute instructions in the exact order they appear in the program. If an instruction stalls (e.g., waiting for a cache miss), all subsequent instructions must wait, even if they’re independent and could execute.
Out-of-order (OOO) processors can execute instructions in a different order than the program specifies, as long as the final result is the same. This allows the processor to work around stalls and dependencies, keeping execution units busy.
Example:
lw x1, 0(x2) # I1: load (cache miss, 100 cycles)
add x3, x4, x5 # I2: independent of I1
sub x6, x7, x8 # I3: independent of I1
add x9, x1, x10 # I4: depends on I1
In-order execution: I2 and I3 must wait for I1 to complete (100 cycles), even though they’re independent.
Out-of-order execution: I2 and I3 can execute immediately while I1 is waiting for the cache miss. Only I4 must wait for I1.
This simple reordering can dramatically improve performance, especially when memory latency is high.
Figure 8.1: In-Order vs Out-of-Order Execution
Cycle: 1 2 3 ... 103 104 105 106 107
lw x1 (miss): IF ID EX MEM(100 cycles) WB
add x3,x4,x5: IF ID -------- stall -------- EX MEM WB
sub x6,x7,x8: IF -------- stall -------- ID EX MEM WB
Cycle: 1 2 3 4 5 6 7 ... 103 104 105 106
lw x1 (miss): IF ID EX MEM (100 cycles) -------- WB
add x3,x4,x5: IF ID EX MEM WB
sub x6,x7,x8: IF ID EX MEM WB
add x9,x1,x10: IF ID ---- wait ---- EX MEM WB
Out-of-order execution allows independent instructions to proceed while the load is waiting, dramatically reducing wasted cycles.
Dynamic Scheduling
Dynamic scheduling is the hardware mechanism that enables out-of-order execution. The processor analyzes dependencies at runtime and schedules instructions to execution units when their operands are ready.
Two classic algorithms for dynamic scheduling:
Scoreboarding (CDC 6600, 1964): A centralized control unit tracks which registers are being written and which instructions are waiting for them. When all operands are ready, the instruction is issued to an execution unit.
Tomasulo’s algorithm (IBM 360/91, 1967): Uses reservation stations to buffer instructions waiting for operands. When an operand is produced, it’s broadcast to all waiting instructions. This eliminates the need for a centralized scoreboard and enables register renaming (Section 8.2).
Modern OOO processors use variations of Tomasulo’s algorithm with additional structures like the Reorder Buffer (ROB) to ensure precise exceptions.
8.2 Register Renaming
The Problem: False Dependencies
Consider this code:
add x1, x2, x3 # I1: x1 = x2 + x3
sub x4, x1, x5 # I2: x4 = x1 - x5 (RAW dependency on x1)
add x1, x6, x7 # I3: x1 = x6 + x7 (WAW dependency on x1)
mul x8, x1, x9 # I4: x8 = x1 * x9 (RAW dependency on x1 from I3)
I2 has a true dependency (RAW) on I1 — it must wait for I1 to produce x1.
But I3 has a false dependency (WAW) on I1 — both write x1, but I3’s write doesn’t actually depend on I1’s value. Similarly, I4 depends on I3’s x1, not I1’s.
False dependencies (WAR and WAW) limit parallelism because the processor must serialize instructions that could otherwise execute in parallel.
Physical vs Architectural Registers
Register renaming eliminates false dependencies by mapping architectural registers (the 32 registers visible to the programmer) to a larger set of physical registers (hidden from the programmer).
RISC-V has 32 architectural registers (x0-x31). A high-performance OOO processor might have 128 or 256 physical registers.
Register Alias Table (RAT): Maps each architectural register to a physical register. When an instruction writes an architectural register, it’s allocated a new physical register.
Example with renaming:
add P10, P2, P3 # I1: x1 -> P10
sub P11, P10, P5 # I2: x4 -> P11, reads P10 (I1's result)
add P12, P6, P7 # I3: x1 -> P12 (new physical register!)
mul P13, P12, P9 # I4: x8 -> P13, reads P12 (I3's result)
Now I3 and I4 use P12 for x1, while I1 and I2 use P10. The WAW dependency is eliminated — I3 can execute as soon as P6 and P7 are ready, without waiting for I1.
Figure 8.2: Register Renaming Example
graph TB
subgraph "Architectural Registers (Programmer View)"
x1[x1]
x2[x2]
x3[x3]
x4[x4]
end
subgraph "Physical Registers (Hardware)"
P2[P2]
P3[P3]
P5[P5]
P6[P6]
P7[P7]
P9[P9]
P10[P10]
P11[P11]
P12[P12]
P13[P13]
end
subgraph "Register Alias Table (RAT)"
RAT["x1 → P12<br/>x2 → P2<br/>x3 → P3<br/>x4 → P11"]
end
x1 -.->|mapped to| P12
x2 -.->|mapped to| P2
x3 -.->|mapped to| P3
x4 -.->|mapped to| P11
style P10 fill:#ffcccc
style P12 fill:#ccffcc
In this example, x1 was previously mapped to P10 (shown in red, now free), and is currently mapped to P12 (shown in green).
Free List Management
Physical registers must be recycled when they’re no longer needed. A free list tracks which physical registers are available for allocation.
When an instruction commits (Section 8.3), its old physical register mapping can be freed:
I3 commits: x1 was mapped to P10, now mapped to P12
-> P10 can be freed (added to free list)
Register renaming eliminates WAR and WAW hazards, leaving only true RAW dependencies. This dramatically increases instruction-level parallelism.
8.3 Reorder Buffer (ROB) and Issue Queue
Reorder Buffer (ROB)
The ROB ensures that instructions commit in program order, even though they execute out-of-order. This is essential for precise exceptions (Chapter 7).
The ROB is a circular buffer that holds all in-flight instructions in program order. Each entry contains:
- Instruction PC
- Destination register (architectural and physical)
- Result value (when execution completes)
- Exception status
- Ready bit
Instruction flow through the ROB:
- Dispatch: Instruction is allocated a ROB entry and issued to an execution unit.
- Execute: Instruction executes out-of-order when operands are ready.
- Complete: Result is written to the ROB entry and broadcast to waiting instructions.
- Commit: When the instruction reaches the head of the ROB and is complete, it commits (updates architectural state).
Commit is in-order: Instructions commit from the head of the ROB one at a time (or in small groups). This ensures that if an exception occurs, all earlier instructions have committed and all later instructions can be discarded.
Figure 8.3: Reorder Buffer (ROB) Structure
graph TB
subgraph "Reorder Buffer (Circular Queue)"
direction TB
ROB1["ROB Entry 1<br/>PC: 0x1000<br/>Dest: x1→P10<br/>Value: 42<br/>Ready: ✓"]
ROB2["ROB Entry 2<br/>PC: 0x1004<br/>Dest: x2→P11<br/>Value: -<br/>Ready: ✗"]
ROB3["ROB Entry 3<br/>PC: 0x1008<br/>Dest: x3→P12<br/>Value: 100<br/>Ready: ✓"]
ROB4["ROB Entry 4<br/>PC: 0x100C<br/>Dest: x4→P13<br/>Value: -<br/>Ready: ✗"]
DOTS["..."]
end
HEAD[Head Pointer<br/>Commit from here] --> ROB1
TAIL[Tail Pointer<br/>Dispatch to here] --> DOTS
ROB1 -->|Commit| COMMIT[Update<br/>Architectural<br/>State]
style ROB1 fill:#ccffcc
style ROB3 fill:#ccffcc
style ROB2 fill:#ffcccc
style ROB4 fill:#ffcccc
Green entries are ready to commit (when they reach the head). Red entries are still executing.
ROB Example Code:
// ROB entry structure
struct ROB_Entry {
uint64_t PC; // Instruction address
uint8_t arch_reg; // Architectural register (x0-x31)
uint8_t phys_reg; // Physical register (P0-P127)
uint64_t value; // Result value
bool ready; // Execution complete?
bool exception; // Exception occurred?
uint8_t exception_code; // Exception type
};
// Commit logic (simplified)
void commit_instruction() {
ROB_Entry *entry = &ROB[head];
if (!entry->ready) {
return; // Can't commit yet
}
if (entry->exception) {
// Handle exception: flush pipeline, jump to handler
flush_pipeline();
PC = trap_handler_address;
return;
}
// Update architectural state
if (entry->arch_reg != 0) { // x0 is always zero
RAT[entry->arch_reg] = entry->phys_reg;
free_old_physical_register(entry->arch_reg);
}
// Advance head pointer
head = (head + 1) % ROB_SIZE;
}
Issue Queue and Reservation Stations
The issue queue (or reservation stations) holds instructions waiting for operands. When an instruction is dispatched, it’s placed in the issue queue. When all its operands are ready, it’s issued to an execution unit.
Wakeup and select:
- Wakeup: When a result is produced, it’s broadcast to all issue queue entries. Entries waiting for that result mark the operand as ready.
- Select: Among all ready instructions, the scheduler selects which ones to issue to execution units (based on priority, age, or other policies).
This is the heart of dynamic scheduling. The issue queue decouples instruction dispatch from execution, allowing the processor to find parallelism dynamically.
8.4 Load/Store Queue
Memory operations are particularly challenging in OOO processors because they must respect memory ordering (Chapter 6) while still allowing reordering for performance.
Load Queue and Store Queue
The load queue holds all in-flight loads. The store queue holds all in-flight stores. These queues track memory addresses and data, and enforce ordering constraints.
Store-to-load forwarding: If a load reads from the same address as an earlier store, the load can get the data directly from the store queue without waiting for the store to commit to memory.
sw x1, 0(x2) # Store x1 to address in x2
lw x3, 0(x2) # Load from same address
The load can forward the data from the store queue, avoiding a memory access.
Figure 8.4: Load/Store Queue Structure
graph TB
subgraph "Store Queue"
SQ1["Store 1<br/>Addr: 0x2000<br/>Data: 42<br/>Committed: ✗"]
SQ2["Store 2<br/>Addr: 0x2008<br/>Data: 100<br/>Committed: ✗"]
SQ3["Store 3<br/>Addr: ?<br/>Data: 55<br/>Committed: ✗"]
end
subgraph "Load Queue"
LQ1["Load 1<br/>Addr: 0x2000<br/>Data: ?"]
LQ2["Load 2<br/>Addr: 0x2010<br/>Data: ?"]
end
LQ1 -.->|Address Match<br/>Forward Data| SQ1
LQ2 -.->|No Match<br/>Go to Cache| CACHE[Data Cache]
SQ3 -.->|Address Unknown<br/>Must Wait| WAIT[Stall Load]
style SQ1 fill:#ccffcc
style SQ3 fill:#ffcccc
Store-to-load forwarding allows loads to get data from earlier stores without waiting for them to commit to memory.
Memory Disambiguation
Memory disambiguation is the problem of determining whether two memory operations access the same address. This is difficult because addresses are often computed dynamically.
sw x1, 0(x2) # Store to address A
lw x3, 0(x4) # Load from address B — is B == A?
If x2 and x4 contain the same value, the load depends on the store. But the processor doesn’t know this until the addresses are computed.
Conservative approach: Assume all loads depend on all earlier stores. This is safe but limits parallelism.
Speculative approach: Assume loads don’t depend on earlier stores and execute them speculatively. If a dependency is later detected (address match), squash the load and re-execute.
Modern processors use memory dependence prediction to guess which loads depend on which stores, improving speculation accuracy.
8.5 Advanced Branch Prediction
Branch prediction is even more critical in OOO processors because mispredictions waste more work (all the speculatively executed instructions must be discarded).
Two-Level Adaptive Predictors
Two-level predictors use both local and global branch history to make predictions. They can capture complex patterns like:
if (a > 0) { // Branch B1
if (b > 0) { // Branch B2
...
}
}
If B1 is taken, B2 is more likely to be taken. A global history predictor can learn this correlation.
Structure: A global history register (GHR) tracks the last N branch outcomes (taken/not-taken). This is used to index into a pattern history table (PHT) that contains 2-bit counters.
TAGE (Tagged Geometric History Length)
TAGE is a state-of-the-art branch predictor used in modern high-performance processors. It uses multiple predictor tables with different history lengths (e.g., 4, 8, 16, 32, 64 branches).
Each table is indexed by a hash of the PC and history. The predictor uses the longest matching history to make a prediction, falling back to shorter histories if there’s no match.
TAGE achieves very high accuracy (98-99%) on most workloads.
Figure 8.5: TAGE Predictor Structure
graph TB
PC[Program Counter] --> HASH1[Hash Function 1]
PC --> HASH2[Hash Function 2]
PC --> HASH3[Hash Function 3]
PC --> BASE[Base Predictor]
GHR[Global History<br/>Register] --> HASH1
GHR --> HASH2
GHR --> HASH3
HASH1 --> T1[Table 1<br/>History: 4]
HASH2 --> T2[Table 2<br/>History: 16]
HASH3 --> T3[Table 3<br/>History: 64]
T1 --> SELECT[Selector<br/>Choose Longest<br/>Matching History]
T2 --> SELECT
T3 --> SELECT
BASE --> SELECT
SELECT --> PRED[Prediction<br/>Taken/Not-Taken]
style T3 fill:#ccffcc
style SELECT fill:#ffffcc
TAGE uses multiple tables with different history lengths. The longest matching history provides the prediction.
Return Address Stack (RAS)
Function returns are predicted using a hardware stack (mentioned in Chapter 7). When a function call is detected (jal or jalr with rd != x0), the return address is pushed onto the RAS. When a return is detected (jalr x0, 0(x1)), the top of the RAS is popped and used as the prediction.
The RAS is very accurate (>99%) because function calls and returns are well-structured.
Indirect Branch Prediction
Indirect branches (jalr with a computed target) are harder to predict than direct branches. The target can vary widely depending on the value in the register.
Indirect branch target buffer (iBTB): A cache indexed by the branch PC that stores recently seen targets. For virtual function calls or switch statements, the iBTB can achieve good accuracy.
Advanced techniques: Some processors use the call path (sequence of recent function calls) to predict indirect branch targets, improving accuracy for polymorphic code.
8.6 Cache Hierarchy
Modern processors have multiple levels of cache to hide memory latency.
L1 Instruction and Data Caches
L1 caches are small (32-64 KB), fast (1-2 cycle latency), and split into separate instruction (I-cache) and data (D-cache) caches.
Split I/D caches allow simultaneous instruction fetch and data access, avoiding structural hazards. They’re also optimized differently: I-caches are read-only and can use simpler replacement policies.
Virtually indexed, physically tagged (VIPT): L1 caches often use virtual addresses for indexing (to avoid TLB lookup latency) but physical addresses for tags (to avoid aliasing issues).
L2 Unified Cache
L2 cache is larger (256 KB - 1 MB), slower (10-20 cycles), and unified (holds both instructions and data).
L2 is the victim cache for L1: when data is evicted from L1, it’s placed in L2. This creates an inclusive hierarchy (L2 contains everything in L1).
L3 Shared Cache
L3 cache (if present) is even larger (4-32 MB), slower (30-50 cycles), and shared among all cores in a multi-core processor.
L3 reduces traffic to main memory and provides a large shared working set for all cores.
Figure 8.6: Cache Hierarchy
graph TB
subgraph "Core 0"
CPU0[CPU Core 0]
L1I0[L1 I-Cache<br/>32 KB<br/>1-2 cycles]
L1D0[L1 D-Cache<br/>32 KB<br/>1-2 cycles]
end
subgraph "Core 1"
CPU1[CPU Core 1]
L1I1[L1 I-Cache<br/>32 KB<br/>1-2 cycles]
L1D1[L1 D-Cache<br/>32 KB<br/>1-2 cycles]
end
CPU0 --> L1I0
CPU0 --> L1D0
CPU1 --> L1I1
CPU1 --> L1D1
L1I0 --> L2_0[L2 Unified<br/>256 KB<br/>10-20 cycles]
L1D0 --> L2_0
L1I1 --> L2_1[L2 Unified<br/>256 KB<br/>10-20 cycles]
L1D1 --> L2_1
L2_0 --> L3[L3 Shared<br/>8 MB<br/>30-50 cycles]
L2_1 --> L3
L3 --> MEM[Main Memory<br/>DDR4/DDR5<br/>100-300 cycles]
style L1I0 fill:#e1f5ff
style L1D0 fill:#e1f5ff
style L1I1 fill:#e1f5ff
style L1D1 fill:#e1f5ff
style L2_0 fill:#fff4e1
style L2_1 fill:#fff4e1
style L3 fill:#ffe1e1
style MEM fill:#f0e1ff
Each level is larger and slower. L1 is private per core, L2 may be private or shared, L3 is shared among all cores.
Cache Replacement Policies
LRU (Least Recently Used): Evict the cache line that hasn’t been accessed for the longest time. This is effective but expensive to implement for high associativity.
PLRU (Pseudo-LRU): An approximation of LRU that’s cheaper to implement. Uses a tree of bits to track approximate recency.
Random: Evict a random cache line. Surprisingly effective and very simple.
Modern caches often use PLRU or adaptive policies that combine multiple strategies.
8.7 Cache Coherence
In multi-core systems, each core has its own L1 cache. Cache coherence ensures that all cores see a consistent view of memory.
MESI Protocol
MESI is the most common coherence protocol. Each cache line is in one of four states:
- M (Modified): This cache has the only valid copy, and it’s been modified (dirty).
- E (Exclusive): This cache has the only valid copy, and it’s clean (matches memory).
- S (Shared): Multiple caches have valid copies, all clean.
- I (Invalid): This cache line is not valid.
State transitions:
- Read miss: If another cache has the line in M, it writes back to memory and transitions to S. The requesting cache loads the line in S.
- Write: If the line is in S, all other caches invalidate their copies. The writing cache transitions to M.
Figure 8.7: MESI Protocol State Diagram
stateDiagram-v2
[*] --> I
I: Invalid<br/>(No valid copy)
E: Exclusive<br/>(Only copy, clean)
S: Shared<br/>(Multiple copies, clean)
M: Modified<br/>(Only copy, dirty)
I --> E: Read Miss<br/>(No other copy)
I --> S: Read Miss<br/>(Other copies exist)
I --> M: Write Miss
E --> M: Write
E --> S: Other Read
E --> I: Evict
S --> M: Write<br/>(Invalidate others)
S --> I: Evict or<br/>Other Write
M --> S: Other Read<br/>(Write back)
M --> I: Evict<br/>(Write back)
MESI Example:
// Core 0 and Core 1 both access the same cache line
// Initial state: Both caches Invalid (I)
// Core 0: Read from address 0x1000
// Core 0 cache: I → E (exclusive, no other copy)
// Core 1: Read from address 0x1000
// Core 0 cache: E → S (shared)
// Core 1 cache: I → S (shared)
// Core 0: Write to address 0x1000
// Core 0 cache: S → M (modified, dirty)
// Core 1 cache: S → I (invalidated)
// Core 1: Read from address 0x1000
// Core 0 cache: M → S (write back to memory)
// Core 1 cache: I → S (load from memory)
MOESI Protocol
MOESI adds an O (Owned) state: the cache has a dirty copy, but other caches may have shared copies. This reduces write-backs to memory.
Snooping vs Directory-Based Coherence
Snooping: All caches monitor (snoop) a shared bus for memory transactions. When a cache sees a transaction that affects its data, it responds appropriately (invalidate, write-back, etc.).
Snooping is simple but doesn’t scale well beyond ~8-16 cores because bus bandwidth becomes a bottleneck.
Directory-based: A centralized directory tracks which caches have copies of each cache line. Coherence messages are sent point-to-point rather than broadcast.
Directory-based coherence scales better to many cores (64+) and is used in large multi-core processors.
RISC-V Coherence Considerations
RISC-V doesn’t mandate a specific coherence protocol. The RVWMO memory model (Chapter 6) defines the ordering guarantees, but the coherence mechanism is implementation-defined.
Most RISC-V multi-core systems use MESI or MOESI with snooping (for small core counts) or directory-based coherence (for large core counts).
8.8 Comparison with ARM and MIPS OOO Cores
Let’s compare RISC-V OOO implementations with other architectures.
RISC-V BOOM vs ARM Cortex-A76/A78
BOOM (Berkeley Out-of-Order Machine) is an open-source RISC-V OOO core. It has:
- 3-4 issue width
- 128-entry ROB
- 64-entry issue queue
- Advanced branch prediction (TAGE)
- L1 I/D caches, L2 unified cache
ARM Cortex-A76 (2018) is a high-performance mobile core:
- 4-issue width
- 128-entry ROB
- Sophisticated branch prediction
- 64 KB L1 I/D, 256-512 KB L2
Comparison: BOOM and Cortex-A76 are similar in structure. Both use register renaming, ROB, and advanced branch prediction. The main differences are in implementation details (pipeline depth, cache sizes, power optimization).
RISC-V’s simpler ISA (no complex addressing modes, no condition codes) makes the OOO logic slightly simpler than ARM’s, but the difference is small in modern designs.
RISC-V vs MIPS R10000 Pipeline
MIPS R10000 (1996) was one of the first commercial OOO processors:
- 4-issue superscalar
- 32-entry active list (similar to ROB)
- Register renaming with 64 physical registers
- Speculative execution
The R10000 pioneered many techniques still used today. Modern RISC-V OOO cores like BOOM are evolutionary descendants of the R10000 design philosophy.
Key difference: RISC-V has no branch delay slots, making branch misprediction recovery simpler than MIPS.
Microarchitecture Trade-offs
Complexity vs Performance: OOO execution provides 2-3x performance improvement over in-order for general-purpose workloads, but at the cost of 3-5x more transistors and power.
When to use OOO:
- High-performance applications (servers, desktops, high-end mobile)
- Workloads with irregular memory access patterns
- Code with many branches and dependencies
When to use in-order:
- Embedded systems with power/area constraints
- Predictable real-time workloads
- Simple control-dominated code
RISC-V’s flexibility allows both in-order (Rocket, SiFive E/U-series) and OOO (BOOM, SiFive P-series) implementations, making it suitable for a wide range of applications.
🛠️ Hands-on Lab: Lab 8.1 — The Truth Behind Performance Counters
This lab guides you through using RISC-V’s hardware performance counters to measure CPI (Cycles Per Instruction) of the same code under different conditions.
Lab Objectives
- Read
mcycle(Machine Cycle Counter) andminstret(Machine Instructions Retired) - Calculate CPI = Cycles / Instructions
- Observe how different code patterns affect CPI
Concept Explanation
RISC-V provides two key performance counter CSRs:
| CSR | Name | Description |
|---|---|---|
mcycle | Machine Cycle Counter | Clock cycles elapsed since reset |
minstret | Machine Instructions Retired | Instructions completed since reset |
CPI (Cycles Per Instruction) = mcycle / minstret
- CPI = 1.0: Ideal case, one instruction completes per cycle
- CPI > 1.0: Stalls present (cache miss, hazard, etc.)
- CPI < 1.0: Superscalar processor, multiple instructions complete per cycle
Code
Create lab8_perf.c:
// lab8_perf.c - Performance Counter Measurement
#include <stdio.h>
#include <stdint.h>
// Read mcycle
static inline uint64_t read_mcycle(void) {
uint64_t val;
asm volatile("csrr %0, mcycle" : "=r"(val));
return val;
}
// Read minstret
static inline uint64_t read_minstret(void) {
uint64_t val;
asm volatile("csrr %0, minstret" : "=r"(val));
return val;
}
// Test function 1: Simple addition loop (no dependencies)
volatile int result1;
void test_independent(int n) {
int a = 0, b = 0, c = 0, d = 0;
for (int i = 0; i < n; i++) {
a += 1;
b += 2;
c += 3;
d += 4;
}
result1 = a + b + c + d;
}
// Test function 2: Addition loop with dependencies
volatile int result2;
void test_dependent(int n) {
int a = 0;
for (int i = 0; i < n; i++) {
a += 1;
a += a; // depends on previous line's result
a += a; // depends on previous line's result
a += a; // depends on previous line's result
}
result2 = a;
}
void measure(const char *name, void (*func)(int), int n) {
uint64_t cycle_start = read_mcycle();
uint64_t instr_start = read_minstret();
func(n);
uint64_t cycle_end = read_mcycle();
uint64_t instr_end = read_minstret();
uint64_t cycles = cycle_end - cycle_start;
uint64_t instrs = instr_end - instr_start;
// Calculate CPI (multiply by 100 to avoid floating point)
uint64_t cpi_x100 = (cycles * 100) / instrs;
printf("%s:\n", name);
printf(" Cycles: %lu\n", cycles);
printf(" Instructions: %lu\n", instrs);
printf(" CPI: %lu.%02lu\n", cpi_x100 / 100, cpi_x100 % 100);
}
int main() {
int n = 100000;
printf("=== Performance Counter Lab ===\n\n");
measure("Independent Operations", test_independent, n);
printf("\n");
measure("Dependent Operations", test_dependent, n);
return 0;
}
Compile and Run
# Compile
riscv64-unknown-elf-gcc -O2 -o lab8_perf lab8_perf.c
# Run (requires M-mode access to mcycle/minstret)
qemu-riscv64 lab8_perf
Expected Output (values vary by implementation):
=== Performance Counter Lab ===
Independent Operations:
Cycles: 500123
Instructions: 700045
CPI: 0.71
Dependent Operations:
Cycles: 1200456
Instructions: 800089
CPI: 1.50
What You Just Did
- Independent Operations: Four parallel additions can be executed simultaneously by OOO processors, resulting in CPI < 1
- Dependent Operations: Each addition depends on the previous result, forcing sequential execution, resulting in CPI > 1
danieRTOS Reference: The scheduler uses similar performance measurement techniques to profile task execution time and optimize scheduling decisions.
⚠️ Common Pitfalls
Pitfall 1: Thinking Out-of-Order Is Always Better
Misconception: “OoO processors are always faster than In-Order!”
Correct Understanding:
- OoO has heavier penalties on Branch Misprediction (more instructions to flush)
- OoO may not be optimal in power-constrained scenarios (phones, IoT)
- For highly parallel workloads (GPU-like), simple In-Order cores are often more efficient
Pitfall 2: Ignoring Spectre/Meltdown Risks
Misconception: “Speculative Execution is just a performance optimization with no side effects.”
Correct Understanding:
- Spectre and Meltdown attacks exploit side-channel effects of Speculative Execution
- Even when mispredicted instructions are cancelled, their effects on Cache remain
- Attackers can infer secret data by measuring Cache access timing
// Simplified Spectre attack concept
if (x < array1_size) { // Bounds check
y = array2[array1[x] * 256]; // If x is out of bounds, this shouldn't execute
}
// But Speculative Execution may "execute first, ask questions later"
// Even after discovering x is out of bounds and cancelling,
// array2's cache state has leaked array1[x]'s value
Pitfall 3: Confusing mcycle with time
Error Scenario: Using mcycle to measure “real time.”
Correct Understanding:
mcycle: CPU clock cycles, tied to CPU frequencytime: Real time (usually from RTC or Timer)- If CPU frequency changes dynamically (DVFS),
mcyclecannot be directly converted to time
// ❌ Wrong: Assuming mcycle equals time
uint64_t start = read_mcycle();
do_something();
uint64_t elapsed_ns = (read_mcycle() - start) * 1000000000 / CPU_FREQ;
// If CPU frequency changed, this calculation is wrong
// ✅ Correct: Use time CSR or SBI timer
uint64_t start = read_time();
do_something();
uint64_t elapsed_ns = (read_time() - start) * 1000000000 / TIMER_FREQ;
Summary
In this chapter, we explored advanced microarchitecture techniques for high-performance RISC-V processors:
- Out-of-order execution: Dynamic scheduling to extract instruction-level parallelism.
- Register renaming: Eliminating false dependencies (WAR, WAW) with physical registers.
- Reorder buffer: Ensuring in-order commit for precise exceptions.
- Load/store queues: Memory disambiguation and store-to-load forwarding.
- Advanced branch prediction: TAGE, RAS, indirect branch prediction.
- Cache hierarchy: L1/L2/L3 caches with LRU/PLRU replacement.
- Cache coherence: MESI/MOESI protocols for multi-core consistency.
- Comparisons: RISC-V BOOM vs ARM Cortex-A76 and MIPS R10000.
RISC-V’s clean ISA makes it an excellent target for both simple in-order cores and complex OOO designs. The architecture doesn’t impose unnecessary constraints, allowing microarchitects to innovate freely.
In the next chapter, we’ll shift focus from hardware to software, exploring how RISC-V systems boot and how firmware and operating systems interact with the hardware.