Chapter 16. Performance Counters & PMU
Part IX — Performance, Debug & Tools
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Read Performance Counters: Use
csrrto readcycleandinstretCSRs - Calculate IPC Metrics: Understand the meaning and formula for Instructions Per Cycle
- Identify Performance Bottlenecks: Distinguish characteristics of Compute-bound vs Memory-bound programs
💡 Scenario: The CPU’s Indigestion
Scene: Junior comes running to Senior with a data sheet.
Junior: “Senior, look! I unrolled this loop, which made more instructions, but the execution time actually got shorter. That doesn’t make sense! More instructions should mean slower, right?”
Senior: “That’s a common rookie mistake—only looking at ‘food quantity’ (Instruction Count), not ‘digestion speed’ (IPC).
A CPU is like a hot dog eating contest competitor:
| Concept | Analogy |
|---|---|
| Cycle | Contest time (seconds) |
| Instret (Instructions) | Number of hot dogs eaten |
| IPC (Instructions Per Cycle) | Swallowing speed |
Formula: IPC = Instret / Cycle
“
Junior: “So after I unrolled the loop, even though there are more hot dogs, they’re being swallowed faster?”
Senior: “Exactly. Your previous code probably had ‘Data Dependencies’—the previous bite wasn’t swallowed yet, so the next bite couldn’t go in, causing the competitor to just stand there dazed (Pipeline Stall), resulting in low IPC.
After loop unrolling, instructions don’t interfere with each other, so the CPU can swallow several at once (Pipeline filled), and IPC goes up. So even though total instruction count increased, because swallowing is fast enough, total time actually decreased.“
Junior: “I see! So higher IPC is always better?”
Senior: “Not necessarily. If you just have them drink water (execute nop), they can swallow super fast (high IPC), but they’re not actually eating anything (no useful work). So when looking at performance, we must look at Cycle Count and IPC together.”
Performance optimization requires measurement. A program runs slowly, and we need to know why. Cache misses dominate execution time, or branch mispredictions cause pipeline stalls, or memory bandwidth limits throughput. Performance counters transform vague slowness into quantifiable bottlenecks.
RISC-V provides a Performance Monitoring Unit (PMU) through a set of hardware performance counters. These counters track events like cycles executed, instructions retired, cache hits and misses, branch predictions, and TLB accesses. The basic counters (cycle, instret, time) are mandatory and provide fundamental metrics. Hardware performance counters (mhpmcounter3-31) are optional and track implementation-specific events. Together, these counters enable profiling, bottleneck identification, and performance analysis.
This chapter explores RISC-V performance counters and the PMU. We’ll examine the counter architecture, basic counters, hardware performance counters, performance events, profiling techniques, and how RISC-V compares to ARM’s PMU.
16.1 Performance Counter Architecture
Performance Monitoring Overview
Performance monitoring answers questions like:
- How many cycles did this function take?
- What is the IPC (instructions per cycle)?
- How many cache misses occurred?
- How many branches were mispredicted?
- Where is the performance bottleneck?
RISC-V performance counters provide hardware-based measurement with minimal overhead. Counters increment automatically on specific events, allowing precise measurement without software instrumentation.
Counter CSRs
RISC-V defines performance counter CSRs in three privilege levels:
Machine-mode counters (M-mode only):
mcycle: Machine cycle counterminstret: Machine instructions-retired countermhpmcounter3-31: Machine hardware performance counters (29 counters)
Supervisor/User-mode counters (readable from S/U-mode):
cycle: Cycle counter (shadow of mcycle)instret: Instructions-retired counter (shadow of minstret)hpmcounter3-31: Hardware performance counters (shadow of mhpmcounter)
Time counter:
time: Real-time counter (wall-clock time)
For RV32, each counter has a high-word CSR (e.g., mcycleh, cycleh) for 64-bit values.
Counter Privilege Levels
Counters are accessible based on privilege:
M-mode: Can read/write all counters (mcycle, minstret, mhpmcounter)
S-mode: Can read cycle, instret, hpmcounter (if enabled)
U-mode: Can read cycle, instret, hpmcounter (if enabled)
Access control via mcounteren and scounteren:
// Enable cycle and instret for S-mode and U-mode
uint64_t mcounteren = (1 << 0) | (1 << 2); // CY, IR
write_csr(mcounteren, mcounteren);
// Enable cycle and instret for U-mode (from S-mode)
uint64_t scounteren = (1 << 0) | (1 << 2);
write_csr(scounteren, scounteren);
Counter Inhibit
Counters can be inhibited (stopped) via mcountinhibit:
// Stop cycle and instret counters
uint64_t mcountinhibit = (1 << 0) | (1 << 2); // CY, IR
write_csr(mcountinhibit, mcountinhibit);
// Resume counters
write_csr(mcountinhibit, 0);
This is useful for:
- Measuring specific code regions
- Reducing power consumption
- Preventing counter overflow
16.2 Basic Performance Counters
mcycle / cycle (Cycle Counter)
The cycle counter tracks the number of clock cycles executed by the hart:
// Read cycle counter
uint64_t start = read_csr(cycle);
// ... code to measure ...
uint64_t end = read_csr(cycle);
uint64_t cycles = end - start;
printf("Cycles: %llu\n", cycles);
For RV32, use cycleh for the high 32 bits:
// RV32: Read 64-bit cycle counter
uint64_t read_cycle_rv32(void) {
uint32_t hi, lo, hi2;
do {
hi = read_csr(cycleh);
lo = read_csr(cycle);
hi2 = read_csr(cycleh);
} while (hi != hi2); // Retry if high word changed
return ((uint64_t)hi << 32) | lo;
}
minstret / instret (Instructions Retired Counter)
The instructions-retired counter tracks the number of instructions completed:
// Read instret counter
uint64_t start = read_csr(instret);
// ... code to measure ...
uint64_t end = read_csr(instret);
uint64_t instructions = end - start;
printf("Instructions: %llu\n", instructions);
IPC Calculation
Combining cycle and instret gives IPC (instructions per cycle):
// Measure IPC
uint64_t cycles_start = read_csr(cycle);
uint64_t instret_start = read_csr(instret);
// ... code to measure ...
uint64_t cycles_end = read_csr(cycle);
uint64_t instret_end = read_csr(instret);
uint64_t cycles = cycles_end - cycles_start;
uint64_t instructions = instret_end - instret_start;
double ipc = (double)instructions / cycles;
printf("IPC: %.2f\n", ipc);
IPC interpretation:
- IPC close to 1: Good utilization (in-order core)
- IPC > 1: Superscalar execution (out-of-order core)
- IPC < 1: Pipeline stalls (cache misses, branch mispredicts, etc.)
time (Real-Time Counter)
The time counter provides wall-clock time:
// Read time counter
uint64_t start_time = read_csr(time);
// ... code to measure ...
uint64_t end_time = read_csr(time);
uint64_t elapsed = end_time - start_time;
// Convert to microseconds (assuming 1 MHz time counter)
printf("Elapsed time: %llu us\n", elapsed);
The time counter frequency is platform-specific (typically 1 MHz or 10 MHz). It’s useful for:
- Wall-clock timing
- Timeout implementation
- Real-time scheduling
Difference: cycle vs time
cycle: Counts CPU cycles (stops during sleep, varies with frequency scaling)time: Counts real time (continues during sleep, constant frequency)
// Example: Measure sleep overhead
uint64_t cycles_before = read_csr(cycle);
uint64_t time_before = read_csr(time);
wfi(); // Sleep until interrupt
uint64_t cycles_after = read_csr(cycle);
uint64_t time_after = read_csr(time);
printf("Cycles during sleep: %llu\n", cycles_after - cycles_before); // ~0
printf("Time during sleep: %llu\n", time_after - time_before); // > 0
16.3 Hardware Performance Counters
mhpmcounter3-31 (Hardware Performance Counters)
RISC-V provides up to 29 hardware performance counters (HPM counters) for tracking implementation-specific events. These counters are optional—implementations may provide 0 to 29 counters.
Counter CSRs:
mhpmcounter3-31: M-mode counters (29 counters)hpmcounter3-31: S/U-mode readable counters (shadows of mhpmcounter)mhpmevent3-31: Event selection registers
Event Selection (mhpmevent CSRs)
Each HPM counter has an associated event selector:
// Configure mhpmcounter3 to count L1 I-cache misses
write_csr(mhpmevent3, EVENT_L1_ICACHE_MISS);
// Reset counter
write_csr(mhpmcounter3, 0);
// ... code to measure ...
// Read counter
uint64_t icache_misses = read_csr(mhpmcounter3);
printf("L1 I-cache misses: %llu\n", icache_misses);
Event codes are implementation-specific. Common events include:
- Cache events (L1/L2 hits, misses)
- Branch events (taken, not-taken, mispredicted)
- Pipeline events (stalls, flushes)
- Memory events (loads, stores, TLB misses)
Counter Overflow Handling
Counters are 64-bit and rarely overflow. If overflow is a concern:
// Check for overflow (counter wrapped around)
uint64_t start = read_csr(mhpmcounter3);
// ... code ...
uint64_t end = read_csr(mhpmcounter3);
if (end < start) {
// Overflow occurred
uint64_t count = (UINT64_MAX - start) + end + 1;
} else {
uint64_t count = end - start;
}
Some implementations support overflow interrupts (implementation-specific).
Example: Multi-Counter Measurement
Measuring multiple events simultaneously:
// Configure counters
write_csr(mhpmevent3, EVENT_L1_DCACHE_MISS);
write_csr(mhpmevent4, EVENT_L2_CACHE_MISS);
write_csr(mhpmevent5, EVENT_BRANCH_MISPREDICT);
// Reset counters
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
write_csr(mhpmcounter5, 0);
// Measure code
uint64_t cycles_start = read_csr(cycle);
uint64_t instret_start = read_csr(instret);
// ... code to measure ...
uint64_t cycles_end = read_csr(cycle);
uint64_t instret_end = read_csr(instret);
uint64_t l1_misses = read_csr(mhpmcounter3);
uint64_t l2_misses = read_csr(mhpmcounter4);
uint64_t branch_mispredicts = read_csr(mhpmcounter5);
// Report
printf("Cycles: %llu\n", cycles_end - cycles_start);
printf("Instructions: %llu\n", instret_end - instret_start);
printf("L1 D-cache misses: %llu\n", l1_misses);
printf("L2 cache misses: %llu\n", l2_misses);
printf("Branch mispredicts: %llu\n", branch_mispredicts);
16.4 Performance Events
Cache Events
Cache events track memory hierarchy performance:
L1 Instruction Cache:
- L1 I-cache access
- L1 I-cache miss
- L1 I-cache hit
L1 Data Cache:
- L1 D-cache access
- L1 D-cache miss
- L1 D-cache hit
- L1 D-cache writeback
L2 Cache:
- L2 cache access
- L2 cache miss
- L2 cache hit
Example: Measure cache miss rate:
// Configure counters
write_csr(mhpmevent3, EVENT_L1_DCACHE_ACCESS);
write_csr(mhpmevent4, EVENT_L1_DCACHE_MISS);
// Reset and measure
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
// ... code ...
uint64_t accesses = read_csr(mhpmcounter3);
uint64_t misses = read_csr(mhpmcounter4);
double miss_rate = (double)misses / accesses * 100.0;
printf("L1 D-cache miss rate: %.2f%%\n", miss_rate);
Branch Events
Branch events track control flow performance:
Branch Types:
- Branch instructions executed
- Branch taken
- Branch not taken
Branch Prediction:
- Branch mispredicted
- Branch correctly predicted
Example: Measure branch prediction accuracy:
write_csr(mhpmevent3, EVENT_BRANCH_EXECUTED);
write_csr(mhpmevent4, EVENT_BRANCH_MISPREDICT);
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
// ... code with branches ...
uint64_t branches = read_csr(mhpmcounter3);
uint64_t mispredicts = read_csr(mhpmcounter4);
double accuracy = (1.0 - (double)mispredicts / branches) * 100.0;
printf("Branch prediction accuracy: %.2f%%\n", accuracy);
Pipeline Events
Pipeline events track execution efficiency:
Stalls:
- Pipeline stall cycles
- Load-use stall
- Store buffer full stall
Flushes:
- Pipeline flush (branch mispredict, exception)
- I-cache flush
- D-cache flush
Example: Identify stall sources:
write_csr(mhpmevent3, EVENT_PIPELINE_STALL);
write_csr(mhpmevent4, EVENT_LOAD_USE_STALL);
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
// ... code ...
uint64_t total_stalls = read_csr(mhpmcounter3);
uint64_t load_use_stalls = read_csr(mhpmcounter4);
printf("Total stall cycles: %llu\n", total_stalls);
printf("Load-use stalls: %llu (%.1f%%)\n",
load_use_stalls,
(double)load_use_stalls / total_stalls * 100.0);
Memory Events
Memory events track memory system activity:
Memory Operations:
- Load instructions
- Store instructions
- Atomic instructions
TLB Events:
- TLB access
- TLB miss (I-TLB, D-TLB)
- Page table walk
Example: Measure TLB performance:
write_csr(mhpmevent3, EVENT_DTLB_ACCESS);
write_csr(mhpmevent4, EVENT_DTLB_MISS);
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
// ... code with memory accesses ...
uint64_t tlb_accesses = read_csr(mhpmcounter3);
uint64_t tlb_misses = read_csr(mhpmcounter4);
double tlb_miss_rate = (double)tlb_misses / tlb_accesses * 100.0;
printf("D-TLB miss rate: %.2f%%\n", tlb_miss_rate);
16.5 Profiling and Analysis
perf Tool for RISC-V
The Linux perf tool supports RISC-V performance counters:
# Count cycles and instructions
perf stat -e cycles,instructions ./my_program
# Sample on cycles (profiling)
perf record -e cycles ./my_program
perf report
# Count cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./my_program
# Count branch mispredictions
perf stat -e branch-misses,branches ./my_program
PMU Programming
Kernel-level PMU programming:
// Linux kernel: Configure PMU for profiling
void setup_pmu_profiling(void) {
// Enable cycle and instret for user mode
write_csr(mcounteren, 0x7); // CY, TM, IR
// Configure HPM counter for L1 D-cache misses
write_csr(mhpmevent3, EVENT_L1_DCACHE_MISS);
write_csr(mhpmcounter3, 0);
// Enable counter for user mode
uint64_t mcounteren = read_csr(mcounteren);
mcounteren |= (1 << 3); // HPM3
write_csr(mcounteren, mcounteren);
}
Event Sampling
Sampling-based profiling collects periodic samples:
// Pseudo-code: Sample-based profiling
void pmu_interrupt_handler(void) {
// Read PC where interrupt occurred
uint64_t pc = read_csr(mepc);
// Record sample
record_sample(pc);
// Reset counter for next sample
write_csr(mhpmcounter3, -SAMPLE_PERIOD);
}
// Setup sampling
void setup_sampling(void) {
// Configure counter to overflow after SAMPLE_PERIOD events
write_csr(mhpmevent3, EVENT_CYCLES);
write_csr(mhpmcounter3, -SAMPLE_PERIOD);
// Enable overflow interrupt (implementation-specific)
enable_pmu_interrupt();
}
Performance Analysis Techniques
Top-down analysis:
- Measure overall IPC
- If IPC is low, identify bottleneck:
- Cache misses? → Optimize data layout
- Branch mispredicts? → Improve branch predictability
- Pipeline stalls? → Reduce dependencies
Hotspot analysis:
- Use sampling to find hot functions
- Measure counters for hot functions
- Optimize based on counter data
Comparative analysis:
- Measure before optimization
- Apply optimization
- Measure after optimization
- Compare counter values
Example workflow:
# Before optimization
perf stat -e cycles,instructions,L1-dcache-load-misses ./program
# Cycles: 1000000, Instructions: 500000, IPC: 0.5, Misses: 50000
# After optimization (improved data locality)
perf stat -e cycles,instructions,L1-dcache-load-misses ./program_opt
# Cycles: 600000, Instructions: 500000, IPC: 0.83, Misses: 10000
# Result: 40% speedup, 80% reduction in cache misses
16.6 Comparison with ARM PMU
RISC-V Counters vs ARM PMU
ARM provides a Performance Monitoring Unit (PMU) with similar capabilities. Comparison:
| Feature | RISC-V PMU | ARM PMU |
|---|---|---|
| Basic Counters | cycle, instret, time | PMCCNTR (cycle), no instret |
| HPM Counters | mhpmcounter3-31 (up to 29) | PMEVCNTRn (typically 6-8) |
| Event Selection | mhpmevent3-31 | PMEVTYPER (event type) |
| Counter Width | 64-bit | 32-bit or 64-bit (ARMv8) |
| Overflow | Implementation-specific | Overflow interrupt (PMOVSCLR) |
| Access Control | mcounteren, scounteren | PMUSERENR (user enable) |
| Counter Inhibit | mcountinhibit | PMCNTENSET/CLR (enable/disable) |
| Privilege Levels | M/S/U modes | EL0/EL1/EL2/EL3 |
Event Mapping
Common events mapped between architectures:
| Event | RISC-V | ARM |
|---|---|---|
| Cycles | cycle CSR | PMCCNTR |
| Instructions | instret CSR | No direct equivalent |
| L1 I-cache miss | Implementation-specific | 0x01 |
| L1 D-cache miss | Implementation-specific | 0x03 |
| L2 cache miss | Implementation-specific | 0x17 |
| Branch mispredict | Implementation-specific | 0x10 |
| Branch executed | Implementation-specific | 0x0C |
| TLB miss | Implementation-specific | 0x05 (I-TLB), 0x06 (D-TLB) |
ARM event codes are standardized (ARM Architecture Reference Manual), while RISC-V event codes are implementation-specific.
Profiling Tool Comparison
Both architectures support standard profiling tools:
RISC-V:
# perf on RISC-V Linux
perf stat -e cycles,instructions,cache-misses ./program
perf record -e cycles -g ./program
perf report
ARM:
# perf on ARM Linux
perf stat -e cycles,instructions,cache-misses ./program
perf record -e cycles -g ./program
perf report
The perf tool abstracts architecture differences, providing a consistent interface.
Practical Differences
RISC-V advantages:
- 64-bit counters (no overflow on long runs)
- Separate instret counter (ARM lacks this)
- Up to 29 HPM counters (ARM typically 6-8)
- Simpler privilege model
ARM advantages:
- Standardized event codes (portable across implementations)
- Mature PMU infrastructure
- Overflow interrupts (standard)
- Extensive tool support
Example: Measuring IPC
RISC-V:
uint64_t cycles = read_csr(cycle);
uint64_t instret = read_csr(instret);
double ipc = (double)instret / cycles;
ARM (requires software counting):
uint64_t cycles = read_pmccntr();
// No instret equivalent—must use PMU event counter
uint64_t instret = read_pmevcntr(0); // Configured for instruction count
double ipc = (double)instret / cycles;
RISC-V’s dedicated instret counter simplifies IPC measurement.
Implementation Examples
RISC-V:
- SiFive U74: 2 HPM counters (L1 cache events)
- SiFive P550: 6 HPM counters (cache, branch, TLB events)
- Alibaba XuanTie C910: 4 HPM counters
ARM:
- Cortex-A53: 6 PMU counters
- Cortex-A72: 6 PMU counters
- Cortex-A76: 6 PMU counters
- Neoverse N1: 6 PMU counters
RISC-V implementations vary widely in HPM counter count. ARM implementations are more consistent (typically 6 counters).
🛠️ Hands-on Lab: Lab 16.1 — The CPU’s EKG (Measuring IPC)
This lab demonstrates how to read hardware performance counters and calculate IPC.
⚠️ Important Warning: In QEMU TCG mode or Spike,
cycleusually just followsinstret(IPC ≈ 1), which doesn’t reflect real hardware pipeline behavior. Run on real hardware to observe significant differences.
Lab Objectives
- Implement C functions to read
cycleandinstret - Design two workloads: High dependency (low IPC) vs High parallelism (high IPC)
- Calculate and print IPC
Code (pmu_lab.c)
#include <stdio.h>
#include <stdint.h>
// ---------------------------------------------------------
// Helper Functions: Read CSRs
// ---------------------------------------------------------
static inline uint64_t read_cycle() {
uint64_t val;
asm volatile ("csrr %0, cycle" : "=r" (val));
return val;
}
static inline uint64_t read_instret() {
uint64_t val;
asm volatile ("csrr %0, instret" : "=r" (val));
return val;
}
// ---------------------------------------------------------
// Workload 1: High Dependency (Low IPC)
// ---------------------------------------------------------
void workload_dependency(int iters) {
volatile int a = 1;
for (int i = 0; i < iters; i++) {
// Each add must wait for previous to complete
asm volatile (
"add %0, %0, %0 \n"
"add %0, %0, %0 \n"
"add %0, %0, %0 \n"
: "+r" (a)
);
}
}
// ---------------------------------------------------------
// Workload 2: Independent (High IPC)
// ---------------------------------------------------------
void workload_independent(int iters) {
volatile int a = 1, b = 2, c = 3;
for (int i = 0; i < iters; i++) {
// Instructions are independent, CPU can issue simultaneously
asm volatile (
"add %0, %0, %0 \n"
"add %1, %1, %1 \n"
"add %2, %2, %2 \n"
: "+r" (a), "+r" (b), "+r" (c)
);
}
}
// ---------------------------------------------------------
// Measurement Function
// ---------------------------------------------------------
void measure(const char* name, void (*func)(int), int iters) {
uint64_t start_c = read_cycle();
uint64_t start_i = read_instret();
func(iters);
uint64_t end_c = read_cycle();
uint64_t end_i = read_instret();
uint64_t delta_c = end_c - start_c;
uint64_t delta_i = end_i - start_i;
double ipc = (double)delta_i / delta_c;
printf("[%s]\n", name);
printf(" Cycles : %lu\n", delta_c);
printf(" Instrs : %lu\n", delta_i);
printf(" IPC : %.2f\n\n", ipc);
}
int main() {
printf("=== RISC-V PMU Demo ===\n");
printf("Warning: On QEMU/Spike, IPC is simulated as ~1.0\n");
printf("Run on real hardware for accurate results.\n\n");
int iters = 100000;
measure("Dependent Workload", workload_dependency, iters);
measure("Independent Workload", workload_independent, iters);
return 0;
}
Compile and Run
# Compile
riscv64-unknown-elf-gcc -O0 -o pmu_lab pmu_lab.c
# Run on Spike (simulated, IPC ≈ 1)
spike pk pmu_lab
# On real hardware, expect:
# - Dependent Workload: IPC ≈ 0.3-0.5 (stalls)
# - Independent Workload: IPC ≈ 1.5-2.0 (parallel)
Expected Output (Real Hardware)
=== RISC-V PMU Demo ===
Warning: On QEMU/Spike, IPC is simulated as ~1.0
Run on real hardware for accurate results.
[Dependent Workload]
Cycles : 1200000
Instrs : 400000
IPC : 0.33
[Independent Workload]
Cycles : 240000
Instrs : 400000
IPC : 1.67
danieRTOS Reference: danieRTOS uses cycle counters in its scheduler to measure context switch overhead and task execution time.
⚠️ Common Pitfalls
Pitfall 1: Higher IPC = Faster Program?
Misconception: The optimization goal is to maximize IPC.
Truth: High IPC doesn’t necessarily mean fast programs.
// Super high IPC, but does no useful work
for (int i = 0; i < 1000000; i++) {
asm volatile ("nop"); // IPC might approach 4.0!
}
// Lower IPC, but actually doing computation
for (int i = 0; i < 1000000; i++) {
result += array[i]; // IPC might only be 0.5
}
💡 Correct Understanding: Performance =
Instret / TimeorInstret / Cycle, but only if those instructions do useful work.
Pitfall 2: Ignoring Counter Overflow
Error Scenario: After long execution, counter overflows causing negative results.
Solution: Use 64-bit counters (RV64) or correctly handle 32-bit counter overflow.
// RV32: Need to read cycleh (high 32 bits)
uint64_t read_cycle_rv32() {
uint32_t lo, hi1, hi2;
do {
hi1 = read_csr(cycleh);
lo = read_csr(cycle);
hi2 = read_csr(cycleh);
} while (hi1 != hi2); // Guard against overflow during read
return ((uint64_t)hi1 << 32) | lo;
}
Pitfall 3: Confusing cycle and time
Error Scenario: Using cycle to measure sleep time.
Truth:
| CSR | Behavior |
|---|---|
cycle | Tracks CPU execution cycles, stops during WFI |
time | Tracks real time, continues during WFI |
// ❌ Wrong: cycle doesn't increment during WFI
start = read_cycle();
wfi(); // Wait for interrupt
end = read_cycle();
sleep_time = end - start; // Result is nearly 0!
// ✅ Correct: Use time for sleep measurement
start = read_time();
wfi();
end = read_time();
sleep_time = end - start; // Correctly reflects wait time
Summary
Performance counters and the Performance Monitoring Unit enable quantitative performance analysis. This chapter explored RISC-V’s counter architecture and how it compares to ARM’s mature PMU infrastructure.
Performance counter architecture provides hardware-based measurement with minimal overhead. Counter CSRs exist at multiple privilege levels—machine-mode counters (mcycle, minstret, mhpmcounter) and supervisor/user-mode readable shadows (cycle, instret, hpmcounter). Access control through mcounteren and scounteren enables selective counter exposure to lower privilege levels. Counter inhibit via mcountinhibit allows stopping counters to measure specific code regions or reduce power consumption.
Basic performance counters provide fundamental metrics. The cycle counter tracks clock cycles executed by the hart. The instret counter tracks instructions retired (completed). The time counter provides wall-clock time at a constant frequency. Together, cycle and instret enable IPC calculation, a key performance metric. The difference between cycle (stops during sleep) and time (continues during sleep) enables measuring sleep overhead and real-time intervals.
Hardware performance counters track implementation-specific events through mhpmcounter3-31 (up to 29 counters). Event selection via mhpmevent CSRs configures what each counter tracks. Counters are 64-bit, minimizing overflow concerns. Multiple counters can measure different events simultaneously, enabling comprehensive performance characterization. Counter overflow handling is implementation-specific, with some implementations supporting overflow interrupts.
Performance events cover the full spectrum of microarchitectural activity. Cache events track L1 instruction cache, L1 data cache, and L2 cache hits and misses, revealing memory hierarchy performance. Branch events track branch execution and prediction accuracy, identifying control flow bottlenecks. Pipeline events track stalls and flushes, showing execution efficiency. Memory events track loads, stores, and TLB performance, revealing memory system behavior.
Profiling and analysis leverage performance counters for optimization. The Linux perf tool provides a standard interface to RISC-V counters for counting events and sampling-based profiling. PMU programming in the kernel configures counters and enables user-mode access. Event sampling collects periodic samples to identify hot code regions. Performance analysis techniques include top-down analysis (identify bottleneck category), hotspot analysis (find hot functions), and comparative analysis (measure optimization impact).
Comparison with ARM shows both similarities and differences. ARM’s PMU provides similar functionality with a cycle counter and multiple event counters. ARM standardizes event codes across implementations, while RISC-V leaves them implementation-specific. RISC-V provides 64-bit counters and a dedicated instret counter, simplifying IPC measurement. ARM provides standardized overflow interrupts. Both architectures support the perf tool, providing a consistent user experience. RISC-V allows up to 29 HPM counters, while ARM implementations typically provide 6-8 counters.
Together, RISC-V’s performance counters enable effective performance measurement, profiling, and optimization across the full range from embedded systems to high-performance processors.