Front Page
title: “See RISC-V Run: Fundamentals” subtitle: “A Comprehensive Guide to RISC-V Architecture” author: “Danny Jiang” version: “Draft v0p11” date: “January 2026”
See RISC-V Run: Fundamentals
A Comprehensive Guide to RISC-V Architecture
From ISA Fundamentals to System Design
Danny Jiang
Draft v0p11 - January 2026
Complete Book:
- 17 Chapters organized into 10 Parts
- 6 Appendices with quick reference materials
- ~100,000+ words (~400 pages)
- Comprehensive coverage from ISA fundamentals to system design
- Enhanced with Learning Objectives, Scenario Dialogues, Hands-on Labs, and Common Pitfalls
Licensed under CC BY 4.0
Copyright and License
See RISC-V Run: Fundamentals
A Comprehensive Guide to RISC-V Architecture
Copyright © 2025-2026 Danny Jiang
- Version: Draft v0p11 (Enhanced Edition)
- Published: January 2026
- Author: Danny Jiang
- Contact: djiang.tw@gmail.com
License
This work is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
You are free to:
-
Share
Copy and redistribute the material in any medium or format for any purpose, even commercially -
Adapt
Remix, transform, and build upon the material for any purpose, even commercially
Under the following terms:
- Attribution
You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Full license text: https://creativecommons.org/licenses/by/4.0/
Trademarks
RISC-V is a trademark of RISC-V International. ARM is a trademark of Arm Limited. MIPS is a trademark of MIPS Technologies, Inc. All other trademarks are the property of their respective owners.
Disclaimer
This book is provided “as is” without warranty of any kind, either expressed or implied. The author and publisher shall not be liable for any damages arising from the use of this book.
The information in this book is based on publicly available specifications and documentation. While every effort has been made to ensure accuracy, the RISC-V specifications continue to evolve. Readers should consult the official RISC-V specifications for the most current information.
About This Book
This is the complete book “See RISC-V Run: Fundamentals”. The book contains:
- 17 Chapters organized into 10 Parts
- 6 Appendices with quick reference materials
- ~100,000+ words (~400 pages)
- Comprehensive coverage from ISA fundamentals to system design
- v0p11 Enhanced: Learning Objectives, Scenario Dialogues, Hands-on Labs, Common Pitfalls
Author’s GitHub: https://github.com/djiangtw
For updates and errata: To be announced
Enhanced Edition, January 2026
Preface
Why This Book?
RISC-V represents a fundamental shift in computer architecture. Unlike proprietary instruction set architectures (ISAs) that dominated the industry for decades, RISC-V is open, modular, and designed for the modern era. It’s not just another ISA—it’s a new way of thinking about processor design, enabling innovation from embedded microcontrollers to high-performance supercomputers.
When I set out to write this book, I wanted to create something that didn’t exist: a comprehensive, systematic guide to RISC-V that combines the depth of official specifications with the clarity of a well-written textbook. I was inspired by Dominic Sweetman’s classic “See MIPS Run”, which masterfully explained MIPS architecture in a way that was both rigorous and accessible. This book aims to do the same for RISC-V—but rather than presenting dry specifications, it teaches you how to think like a system designer through dialogues between “Junior” and “Senior” engineers, using real-world metaphors like “The Gourmet Kitchen” (ISA modularity), “The Museum’s Red Barrier Poles” (PMP), and “The Building’s Privilege Hierarchy” (M/S/U modes) to explain complex concepts.
Who Should Read This Book?
This book is written for anyone who wants to understand RISC-V deeply:
System Software Developers: If you’re writing operating systems, bootloaders, firmware, or low-level drivers for RISC-V, this book provides the architectural knowledge you need. You’ll learn how exceptions work, how virtual memory is implemented, how to use SBI calls, and how to write correct concurrent code under RISC-V’s weak memory model.
Hardware Engineers: If you’re designing RISC-V processors or SoCs, this book explains the ISA from an implementation perspective. You’ll understand pipeline hazards, memory ordering requirements, interrupt controller integration, and the trade-offs between different microarchitectural choices.
Computer Architecture Students: If you’re learning computer architecture, RISC-V is an excellent teaching vehicle. This book provides a complete picture of a modern ISA, from instruction encoding to system-level features, with comparisons to ARM and MIPS to build your architectural intuition.
Engineers Transitioning from ARM or MIPS: If you’re experienced with other architectures and moving to RISC-V, this book highlights the similarities and differences. You’ll find detailed comparisons of instruction sets, exception models, memory models, and calling conventions to accelerate your learning.
What Makes This Book Different?
Comprehensive Coverage: This book covers the entire RISC-V ecosystem—not just the base ISA, but also extensions (M, A, F, D, C, V, B, H), privileged architecture (M/S/U modes), system software interfaces (SBI, ABI), platform specifications (PLIC, CLIC), and real-world implementation considerations.
Systematic Organization: The book is organized into 10 parts that build progressively from fundamentals to advanced topics. Each chapter is self-contained but connects to the broader narrative, making it suitable both for cover-to-cover reading and as a reference.
Practical Focus: Every concept is illustrated with code examples, diagrams, and real-world use cases. You’ll see how to implement spinlocks, how to handle page faults, how to configure PMPs, and how to debug memory ordering issues—not just abstract theory.
Comparative Analysis: Throughout the book, I compare RISC-V with ARM and MIPS. These comparisons help you understand RISC-V’s design philosophy and make informed decisions when porting code or designing systems.
Modern Perspective: RISC-V is a modern ISA designed with lessons learned from decades of processor evolution. This book emphasizes modern features like weak memory ordering, modular extensions, and formal specifications that distinguish RISC-V from older architectures.
How to Use This Book
For Cover-to-Cover Reading: The chapters are designed to be read in order, building from basic concepts to advanced topics. Start with Part I (RISC-V Overview) and work through to Part X (Comparisons with Other Architectures).
As a Reference: Each chapter is self-contained with clear section headings and summaries. The appendices provide quick reference tables for CSRs, extensions, SBI calls, and instruction comparisons. Use the table of contents and index to find specific topics.
For Hands-On Learning: The book includes numerous code examples in assembly and C. I encourage you to run these examples on RISC-V simulators (QEMU, Spike) or real hardware. Experimenting with the code will deepen your understanding.
For Teaching: This book is suitable for undergraduate or graduate courses in computer architecture or systems programming. Each chapter includes learning objectives and summaries that can guide course structure.
What’s in This Book?
This book contains 17 chapters organized into 10 parts, plus 6 comprehensive appendices:
Part I — Introduction introduces the RISC-V ISA, its history, design philosophy, and ecosystem.
Part II — Programmer’s Model covers the fundamental programming model—registers, privilege levels, calling conventions, and execution environment.
Part III — Traps & Exceptions explains the unified trap handling mechanism for exceptions and interrupts.
Part IV — Memory & Addressing covers virtual memory, paging (Sv39/Sv48), memory ordering, and synchronization.
Part V — Pipeline & Microarchitecture explores pipeline fundamentals, hazards, and microarchitecture variations.
Part VI — Booting & System Software details reset, boot flow, firmware, SBI, and OS integration.
Part VII — ISA Extensions covers standard extensions (M, A, F, D, C) and the Vector extension.
Part VIII — System Design, Platform Spec & SoC Integration explains SoC integration, interrupt controllers, and platform profiles.
Part IX — Performance, Debug & Tools covers debugging, trace, and performance monitoring.
Part X — RISC-V vs Other Architectures provides a systematic comparison of RISC-V vs ARM vs MIPS.
Appendices provide quick reference for CSRs, extensions, bootloaders, SBI, RISC-V vs ARM comparison, and memory model.
Total: ~100,000+ words, ~400 pages
Acknowledgments
This book would not have been possible without the RISC-V community’s commitment to open specifications and transparent development. I’m grateful to RISC-V International and all the engineers who have contributed to the RISC-V specifications.
I also want to thank the authors of the classic architecture books that inspired this work, particularly Dominic Sweetman’s “See MIPS Run” and the ARM architecture reference manuals that set the standard for technical documentation.
Finally, thank you to the early readers and reviewers who provided feedback on draft chapters. Your insights have made this book better.
Feedback and Errata
This is a living book. RISC-V continues to evolve, and I’m committed to keeping this book current. If you find errors, have suggestions, or want to share how you’re using the book, please reach out:
Email: djiang.tw@gmail.com
GitHub: https://github.com/djiangtw
Errata: To be announced
Danny Jiang
January 2026
Table of Contents
Front Matter
- Cover
- Copyright and License
- Preface
- Table of Contents
Part I — Introduction
Chapter 1: What Is RISC-V?
1.1 The Birth of RISC-V
1.2 RISC-V Design Philosophy
1.3 RISC-V ISA Overview
1.4 RISC-V Ecosystem
1.5 Why Learn RISC-V?
1.6 How to Use This Book
Part II — Programmer’s Model
Chapter 2: Programmer’s Model & Register Set
2.1 Integer Register File
2.2 Floating-Point Register File
2.3 Control and Status Registers (CSRs)
2.4 Privilege Modes
2.5 Memory Model
2.6 Comparison with ARM and MIPS
Chapter 3: Privilege Levels & Execution Environment
3.1 RISC-V Privilege Architecture
3.2 Machine Mode (M-mode)
3.3 Supervisor Mode (S-mode)
3.4 User Mode (U-mode)
Part III — Traps & Exceptions
Chapter 4: Trap, Exception, Interrupt
4.1 Trap Handling Overview
4.2 Exception Types
4.3 Interrupt Types
4.4 Trap Delegation
4.5 Trap Vector Table
4.6 Trap Entry and Exit
4.7 Nested Traps
4.8 Comparison with ARM and MIPS
Part IV — Memory & Addressing
Chapter 5: Virtual Memory & Paging (Sv39/Sv48)
5.1 Virtual Memory Overview
5.2 Page Table Structure
5.3 Address Translation
5.4 TLB Management
Chapter 6: Memory Ordering & Synchronization
6.1 Memory Consistency Model
6.2 RISC-V Memory Model (RVWMO)
6.3 Fence Instructions
6.4 Atomic Operations
6.5 Load-Reserved/Store-Conditional
6.6 Comparison with ARM and x86
Part V — Pipeline & Microarchitecture
Chapter 7: RISC-V Pipeline Fundamentals
7.1 Classic Five-Stage Pipeline
7.2 Pipeline Hazards
7.3 Hazard Detection and Resolution
7.4 Branch Prediction
7.5 Pipeline Performance
7.6 Comparison with ARM and MIPS
Chapter 8: Microarchitecture Variations
8.1 In-Order vs Out-of-Order
8.2 Superscalar Execution
8.3 Cache Hierarchy
8.4 Memory Subsystem
8.5 RISC-V Core Examples
8.6 Rocket Core
8.7 BOOM (Berkeley Out-of-Order Machine)
8.8 Performance Comparison
Part VI — Booting & System Software
Chapter 9: Reset, Boot Flow & Firmware
9.1 Reset and Initialization
9.2 Boot ROM
9.3 Bootloader Stages
9.4 Firmware Components
9.5 Device Tree
9.6 Boot Flow Examples
9.7 Comparison with ARM
Chapter 10: Machine Mode, SBI & Supervisor Mode
10.1 Machine Mode Overview
10.2 Supervisor Binary Interface (SBI)
10.3 SBI Implementation (OpenSBI)
10.4 Supervisor Mode Software
10.5 OS Kernel Integration
10.6 Comparison with ARM
Part VII — ISA Extensions
Chapter 11: RISC-V Standard Extensions
11.1 Extension Naming Convention
11.2 M Extension (Integer Multiply/Divide)
11.3 A Extension (Atomic Instructions)
11.4 F Extension (Single-Precision Floating-Point)
11.5 D Extension (Double-Precision Floating-Point)
11.6 C Extension (Compressed Instructions)
11.7 Zicsr and Zifencei
11.8 Other Standard Extensions
Chapter 12: Vector Processing & SIMD Comparison
12.1 Vector Extension Overview
12.2 Vector Register File
12.3 Vector Instructions
12.4 Vector Length Agnostic Programming
12.5 Vector Memory Operations
12.6 Vector Performance
12.7 Comparison with ARM SVE/SVE2
12.8 Comparison with x86 AVX
Part VIII — System Design, Platform Spec & SoC Integration
Chapter 13: SoC Integration
13.1 SoC Architecture Overview
13.2 Interrupt Controllers (PLIC, CLIC, AIA)
13.3 Memory-Mapped I/O
13.4 DMA and Bus Protocols
13.5 Power Management Integration
Chapter 14: RISC-V Platform Profiles & Embedded Systems
14.1 Platform Specifications
14.2 RVA Profiles (RVA22, RVA23)
14.3 Embedded Profiles
14.4 Certification and Compliance
Part IX — Performance, Debug & Tools
Chapter 15: Debugging & Trace
15.1 Debug Architecture Overview
15.2 Debug Module
15.3 Trigger Module
15.4 Trace Architecture
15.5 JTAG and OpenOCD
15.6 GDB Integration
Chapter 16: Performance Counters & PMU
16.1 Performance Monitoring Overview
16.2 Hardware Performance Counters
16.3 Event Selection
16.4 PMU Programming
16.5 Performance Analysis Tools
Part X — RISC-V vs Other Architectures
Chapter 17: RISC-V vs ARM vs MIPS — A Systematic Comparison
17.1 Historical Context
17.2 ISA Design Philosophy
17.3 Instruction Encoding
17.4 Privilege Architecture
17.5 Memory Model
17.6 Ecosystem and Adoption
17.7 Future Outlook
Appendices
Appendix A: CSR Reference
Appendix B: Extension Reference
Appendix C: Bootloader Reference
Appendix D: SBI Reference
Appendix E: RISC-V vs ARM Comparison
Appendix F: Memory Model Reference
Back Matter
- About the Author
- Bibliography
Chapter 1. What Is RISC-V?
Part I — Introduction
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand the ISA’s Role: Grasp how the Instruction Set Architecture (ISA) serves as the critical interface between software and hardware
- Master RISC-V Core Philosophy: Understand how the three pillars—modularity, simplicity, and openness—influence design decisions
- Distinguish Business Model Differences: Recognize the fundamental licensing and ecosystem differences between RISC-V and x86/ARM
- Decode ISA Naming: Read and understand ISA strings like
RV64GCandRV32IMC - Set Up Development Environment: Successfully install the RISC-V toolchain and run your first program
💡 Scenario: The Break Room Decision
Scene: Monday morning in the company break room. The coffee machine hums quietly. The aroma of fresh coffee fills the air, but Junior’s expression is anything but relaxed.
Junior: (sighing) “Hey Senior, the PM just dropped a new project on me. Says we need to evaluate RISC-V for the MCU selection instead of our usual ARM Cortex-M. My head is spinning—what’s the difference anyway? Don’t they both just run C code? Why make things complicated?”
Senior: (accepting a fresh cup of coffee with a smile) “Don’t panic. Think of it this way: you’ve been eating at a ‘franchise fast-food restaurant’ all this time, and now the boss wants you to try a ‘build-your-own gourmet kitchen.’ Writing C code might feel similar, but the underlying rules of the game have changed.”
Junior: “Rules of the game? You mean the instruction set architecture?”
Senior: “Exactly. Think about it—when we use ARM, the ecosystem is strong, but it’s still one company’s product. To use their IP, your company pays licensing fees, and you get whatever features they decide to give you. If we wanted to add special hardware acceleration for an AI algorithm, could we modify ARM’s core?”
Junior: “No way. The vendor would never let us mess with their core.”
Senior: “And that’s RISC-V’s biggest value. It’s an open standard. Like TCP/IP or Linux, no single company ‘owns’ it. If we need special acceleration, under RISC-V, we can design our own custom instructions and add them in—no begging the vendor for permission.”
Junior: “Sounds flexible, but wouldn’t it be chaotic? Without standards, would programs even run?”
Senior: “That’s the most common newbie question. The clever thing about RISC-V is its modular design. There’s a mandatory ‘base model’ that everyone must follow, and additional extensions are like LEGO blocks—add them when you need them. That’s why everyone from NVIDIA to university labs is playing with it.”
Junior: (eyes lighting up) “Interesting… So we’re not just ‘users’ anymore—we can become ‘designers’?”
Senior: “Bingo! Come on, let’s head back to our desks. I’ll help you set up the environment, and we’ll start from the foundation of these ‘LEGO blocks.’”
RISC-V represents a fundamental shift in how we think about processor architectures. Unlike proprietary instruction sets that require licenses and royalties, RISC-V is completely open and free. Unlike monolithic architectures that bundle everything together, RISC-V is modular, allowing implementations from tiny microcontrollers to high-performance servers. Unlike architectures burdened by decades of legacy decisions, RISC-V was designed from scratch in 2010, learning from 30 years of RISC evolution.
This chapter introduces RISC-V by exploring its historical context, design philosophy, and place in the architecture landscape. We’ll trace the RISC revolution from the 1980s through RISC-V’s creation at UC Berkeley. We’ll examine why an open ISA matters and how RISC-V’s modular design enables unprecedented flexibility. We’ll compare RISC-V with ARM and MIPS to understand its competitive position. By the end, you’ll understand not just what RISC-V is, but why it matters and why it’s rapidly gaining adoption across the industry.
1.1 The RISC Revolution
Origins of RISC
The story of RISC-V begins not in 2010, but in the early 1980s, when computer architects began questioning the prevailing wisdom of complex instruction sets. Two landmark projects emerged almost simultaneously: the Berkeley RISC project led by David Patterson and Carlo Séquin, and the Stanford MIPS project led by John Hennessy. Both teams arrived at a radical conclusion: simpler is better.
The Berkeley RISC project, starting in 1980, challenged the conventional wisdom that complex instructions were necessary for high performance. Patterson’s team demonstrated that a processor with a small, carefully chosen set of simple instructions could outperform contemporary CISC (Complex Instruction Set Computer) processors. The key insight was that compilers, not hardware, should handle complexity.
Meanwhile, at Stanford, John Hennessy’s MIPS (Microprocessor without Interlocked Pipeline Stages) project pursued similar goals with a slightly different approach. The MIPS design emphasized pipeline efficiency and compiler optimization, creating an architecture that would eventually power Silicon Graphics workstations and countless embedded systems.
IBM’s 801 project, though less publicized, also contributed crucial ideas to the RISC philosophy. Started in the late 1970s, it demonstrated that a load-store architecture with a large register file could achieve excellent performance.
RISC Design Philosophy
At the heart of RISC is a deceptively simple idea: make the common case fast. RISC architectures achieve this through several key principles:
Load-Store Architecture: Only load and store instructions access memory. All computation happens in registers. This clean separation simplifies pipeline design and enables higher clock frequencies.
Fixed-Length Instructions: All instructions are the same size (typically 32 bits). This allows the processor to fetch and decode instructions in parallel, simplifying the instruction fetch stage and enabling efficient pipelining.
Simple Addressing Modes: RISC architectures typically support only a few addressing modes, often just base+offset for memory access. Complex address calculations are performed using explicit arithmetic instructions.
Large Register File: With 32 or more general-purpose registers, RISC processors can keep frequently used data in fast registers rather than slow memory. This reduces memory traffic and improves performance.
RISC vs CISC
The RISC vs CISC debate dominated computer architecture discussions throughout the 1980s and 1990s. CISC architectures like the Intel x86 and Motorola 68000 featured hundreds of complex instructions, variable-length instruction encoding, and numerous addressing modes. The philosophy was to provide rich instruction sets that closely matched high-level language constructs.
RISC took the opposite approach. By keeping instructions simple and uniform, RISC processors could achieve higher clock frequencies and more efficient pipelining. The burden of generating efficient code shifted from hardware to compilers, but this proved to be a winning strategy as compiler technology matured.
Performance comparisons initially favored RISC. A RISC processor might execute more instructions to accomplish the same task, but it could execute each instruction faster and with better pipeline efficiency. The result was often superior overall performance, especially for compute-intensive workloads.
Historical Impact
The RISC revolution spawned several influential architectures that shaped the computing landscape:
MIPS: Commercialized by MIPS Computer Systems in 1985, the MIPS architecture powered Silicon Graphics workstations and became ubiquitous in embedded systems. Its clean design made it a favorite for teaching computer architecture. Even today, MIPS processors are found in routers, set-top boxes, and other embedded devices.
SPARC: Sun Microsystems’ Scalable Processor Architecture (1987) dominated the workstation and server market in the 1990s. SPARC’s register windows and clean architecture made it popular for Unix systems and scientific computing.
ARM: Perhaps the most successful RISC architecture, ARM (Acorn RISC Machine, later Advanced RISC Machine) started in 1985 as a processor for the Acorn Archimedes computer. Its focus on power efficiency made it the dominant architecture for mobile devices. Today, ARM processors power virtually every smartphone and tablet.
PowerPC: The alliance between Apple, IBM, and Motorola produced PowerPC in 1991, combining ideas from IBM’s POWER architecture with RISC principles. PowerPC powered Apple Macintosh computers from 1994 to 2006 and remains important in embedded and high-performance computing.
These architectures proved that RISC principles worked in practice. They demonstrated that simple, regular instruction sets could deliver excellent performance while simplifying processor design. This legacy directly influenced RISC-V’s design philosophy.
Figure 1.1: RISC Architecture Evolution Timeline
RISC Architecture Evolution Timeline (1980-2010)
| Year | Architecture | Significance |
|---|---|---|
| 1980 | Berkeley RISC-I | First RISC processor, demonstrated RISC principles |
| 1980 | IBM 801 | Early RISC design, influenced later architectures |
| 1981 | Berkeley RISC-II | Refined RISC design, register windows |
| 1983 | Stanford MIPS | Emphasized pipeline efficiency |
| 1985 | MIPS R2000 | First commercial RISC processor |
| 1985 | ARM1 (Acorn) | Low-power RISC for personal computers |
| 1987 | Sun SPARC | Workstation and server market |
| 1991 | PowerPC | Apple/IBM/Motorola alliance |
| 1994 | ARMv4 (ARM7TDMI) | Thumb instruction set, embedded dominance |
| 2001 | ARMv6 (ARM11) | Advanced features, multimedia support |
| 2010 | RISC-V | Open, modular, extensible ISA |
This timeline shows how RISC-V builds on 30 years of RISC architecture evolution, learning from both successes and mistakes of its predecessors.
1.2 Why RISC-V? The Open ISA Movement
The Need for Open ISA
By 2010, the computer architecture landscape had consolidated around a few proprietary instruction set architectures. Intel’s x86 dominated PCs and servers. ARM dominated mobile and embedded systems. MIPS and PowerPC served niche markets. All shared one characteristic: they were proprietary, requiring licenses and royalty payments.
This created several problems for researchers, educators, and innovators:
Licensing Costs: Even for academic research, obtaining architecture licenses could be expensive and time-consuming. Companies faced significant royalty payments for each chip manufactured.
Restrictions on Modification: Proprietary ISAs typically prohibited modifications or extensions without explicit permission. This stifled innovation and made it difficult to explore new architectural ideas.
Fragmentation: Each vendor’s extensions and modifications created incompatible variants. ARM alone had numerous profiles and extensions, making it challenging to write portable software.
Long-Term Uncertainty: Companies building products around a proprietary ISA faced uncertainty about future licensing terms, support, and the architecture’s longevity.
The open-source software movement had demonstrated the power of collaborative development and unrestricted access. Why not apply the same principles to hardware instruction sets?
Birth of RISC-V
In 2010, a team at UC Berkeley led by Krste Asanović, David Patterson, Yunsup Lee, and Andrew Waterman set out to create a new instruction set architecture for research and education. They needed an ISA that was:
- Free and open, with no licensing fees or restrictions
- Clean and simple, suitable for teaching
- Practical and complete, capable of running real operating systems
- Extensible, allowing custom instructions for specialized applications
- Stable, with a frozen base that would never change
The team initially considered using an existing open ISA but found none that met all their requirements. MIPS was becoming open but carried legacy baggage. OpenRISC existed but lacked industry momentum. SPARC V8 was available but complex.
So they designed RISC-V from scratch, learning from 30 years of RISC architecture evolution. The name “RISC-V” (pronounced “risk-five”) represents the fifth generation of RISC architectures developed at Berkeley, following RISC-I through RISC-IV.
The initial RISC-V specification, released in 2011, defined a minimal 32-bit integer instruction set (RV32I) with just 47 instructions. This base ISA was deliberately kept small and frozen, ensuring that software written for RISC-V would run forever.
RISC-V International
As RISC-V gained traction, it became clear that a formal organization was needed to manage the specification and ensure consistency. In 2015, the RISC-V Foundation was established as a non-profit organization to standardize and promote the architecture.
In 2020, the foundation relocated to Switzerland and became RISC-V International, reflecting its global nature and ensuring neutrality from any single country’s export controls or political considerations.
RISC-V International operates through a collaborative governance model:
- Technical committees develop and ratify specifications
- Members include companies, universities, and individuals
- All specifications are freely available
- Anyone can implement RISC-V without fees or licenses
- Members contribute to development but don’t control the ISA
This open governance ensures that RISC-V remains truly open and vendor-neutral, unlike proprietary ISAs controlled by single companies.
Industry Ecosystem
What started as an academic project has grown into a thriving industry ecosystem. By 2025, RISC-V International has over 3,000 members from more than 70 countries, including major technology companies, startups, universities, and government organizations.
Hardware implementations range from tiny microcontrollers to high-performance application processors:
- SiFive produces commercial RISC-V cores for embedded and application processors
- Western Digital has shipped billions of RISC-V cores in storage controllers
- NVIDIA uses RISC-V for GPU microcontrollers
- Alibaba’s T-Head develops high-performance RISC-V processors
- Numerous startups are building RISC-V-based products
The software ecosystem has matured rapidly. The GNU toolchain (GCC, binutils) and LLVM support RISC-V. Major operating systems including Linux, FreeBSD, and real-time operating systems run on RISC-V. Language runtimes for Java, Python, JavaScript, and others have been ported.
Academic adoption has been particularly strong. RISC-V’s clean design and open nature make it ideal for teaching computer architecture. Universities worldwide use RISC-V in courses, and researchers use it to explore new architectural ideas without licensing barriers.
Comparison with Proprietary ISAs
The contrast between RISC-V’s open model and proprietary ISAs is stark:
ARM Licensing: ARM Holdings licenses its architecture and core designs. Licensees pay upfront fees and per-chip royalties. Architecture licenses (allowing custom core design) are expensive and restricted. This model has been profitable for ARM but creates barriers for innovation and education.
x86 Duopoly: Intel and AMD control the x86 architecture through patents and cross-licensing agreements. No other company can legally implement x86 processors. This duopoly limits competition and innovation in the PC and server markets.
RISC-V Open Model: Anyone can implement RISC-V without fees, licenses, or royalties. The specification is freely available. Custom extensions are permitted. This openness enables innovation, reduces costs, and eliminates vendor lock-in.
The open model doesn’t mean RISC-V is “free” in the sense of zero cost. Designing and manufacturing processors still requires significant investment. But it removes the artificial barriers of licensing fees and restrictions, allowing competition based on implementation quality rather than ISA access.
1.3 RISC-V Design Philosophy
Simplicity
RISC-V embraces simplicity as a core principle. The base integer instruction set (RV32I) contains just 47 instructions—enough to run a complete operating system and applications, but small enough to understand completely in a few hours.
This simplicity manifests in several ways:
Orthogonal Instruction Set: Instructions don’t have special cases or exceptions. Load and store instructions work the same way regardless of data type. Arithmetic instructions operate uniformly on registers.
Regular Encoding: Instruction formats are consistent and predictable. The opcode is always in the same position. Source and destination registers occupy fixed fields. This regularity simplifies decoding and enables efficient implementation.
No Implicit Operations: RISC-V instructions do exactly what they say, nothing more. There are no hidden side effects, implicit register updates, or condition code modifications (except for explicit compare instructions).
The simplicity extends to the privilege architecture. RISC-V defines three privilege levels (Machine, Supervisor, User) with clean separation of responsibilities. There are no complex security states or trust zones in the base specification—those can be added as extensions if needed.
Modularity
Perhaps RISC-V’s most distinctive feature is its modular design. Rather than defining one monolithic ISA, RISC-V separates functionality into a small base ISA plus optional standard extensions.
The base ISA (RV32I, RV64I, or RV128I) is frozen and will never change. It provides:
- Integer arithmetic and logical operations
- Load and store instructions
- Control flow (branches and jumps)
- System instructions (environment calls, fences)
Standard extensions add functionality:
- M: Integer multiplication and division
- A: Atomic instructions for synchronization
- F: Single-precision floating-point
- D: Double-precision floating-point
- C: Compressed 16-bit instructions for code density
- V: Vector operations for data parallelism
- B: Bit manipulation
A processor implements only the extensions it needs. An embedded microcontroller might implement just RV32I. A Linux-capable application processor would implement RV64IMAC (often abbreviated as RV64GC, where G = IMAFD). A high-performance processor might add V for vector processing.
This modularity provides several benefits:
- Implementations can be tailored to specific applications
- New extensions can be added without affecting existing code
- The ISA can evolve without breaking compatibility
- Educational and research projects can start simple and add complexity as needed
Figure 1.2: RISC-V Modular Architecture
RISC-V ISA Modular Structure
| Category | Component | Description | Status |
|---|---|---|---|
| Base ISA | RV32I | 32-bit integer base | FROZEN |
| RV64I | 64-bit integer base | FROZEN | |
| RV128I | 128-bit integer base | FROZEN | |
| Standard Extensions | M | Multiply/Divide | Standard |
| A | Atomics | Standard | |
| F | Single-precision floating-point | Standard | |
| D | Double-precision floating-point | Standard | |
| C | Compressed instructions (16-bit) | Standard | |
| V | Vector operations | Standard | |
| B | Bit manipulation | Standard | |
| Custom Extensions | Custom | Domain-specific instructions | Vendor-defined |
The modular design allows implementations to include only the extensions they need, from minimal embedded systems (RV32I only) to high-performance processors (RV64GCV).
Extensibility
Beyond standard extensions, RISC-V explicitly supports custom extensions. The instruction encoding reserves space for custom opcodes, allowing vendors to add specialized instructions without conflicting with standard ones.
Custom extensions enable:
- Domain-specific accelerators (cryptography, AI, signal processing)
- Proprietary features for competitive advantage
- Research into new instruction types
- Rapid prototyping of architectural ideas
The key is that custom extensions don’t break compatibility. Software that doesn’t use custom instructions runs unchanged. Compilers can generate code that uses custom instructions when available and falls back to standard instructions otherwise.
This extensibility has proven valuable in practice. Companies have added custom instructions for encryption, machine learning, and other specialized tasks. Researchers have explored new architectural concepts without forking the ISA.
Stability
RISC-V makes a strong commitment to stability. The base ISA is frozen—it will never change. Software written for RV32I in 2011 will run on RISC-V processors in 2050 and beyond.
Standard extensions follow a rigorous ratification process. Once ratified, an extension is frozen. New versions may add features but must maintain backward compatibility.
This stability provides confidence for long-term investments. Companies can build products knowing the ISA won’t change underneath them. Software developers can write code that will run on future processors.
The stability commitment distinguishes RISC-V from some other open ISAs that have undergone incompatible changes. It also contrasts with proprietary ISAs where vendors can deprecate features or change behavior in new versions.
1.4 RISC-V ISA Overview
Base Integer ISAs
RISC-V defines three base integer ISAs, differing only in register width:
RV32I: 32-bit registers and addresses. Suitable for embedded systems, microcontrollers, and 32-bit applications. The base RV32I instruction set is frozen and contains 47 instructions.
RV64I: 64-bit registers and addresses. Designed for application processors, servers, and systems requiring large address spaces. RV64I extends RV32I with additional instructions for 64-bit operations (like ADDW for 32-bit addition with sign extension).
RV128I: 128-bit registers and addresses. Reserved for future systems requiring very large address spaces. The specification is preliminary and not yet frozen.
All three ISAs share the same basic instruction formats and philosophy. Code written for RV32I can often be recompiled for RV64I with minimal changes. The transition from 32-bit to 64-bit is cleaner than in some other architectures.
There’s also RV32E, a reduced variant with only 16 registers instead of 32, designed for extremely small embedded systems where chip area is critical.
Standard Extensions
The standard extensions add functionality to the base ISA:
M Extension - Multiplication and Division: Adds integer multiply, divide, and remainder instructions. Essential for most applications but optional for the simplest embedded systems. The M extension adds 8 instructions in RV32 and 13 in RV64.
A Extension - Atomic Instructions: Provides atomic memory operations for synchronization in multi-processor systems. Includes load-reserved/store-conditional (LR/SC) for lock-free algorithms and atomic memory operations (AMO) like atomic add, swap, and compare-and-swap.
F and D Extensions - Floating-Point: F adds single-precision (32-bit) floating-point, while D adds double-precision (64-bit). These extensions include arithmetic operations, comparisons, conversions, and a separate register file for floating-point values. They follow the IEEE 754 standard.
C Extension - Compressed Instructions: Adds 16-bit instruction encodings for common operations, improving code density by 25-30%. The C extension is particularly valuable for embedded systems with limited memory. Compressed instructions can be freely mixed with standard 32-bit instructions.
V Extension - Vector Processing: Provides vector operations for data parallelism. Unlike fixed-width SIMD (like ARM NEON or x86 AVX), RISC-V vectors are length-agnostic, allowing the same code to run efficiently on different vector lengths. This is similar to ARM’s SVE but with a cleaner design.
B Extension - Bit Manipulation: Adds instructions for common bit manipulation operations like count leading zeros, rotate, and bit field extraction. These operations are common in cryptography, compression, and other algorithms.
The combination of base ISA plus extensions is denoted by a string like “RV64IMAFD” or “RV32IMC”. The letter “G” is shorthand for IMAFD (general-purpose), so “RV64GC” means RV64IMAFD plus compressed instructions.
ISA Naming Convention
RISC-V uses a systematic naming convention to specify which features a processor implements:
- RV32 or RV64 or RV128: Base integer ISA width
- I: Base integer instruction set (always present)
- M: Multiplication and division
- A: Atomic instructions
- F: Single-precision floating-point
- D: Double-precision floating-point
- C: Compressed instructions
- V: Vector extension
- G: Shorthand for IMAFD (general-purpose)
Additional letters indicate other extensions. The order of letters follows a standard sequence defined in the specification.
Examples:
- RV32I: Minimal 32-bit processor, base ISA only
- RV32IMC: 32-bit with multiply/divide and compressed instructions (common for microcontrollers)
- RV64GC: 64-bit general-purpose processor with compressed instructions (common for application processors)
- RV64GCV: 64-bit general-purpose with compressed and vector extensions
This naming convention makes it immediately clear what capabilities a processor has.
Figure 1.3: ISA Naming Convention Examples
Common ISA String Breakdown
| ISA String | Components | Meaning |
|---|---|---|
| RV64GC | RV64 | 64-bit base integer ISA |
| G | General-purpose = IMAFD | |
| - I | Base integer instructions | |
| - M | Multiply/Divide | |
| - A | Atomics | |
| - F | Single-precision float | |
| - D | Double-precision float | |
| C | Compressed instructions | |
| RV32IMC | RV32 | 32-bit base integer ISA |
| I | Base integer instructions | |
| M | Multiply/Divide | |
| C | Compressed instructions |
Note: “G” is a shorthand for IMAFD, representing a general-purpose processor configuration.
1.5 RISC-V Profiles
The Profile Concept
As RISC-V matured, the flexibility of optional extensions created a challenge: how do software developers know what features they can rely on? A processor implementing just RV64I is very different from one implementing RV64GCV.
RISC-V Profiles solve this problem by defining standard combinations of extensions for specific use cases. A profile specifies:
- Mandatory extensions that must be present
- Optional extensions that may be present
- Specific versions of each extension
- Additional requirements (like privilege modes)
Software targeting a profile can assume all mandatory features are available, simplifying development and ensuring portability.
RVA22 Profile
The RVA22 profile (ratified in 2022) targets application processors capable of running rich operating systems like Linux. It comes in two variants:
RVA22U (Unprivileged): Specifies the user-mode ISA. Mandatory extensions include:
- RV64I base ISA
- M, A, F, D, C extensions (i.e., RV64GC)
- Zicsr (CSR instructions)
- Zifencei (instruction fence)
- Various other Zextensions for specific functionality
RVA22S (Supervisor): Adds supervisor-mode requirements for OS support:
- Sv39 virtual memory (39-bit virtual addresses)
- Supervisor mode and required CSRs
- SBI (Supervisor Binary Interface) support
- Additional privilege-related extensions
A processor claiming RVA22S compliance guarantees it can run standard Linux distributions and other Unix-like operating systems.
RVA23 Profile
RVA23 (ratified in 2023) builds on RVA22 with additional features:
- Vector extension (V) is mandatory
- Additional bit manipulation instructions
- Hypervisor extension for virtualization
- Enhanced memory ordering features
RVA23 represents the next generation of application processors, with vector processing as a standard feature rather than an option.
Embedded Profiles
While RVA profiles target application processors, separate profiles exist for embedded systems:
Microcontroller Profiles: Specify minimal feature sets for resource-constrained devices. These might require only RV32IMC with machine mode, omitting supervisor mode and virtual memory.
Real-Time Profiles: Add requirements for deterministic interrupt handling and timing, important for real-time operating systems.
The profile system allows RISC-V to serve diverse markets while maintaining clear compatibility boundaries. Software developers can target a profile rather than trying to support every possible combination of extensions.
1.6 RISC-V vs ARM vs MIPS
Historical Context
To understand RISC-V’s place in the architecture landscape, it’s helpful to compare it with its RISC predecessors:
MIPS (1985): The pioneer of commercial RISC, MIPS established many principles that RISC-V follows. Its clean design made it popular for education and embedded systems. However, MIPS fragmented into multiple incompatible variants (MIPS I through MIPS V, plus MIPS32/MIPS64), and its proprietary nature limited adoption. In 2019, MIPS became open-source, but by then RISC-V had captured the momentum.
ARM (1985): Starting as a simple RISC design, ARM evolved into a complex architecture with numerous extensions and profiles. Its focus on power efficiency made it dominant in mobile devices. However, ARM remains proprietary, requiring licenses and royalties. The architecture has accumulated considerable complexity over 35+ years of evolution.
RISC-V (2010): Learning from both predecessors, RISC-V combines MIPS’s clean design philosophy with ARM’s practical focus on real-world applications, while adding the crucial element of openness. It avoids the fragmentation that plagued MIPS and the complexity that accumulated in ARM.
ISA Complexity Comparison
Instruction Count: RISC-V’s base RV32I has 47 instructions. MIPS32 has about 60 base instructions. ARM’s instruction count is harder to pin down due to numerous variants, but ARMv8-A has hundreds of instructions when including all extensions.
Encoding Formats: RISC-V uses 6 basic instruction formats, all 32 bits (plus 16-bit compressed formats). MIPS uses 3 formats, all 32 bits. ARM uses variable-length encoding in Thumb mode and fixed 32-bit in ARM mode, with complex encoding rules.
Addressing Modes: RISC-V supports base+offset for memory access, with offsets computed explicitly for complex addressing. MIPS is similar. ARM supports more complex addressing modes including auto-increment and scaled indexing.
The trend is clear: RISC-V is the simplest, MIPS is moderately complex, and ARM is the most complex. This simplicity makes RISC-V easier to implement, verify, and teach.
Licensing and Ecosystem
RISC-V: Completely open and free. No licenses required, no royalties, no restrictions. Anyone can implement, modify, or extend RISC-V. The specification is publicly available.
ARM: Proprietary and licensed. Architecture licenses cost millions of dollars. Per-chip royalties apply. Modifications require permission. However, ARM offers extensive IP, tools, and ecosystem support.
MIPS: Historically proprietary, became open-source in 2019 under Wave Computing. However, the transition was rocky, and MIPS lacks the momentum and ecosystem of RISC-V.
The licensing difference is fundamental. RISC-V enables innovation and competition that proprietary ISAs cannot match.
Use Cases and Adoption
Embedded Systems: RISC-V is rapidly gaining share in microcontrollers and embedded processors. Its modularity allows implementations tailored to specific needs. Western Digital has shipped billions of RISC-V cores in storage controllers.
Application Processors: ARM dominates mobile devices, but RISC-V is emerging in this space. SiFive and others are developing high-performance RISC-V cores. The open nature appeals to companies wanting to avoid ARM licensing.
High-Performance Computing: RISC-V is being explored for HPC accelerators and specialized processors. The vector extension makes it competitive for data-parallel workloads.
Education and Research: RISC-V has become the architecture of choice for teaching and research, displacing MIPS in many universities.
Future Outlook
RISC-V’s trajectory is clear: rapid growth driven by openness, simplicity, and industry support. It won’t displace ARM in smartphones overnight, but it’s becoming the default choice for new designs where licensing costs and flexibility matter.
The comparison with MIPS is particularly instructive. MIPS had technical merit but remained proprietary too long. By the time it opened, RISC-V had captured the open ISA mindshare. ARM’s technical excellence and ecosystem are formidable, but its proprietary nature creates opportunities for RISC-V.
RISC-V represents not just a new ISA, but a new model for processor architecture: open, collaborative, and free from vendor lock-in. This model is proving compelling for the next generation of computing.
🛠️ Hands-on Lab: Lab 1.1 — Hello RISC-V World
Now let’s get our hands dirty! In this lab, you’ll install the RISC-V toolchain and run your first program.
Lab Objectives
- Understand what a Cross-Compiler is
- Install the
riscv64-unknown-elf-gcctoolchain - Successfully compile and emulate your first C program
Environment Setup
Choose one of the following methods to install the RISC-V toolchain:
Option A: Using xPack Pre-built Packages (Recommended for Beginners)
# Download xPack RISC-V GNU Toolchain
# https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases
# Linux/macOS example (adjust version as needed)
wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v14.2.0-3/xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
tar xzf xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
export PATH=$PWD/xpack-riscv-none-elf-gcc-14.2.0-3/bin:$PATH
# Verify installation
riscv-none-elf-gcc --version
Option B: Using Docker (Cross-platform Consistency)
# Use pre-built Docker image
docker pull riscv/riscv-gnu-toolchain
docker run -it -v $(pwd):/work riscv/riscv-gnu-toolchain bash
Option C: Ubuntu/Debian Package Manager
sudo apt update
sudo apt install gcc-riscv64-unknown-elf qemu-system-riscv64
# Verify installation
riscv64-unknown-elf-gcc --version
qemu-system-riscv64 --version
Write the Program
Create a simple hello.c:
// hello.c
#include <stdio.h>
int main() {
printf("Hello, RISC-V World!\n");
printf("This is my first RISC-V program!\n");
return 0;
}
Compile and Run
# Compile (using ELF format supported by Proxy Kernel)
riscv64-unknown-elf-gcc -o hello hello.c
# Check file format
file hello
# Output should be: hello: ELF 64-bit LSB executable, UCB RISC-V, ...
# Run using QEMU User Mode (requires qemu-riscv64)
qemu-riscv64 hello
# Or run using Spike + Proxy Kernel
spike pk hello
Expected Output:
Hello, RISC-V World!
This is my first RISC-V program!
What You Just Did
Congratulations! You’ve completed three important steps:
- Cross-Compilation: Used a compiler running on x86 to generate code for RISC-V
- Emulation: Used QEMU or Spike to execute RISC-V instructions on your x86 machine
- Toolchain Validation: Confirmed that the complete toolchain (compiler, linker, emulator) is working
danieRTOS Reference: This lab establishes the foundation for all subsequent labs. The same toolchain setup is used in danieRTOS development.
Extended Challenge (Optional)
Try using objdump to view the compiled assembly:
riscv64-unknown-elf-objdump -d hello | head -50
You’ll see output similar to this:
0000000000010000 <_start>:
10000: 00001197 auipc gp,0x1
10004: 800980e7 jalr gp
...
These are RISC-V instructions! In subsequent chapters, we’ll learn how to read these instructions.
💡 Note on Cross-Compilation:
- Host: The computer you’re using (e.g., x86 Linux)
- Target: The platform you’re compiling for (e.g., RISC-V)
- The Cross-Compiler’s job is to generate Target code on the Host
Summary
RISC-V builds on 30 years of RISC architecture evolution, learning from the successes and mistakes of MIPS, SPARC, ARM, and PowerPC. The RISC revolution of the 1980s demonstrated that simple, regular instruction sets could outperform complex ones, establishing principles that RISC-V follows today: load-store architecture, fixed-length instructions, simple addressing modes, and large register files.
The open ISA movement addresses fundamental problems with proprietary architectures: licensing costs, restrictions on modification, fragmentation, and vendor lock-in. RISC-V provides a completely open and free instruction set architecture, governed by RISC-V International through a collaborative model that ensures vendor neutrality. Anyone can implement, modify, or extend RISC-V without fees or licenses.
RISC-V’s design philosophy emphasizes simplicity (just 47 instructions in the base ISA), modularity (optional extensions for specific needs), extensibility (custom instructions without breaking compatibility), and stability (frozen base ISA that will never change). This modular approach allows implementations tailored to specific applications, from RV32I-only microcontrollers to RV64GCV high-performance processors.
The standard extensions add functionality as needed: M for multiplication and division, A for atomic operations, F and D for floating-point, C for compressed instructions, V for vector processing, and B for bit manipulation. Profiles like RVA22 and RVA23 define standard combinations of extensions for specific use cases, ensuring software portability while preserving flexibility.
Compared to ARM and MIPS, RISC-V is simpler (fewer instructions, cleaner encoding), more modern (no legacy baggage), and completely open (no licensing fees or restrictions). While ARM dominates mobile devices and MIPS has declined, RISC-V is rapidly gaining adoption in embedded systems, emerging in application processors, and becoming the architecture of choice for education and research.
RISC-V represents not just a new ISA, but a new model for processor architecture: open, collaborative, and free from vendor lock-in. This openness, combined with technical excellence and industry support, positions RISC-V as the architecture for the next generation of computing.
Chapter 2. Programmer’s Model & Register Set
Part II — Programmer’s Model
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Memorize the 32 General-Purpose Registers: Know x0-x31 and their ABI aliases (a0, s0, sp, ra…)
- Understand the Calling Convention: Master the responsibilities of Caller-saved vs Callee-saved registers
- Mixed Programming Ability: Write mixed C and Assembly code and understand how they interact
- CSR Basics: Know the purpose and access methods of Control and Status Registers
- Privilege Level Concepts: Understand the permission differences between M/S/U modes
💡 Scenario: Sticky Notes and the Warehouse
Scene: Junior’s screen displays a dense wall of
objdumpoutput.
Junior: “Senior, I’m losing my mind. The documentation says arguments go in a0, a1, but the disassembled code shows x10, x11. And are x1 and ra even the same thing?”
Senior: “Ha, this is a rite of passage for every newbie. The hardware only knows x0 through x31—like street addresses. But to make it easier for us humans to write programs, we’ve established a set of ‘rules’ called the ABI (Application Binary Interface) that gives them nicknames.”
Junior: “Nicknames?”
Senior: “Yes. Imagine you’re repairing a watch. Registers are like the workbench right in front of you—limited space, only 32 parts can fit, but you can grab them instantly, super fast. Memory is like the big warehouse behind you—huge capacity but slow to fetch things.”
Junior: “So what about names like a0?”
Senior: “Those are the workbench’s designated zones.
a0-a7(Arguments): The ‘mail room.’ Others put parts here for you to work on, and you put finished parts back here for them.t0-t6(Temporaries): The ‘scratch area.’ Use it however you want, toss things around—nobody cares.s0-s11(Saved): The ‘reserved zone.’ If you borrow this space, you must first save whatever was there, then restore it when you’re done—otherwise the previous user won’t find their stuff.“
Junior: “I see! What about ra (Return Address)?”
Senior: “That’s ‘the way home.’ When a function finishes, it needs to know which line of code to jump back to. Come on, let’s write a program and trace the changes on this ‘workbench’ using a simulator.”
Understanding a processor architecture begins with understanding its programmer’s model: the registers, instructions, and conventions that software uses to interact with hardware. RISC-V’s programmer’s model is clean, regular, and designed for both simplicity and efficiency.
This chapter explores the fundamental elements that every RISC-V programmer must know. We’ll examine the 32 general-purpose registers and their conventional uses, the Control and Status Registers (CSRs) that manage processor state, the privilege levels that separate user code from operating system code, and the calling convention that enables functions to work together. We’ll see how RISC-V’s design choices—like the zero register, separate CSR address space, and clean privilege model—simplify both hardware implementation and software development.
2.1 General-Purpose Registers
The Register File
RISC-V provides 32 general-purpose registers, numbered x0 through x31. Each register is XLEN bits wide, where XLEN is 32 for RV32, 64 for RV64, and 128 for RV128. This discussion focuses on RV64, the most common variant for application processors.
The 32-register design follows RISC tradition. It’s large enough to keep frequently used values in fast registers rather than slow memory, but small enough to implement efficiently in hardware. The register file needs multiple read and write ports to support instruction execution, and size directly impacts chip area and access time.
Unlike some architectures, RISC-V’s registers are truly general-purpose. There are no special restrictions on most registers—any register can be used as a source or destination for most instructions. This orthogonality simplifies both hardware implementation and compiler design.
Register x0: The Zero Register
Register x0 is special: it always reads as zero, and writes to it are discarded. This might seem wasteful, but it’s remarkably useful.
The zero register enables several common operations without dedicated instructions:
- NOP (no operation):
ADDI x0, x0, 0adds zero to zero and stores in x0 (which discards the result) - Move:
ADDI x1, x2, 0adds zero to x2 and stores in x1, effectively copying x2 to x1 - Load immediate:
ADDI x1, x0, 42adds 42 to zero, loading the constant 42 into x1 - Unconditional branch:
BEQ x0, x0, targetbranches if x0 equals x0 (always true)
The zero register also simplifies hardware. Many operations naturally produce zero (like XOR of a register with itself), and having a dedicated zero register makes these operations explicit and efficient.
Standard Register Names (ABI)
While the hardware knows registers as x0-x31, software uses symbolic names defined by the Application Binary Interface (ABI). These names indicate each register’s conventional use:
| Register | ABI Name | Description | Saved by |
|---|---|---|---|
| x0 | zero | Hard-wired zero | — |
| x1 | ra | Return address | Caller |
| x2 | sp | Stack pointer | Callee |
| x3 | gp | Global pointer | — |
| x4 | tp | Thread pointer | — |
| x5-x7 | t0-t2 | Temporaries | Caller |
| x8 | s0/fp | Saved register / Frame pointer | Callee |
| x9 | s1 | Saved register | Callee |
| x10-x11 | a0-a1 | Function arguments / Return values | Caller |
| x12-x17 | a2-a7 | Function arguments | Caller |
| x18-x27 | s2-s11 | Saved registers | Callee |
| x28-x31 | t3-t6 | Temporaries | Caller |
These names are conventions, not hardware requirements. The processor doesn’t enforce them—you could use sp for arithmetic if you wanted (though your program would likely crash). But following the ABI ensures that code from different compilers and libraries can interoperate.
The 32 general-purpose registers are organized with their ABI names and conventional usage. Registers are categorized as: zero register (x0), special-purpose registers (ra, sp, gp, tp), caller-saved temporaries and arguments (t0-t6, a0-a7), and callee-saved registers (s0-s11). The complete register file organization with ABI names is shown in the table above.
Caller-Saved vs Callee-Saved
The ABI divides registers into two categories based on who preserves their values across function calls:
Caller-Saved Registers (t0-t6, a0-a7): The calling function must save these if it needs their values after the call. The called function can freely modify them. These are used for temporary values and function arguments.
Callee-Saved Registers (s0-s11, sp): The called function must preserve these. If it uses them, it must save their values on entry and restore them before returning. These are used for values that must survive across function calls.
This division optimizes the common case. Temporary values don’t need to be saved if they’re not used after a call. Long-lived values are automatically preserved across calls.
Special-Purpose Registers
Several registers have special conventional uses:
ra (x1) - Return Address: Stores the return address for function calls. The JAL and JALR instructions (jump-and-link) automatically write the return address to ra. The function returns by jumping to the address in ra.
sp (x2) - Stack Pointer: Points to the top of the stack. The stack grows downward (toward lower addresses) by convention. Functions allocate stack space by subtracting from sp and deallocate by adding to sp.
gp (x3) - Global Pointer: Points to the middle of a 4KB region of global variables. This allows accessing globals with a single load/store instruction using a 12-bit signed offset (±2KB from gp). The linker sets up gp, and it remains constant during execution.
tp (x4) - Thread Pointer: Points to thread-local storage in multi-threaded programs. Each thread has its own tp value, allowing efficient access to thread-specific data.
fp (x8) - Frame Pointer: An alias for s0, used to point to the current stack frame. Some code uses fp to access local variables and function arguments, while sp may change during function execution. Other code omits the frame pointer to free up another register.
Register Usage in Practice
Understanding register usage is crucial for reading assembly code and understanding compiler output. Here’s a typical function call sequence:
# Caller prepares arguments
li a0, 10 # First argument in a0
li a1, 20 # Second argument in a1
# Caller saves any needed temporaries
sd t0, 0(sp) # Save t0 if needed after call
# Call function
jal ra, my_func # Jump to my_func, save return address in ra
# Caller restores temporaries
ld t0, 0(sp) # Restore t0
# Result is in a0
mv s0, a0 # Save result to callee-saved register
Inside the called function:
my_func:
# Prologue: allocate stack frame
addi sp, sp, -32 # Allocate 32 bytes
sd ra, 24(sp) # Save return address
sd s0, 16(sp) # Save s0 if we'll use it
# Function body uses a0, a1 (arguments)
add s0, a0, a1 # Use s0 for computation
# Prepare return value
mv a0, s0 # Return value in a0
# Epilogue: restore and return
ld s0, 16(sp) # Restore s0
ld ra, 24(sp) # Restore return address
addi sp, sp, 32 # Deallocate stack frame
ret # Return (pseudo-instruction for jalr x0, 0(ra))
This pattern—prologue, body, epilogue—is standard for RISC-V functions. The prologue saves registers and allocates stack space. The body performs the computation. The epilogue restores registers and returns.
2.2 Control and Status Registers (CSRs)
CSR Overview
Beyond the 32 general-purpose registers, RISC-V defines a separate address space for Control and Status Registers (CSRs). These registers control processor behavior, report status, and provide access to privileged functionality.
CSRs are accessed using dedicated instructions (CSRRW, CSRRS, CSRRC, and their immediate variants) rather than normal load/store instructions. Each CSR has a 12-bit address, allowing up to 4,096 CSRs, though only a fraction are currently defined.
The CSR address space is partitioned by privilege level and read/write access:
- Bits [11:10] encode the privilege level required to access the CSR
- Bits [9:8] indicate read/write vs read-only
- Bits [7:0] identify the specific register
This encoding allows the hardware to quickly check access permissions. Attempting to access a CSR from insufficient privilege level or writing to a read-only CSR causes an illegal instruction exception.
Figure 2.1: CSR Address Space Organization
graph TB
subgraph "CSR Address Space (12-bit)"
subgraph "Bits [11:10]: Privilege Level"
M[00: User<br/>01: Supervisor<br/>10: Reserved<br/>11: Machine]
end
subgraph "Bits [9:8]: Read/Write"
RW[00: Read/Write<br/>01: Read/Write<br/>10: Read/Write<br/>11: Read-Only]
end
subgraph "Bits [7:0]: Register ID"
ID[256 possible registers<br/>per privilege level]
end
end
subgraph "Example CSRs"
MSTATUS[mstatus: 0x300<br/>Machine Status]
SSTATUS[sstatus: 0x100<br/>Supervisor Status]
CYCLE[cycle: 0xC00<br/>Cycle Counter Read-Only]
end
M --> MSTATUS
M --> SSTATUS
RW --> CYCLE
style M fill:#FFB6C1
style RW fill:#87CEEB
style ID fill:#90EE90
Machine-Level CSRs
Machine mode is the highest privilege level in RISC-V, with access to all CSRs. Key machine-level CSRs include:
mstatus (Machine Status): Controls and reports various aspects of processor state:
- MIE: Machine Interrupt Enable (global interrupt enable for M-mode)
- MPIE: Previous MIE value (saved when taking a trap)
- MPP: Previous privilege mode (saved when taking a trap)
- MPRV: Modify Privilege (affects memory access privilege)
- Various extension enable bits (FS for floating-point, VS for vector)
misa (Machine ISA): Indicates which extensions are implemented. Each bit corresponds to an extension (bit 0 = A extension, bit 12 = M extension, etc.). This register allows software to detect available features. On some implementations, misa is read-only; on others, it can be written to enable/disable extensions dynamically.
mie (Machine Interrupt Enable): Controls which interrupts are enabled. Each bit corresponds to an interrupt source (software interrupt, timer interrupt, external interrupt). Even if a bit is set in mie, interrupts are only taken if MIE in mstatus is also set.
mip (Machine Interrupt Pending): Indicates which interrupts are pending. Hardware sets bits when interrupts arrive; software can read mip to determine which interrupts are waiting.
mtvec (Machine Trap Vector): Specifies the address of the trap handler. The low 2 bits select the mode:
- 0 (Direct): All traps jump to the same address (mtvec & ~0x3)
- 1 (Vectored): Interrupts jump to (mtvec & ~0x3) + 4×cause, exceptions jump to (mtvec & ~0x3)
mepc (Machine Exception PC): Stores the program counter of the instruction that caused the trap (for exceptions) or the instruction to resume after handling an interrupt. The trap handler returns by writing mepc to the PC.
mcause (Machine Cause): Indicates what caused the trap. The high bit distinguishes interrupts (1) from exceptions (0). The low bits encode the specific cause (e.g., 2 = illegal instruction, 11 = environment call from M-mode).
mtval (Machine Trap Value): Provides additional information about the trap. For address-related exceptions (like page faults), mtval contains the faulting address. For illegal instruction exceptions, it may contain the instruction itself.
mscratch (Machine Scratch): A general-purpose register for machine-mode software. Typically used to save a register temporarily when entering a trap handler, before the handler has set up its stack.
Supervisor-Level CSRs
Supervisor mode is intended for operating systems. It has its own set of CSRs, analogous to the machine-level ones:
sstatus: A restricted view of mstatus, showing only fields relevant to supervisor mode (SIE, SPIE, SPP, etc.). Writing sstatus actually modifies the corresponding fields in mstatus.
sie, sip: Supervisor interrupt enable and pending registers, similar to mie/mip but for supervisor-level interrupts.
stvec, sepc, scause, stval, sscratch: Supervisor versions of the trap-handling CSRs, used when traps are delegated to supervisor mode.
satp (Supervisor Address Translation and Protection): Controls virtual memory:
- MODE: Selects the address translation scheme (Bare, Sv39, Sv48, etc.)
- ASID: Address Space Identifier for TLB tagging
- PPN: Physical page number of the root page table
The satp register is crucial for virtual memory. Writing to satp can change the address translation mode or switch to a different page table, enabling context switches between processes.
User-Level CSRs
User mode has access to a limited set of CSRs, primarily for performance monitoring and floating-point control:
fflags, frm, fcsr: Floating-point exception flags, rounding mode, and combined control/status register. These allow user code to control floating-point behavior and detect exceptions.
cycle, time, instret: Performance counters accessible from user mode (if not disabled by supervisor/machine mode). These provide the number of cycles elapsed, current time, and instructions retired, useful for profiling and timing.
CSR Instructions
RISC-V provides six CSR manipulation instructions:
CSRRW rd, csr, rs1 (CSR Read-Write): Atomically swap the value in csr with the value in rs1, writing the old CSR value to rd. If rd is x0, the read is suppressed (useful for write-only access).
CSRRS rd, csr, rs1 (CSR Read-Set): Read csr into rd, then set bits in csr corresponding to 1 bits in rs1. If rs1 is x0, this is a read-only operation.
CSRRC rd, csr, rs1 (CSR Read-Clear): Read csr into rd, then clear bits in csr corresponding to 1 bits in rs1.
CSRRWI, CSRRSI, CSRRCI: Immediate variants that use a 5-bit immediate value instead of a register.
These instructions are atomic, ensuring that CSR modifications aren’t interrupted. The read-set and read-clear operations are particularly useful for manipulating individual bits without affecting others.
Example: Enabling machine-mode interrupts:
# Set MIE bit in mstatus
li t0, 0x8 # MIE is bit 3
csrrs zero, mstatus, t0 # Set bit 3, discard old value
Example: Saving and modifying a CSR:
# Save current mstatus and disable interrupts
csrrci t0, mstatus, 0x8 # Clear MIE, save old value in t0
# ... critical section ...
csrw mstatus, t0 # Restore original mstatus
The CSR instructions provide controlled access to privileged state, enabling operating systems and firmware to manage the processor while preventing user code from interfering with system operation.
2.3 Program State and Privilege Levels
Privilege Modes
RISC-V defines three privilege levels, from lowest to highest:
User Mode (U-mode): The least privileged level, intended for application code. User mode cannot access most CSRs or execute privileged instructions. It typically runs with virtual memory enabled, isolating processes from each other and from the OS.
Supervisor Mode (S-mode): Intended for operating systems. Supervisor mode can manage virtual memory, handle traps delegated from machine mode, and access supervisor-level CSRs. It cannot access machine-level CSRs or certain privileged operations reserved for firmware.
Machine Mode (M-mode): The highest privilege level, intended for firmware and bootloaders. Machine mode has unrestricted access to all hardware resources. It can access all CSRs, execute all instructions, and delegate traps to lower privilege levels.
Not all implementations support all modes. A simple embedded system might implement only M-mode. A microcontroller might implement M-mode and U-mode. A full application processor implements all three modes.
The current privilege level is not stored in a dedicated register. Instead, it’s implicit in the processor state and can be inferred from CSRs like mstatus.MPP (previous privilege) after a trap.
Privilege Level Transitions
Transitions between privilege levels occur through well-defined mechanisms:
Trap to Higher Privilege: When an exception occurs or an interrupt arrives, the processor traps to a higher privilege level (or stays at the same level). The trap handler is determined by the xtvec CSR (mtvec for M-mode, stvec for S-mode). The processor saves the current PC in xepc, the cause in xcause, and additional information in xtval.
Return from Trap: The MRET, SRET, and URET instructions return from a trap, restoring the privilege level from xstatus.xPP and the PC from xepc. These instructions are privileged—MRET can only be executed in M-mode, SRET in S-mode or higher.
Environment Call: The ECALL instruction explicitly requests a trap to a higher privilege level. User code uses ECALL to invoke OS services (system calls). OS code uses ECALL to invoke firmware services (SBI calls). The trap handler examines the calling context to determine which service was requested.
This controlled transition mechanism ensures that privilege escalation only occurs through defined entry points, maintaining system security.
Figure 2.2: Privilege Level Transitions
stateDiagram-v2
[*] --> MMode: Reset/Boot
MMode: Machine Mode (M-mode)
SMode: Supervisor Mode (S-mode)
UMode: User Mode (U-mode)
MMode --> SMode: MRET (return from M-trap)
MMode --> UMode: MRET (return from M-trap)
SMode --> UMode: SRET (return from S-trap)
SMode --> MMode: Exception/Interrupt<br/>(not delegated)
UMode --> SMode: ECALL (system call)<br/>Exception/Interrupt<br/>(delegated to S-mode)
UMode --> MMode: Exception/Interrupt<br/>(not delegated)
SMode --> SMode: Exception/Interrupt<br/>(delegated to S-mode)
note right of MMode
Highest privilege
Full hardware access
Firmware/Bootloader
end note
note right of SMode
OS kernel
Virtual memory control
Delegated traps
end note
note right of UMode
Application code
Restricted access
Virtual memory enabled
end note
The state diagram shows how privilege levels transition through traps (upward) and return instructions (downward). ECALL explicitly requests higher privilege, while exceptions and interrupts cause automatic transitions.
2.4 Calling Convention and ABI
The RISC-V Calling Convention
The calling convention defines how functions call each other: how arguments are passed, how return values are communicated, and which registers must be preserved. RISC-V follows the System V ABI (Application Binary Interface), which is also used by many other architectures.
The calling convention is a software convention, not enforced by hardware. The processor doesn’t care which registers you use for arguments. But following the convention ensures that code from different compilers and libraries can interoperate.
Argument Passing
Function arguments are passed in registers a0 through a7 (x10-x17). The first argument goes in a0, the second in a1, and so on:
int add(int x, int y, int z) {
return x + y + z;
}
Compiles to:
add:
add a0, a0, a1 # a0 = x + y
add a0, a0, a2 # a0 = (x + y) + z
ret
Arguments x, y, and z arrive in a0, a1, and a2. The result is returned in a0.
If a function has more than 8 arguments, the additional arguments are passed on the stack:
int sum9(int a, int b, int c, int d, int e, int f, int g, int h, int i) {
return a + b + c + d + e + f + g + h + i;
}
Arguments a through h are in a0-a7. Argument i is on the stack at sp+0.
Return Values
Return values are passed in a0 and a1:
- Single return value (up to XLEN bits):
a0 - Two return values or 128-bit value on RV64:
a0(low) anda1(high)
For example, a function returning a 128-bit integer on RV64:
__int128 multiply(__int128 x, __int128 y);
Returns the low 64 bits in a0 and the high 64 bits in a1.
Structures and unions are handled specially:
- Small structs (≤ 2×XLEN bits) are returned in
a0anda1 - Larger structs are returned via a pointer: the caller allocates space and passes a pointer in
a0; the function writes the result there and returns the pointer ina0
Caller-Saved vs Callee-Saved Registers
The calling convention divides registers into two categories:
Caller-saved (temporary registers): The caller must save these if it needs their values preserved across a function call. The called function is free to modify them.
t0-t6(x5-x7, x28-x31): Temporariesa0-a7(x10-x17): Arguments/return valuesra(x1): Return address (modified bycall)
Callee-saved (saved registers): The called function must preserve these. If it uses them, it must save them on entry and restore them before returning.
s0-s11(x8-x9, x18-x27): Saved registerssp(x2): Stack pointer
Example:
function:
# Prologue: save callee-saved registers
addi sp, sp, -16
sd s0, 0(sp)
sd s1, 8(sp)
# Function body: can use s0, s1 freely
mv s0, a0
mv s1, a1
# ... computation ...
# Epilogue: restore callee-saved registers
ld s0, 0(sp)
ld s1, 8(sp)
addi sp, sp, 16
ret
Special Registers
Several registers have special roles:
Stack Pointer (sp): Points to the top of the stack. Must be preserved by callees. The stack grows downward (toward lower addresses).
Return Address (ra): Holds the return address for the current function. Set by call (or jal), used by ret (which is jalr zero, 0(ra)).
Frame Pointer (fp/s0): Optionally points to the base of the current stack frame. This is the same register as s0. Using a frame pointer simplifies debugging and stack unwinding, but costs a register.
Global Pointer (gp): Points to global data. Used for relaxation optimization—the linker can replace absolute addresses with gp-relative addresses, saving instructions. Typically set once at program startup and never changed.
Thread Pointer (tp): Points to thread-local storage (TLS). Each thread has its own TLS area. The OS sets tp when creating a thread.
2.5 Stack Frame Structure
Stack Layout
The stack is a region of memory used for:
- Local variables
- Saved registers
- Function arguments (beyond the first 8)
- Return addresses (for nested calls)
The stack grows downward. The stack pointer (sp) points to the top (lowest address) of the stack. Allocating stack space means subtracting from sp; deallocating means adding to sp.
A typical stack frame looks like:
Higher addresses
+------------------+
| Caller's frame |
+------------------+
| Arguments 9+ | ← Passed on stack
+------------------+
| Return address | ← Saved by caller (if needed)
+------------------+
| Saved registers | ← Callee-saved (s0-s11)
+------------------+
| Local variables |
+------------------+
| Outgoing args | ← For functions this function calls
+------------------+ ← sp (stack pointer)
Lower addresses
Function Prologue and Epilogue
The prologue is code at the start of a function that sets up the stack frame:
function:
# Prologue
addi sp, sp, -32 # Allocate 32 bytes
sd ra, 24(sp) # Save return address
sd s0, 16(sp) # Save s0
sd s1, 8(sp) # Save s1
# (Local variables use sp+0 to sp+7)
# Function body
# ...
# Epilogue
ld ra, 24(sp) # Restore return address
ld s0, 16(sp) # Restore s0
ld s1, 8(sp) # Restore s1
addi sp, sp, 32 # Deallocate stack frame
ret
The epilogue is code at the end that tears down the stack frame and returns.
Frame Pointer
Some functions use a frame pointer (fp, which is s0). The frame pointer points to a fixed location in the stack frame, making it easier to access local variables and arguments:
function:
# Prologue with frame pointer
addi sp, sp, -32
sd ra, 24(sp)
sd s0, 16(sp) # Save old frame pointer
addi s0, sp, 32 # Set frame pointer to old sp
# Now can access locals relative to fp:
# Local var at fp-8, fp-16, etc.
# Epilogue
ld ra, -8(s0)
ld s0, -16(s0)
addi sp, s0, -32
ret
Frame pointers are optional. They simplify debugging (debuggers can walk the stack) and exception handling, but cost a register.
Leaf Functions
A leaf function is one that doesn’t call any other functions. Leaf functions can often avoid saving ra and allocating a stack frame:
leaf_function:
# No prologue needed
add a0, a0, a1
ret
# No epilogue needed
This is more efficient but only works if the function doesn’t call anything and doesn’t need to save registers.
2.6 Comparison with ARM64 and MIPS
Register Count and Usage
All three architectures have 32 general-purpose registers, but they use them differently:
RISC-V:
- 32 registers (x0-x31)
- x0 is hardwired zero
- 31 usable registers
- Clear caller/callee-saved distinction
ARM64:
- 31 general-purpose registers (x0-x30)
- x31 is special: zero register or stack pointer depending on context
- 30 fully general registers
- Link register (x30) holds return address
MIPS:
- 32 registers ($0-$31)
- $0 is hardwired zero
- $31 is return address (ra)
- 30 usable general registers
Calling Conventions
RISC-V:
- Arguments: a0-a7 (8 registers)
- Return: a0-a1
- Caller-saved: t0-t6, a0-a7
- Callee-saved: s0-s11
ARM64:
- Arguments: x0-x7 (8 registers)
- Return: x0-x1
- Caller-saved: x0-x18
- Callee-saved: x19-x28
MIPS:
- Arguments: $a0-$a3 (4 registers, fewer than RISC-V/ARM)
- Return: $v0-$v1
- Caller-saved: $t0-$t9
- Callee-saved: $s0-$s7
RISC-V and ARM64 are similar, both providing 8 argument registers. MIPS is older and provides only 4, which means more stack usage for functions with many arguments.
Special Registers
RISC-V:
- sp (x2): Stack pointer
- ra (x1): Return address
- gp (x3): Global pointer
- tp (x4): Thread pointer
ARM64:
- sp (x31 in some contexts): Stack pointer
- lr (x30): Link register (return address)
- No global pointer equivalent
- Platform register for TLS
MIPS:
- sp ($29): Stack pointer
- ra ($31): Return address
- gp ($28): Global pointer
- No standard thread pointer
Zero Register
Both RISC-V and MIPS have a hardwired zero register (x0 / $0). ARM64’s x31 can act as zero in some contexts but is also used as the stack pointer, which is more complex.
The zero register is surprisingly useful:
mv rd, rsisaddi rd, rs, 0oradd rd, rs, zeroli rd, immisaddi rd, zero, imm- Discarding results:
add zero, a0, a1(compute but discard)
🛠️ Hands-on Lab: Lab 2.1 — Your First RISC-V Function
This lab guides you through implementing a simple addition function, experiencing mixed C and Assembly programming.
Lab Objectives
- Understand how C passes arguments to an Assembly function (
a0,a1) - Understand how Assembly returns results to C (
a0) - Observe the Calling Convention in action
Code
Create a folder lab2 and create the following two files:
File 1: add_func.S (Assembly Implementation)
# add_func.S
.section .text
.global my_add # Declare my_add as global symbol for the linker
# Function prototype: int my_add(int a, int b);
# Input: a in a0, b in a1
# Output: result goes in a0
my_add:
# Observation point 1: a0, a1 are already filled by the caller
add a0, a0, a1 # Perform addition: a0 = a0 + a1
# Observation point 2: result is now in a0
ret # Return instruction (actually jalr x0, 0(ra))
# It jumps to the address stored in ra
File 2: main.c (C Driver Program)
// main.c
#include <stdio.h>
// Declare external assembly function
extern int my_add(int a, int b);
int main() {
int val1 = 10;
int val2 = 20;
int sum;
printf("About to call assembly function...\n");
// Call assembly function
// The compiler automatically puts val1 in a0, val2 in a1
sum = my_add(val1, val2);
printf("Result: %d + %d = %d\n", val1, val2, sum);
return 0;
}
Compile and Run
# Compile
riscv64-unknown-elf-gcc -o lab2_add main.c add_func.S
# Run (using QEMU User Mode)
qemu-riscv64 lab2_add
# Or use Spike + PK
spike pk lab2_add
Expected Output:
About to call assembly function...
Result: 10 + 20 = 30
🛠️ Hands-on Lab: Lab 2.2 — Analyzing Compiler-Generated Assembly
This lab lets you “reverse engineer” what C code looks like after compilation, verifying your understanding against the ABI specification.
Lab Objectives
- Learn to use
objdumpfor disassembly - Match the ABI to identify argument and return value register usage
- Identify Prologue and Epilogue structures
Code
Create test_abi.c:
// test_abi.c
int calculate(int a, int b, int c) {
int temp = a + b;
return temp * c;
}
int main() {
return calculate(2, 3, 4);
}
Compile and Analyze
# Compile (use -O1 for readable output, -g for debug info)
riscv64-unknown-elf-gcc -O1 -c test_abi.c -o test_abi.o
# Disassemble
riscv64-unknown-elf-objdump -d test_abi.o
What to Observe
You should see output similar to:
0000000000000000 <calculate>:
0: 00b50533 add a0,a0,a1 # a0 = a + b (temp)
4: 02c50533 mul a0,a0,a2 # a0 = temp * c
8: 00008067 ret
000000000000000c <main>:
c: ff010113 addi sp,sp,-16 # Prologue: allocate stack
10: 00113423 sd ra,8(sp) # Save return address
14: 00200513 li a0,2 # First argument
18: 00300593 li a1,3 # Second argument
1c: 00400613 li a2,4 # Third argument
20: 00000097 auipc ra,0x0 # Prepare call
24: 000080e7 jalr ra # Call calculate
28: 00813083 ld ra,8(sp) # Epilogue: restore ra
2c: 01010113 addi sp,sp,16 # Free stack
30: 00008067 ret
Analysis Exercises
Answer the following questions (answers are in the code comments):
- Argument Passing: Which registers hold the arguments 2, 3, 4?
- Return Value: Which register holds
calculate’s result? - Prologue Purpose: Why does
mainstart withaddi sp, sp, -16? - Why Save ra:
mainsavesra, butcalculatedoesn’t—why?
💡 Hint: Because
calculateis a Leaf Function (doesn’t call other functions), it doesn’t need to savera. Butmaincallscalculate, so it must savera—otherwisejalrwould overwrite it.
⚠️ Common Pitfalls
Pitfall 1: Misusing Saved Registers
Error Scenario: Freely modifying s0-s11 within a function without saving them first, causing crashes when returning to the caller.
# ❌ Wrong
my_bad_func:
mv s0, a0 # Directly use s0 without saving!
call another_func # Call another function
mv a0, s0 # Expect s0 to be unchanged...
ret # But caller's s0 has been corrupted!
# ✅ Correct
my_good_func:
addi sp, sp, -16
sd s0, 0(sp) # Save s0 first
sd ra, 8(sp) # Save ra (since we're calling)
mv s0, a0
call another_func
mv a0, s0
ld ra, 8(sp) # Restore ra
ld s0, 0(sp) # Restore s0
addi sp, sp, 16
ret
Pitfall 2: Forgetting to Save ra in Non-Leaf Functions
Error Scenario: A function calls another function but forgets to save ra, losing the return address.
# ❌ Wrong
my_func:
call helper # jalr writes to ra, overwriting original!
ret # Now ra points to wrong location
# ✅ Correct
my_func:
addi sp, sp, -16
sd ra, 8(sp) # Save ra before calling
call helper
ld ra, 8(sp) # Restore ra
addi sp, sp, 16
ret
Pitfall 3: Stack Misalignment
Error Scenario: The RISC-V calling convention requires 16-byte stack alignment. Violating this can cause crashes or subtle bugs.
# ❌ Wrong: 8-byte allocation (misaligned!)
addi sp, sp, -8
# ✅ Correct: Always use multiples of 16
addi sp, sp, -16
danieRTOS Reference: The Context Switch implementation in danieRTOS carefully follows these conventions, saving all callee-saved registers before switching tasks.
Summary
RISC-V’s programmer’s model provides a clean, regular interface between software and hardware. The 32 general-purpose registers (x0-x31) follow RISC tradition, with x0 hardwired to zero—a simple feature that enables many common operations without dedicated instructions. The Application Binary Interface (ABI) assigns conventional roles to registers: a0-a7 for arguments, t0-t6 for temporaries, s0-s11 for saved registers, and ra for the return address.
The calling convention balances efficiency and simplicity. Caller-saved registers (temporaries and arguments) allow callees to use them freely without saving. Callee-saved registers (s0-s11) preserve values across calls, enabling long-lived variables. The stack pointer (sp) and frame pointer (s0/fp) support stack frames for local variables and nested calls. This convention enables separate compilation and efficient function calls.
Control and Status Registers (CSRs) manage processor state and configuration. Unlike general-purpose registers, CSRs use a separate 12-bit address space and dedicated instructions (CSRRW, CSRRS, CSRRC). CSRs are partitioned by privilege level: machine mode CSRs (0x300-0x3FF) control hardware, supervisor mode CSRs (0x100-0x1FF) support operating systems, and user mode CSRs (0x000-0x0FF) provide performance counters and other user-accessible state.
RISC-V defines three privilege levels: Machine mode (M-mode) has full hardware access and handles initialization and low-level exceptions. Supervisor mode (S-mode) runs operating systems with virtual memory and controlled hardware access. User mode (U-mode) runs applications with restricted privileges. This clean separation enables secure, efficient systems from embedded microcontrollers (M-mode only) to full operating systems (M+S+U).
Compared to ARM64 and MIPS, RISC-V’s programmer’s model is cleaner and more consistent. ARM64 has similar register conventions but with quirks like x31’s dual role as zero register and stack pointer. MIPS shows its age with fewer argument registers and less consistent naming. RISC-V’s separate CSR address space is cleaner than ARM’s system register encoding or MIPS’s coprocessor 0 model.
The programmer’s model reflects RISC-V’s design philosophy: simplicity, regularity, and clean separation of concerns. These principles make RISC-V easier to learn, implement, and optimize than more complex architectures.
Chapter 3. Privilege Levels & Execution Environment
Part II — The RISC-V Execution Model
Modern processors must balance two competing needs: applications require isolation and protection, while operating systems need controlled access to hardware. RISC-V addresses this through a clean privilege architecture with three levels—Machine, Supervisor, and User—each with well-defined responsibilities and capabilities.
This chapter explores how RISC-V implements privilege separation, from the mandatory Machine mode that controls all hardware to the optional Supervisor and User modes that enable operating systems and applications. We’ll examine the Supervisor Binary Interface (SBI) that abstracts platform differences, the execution environment interface that defines how programs interact with their environment, and how RISC-V’s privilege model compares to ARM’s exception levels. Understanding these concepts is essential for anyone working with RISC-V system software, from firmware developers to OS kernel engineers.
🎯 Learning Objectives
After completing this chapter, you will be able to:
- Understand why layered protection is necessary: Grasp the core concepts of Isolation & Protection, and why modern processors must restrict applications from directly accessing hardware
- Master the privilege differences between M/S/U modes: Clearly distinguish the capabilities and limitations of Machine Mode (building manager), Supervisor Mode (corporate tenant), and User Mode (regular employee)
- Understand how SBI serves as M-mode’s service window: Grasp how the Supervisor Binary Interface acts as the standard communication bridge between S-mode and M-mode
- Use ecall to request services: Understand how applications “make a phone call” to the operating system or firmware via the
ecallinstruction to obtain needed services
💡 Scenario: The Smart Building’s Access Card — Understanding Privilege Levels
Scene: In front of the lab whiteboard. Junior points at the screen showing “Illegal Instruction Exception” with a puzzled look. Architect sets down his coffee cup and adjusts his glasses.
Junior: “Architect, I just wanted to temporarily disable interrupts in my application to get more precise timing. Why did the CPU throw an error and kick me out? I bought this board myself!”
Architect: “Junior, you’re thinking about the CPU too simply. Modern processor design is like a ‘smart office building’ — for security, there must be strict access levels.”
Junior: “Access levels?”
Architect: “Think about it this way:
-
M-mode (Machine Mode) is the ‘Building Manager’: This is the highest authority. They have the master key to the entire building and can directly control the main power switches (hardware reset, clock configuration). Only they can talk directly to the building’s infrastructure.
-
S-mode (Supervisor Mode) is the ‘Corporate Tenant’: This is our operating system (OS). It rents several floors and can decide how to arrange the desks (memory management) and who sits where (scheduling), but it can’t cut power to the entire building or interfere with other companies’ floors.
-
U-mode (User Mode) is the ‘Regular Employee’: This is your application. You can only work within your assigned desk area (allocated memory). Want to adjust the AC? No way. Want to flip the main power switch? Not a chance.“
Junior: “So what if I’m actually cold and want to adjust the AC? (meaning: need hardware resources)”
Architect: “You have to ‘call the front desk.’ In RISC-V, this is called ecall (Environment Call). You make a request, the OS (S-mode) checks if you’re authorized, and if reasonable, the OS does it for you. If it involves lower-level hardware, the OS has to request M-mode’s help.”
Junior: “I see! So when I tried to disable interrupts, it was like an intern trying to run to the server room and pull the main breaker?”
Architect: “Exactly. Security (the CPU’s hardware exception mechanism) immediately stopped you. This layered protection is the key reason why the system doesn’t crash completely because of one bad program.”
┌─────────────────────────────────────────┐
│ M-mode (Building Manager) │
│ ┌─────────────────────────────────┐ │
│ │ S-mode (Corporate Tenant/OS) │ │
│ │ ┌─────────────────────────┐ │ │
│ │ │ U-mode (Employee/App) │ │ │
│ │ │ │ │ │
│ │ │ ecall ──────────────►│────┼───►│ Call the front desk
│ │ │ ◄─────────────────── │◄───┼────│ Response
│ │ └─────────────────────────┘ │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
💡 Key Insight: Privilege levels aren’t meant to restrict you — they’re meant to protect the entire system. Just like building access control isn’t meant to hassle employees, but to prevent one person’s mistake from causing a building-wide power outage.
3.1 RISC-V Privilege Architecture
The Privilege Model
RISC-V’s privilege architecture is elegantly simple compared to other modern processors. Where ARM defines four exception levels and x86 has rings 0-3 (though only 0 and 3 are commonly used), RISC-V defines just three privilege modes: Machine (M), Supervisor (S), and User (U).
This simplicity is intentional. The RISC-V designers observed that most systems need only two or three privilege levels: one for applications, one for the operating system, and one for firmware. Additional levels add complexity without proportional benefit for most use cases.
The privilege modes form a hierarchy:
- Machine mode is the highest privilege level with unrestricted access to all hardware
- Supervisor mode is intended for operating systems, with controlled access to privileged operations
- User mode is the lowest privilege level, intended for applications with minimal privileges
Importantly, not all modes are mandatory. A simple embedded system might implement only M-mode. A microcontroller with basic memory protection might implement M-mode and U-mode. A full application processor running Linux implements all three modes.
Machine Mode (M-mode)
Machine mode is the only mandatory privilege level in RISC-V. Every RISC-V processor must implement M-mode, even if it implements no other privilege levels.
M-mode has complete and unrestricted access to the entire system. It can:
- Access all memory and I/O devices
- Read and write all CSRs
- Execute all instructions
- Configure and delegate traps to lower privilege levels
- Control physical memory protection (PMP)
When a RISC-V processor resets, it starts executing in M-mode. The first code to run—typically a bootloader or firmware—executes in M-mode. This code initializes the hardware, sets up memory protection, and may eventually transfer control to supervisor-mode software (like an operating system) or directly to user-mode applications.
In embedded systems without an operating system, all code may run in M-mode. There’s no requirement to use lower privilege levels if they’re not needed. This flexibility allows RISC-V to scale from simple microcontrollers to complex application processors.
M-mode software typically implements:
- Bootloader: Initial code that runs after reset
- Firmware: Low-level hardware initialization and runtime services
- SBI (Supervisor Binary Interface): Services for supervisor-mode software
- Trap handlers: For traps that aren’t delegated to lower privilege levels
Supervisor Mode (S-mode)
Supervisor mode is optional but nearly universal in application processors. It’s designed for operating system kernels that manage multiple user processes.
S-mode has more privileges than U-mode but less than M-mode. It can:
- Control virtual memory (page tables, TLB)
- Handle traps delegated from M-mode
- Access supervisor-level CSRs
- Execute privileged instructions (like SFENCE.VMA for TLB management)
S-mode cannot:
- Access machine-level CSRs
- Directly control physical memory protection
- Handle traps not delegated to it
- Access I/O devices not mapped into its address space
This restricted access is intentional. M-mode firmware retains ultimate control over the hardware, while S-mode software (the OS kernel) manages virtual memory and user processes. This separation allows the firmware to provide platform-specific services while the OS remains platform-independent.
Operating systems like Linux, FreeBSD, and real-time operating systems run in S-mode on RISC-V. They use S-mode privileges to:
- Manage virtual memory for process isolation
- Handle system calls from user applications
- Manage interrupts and exceptions
- Schedule processes and manage resources
User Mode (U-mode)
User mode is the lowest privilege level, intended for application code. Like S-mode, U-mode is optional, but it’s implemented in any system that needs to isolate applications from each other and from the OS.
U-mode has minimal privileges. It can:
- Execute unprivileged instructions
- Access memory mapped into its virtual address space
- Read a few user-accessible CSRs (like performance counters)
- Request services from higher privilege levels via ECALL
U-mode cannot:
- Access privileged CSRs
- Execute privileged instructions
- Directly access I/O devices
- Modify page tables or TLB
- Disable interrupts
When U-mode code needs a privileged operation (like file I/O or memory allocation), it uses the ECALL instruction to trap to S-mode. The OS kernel examines the trap, performs the requested operation if permitted, and returns to U-mode.
This isolation is fundamental to modern operating systems. Each user process runs in U-mode with its own virtual address space. Processes cannot interfere with each other or with the kernel. If a process crashes, it doesn’t affect other processes or the system.
Hypervisor Extension (H-mode)
The hypervisor extension adds support for virtualization, allowing multiple operating systems to run simultaneously on the same hardware. Unlike M/S/U modes, the hypervisor extension is truly optional and only needed for virtualization use cases.
The H extension doesn’t add a new privilege level. Instead, it adds:
- VS-mode (Virtual Supervisor): Guest OS kernel mode
- VU-mode (Virtual User): Guest OS user mode
- Two-stage address translation
- Additional CSRs for virtualization control
A hypervisor runs in HS-mode (Hypervisor-extended Supervisor mode) and manages multiple guest operating systems. Each guest OS runs in VS-mode, believing it’s in S-mode. The hypervisor intercepts certain operations and provides virtualized hardware to each guest.
This extension is important for cloud computing and server virtualization, but most embedded systems and even many application processors don’t implement it.
Figure 3.1a: RISC-V Privilege Hierarchy
graph TB
M[Machine Mode M-mode<br/>Firmware, Bootloader, SBI<br/>Full hardware access]
S[Supervisor Mode S-mode<br/>OS Kernel<br/>Virtual memory, delegated traps]
U[User Mode U-mode<br/>Applications<br/>Minimal privileges]
M -->|MRET| S
M -->|MRET| U
S -->|SRET| U
U -->|ECALL, Exception| S
S -->|Exception not delegated| M
U -->|Exception not delegated| M
style M fill:#FFB6C1
style S fill:#87CEEB
style U fill:#90EE90
Figure 3.1b: Hypervisor Extension (Optional)
graph TB
HS[HS-mode<br/>Hypervisor<br/>Manages guest VMs]
VS[VS-mode<br/>Guest OS Kernel<br/>Virtualized supervisor]
VU[VU-mode<br/>Guest Applications<br/>Virtualized user mode]
HS -->|Return| VS
HS -->|Return| VU
VS -->|Return| VU
VU -->|Trap| VS
VS -->|Trap| HS
style HS fill:#DDA0DD
style VS fill:#F0E68C
style VU fill:#98FB98
3.2 Privilege Levels vs ARM Exception Levels
RISC-V’s Three-Level Model
RISC-V’s M/S/U privilege model is deliberately minimal. Three levels suffice for most systems:
- M-mode for firmware and platform-specific code
- S-mode for the OS kernel
- U-mode for applications
This simplicity has advantages:
- Easier to understand and implement
- Fewer privilege transitions mean less overhead
- Clear separation of concerns
The model is also flexible. Systems that don’t need all three levels can omit S-mode or U-mode. A bare-metal embedded system might use only M-mode. A simple RTOS might use M-mode and U-mode without S-mode.
ARM’s Four-Level Model
ARM takes a different approach with four exception levels (ELs):
- EL0: Applications (like RISC-V U-mode)
- EL1: OS kernel (like RISC-V S-mode)
- EL2: Hypervisor (for virtualization)
- EL3: Secure monitor (for TrustZone)
Additionally, ARM’s TrustZone creates two parallel worlds—Secure and Non-secure—each with its own set of exception levels. This creates considerable complexity:
- EL3 manages transitions between Secure and Non-secure worlds
- Each world has its own EL0, EL1, and EL2
- Different exception levels have different capabilities
The ARM model addresses real needs. EL3 and TrustZone provide hardware-enforced security isolation, important for mobile devices handling sensitive data like payment credentials. EL2 enables efficient virtualization for servers and cloud computing.
But this complexity comes at a cost:
- More complex privilege transitions
- More CSRs and state to manage
- Steeper learning curve
- More implementation complexity
Comparison of Privilege Transitions
The mechanisms for changing privilege levels differ between RISC-V and ARM, though the concepts are similar.
RISC-V Transitions:
- Upward (to higher privilege): Exception, interrupt, or ECALL instruction
- Downward (to lower privilege): MRET or SRET instruction
- Trap cause stored in xcause CSR
- Return address stored in xepc CSR
- Trap handler address from xtvec CSR
ARM Transitions:
- Upward: Exception or SVC (supervisor call) instruction
- Downward: ERET (exception return) instruction
- Exception syndrome stored in ESR_ELx
- Return address stored in ELR_ELx
- Vector table base in VBAR_ELx
The concepts are nearly identical—both architectures save state, jump to a handler, and provide a return instruction. The main differences are naming and the number of levels involved.
RISC-V’s simpler model means fewer cases to handle. An exception in U-mode might go to S-mode (if delegated) or M-mode (if not). In ARM, an exception in EL0 might go to EL1, EL2, or EL3 depending on configuration, and might involve a world switch if TrustZone is involved.
Security Model Comparison
Security is where the architectures diverge most significantly.
ARM TrustZone: ARM’s TrustZone creates two parallel execution environments—Secure and Non-secure worlds. Each world has its own:
- Memory regions (some memory is Secure-only)
- Peripherals (some devices are Secure-only)
- Exception levels (EL0-EL2 in each world)
The Secure world can access Non-secure resources, but not vice versa. EL3 (Secure monitor) manages transitions between worlds. This provides strong isolation for security-critical code like cryptographic operations, DRM, and payment processing.
TrustZone is mandatory in ARM Cortex-A processors and widely used in mobile devices. It’s proven effective for protecting sensitive operations from compromised OS kernels.
RISC-V PMP and ePMP: RISC-V takes a different approach with Physical Memory Protection (PMP). PMP allows M-mode to define memory regions with specific access permissions for lower privilege levels.
PMP provides:
- Up to 16 (or more) memory regions
- Per-region permissions (read, write, execute)
- Protection for S-mode and U-mode
- Flexible region sizes and alignment
The enhanced PMP (ePMP) extension adds:
- Locked regions that even M-mode cannot modify
- More flexible permission models
- Better support for security use cases
PMP is simpler than TrustZone but less comprehensive. It doesn’t provide separate execution environments or world switching. For many use cases, PMP suffices. For high-security applications requiring strong isolation, additional mechanisms may be needed.
RISC-V’s approach is to keep the base architecture simple and allow extensions for specific security needs. Custom extensions can add TrustZone-like features if required. This flexibility allows implementations to choose the right security model for their use case without mandating complexity for systems that don’t need it.
Figure 3.2a: RISC-V Privilege Levels
graph TB
RV_M[M-mode<br/>Machine Mode<br/>Firmware, SBI]
RV_S[S-mode<br/>Supervisor Mode<br/>OS Kernel]
RV_U[U-mode<br/>User Mode<br/>Applications]
RV_M --> RV_S
RV_S --> RV_U
style RV_M fill:#FFB6C1
style RV_S fill:#87CEEB
style RV_U fill:#90EE90
Figure 3.2b: ARM Exception Levels
graph TB
ARM_EL3[EL3<br/>Secure Monitor<br/>TrustZone]
ARM_EL2[EL2<br/>Hypervisor<br/>Virtualization]
ARM_EL1[EL1<br/>OS Kernel<br/>Privileged OS]
ARM_EL0[EL0<br/>Applications<br/>User mode]
ARM_EL3 --> ARM_EL2
ARM_EL2 --> ARM_EL1
ARM_EL1 --> ARM_EL0
style ARM_EL3 fill:#FF6B6B
style ARM_EL2 fill:#FFA07A
style ARM_EL1 fill:#87CEEB
style ARM_EL0 fill:#90EE90
Figure 3.2c: ARM TrustZone Architecture
graph TB
MONITOR[EL3<br/>Secure Monitor<br/>World switching]
SECURE[Secure World<br/>EL0-EL2<br/>Trusted execution]
NONSECURE[Non-secure World<br/>EL0-EL2<br/>Normal execution]
MONITOR --> SECURE
MONITOR --> NONSECURE
style MONITOR fill:#FF6B6B
style SECURE fill:#98FB98
style NONSECURE fill:#FFD700
Trade-offs and Use Cases
Neither model is universally superior—each makes different trade-offs.
RISC-V advantages:
- Simpler to understand and implement
- Lower overhead for privilege transitions
- Flexible—implement only what you need
- Easier to verify and validate
ARM advantages:
- TrustZone provides strong security isolation
- Four exception levels enable finer-grained privilege separation
- Mature ecosystem with proven security solutions
- Hypervisor support is standard (EL2)
For embedded systems and microcontrollers, RISC-V’s simplicity is often preferable. For mobile devices handling sensitive data, ARM’s TrustZone has proven valuable. For servers and cloud computing, both architectures can provide adequate virtualization support (RISC-V via the H extension, ARM via EL2).
The key insight is that RISC-V’s modular approach allows adding complexity where needed (via extensions like H for virtualization) while keeping the base simple. ARM’s approach is to provide comprehensive features in the base architecture, which ensures consistency but mandates complexity even for systems that don’t need all features.
3.3 Execution Environment Interface (EEI)
What is an EEI?
The Execution Environment Interface (EEI) defines the interface between a program and its execution environment. It specifies:
- Which instructions are available
- How system calls are made
- How the program interacts with I/O
- Memory layout and addressing
- Interrupt and exception handling
Different privilege levels have different EEIs. A user-mode program has a different EEI than a supervisor-mode kernel, which has a different EEI than machine-mode firmware.
The EEI concept is important because it separates the ISA (which instructions exist) from the execution environment (how those instructions interact with the system). The same RISC-V ISA can support different EEIs for different use cases.
Application Execution Environment (AEE)
The Application Execution Environment is the EEI for user-mode programs. It defines what applications can do and how they interact with the operating system.
A typical AEE provides:
- Virtual memory with process isolation
- System calls via ECALL instruction
- Standard library functions
- File I/O, networking, and other OS services
- Signal handling for asynchronous events
The AEE is usually defined by the operating system and ABI (Application Binary Interface). For example, Linux on RISC-V defines a specific AEE that includes:
- System call numbers and calling convention
- Signal delivery mechanism
- Virtual memory layout
- Thread-local storage access
Applications written for this AEE can run on any RISC-V Linux system, regardless of the underlying hardware.
Supervisor Execution Environment (SEE)
The Supervisor Execution Environment is the EEI for OS kernels running in S-mode. It defines how the kernel interacts with the underlying firmware and hardware.
The SEE is typically provided by M-mode firmware through the Supervisor Binary Interface (SBI). The SBI defines services that M-mode provides to S-mode, such as:
- Timer management
- Inter-processor interrupts (IPI)
- Remote fence operations (TLB shootdown)
- System reset and shutdown
- Console I/O (for debugging)
By providing these services through SBI, the firmware abstracts platform-specific details. The OS kernel can be platform-independent, calling SBI functions instead of directly accessing hardware. This is similar to BIOS/UEFI on x86 or ARM’s PSCI (Power State Coordination Interface).
Bare-Metal Execution Environment
Not all RISC-V systems run operating systems. Embedded systems often run “bare-metal” code directly on the hardware without an OS.
In a bare-metal environment:
- Code runs in M-mode with full hardware access
- No virtual memory or process isolation
- Direct access to all peripherals
- Custom interrupt handlers
- Application-specific memory layout
The bare-metal EEI is defined by the hardware platform and any runtime library used. For example, a microcontroller might provide:
- Startup code that initializes the hardware
- Interrupt vector table
- Basic I/O functions
- Memory map documentation
Bare-metal programming is common in embedded systems, IoT devices, and real-time applications where the overhead of an OS is unacceptable or unnecessary.
🛠️ Lab 3.1: The Ecall Elevator
This Lab’s goal is to let you “see” the privilege mode transition process. We’ll use QEMU to simulate a simple bare-metal environment and observe how a User Mode program “takes the elevator” to a higher privilege level via ecall.
Objectives
- Understand how the
ecallinstruction triggers an exception and traps to a higher privilege level - Observe the
mcauseregister value to confirm the exception type is “Environment call from U-mode” - Observe the
mepcregister to confirm it points to theecallinstruction address
Environment Requirements
- QEMU RISC-V emulator (
qemu-system-riscv64) - RISC-V GCC toolchain (
riscv64-unknown-elf-gcc) - GDB debugger
Code
File: ecall_elevator.S
# Lab 3.1: The Ecall Elevator
# Observe how ecall transitions from U-mode to M-mode
.section .text
.global _start
# ============================================================
# M-mode Initialization and Trap Handler Setup
# ============================================================
_start:
# Set up Trap Handler
la t0, trap_handler
csrw mtvec, t0
# Set up User Stack (simplified: using fixed address)
li sp, 0x80010000
# Prepare to switch to U-mode
# mstatus.MPP = 0 (U-mode), mstatus.MPIE = 1
li t0, (0 << 11) | (1 << 7) # MPP=0 (U-mode), MPIE=1
csrw mstatus, t0
# Set return address (mepc = user_code)
la t0, user_code
csrw mepc, t0
# Switch to U-mode
mret
# ============================================================
# User Mode Code (U-mode)
# ============================================================
user_code:
# We're now executing in U-mode
# Prepare syscall arguments
li a7, 100 # syscall number = 100 (custom)
li a0, 42 # arg0 = 42
# Press the elevator button!
ecall # <-- Set breakpoint here to observe
# After ecall returns, a0 contains the return value
# (This example returns a0 + 1 = 43)
user_loop:
j user_loop # Infinite loop
# ============================================================
# M-mode Trap Handler
# ============================================================
.align 4
trap_handler:
# === Observation Point 1: Read exception cause ===
csrr t0, mcause # t0 = exception cause
# U-mode ecall: mcause = 8 (Environment call from U-mode)
# === Observation Point 2: Read exception address ===
csrr t1, mepc # t1 = address where ecall occurred
# === Observation Point 3: Read previous privilege state ===
csrr t2, mstatus # mstatus.MPP shows previous mode
# Simple syscall handling: return a0 + 1
addi a0, a0, 1 # return value = argument + 1
# Skip past ecall instruction (ecall is 4 bytes)
addi t1, t1, 4
csrw mepc, t1
# Return to U-mode
mret
Execution Steps
1. Compile the program
riscv64-unknown-elf-gcc -nostdlib -nostartfiles -T linker.ld \
-o ecall_elevator.elf ecall_elevator.S
2. Start QEMU and connect GDB
# Terminal 1: Start QEMU
qemu-system-riscv64 -machine virt -nographic \
-kernel ecall_elevator.elf -S -gdb tcp::1234
# Terminal 2: Connect GDB
riscv64-unknown-elf-gdb ecall_elevator.elf
(gdb) target remote :1234
(gdb) break trap_handler
(gdb) continue
3. Observe key registers
When the breakpoint triggers, execute in GDB:
# Check exception cause
(gdb) print/x $mcause
# Expected: 0x8 (Environment call from U-mode)
# Check ecall instruction address
(gdb) print/x $mepc
# Expected: points to ecall address in user_code
# Check previous privilege state
(gdb) print/x $mstatus
# Check MPP bits (bit 12:11): 00 = U-mode
Key Observations
| Register | Value | Meaning |
|---|---|---|
mcause | 0x8 | Exception Code 8 = Environment call from U-mode |
mepc | ecall address | Instruction address when trap occurred |
mstatus.MPP | 0b00 | Previously in U-mode (00=U, 01=S, 11=M) |
Food for Thought
💭 Question: Why does the Trap Handler need to add 4 to
mepcbefore returning?Answer: If we don’t skip past
ecall,mretwill return to the sameecallinstruction, causing an infinite trap loop! This is why real-world syscall handlers (like in danieRTOS) includectx[CTX_MEPC] += 4to advance past the ecall.
⚠️ Common Pitfalls
Pitfall 1: Bare-Metal Mindset Carryover
Misconception: “I’m used to writing bare-metal code on embedded systems. RISC-V should just let me access CSRs directly.”
Reality: Many RISC-V development boards run Linux, and your program executes in U-mode where you simply cannot access M-mode CSRs.
// ❌ Wrong: Trying to read mstatus in Linux User Space
#include <stdio.h>
int main() {
unsigned long mstatus;
asm volatile ("csrr %0, mstatus" : "=r"(mstatus));
// Result: Illegal Instruction Exception, program killed by SIGILL
printf("mstatus = 0x%lx\n", mstatus);
return 0;
}
// ✅ Correct: Use system calls to get system information
#include <stdio.h>
#include <sys/utsname.h>
int main() {
struct utsname buf;
uname(&buf); // Request kernel to look it up via syscall
printf("Machine: %s\n", buf.machine);
return 0;
}
Diagnosis:
If your program mysteriously crashes, first check whether you’re using csrr/csrw to access M-mode or S-mode specific CSRs. In a Linux environment, only a few CSRs (like cycle, time) can be read from U-mode.
Pitfall 2: Forgetting to Skip Past ecall
Symptom: After the Trap Handler finishes, the CPU enters an infinite trap loop.
Cause: mepc still points to the ecall instruction. After mret, it immediately executes ecall again, triggering another trap.
# ❌ Wrong: Not updating mepc
trap_handler:
csrr t0, mcause
# ... handle syscall ...
mret # Returns to same ecall, infinite loop!
# ✅ Correct: Skip past ecall instruction
trap_handler:
csrr t0, mcause
csrr t1, mepc
addi t1, t1, 4 # ecall is 4 bytes
csrw mepc, t1
# ... handle syscall ...
mret # Returns to instruction after ecall
Pitfall 3: Confusing M/S/U-Specific CSRs
Symptom: Want to read trap information but used the wrong CSR prefix.
Explanation: RISC-V CSRs have different prefixes based on privilege level:
| Prefix | Privilege Level | Examples |
|---|---|---|
m | Machine | mstatus, mcause, mepc, mtvec |
s | Supervisor | sstatus, scause, sepc, stvec |
| none | User (some readable) | cycle, time, instret |
# ❌ Wrong: Reading S-mode CSR in M-mode Trap Handler
trap_handler:
csrr t0, scause # This reads S-mode's exception cause, not current!
# ✅ Correct: Use M-mode CSRs in M-mode
trap_handler:
csrr t0, mcause # Read M-mode's exception cause
💡 Memory Tip: Use the CSRs for the level you’re in. M-mode uses
m*, S-mode usess*.
Summary
RISC-V’s privilege architecture provides a clean, flexible model for separating system software responsibilities. The three privilege levels—Machine (M-mode), Supervisor (S-mode), and User (U-mode)—form a hierarchy where each level has well-defined capabilities and restrictions. M-mode is mandatory and has unrestricted hardware access, making it suitable for firmware and bootloaders. S-mode is optional and designed for operating systems, with controlled access to privileged operations and virtual memory. U-mode is optional and intended for applications, with minimal privileges and strong isolation.
The privilege model is more flexible than it first appears. Simple embedded systems implement only M-mode. Microcontrollers with basic protection implement M-mode and U-mode. Full application processors running Linux implement all three modes. The Hypervisor extension adds VS-mode and VU-mode for virtualization, enabling multiple guest operating systems to run on a single processor.
The Supervisor Binary Interface (SBI) provides a standardized interface between M-mode firmware and S-mode operating systems. SBI abstracts platform-specific details, allowing OS kernels to be portable across different RISC-V implementations. Key SBI services include timer management, inter-processor interrupts, remote fence operations, and system reset. OpenSBI provides a reference implementation that supports numerous platforms.
The Execution Environment Interface (EEI) defines how programs interact with their execution environment. The Application Execution Environment (AEE) is what user programs see—system calls, standard library functions, and OS services. The Supervisor Execution Environment (SEE) is what the OS kernel sees—SBI calls, hardware access, and platform services. Bare-metal environments provide direct hardware access without an OS layer.
Compared to ARM’s four exception levels (EL0-EL3), RISC-V’s three privilege levels are simpler and more flexible. ARM’s EL3 (Secure Monitor) and EL2 (Hypervisor) are always present in ARMv8-A, even if unused. RISC-V makes S-mode and U-mode optional, and adds hypervisor support as an extension. This modularity allows RISC-V to scale from tiny microcontrollers to high-performance servers without carrying unnecessary complexity.
The privilege architecture reflects RISC-V’s design philosophy: provide minimal mandatory features, make everything else optional, and maintain clean separation of concerns. This approach enables efficient implementations across a wide range of applications while preserving the flexibility to add advanced features when needed.
Chapter 4. Trap, Exception, Interrupt
Part III — Control Transfer & Exception System
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Distinguish Exception from Interrupt: Understand the fundamental difference between synchronous and asynchronous traps
- Master the Trap Handling Flow: Understand the roles of
mtvec,mepc,mcause, andmtval - Write a Trap Handler: Implement a basic exception handler
- Understand Trap Delegation: Know how M-mode delegates traps to S-mode
- Recognize PLIC: Understand the basic architecture of the Platform-Level Interrupt Controller
💡 Scenario: When the CPU Hits the Pause Button
Scene: Junior stares at a completely black terminal screen, looking bewildered.
Junior: “Senior, I have a question. I deliberately stuffed some garbage data .word 0xFFFFFFFF into my program, and it just crashed. But in Linux, if a program goes bad, usually only that program gets killed (Segmentation Fault), and the system stays fine, right?”
Senior: “That’s because Linux has a powerful ‘emergency response center’—the Trap Handler. But in our bare-metal environment right now, you haven’t written a handler. When the CPU encounters an instruction it doesn’t understand, it doesn’t know what to do, so it just… gives up.”
Junior: “So I need to write this response center myself?”
Senior: “Exactly. Think of it this way:
- Trap Occurs: Like a robotic arm on the assembly line suddenly shows a red warning light and stops.
- Hardware Action: The CPU automatically saves the current progress (PC) in
mepc(Machine Exception PC), then jumps to wherevermtvec(Trap Vector) points to for help. - Software Takes Over (Handler): This is the code you need to write. You check
mcause(Machine Cause) to see what triggered the alarm, handle the problem, then let the machine continue.“
Junior: “Sounds logical—handle it and go back to what you were doing?”
Senior: “Here’s the trap within the trap. If it’s an ‘Interrupt’ (like a timer going off), you handle it and return to ‘what you were doing.’ But if it’s an ‘Illegal Instruction Exception,’ returning to ‘what you were doing’ just hits the same illegal instruction again—infinite loop! So you have to manually ‘skip over’ that bad instruction.”
Junior: “I see! Let’s try it then.”
When a program encounters an error, receives an interrupt, or makes a system call, control must transfer from the current execution context to a handler that can deal with the event. RISC-V calls this mechanism a “trap”—a general term encompassing both synchronous exceptions (like page faults and illegal instructions) and asynchronous interrupts (like timer ticks and device signals).
Understanding traps is fundamental to system programming. Operating systems rely on traps to implement system calls, handle errors, and respond to hardware events. Firmware uses traps to manage low-level hardware and provide services to higher-level software. Even application programmers benefit from understanding how exceptions propagate and how interrupt latency affects real-time performance.
This chapter explores RISC-V’s trap mechanism in detail: how traps are triggered, how control transfers to handlers, how CSRs record trap information, and how the Platform-Level Interrupt Controller (PLIC) manages external interrupts. We’ll also examine the Advanced Interrupt Architecture (AIA) that extends RISC-V’s interrupt capabilities for high-performance systems, and compare RISC-V’s approach with ARM’s exception model.
4.1 Trap Fundamentals
What is a Trap?
In RISC-V terminology, a “trap” is any event that causes a control transfer to a trap handler. This includes both exceptions (synchronous events caused by instruction execution) and interrupts (asynchronous events from external sources).
The term “trap” is deliberately general. It encompasses:
- Illegal instructions
- Page faults
- System calls (ECALL)
- Breakpoints
- Timer interrupts
- External device interrupts
- Inter-processor interrupts
When a trap occurs, the processor:
- Saves the current program counter to xepc (where x is m, s, or u depending on the target privilege level)
- Saves the trap cause to xcause
- Saves additional information to xtval (if applicable)
- Updates xstatus to record the previous privilege level and interrupt enable state
- Jumps to the trap handler address specified in xtvec
This mechanism is similar to exception handling in other architectures, but RISC-V’s terminology and implementation are particularly clean and consistent.
Exceptions vs Interrupts
The distinction between exceptions and interrupts is fundamental:
Exceptions are synchronous—they’re caused by the execution of a specific instruction. When you execute an instruction that causes an exception, the exception occurs at that point in the program. Examples include:
- Illegal instruction: The processor doesn’t recognize the opcode
- Page fault: A memory access violates page table permissions
- ECALL: The program explicitly requests a trap to higher privilege
- Breakpoint: A debugging breakpoint is hit
Exceptions are predictable and reproducible. If you execute the same instruction sequence with the same processor state, you’ll get the same exception at the same point.
Interrupts are asynchronous—they’re caused by events external to the currently executing instruction stream. An interrupt can occur between any two instructions (or even during instruction execution in some implementations). Examples include:
- Timer interrupt: A timer has expired
- External interrupt: A device needs attention
- Software interrupt: Another processor or software has signaled this processor
Interrupts are not predictable from the instruction stream alone. The same program might experience interrupts at different points on different runs, depending on external events.
This distinction affects how traps are handled. Exceptions typically require examining the faulting instruction (available in xtval for some exceptions). Interrupts require identifying which device or source caused the interrupt.
Synchronous vs Asynchronous Traps
The synchronous/asynchronous distinction is encoded in the xcause CSR. The high bit of xcause indicates the trap type:
- Bit 63 (in RV64) = 0: Exception (synchronous)
- Bit 63 (in RV64) = 1: Interrupt (asynchronous)
The low bits encode the specific cause. For example:
- xcause = 0x0000000000000002: Illegal instruction exception
- xcause = 0x8000000000000005: Supervisor timer interrupt
This encoding allows trap handlers to quickly distinguish interrupts from exceptions with a simple sign test.
Trap Classification
RISC-V doesn’t formally classify traps beyond the exception/interrupt distinction, but it’s useful to think about exceptions in terms of their behavior:
Faults: Exceptions that can be corrected, after which the faulting instruction can be restarted. Page faults are the classic example. When a page fault occurs:
- The OS trap handler is invoked
- The handler loads the missing page from disk
- The handler updates the page table
- The handler returns, and the faulting instruction is re-executed
- This time, the instruction succeeds
The key is that xepc points to the faulting instruction, so returning from the trap re-executes it.
Traps (in the narrow sense): Exceptions that are reported after the instruction completes. Breakpoints are an example. The breakpoint exception occurs after the EBREAK instruction executes, and xepc points to the next instruction.
Interrupts: Asynchronous events. The interrupted instruction may or may not have completed. For precise interrupts, xepc points to an instruction that hasn’t executed yet. For imprecise interrupts (rare in RISC-V), the exact point of interruption may be approximate.
Aborts: Unrecoverable errors. These are rare in RISC-V. Most errors that would be aborts in other architectures are either faults (if recoverable) or cause the processor to enter a failure state.
Understanding these classifications helps in writing correct trap handlers. Fault handlers must be idempotent (safe to execute multiple times) because the faulting instruction will be retried. Trap handlers for breakpoints must advance xepc before returning to avoid infinite loops.
4.2 Trap Entry and Exit
Trap Entry Flow
When a trap occurs, the processor performs a well-defined sequence of operations. Understanding this sequence is crucial for writing trap handlers and debugging trap-related issues.
The trap entry flow differs slightly depending on the target privilege level (M-mode, S-mode, or U-mode), but the basic pattern is the same. Let’s consider a trap to M-mode:
-
Save PC: The current PC is saved to mepc. For exceptions, this is the PC of the faulting instruction. For interrupts, this is the PC of the instruction that would have executed next.
-
Update mcause: The trap cause is written to mcause. The high bit indicates interrupt (1) or exception (0). The low bits encode the specific cause.
-
Update mtval: Additional trap-specific information is written to mtval. For address-related exceptions (like page faults), mtval contains the faulting address. For illegal instruction exceptions, mtval may contain the instruction itself. For some traps, mtval is zero.
-
Update mstatus: Several fields in mstatus are updated:
- MPP (previous privilege) is set to the current privilege level
- MPIE (previous interrupt enable) is set to the current value of MIE
- MIE (interrupt enable) is set to 0, disabling interrupts in M-mode
-
Set privilege to M-mode: The processor switches to M-mode.
-
Jump to handler: The PC is set to the trap handler address from mtvec. The exact address depends on the mtvec mode (direct or vectored).
This entire sequence is atomic—it cannot be interrupted. Once a trap begins, it completes before any other trap can occur.
Figure 4.1: Trap Entry and Exit Flow
sequenceDiagram
participant CPU as CPU Execution
participant CSR as CSRs
participant Handler as Trap Handler
Note over CPU: Trap occurs
CPU->>CSR: Save state:<br/>PC→xepc, cause→xcause,<br/>value→xtval, update xstatus
CPU->>CPU: Switch privilege level
CPU->>Handler: Jump to xtvec
Note over Handler: Process trap
Handler->>CPU: Execute xRET
CPU->>CSR: Restore state:<br/>xepc→PC, xstatus→privilege
Note over CPU: Resume execution
CSR Updates on Trap
Let’s examine each CSR update in detail:
xepc (Exception Program Counter): This register holds the address to return to after handling the trap. For exceptions, it’s the address of the instruction that caused the exception. For interrupts, it’s the address of the instruction that was about to execute when the interrupt occurred.
The trap handler can modify xepc before returning. This is useful for:
- Skipping over a faulting instruction that can’t be fixed
- Implementing single-stepping in a debugger
- Emulating instructions not supported by the hardware
xcause (Trap Cause): This register indicates why the trap occurred. The format is:
- Bit XLEN-1: Interrupt bit (1 = interrupt, 0 = exception)
- Bits XLEN-2:0: Exception code or interrupt code
Common exception codes include:
- 0: Instruction address misaligned
- 1: Instruction access fault
- 2: Illegal instruction
- 3: Breakpoint
- 5: Load access fault
- 7: Store/AMO access fault
- 8: Environment call from U-mode
- 9: Environment call from S-mode
- 11: Environment call from M-mode
- 12: Instruction page fault
- 13: Load page fault
- 15: Store/AMO page fault
Common interrupt codes include:
- 1: Supervisor software interrupt
- 3: Machine software interrupt
- 5: Supervisor timer interrupt
- 7: Machine timer interrupt
- 9: Supervisor external interrupt
- 11: Machine external interrupt
xtval (Trap Value): This register provides additional information about the trap. Its contents depend on the trap cause:
- For address misaligned or access fault exceptions: The faulting address
- For illegal instruction exceptions: The instruction itself (optional)
- For breakpoint exceptions: The address of the breakpoint instruction
- For page fault exceptions: The faulting virtual address
- For other traps: Zero or undefined
The trap handler uses xtval to determine what went wrong and how to fix it. For example, a page fault handler uses xtval to know which page to load from disk.
xstatus (Status Register): Several fields are updated:
- xPP (Previous Privilege): Set to the privilege level before the trap. This allows the xRET instruction to return to the correct privilege level.
- xPIE (Previous Interrupt Enable): Set to the value of xIE before the trap. This preserves the interrupt enable state.
- xIE (Interrupt Enable): Cleared to 0, disabling interrupts at the target privilege level. This prevents nested interrupts from immediately occurring.
These updates ensure that the trap handler knows where it came from and can return correctly.
Trap Vector (xtvec)
The xtvec CSR specifies where trap handlers are located. It has two fields:
- BASE (bits XLEN-1:2): The base address of the trap handler(s), aligned to 4 bytes
- MODE (bits 1:0): The vectoring mode
Two modes are defined:
- Direct mode (MODE=0): All traps jump to BASE. A single trap handler must determine the cause by reading xcause and dispatch accordingly.
- Vectored mode (MODE=1): Exceptions jump to BASE. Interrupts jump to BASE + 4×cause. This allows separate handlers for each interrupt source.
Vectored mode is useful for performance. Instead of a single handler that must check xcause and dispatch, each interrupt can have its own handler. This reduces latency for interrupt handling.
Example xtvec values:
- 0x80000000: Direct mode, handler at 0x80000000
- 0x80000001: Vectored mode, base at 0x80000000
- Exceptions → 0x80000000
- Supervisor software interrupt (cause 1) → 0x80000004
- Supervisor timer interrupt (cause 5) → 0x80000014
- Supervisor external interrupt (cause 9) → 0x80000024
Trap Return (xRET)
Returning from a trap is accomplished with the xRET instruction (MRET for M-mode, SRET for S-mode, URET for U-mode). The xRET instruction:
- Restore PC: Set PC to the value in xepc
- Restore privilege: Set the current privilege level to xstatus.xPP
- Restore interrupt enable: Set xIE to xstatus.xPIE
- Update xPIE: Set xPIE to 1 (enabled)
- Update xPP: Set xPP to U-mode (least privilege)
The last two steps prepare for the next trap. Setting xPIE to 1 ensures that interrupts will be enabled after the next trap (unless explicitly disabled). Setting xPP to U-mode ensures that returning from the next trap won’t accidentally escalate privilege.
A typical trap handler epilogue looks like:
# Restore saved registers
ld t0, 0(sp)
ld t1, 8(sp)
# ... restore other registers ...
addi sp, sp, 256 # Deallocate stack frame
mret # Return from M-mode trap
The MRET instruction is privileged—it can only be executed in M-mode. Similarly, SRET can only be executed in S-mode or higher. Attempting to execute xRET from insufficient privilege causes an illegal instruction exception.
4.3 Exception Causes and Handling
Exception Cause Codes
RISC-V defines a standard set of exception codes. Understanding these codes is essential for writing trap handlers.
Instruction Address Misaligned (0): The PC is not properly aligned for the instruction being fetched. In the base ISA, instructions must be 4-byte aligned. With the C extension, instructions can be 2-byte aligned, but jumping to an odd address still causes this exception.
Instruction Access Fault (1): The instruction fetch failed. This might occur because:
- The address is not mapped in the page table
- The page doesn’t have execute permission
- The address is in a protected region (PMP violation)
- A bus error occurred
Illegal Instruction (2): The processor doesn’t recognize the instruction. This occurs when:
- The opcode is invalid
- The instruction uses an unimplemented extension
- The instruction is privileged but executed from insufficient privilege
- Reserved fields have incorrect values
Illegal instruction exceptions are often used to emulate unimplemented instructions in software.
Breakpoint (3): The EBREAK instruction was executed. This is used by debuggers to set breakpoints. The trap handler can examine the program state and return control to the debugger.
Load Address Misaligned (4) and Store/AMO Address Misaligned (6): A load or store instruction used an improperly aligned address. For example, a 4-byte load (LW) from an address that’s not 4-byte aligned. Some implementations support misaligned accesses in hardware; others trap and require software emulation.
Load Access Fault (5) and Store/AMO Access Fault (7): A load or store failed for reasons similar to instruction access faults—page table violations, PMP violations, or bus errors.
Environment Call from U/S/M-mode (8/9/11): The ECALL instruction was executed. The exception code indicates which privilege level made the call. This is how system calls are implemented—user code executes ECALL, trapping to the kernel.
Instruction Page Fault (12), Load Page Fault (13), Store/AMO Page Fault (15): A page table entry was found, but it doesn’t grant the required permission. These are distinct from access faults (which occur when no valid translation exists). Page faults are typically handled by loading the page from disk and updating the page table.
Exception Handling Patterns
Different exceptions require different handling strategies:
Illegal Instruction Emulation: When an illegal instruction exception occurs, the handler can:
- Read the faulting instruction from memory (or from mtval if provided)
- Decode the instruction
- If it’s an instruction that can be emulated, perform the operation in software
- Update registers and memory as the instruction would have
- Advance mepc past the instruction
- Return with MRET
This technique is used to support optional extensions in software or to provide backward compatibility.
Page Fault Handling: Page faults are more complex:
- Read the faulting address from xtval
- Check if the address is valid for the process
- If not, terminate the process (segmentation fault)
- If valid, allocate a physical page
- Load the page contents from disk (if it was swapped out)
- Update the page table to map the virtual address to the physical page
- Return with SRET (the faulting instruction will be retried)
Page fault handling is critical for virtual memory systems and can involve significant latency if disk I/O is required.
System Call Handling: ECALL exceptions implement system calls:
- Read the system call number (typically from register a7)
- Read arguments from registers (a0-a6)
- Validate arguments
- Perform the requested operation
- Write the result to a0
- Advance sepc past the ECALL instruction
- Return with SRET
The key is advancing sepc—otherwise, returning would re-execute the ECALL, creating an infinite loop.
4.4 Interrupt Architecture
Interrupt Types
RISC-V defines three types of interrupts, each with machine-level and supervisor-level variants:
Software Interrupts:
- Machine software interrupt (cause code 3)
- Supervisor software interrupt (cause code 1)
- Triggered by writing to memory-mapped registers
- Used for inter-processor interrupts (IPI)
Timer Interrupts:
- Machine timer interrupt (cause code 7)
- Supervisor timer interrupt (cause code 5)
- Triggered when a timer reaches a threshold
- Used for scheduling and timekeeping
External Interrupts:
- Machine external interrupt (cause code 11)
- Supervisor external interrupt (cause code 9)
- Triggered by external devices via interrupt controller
- Used for I/O device interrupts
Each interrupt type has a corresponding bit in the xie (interrupt enable) and xip (interrupt pending) registers.
Interrupt Enable and Pending
Interrupts are controlled by two sets of CSRs:
xie (Interrupt Enable): Each bit enables a specific interrupt type.
- Bit 11: Machine external interrupt enable (MEIE)
- Bit 9: Supervisor external interrupt enable (SEIE)
- Bit 7: Machine timer interrupt enable (MTIE)
- Bit 5: Supervisor timer interrupt enable (STIE)
- Bit 3: Machine software interrupt enable (MSIE)
- Bit 1: Supervisor software interrupt enable (SSIE)
xip (Interrupt Pending): Each bit indicates if an interrupt is pending.
- Same bit positions as xie
- Read-only for most bits (set by hardware)
- Software interrupt bits can be written
For an interrupt to be taken:
- The interrupt must be pending (bit set in xip)
- The interrupt must be enabled (bit set in xie)
- Global interrupts must be enabled (xIE bit in xstatus)
- The interrupt must not be delegated to a lower privilege level
Interrupt Priority
When multiple interrupts are pending, the hardware chooses one based on priority. The standard priority order is:
- External interrupts (highest priority)
- Software interrupts
- Timer interrupts (lowest priority)
Within each category, machine-level interrupts have higher priority than supervisor-level.
This ordering ensures that external device interrupts (which may be time-critical) are serviced before software-triggered interrupts.
Interrupt Nesting
By default, taking an interrupt disables further interrupts (xIE is cleared). This prevents nested interrupts, which simplifies handler code.
However, handlers can re-enable interrupts to allow nesting:
interrupt_handler:
# Save context
csrrw sp, mscratch, sp # Swap sp with mscratch
addi sp, sp, -256
sd x1, 0(sp)
# ... save other registers ...
# Re-enable interrupts for nesting
csrsi mstatus, 0x8 # Set MIE bit
# Handle interrupt
# ...
# Disable interrupts before returning
csrci mstatus, 0x8 # Clear MIE bit
# Restore context
ld x1, 0(sp)
# ... restore other registers ...
addi sp, sp, 256
csrrw sp, mscratch, sp
mret
Nested interrupts require careful stack management and reentrancy considerations.
4.5 Platform-Level Interrupt Controller (PLIC)
PLIC Architecture
The Platform-Level Interrupt Controller (PLIC) is the standard interrupt controller for RISC-V systems. It routes interrupts from external sources (devices) to harts and privilege levels.
Key features:
- Supports up to 1024 interrupt sources
- Routes interrupts to multiple targets (harts × privilege modes)
- Priority-based arbitration
- Memory-mapped configuration registers
The PLIC sits between interrupt sources (devices) and interrupt targets (harts):
Devices → PLIC → Harts (M-mode, S-mode)
Interrupt Routing
Each interrupt source can be routed to any combination of targets. A target is a (hart, privilege mode) pair. For example, in a 4-hart system with M-mode and S-mode:
- Hart 0 M-mode
- Hart 0 S-mode
- Hart 1 M-mode
- Hart 1 S-mode
- … (8 targets total)
Each target has an enable register with one bit per interrupt source. Setting bit N enables source N for that target.
This flexibility allows:
- Dedicating certain interrupts to specific harts
- Sharing interrupts across multiple harts
- Routing interrupts to different privilege levels
Priority and Threshold
Each interrupt source has a priority (typically 0-7, where 0 means “never interrupt”). Each target has a threshold. An interrupt is delivered only if its priority exceeds the target’s threshold.
This allows:
- Masking low-priority interrupts during critical sections
- Implementing priority-based preemption
- Temporarily disabling interrupts without modifying enable bits
Example: Set threshold to 5 to mask all interrupts with priority ≤ 5.
PLIC Memory Map and Programming
The PLIC is configured through memory-mapped registers:
Base Address: 0x0C000000 (typical, platform-specific)
Priority registers: Base + 0x000000 + source_id * 4
Pending registers: Base + 0x001000 + (source_id / 32) * 4
Enable registers: Base + 0x002000 + context * 0x80 + (source_id / 32) * 4
Threshold registers: Base + 0x200000 + context * 0x1000
Claim/Complete: Base + 0x200004 + context * 0x1000
A “context” is a (hart, privilege mode) pair. For a system with M-mode and S-mode per hart:
- Context 0 = Hart 0, M-mode
- Context 1 = Hart 0, S-mode
- Context 2 = Hart 1, M-mode
- Context 3 = Hart 1, S-mode
- …
Example register definitions and initialization:
// PLIC register definitions
#define PLIC_BASE 0x0C000000
#define PLIC_PRIORITY(id) (PLIC_BASE + (id) * 4)
#define PLIC_PENDING(id) (PLIC_BASE + 0x1000 + ((id) / 32) * 4)
#define PLIC_ENABLE(hart, mode, id) \
(PLIC_BASE + 0x2000 + (hart) * 0x100 + (mode) * 0x80 + ((id) / 32) * 4)
#define PLIC_THRESHOLD(hart, mode) \
(PLIC_BASE + 0x200000 + (hart) * 0x2000 + (mode) * 0x1000)
#define PLIC_CLAIM(hart, mode) \
(PLIC_BASE + 0x200004 + (hart) * 0x2000 + (mode) * 0x1000)
// Initialize PLIC
void plic_init(void) {
// Set priority of interrupt source 1 to 7 (highest)
*(volatile uint32_t *)PLIC_PRIORITY(1) = 7;
// Enable interrupt source 1 for hart 0 M-mode
uint32_t *enable_reg = (uint32_t *)PLIC_ENABLE(0, 0, 1); // mode 0 = M-mode
*enable_reg |= (1 << (1 % 32));
// Set priority threshold to 0 (accept all interrupts with priority > 0)
*(volatile uint32_t *)PLIC_THRESHOLD(0, 0) = 0;
// Enable M-mode external interrupt
set_csr(mie, 1 << 11); // MEIE
set_csr(mstatus, 1 << 3); // MIE
}
// PLIC interrupt handler
void plic_handler(void) {
// Claim interrupt (read source ID)
uint32_t source = *(volatile uint32_t *)PLIC_CLAIM(0, 0);
if (source == 0) {
// No pending interrupt (should not happen)
return;
}
// Handle interrupt
printf("Handling PLIC interrupt from source %u\n", source);
handle_device_interrupt(source);
// Complete interrupt (write back source ID)
*(volatile uint32_t *)PLIC_CLAIM(0, 0) = source;
}
Claim and Completion
The PLIC uses a claim/complete protocol:
-
Claim: The handler reads the claim register. This atomically:
- Returns the ID of the highest-priority pending interrupt
- Marks that interrupt as “in service”
- Prevents other harts from claiming the same interrupt
-
Service: The handler services the interrupt
-
Complete: The handler writes the interrupt ID to the completion register. This:
- Marks the interrupt as no longer in service
- Allows the interrupt to be triggered again
Example:
void plic_handler(void) {
uint32_t irq = plic_claim(); // Claim interrupt
if (irq == UART_IRQ) {
uart_interrupt_handler();
} else if (irq == TIMER_IRQ) {
timer_interrupt_handler();
}
// ... handle other interrupts ...
plic_complete(irq); // Complete interrupt
}
The claim/complete protocol ensures that:
- Each interrupt is handled exactly once
- Multiple harts can share the PLIC without races
- Level-triggered interrupts don’t cause spurious re-triggers
4.6 Core-Local Interrupt Controller (CLIC)
CLIC vs PLIC
The Core-Local Interrupt Controller (CLIC) is an optional extension that provides lower-latency interrupt handling than the PLIC.
PLIC:
- Platform-level (shared across harts)
- Flexible routing
- Memory-mapped configuration
- Higher latency (claim/complete protocol)
- Good for general-purpose I/O
CLIC:
- Core-local (one per hart)
- Direct vectoring to handlers
- Lower latency
- More complex configuration
- Good for real-time, low-latency interrupts
PLIC and CLIC are complementary. A system might use CLIC for time-critical interrupts and PLIC for general I/O.
Vectored Interrupt Handling
CLIC supports vectored interrupts—each interrupt source can have its own handler address. When an interrupt occurs, the hardware jumps directly to the handler, avoiding software dispatch overhead.
The vector table is an array of handler addresses:
Vector Table:
[0]: Handler for interrupt 0
[1]: Handler for interrupt 1
[2]: Handler for interrupt 2
...
The hardware computes the handler address as:
handler_address = vector_table_base + (interrupt_id × entry_size)
This eliminates the need for a dispatcher that reads the interrupt ID and jumps to the appropriate handler.
Interrupt Levels and Preemption
CLIC supports multiple interrupt levels (typically 256). Each interrupt has a level, and higher-level interrupts can preempt lower-level ones.
When an interrupt is taken:
- The current level is saved
- The new level is set to the interrupt’s level
- Only interrupts with higher levels can preempt
This provides fine-grained priority control for real-time systems.
4.7 Advanced Interrupt Architecture (AIA)
AIA Overview
The Advanced Interrupt Architecture (AIA) is a newer interrupt specification that extends and improves upon PLIC and CLIC. It provides:
- Message-signaled interrupts (MSI)
- Interrupt virtualization support
- Improved scalability
- Better integration with PCIe
AIA is designed for modern systems with many cores and devices, particularly servers and data center applications.
Message-Signaled Interrupts (MSI)
Traditional interrupts use dedicated wires. MSI uses memory writes to signal interrupts:
- Device writes to a special address
- The write is intercepted by the interrupt controller
- An interrupt is triggered
MSI advantages:
- No dedicated interrupt wires needed
- Scales better (thousands of interrupt sources)
- Better for PCIe devices
- Supports interrupt remapping
Interrupt Virtualization
AIA includes support for virtualizing interrupts, essential for running multiple guest operating systems:
- Guest interrupts can be delivered directly to guest VMs
- Hypervisor can intercept and remap interrupts
- Reduces virtualization overhead
This is similar to ARM’s GIC virtualization extensions.
4.8 Comparison with ARM GIC
ARM Generic Interrupt Controller (GIC)
ARM’s GIC is the standard interrupt controller for ARM systems. Comparing with RISC-V:
Architecture:
- GIC: Centralized, hierarchical (distributor → CPU interfaces)
- PLIC: Flat, memory-mapped
- CLIC: Core-local, vectored
Interrupt Types:
- GIC: SGI (software), PPI (private peripheral), SPI (shared peripheral)
- RISC-V: Software, timer, external
Priority Levels:
- GIC: Up to 256 priority levels
- PLIC: Typically 8 levels
- CLIC: Up to 256 levels
Virtualization:
- GIC: Built-in virtualization support (GICv3+)
- RISC-V: AIA provides virtualization support
Similarities:
- Both support priority-based arbitration
- Both support routing interrupts to specific cores
- Both support message-signaled interrupts (GICv3 ITS, RISC-V AIA)
Differences:
- GIC is more complex and feature-rich
- PLIC is simpler and more flexible
- CLIC provides lower latency for real-time use cases
- AIA brings RISC-V closer to GIC’s capabilities
Trade-offs:
- RISC-V’s modular approach (PLIC/CLIC/AIA) allows implementations to choose the right complexity
- ARM’s unified GIC provides consistency but mandates more complexity
- RISC-V is catching up with AIA for advanced features
🛠️ Hands-on Lab: Lab 4.1 — Your First Trap Handler
This lab guides you through implementing a minimal Trap Handler that handles an Illegal Instruction and gracefully skips over it.
Lab Objectives
- Set
mtvecto point to your Trap Handler - Read
mcauseto determine the trap type - Modify
mepcto skip over the bad instruction - Return using
mret
Code
Create lab4_trap.S:
# lab4_trap.S - Minimal Trap Handler Implementation
.section .text
.global _start
_start:
# 1. Set up Trap Vector
la t0, trap_handler
csrw mtvec, t0 # Tell CPU: jump here when trap occurs
# 2. Execute some normal instructions
li a0, 100
li a1, 200
add a2, a0, a1 # a2 = 300
# 3. Deliberately trigger an illegal instruction
.word 0xFFFFFFFF # This is NOT a valid RISC-V instruction!
# 4. If Handler is correct, we skip the above and continue here
li a3, 999 # Marker: we successfully skipped it!
# 5. Exit program
li a7, 93 # exit syscall
li a0, 0 # exit code
ecall
# ============================================
# Trap Handler
# ============================================
.align 4 # mtvec requires 4-byte alignment
trap_handler:
# Save registers we'll use
addi sp, sp, -16
sd t0, 0(sp)
sd t1, 8(sp)
# Read trap cause
csrr t0, mcause
# Check if Illegal Instruction (cause = 2)
li t1, 2
bne t0, t1, unknown_trap
# It's an illegal instruction! Skip it
# Read mepc (address of trapping instruction)
csrr t0, mepc
# Assume 32-bit instruction, skip 4 bytes
# (Compressed instructions are 2 bytes; simplified here)
addi t0, t0, 4
csrw mepc, t0 # Update mepc
# Restore registers
ld t0, 0(sp)
ld t1, 8(sp)
addi sp, sp, 16
# Return! CPU will jump to new mepc
mret
unknown_trap:
# Unknown trap, halt
j unknown_trap
Compile and Run
# Compile
riscv64-unknown-elf-gcc -nostdlib -nostartfiles -o lab4_trap lab4_trap.S
# Run with QEMU (requires Machine Mode support)
qemu-system-riscv64 -machine virt -nographic -bios none -kernel lab4_trap
# Or use Spike
spike --isa=rv64gc lab4_trap
What to Observe
Use GDB to trace execution:
# Terminal 1: Start QEMU and wait for GDB
qemu-system-riscv64 -machine virt -nographic -bios none -kernel lab4_trap -s -S
# Terminal 2: Connect GDB
riscv64-unknown-elf-gdb lab4_trap
(gdb) target remote :1234
(gdb) break trap_handler
(gdb) continue
When the breakpoint hits, examine:
(gdb) info registers mcause # Should show 2 (Illegal Instruction)
(gdb) info registers mepc # Address of the .word 0xFFFFFFFF
(gdb) stepi # Step through handler
You should observe:
mcause = 2(Illegal Instruction)mepcpoints to the address of.word 0xFFFFFFFFmtvalmay contain the encoding of the illegal instruction
danieRTOS Reference: The context switch mechanism in danieRTOS builds on these trap handling fundamentals, using
mretto switch between tasks.
Key Concept: mepc Adjustment
💭 Why doesn’t an Interrupt need to modify
mepc, but an Exception does?
- Interrupt: Is “asynchronous”—it occurs between two instructions.
mepcpoints to the “next instruction to execute.” After handling the interrupt, continuing from there is correct.- Exception: Is “synchronous”—triggered by the current instruction.
mepcpoints to the “instruction that triggered the exception.” If you don’t modifymepc,mretwill re-execute the same instruction, triggering the exception again, forming an infinite loop!
⚠️ Common Pitfalls
Pitfall 1: Forgetting to Set mtvec
Error Scenario: A trap triggers at startup before mtvec is set, causing the CPU to jump to an uninitialized address (usually 0).
# ❌ Wrong: mtvec not set before potential trap
_start:
ecall # If mtvec isn't set, jumps to unknown address
# ✅ Correct: Set mtvec as the first thing
_start:
la t0, trap_handler
csrw mtvec, t0
# Now safe to execute instructions that might trap
Pitfall 2: Corrupting Caller’s Registers in Handler
Error Scenario: The Trap Handler uses a0, t0, etc. without saving/restoring them, causing the interrupted program’s data to be corrupted.
# ❌ Wrong: Directly using registers
trap_handler:
csrr t0, mcause # t0 is overwritten!
# ... handle ...
mret # Original t0 value is gone
# ✅ Correct: Save first, restore after
trap_handler:
addi sp, sp, -8
sd t0, 0(sp) # Save t0
csrr t0, mcause
# ... handle ...
ld t0, 0(sp) # Restore t0
addi sp, sp, 8
mret
Pitfall 3: Confusing mret with ret
Error Scenario: Using ret instead of mret at the end of the Trap Handler.
# ❌ Wrong: ret is just jalr x0, 0(ra), doesn't restore privilege level
trap_handler:
# ...
ret # Jumps to ra, but privilege unchanged, mepc unused
# ✅ Correct: mret restores mstatus.MPP and jumps to mepc
trap_handler:
# ...
mret # Correct return
Summary
RISC-V’s trap mechanism provides a unified framework for handling both synchronous exceptions and asynchronous interrupts. When a trap occurs, the processor saves the current PC to xepc, records the cause in xcause, stores additional information in xtval, updates privilege and interrupt state in xstatus, and jumps to the handler address in xtvec. This clean, consistent mechanism works across all privilege levels (M-mode, S-mode, U-mode) with parallel CSR sets.
Exceptions are synchronous events caused by instruction execution: illegal instructions, misaligned accesses, page faults, breakpoints, and environment calls (ECALL). The xcause register encodes the exception type, while xtval provides context like the faulting address or illegal instruction. Exception handlers can fix the problem (like loading a swapped page) and resume execution, or terminate the offending program.
Interrupts are asynchronous events from external sources: software interrupts (for inter-processor communication), timer interrupts (for preemptive scheduling), and external interrupts (from devices). Each interrupt type has enable bits in xie and pending bits in xip. Interrupts are only taken when globally enabled (xstatus.xIE = 1) and individually enabled. Nested interrupts require careful management of the interrupt enable state.
The Platform-Level Interrupt Controller (PLIC) manages external interrupts from devices. It provides priority-based arbitration, per-hart interrupt routing, and a claim/complete protocol that ensures each interrupt is handled exactly once. The PLIC supports up to 1024 interrupt sources with configurable priorities and thresholds. Memory-mapped registers control configuration and claim interrupts for handling.
The Core-Local Interrupt Controller (CLIC) extends RISC-V with vectored interrupts, preemptive priority levels, and hardware interrupt nesting. CLIC reduces interrupt latency by eliminating software dispatch and enabling direct jumps to interrupt-specific handlers. This makes CLIC suitable for real-time systems where deterministic, low-latency interrupt handling is critical.
The Advanced Interrupt Architecture (AIA) brings message-signaled interrupts (MSI), interrupt virtualization, and scalable interrupt delivery to RISC-V. The Incoming MSI Controller (IMSIC) receives MSI writes and signals interrupts to harts. The Advanced PLIC (APLIC) converts wired interrupts to MSIs. Together, they enable efficient interrupt handling in large multi-core systems and virtualized environments.
Compared to ARM’s Generic Interrupt Controller (GIC), RISC-V’s interrupt architecture is more modular. Simple systems use the basic PLIC. Real-time systems add CLIC for low latency. High-performance systems add AIA for MSI and virtualization. ARM’s GIC provides all features in one controller, which ensures consistency but mandates complexity even for simple systems. RISC-V’s approach allows implementations to choose the right level of complexity for their needs.
Chapter 5. Virtual Memory & Paging (Sv39 / Sv48)
Part IV — Memory & Addressing
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand VA to PA Translation: Grasp how Virtual Addresses are converted to Physical Addresses via Page Tables
- Master Sv39 Structure: Understand the three-level Page Table hierarchy (L2 → L1 → L0)
- Configure the
satpCSR: Calculate and setsatpto enable the MMU - Understand TLB Mechanism: Know how TLB accelerates address translation and when to flush it
- Handle Page Faults: Analyze the causes of Page Faults and understand the handling flow
💡 Scenario: The Library’s Call Numbers
Scene: Junior is debugging a multi-process system, staring at GDB’s memory display, increasingly puzzled.
Junior: “Professor, I’ve encountered something really weird. I’m running two programs simultaneously, and when I look at their memory in GDB, both are using address 0x10000! But the data inside is completely different. How is this possible? Is the CPU a quantum computer?”
Professor: (laughing) “This isn’t quantum entanglement. Have you ever been to a library?”
Junior: “Sure, but what does a library have to do with this?”
Professor: “Imagine this: You’re at Library Branch A, and using call number Q123, you find a book called ‘Introduction to Quantum Mechanics.’ Your friend is at Branch B, uses the same call number Q123, and finds ‘Calculus Exercise Collection.’”
Junior: “Because each branch has its own shelf arrangement?”
Professor: “Exactly!
- Call Number (Virtual Address): The address the program sees, like a call number. Each program thinks it has an entire library to itself.
- Actual Shelf Location (Physical Address): Where the book really is.
- Catalog Index (Page Table): The lookup table that translates call numbers to actual shelf locations.
- Branch (Process): Each branch has its own catalog index.“
Junior: “So two programs using the same virtual address, through different Page Tables, get translated to different physical addresses?”
Professor: “You’ve got it! This is the essence of Virtual Memory. The operating system prepares a dedicated catalog (Page Table) for each process, making each think it has exclusive use of the entire library, when in reality everyone’s books are crammed into the same warehouse. The benefits are:
- Isolation: Program A messing up won’t affect Program B.
- Protection: Some shelves are marked ‘Staff Only’—you can’t touch them without permission.
- Flexibility: Books can be relocated anytime—just update the catalog.“
Junior: “So what’s the satp CSR for?”
Professor: “satp tells the CPU: ‘Use this particular catalog (Page Table), starting from this location.’ When the OS switches processes, it updates satp to point to a different catalog.”
Junior: “Got it! Let’s try building this catalog ourselves!”
Virtual memory is one of the most important abstractions in modern computing. It provides memory protection, isolating processes from each other and from the operating system. It provides address space abstraction, giving each process a simple, contiguous view of memory regardless of physical layout. It enables memory overcommitment, allowing systems to run more programs than would fit in physical RAM. And it supports shared memory, enabling efficient communication and resource sharing.
RISC-V implements virtual memory through a clean, flexible paging system. Sv39 provides 39-bit virtual addresses (512 GB address space) with three-level page tables, suitable for most application processors. Sv48 extends this to 48-bit virtual addresses (256 TB address space) with four-level page tables for systems requiring larger address spaces. Both modes support superpages (2 MB and 1 GB pages) for reduced TLB pressure and efficient large mappings.
This chapter explores RISC-V’s virtual memory system in detail: page table structures, address translation, TLB management, page faults, and the Physical Memory Protection (PMP) mechanism that provides memory protection even without virtual memory. Understanding these concepts is essential for operating system developers, hypervisor implementers, and anyone working with RISC-V system software.
5.1 Virtual Memory Overview
Why Virtual Memory?
Virtual memory is one of the most important abstractions in modern computing. It solves several fundamental problems that would otherwise make operating systems nearly impossible to build.
First, virtual memory provides memory protection. Without it, any program could read or write any memory location, including the operating system’s code and data. A buggy program could crash the entire system. A malicious program could steal data from other programs or take control of the system. Virtual memory allows the OS to isolate each process in its own address space, preventing interference.
Second, virtual memory provides address space abstraction. Each process sees a simple, contiguous address space starting at address zero, regardless of where its memory is actually located in physical RAM. The process doesn’t need to know—or care—that its memory might be scattered across different physical locations, or that some of it might be swapped to disk. This abstraction simplifies programming and allows the OS to manage physical memory flexibly.
Third, virtual memory enables memory overcommitment. The OS can give each process a large virtual address space (512 GB in Sv39, 256 TB in Sv48) even if the system has far less physical RAM. Most of that virtual space is never actually used. The OS only allocates physical memory for the pages that are actually accessed. This allows running more programs than would fit in physical memory simultaneously.
Fourth, virtual memory supports shared memory. Multiple processes can map the same physical memory into their virtual address spaces. This is essential for shared libraries (like libc), which would otherwise waste memory by being loaded separately for each process. It’s also used for inter-process communication and memory-mapped files.
RISC-V Virtual Memory Modes
RISC-V defines several virtual memory modes, selected by the MODE field in the satp (Supervisor Address Translation and Protection) CSR:
-
Bare (MODE=0): No address translation. Virtual addresses equal physical addresses. This is the mode used by M-mode and by systems that don’t need virtual memory.
-
Sv32 (MODE=1): 32-bit virtual addressing for RV32. Provides a 4 GB virtual address space with two-level page tables. Used in 32-bit embedded systems running operating systems.
-
Sv39 (MODE=8): 39-bit virtual addressing for RV64. Provides a 512 GB virtual address space with three-level page tables. This is the most common mode for 64-bit RISC-V systems running Linux or similar operating systems.
-
Sv48 (MODE=9): 48-bit virtual addressing for RV64. Provides a 256 TB virtual address space with four-level page tables. Used in systems that need larger address spaces, such as large servers or databases.
-
Sv57 (MODE=10): 57-bit virtual addressing for RV64. Provides a 128 PB virtual address space with five-level page tables. This mode is defined but rarely implemented, as 256 TB is sufficient for nearly all current applications.
The choice of mode is a trade-off. Larger address spaces require more levels of page table lookup, which increases the cost of TLB misses. Most systems use Sv39, which provides a good balance between address space size and performance.
The satp CSR
The satp (Supervisor Address Translation and Protection) register controls virtual memory. It’s a supervisor-level CSR that can only be accessed from S-mode or M-mode.
In RV64, satp has three fields:
63 60 59 44 43 0
+--------+--------------------+----------------------------------+
| MODE | ASID | PPN |
+--------+--------------------+----------------------------------+
- MODE (bits 63:60): Selects the address translation mode (0=Bare, 8=Sv39, 9=Sv48, 10=Sv57)
- ASID (bits 59:44): Address Space Identifier, a 16-bit tag used to distinguish TLB entries from different processes
- PPN (bits 43:0): Physical Page Number of the root page table
To enable virtual memory, the OS:
- Allocates a page table in physical memory
- Initializes the page table entries
- Writes satp with MODE=8 (for Sv39) or MODE=9 (for Sv48) and the physical address of the root page table
- Executes SFENCE.VMA to flush the TLB
After this, all memory accesses from S-mode and U-mode go through address translation.
Address Space Identifiers (ASIDs)
The ASID field in satp is an optimization. When the OS switches between processes, it must change satp to point to the new process’s page table. This would normally require flushing the entire TLB, since TLB entries from the old process are no longer valid.
ASIDs avoid this cost. Each TLB entry is tagged with the ASID from satp when it was created. When looking up a TLB entry, the hardware checks that the ASID matches. This allows TLB entries from multiple processes to coexist. When switching processes, the OS just changes satp (including the ASID), and the TLB automatically filters entries.
If the OS runs out of ASIDs (there are only 2^16 = 65536 possible values), it can flush the TLB and reuse ASIDs. But in practice, 65536 is enough for most workloads.
TLB Management
The Translation Lookaside Buffer (TLB) caches recent address translations. Without the TLB, every memory access would require walking the page table, which could take several memory accesses. The TLB makes virtual memory practical by caching translations.
RISC-V doesn’t specify the TLB implementation—it’s a microarchitectural detail. But it does provide instructions for managing the TLB:
- SFENCE.VMA: Fence for virtual memory. This instruction orders memory accesses and TLB updates. It’s used after modifying page tables to ensure the TLB is consistent.
SFENCE.VMA can take two optional arguments:
rs1: If non-zero, only flush TLB entries for the virtual address in rs1rs2: If non-zero, only flush TLB entries for the ASID in rs2
If both are zero, the entire TLB is flushed. If only rs1 is non-zero, only entries for that virtual address are flushed (across all ASIDs). If only rs2 is non-zero, only entries for that ASID are flushed.
This flexibility allows the OS to minimize TLB flushes. For example, when unmapping a single page, the OS can flush just that page’s TLB entry instead of the entire TLB.
Figure 5.1: RISC-V Virtual Memory Modes
graph TB
subgraph "RV32 Modes"
BARE32[Bare Mode<br/>No translation<br/>VA = PA]
SV32[Sv32 Mode<br/>32-bit VA<br/>4 GB address space<br/>2-level page table]
end
subgraph "RV64 Modes"
BARE64[Bare Mode<br/>No translation<br/>VA = PA]
SV39[Sv39 Mode<br/>39-bit VA<br/>512 GB address space<br/>3-level page table]
SV48[Sv48 Mode<br/>48-bit VA<br/>256 TB address space<br/>4-level page table]
SV57[Sv57 Mode<br/>57-bit VA<br/>128 PB address space<br/>5-level page table]
end
style BARE32 fill:#FFB6C1
style SV32 fill:#87CEEB
style BARE64 fill:#FFB6C1
style SV39 fill:#90EE90
style SV48 fill:#FFD700
style SV57 fill:#DDA0DD
5.2 Sv39: 39-bit Virtual Address Space
Sv39 Overview
Sv39 is the most widely used virtual memory mode for 64-bit RISC-V systems. It provides a 512 GB virtual address space, which is sufficient for most applications while keeping page table walks reasonably fast.
The “39” in Sv39 refers to the number of bits in the virtual address. A 39-bit address can represent 2^39 = 512 GB of address space. This might seem small compared to the 64-bit registers in RV64, but it’s a practical choice. Most programs don’t need more than 512 GB of virtual memory, and using fewer bits means fewer levels of page table lookup.
Sv39 Address Format
An Sv39 virtual address is divided into four parts:
63 39 38 30 29 21 20 12 11 0
+------------+--------+--------+--------+--------------+
| Reserved | VPN[2] | VPN[1] | VPN[0] | Page Offset |
+------------+--------+--------+--------+--------------+
25 bits 9 bits 9 bits 9 bits 12 bits
-
Reserved (bits 63:39): Must be equal to bit 38 (sign extension). This ensures that valid addresses are either in the lower half (0x0000_0000_0000_0000 to 0x0000_003F_FFFF_FFFF) or the upper half (0xFFFF_FFC0_0000_0000 to 0xFFFF_FFFF_FFFF_FFFF) of the 64-bit address space. Addresses that don’t follow this rule cause a page fault.
-
VPN[2] (bits 38:30): Virtual Page Number, level 2. This is the index into the root page table.
-
VPN[1] (bits 29:21): Virtual Page Number, level 1. This is the index into the second-level page table.
-
VPN[0] (bits 20:12): Virtual Page Number, level 0. This is the index into the third-level (leaf) page table.
-
Page Offset (bits 11:0): Offset within the 4 KB page. This is not translated—it’s copied directly to the physical address.
Each VPN field is 9 bits, which means each page table has 2^9 = 512 entries. The page offset is 12 bits, which means pages are 2^12 = 4096 bytes (4 KB).
Three-Level Page Table Walk
Address translation in Sv39 involves walking a three-level page table. Here’s the algorithm:
-
Start with the root page table. Its physical address is in satp.PPN.
-
Use VPN[2] as an index into the root page table. Read the Page Table Entry (PTE) at that index.
-
If the PTE is invalid (V=0) or has invalid permissions, raise a page fault.
-
If the PTE is a leaf (R=1, W=1, or X=1), the translation is complete. The PTE contains the physical page number. Go to step 8.
-
Otherwise, the PTE points to the next level page table. Use VPN[1] as an index into that page table. Read the PTE at that index.
-
If the PTE is invalid or has invalid permissions, raise a page fault.
-
If the PTE is a leaf, the translation is complete. Otherwise, use VPN[0] as an index into the third-level page table. Read the PTE at that index. This must be a leaf.
-
Combine the physical page number from the PTE with the page offset from the virtual address to form the physical address.
This process can require up to three memory accesses (one per level). That’s why the TLB is so important—it caches the result, avoiding the page table walk for subsequent accesses to the same page.
Page Table Entry (PTE) Format
Each PTE in Sv39 is 64 bits:
63 54 53 28 27 19 18 10 9 8 7 6 5 4 3 2 1 0
+----------+------------+------------+------------+-----+-+-+-+-+-+-+-+-+
| Reserved | PPN[2] | PPN[1] | PPN[0] | RSW |D|A|G|U|X|W|R|V|
+----------+------------+------------+------------+-----+-+-+-+-+-+-+-+-+
10 bits 26 bits 9 bits 9 bits 2 bits 8 flag bits
-
Reserved (bits 63:54): Reserved for future use. Must be zero.
-
PPN[2:0] (bits 53:10): Physical Page Number. For a leaf PTE, this is the physical page number of the mapped page. For a non-leaf PTE, this is the physical page number of the next-level page table.
-
RSW (bits 9:8): Reserved for Software. The hardware ignores these bits. The OS can use them for any purpose (e.g., tracking page state).
-
D (bit 7): Dirty. Set by hardware when the page is written. Used by the OS to track which pages need to be written back to disk.
-
A (bit 6): Accessed. Set by hardware when the page is read or written. Used by the OS for page replacement algorithms.
-
G (bit 5): Global. If set, this mapping is global and not associated with any ASID. Global mappings are never flushed by ASID-specific SFENCE.VMA.
-
U (bit 4): User. If set, this page is accessible from U-mode. If clear, the page is only accessible from S-mode.
-
X (bit 3): Execute. If set, the page can be executed.
-
W (bit 2): Write. If set, the page can be written.
-
R (bit 1): Read. If set, the page can be read.
-
V (bit 0): Valid. If clear, the PTE is invalid and any access causes a page fault.
PTE Flags and Permissions
The R, W, X, and U flags control access permissions. The hardware checks these flags during address translation:
- If V=0, the PTE is invalid. Page fault.
- If R=0 and W=1, the PTE is invalid (write-only pages are reserved). Page fault.
- If R=1, W=1, or X=1, the PTE is a leaf. The physical page number is in PPN[2:0].
- If R=0, W=0, and X=0, the PTE is a pointer to the next level. The physical page number of the next-level page table is in PPN[2:0].
For leaf PTEs, the permissions are checked:
- If the access is a read and R=0, page fault.
- If the access is a write and W=0, page fault.
- If the access is an instruction fetch and X=0, page fault.
- If the access is from U-mode and U=0, page fault.
The A and D bits are set by hardware when the page is accessed or modified. The OS can clear these bits and use them to implement page replacement algorithms (e.g., LRU).
Superpages
Sv39 supports superpages—large pages that are multiples of the base 4 KB page size. A superpage is created by making a PTE at level 1 or level 2 a leaf (by setting R, W, or X).
-
A level 1 leaf PTE creates a 2 MB superpage (2^21 bytes). VPN[0] is not used; instead, bits 20:12 of the virtual address become part of the page offset.
-
A level 2 leaf PTE creates a 1 GB superpage (2^30 bytes). VPN[1] and VPN[0] are not used; instead, bits 29:12 of the virtual address become part of the page offset.
Superpages reduce TLB pressure by covering more memory with fewer TLB entries. They’re commonly used for large allocations like the kernel’s direct map of physical memory, or for large application heaps.
For a superpage PTE to be valid, the PPN must be properly aligned. For a 2 MB superpage, PPN[0] must be zero. For a 1 GB superpage, PPN[1:0] must be zero. If the alignment is incorrect, the PTE is considered invalid.
Figure 5.2: Sv39 Address Translation
graph LR
VA[Virtual Address<br/>39 bits]
VPN2[VPN2<br/>9 bits]
VPN1[VPN1<br/>9 bits]
VPN0[VPN0<br/>9 bits]
OFFSET[Offset<br/>12 bits]
SATP[satp.PPN<br/>Root Page Table]
L2[Level 2 PTE]
L1[Level 1 PTE]
L0[Level 0 PTE<br/>Leaf]
PPN[Physical Page Number]
PA[Physical Address]
VA --> VPN2
VA --> VPN1
VA --> VPN0
VA --> OFFSET
SATP --> L2
VPN2 -.index.-> L2
L2 --> L1
VPN1 -.index.-> L1
L1 --> L0
VPN0 -.index.-> L0
L0 --> PPN
PPN --> PA
OFFSET --> PA
style VA fill:#FFB6C1
style SATP fill:#87CEEB
style L0 fill:#90EE90
style PA fill:#FFD700
5.3 Sv48: 48-bit Virtual Address Space
Sv48 Overview
Sv48 extends Sv39 by adding one more level to the page table, increasing the virtual address space from 512 GB to 256 TB. This is useful for very large applications, such as databases that manage terabytes of data, or for systems that need to map large amounts of physical memory.
The trade-off is performance. Each additional level of page table adds one more memory access to the page table walk. For workloads with poor TLB hit rates, this can noticeably impact performance. Most systems use Sv39 unless they specifically need the larger address space.
Sv48 Address Format
An Sv48 virtual address has 48 bits of address and 16 bits of sign extension:
63 48 47 39 38 30 29 21 20 12 11 0
+------------+--------+--------+--------+--------+--------------+
| Reserved | VPN[3] | VPN[2] | VPN[1] | VPN[0] | Page Offset |
+------------+--------+--------+--------+--------+--------------+
16 bits 9 bits 9 bits 9 bits 9 bits 12 bits
The structure is similar to Sv39, but with an additional VPN[3] field for the fourth level of page table.
Four-Level Page Table Walk
The page table walk in Sv48 is similar to Sv39, but with an extra level:
- Start with the root page table at satp.PPN
- Use VPN[3] to index into the root page table
- If the PTE is a leaf, translation is complete
- Otherwise, use VPN[2] to index into the level 2 page table
- If the PTE is a leaf, translation is complete
- Otherwise, use VPN[1] to index into the level 1 page table
- If the PTE is a leaf, translation is complete
- Otherwise, use VPN[0] to index into the level 0 page table
- This must be a leaf PTE
- Combine the PPN from the PTE with the page offset to form the physical address
Sv48 Superpages
Sv48 supports the same superpages as Sv39, plus one additional size:
- 4 KB: Level 0 leaf (base page size)
- 2 MB: Level 1 leaf (2^21 bytes)
- 1 GB: Level 2 leaf (2^30 bytes)
- 512 GB: Level 3 leaf (2^39 bytes)
The 512 GB superpage is enormous—it’s the entire address space of Sv39! Such large pages are rarely used, but they could be useful for mapping very large regions of physical memory with minimal TLB overhead.
Sv48 vs Sv39 Trade-offs
Choosing between Sv39 and Sv48 involves several considerations:
Address Space:
- Sv39: 512 GB (sufficient for most applications)
- Sv48: 256 TB (needed for very large databases, in-memory computing)
Page Table Walk Cost:
- Sv39: Up to 3 memory accesses
- Sv48: Up to 4 memory accesses
- Impact depends on TLB hit rate
Memory Overhead:
- Sv48 requires more page table memory for sparse address spaces
- Each additional level adds 4 KB per 512 GB of virtual address space
Compatibility:
- Sv39 is more widely supported
- Sv48 may not be implemented on all RISC-V processors
For most systems, Sv39 is the right choice. Sv48 should be used only when the larger address space is genuinely needed.
5.4 Page Faults and Exception Handling
Page Fault Types
RISC-V defines three types of page faults, distinguished by the type of access that caused the fault:
-
Instruction Page Fault (exception code 12): Occurs when fetching an instruction from a page that is not mapped, not executable, or not accessible at the current privilege level.
-
Load Page Fault (exception code 13): Occurs when loading from a page that is not mapped, not readable, or not accessible at the current privilege level.
-
Store/AMO Page Fault (exception code 15): Occurs when storing to a page that is not mapped, not writable, or not accessible at the current privilege level.
When a page fault occurs, the processor:
- Sets
scauseto the exception code (12, 13, or 15) - Sets
sepcto the PC of the faulting instruction - Sets
stvalto the faulting virtual address - Traps to S-mode (or M-mode if not delegated)
The OS page fault handler examines stval to determine which page caused the fault, then decides how to handle it.
Page Fault Handling
The OS can handle page faults in several ways:
Demand Paging: The page is valid but not currently in physical memory. The OS:
- Allocates a physical page
- Loads the page contents from disk (if it was swapped out) or zeros it (if it’s a new page)
- Updates the page table to map the virtual page to the physical page
- Executes SFENCE.VMA to flush the TLB
- Returns with SRET, which re-executes the faulting instruction
Copy-on-Write: The page is mapped read-only, but the process tries to write to it. This is used for fork() optimization. The OS:
- Allocates a new physical page
- Copies the contents from the old page to the new page
- Updates the page table to map the virtual page to the new page with write permission
- Executes SFENCE.VMA
- Returns with SRET
Invalid Access: The page is not mapped and should not be. The OS:
- Sends a SIGSEGV signal to the process (on Unix-like systems)
- The process typically terminates with a segmentation fault
The key is that sepc points to the faulting instruction, so returning from the trap re-executes it. This is essential for demand paging and copy-on-write to work correctly.
🛠️ Hands-on Lab: Lab 5.1 — Putting on Magic Glasses (Enable Paging)
This lab guides you through building the simplest Page Table: Identity Mapping (Virtual Address = Physical Address), and enabling the MMU.
Lab Objectives
- Understand Sv39’s Page Table Entry (PTE) structure
- Build an Identity Mapping Page Table
- Configure the
satpCSR and enable the MMU - Understand the role of
sfence.vma
Concept Explanation
In Sv39 mode, the Page Table has three levels:
Virtual Address (39-bit):
+--------+--------+--------+------------+
| VPN[2] | VPN[1] | VPN[0] | Offset |
| 9-bit | 9-bit | 9-bit | 12-bit |
+--------+--------+--------+------------+
Page Table Walk:
satp.PPN → Level 2 Table → Level 1 Table → Level 0 Table → Physical Page
Each Page Table Entry (PTE) is 64-bit:
PTE Format:
+-----------------------------------------------+-------+
| PPN (44-bit) | Flags |
| | RWXUG |
+-----------------------------------------------+-------+
63 10 9 0
Flags:
V (Valid) - bit 0: Entry is valid
R (Read) - bit 1: Readable
W (Write) - bit 2: Writable
X (Execute) - bit 3: Executable
U (User) - bit 4: User mode accessible
G (Global) - bit 5: Global mapping
A (Accessed) - bit 6: Has been accessed
D (Dirty) - bit 7: Has been written
Code
Create lab5_paging.c:
// lab5_paging.c - Minimal Identity Mapping Demo
#include <stdint.h>
// PTE Flag Definitions
#define PTE_V (1 << 0) // Valid
#define PTE_R (1 << 1) // Read
#define PTE_W (1 << 2) // Write
#define PTE_X (1 << 3) // Execute
#define PTE_U (1 << 4) // User
#define PTE_A (1 << 6) // Accessed
#define PTE_D (1 << 7) // Dirty
// Sv39: 512 entries per page table (9-bit index)
#define PAGE_SIZE 4096
#define PTE_PER_PAGE 512
// Page Table (must be 4KB aligned)
__attribute__((aligned(PAGE_SIZE)))
uint64_t root_page_table[PTE_PER_PAGE];
// Simplified: We use 1GB Gigapages for Identity Mapping
// VPN[2] = 0 → PA 0x0000_0000 ~ 0x3FFF_FFFF (1GB)
// VPN[2] = 1 → PA 0x4000_0000 ~ 0x7FFF_FFFF (1GB)
void setup_identity_mapping(void) {
// Clear Page Table
for (int i = 0; i < PTE_PER_PAGE; i++) {
root_page_table[i] = 0;
}
// Create Identity Mapping (first 4GB, using 1GB gigapages)
// This is a Leaf PTE: RWX bits are set, meaning this is the final mapping
for (int i = 0; i < 4; i++) {
uint64_t pa = (uint64_t)i << 30; // Each entry maps 1GB
uint64_t ppn = pa >> 12; // PPN = PA >> 12
root_page_table[i] = (ppn << 10) | PTE_V | PTE_R | PTE_W | PTE_X | PTE_A | PTE_D;
}
}
void enable_paging(void) {
uint64_t root_ppn = ((uint64_t)root_page_table) >> 12;
// satp format: MODE (4-bit) | ASID (16-bit) | PPN (44-bit)
// MODE = 8 (Sv39)
uint64_t satp_val = (8ULL << 60) | root_ppn;
// Set satp
asm volatile("csrw satp, %0" : : "r"(satp_val));
// Flush TLB - CRITICAL!
asm volatile("sfence.vma");
}
int main(void) {
setup_identity_mapping();
enable_paging();
// If we reach here, paging is working!
// The program continues to run because VA == PA
return 0;
}
Compile and Run
# Compile (for bare-metal S-mode)
riscv64-unknown-elf-gcc -march=rv64gc -mabi=lp64d -nostdlib \
-T linker.ld -o lab5_paging lab5_paging.c startup.S
# Run with QEMU
qemu-system-riscv64 -machine virt -nographic -bios none -kernel lab5_paging
What You Just Did
You’ve accomplished the fundamental MMU setup:
- Built a Page Table: Created entries that map VA to the same PA
- Configured satp: Told the CPU where the Page Table is and which mode to use
- Flushed TLB: Ensured the CPU uses the new mappings
danieRTOS Reference: The memory management in danieRTOS uses similar identity mapping for kernel space, with separate per-task mappings for user space.
Paper Exercise: Address Translation Drill
Given an Sv39 Virtual Address: 0x0000_0040_1234_5678
Manually extract:
- VPN[2] = bits 38-30 = ?
- VPN[1] = bits 29-21 = ?
- VPN[0] = bits 20-12 = ?
- Offset = bits 11-0 = ?
Click to reveal answer
VA = 0x0000_0040_1234_5678
= 0b 0000...0001 000000001 000100011 010001010110 01111000
VPN[2] = bits 38-30 = 0x001 = 1
VPN[1] = bits 29-21 = 0x009 = 9
VPN[0] = bits 20-12 = 0x234 = 564
Offset = bits 11-0 = 0x678 = 1656
Translation Process:
- From
satp.PPN, find the Root Table (Level 2) - Use VPN[2]=1 as index, find Level 1 Table’s PPN
- Use VPN[1]=9 as index, find Level 0 Table’s PPN
- Use VPN[0]=564 as index, find the final Physical Page’s PPN
- Physical Address = (PPN << 12) | Offset
⚠️ Common Pitfalls
Pitfall 1: Page Table Not Aligned
Error Scenario: Page Table is not 4KB aligned, causing satp to compute an incorrect PPN.
// ❌ Wrong: Not aligned
uint64_t page_table[512]; // May not be 4KB aligned!
// ✅ Correct: Force 4KB alignment
__attribute__((aligned(4096)))
uint64_t page_table[512];
Pitfall 2: Forgetting to Flush TLB
Error Scenario: Modified the Page Table but didn’t execute sfence.vma, causing the CPU to continue using stale TLB cache.
// ❌ Wrong: Forgot to flush after modification
page_table[index] = new_pte;
// CPU may still use old mapping!
// ✅ Correct: Flush TLB after modification
page_table[index] = new_pte;
asm volatile("sfence.vma"); // Tell CPU: Page Table changed, clear cache
Pitfall 3: Confusing Leaf PTE with Non-Leaf PTE
Error Scenario: Setting RWX bits on an intermediate level, accidentally creating a gigapage mapping.
// PTE type determination rules:
// - RWX all 0: Non-Leaf (points to next level Page Table)
// - RWX at least one is 1: Leaf (final mapping)
// ❌ Wrong: Level 2 PTE has R bit set, becomes 1GB gigapage!
level2_pte = (next_table_ppn << 10) | PTE_V | PTE_R; // Accidentally becomes Leaf
// ✅ Correct: Non-Leaf PTE only sets V bit
level2_pte = (next_table_ppn << 10) | PTE_V; // Correct Non-Leaf
Summary
RISC-V’s virtual memory system provides memory protection, address space abstraction, and flexible memory management through a clean paging mechanism. The satp CSR controls address translation, selecting the translation mode (Bare, Sv32, Sv39, Sv48), specifying the Address Space Identifier (ASID) for TLB tagging, and pointing to the root page table.
Sv39 provides 39-bit virtual addresses with a 512 GB address space, using three-level page tables. Each level has 512 entries indexed by 9-bit VPN fields. Page Table Entries (PTEs) are 64 bits, containing a 44-bit physical page number and 8 flag bits (V, R, W, X, U, G, A, D). Leaf PTEs at level 0 map 4 KB pages. Leaf PTEs at higher levels create superpages: 2 MB at level 1, 1 GB at level 2.
Sv48 extends this to 48-bit virtual addresses with a 256 TB address space, using four-level page tables. The additional level provides more address space at the cost of one extra memory access per translation. Sv48 is needed for large databases, scientific computing, and systems requiring very large address spaces.
The Translation Lookaside Buffer (TLB) caches recent address translations, avoiding expensive page table walks. TLB entries are tagged with ASID to distinguish different address spaces. The SFENCE.VMA instruction flushes TLB entries, with optional parameters to flush specific virtual addresses or ASIDs. Efficient TLB management is critical for performance—unnecessary flushes cause expensive page table walks.
Page faults occur when the hardware cannot complete a translation: invalid PTEs (V=0), permission violations (accessing a page without R/W/X permission), or privilege violations (U-mode accessing a non-U page). The OS page fault handler can implement demand paging (allocate and load pages on first access), copy-on-write (share pages until written), and memory-mapped files (map file contents into address space).
Physical Memory Protection (PMP) provides memory protection without virtual memory, essential for embedded systems and M-mode firmware. PMP uses CSRs (pmpcfg0-15, pmpaddr0-63) to define up to 64 memory regions with access permissions (R, W, X) and address matching modes (OFF, TOR, NA4, NAPOT). PMP checks occur in parallel with virtual memory translation, protecting against both user programs and S-mode OS bugs.
Compared to ARM’s translation system, RISC-V’s is simpler and more regular. ARM uses complex descriptor formats with multiple page sizes and attributes. RISC-V uses a single PTE format with clean flag bits. ARM’s ASID is 16 bits; RISC-V’s is 16 bits in Sv39 and 9 bits in Sv48. Both support superpages, but RISC-V’s approach is more uniform—any level can be a leaf.
RISC-V’s virtual memory design reflects its philosophy: provide a clean, minimal mechanism that’s easy to implement and understand, while supporting the features needed for modern operating systems. The result is a system that’s simpler than ARM’s but equally capable for most applications.
Chapter 6. Memory Ordering & Synchronization
Part IV — Memory & Addressing
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand Out-of-Order Execution: Know why CPUs reorder memory accesses
- Master RVWMO: Understand the basic rules of RISC-V Weak Memory Ordering
- Use Fence Instructions: Know when to use
fenceto enforce ordering - Implement a Spinlock: Use
amoswaporlr/scto build a mutex lock - Avoid Data Races: Identify and fix race conditions in multi-core programs
💡 Scenario: Shipping Logic at the Distribution Center
Scene: Junior wrote a dual-core program where Core 0 writes data and Core 1 reads it, but the results are always scrambled.
Junior: “Senior, I’m losing my mind! My program logic is correct, but the output looks like random numbers.”
Senior: “Show me the code.”
Junior: (showing screen)
// Core 0 // Core 1
data = 42; while (flag == 0) {}
flag = 1; print(data); // Expected: 42
“By logic, Core 1 should wait until flag becomes 1 before printing data, and by then data should already be 42, right? But sometimes it prints 0!”
Senior: “You’re imagining the CPU as too honest. Modern CPUs are like distribution center managers—for efficiency, they’ll ‘secretly reorder shipments.’”
Junior: “What do you mean?”
Senior: “Imagine you’re a logistics manager. You have two packages to ship:
- Package A: Ship to Taipei (far)
- Package B: Ship to Hsinchu (near)
Which do you ship first?“
Junior: “The Hsinchu one—it’s faster anyway.”
Senior: “Bingo! The CPU thinks the same way. It sees data = 42 and flag = 1 as two stores. It notices flag’s address is already in cache while data has to wait for memory, so it writes flag first.”
Junior: “So Core 1 sees flag == 1, but data hasn’t been written yet?”
Senior: “Exactly. This is Memory Reordering. The fix is to use a Fence instruction to tell the CPU: ‘No cutting in line! All preceding stores must complete before executing subsequent stores.’”
// Core 0 (fixed version)
data = 42;
__sync_synchronize(); // Compiles to: fence iorw, iorw
flag = 1;
Junior: “I see! What about amoswap and those atomic instructions?”
Senior: “That’s a different problem: ‘How do you ensure two people don’t enter the bathroom at the same time?’ That’s what Spinlock solves.”
Modern processors execute instructions out of order, reorder memory accesses, and use caches that can delay when writes become visible to other processors. These optimizations are essential for performance, but they create a fundamental problem: what does a program actually mean when multiple processors access shared memory? Without careful synchronization, programs can observe impossible behaviors where effects appear to happen in the wrong order.
RISC-V addresses this through a memory consistency model that defines which memory access orderings are legal, and synchronization primitives that enforce ordering when needed. The RISC-V Weak Memory Ordering (RVWMO) model allows aggressive reordering for performance while providing fence instructions and atomic operations to enforce ordering where required. Understanding memory ordering is essential for anyone writing concurrent code, implementing synchronization primitives, or optimizing multi-threaded applications.
This chapter explores RISC-V’s memory model in detail: the RVWMO consistency model, fence instructions for enforcing ordering, atomic instructions for lock-free synchronization, the Total Store Ordering (RVTSO) extension for stronger ordering, and comparisons with ARM’s and x86’s memory models. We’ll see how to implement locks, barriers, and lock-free data structures correctly on RISC-V.
6.1 Memory Consistency Models
The Memory Ordering Problem
Modern processors don’t execute instructions in the order they appear in the program. They reorder loads and stores, execute instructions out of order, and use store buffers and caches that can delay when writes become visible to other processors. These optimizations are essential for performance, but they create a problem: what does a program actually mean when multiple processors are accessing shared memory?
Consider this simple example with two processors:
Processor 0: Processor 1:
x = 1 y = 1
r1 = y r2 = x
Initially, x = 0 and y = 0. After both processors execute, what are the possible values of r1 and r2?
In a sequentially consistent system, there are three possible outcomes:
r1 = 0, r2 = 1(P0 executes first)r1 = 1, r2 = 0(P1 executes first)r1 = 1, r2 = 1(stores happen before loads)
But in a weakly ordered system like RISC-V, there’s a fourth possibility:
r1 = 0, r2 = 0(both loads execute before both stores become visible)
This happens because each processor can reorder its own store after its load. The store to x might sit in P0’s store buffer while P0 executes the load from y. Similarly for P1. Both loads see the old values (0), even though both stores eventually complete.
This behavior is surprising and can lead to bugs if programmers aren’t careful. But it’s also essential for performance. Forcing sequential consistency would require stalling the processor on every memory operation, which would be unacceptably slow.
Sequential Consistency (SC)
Sequential consistency is the simplest and most intuitive memory model. It was defined by Leslie Lamport in 1979: “the result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”
In other words:
- All memory operations appear to execute in some total order
- Each processor’s operations appear in program order within that total order
SC is easy to reason about—programs behave as if they execute one instruction at a time, in order. But SC is also restrictive. It prohibits many optimizations:
- Store buffers (stores must be visible immediately)
- Out-of-order execution of loads
- Speculative loads
- Non-blocking caches
Modern processors don’t implement SC because the performance cost is too high.
Weak Memory Models
Weak memory models relax the ordering requirements to allow more optimization. They permit reordering of memory operations, with explicit synchronization instructions (like fences) to enforce ordering when needed.
The key insight is that most memory operations don’t need strict ordering. If a thread is computing on local data, it doesn’t matter if loads and stores are reordered—no other thread can observe the difference. Ordering only matters at synchronization points: when acquiring a lock, releasing a lock, or communicating between threads.
Weak memory models make these synchronization points explicit. The programmer (or compiler) inserts fence instructions or uses atomic operations with ordering semantics to enforce the necessary ordering. Between synchronization points, the hardware is free to reorder operations for performance.
RISC-V Weak Memory Ordering (RVWMO)
RISC-V uses a weak memory model called RVWMO (RISC-V Weak Memory Ordering). It’s similar to the memory models of ARM and Power, but with a formal specification that precisely defines what behaviors are allowed.
RVWMO allows extensive reordering:
- Loads can be reordered with other loads
- Stores can be reordered with other stores
- Loads can be reordered with stores
- Stores can be reordered with loads
The only operations that are not reordered are those with explicit dependencies or those separated by fence instructions.
This might sound chaotic, but in practice, most programs don’t need to worry about it. Single-threaded programs behave as expected (the processor preserves the illusion of sequential execution within a thread). Multithreaded programs use locks, atomics, and other synchronization primitives that include the necessary fences.
RVWMO strikes a balance: it’s weak enough to allow aggressive optimization, but strong enough to support efficient synchronization primitives.
6.2 RISC-V Memory Model (RVWMO)
Program Order vs Memory Order
To understand RVWMO, we need to distinguish two concepts:
Program order is the order in which instructions appear in the program. If instruction A appears before instruction B in the program, we say A precedes B in program order.
Memory order is the order in which memory operations become visible to other harts (RISC-V’s term for hardware threads). This is the order that other harts observe when they read from memory.
In a sequentially consistent system, memory order equals program order. In RVWMO, memory order can differ from program order due to reordering.
Load and Store Ordering Rules
RVWMO allows the following reorderings:
-
Load → Load: A load can be reordered before an earlier load (unless they have a dependency or are separated by a fence)
-
Load → Store: A load can be reordered before an earlier store (unless they have a dependency or are separated by a fence)
-
Store → Store: A store can be reordered before an earlier store (unless they overlap in address or are separated by a fence)
-
Store → Load: A store can be reordered before an earlier load (unless they overlap in address or are separated by a fence)
The key exceptions are:
- Operations to overlapping addresses are not reordered
- Operations separated by a FENCE instruction are not reordered
- Operations with syntactic dependencies are not reordered
Preserved Program Order (PPO)
Not all program order is lost. RVWMO defines Preserved Program Order (PPO)—the subset of program order that is guaranteed to be respected in memory order.
PPO includes:
-
Overlapping addresses: If two memory operations access overlapping addresses, they are not reordered. For example, a store to address X followed by a load from address X will execute in order.
-
Explicit synchronization: Operations separated by a FENCE instruction maintain their order.
-
Acquire/Release: Atomic operations with
.aq(acquire) or.rl(release) suffixes enforce ordering. -
Syntactic dependencies: If a later instruction uses the result of an earlier instruction, they execute in order. For example:
ld a0, 0(a1) # Load from address in a1 ld a2, 0(a0) # Load from address in a0 (depends on first load)The second load depends on the first, so they cannot be reordered.
-
Control dependencies: If a later instruction is control-dependent on an earlier instruction (e.g., after a branch), certain orderings are preserved.
PPO is the foundation of RVWMO. It defines the minimum ordering that the hardware must respect. Everything else can be reordered.
Global Memory Order
RVWMO requires that there exists a global memory order—a total order of all memory operations across all harts that is consistent with each hart’s PPO.
This doesn’t mean operations actually execute in this order. It means that the observable effects must be consistent with some such order. The hardware can reorder, buffer, and cache operations as long as the final result looks like they executed in some valid global order.
This property is called multi-copy atomicity. When a store becomes visible to one hart, it becomes visible to all harts at the same point in the global memory order. There’s no state where hart A sees a store but hart B doesn’t (assuming both are reading the same address).
Multi-copy atomicity simplifies reasoning about concurrent programs. It means you don’t have to worry about stores propagating at different rates to different harts.
Figure 6.1: Memory Ordering Example
sequenceDiagram
participant H0 as Hart 0
participant Mem as Memory
participant H1 as Hart 1
Note over H0,H1: Initially: x=0, y=0
H0->>Mem: x = 1 (store)
Note over H0: Store may be buffered
H0->>Mem: r1 = y (load)
Note over H0: Load can execute early
H1->>Mem: y = 1 (store)
Note over H1: Store may be buffered
H1->>Mem: r2 = x (load)
Note over H1: Load can execute early
Note over H0,H1: Possible: r1=0, r2=0<br/>(both loads before both stores)
Note over H0,H1: FENCE prevents this reordering
6.3 Memory Ordering Instructions
The FENCE Instruction
The FENCE instruction enforces ordering between memory operations. It prevents reordering of operations before the fence with operations after the fence.
The basic syntax is:
fence pred, succ
Where pred (predecessor) and succ (successor) specify which types of operations are ordered:
r: Reads (loads)w: Writes (stores)rw: Both reads and writes
Common fence variants:
FENCE rw, rw (full fence):
fence rw, rw
Orders all memory operations before the fence with all memory operations after the fence. This is the strongest fence—it prevents all reordering across the fence.
FENCE w, w (store-store fence):
fence w, w
Orders stores before the fence with stores after the fence. Loads can still be reordered. This is useful for ensuring that a sequence of stores becomes visible in order.
FENCE r, r (load-load fence):
fence r, r
Orders loads before the fence with loads after the fence. Stores can still be reordered.
FENCE r, rw (acquire fence):
fence r, rw
Orders loads before the fence with all operations after the fence. This is used for acquire semantics (e.g., after acquiring a lock).
FENCE rw, w (release fence):
fence rw, w
Orders all operations before the fence with stores after the fence. This is used for release semantics (e.g., before releasing a lock).
Example: Message Passing
Consider a classic message-passing pattern:
# Producer (Hart 0)
sw a0, 0(s0) # Write data
fence w, w # Ensure data is written before flag
sw a1, 0(s1) # Write flag
# Consumer (Hart 1)
loop:
lw t0, 0(s1) # Read flag
beqz t0, loop # Wait for flag
fence r, r # Ensure flag is read before data
lw t1, 0(s0) # Read data
The producer writes data, then sets a flag. The consumer waits for the flag, then reads the data. The fences ensure that:
- The data write happens before the flag write (producer)
- The flag read happens before the data read (consumer)
Without the fences, the hardware could reorder the operations, and the consumer might read stale data.
FENCE.I: Instruction Fence
FENCE.I synchronizes the instruction and data streams. It ensures that all previous stores to instruction memory are visible to subsequent instruction fetches.
This is needed for:
- Self-modifying code: If a program writes new instructions to memory, FENCE.I ensures those instructions are fetched correctly.
- JIT compilation: A JIT compiler generates code at runtime. After writing the code to memory, it executes FENCE.I before jumping to the new code.
- Dynamic linking: Loading a shared library involves writing code to memory.
Example:
# Write new instruction to memory
sw a0, 0(s0)
# Ensure instruction cache sees the new instruction
fence.i
# Jump to new code
jalr s0
FENCE.I is relatively expensive—it may require flushing instruction caches. It should be used sparingly.
FENCE.TSO: Total Store Ordering
FENCE.TSO provides x86-like memory ordering. It’s equivalent to FENCE rw, rw but may be implemented more efficiently on some microarchitectures.
TSO (Total Store Ordering) is the memory model used by x86 processors. It’s stronger than RVWMO:
- Loads are not reordered with loads
- Stores are not reordered with stores
- Loads are not reordered with earlier stores
- Stores can be reordered with earlier loads (the only relaxation)
FENCE.TSO is useful for porting x86 code to RISC-V. Instead of analyzing the code to determine which fences are needed, you can insert FENCE.TSO at synchronization points and get x86-like behavior.
Acquire and Release Semantics
Atomic operations (from the A extension) can have .aq (acquire) or .rl (release) suffixes that enforce ordering:
-
Acquire (
.aq): No memory operations after the atomic can be reordered before it. This is used when acquiring a lock—you want to ensure that accesses to protected data happen after the lock is acquired. -
Release (
.rl): No memory operations before the atomic can be reordered after it. This is used when releasing a lock—you want to ensure that accesses to protected data happen before the lock is released.
Example:
# Acquire lock
acquire_loop:
lr.w.aq t0, 0(a0) # Load-reserved with acquire
bnez t0, acquire_loop # Wait if locked
li t1, 1
sc.w t1, t1, 0(a0) # Store-conditional
bnez t1, acquire_loop # Retry if failed
# Critical section - access protected data
lw t2, 0(s0)
addi t2, t2, 1
sw t2, 0(s0)
# Release lock
sw zero, 0(a0) # Clear lock (with release semantics)
fence rw, w # Release fence
The .aq on the load-reserved ensures that loads/stores in the critical section don’t move before the lock acquisition. The release fence ensures they don’t move after the lock release.
6.4 Atomic Operations (A Extension)
The A Extension
The A (Atomic) extension provides atomic memory operations for synchronization. It includes:
- Load-Reserved / Store-Conditional (LR/SC)
- Atomic Memory Operations (AMO)
These operations are essential for implementing locks, semaphores, and lock-free data structures.
Load-Reserved / Store-Conditional (LR/SC)
LR/SC is a pair of instructions that together implement atomic read-modify-write:
LR.W / LR.D (Load-Reserved):
lr.w rd, (rs1) # Load word from address in rs1
lr.d rd, (rs1) # Load doubleword from address in rs1
LR loads a value from memory and establishes a reservation on that memory location. The reservation tracks whether any other hart has written to that location.
SC.W / SC.D (Store-Conditional):
sc.w rd, rs2, (rs1) # Store word to address in rs1
sc.d rd, rs2, (rs1) # Store doubleword to address in rs1
SC attempts to store a value to memory. It succeeds only if the reservation is still valid (no other hart has written to the location). SC writes 0 to rd on success, or a non-zero value on failure.
LR/SC Example: Atomic Increment
atomic_increment:
lr.w t0, 0(a0) # Load current value
addi t0, t0, 1 # Increment
sc.w t1, t0, 0(a0) # Try to store
bnez t1, atomic_increment # Retry if failed
This implements an atomic increment. If another hart modifies the location between the LR and SC, the SC fails, and we retry.
Reservation Set
The reservation is not necessarily on a single address. The hardware maintains a reservation set—a set of bytes that includes the address loaded by LR. The reservation is invalidated if any hart writes to any byte in the reservation set.
The reservation set is implementation-defined, but it must include at least the bytes loaded by LR. It might be as small as a single word, or as large as a cache line.
This means SC can fail spuriously—even if no other hart wrote to the exact address, SC might fail if another hart wrote to a nearby address in the same reservation set. Code using LR/SC must handle spurious failures by retrying.
LR/SC Guidelines
For LR/SC to work correctly:
-
Minimal code between LR and SC: The reservation can be broken by interrupts, context switches, or other harts’ stores. Keep the critical section small.
-
No other memory operations: Some implementations invalidate the reservation if the hart performs any other memory operation between LR and SC. Avoid loads or stores between LR and SC.
-
Always retry on failure: SC can fail spuriously. Always check the result and retry.
-
Forward progress: Implementations must guarantee that LR/SC eventually succeeds if no other hart is contending. This prevents livelock.
Atomic Memory Operations (AMO)
AMOs are single instructions that atomically read, modify, and write memory. They’re simpler than LR/SC for common operations like increment, swap, or bitwise operations.
The A extension defines these AMOs:
AMOSWAP: Atomic swap
amoswap.w rd, rs2, (rs1) # Atomically: rd = mem[rs1]; mem[rs1] = rs2
amoswap.d rd, rs2, (rs1)
AMOADD: Atomic add
amoadd.w rd, rs2, (rs1) # Atomically: rd = mem[rs1]; mem[rs1] += rs2
amoadd.d rd, rs2, (rs1)
AMOAND, AMOOR, AMOXOR: Atomic bitwise operations
amoand.w rd, rs2, (rs1) # Atomically: rd = mem[rs1]; mem[rs1] &= rs2
amoor.w rd, rs2, (rs1) # Atomically: rd = mem[rs1]; mem[rs1] |= rs2
amoxor.w rd, rs2, (rs1) # Atomically: rd = mem[rs1]; mem[rs1] ^= rs2
AMOMIN, AMOMAX: Atomic min/max (signed)
amomin.w rd, rs2, (rs1) # Atomically: rd = mem[rs1]; mem[rs1] = min(mem[rs1], rs2)
amomax.w rd, rs2, (rs1) # Atomically: rd = mem[rs1]; mem[rs1] = max(mem[rs1], rs2)
AMOMINU, AMOMAXU: Atomic min/max (unsigned)
amominu.w rd, rs2, (rs1) # Unsigned min
amomaxu.w rd, rs2, (rs1) # Unsigned max
All AMOs can have .aq and/or .rl suffixes for acquire/release semantics.
AMO Example: Atomic Increment
Using AMO, atomic increment is a single instruction:
amoadd.w zero, t0, 0(a0) # Atomically add t0 to mem[a0]
This is simpler and more efficient than the LR/SC version. The old value is discarded (written to zero).
AMO vs LR/SC
When should you use AMO vs LR/SC?
Use AMO when:
- The operation matches one of the AMO instructions
- You need a simple, single-instruction atomic operation
- Performance is critical (AMOs are typically faster than LR/SC)
Use LR/SC when:
- The operation is complex (e.g., conditional update)
- You need to read the old value and make a decision
- The operation doesn’t match any AMO
Example: Atomic compare-and-swap (CAS) requires LR/SC:
cas:
lr.w t0, 0(a0) # Load current value
bne t0, a1, cas_fail # Compare with expected value
sc.w t1, a2, 0(a0) # Store new value
bnez t1, cas # Retry if failed
li a0, 1 # Success
ret
cas_fail:
li a0, 0 # Failure
ret
This can’t be done with a single AMO because it requires a conditional check.
Figure 6.2a: LR/SC Pattern
graph TB
LR[LR: Load + Reserve]
COMPUTE[Compute new value]
SC[SC: Store if reservation valid]
CHECK{Success?}
RETRY[Retry]
DONE[Done]
LR --> COMPUTE
COMPUTE --> SC
SC --> CHECK
CHECK -->|No| RETRY
RETRY --> LR
CHECK -->|Yes| DONE
style LR fill:#FFB6C1
style SC fill:#87CEEB
style DONE fill:#90EE90
Figure 6.2b: AMO Pattern
graph TB
AMO[AMO: Atomic operation<br/>Single instruction]
DONE[Done]
AMO --> DONE
style AMO fill:#90EE90
style DONE fill:#87CEEB
6.5 Comparison with ARM and x86
ARM Memory Model
ARM uses a weak memory model similar to RISC-V. The ARMv8 architecture defines:
- Relaxed ordering: Like RVWMO, ARM allows extensive reordering
- DMB (Data Memory Barrier): Similar to RISC-V FENCE
- DSB (Data Synchronization Barrier): Stronger than DMB, waits for operations to complete
- ISB (Instruction Synchronization Barrier): Similar to RISC-V FENCE.I
ARM atomic operations:
- LDXR/STXR: Load-Exclusive / Store-Exclusive (similar to LR/SC)
- Atomic operations: LDADD, LDCLR, LDEOR, etc. (similar to AMOs)
The main difference is that ARM has more fence variants (DMB with different domains and types), while RISC-V has a simpler fence model.
x86 Memory Model
x86 uses a much stronger memory model called TSO (Total Store Ordering):
- Loads are not reordered with loads
- Stores are not reordered with stores
- Loads are not reordered with earlier stores
- Stores can be reordered with earlier loads (the only relaxation)
This is much closer to sequential consistency than RVWMO. Most x86 code doesn’t need explicit fences because the hardware provides strong ordering.
x86 atomic operations:
- LOCK prefix: Makes an instruction atomic (e.g.,
LOCK ADD) - CMPXCHG: Compare-and-swap
- XCHG: Atomic exchange (implicitly locked)
Porting x86 code to RISC-V requires adding fences. The FENCE.TSO instruction helps by providing x86-like ordering.
Performance Implications
The choice of memory model affects performance:
Weak models (RISC-V, ARM):
- Allow aggressive reordering and optimization
- Better performance for well-synchronized code
- Require careful use of fences
- More complex for programmers
Strong models (x86 TSO):
- Simpler for programmers
- Less optimization opportunity
- Implicit ordering has performance cost
- Easier to port code
RISC-V’s weak model is a deliberate choice to maximize performance. The cost is that programmers must understand memory ordering and use synchronization correctly.
Figure 6.3: Memory Model Comparison
graph LR
%% Memory Model Spectrum (Left = Strong, Right = Weak)
SC[Sequential Consistency<br/>Strict ordering<br/>Simplest, slowest]
TSO[x86 TSO<br/>Store-Load reordering<br/>Strong model]
RVWMO[RISC-V RVWMO<br/>Extensive reordering<br/>Needs fences]
ARM[ARM Weak Order<br/>Similar to RVWMO<br/>More fence types]
%% Left → Right = Weaker memory ordering
SC -->|Weaker| TSO
TSO -->|Weaker| RVWMO
RVWMO -->|Weaker| ARM
%% Style the nodes like horizontal bars
style SC fill:#FFB6C1,stroke:#FF4500,stroke-width:2px
style TSO fill:#87CEEB,stroke:#1E90FF,stroke-width:2px
style RVWMO fill:#90EE90,stroke:#3CB371,stroke-width:2px
style ARM fill:#FFD700,stroke:#FFA500,stroke-width:2px
6.6 Programming with Weak Memory
Best Practices
Programming correctly with weak memory ordering requires discipline:
-
Use high-level synchronization: Prefer mutexes, semaphores, and atomic types from your language’s standard library. These handle memory ordering correctly.
-
Understand data races: A data race occurs when two harts access the same memory location without synchronization, and at least one access is a write. Data races are undefined behavior.
-
Use acquire/release: For custom synchronization, use acquire semantics when reading a synchronization variable and release semantics when writing it.
-
Minimize critical sections: Keep the code between lock acquisition and release as short as possible.
-
Test on real hardware: Memory ordering bugs may not appear in simulation or on strongly-ordered processors. Test on actual RISC-V hardware.
Common Patterns
Spinlock:
acquire:
li t0, 1
acquire_loop:
amoswap.w.aq t1, t0, 0(a0) # Atomic swap with acquire
bnez t1, acquire_loop # Retry if already locked
# Critical section
release:
amoswap.w.rl zero, zero, 0(a0) # Atomic swap with release
Message passing:
# Producer
sw a0, 0(s0) # Write data
fence rw, w # Release fence
sw a1, 0(s1) # Write flag
# Consumer
lw t0, 0(s1) # Read flag
fence r, rw # Acquire fence
lw t1, 0(s0) # Read data
Dekker’s algorithm (mutual exclusion without atomic operations):
# Hart 0
li t0, 1
sw t0, flag0 # flag0 = 1
fence w, rw
lw t1, flag1 # Read flag1
bnez t1, wait # If flag1 set, wait
# Critical section
sw zero, flag0 # flag0 = 0
These patterns rely on careful placement of fences to ensure correct ordering.
Debugging Memory Ordering Issues
Memory ordering bugs are notoriously difficult to debug:
- They may occur rarely and non-deterministically
- They may not appear on some hardware
- They may disappear when debugging code is added
Strategies:
- Use memory model checking tools (e.g., herd7, rmem)
- Add assertions to check invariants
- Use thread sanitizers (e.g., ThreadSanitizer)
- Test under high contention
- Review synchronization code carefully
The RISC-V memory model is formally specified, which allows using formal verification tools to prove correctness.
🛠️ Hands-on Lab: Lab 6.1 — The Bathroom Battle (Spinlock)
This lab guides you through implementing a Spinlock using the amoswap instruction to protect shared variables from concurrent access by multiple cores.
Lab Objectives
- Understand why naive read-modify-write operations cause Race Conditions
- Use
amoswap(Atomic Memory Operation Swap) to implement a Spinlock - Understand acquire/release semantics
Concept Explanation
Why Atomic Operations?
Consider this “naive” lock implementation:
// ❌ Wrong lock implementation
void lock_acquire(int *lock) {
while (*lock == 1) {} // (1) Read: check if lock is held
*lock = 1; // (2) Write: acquire lock
}
Problem: Steps (1) and (2) are not atomic!
Time →
Core 0: read lock=0 ──────────────────────── write lock=1
Core 1: ────────────────── read lock=0 ───── write lock=1
↑ Both cores think they acquired the lock!
The Role of amoswap
amoswap combines “read old value” and “write new value” into one atomic operation:
# amoswap.w.aq rd, rs2, (rs1)
# Atomically executes:
# temp = memory[rs1]
# memory[rs1] = rs2
# rd = temp
Code
Create lab6_spinlock.S:
# lab6_spinlock.S - Spinlock using amoswap
.section .text
.global spinlock_acquire
.global spinlock_release
# void spinlock_acquire(int *lock)
# a0 = address of lock
spinlock_acquire:
li t0, 1 # t0 = 1 (LOCKED state)
spin:
# amoswap.w.aq: Atomic swap with acquire semantics
# Atomically: old value → t1, new value 1 → memory[a0]
amoswap.w.aq t1, t0, (a0)
# Check if old value was 0 (UNLOCKED)
bnez t1, spin # If old value wasn't 0, keep spinning
ret # Successfully acquired lock!
# void spinlock_release(int *lock)
# a0 = address of lock
spinlock_release:
# amoswap.w.rl: Atomic swap with release semantics
# Write 0 (UNLOCKED) to lock
li t0, 0
amoswap.w.rl zero, t0, (a0) # Discard result (write to zero)
ret
C Driver Program main.c:
#include <stdio.h>
extern void spinlock_acquire(int *lock);
extern void spinlock_release(int *lock);
int shared_counter = 0;
int lock = 0;
void increment_safely(void) {
spinlock_acquire(&lock);
// Critical Section: protected region
shared_counter++;
spinlock_release(&lock);
}
int main() {
// Simulate concurrent access
increment_safely();
increment_safely();
printf("Counter: %d\n", shared_counter);
return 0;
}
Compile and Run
# Compile
riscv64-unknown-elf-gcc -o lab6_spinlock main.c lab6_spinlock.S
# Run
qemu-riscv64 lab6_spinlock
Expected Output:
Counter: 2
What You Just Did
You’ve implemented a correct spinlock:
- Atomicity:
amoswapensures read-and-write happens as one indivisible operation - Acquire Semantics:
.aqensures subsequent operations in the critical section don’t move before the lock acquisition - Release Semantics:
.rlensures previous operations in the critical section don’t move after the lock release
danieRTOS Reference: The scheduler in danieRTOS uses similar spinlocks to protect the task queue during context switches.
⚠️ Common Pitfalls
Pitfall 1: Confusing volatile with Memory Barriers
Error Scenario: Thinking volatile solves multi-core synchronization.
// ❌ Wrong: volatile only prevents compiler optimization, doesn't affect CPU reordering
volatile int flag = 0;
data = 42;
flag = 1; // CPU may still execute this first!
// ✅ Correct: Need a memory barrier
data = 42;
__sync_synchronize(); // Or: asm volatile("fence rw, rw")
flag = 1;
Pitfall 2: Using Regular load/store in Spinlock
Error Scenario: Not using atomic instructions, allowing two cores into the critical section.
// ❌ Wrong: Non-atomic operations
void bad_lock(int *lock) {
while (*lock) {} // Read
*lock = 1; // Write — can be interrupted between these!
}
// ✅ Correct: Use atomic operations
void good_lock(int *lock) {
while (__sync_lock_test_and_set(lock, 1)) {} // Compiles to amoswap
}
Pitfall 3: Forgetting acquire/release Semantics
Error Scenario: Using atomic operations but without correct ordering.
# ❌ Wrong: No .aq, critical section reads may be moved earlier
amoswap.w t1, t0, (a0) # No .aq
# ❌ Wrong: No .rl, critical section writes may be delayed
amoswap.w zero, t0, (a0) # No .rl
# ✅ Correct: acquire for lock, release for unlock
amoswap.w.aq t1, t0, (a0) # acquire
amoswap.w.rl zero, t0, (a0) # release
Summary
Memory ordering is one of the most subtle and challenging aspects of concurrent programming. Modern processors reorder memory accesses for performance, creating behaviors that can seem impossible from a sequential perspective. RISC-V’s memory model defines which reorderings are legal and provides synchronization primitives to enforce ordering when needed.
The RISC-V Weak Memory Ordering (RVWMO) model allows aggressive reordering: loads can be reordered with loads, stores with stores, and loads with earlier stores. Only store-load ordering is preserved by default. This weak model enables high performance but requires explicit synchronization. The model is formally specified using happens-before relationships and preserved program order, allowing formal verification of concurrent algorithms.
Fence instructions enforce memory ordering. FENCE with predecessor and successor sets (r, w, rw) creates ordering between memory operations. FENCE RW, RW is a full barrier preventing all reordering. FENCE W, W orders stores (for publish). FENCE R, RW orders loads before subsequent accesses (for acquire). FENCE RW, W orders all accesses before stores (for release). FENCE.I synchronizes instruction and data caches after code modification.
Atomic instructions provide indivisible read-modify-write operations essential for synchronization. Load-Reserved/Store-Conditional (LR/SC) implements lock-free algorithms: LR loads and reserves, SC stores only if the reservation is still valid. Atomic Memory Operations (AMO) like AMOSWAP, AMOADD, and AMOAND perform atomic operations directly. Acquire and release annotations (.aq, .rl) provide ordering without separate fences.
The Total Store Ordering (RVTSO) extension provides stronger ordering compatible with x86’s TSO model. RVTSO preserves all orderings except store-load, making it easier to port x86 code but potentially reducing performance. Most RISC-V implementations use RVWMO for better performance, but RVTSO is available for compatibility.
Common synchronization patterns include spinlocks (using AMOSWAP or LR/SC), mutexes (with futex system calls for blocking), barriers (using atomic counters), and lock-free data structures (using LR/SC for ABA-safe updates). Each pattern requires careful fence placement to ensure correctness under RVWMO.
Compared to ARM and x86, RISC-V’s memory model is similar to ARM’s (both are weakly ordered) but simpler and more formally specified. x86’s TSO model is stronger, preserving more orderings by default, which simplifies programming but may reduce performance. RISC-V’s RVTSO extension provides x86 compatibility when needed.
Memory ordering bugs are notoriously difficult to debug—they occur rarely, non-deterministically, and may disappear when debugging code is added. Strategies include using memory model checking tools (herd7, rmem), thread sanitizers (ThreadSanitizer), formal verification, and careful code review. RISC-V’s formal memory model specification enables rigorous verification of concurrent algorithms.
Chapter 7. RISC-V Pipeline Fundamentals
Part V — Pipeline & Microarchitecture
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand the Pipeline Concept: Know why pipelining improves throughput
- Master the 5-Stage Pipeline: Be familiar with the functions of IF, ID, EX, MEM, WB stages
- Identify Hazard Types: Distinguish between Structural, Data, and Control Hazards
- Understand Solutions: Grasp the principles of Stalling, Forwarding, and Branch Prediction
- Analyze Pipeline Performance: Calculate factors affecting CPI (Cycles Per Instruction)
💡 Scenario: The Wisdom of the Factory Assembly Line
Scene: Junior visits Architect’s semiconductor factory, curious about how the production line works.
Junior: “Architect, I’ve always had a question. Textbooks say a CPU executes one instruction per cycle, but I see each instruction goes through five steps—fetch, decode, execute, memory access, writeback. How can it possibly complete in one cycle?”
Architect: “Great question. Come, let me show you the factory floor.”
(They walk to the production line)
Architect: “Look at this assembly line. Each station does only one thing:
- Station 1: Get parts (Fetch)
- Station 2: Check specifications (Decode)
- Station 3: Assemble (Execute)
- Station 4: Quality inspection (Memory)
- Station 5: Package (Writeback)
Each product goes through five stations to complete. If each station takes 1 minute, one product takes 5 minutes, right?“
Junior: “Right.”
Architect: “But look—how many products are being processed simultaneously on the line right now?”
Junior: “Five! There’s one at each station.”
Architect: “Exactly. Although each product takes 5 minutes to complete, the line outputs one finished product every minute. This is the power of Pipeline—Throughput is one instruction per cycle, even though Latency for a single instruction is five cycles.”
Junior: “I see! What if one station gets stuck?”
Architect: “That’s a Hazard. Imagine the screwdriver at station 3 breaks—
- Structural Hazard: Not enough tools—two products fighting for the same screwdriver.
- Data Hazard: Station 3 needs a part from station 5, but that product isn’t finished yet.
- Control Hazard: A phone call says ‘Stop! Switch to a different model!’—all half-finished products are scrapped.“
Junior: “How do you solve these?”
Architect: “Three tricks:
- Stall: Stop the line and wait, but this reduces throughput.
- Forwarding: Station 3’s result doesn’t wait for station 5 to package—pass it directly from the side.
- Prediction: Guess what model the boss wants. If right, keep going. If wrong, tear it down and redo.“
Junior: “Got it! Let’s see how the CPU handles these situations.”
The pipeline is the heart of modern processor design. It’s the mechanism that allows a processor to work on multiple instructions simultaneously, dramatically improving throughput. In this chapter, we’ll explore how RISC-V processors implement pipelining, from the classic five-stage pipeline to advanced techniques for handling hazards and branches.
Understanding pipelines is crucial for anyone working with RISC-V, whether you’re designing hardware, writing compilers, or optimizing performance-critical code. The beauty of RISC-V’s design is that its clean, regular instruction set makes it particularly well-suited for efficient pipeline implementation. We’ll examine the classic five-stage pipeline (Fetch, Decode, Execute, Memory, Writeback), the three types of hazards that disrupt pipeline flow (structural, data, control), and techniques for handling them (forwarding, stalling, branch prediction). We’ll also explore how pipeline depth affects performance and complexity.
7.1 Classic Five-Stage Pipeline
The classic five-stage pipeline is the foundation of most RISC processor designs. It divides instruction execution into five distinct stages, allowing up to five instructions to be in flight simultaneously. Let’s walk through each stage.
Figure 7.1: Five-Stage Pipeline Overview
graph LR
IF[IF<br/>Instruction<br/>Fetch] --> ID[ID<br/>Instruction<br/>Decode]
ID --> EX[EX<br/>Execute]
EX --> MEM[MEM<br/>Memory<br/>Access]
MEM --> WB[WB<br/>Write<br/>Back]
style IF fill:#e1f5ff
style ID fill:#fff4e1
style EX fill:#ffe1e1
style MEM fill:#e1ffe1
style WB fill:#f0e1ff
Figure 7.2: Pipeline Timing Diagram
Cycle: 1 2 3 4 5 6 7 8 9
I1: IF ID EX MEM WB
I2: IF ID EX MEM WB
I3: IF ID EX MEM WB
I4: IF ID EX MEM WB
I5: IF ID EX MEM WB
In steady state, all five stages are busy with different instructions, achieving a throughput of one instruction per cycle (IPC = 1).
Instruction Fetch (IF)
The first stage fetches the next instruction from memory. The program counter (PC) points to the address of the instruction to fetch. The instruction is read from the instruction cache (I-cache) or main memory if there’s a cache miss.
In RISC-V, all instructions are either 16-bit (compressed, with C extension) or 32-bit (standard). The fetch unit must handle both formats, though in a simple implementation without the C extension, all instructions are 32-bit aligned.
IF Stage:
instruction = I-cache[PC]
next_PC = PC + 4 // or PC + 2 for compressed instructions
Fetch bandwidth is critical for performance. A processor that can fetch multiple instructions per cycle (superscalar) needs wider fetch paths and more complex PC prediction logic.
Instruction Decode (ID)
The second stage decodes the instruction and reads operands from the register file. The decoder examines the opcode and function fields to determine what operation to perform and which registers to read.
RISC-V’s regular instruction format makes decoding straightforward. All instructions have the opcode in bits [6:0], and register specifiers are always in the same positions:
rs1(source register 1): bits [19:15]rs2(source register 2): bits [24:20]rd(destination register): bits [11:7]
ID Stage:
opcode = instruction[6:0]
rs1_data = register_file[instruction[19:15]]
rs2_data = register_file[instruction[24:20]]
rd_addr = instruction[11:7]
immediate = decode_immediate(instruction)
Immediate generation is also part of this stage. RISC-V has several immediate formats (I-type, S-type, B-type, U-type, J-type), and the decoder must extract and sign-extend the immediate value correctly.
Execute (EX)
The third stage performs the actual computation. This is where the ALU (Arithmetic Logic Unit) operates on the source operands to produce a result.
For arithmetic instructions like ADD, SUB, AND, the ALU performs the operation. For load/store instructions, the ALU calculates the memory address by adding the base register and offset. For branches, the ALU evaluates the branch condition.
EX Stage:
case opcode:
ADD: result = rs1_data + rs2_data
SUB: result = rs1_data - rs2_data
LOAD: address = rs1_data + immediate
BEQ: taken = (rs1_data == rs2_data)
Branch condition evaluation happens here. If a branch is taken, the pipeline must be flushed (more on this in Section 7.4).
Memory Access (MEM)
The fourth stage accesses data memory for load and store instructions. For loads, data is read from the data cache (D-cache). For stores, data is written to the cache.
MEM Stage:
if LOAD:
load_data = D-cache[address]
if STORE:
D-cache[address] = rs2_data
Cache hit or miss is determined here. A cache miss can stall the pipeline for many cycles while data is fetched from main memory.
For non-memory instructions, this stage does nothing (or passes through the result from the EX stage).
Write Back (WB)
The fifth and final stage writes the result back to the register file. This is the commit point where the instruction’s effects become architecturally visible.
WB Stage:
if rd != x0: // x0 is hardwired to zero
register_file[rd] = result
RISC-V’s x0 register is always zero, so writes to x0 are discarded. This is checked in hardware to avoid unnecessary register file writes.
Pipeline Example: Executing a Simple Program
Let’s trace a simple RISC-V program through the pipeline:
# Example: Calculate sum = a + b + c
lw x1, 0(x10) # I1: Load a from memory
lw x2, 4(x10) # I2: Load b from memory
lw x3, 8(x10) # I3: Load c from memory
add x4, x1, x2 # I4: x4 = a + b
add x5, x4, x3 # I5: x5 = (a + b) + c
sw x5, 12(x10) # I6: Store sum to memory
Cycle-by-cycle execution (assuming no cache misses):
Cycle: 1 2 3 4 5 6 7 8 9 10 11
I1: IF ID EX MEM WB
I2: IF ID EX MEM WB
I3: IF ID EX MEM WB
I4: IF ID EX MEM WB
I5: IF ID EX MEM WB
I6: IF ID EX MEM WB
In this ideal case, 6 instructions complete in 11 cycles. After the pipeline fills (first 5 cycles), we achieve 1 instruction per cycle.
7.2 Pipeline Hazards
Pipelining would be perfect if instructions were completely independent. Unfortunately, they’re not. Hazards are situations where the next instruction cannot execute in the next clock cycle. There are three types of hazards.
Structural Hazards
A structural hazard occurs when two instructions need the same hardware resource at the same time. For example, if the instruction fetch and memory access stages both need to access memory in the same cycle, there’s a conflict.
In a simple RISC-V implementation with a single memory port, you can’t fetch an instruction and perform a load/store simultaneously. The solution is either to stall one operation or to use separate instruction and data caches (Harvard architecture).
Register file port conflicts are another example. If the register file has only one write port, you can’t write back two results in the same cycle. Most RISC-V implementations avoid this by having enough ports or by carefully scheduling operations.
Data Hazards
Data hazards occur when an instruction depends on the result of a previous instruction that hasn’t completed yet. There are three types:
RAW (Read After Write) — The most common hazard. An instruction tries to read a register before a previous instruction writes it:
add x1, x2, x3 # x1 = x2 + x3
sub x4, x1, x5 # x4 = x1 - x5 (needs x1 from previous instruction)
The sub instruction needs the value of x1, but the add instruction hasn’t written it yet. This is a true dependency and must be handled carefully.
Figure 7.3: RAW Data Hazard
Cycle: 1 2 3 4 5 6 7 8
add x1,x2,x3: IF ID EX MEM WB
sub x4,x1,x5: IF ID -- -- EX MEM WB
└─ stall ─┘
Without forwarding, the sub must stall until add writes x1 in cycle 5.
WAR (Write After Read) — An instruction writes a register before a previous instruction reads it. This is an anti-dependency:
add x1, x2, x3 # reads x2
sub x2, x4, x5 # writes x2
In an in-order pipeline, WAR hazards don’t occur because instructions complete in order. But in out-of-order processors (Chapter 8), they can happen.
WAW (Write After Write) — Two instructions write the same register. This is an output dependency:
add x1, x2, x3 # writes x1
sub x1, x4, x5 # writes x1
Again, this is mainly a concern for out-of-order processors.
Control Hazards
Control hazards occur when the pipeline doesn’t know which instruction to fetch next. This happens with branches and jumps.
Consider a conditional branch:
beq x1, x2, target # if x1 == x2, jump to target
add x3, x4, x5 # next instruction if not taken
...
target:
sub x6, x7, x8 # target instruction if taken
The pipeline doesn’t know whether to fetch the add or the sub until the branch condition is evaluated in the EX stage. By that time, the pipeline has already fetched the next instruction speculatively.
Branch misprediction causes pipeline bubbles (wasted cycles) because the speculatively fetched instructions must be discarded.
Figure 7.6: Branch Misprediction
Cycle: 1 2 3 4 5 6
beq (taken): IF ID EX MEM WB
Wrong Path I1: IF ID XX
Wrong Path I2: IF XX
Correct Path: IF ID EX
└─ 3 cycles wasted ─┘
When the branch is resolved in cycle 3 and found to be mispredicted, instructions from the wrong path are squashed, wasting 2-3 cycles.
7.3 Hazard Resolution
Processors use several techniques to handle hazards without stalling the pipeline too much.
Forwarding (Bypassing)
Forwarding (also called bypassing) allows a result to be used before it’s written back to the register file. This is the most important technique for reducing data hazard stalls.
Consider our earlier example:
add x1, x2, x3 # x1 = x2 + x3 (result available at end of EX stage)
sub x4, x1, x5 # x4 = x1 - x5 (needs x1 in EX stage)
Without forwarding, the sub would have to wait until the add writes x1 in the WB stage (3 cycles later). With forwarding, the result from the add instruction’s EX stage can be forwarded directly to the sub instruction’s EX stage.
Forwarding paths are data paths that bypass the register file:
- EX-to-EX forwarding: Result from EX stage to EX stage (1 cycle later)
- MEM-to-EX forwarding: Result from MEM stage to EX stage (2 cycles later)
- WB-to-EX forwarding: Result from WB stage to EX stage (3 cycles later, but this is just normal register file read)
Figure 7.4: Forwarding Paths
graph TB
subgraph Pipeline Stages
IF[IF Stage]
ID[ID Stage<br/>Register Read]
EX[EX Stage<br/>ALU]
MEM[MEM Stage<br/>Data Cache]
WB[WB Stage<br/>Register Write]
end
IF --> ID
ID --> EX
EX --> MEM
MEM --> WB
EX -.->|EX-to-EX<br/>Forwarding| EX
MEM -.->|MEM-to-EX<br/>Forwarding| EX
WB -.->|Normal<br/>Register Read| ID
style EX fill:#ffe1e1
style MEM fill:#e1ffe1
style WB fill:#f0e1ff
Forwarding Logic (simplified):
// Forwarding unit logic
if (EX_MEM.RegWrite && (EX_MEM.rd != 0) && (EX_MEM.rd == ID_EX.rs1))
ForwardA = 01; // Forward from EX/MEM pipeline register
else if (MEM_WB.RegWrite && (MEM_WB.rd != 0) && (MEM_WB.rd == ID_EX.rs1))
ForwardA = 10; // Forward from MEM/WB pipeline register
else
ForwardA = 00; // No forwarding, use register file
// Similar logic for rs2 (ForwardB)
Example with forwarding:
add x1, x2, x3 # I1: x1 = x2 + x3 (result ready at end of EX)
sub x4, x1, x5 # I2: x4 = x1 - x5 (needs x1 at start of EX)
Cycle: 1 2 3 4 5 6
I1: IF ID EX MEM WB
I2: IF ID EX MEM WB
^
|
Forward from I1's EX stage
With forwarding, I2 can execute immediately after I1, with no stall!
Forwarding doesn’t solve all data hazards. The classic example is a load followed immediately by a use:
lw x1, 0(x2) # load x1 from memory
add x3, x1, x4 # use x1 immediately
The load data isn’t available until the end of the MEM stage, but the add needs it at the beginning of the EX stage. Even with forwarding, a one-cycle stall is required.
Figure 7.5: Load-Use Hazard
Cycle: 1 2 3 4 5 6 7
lw x1,0(x2): IF ID EX MEM WB
add x3,x1,x4: IF ID -- EX MEM WB
└─ stall ─┘
The add must stall in cycle 3 because the load data isn’t ready until the end of cycle 4. Even with MEM-to-EX forwarding, we need one bubble.
Pipeline Stalls
When forwarding isn’t enough, the pipeline must stall (insert bubbles). A stall freezes earlier pipeline stages while later stages continue.
For the load-use hazard above, the pipeline inserts a one-cycle stall:
Cycle: 1 2 3 4 5 6
lw IF ID EX MEM WB
add IF ID stall EX MEM
The add instruction’s ID stage is held for an extra cycle, creating a bubble in the EX stage.
Stall detection logic monitors the pipeline for hazards:
// Hazard detection unit
bool load_use_hazard = (ID_EX.MemRead) &&
((ID_EX.rd == IF_ID.rs1) ||
(ID_EX.rd == IF_ID.rs2));
if (load_use_hazard) {
// Stall the pipeline
PC_write = 0; // Don't update PC
IF_ID_write = 0; // Don't update IF/ID register
Control_signals = 0; // Insert bubble (nop) in EX stage
}
Performance impact: Each stall reduces IPC (Instructions Per Cycle). Compilers try to schedule instructions to avoid load-use hazards when possible.
Compiler scheduling example:
# Original code (has load-use hazard):
lw x1, 0(x2)
add x3, x1, x4 # Stall! (depends on x1)
sub x5, x6, x7
# Compiler-scheduled code (no hazard):
lw x1, 0(x2)
sub x5, x6, x7 # Independent instruction fills the slot
add x3, x1, x4 # No stall now (x1 is ready)
By reordering independent instructions, the compiler can hide load latency and avoid stalls.
Compiler Scheduling
Compilers can reorder instructions to avoid hazards without changing program semantics. This is called instruction scheduling or software pipelining.
Example: Instead of this (with a load-use hazard):
lw x1, 0(x2)
add x3, x1, x4 # stall!
The compiler can insert an independent instruction:
lw x1, 0(x2)
sub x5, x6, x7 # independent instruction
add x3, x1, x4 # no stall now
Loop unrolling and software pipelining are advanced compiler techniques that expose more instruction-level parallelism and reduce hazards.
7.4 Branch Handling
Branches are the bane of pipelining. Every branch is a potential control hazard that can disrupt the smooth flow of instructions through the pipeline.
Branch Prediction Basics
Branch prediction tries to guess whether a branch will be taken or not-taken before the condition is evaluated. The pipeline speculatively fetches and executes instructions based on this prediction. If the prediction is correct, there’s no penalty. If it’s wrong, the pipeline must be flushed and restarted from the correct path.
Misprediction penalty is the number of cycles wasted when a branch is mispredicted. In a five-stage pipeline, if the branch is resolved in the EX stage (cycle 3), the misprediction penalty is 2 cycles (the IF and ID stages of the wrong-path instructions must be discarded).
Branch prediction accuracy is critical for performance. Modern processors achieve 95-99% accuracy on typical workloads. Even a 5% misprediction rate can significantly impact performance if branches are frequent (every 5-10 instructions in typical code).
Static Branch Prediction
Static prediction uses fixed rules that don’t change during execution. The simplest strategies are:
Always not-taken: Assume all branches are not taken. This works well for forward branches (like if statements that skip over error handling code).
Always taken: Assume all branches are taken. This works well for backward branches (like loop back-edges).
BTFNT (Backward Taken, Forward Not-Taken): A hybrid strategy that predicts backward branches as taken and forward branches as not-taken. This is surprisingly effective because loops (backward branches) are usually taken, and forward branches (error checks, early exits) are usually not taken.
Static Prediction:
if (branch_target < PC): // backward branch
predict_taken()
else: // forward branch
predict_not_taken()
Profile-guided prediction: The compiler can use profiling data to predict branches based on actual execution patterns. Hot paths are predicted as taken.
Dynamic Branch Prediction
Dynamic prediction learns from past branch behavior and adapts during execution. This is much more accurate than static prediction.
Branch History Table (BHT): A table indexed by the branch PC (or a hash of it) that stores prediction information. Each entry contains a two-bit saturating counter:
00: Strongly not-taken
01: Weakly not-taken
10: Weakly taken
11: Strongly taken
When a branch is taken, the counter increments (saturates at 11). When not taken, it decrements (saturates at 00). The prediction is “taken” if the counter is 10 or 11.
Why two bits? A single bit would mispredict on every iteration of a loop (predict taken, but the last iteration is not taken, so flip to not-taken, but the next loop iteration is taken, so flip back…). Two bits provide hysteresis: a single misprediction doesn’t immediately flip the prediction.
Figure 7.7: Two-Bit Saturating Counter State Machine
stateDiagram-v2
[*] --> SNT
SNT: 00<br/>Strongly<br/>Not-Taken
WNT: 01<br/>Weakly<br/>Not-Taken
WT: 10<br/>Weakly<br/>Taken
ST: 11<br/>Strongly<br/>Taken
SNT --> SNT: Not Taken
SNT --> WNT: Taken
WNT --> SNT: Not Taken
WNT --> WT: Taken
WT --> WNT: Not Taken
WT --> ST: Taken
ST --> WT: Not Taken
ST --> ST: Taken
Example: Loop prediction
for (int i = 0; i < 100; i++) {
// Loop body
}
The loop back-edge branch is taken 99 times and not-taken once (exit). With a 2-bit counter:
- After a few iterations, counter reaches 11 (strongly taken)
- Predicts “taken” for iterations 1-99 (correct)
- Iteration 100: not-taken (misprediction), counter goes to 10
- Next loop: first iteration is taken, counter goes back to 11
- Result: Only 1 misprediction per 100 iterations (99% accuracy)
Local vs global history:
- Local history: Each branch has its own history (pattern of taken/not-taken).
- Global history: All branches share a global history register that tracks the last N branch outcomes.
Global history can capture correlations between branches (e.g., if branch A is taken, branch B is likely taken too).
Branch Target Buffer (BTB)
The BTB is a cache that stores the target addresses of recently executed branches. When a branch is predicted taken, the BTB provides the target address so the fetch unit can immediately fetch from the correct location.
Without a BTB, even if a branch is correctly predicted as taken, the pipeline must wait until the branch target is calculated in the EX stage. The BTB eliminates this delay.
BTB Lookup:
if (PC in BTB) and (predict_taken):
next_PC = BTB[PC].target
else:
next_PC = PC + 4
Return Address Stack (RAS): Function returns (ret in RISC-V, which is jalr x0, 0(x1)) are a special case. The return address is pushed onto a hardware stack when a function is called (jal or jalr), and popped when returning. This provides near-perfect prediction for function returns.
RISC-V: No Branch Delay Slots
RISC-V does not have branch delay slots, unlike MIPS. In MIPS, the instruction immediately after a branch is always executed, regardless of whether the branch is taken. This is called a delay slot.
MIPS example:
beq $t0, $t1, target
add $t2, $t3, $t4 # delay slot: always executed
The add instruction executes even if the branch is taken. Compilers must fill the delay slot with a useful instruction or a nop.
RISC-V eliminated delay slots for several reasons:
- Simpler pipeline control: No need to track delay slot instructions.
- Cleaner ISA: The semantics are more intuitive.
- Better for superscalar: Delay slots complicate multi-issue pipelines.
- Compiler complexity: Filling delay slots is tricky and doesn’t always help.
This is a significant improvement over MIPS and makes RISC-V easier to implement and optimize.
7.5 Trap and Interrupt Handling in Pipeline
Traps and interrupts (covered in Chapter 4) have a significant impact on the pipeline. They require precise exception handling and pipeline flushing.
Precise Exceptions
A precise exception means the architectural state is consistent when the exception is taken. All instructions before the faulting instruction have completed, and no instructions after it have modified architectural state.
This requires in-order commit: even if instructions execute out-of-order (Chapter 8), they must commit (update architectural state) in program order.
For a five-stage in-order pipeline, precise exceptions are natural: instructions complete in order. But the pipeline must ensure that:
- All instructions before the exception have written back.
- The faulting instruction and all later instructions have not modified state.
Pipeline Flush on Trap
When a trap occurs, the pipeline must be flushed. All instructions in the pipeline that are younger than the trap are discarded (squashed).
Cycle: 1 2 3 4 5
I1: IF ID EX(trap) -- --
I2: IF ID squash --
I3: IF squash --
I4: squash --
Instructions I2, I3, I4 are squashed. The PC is redirected to the trap handler (from xtvec), and execution resumes there.
Squashing means:
- Clear pipeline registers (set to
nopor invalid). - Prevent any writes to architectural state (register file, memory, CSRs).
- Invalidate any speculative state (branch predictions, cache fills).
Performance Cost
Traps are expensive. A trap in a five-stage pipeline wastes 3-4 cycles (the instructions in the pipeline that must be squashed). In deeper pipelines or out-of-order processors, the cost is even higher.
This is why:
- Exception-free code is faster: Avoid page faults, misaligned accesses, illegal instructions.
- Interrupts should be infrequent: High interrupt rates can severely degrade performance.
- Trap handlers should be fast: The sooner you return from a trap, the sooner useful work resumes.
7.6 Simple In-Order Implementations
Let’s look at how real RISC-V processors implement pipelines.
Single-Issue vs Multi-Issue
Single-issue processors execute one instruction per cycle (at most). The classic five-stage pipeline is single-issue.
Multi-issue (superscalar) processors can execute multiple instructions per cycle. For example, a 2-issue processor can fetch, decode, and execute 2 instructions simultaneously.
Issue width is the maximum number of instructions that can be issued per cycle. Wider issue requires:
- Multiple fetch ports (or wider fetch)
- Multiple decode units
- Multiple execution units (ALUs, load/store units)
- More register file ports
- More complex hazard detection and forwarding logic
Scalar vs Superscalar
Scalar means single-issue, one instruction at a time.
Superscalar means multi-issue, exploiting instruction-level parallelism (ILP) by executing independent instructions in parallel.
Superscalar processors are more complex but can achieve higher IPC (Instructions Per Cycle). A 4-issue superscalar can theoretically execute 4 instructions per cycle, achieving IPC = 4 (though in practice, IPC is usually 1.5-2.5 due to hazards and dependencies).
RISC-V Implementation Examples
Rocket Core: An open-source, in-order, single-issue RISC-V core developed at UC Berkeley. It has a classic five-stage pipeline and is used in many academic and commercial projects. Rocket is simple, efficient, and easy to understand.
BOOM (Berkeley Out-of-Order Machine): An open-source, out-of-order, superscalar RISC-V core (also from UC Berkeley). BOOM is much more complex than Rocket but achieves higher performance. We’ll cover out-of-order execution in Chapter 8.
SiFive Cores:
- E-series (e.g., E20, E21): Small, low-power, in-order cores for embedded systems.
- U-series (e.g., U54, U74): Higher-performance, in-order cores with MMU for running Linux.
- P-series (e.g., P270, P670): High-performance, out-of-order cores for demanding applications.
Performance Characteristics
CPI (Cycles Per Instruction): The average number of cycles needed to execute one instruction. For an ideal five-stage pipeline with no hazards, CPI = 1. In practice, hazards increase CPI to 1.2-1.5 for in-order cores.
IPC (Instructions Per Cycle): The inverse of CPI. IPC = 1/CPI. Higher IPC means better performance.
Pipeline depth trade-offs:
- Deeper pipelines (more stages) allow higher clock frequencies because each stage does less work. But they increase branch misprediction penalties and make hazard handling more complex.
- Shallow pipelines (fewer stages) have lower misprediction penalties and simpler control, but lower maximum frequency.
Modern processors balance these trade-offs. RISC-V cores range from 3-stage pipelines (simple embedded cores) to 10+ stage pipelines (high-performance cores).
🛠️ Hands-on Lab: Lab 7.1 — Pipeline Bubble Analysis
This lab is a pencil-and-paper exercise that guides you through analyzing Pipeline Hazards in real assembly code and drawing Pipeline Diagrams.
Lab Objectives
- Identify Data Hazards in code
- Draw Pipeline Timing Diagrams
- Calculate Stall Cycles and actual CPI
- Understand how Forwarding reduces Stalls
Analysis Example
Consider the following RISC-V assembly:
lw x1, 0(x2) # I1: Load x1 from memory
add x3, x1, x4 # I2: x3 = x1 + x4 (depends on I1's result)
sub x5, x3, x6 # I3: x5 = x3 - x6 (depends on I2's result)
and x7, x5, x8 # I4: x7 = x5 & x8 (depends on I3's result)
Exercise 1: Pipeline Without Forwarding
Assume no Forwarding—results must be written back in WB before the next instruction can read them in ID.
Draw the Pipeline Diagram:
Cycle: 1 2 3 4 5 6 7 8 9 10 11 12
I1 (lw): IF ID EX MEM WB
I2 (add): IF ID -- -- -- ID EX MEM WB
↑ stall 3 cycles (waiting for x1 ready)
I3 (sub): IF -- -- -- IF ID -- -- ...
I4 (and): IF ID ...
Calculation:
- Ideal case: 4 instructions × 1 cycle = 4 cycles
- Actual case: Multiple stalls due to Hazards
- Real CPI > 1
Exercise 2: Pipeline With Forwarding
With Forwarding, results from EX or MEM stages can be “forwarded” directly to the next instruction.
Think About:
- When is
lw’s result earliest available? (Hint: end of MEM stage) - When does
addneedx1’s value? (Hint: start of EX stage) - Even with Forwarding, does
lwfollowed byaddstill need a stall?
Click to see answer
lw’s result is available at end of MEM stage (memory read completes)addneedsx1’s value at start of EX stage- Yes, 1 cycle stall needed! Because
lwhas result at MEM end, butaddneeds it at EX start—this is called Load-Use Hazard
Cycle: 1 2 3 4 5 6 7 8
I1 (lw): IF ID EX MEM WB
I2 (add): IF ID -- EX MEM WB
↑ 1 cycle stall (Load-Use Hazard)
I3 (sub): IF -- ID EX MEM WB
↑ forwarding from I2.EX → I3.EX
I4 (and): IF ID EX MEM WB
↑ forwarding from I3.EX → I4.EX
Extended Exercise: Code Scheduling
Compilers can reduce stalls by reordering instructions. Try reordering this code:
# Original code (has Load-Use Hazard)
lw x1, 0(x2)
add x3, x1, x4 # depends on x1, must stall
lw x5, 4(x2)
add x6, x5, x7 # depends on x5, must stall
# Optimized code (interleave independent instructions)
lw x1, 0(x2)
lw x5, 4(x2) # independent, can execute during x1's MEM
add x3, x1, x4 # x1 now ready (forwarded from MEM)
add x6, x5, x7 # x5 now ready
danieRTOS Reference: Understanding pipeline behavior helps optimize context switch code, where minimizing stalls in the critical path improves task switching latency.
⚠️ Common Pitfalls
Pitfall 1: Confusing Throughput and Latency
Misconception: “5-stage pipeline means each instruction takes 5 cycles?”
Correct Understanding:
- Latency: A single instruction from IF to WB indeed takes 5 cycles
- Throughput: In steady state, 1 instruction completes per cycle
- More stages increase clock frequency but also increase Hazard penalty
Pitfall 2: Thinking Forwarding Solves All Data Hazards
Misconception: “With Forwarding, we never need to stall!”
Correct Understanding:
- Load-Use Hazard cannot be fully solved by Forwarding
- Load result is produced in MEM stage, but next instruction needs it in EX stage
- Must insert at least 1 cycle stall (bubble)
lw x1, 0(x2) # Result available at cycle 4 (MEM)
add x3, x1, x4 # Needs value at cycle 3 (EX) — too late!
# Must stall 1 cycle
Pitfall 3: Ignoring Control Hazard Cost
Misconception: “A branch instruction is just a jump.”
Correct Understanding:
- Branch target is determined in EX stage (comparison operation)
- IF and ID stages have already fetched “wrong” subsequent instructions
- If branch taken, these instructions must be flushed
- This is called Branch Penalty
beq x1, x2, target # cycle 1: IF
add x3, x4, x5 # cycle 2: IF (guessed not taken)
sub x6, x7, x8 # cycle 3: IF
# cycle 3: discover branch taken!
# add, sub must be flushed, wasting 2 cycles
Solution: Branch Prediction
- Static Prediction: Guess backward branch taken, forward not taken
- Dynamic Prediction: Predict based on history (Branch History Table)
Summary
In this chapter, we explored the fundamentals of RISC-V pipelining:
- Five-stage pipeline: IF, ID, EX, MEM, WB — the classic RISC pipeline structure.
- Hazards: Structural, data (RAW, WAR, WAW), and control hazards that disrupt pipeline flow.
- Hazard resolution: Forwarding, stalls, and compiler scheduling to minimize performance loss.
- Branch handling: Static and dynamic prediction, BTB, and RISC-V’s elimination of delay slots.
- Trap handling: Precise exceptions, pipeline flushing, and performance costs.
- Implementations: Single-issue vs multi-issue, scalar vs superscalar, and real RISC-V cores.
RISC-V’s clean, regular ISA makes it ideal for efficient pipeline implementation. The absence of delay slots and complex addressing modes simplifies pipeline control compared to older architectures like MIPS.
In the next chapter, we’ll explore out-of-order execution, where processors dynamically reorder instructions to extract even more parallelism and performance.
Chapter 8. Microarchitecture Variations
Part V — Pipeline & Microarchitecture
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Distinguish In-Order from Out-of-Order: Understand the performance differences and use cases of both execution models
- Understand Register Renaming: Know how Physical Registers eliminate False Dependencies (WAW/WAR)
- Master ROB Mechanism: Understand how the Reorder Buffer guarantees “out-of-order execution, in-order commit”
- Recognize Speculative Execution: Understand the principles and security risks (Spectre/Meltdown)
- Use Performance Counters: Calculate CPI using
mcycleandminstret
💡 Scenario: The Michelin Restaurant Kitchen Philosophy
Scene: Junior is comparing benchmark results of two RISC-V cores and finds that despite similar frequencies, performance differs by a factor of two.
Junior: “Professor, I looked at specs for two RISC-V cores. SiFive U74 and Alibaba C910 both run at about 1.5GHz, but the CoreMark scores differ by almost 2x! How is this possible?”
Professor: “Have you ever been to a Michelin-star restaurant?”
Junior: “Sure, but what does that have to do with CPUs?”
Professor: “Imagine two restaurants.
The first one (In-Order): The chef strictly follows the order of tickets. If the first dish needs 10 minutes for ingredients to thaw, all other dishes must wait—even if the second dish’s ingredients are already ready.
The second one (Out-of-Order): The chef looks at which dish has ingredients ready first and starts with that. While waiting for ingredients to thaw, the chef has already finished three other dishes.“
Junior: “So Out-of-Order means the CPU doesn’t wait idly?”
Professor: “Exactly. But this requires a smart ‘restaurant manager’ to coordinate:
- Reservation Station: Tracks what ingredients each dish needs; starts cooking when ingredients arrive.
- Reorder Buffer: Even though cooking order is scrambled, dishes must still be served in the order customers placed them—otherwise chaos ensues.
- Register Renaming: If two dishes both need ‘eggs,’ but they’re actually different eggs, label them differently to avoid confusion.“
Junior: “Sounds complex. What’s the cost?”
Professor: “Transistor count explodes, and power consumption goes up with it. That’s why phone ‘big cores’ are power-hungry while ‘little cores’ are efficient—big cores are usually Out-of-Order, little cores are In-Order.”
Junior: “Let’s measure the actual performance difference!”
In Chapter 7, we explored the classic five-stage in-order pipeline. But modern processors go far beyond this simple model. Out-of-order (OOO) execution allows processors to dynamically reorder instructions to extract more parallelism, dramatically improving performance. While in-order processors execute instructions in program order and stall on dependencies, out-of-order processors can execute independent instructions while waiting for slow operations to complete.
This chapter explores the microarchitectural techniques that enable high-performance RISC-V processors: register renaming to eliminate false dependencies, reorder buffers to maintain precise exceptions, speculative execution to execute beyond branches, and advanced branch prediction to minimize misprediction penalties. We’ll examine the cache hierarchy that hides memory latency, and cache coherence protocols that maintain consistency across multiple cores. Understanding these techniques is essential for anyone designing high-performance RISC-V systems or optimizing code for modern processors.
8.1 Out-of-Order Execution Basics
In-Order vs Out-of-Order
In-order processors execute instructions in the exact order they appear in the program. If an instruction stalls (e.g., waiting for a cache miss), all subsequent instructions must wait, even if they’re independent and could execute.
Out-of-order (OOO) processors can execute instructions in a different order than the program specifies, as long as the final result is the same. This allows the processor to work around stalls and dependencies, keeping execution units busy.
Example:
lw x1, 0(x2) # I1: load (cache miss, 100 cycles)
add x3, x4, x5 # I2: independent of I1
sub x6, x7, x8 # I3: independent of I1
add x9, x1, x10 # I4: depends on I1
In-order execution: I2 and I3 must wait for I1 to complete (100 cycles), even though they’re independent.
Out-of-order execution: I2 and I3 can execute immediately while I1 is waiting for the cache miss. Only I4 must wait for I1.
This simple reordering can dramatically improve performance, especially when memory latency is high.
Figure 8.1: In-Order vs Out-of-Order Execution
Cycle: 1 2 3 ... 103 104 105 106 107
lw x1 (miss): IF ID EX MEM(100 cycles) WB
add x3,x4,x5: IF ID -------- stall -------- EX MEM WB
sub x6,x7,x8: IF -------- stall -------- ID EX MEM WB
Cycle: 1 2 3 4 5 6 7 ... 103 104 105 106
lw x1 (miss): IF ID EX MEM (100 cycles) -------- WB
add x3,x4,x5: IF ID EX MEM WB
sub x6,x7,x8: IF ID EX MEM WB
add x9,x1,x10: IF ID ---- wait ---- EX MEM WB
Out-of-order execution allows independent instructions to proceed while the load is waiting, dramatically reducing wasted cycles.
Dynamic Scheduling
Dynamic scheduling is the hardware mechanism that enables out-of-order execution. The processor analyzes dependencies at runtime and schedules instructions to execution units when their operands are ready.
Two classic algorithms for dynamic scheduling:
Scoreboarding (CDC 6600, 1964): A centralized control unit tracks which registers are being written and which instructions are waiting for them. When all operands are ready, the instruction is issued to an execution unit.
Tomasulo’s algorithm (IBM 360/91, 1967): Uses reservation stations to buffer instructions waiting for operands. When an operand is produced, it’s broadcast to all waiting instructions. This eliminates the need for a centralized scoreboard and enables register renaming (Section 8.2).
Modern OOO processors use variations of Tomasulo’s algorithm with additional structures like the Reorder Buffer (ROB) to ensure precise exceptions.
8.2 Register Renaming
The Problem: False Dependencies
Consider this code:
add x1, x2, x3 # I1: x1 = x2 + x3
sub x4, x1, x5 # I2: x4 = x1 - x5 (RAW dependency on x1)
add x1, x6, x7 # I3: x1 = x6 + x7 (WAW dependency on x1)
mul x8, x1, x9 # I4: x8 = x1 * x9 (RAW dependency on x1 from I3)
I2 has a true dependency (RAW) on I1 — it must wait for I1 to produce x1.
But I3 has a false dependency (WAW) on I1 — both write x1, but I3’s write doesn’t actually depend on I1’s value. Similarly, I4 depends on I3’s x1, not I1’s.
False dependencies (WAR and WAW) limit parallelism because the processor must serialize instructions that could otherwise execute in parallel.
Physical vs Architectural Registers
Register renaming eliminates false dependencies by mapping architectural registers (the 32 registers visible to the programmer) to a larger set of physical registers (hidden from the programmer).
RISC-V has 32 architectural registers (x0-x31). A high-performance OOO processor might have 128 or 256 physical registers.
Register Alias Table (RAT): Maps each architectural register to a physical register. When an instruction writes an architectural register, it’s allocated a new physical register.
Example with renaming:
add P10, P2, P3 # I1: x1 -> P10
sub P11, P10, P5 # I2: x4 -> P11, reads P10 (I1's result)
add P12, P6, P7 # I3: x1 -> P12 (new physical register!)
mul P13, P12, P9 # I4: x8 -> P13, reads P12 (I3's result)
Now I3 and I4 use P12 for x1, while I1 and I2 use P10. The WAW dependency is eliminated — I3 can execute as soon as P6 and P7 are ready, without waiting for I1.
Figure 8.2: Register Renaming Example
graph TB
subgraph "Architectural Registers (Programmer View)"
x1[x1]
x2[x2]
x3[x3]
x4[x4]
end
subgraph "Physical Registers (Hardware)"
P2[P2]
P3[P3]
P5[P5]
P6[P6]
P7[P7]
P9[P9]
P10[P10]
P11[P11]
P12[P12]
P13[P13]
end
subgraph "Register Alias Table (RAT)"
RAT["x1 → P12<br/>x2 → P2<br/>x3 → P3<br/>x4 → P11"]
end
x1 -.->|mapped to| P12
x2 -.->|mapped to| P2
x3 -.->|mapped to| P3
x4 -.->|mapped to| P11
style P10 fill:#ffcccc
style P12 fill:#ccffcc
In this example, x1 was previously mapped to P10 (shown in red, now free), and is currently mapped to P12 (shown in green).
Free List Management
Physical registers must be recycled when they’re no longer needed. A free list tracks which physical registers are available for allocation.
When an instruction commits (Section 8.3), its old physical register mapping can be freed:
I3 commits: x1 was mapped to P10, now mapped to P12
-> P10 can be freed (added to free list)
Register renaming eliminates WAR and WAW hazards, leaving only true RAW dependencies. This dramatically increases instruction-level parallelism.
8.3 Reorder Buffer (ROB) and Issue Queue
Reorder Buffer (ROB)
The ROB ensures that instructions commit in program order, even though they execute out-of-order. This is essential for precise exceptions (Chapter 7).
The ROB is a circular buffer that holds all in-flight instructions in program order. Each entry contains:
- Instruction PC
- Destination register (architectural and physical)
- Result value (when execution completes)
- Exception status
- Ready bit
Instruction flow through the ROB:
- Dispatch: Instruction is allocated a ROB entry and issued to an execution unit.
- Execute: Instruction executes out-of-order when operands are ready.
- Complete: Result is written to the ROB entry and broadcast to waiting instructions.
- Commit: When the instruction reaches the head of the ROB and is complete, it commits (updates architectural state).
Commit is in-order: Instructions commit from the head of the ROB one at a time (or in small groups). This ensures that if an exception occurs, all earlier instructions have committed and all later instructions can be discarded.
Figure 8.3: Reorder Buffer (ROB) Structure
graph TB
subgraph "Reorder Buffer (Circular Queue)"
direction TB
ROB1["ROB Entry 1<br/>PC: 0x1000<br/>Dest: x1→P10<br/>Value: 42<br/>Ready: ✓"]
ROB2["ROB Entry 2<br/>PC: 0x1004<br/>Dest: x2→P11<br/>Value: -<br/>Ready: ✗"]
ROB3["ROB Entry 3<br/>PC: 0x1008<br/>Dest: x3→P12<br/>Value: 100<br/>Ready: ✓"]
ROB4["ROB Entry 4<br/>PC: 0x100C<br/>Dest: x4→P13<br/>Value: -<br/>Ready: ✗"]
DOTS["..."]
end
HEAD[Head Pointer<br/>Commit from here] --> ROB1
TAIL[Tail Pointer<br/>Dispatch to here] --> DOTS
ROB1 -->|Commit| COMMIT[Update<br/>Architectural<br/>State]
style ROB1 fill:#ccffcc
style ROB3 fill:#ccffcc
style ROB2 fill:#ffcccc
style ROB4 fill:#ffcccc
Green entries are ready to commit (when they reach the head). Red entries are still executing.
ROB Example Code:
// ROB entry structure
struct ROB_Entry {
uint64_t PC; // Instruction address
uint8_t arch_reg; // Architectural register (x0-x31)
uint8_t phys_reg; // Physical register (P0-P127)
uint64_t value; // Result value
bool ready; // Execution complete?
bool exception; // Exception occurred?
uint8_t exception_code; // Exception type
};
// Commit logic (simplified)
void commit_instruction() {
ROB_Entry *entry = &ROB[head];
if (!entry->ready) {
return; // Can't commit yet
}
if (entry->exception) {
// Handle exception: flush pipeline, jump to handler
flush_pipeline();
PC = trap_handler_address;
return;
}
// Update architectural state
if (entry->arch_reg != 0) { // x0 is always zero
RAT[entry->arch_reg] = entry->phys_reg;
free_old_physical_register(entry->arch_reg);
}
// Advance head pointer
head = (head + 1) % ROB_SIZE;
}
Issue Queue and Reservation Stations
The issue queue (or reservation stations) holds instructions waiting for operands. When an instruction is dispatched, it’s placed in the issue queue. When all its operands are ready, it’s issued to an execution unit.
Wakeup and select:
- Wakeup: When a result is produced, it’s broadcast to all issue queue entries. Entries waiting for that result mark the operand as ready.
- Select: Among all ready instructions, the scheduler selects which ones to issue to execution units (based on priority, age, or other policies).
This is the heart of dynamic scheduling. The issue queue decouples instruction dispatch from execution, allowing the processor to find parallelism dynamically.
8.4 Load/Store Queue
Memory operations are particularly challenging in OOO processors because they must respect memory ordering (Chapter 6) while still allowing reordering for performance.
Load Queue and Store Queue
The load queue holds all in-flight loads. The store queue holds all in-flight stores. These queues track memory addresses and data, and enforce ordering constraints.
Store-to-load forwarding: If a load reads from the same address as an earlier store, the load can get the data directly from the store queue without waiting for the store to commit to memory.
sw x1, 0(x2) # Store x1 to address in x2
lw x3, 0(x2) # Load from same address
The load can forward the data from the store queue, avoiding a memory access.
Figure 8.4: Load/Store Queue Structure
graph TB
subgraph "Store Queue"
SQ1["Store 1<br/>Addr: 0x2000<br/>Data: 42<br/>Committed: ✗"]
SQ2["Store 2<br/>Addr: 0x2008<br/>Data: 100<br/>Committed: ✗"]
SQ3["Store 3<br/>Addr: ?<br/>Data: 55<br/>Committed: ✗"]
end
subgraph "Load Queue"
LQ1["Load 1<br/>Addr: 0x2000<br/>Data: ?"]
LQ2["Load 2<br/>Addr: 0x2010<br/>Data: ?"]
end
LQ1 -.->|Address Match<br/>Forward Data| SQ1
LQ2 -.->|No Match<br/>Go to Cache| CACHE[Data Cache]
SQ3 -.->|Address Unknown<br/>Must Wait| WAIT[Stall Load]
style SQ1 fill:#ccffcc
style SQ3 fill:#ffcccc
Store-to-load forwarding allows loads to get data from earlier stores without waiting for them to commit to memory.
Memory Disambiguation
Memory disambiguation is the problem of determining whether two memory operations access the same address. This is difficult because addresses are often computed dynamically.
sw x1, 0(x2) # Store to address A
lw x3, 0(x4) # Load from address B — is B == A?
If x2 and x4 contain the same value, the load depends on the store. But the processor doesn’t know this until the addresses are computed.
Conservative approach: Assume all loads depend on all earlier stores. This is safe but limits parallelism.
Speculative approach: Assume loads don’t depend on earlier stores and execute them speculatively. If a dependency is later detected (address match), squash the load and re-execute.
Modern processors use memory dependence prediction to guess which loads depend on which stores, improving speculation accuracy.
8.5 Advanced Branch Prediction
Branch prediction is even more critical in OOO processors because mispredictions waste more work (all the speculatively executed instructions must be discarded).
Two-Level Adaptive Predictors
Two-level predictors use both local and global branch history to make predictions. They can capture complex patterns like:
if (a > 0) { // Branch B1
if (b > 0) { // Branch B2
...
}
}
If B1 is taken, B2 is more likely to be taken. A global history predictor can learn this correlation.
Structure: A global history register (GHR) tracks the last N branch outcomes (taken/not-taken). This is used to index into a pattern history table (PHT) that contains 2-bit counters.
TAGE (Tagged Geometric History Length)
TAGE is a state-of-the-art branch predictor used in modern high-performance processors. It uses multiple predictor tables with different history lengths (e.g., 4, 8, 16, 32, 64 branches).
Each table is indexed by a hash of the PC and history. The predictor uses the longest matching history to make a prediction, falling back to shorter histories if there’s no match.
TAGE achieves very high accuracy (98-99%) on most workloads.
Figure 8.5: TAGE Predictor Structure
graph TB
PC[Program Counter] --> HASH1[Hash Function 1]
PC --> HASH2[Hash Function 2]
PC --> HASH3[Hash Function 3]
PC --> BASE[Base Predictor]
GHR[Global History<br/>Register] --> HASH1
GHR --> HASH2
GHR --> HASH3
HASH1 --> T1[Table 1<br/>History: 4]
HASH2 --> T2[Table 2<br/>History: 16]
HASH3 --> T3[Table 3<br/>History: 64]
T1 --> SELECT[Selector<br/>Choose Longest<br/>Matching History]
T2 --> SELECT
T3 --> SELECT
BASE --> SELECT
SELECT --> PRED[Prediction<br/>Taken/Not-Taken]
style T3 fill:#ccffcc
style SELECT fill:#ffffcc
TAGE uses multiple tables with different history lengths. The longest matching history provides the prediction.
Return Address Stack (RAS)
Function returns are predicted using a hardware stack (mentioned in Chapter 7). When a function call is detected (jal or jalr with rd != x0), the return address is pushed onto the RAS. When a return is detected (jalr x0, 0(x1)), the top of the RAS is popped and used as the prediction.
The RAS is very accurate (>99%) because function calls and returns are well-structured.
Indirect Branch Prediction
Indirect branches (jalr with a computed target) are harder to predict than direct branches. The target can vary widely depending on the value in the register.
Indirect branch target buffer (iBTB): A cache indexed by the branch PC that stores recently seen targets. For virtual function calls or switch statements, the iBTB can achieve good accuracy.
Advanced techniques: Some processors use the call path (sequence of recent function calls) to predict indirect branch targets, improving accuracy for polymorphic code.
8.6 Cache Hierarchy
Modern processors have multiple levels of cache to hide memory latency.
L1 Instruction and Data Caches
L1 caches are small (32-64 KB), fast (1-2 cycle latency), and split into separate instruction (I-cache) and data (D-cache) caches.
Split I/D caches allow simultaneous instruction fetch and data access, avoiding structural hazards. They’re also optimized differently: I-caches are read-only and can use simpler replacement policies.
Virtually indexed, physically tagged (VIPT): L1 caches often use virtual addresses for indexing (to avoid TLB lookup latency) but physical addresses for tags (to avoid aliasing issues).
L2 Unified Cache
L2 cache is larger (256 KB - 1 MB), slower (10-20 cycles), and unified (holds both instructions and data).
L2 is the victim cache for L1: when data is evicted from L1, it’s placed in L2. This creates an inclusive hierarchy (L2 contains everything in L1).
L3 Shared Cache
L3 cache (if present) is even larger (4-32 MB), slower (30-50 cycles), and shared among all cores in a multi-core processor.
L3 reduces traffic to main memory and provides a large shared working set for all cores.
Figure 8.6: Cache Hierarchy
graph TB
subgraph "Core 0"
CPU0[CPU Core 0]
L1I0[L1 I-Cache<br/>32 KB<br/>1-2 cycles]
L1D0[L1 D-Cache<br/>32 KB<br/>1-2 cycles]
end
subgraph "Core 1"
CPU1[CPU Core 1]
L1I1[L1 I-Cache<br/>32 KB<br/>1-2 cycles]
L1D1[L1 D-Cache<br/>32 KB<br/>1-2 cycles]
end
CPU0 --> L1I0
CPU0 --> L1D0
CPU1 --> L1I1
CPU1 --> L1D1
L1I0 --> L2_0[L2 Unified<br/>256 KB<br/>10-20 cycles]
L1D0 --> L2_0
L1I1 --> L2_1[L2 Unified<br/>256 KB<br/>10-20 cycles]
L1D1 --> L2_1
L2_0 --> L3[L3 Shared<br/>8 MB<br/>30-50 cycles]
L2_1 --> L3
L3 --> MEM[Main Memory<br/>DDR4/DDR5<br/>100-300 cycles]
style L1I0 fill:#e1f5ff
style L1D0 fill:#e1f5ff
style L1I1 fill:#e1f5ff
style L1D1 fill:#e1f5ff
style L2_0 fill:#fff4e1
style L2_1 fill:#fff4e1
style L3 fill:#ffe1e1
style MEM fill:#f0e1ff
Each level is larger and slower. L1 is private per core, L2 may be private or shared, L3 is shared among all cores.
Cache Replacement Policies
LRU (Least Recently Used): Evict the cache line that hasn’t been accessed for the longest time. This is effective but expensive to implement for high associativity.
PLRU (Pseudo-LRU): An approximation of LRU that’s cheaper to implement. Uses a tree of bits to track approximate recency.
Random: Evict a random cache line. Surprisingly effective and very simple.
Modern caches often use PLRU or adaptive policies that combine multiple strategies.
8.7 Cache Coherence
In multi-core systems, each core has its own L1 cache. Cache coherence ensures that all cores see a consistent view of memory.
MESI Protocol
MESI is the most common coherence protocol. Each cache line is in one of four states:
- M (Modified): This cache has the only valid copy, and it’s been modified (dirty).
- E (Exclusive): This cache has the only valid copy, and it’s clean (matches memory).
- S (Shared): Multiple caches have valid copies, all clean.
- I (Invalid): This cache line is not valid.
State transitions:
- Read miss: If another cache has the line in M, it writes back to memory and transitions to S. The requesting cache loads the line in S.
- Write: If the line is in S, all other caches invalidate their copies. The writing cache transitions to M.
Figure 8.7: MESI Protocol State Diagram
stateDiagram-v2
[*] --> I
I: Invalid<br/>(No valid copy)
E: Exclusive<br/>(Only copy, clean)
S: Shared<br/>(Multiple copies, clean)
M: Modified<br/>(Only copy, dirty)
I --> E: Read Miss<br/>(No other copy)
I --> S: Read Miss<br/>(Other copies exist)
I --> M: Write Miss
E --> M: Write
E --> S: Other Read
E --> I: Evict
S --> M: Write<br/>(Invalidate others)
S --> I: Evict or<br/>Other Write
M --> S: Other Read<br/>(Write back)
M --> I: Evict<br/>(Write back)
MESI Example:
// Core 0 and Core 1 both access the same cache line
// Initial state: Both caches Invalid (I)
// Core 0: Read from address 0x1000
// Core 0 cache: I → E (exclusive, no other copy)
// Core 1: Read from address 0x1000
// Core 0 cache: E → S (shared)
// Core 1 cache: I → S (shared)
// Core 0: Write to address 0x1000
// Core 0 cache: S → M (modified, dirty)
// Core 1 cache: S → I (invalidated)
// Core 1: Read from address 0x1000
// Core 0 cache: M → S (write back to memory)
// Core 1 cache: I → S (load from memory)
MOESI Protocol
MOESI adds an O (Owned) state: the cache has a dirty copy, but other caches may have shared copies. This reduces write-backs to memory.
Snooping vs Directory-Based Coherence
Snooping: All caches monitor (snoop) a shared bus for memory transactions. When a cache sees a transaction that affects its data, it responds appropriately (invalidate, write-back, etc.).
Snooping is simple but doesn’t scale well beyond ~8-16 cores because bus bandwidth becomes a bottleneck.
Directory-based: A centralized directory tracks which caches have copies of each cache line. Coherence messages are sent point-to-point rather than broadcast.
Directory-based coherence scales better to many cores (64+) and is used in large multi-core processors.
RISC-V Coherence Considerations
RISC-V doesn’t mandate a specific coherence protocol. The RVWMO memory model (Chapter 6) defines the ordering guarantees, but the coherence mechanism is implementation-defined.
Most RISC-V multi-core systems use MESI or MOESI with snooping (for small core counts) or directory-based coherence (for large core counts).
8.8 Comparison with ARM and MIPS OOO Cores
Let’s compare RISC-V OOO implementations with other architectures.
RISC-V BOOM vs ARM Cortex-A76/A78
BOOM (Berkeley Out-of-Order Machine) is an open-source RISC-V OOO core. It has:
- 3-4 issue width
- 128-entry ROB
- 64-entry issue queue
- Advanced branch prediction (TAGE)
- L1 I/D caches, L2 unified cache
ARM Cortex-A76 (2018) is a high-performance mobile core:
- 4-issue width
- 128-entry ROB
- Sophisticated branch prediction
- 64 KB L1 I/D, 256-512 KB L2
Comparison: BOOM and Cortex-A76 are similar in structure. Both use register renaming, ROB, and advanced branch prediction. The main differences are in implementation details (pipeline depth, cache sizes, power optimization).
RISC-V’s simpler ISA (no complex addressing modes, no condition codes) makes the OOO logic slightly simpler than ARM’s, but the difference is small in modern designs.
RISC-V vs MIPS R10000 Pipeline
MIPS R10000 (1996) was one of the first commercial OOO processors:
- 4-issue superscalar
- 32-entry active list (similar to ROB)
- Register renaming with 64 physical registers
- Speculative execution
The R10000 pioneered many techniques still used today. Modern RISC-V OOO cores like BOOM are evolutionary descendants of the R10000 design philosophy.
Key difference: RISC-V has no branch delay slots, making branch misprediction recovery simpler than MIPS.
Microarchitecture Trade-offs
Complexity vs Performance: OOO execution provides 2-3x performance improvement over in-order for general-purpose workloads, but at the cost of 3-5x more transistors and power.
When to use OOO:
- High-performance applications (servers, desktops, high-end mobile)
- Workloads with irregular memory access patterns
- Code with many branches and dependencies
When to use in-order:
- Embedded systems with power/area constraints
- Predictable real-time workloads
- Simple control-dominated code
RISC-V’s flexibility allows both in-order (Rocket, SiFive E/U-series) and OOO (BOOM, SiFive P-series) implementations, making it suitable for a wide range of applications.
🛠️ Hands-on Lab: Lab 8.1 — The Truth Behind Performance Counters
This lab guides you through using RISC-V’s hardware performance counters to measure CPI (Cycles Per Instruction) of the same code under different conditions.
Lab Objectives
- Read
mcycle(Machine Cycle Counter) andminstret(Machine Instructions Retired) - Calculate CPI = Cycles / Instructions
- Observe how different code patterns affect CPI
Concept Explanation
RISC-V provides two key performance counter CSRs:
| CSR | Name | Description |
|---|---|---|
mcycle | Machine Cycle Counter | Clock cycles elapsed since reset |
minstret | Machine Instructions Retired | Instructions completed since reset |
CPI (Cycles Per Instruction) = mcycle / minstret
- CPI = 1.0: Ideal case, one instruction completes per cycle
- CPI > 1.0: Stalls present (cache miss, hazard, etc.)
- CPI < 1.0: Superscalar processor, multiple instructions complete per cycle
Code
Create lab8_perf.c:
// lab8_perf.c - Performance Counter Measurement
#include <stdio.h>
#include <stdint.h>
// Read mcycle
static inline uint64_t read_mcycle(void) {
uint64_t val;
asm volatile("csrr %0, mcycle" : "=r"(val));
return val;
}
// Read minstret
static inline uint64_t read_minstret(void) {
uint64_t val;
asm volatile("csrr %0, minstret" : "=r"(val));
return val;
}
// Test function 1: Simple addition loop (no dependencies)
volatile int result1;
void test_independent(int n) {
int a = 0, b = 0, c = 0, d = 0;
for (int i = 0; i < n; i++) {
a += 1;
b += 2;
c += 3;
d += 4;
}
result1 = a + b + c + d;
}
// Test function 2: Addition loop with dependencies
volatile int result2;
void test_dependent(int n) {
int a = 0;
for (int i = 0; i < n; i++) {
a += 1;
a += a; // depends on previous line's result
a += a; // depends on previous line's result
a += a; // depends on previous line's result
}
result2 = a;
}
void measure(const char *name, void (*func)(int), int n) {
uint64_t cycle_start = read_mcycle();
uint64_t instr_start = read_minstret();
func(n);
uint64_t cycle_end = read_mcycle();
uint64_t instr_end = read_minstret();
uint64_t cycles = cycle_end - cycle_start;
uint64_t instrs = instr_end - instr_start;
// Calculate CPI (multiply by 100 to avoid floating point)
uint64_t cpi_x100 = (cycles * 100) / instrs;
printf("%s:\n", name);
printf(" Cycles: %lu\n", cycles);
printf(" Instructions: %lu\n", instrs);
printf(" CPI: %lu.%02lu\n", cpi_x100 / 100, cpi_x100 % 100);
}
int main() {
int n = 100000;
printf("=== Performance Counter Lab ===\n\n");
measure("Independent Operations", test_independent, n);
printf("\n");
measure("Dependent Operations", test_dependent, n);
return 0;
}
Compile and Run
# Compile
riscv64-unknown-elf-gcc -O2 -o lab8_perf lab8_perf.c
# Run (requires M-mode access to mcycle/minstret)
qemu-riscv64 lab8_perf
Expected Output (values vary by implementation):
=== Performance Counter Lab ===
Independent Operations:
Cycles: 500123
Instructions: 700045
CPI: 0.71
Dependent Operations:
Cycles: 1200456
Instructions: 800089
CPI: 1.50
What You Just Did
- Independent Operations: Four parallel additions can be executed simultaneously by OOO processors, resulting in CPI < 1
- Dependent Operations: Each addition depends on the previous result, forcing sequential execution, resulting in CPI > 1
danieRTOS Reference: The scheduler uses similar performance measurement techniques to profile task execution time and optimize scheduling decisions.
⚠️ Common Pitfalls
Pitfall 1: Thinking Out-of-Order Is Always Better
Misconception: “OoO processors are always faster than In-Order!”
Correct Understanding:
- OoO has heavier penalties on Branch Misprediction (more instructions to flush)
- OoO may not be optimal in power-constrained scenarios (phones, IoT)
- For highly parallel workloads (GPU-like), simple In-Order cores are often more efficient
Pitfall 2: Ignoring Spectre/Meltdown Risks
Misconception: “Speculative Execution is just a performance optimization with no side effects.”
Correct Understanding:
- Spectre and Meltdown attacks exploit side-channel effects of Speculative Execution
- Even when mispredicted instructions are cancelled, their effects on Cache remain
- Attackers can infer secret data by measuring Cache access timing
// Simplified Spectre attack concept
if (x < array1_size) { // Bounds check
y = array2[array1[x] * 256]; // If x is out of bounds, this shouldn't execute
}
// But Speculative Execution may "execute first, ask questions later"
// Even after discovering x is out of bounds and cancelling,
// array2's cache state has leaked array1[x]'s value
Pitfall 3: Confusing mcycle with time
Error Scenario: Using mcycle to measure “real time.”
Correct Understanding:
mcycle: CPU clock cycles, tied to CPU frequencytime: Real time (usually from RTC or Timer)- If CPU frequency changes dynamically (DVFS),
mcyclecannot be directly converted to time
// ❌ Wrong: Assuming mcycle equals time
uint64_t start = read_mcycle();
do_something();
uint64_t elapsed_ns = (read_mcycle() - start) * 1000000000 / CPU_FREQ;
// If CPU frequency changed, this calculation is wrong
// ✅ Correct: Use time CSR or SBI timer
uint64_t start = read_time();
do_something();
uint64_t elapsed_ns = (read_time() - start) * 1000000000 / TIMER_FREQ;
Summary
In this chapter, we explored advanced microarchitecture techniques for high-performance RISC-V processors:
- Out-of-order execution: Dynamic scheduling to extract instruction-level parallelism.
- Register renaming: Eliminating false dependencies (WAR, WAW) with physical registers.
- Reorder buffer: Ensuring in-order commit for precise exceptions.
- Load/store queues: Memory disambiguation and store-to-load forwarding.
- Advanced branch prediction: TAGE, RAS, indirect branch prediction.
- Cache hierarchy: L1/L2/L3 caches with LRU/PLRU replacement.
- Cache coherence: MESI/MOESI protocols for multi-core consistency.
- Comparisons: RISC-V BOOM vs ARM Cortex-A76 and MIPS R10000.
RISC-V’s clean ISA makes it an excellent target for both simple in-order cores and complex OOO designs. The architecture doesn’t impose unnecessary constraints, allowing microarchitects to innovate freely.
In the next chapter, we’ll shift focus from hardware to software, exploring how RISC-V systems boot and how firmware and operating systems interact with the hardware.
Chapter 9: Reset, Boot Flow & Firmware
Part VI — Booting & System Software
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand the Reset Vector: Know where the first instruction executes after RISC-V powers on
- Master the Boot Flow: Understand the relay from BootROM → Loader → Firmware → OS
- Write Linker Scripts: Define your program’s memory layout
- Implement Bare-metal Programs: Control hardware directly without an OS
- Understand UART MMIO: Output characters through Memory-Mapped I/O
💡 Scenario: The First Leg of the Relay Race
Scene: Junior presses the Reset button on the development board, watching logs scroll across the terminal.
Junior: “Architect, when we write C, we always start from main(). But when the CPU just powers on, RAM should be empty, right? Where does the CPU get its first instruction?”
Architect: “Great question. It’s like a relay race—main() is actually the third or fourth leg.
-
First Leg (Reset Vector): The CPU hardware is designed so that after power-on, the PC (Program Counter) automatically points to a fixed location (usually ROM). There, a small hardcoded program (BootROM) lives.
-
Second Leg (Loader): BootROM’s job is simple—copy the next program (like OpenSBI or U-Boot) from storage (Flash/SD card) into RAM, then jump to it.
-
Third Leg (Firmware/OS): Now the environment is more comfortable—we have RAM available, and we can finally prepare to run your
main().“
Junior: “Can we skip all that complex OS stuff and directly be that ‘second leg of the relay,’ controlling the hardware ourselves?”
Architect: “Absolutely—that’s called Bare-metal programming. In this world, there’s no printf, no malloc, not even a Stack unless you set it up yourself. Let’s try sending our first shout into this ‘wilderness.’”
Junior: “Sounds exciting!”
Architect: “But here’s the key: before entering C code, you must set up the Stack Pointer (SP). C function calls depend on the Stack—jumping into C without setting SP will crash immediately.”
What happens when you power on a RISC-V system? Unlike application software that runs in a well-prepared environment, the boot process starts from nothing—no operating system, no memory initialization, not even a stack. This chapter explores how RISC-V systems bootstrap themselves from power-on reset to a running operating system.
The boot process is a carefully orchestrated sequence of firmware stages, each preparing the environment for the next. We’ll trace this journey from the reset vector through machine-mode firmware (ZSBL, FSBL, OpenSBI), bootloaders (U-Boot, GRUB), and finally to the operating system handoff. Understanding this process is essential for firmware developers, system integrators, and anyone debugging boot issues.
9.1 Reset and Boot Sequence
Power-On Reset
When power is applied to a RISC-V processor, hardware reset logic initializes the core to a known state. All harts (hardware threads) begin execution in Machine mode (M-mode), the highest privilege level with full access to all hardware resources.
Reset state (defined by the RISC-V Privileged Specification):
- PC (Program Counter): Set to the reset vector address (implementation-defined, often
0x1000or0x80000000) - Privilege mode: M-mode (mstatus.MPP = 3)
- Interrupts: Disabled (mstatus.MIE = 0, mie = 0)
- Virtual memory: Disabled (satp = 0)
- Most CSRs: Undefined or zero
- General-purpose registers: Undefined (except x0, which is always zero)
Only one hart boots by default. In multi-hart systems, the boot hart (usually hart 0) starts executing from the reset vector, while other harts are held in a wait state until explicitly started by the boot hart.
Reset Vector
The reset vector is the first instruction address executed after reset. This address is implementation-defined and typically points to:
- ROM (Read-Only Memory): Contains first-stage bootloader (FSBL)
- Flash memory: Contains firmware image
- RAM: Pre-loaded by JTAG debugger (for development)
Example reset vectors:
- SiFive FU540:
0x1000(ROM) - SiFive FU740:
0x1000(ROM) - QEMU virt machine:
0x1000(ROM) - Rocket Chip:
0x10000(configurable)
Figure 9.1: RISC-V Boot Sequence Overview
graph TB
RESET[Power-On Reset] --> RV[Reset Vector<br/>0x1000]
RV --> ROM[ROM Code<br/>M-mode]
ROM --> FSBL[First-Stage<br/>Bootloader<br/>ZSBL/FSBL]
FSBL --> SSBL[Second-Stage<br/>Bootloader<br/>U-Boot/OpenSBI]
SSBL --> OS[Operating System<br/>Linux/FreeBSD<br/>S-mode]
ROM -.->|Initialize| HW[Hardware<br/>DRAM, Clocks,<br/>Peripherals]
FSBL -.->|Load from| STORAGE[Storage<br/>Flash/SD/Network]
SSBL -.->|SBI Services| RUNTIME[M-mode Runtime<br/>OpenSBI]
style RESET fill:#ffcccc
style ROM fill:#ffe1e1
style FSBL fill:#fff4e1
style SSBL fill:#e1ffe1
style OS fill:#e1f5ff
Early Initialization
The first code executed at the reset vector must be extremely careful — it runs with no stack, no initialized data, and minimal hardware setup. Typical early initialization:
# Reset vector entry point (M-mode)
_start:
# Disable interrupts (already disabled by reset, but be explicit)
csrw mie, zero
csrw mip, zero
# Initialize global pointer (gp) for data access
.option push
.option norelax
la gp, __global_pointer$
.option pop
# Set up stack pointer (sp)
la sp, __stack_top
# Clear BSS (uninitialized data)
la t0, __bss_start
la t1, __bss_end
1: bge t0, t1, 2f
sd zero, 0(t0)
addi t0, t0, 8
j 1b
2:
# Jump to C code
call boot_main
Key steps:
- Disable interrupts: Ensure no interrupts occur during initialization
- Set up
gp(global pointer): Enables efficient access to global variables - Set up
sp(stack pointer): Enables function calls and local variables - Clear BSS: Zero-initialize uninitialized global variables
- Jump to C code: Now safe to run higher-level code
9.2 Machine Mode Initialization
CSR Initialization
Machine-mode firmware must initialize critical CSRs before proceeding. These control interrupt handling, memory protection, and hardware features.
Essential CSR initialization:
// Initialize machine-mode CSRs
void init_machine_mode(void) {
// 1. Set up trap vector
write_csr(mtvec, (uintptr_t)&m_trap_vector);
// 2. Enable machine-mode interrupts (but keep global IE off for now)
write_csr(mie, MIE_MSIE | MIE_MTIE | MIE_MEIE);
// 3. Initialize mstatus
uintptr_t mstatus = read_csr(mstatus);
mstatus &= ~MSTATUS_MIE; // Keep interrupts disabled
mstatus |= MSTATUS_FS_INITIAL; // Enable FPU (if present)
write_csr(mstatus, mstatus);
// 4. Clear pending interrupts
write_csr(mip, 0);
// 5. Initialize performance counters (if present)
write_csr(mcounteren, 0x7); // Enable cycle, time, instret for S-mode
}
Physical Memory Protection (PMP) Setup
PMP is RISC-V’s mechanism for isolating memory regions and enforcing access permissions. M-mode firmware configures PMP entries to protect firmware code, restrict device access, and define memory regions for S-mode.
PMP configuration example:
// Configure PMP to allow S-mode access to RAM
void setup_pmp(void) {
// Entry 0: Protect M-mode firmware (0x80000000 - 0x80100000)
// TOR (Top-Of-Range) addressing
write_csr(pmpaddr0, 0x80000000 >> 2);
write_csr(pmpaddr1, 0x80100000 >> 2);
write_csr(pmpcfg0, PMP_R | PMP_X | PMP_L | PMP_TOR); // R-X, locked
// Entry 1: Allow S-mode full access to RAM (0x80100000 - 0x88000000)
write_csr(pmpaddr2, 0x80100000 >> 2);
write_csr(pmpaddr3, 0x88000000 >> 2);
write_csr(pmpcfg0, (PMP_R | PMP_W | PMP_X | PMP_TOR) << 8); // RWX
// Entry 2: Allow access to UART (0x10000000 - 0x10001000)
write_csr(pmpaddr4, 0x10000000 >> 2);
write_csr(pmpaddr5, 0x10001000 >> 2);
write_csr(pmpcfg0, (PMP_R | PMP_W | PMP_TOR) << 16); // RW
}
PMP addressing modes:
- OFF: Entry disabled
- TOR (Top-Of-Range): Region from
pmpaddr[i-1]topmpaddr[i] - NA4: Naturally aligned 4-byte region
- NAPOT: Naturally aligned power-of-2 region
Figure 9.2: PMP Memory Protection
graph TB
subgraph "Physical Memory"
ROM[ROM<br/>0x1000-0x10000<br/>M-mode only]
MRAM[M-mode RAM<br/>0x80000000-0x80100000<br/>Locked, R-X]
SRAM[S-mode RAM<br/>0x80100000-0x88000000<br/>RWX]
UART[UART<br/>0x10000000-0x10001000<br/>RW]
FLASH[Flash<br/>0x20000000-0x24000000<br/>R-X]
end
PMP0[PMP Entry 0<br/>M-mode Firmware] --> MRAM
PMP1[PMP Entry 1<br/>S-mode Memory] --> SRAM
PMP2[PMP Entry 2<br/>UART Device] --> UART
style MRAM fill:#ffcccc
style SRAM fill:#ccffcc
style UART fill:#ffffcc
Memory Configuration
M-mode firmware must initialize DRAM controllers and configure memory timing. This is highly platform-specific and often the most complex part of early boot.
Typical DRAM initialization:
- Configure DRAM controller registers (timing, refresh rate)
- Perform DRAM training (calibrate delays)
- Test memory (optional, but recommended)
- Set up memory map (base address, size)
Example (simplified):
void init_dram(void) {
volatile uint32_t *dram_ctrl = (uint32_t *)0x10000000;
// Configure DRAM timing (platform-specific)
dram_ctrl[0] = 0x12345678; // Timing register
dram_ctrl[1] = 0x9ABCDEF0; // Refresh register
// Wait for DRAM ready
while (!(dram_ctrl[2] & 0x1));
// Simple memory test
volatile uint64_t *mem = (uint64_t *)0x80000000;
mem[0] = 0xDEADBEEFCAFEBABE;
if (mem[0] != 0xDEADBEEFCAFEBABE) {
// Memory test failed
while (1);
}
}
9.3 Firmware and Bootloader
Firmware Stages
RISC-V boot firmware is typically organized into multiple stages, each with specific responsibilities:
- ZSBL (Zeroth-Stage Bootloader): Minimal ROM code, initializes DRAM
- FSBL (First-Stage Bootloader): Loads SSBL from storage
- SSBL (Second-Stage Bootloader): Full-featured bootloader (U-Boot)
- Runtime firmware: M-mode services (OpenSBI)
Why multiple stages?
- ROM size constraints: ZSBL must fit in small on-chip ROM
- Flexibility: FSBL/SSBL can be updated without hardware changes
- Feature richness: Later stages can use DRAM and have more code space
First-Stage Bootloader (FSBL)
FSBL’s primary job is to load the second-stage bootloader from non-volatile storage (Flash, SD card, network).
FSBL responsibilities:
- Initialize storage controller (SPI, SD, eMMC)
- Load SSBL image from storage to DRAM
- Verify SSBL integrity (checksum, signature)
- Jump to SSBL entry point
Example FSBL flow:
void fsbl_main(void) {
// 1. Initialize storage
spi_flash_init();
// 2. Load SSBL from flash to DRAM
uint8_t *ssbl_dest = (uint8_t *)0x80200000;
uint32_t ssbl_size = 512 * 1024; // 512 KB
spi_flash_read(0x100000, ssbl_dest, ssbl_size);
// 3. Verify checksum
if (!verify_checksum(ssbl_dest, ssbl_size)) {
panic("SSBL checksum failed");
}
// 4. Jump to SSBL
void (*ssbl_entry)(void) = (void (*)(void))ssbl_dest;
ssbl_entry();
}
U-Boot for RISC-V
U-Boot is the most common second-stage bootloader for RISC-V Linux systems. It provides a rich environment for loading and booting operating systems.
U-Boot features:
- Multiple boot sources: Flash, SD, USB, network (TFTP, NFS)
- File system support: FAT, ext2/3/4, SquashFS
- Network stack: DHCP, TFTP, NFS
- Scripting: Boot scripts for automation
- Device tree: Passes hardware description to OS
- Interactive shell: For debugging and manual boot
U-Boot boot flow:
U-Boot SPL (if used) → U-Boot proper → Load kernel → Load device tree → Boot kernel
9.4 OpenSBI: Supervisor Binary Interface
OpenSBI is the reference implementation of the RISC-V Supervisor Binary Interface (SBI). It provides a standard interface between M-mode firmware and S-mode operating systems.
OpenSBI Architecture
OpenSBI runs in M-mode and provides runtime services to S-mode software (OS kernels, hypervisors). It acts as a thin firmware layer that abstracts platform-specific details.
Figure 9.3: OpenSBI Architecture
graph TB
subgraph "S-mode (Supervisor)"
LINUX[Linux Kernel]
FREEBSD[FreeBSD Kernel]
HV[Hypervisor]
end
subgraph "M-mode (Machine)"
OPENSBI[OpenSBI Runtime]
PLATFORM[Platform Code<br/>UART, Timer, IPI]
end
subgraph "Hardware"
CPU[RISC-V Core]
CLINT[CLINT<br/>Timer, IPI]
PLIC[PLIC<br/>Interrupts]
UART_HW[UART]
end
LINUX -->|ecall| OPENSBI
FREEBSD -->|ecall| OPENSBI
HV -->|ecall| OPENSBI
OPENSBI --> PLATFORM
PLATFORM --> CLINT
PLATFORM --> PLIC
PLATFORM --> UART_HW
style OPENSBI fill:#e1ffe1
style PLATFORM fill:#fff4e1
OpenSBI provides:
- Timer services: Set timer interrupts
- IPI (Inter-Processor Interrupt): Send IPIs to other harts
- RFENCE: Remote fence operations (TLB flush, I-cache flush)
- Hart state management: Start/stop harts
- System reset: Reboot or shutdown
- Console I/O: Early debug output
Platform Initialization
OpenSBI initializes platform-specific hardware during boot:
// OpenSBI platform initialization (simplified)
int sbi_platform_init(void) {
// 1. Initialize console (UART)
uart_init();
sbi_printf("OpenSBI v1.0\n");
// 2. Initialize CLINT (timer and IPI)
clint_init();
// 3. Initialize PLIC (interrupt controller)
plic_init();
// 4. Set up PMP for S-mode
setup_pmp();
// 5. Initialize other harts
for (int i = 1; i < num_harts; i++) {
sbi_hsm_hart_start(i, smode_entry, 0);
}
return 0;
}
SBI Runtime Services
S-mode software invokes SBI services using the ecall instruction. The SBI call convention uses registers to pass function ID and parameters:
- a7: SBI extension ID (EID)
- a6: SBI function ID (FID)
- a0-a5: Parameters
- a0: Return value (0 = success, negative = error)
- a1: Additional return value (optional)
Example: Setting a timer
# S-mode code: Set timer for 1 second from now
li a7, 0x54494D45 # EID_TIME = 0x54494D45 ("TIME")
li a6, 0 # FID_SET_TIMER = 0
rdtime a0 # Read current time
li t0, 10000000 # 1 second at 10 MHz
add a0, a0, t0 # Target time
ecall # Call OpenSBI
OpenSBI handles the ecall:
// OpenSBI trap handler
void sbi_trap_handler(struct sbi_trap_regs *regs) {
if (regs->cause == CAUSE_SUPERVISOR_ECALL) {
ulong eid = regs->a7;
ulong fid = regs->a6;
if (eid == SBI_EXT_TIME && fid == SBI_EXT_TIME_SET_TIMER) {
// Set timer
uint64_t next_time = regs->a0;
clint_set_timer(current_hart(), next_time);
regs->a0 = SBI_SUCCESS;
}
// Advance sepc past ecall
regs->sepc += 4;
}
}
9.5 Supervisor Mode Handoff
M-mode to S-mode Transition
After OpenSBI initialization, control is transferred to S-mode (the operating system). This transition involves:
- Set up S-mode entry point:
sepc= OS entry address - Configure mstatus: Set
MPP= 1 (S-mode), enable interrupts - Delegate interrupts/exceptions: Configure
midelegandmedeleg - Pass parameters: Device tree address in
a1 - Execute
mret: Return to S-mode
OpenSBI handoff code:
void sbi_boot_hart(ulong next_addr, ulong next_mode, ulong fdt_addr) {
// 1. Set S-mode entry point
csr_write(CSR_SEPC, next_addr);
// 2. Configure mstatus for S-mode
ulong mstatus = csr_read(CSR_MSTATUS);
mstatus = INSERT_FIELD(mstatus, MSTATUS_MPP, PRV_S); // Return to S-mode
mstatus = INSERT_FIELD(mstatus, MSTATUS_MPIE, 0); // Disable interrupts initially
mstatus = INSERT_FIELD(mstatus, MSTATUS_SPP, 0); // S-mode came from U-mode
csr_write(CSR_MSTATUS, mstatus);
// 3. Delegate interrupts to S-mode
csr_write(CSR_MIDELEG, MIP_SSIP | MIP_STIP | MIP_SEIP);
// 4. Delegate exceptions to S-mode
csr_write(CSR_MEDELEG, (1 << CAUSE_MISALIGNED_FETCH) |
(1 << CAUSE_FETCH_PAGE_FAULT) |
(1 << CAUSE_LOAD_PAGE_FAULT) |
(1 << CAUSE_STORE_PAGE_FAULT));
// 5. Pass device tree address in a1
register ulong a0 asm("a0") = current_hartid();
register ulong a1 asm("a1") = fdt_addr;
// 6. Jump to S-mode
asm volatile("mret" : : "r"(a0), "r"(a1));
}
Device Tree Passing
The device tree (DTB) describes the hardware platform to the OS. OpenSBI passes the DTB address to the kernel in register a1.
Device tree structure (simplified):
/dts-v1/;
/ {
#address-cells = <2>;
#size-cells = <2>;
compatible = "sifive,fu740", "sifive,fu540";
model = "SiFive HiFive Unmatched";
cpus {
#address-cells = <1>;
#size-cells = <0>;
cpu@0 {
device_type = "cpu";
reg = <0>;
compatible = "sifive,u74", "riscv";
riscv,isa = "rv64imafdc";
mmu-type = "riscv,sv39";
};
// More CPUs...
};
memory@80000000 {
device_type = "memory";
reg = <0x0 0x80000000 0x2 0x00000000>; // 8 GB at 0x80000000
};
soc {
uart@10010000 {
compatible = "sifive,uart0";
reg = <0x0 0x10010000 0x0 0x1000>;
interrupts = <4>;
};
// More devices...
};
};
Kernel receives DTB:
// Linux kernel entry point (arch/riscv/kernel/head.S)
_start:
// a0 = hartid
// a1 = DTB address
// Save DTB address
la t0, dtb_early_pa
sd a1, 0(t0)
// Continue boot...
9.6 Linux Boot on RISC-V
Linux Kernel Entry Point
The Linux kernel for RISC-V starts in arch/riscv/kernel/head.S with the following state:
- Privilege mode: S-mode
- MMU: Disabled (satp = 0)
- Interrupts: Disabled
- a0: Hart ID
- a1: Device tree physical address
Early kernel initialization:
# arch/riscv/kernel/head.S (simplified)
_start:
# Disable interrupts
csrw sie, zero
csrw sip, zero
# Save hart ID and DTB address
mv s0, a0 # s0 = hartid
mv s1, a1 # s1 = DTB address
# Set up temporary stack
la sp, init_thread_union + THREAD_SIZE
# Clear BSS
la t0, __bss_start
la t1, __bss_stop
1: sd zero, 0(t0)
addi t0, t0, 8
blt t0, t1, 1b
# Set up early page tables
call setup_vm
# Enable MMU
la t0, early_pg_dir
srli t0, t0, 12
li t1, SATP_MODE_SV39
or t0, t0, t1
csrw satp, t0
sfence.vma
# Jump to virtual address space
la t0, .Lvirtual
jr t0
.Lvirtual:
# Now running with MMU enabled
call start_kernel
Device Tree Parsing
The kernel parses the device tree to discover hardware:
// Simplified device tree parsing
void __init setup_arch(char **cmdline_p) {
// 1. Unflatten device tree
unflatten_device_tree();
// 2. Parse memory nodes
early_init_dt_scan_memory();
// 3. Parse CPU nodes
for_each_of_cpu_node(node) {
parse_cpu_node(node);
}
// 4. Parse chosen node (bootargs, initrd)
early_init_dt_scan_chosen(cmdline_p);
// 5. Set up memory management
setup_bootmem();
paging_init();
}
Figure 9.4: Linux Boot Sequence
OpenSBI (M-mode)
|
| mret (a0=hartid, a1=DTB address)
v
_start (S-mode, arch/riscv/kernel/head.S)
|
+---> Setup early page tables
+---> Enable MMU (write satp, sfence.vma)
|
v
start_kernel() (init/main.c)
|
+---> parse_early_param()
+---> setup_arch() ← Parse device tree, setup memory
+---> mm_init() ← Memory management init
+---> sched_init() ← Scheduler init
+---> rest_init()
|
+---> kernel_init() → Init process (PID 1)
9.7 Comparison with ARM Trusted Firmware
RISC-V’s boot architecture is simpler than ARM’s, but serves similar purposes.
Boot Flow Comparison
ARM Trusted Firmware (TF-A) uses multiple boot stages:
- BL1: ROM code (EL3)
- BL2: Trusted boot firmware (EL3)
- BL31: Runtime firmware (EL3)
- BL32: Secure OS (S-EL1, optional)
- BL33: Non-secure bootloader (EL2/EL1) → OS
RISC-V OpenSBI is simpler:
- ZSBL/FSBL: ROM code (M-mode)
- OpenSBI: Runtime firmware (M-mode)
- U-Boot: Bootloader (S-mode)
- OS: Linux/FreeBSD (S-mode)
Key differences:
| Feature | ARM TF-A | RISC-V OpenSBI |
|---|---|---|
| Privilege levels | EL0-EL3 (4 levels) | U/S/M (3 levels) |
| Secure world | TrustZone (S-EL0/1) | PMP-based isolation |
| Runtime firmware | BL31 (EL3) | OpenSBI (M-mode) |
| Hypervisor | EL2 (built-in) | H-extension (optional) |
| Boot stages | BL1→BL2→BL31→BL33 | ZSBL→FSBL→OpenSBI→U-Boot |
| Complexity | High (many stages) | Lower (fewer stages) |
M-mode vs EL3
Both M-mode and EL3 are the highest privilege levels, but differ in scope:
M-mode (RISC-V):
- Minimal, focused on essential services
- Delegates most exceptions/interrupts to S-mode
- Thin runtime layer (OpenSBI ~50 KB)
- No built-in secure world (use PMP)
EL3 (ARM):
- Rich feature set (TrustZone, secure monitor)
- Handles all secure world transitions
- Larger runtime (TF-A ~200 KB+)
- Built-in secure/non-secure separation
RISC-V’s philosophy: Keep M-mode minimal, push complexity to S-mode. ARM’s philosophy: Rich firmware layer with extensive security features.
🛠️ Hands-on Lab: Lab 9.2 — Survival in the Wilderness (Bare-metal Hello World)
This is the most “pure” programming experience of your career. We’ll strip away all OS protections and talk directly to hardware.
Lab Objectives
- Write a Linker Script to define memory layout
- Write Assembly startup code to set the Stack Pointer
- Use MMIO to directly control UART output
- Run on QEMU
Project Structure
Create a folder lab9 with three files:
lab9/
├── link.ld # Map: defines memory layout
├── entry.S # Startup key: set SP and jump to main
└── main.c # Logic brain: UART driver
Code
File 1: link.ld (Linker Script)
Tell the Linker: our program starts at RAM’s beginning (0x80000000).
OUTPUT_ARCH( "riscv" )
ENTRY( _start )
SECTIONS
{
/* QEMU virt machine RAM starts at 0x80000000 */
. = 0x80000000;
/* Text section: put startup code first */
.text : {
*(.text.boot)
*(.text)
}
/* Data section */
.data : { *(.data) }
/* Uninitialized data section (BSS) */
.bss : { *(.bss) }
/* Define Stack Top, reserve 4KB */
. = . + 0x1000;
_stack_top = .;
}
File 2: entry.S (Startup Code)
This is the first code the CPU executes.
.section .text.boot
.global _start
_start:
# 1. Disable interrupts (good practice, usually off at boot anyway)
csrw mie, zero
# 2. Set Stack Pointer (SP)
# C function calls depend on Stack—jumping into C without SP crashes
la sp, _stack_top
# 3. Jump to C main function
call main
# 4. If main returns (shouldn't happen), loop forever
loop:
j loop
File 3: main.c (UART Driver)
QEMU virt machine UART0 is fixed at 0x10000000.
#include <stdint.h>
// QEMU virt UART base address
#define UART0_BASE 0x10000000
// Define a pointer to this address
// volatile is crucial! Tells compiler not to optimize away reads/writes
volatile uint8_t *uart0 = (uint8_t *)(UART0_BASE);
void put_char(char c) {
// Write character directly to memory address
// UART controller transmits it
*uart0 = c;
}
void print_str(const char *s) {
while (*s) {
put_char(*s++);
}
}
int main(void) {
print_str("Hello from Bare-metal RISC-V!\n");
// Loop forever (no OS to return to)
while (1) {}
return 0;
}
Compile and Run
# Compile
riscv64-unknown-elf-gcc -nostdlib -nostartfiles -T link.ld \
-o bare_hello entry.S main.c
# Run on QEMU (M-mode, no firmware)
qemu-system-riscv64 -machine virt -nographic -bios none -kernel bare_hello
Expected Output:
Hello from Bare-metal RISC-V!
(Press Ctrl+A, then X to exit QEMU)
What You Just Did
You’ve written a complete bare-metal program:
- Linker Script: Defined where code and stack live in memory
- Startup Code: Set SP and jumped to C—the essential bootstrap
- UART MMIO: Talked directly to hardware using memory-mapped I/O
danieRTOS Reference: The danieRTOS entry point follows the same pattern—
entry.Ssets up SP and callskernel_main().
Deep Dive: QEMU Memory Map
Why 0x10000000 and 0x80000000? This is the QEMU virt machine’s Memory Map:
| Address Range | Purpose |
|---|---|
0x0000_1000 | BootROM (Reset Vector) |
0x1000_0000 | UART (we write characters here) |
0x8000_0000 | DRAM (our program runs here) |
This is the magic of MMIO (Memory Mapped I/O): to the CPU, it’s just writing to memory, but to the system, it’s controlling peripheral devices.
Extended Challenge
💭 Try making the program print “Booting…” then do a simple count loop before printing “Done!”
This will give you a taste of how “primitive” the world is without a
sleep()function.
⚠️ Common Pitfalls
Pitfall 1: Forgetting the Stack Pointer
Error Scenario: Jumping directly from Assembly to C’s main() without setting sp.
Consequence: Program crashes as soon as it enters a C function, because C needs the Stack for local variables and return addresses.
# ❌ Wrong: Forgot to set SP
_start:
call main # Jump into C with garbage sp—certain death
# ✅ Correct: Set SP first
_start:
la sp, _stack_top
call main
Pitfall 2: Linker Script Section Order
Error Scenario: Not placing .text.boot at the beginning.
Consequence: CPU starts executing at 0x80000000, but that’s not _start—it’s some other function’s code, causing unpredictable behavior.
/* ❌ Wrong: .text.boot not first */
.text : {
*(.text) /* Other code comes in first */
*(.text.boot) /* _start pushed to later */
}
/* ✅ Correct: Ensure .text.boot is first */
.text : {
*(.text.boot) /* Startup code goes first */
*(.text)
}
Pitfall 3: Hardcoding UART Address
Error Scenario: Hardcoding 0x10000000 in your program, then running it on a different board.
Consequence: Different hardware has different memory maps—UART address may be completely different.
// ❌ Problem: Hardcoded address, not portable
#define UART0_BASE 0x10000000
// ✅ Better: Use Device Tree or header definitions
// Or use SBI (next chapter)
💡 Tip: This is why we need SBI (Supervisor Binary Interface)—it provides a standardized interface so you don’t have to worry about hardware differences. Next chapter, we’ll learn how to output characters through SBI instead of directly manipulating UART.
Summary
The RISC-V boot process is a carefully orchestrated sequence:
- Reset: Hart starts in M-mode at reset vector
- ZSBL/FSBL: Initialize DRAM, load bootloader
- OpenSBI: Provide SBI runtime services
- U-Boot: Load kernel and device tree
- Linux: Parse device tree, initialize hardware, start init
Key takeaways:
- ✅ M-mode firmware is minimal and platform-specific
- ✅ OpenSBI provides standard SBI interface
- ✅ Device tree describes hardware to OS
- ✅ PMP protects M-mode firmware from S-mode
- ✅ Simpler than ARM’s multi-stage boot flow
In the next chapter, we’ll dive deeper into M-mode firmware design, SBI call interface, and the Hypervisor extension.
Chapter 10: Machine Mode, SBI & Supervisor Mode
Part VI — Booting & System Software
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand SBI Architecture: Grasp the layered design and value of the Supervisor Binary Interface
- Master SBI Calling Convention: Know the roles of
a7(EID),a6(FID),a0-a5(Args) - Implement SBI Calls: Make
ecallrequests to OpenSBI services - Understand Exception Delegation: Know how M-mode delegates traps to S-mode
- Distinguish M-mode and S-mode Responsibilities: Understand why RISC-V encourages thin M-mode
💡 Scenario: Please Get the Manager
Scene: Junior is pulling his hair at the screen—the UART driver from the last chapter doesn’t work on the new board.
Junior: “Senior, I’m going crazy. Remember the UART driver we wrote last chapter? I just switched to a different board to try it out, and after digging through the datasheet, I found this board’s UART address is 0x54000000, not 0x10000000. Do I have to modify the code every time I switch boards?”
Senior: “That’s exactly why we need SBI (Supervisor Binary Interface). What you’re doing now is like going to a restaurant and rushing into the kitchen to cook your own food. Change restaurants (hardware), the kitchen layout is different, and you don’t know how to cook anymore.”
Junior: “So what should I do?”
Senior: “You need to learn to ‘call the manager.’ In RISC-V, M-mode (OpenSBI) is that manager.
You (S-mode Kernel) just need to sit at your table and use the standard format (SBI Call) to shout: ‘Manager, please print a character for me!’
The manager receives your request, looks up this restaurant’s kitchen layout, and prints the character for you. This way, no matter which restaurant you go to, as long as you know how to call the manager, you’re fine.“
Junior: “Sounds much easier! How do I call?”
Senior: “Use the ecall instruction. But before calling, you need to write your request on specific ‘sticky notes’ (Registers):
| Register | Purpose | Analogy |
|---|---|---|
a7 | Extension ID (EID) | Which department? |
a6 | Function ID (FID) | What service? |
a0-a5 | Arguments | Service parameters |
a0, a1 | Return Values | Manager’s reply |
Come on, let’s try it.“
Machine mode is RISC-V’s highest privilege level, with unrestricted access to all hardware resources. But with great power comes great responsibility—M-mode firmware must be minimal, robust, and provide essential services to supervisor mode software. This chapter explores how to design M-mode firmware, implement SBI services, and support advanced features like virtualization and security.
Unlike monolithic firmware architectures, RISC-V encourages a thin M-mode layer that delegates most functionality to S-mode. This design philosophy keeps M-mode simple and portable while allowing rich OS features in S-mode. We’ll examine M-mode firmware design patterns, the Supervisor Binary Interface (SBI) specification, hypervisor support through the H extension, and security features like Physical Memory Protection (PMP) and the WorldGuard extension.
10.1 Machine Mode Firmware Design
Minimal M-mode Firmware
The RISC-V philosophy is to keep M-mode firmware as small as possible. A minimal M-mode firmware might be only a few kilobytes, providing just enough functionality to boot S-mode software.
Minimal M-mode responsibilities:
- Early hardware initialization: DRAM, clocks, reset
- Platform-specific setup: Configure peripherals
- SBI runtime services: Timer, IPI, RFENCE, console
- Exception delegation: Pass most traps to S-mode
- Boot S-mode software: Set up and jump to OS
Example minimal M-mode firmware structure:
// Minimal M-mode firmware
void m_mode_main(unsigned long hartid, void *fdt) {
// 1. Initialize platform
platform_init();
// 2. Set up trap handler
write_csr(mtvec, (uintptr_t)&m_trap_entry);
// 3. Configure PMP
setup_pmp();
// 4. Delegate exceptions and interrupts
write_csr(medeleg, 0xb1ff); // Delegate most exceptions
write_csr(mideleg, 0x0222); // Delegate S-mode interrupts
// 5. Start other harts (if multi-core)
if (hartid == 0) {
for (int i = 1; i < num_harts; i++) {
start_hart(i);
}
}
// 6. Boot S-mode payload
boot_next_stage(hartid, fdt);
}
Platform-Specific Initialization
Each RISC-V platform has unique hardware that must be initialized. M-mode firmware abstracts these details through a platform layer.
Platform initialization example:
// Platform-specific initialization
struct platform_ops {
int (*early_init)(void);
int (*final_init)(void);
void (*console_putc)(char c);
int (*console_getc)(void);
void (*timer_init)(void);
void (*ipi_send)(unsigned long hartid);
void (*system_reset)(void);
};
// SiFive FU740 platform
static struct platform_ops fu740_ops = {
.early_init = fu740_early_init,
.final_init = fu740_final_init,
.console_putc = uart_putc,
.console_getc = uart_getc,
.timer_init = clint_timer_init,
.ipi_send = clint_ipi_send,
.system_reset = fu740_system_reset,
};
int platform_init(void) {
struct platform_ops *ops = &fu740_ops;
// Early initialization
if (ops->early_init)
ops->early_init();
// Initialize console
if (ops->console_putc)
sbi_console_init(ops->console_putc, ops->console_getc);
// Initialize timer
if (ops->timer_init)
ops->timer_init();
// Final initialization
if (ops->final_init)
ops->final_init();
return 0;
}
Runtime Services
M-mode firmware provides runtime services to S-mode through the SBI interface. These services remain active after S-mode boots.
Core runtime services:
- Timer: Set timer interrupts (
sbi_set_timer) - IPI: Send inter-processor interrupts (
sbi_send_ipi) - RFENCE: Remote fence operations (
sbi_remote_fence_i,sbi_remote_sfence_vma) - Console: Debug output (
sbi_console_putchar,sbi_console_getchar) - Hart management: Start/stop harts (
sbi_hart_start,sbi_hart_stop) - System reset: Reboot/shutdown (
sbi_system_reset)
10.2 SBI Call Interface
SBI Call Mechanism
S-mode software invokes SBI services using the ecall instruction. This traps to M-mode, which handles the request and returns to S-mode.
Figure 10.1: SBI Call Flow
sequenceDiagram
participant S as S-mode<br/>(Linux)
participant M as M-mode<br/>(OpenSBI)
participant HW as Hardware<br/>(CLINT/PLIC)
S->>S: Prepare SBI call<br/>(a7=EID, a6=FID, a0-a5=params)
S->>M: ecall
Note over M: Trap to M-mode<br/>mcause = 9 (ecall from S-mode)
M->>M: Decode EID/FID
M->>HW: Perform operation<br/>(e.g., set timer)
HW-->>M: Operation complete
M->>M: Set return value (a0)
M->>S: mret
Note over S: Resume S-mode<br/>Check a0 for result
SBI Calling Convention
SBI calls use a standard register convention:
Input registers:
- a7: Extension ID (EID) — Identifies the SBI extension
- a6: Function ID (FID) — Identifies the function within the extension
- a0-a5: Function parameters (up to 6 parameters)
Output registers:
- a0: Error code (0 = success, negative = error)
- a1: Return value (optional, function-specific)
Preserved registers: All registers except a0-a1 are preserved across SBI calls.
Example: Send IPI
// S-mode code: Send IPI to hart 1
static inline long sbi_send_ipi(unsigned long hart_mask,
unsigned long hart_mask_base) {
register unsigned long a0 asm("a0") = hart_mask;
register unsigned long a1 asm("a1") = hart_mask_base;
register unsigned long a6 asm("a6") = SBI_EXT_IPI_SEND_IPI;
register unsigned long a7 asm("a7") = SBI_EXT_IPI;
asm volatile("ecall"
: "+r"(a0), "+r"(a1)
: "r"(a6), "r"(a7)
: "memory");
return a0; // Return error code
}
// Usage
sbi_send_ipi(1 << 1, 0); // Send IPI to hart 1
SBI Error Codes
SBI functions return standard error codes:
#define SBI_SUCCESS 0
#define SBI_ERR_FAILED -1
#define SBI_ERR_NOT_SUPPORTED -2
#define SBI_ERR_INVALID_PARAM -3
#define SBI_ERR_DENIED -4
#define SBI_ERR_INVALID_ADDRESS -5
#define SBI_ERR_ALREADY_AVAILABLE -6
#define SBI_ERR_ALREADY_STARTED -7
#define SBI_ERR_ALREADY_STOPPED -8
Error handling:
long ret = sbi_send_ipi(hart_mask, 0);
if (ret < 0) {
switch (ret) {
case SBI_ERR_INVALID_PARAM:
pr_err("Invalid hart mask\n");
break;
case SBI_ERR_FAILED:
pr_err("IPI send failed\n");
break;
default:
pr_err("Unknown error: %ld\n", ret);
}
}
10.3 SBI Standard Extensions
SBI defines multiple extensions, each providing related functionality. Extensions are identified by EID (Extension ID).
Timer Extension (EID = 0x54494D45)
The Timer extension provides timer interrupt services.
Function: sbi_set_timer (FID = 0)
Sets the timer to fire at a specific time value.
// Set timer to fire in 1 second
uint64_t current_time = rdtime();
uint64_t next_time = current_time + 10000000; // 10 MHz clock
register unsigned long a0 asm("a0") = next_time;
register unsigned long a6 asm("a6") = 0; // FID_SET_TIMER
register unsigned long a7 asm("a7") = 0x54494D45; // EID_TIME
asm volatile("ecall" : "+r"(a0) : "r"(a6), "r"(a7) : "memory");
M-mode implementation:
void sbi_set_timer(uint64_t stime_value) {
unsigned long hartid = current_hartid();
// Write to CLINT mtimecmp register
volatile uint64_t *mtimecmp = (uint64_t *)(CLINT_BASE + 0x4000 + hartid * 8);
*mtimecmp = stime_value;
// Clear pending timer interrupt
csr_clear(CSR_MIP, MIP_STIP);
}
IPI Extension (EID = 0x735049)
The IPI extension sends inter-processor interrupts.
Function: sbi_send_ipi (FID = 0)
Sends IPI to a set of harts specified by a hart mask.
// Send IPI to harts 1, 2, 3
unsigned long hart_mask = 0b1110; // Bits 1, 2, 3 set
sbi_send_ipi(hart_mask, 0);
M-mode implementation:
int sbi_send_ipi(unsigned long hart_mask, unsigned long hart_mask_base) {
for (int i = 0; i < 64; i++) {
if (hart_mask & (1UL << i)) {
unsigned long hartid = hart_mask_base + i;
// Write to CLINT MSIP register
volatile uint32_t *msip = (uint32_t *)(CLINT_BASE + hartid * 4);
*msip = 1;
}
}
return SBI_SUCCESS;
}
RFENCE Extension (EID = 0x52464E43)
The RFENCE extension performs remote fence operations (TLB flush, I-cache flush) on other harts.
Functions:
sbi_remote_fence_i(FID = 0): Flush instruction cachesbi_remote_sfence_vma(FID = 1): Flush TLB entriessbi_remote_sfence_vma_asid(FID = 2): Flush TLB entries for specific ASID
Example: Remote TLB flush
// Flush TLB on harts 1-3 for address range 0x80000000-0x80001000
unsigned long hart_mask = 0b1110;
unsigned long start_addr = 0x80000000;
unsigned long size = 0x1000;
register unsigned long a0 asm("a0") = hart_mask;
register unsigned long a1 asm("a1") = 0; // hart_mask_base
register unsigned long a2 asm("a2") = start_addr;
register unsigned long a3 asm("a3") = size;
register unsigned long a6 asm("a6") = 1; // FID_REMOTE_SFENCE_VMA
register unsigned long a7 asm("a7") = 0x52464E43; // EID_RFENCE
asm volatile("ecall" : "+r"(a0) : "r"(a1), "r"(a2), "r"(a3), "r"(a6), "r"(a7) : "memory");
M-mode implementation:
int sbi_remote_sfence_vma(unsigned long hart_mask, unsigned long hart_mask_base,
unsigned long start_addr, unsigned long size) {
// Send IPI to target harts
for (int i = 0; i < 64; i++) {
if (hart_mask & (1UL << i)) {
unsigned long hartid = hart_mask_base + i;
// Store fence parameters for target hart
remote_fence_info[hartid].start = start_addr;
remote_fence_info[hartid].size = size;
remote_fence_info[hartid].type = FENCE_SFENCE_VMA;
// Send IPI
clint_send_ipi(hartid);
}
}
// Wait for completion (optional, depends on implementation)
return SBI_SUCCESS;
}
// IPI handler on target hart
void handle_remote_fence_ipi(void) {
struct remote_fence_info *info = &remote_fence_info[current_hartid()];
if (info->type == FENCE_SFENCE_VMA) {
// Perform sfence.vma
if (info->size == 0) {
asm volatile("sfence.vma" ::: "memory");
} else {
// Flush specific range (implementation-specific)
for (unsigned long addr = info->start;
addr < info->start + info->size;
addr += PAGE_SIZE) {
asm volatile("sfence.vma %0" :: "r"(addr) : "memory");
}
}
}
}
HSM Extension (EID = 0x48534D)
The Hart State Management (HSM) extension controls hart lifecycle.
Functions:
sbi_hart_start(FID = 0): Start a hartsbi_hart_stop(FID = 1): Stop current hartsbi_hart_get_status(FID = 2): Get hart status
Example: Start a hart
// Start hart 1 at address 0x80200000 with argument 0x12345678
unsigned long hartid = 1;
unsigned long start_addr = 0x80200000;
unsigned long opaque = 0x12345678;
register unsigned long a0 asm("a0") = hartid;
register unsigned long a1 asm("a1") = start_addr;
register unsigned long a2 asm("a2") = opaque;
register unsigned long a6 asm("a6") = 0; // FID_HART_START
register unsigned long a7 asm("a7") = 0x48534D; // EID_HSM
asm volatile("ecall" : "+r"(a0) : "r"(a1), "r"(a2), "r"(a6), "r"(a7) : "memory");
M-mode implementation:
int sbi_hart_start(unsigned long hartid, unsigned long start_addr, unsigned long opaque) {
if (hartid >= num_harts)
return SBI_ERR_INVALID_PARAM;
if (hart_state[hartid] != HART_STOPPED)
return SBI_ERR_ALREADY_STARTED;
// Set up hart entry point
hart_entry_addr[hartid] = start_addr;
hart_entry_arg[hartid] = opaque;
// Wake up hart (platform-specific)
platform_hart_start(hartid);
hart_state[hartid] = HART_STARTED;
return SBI_SUCCESS;
}
System Reset Extension (EID = 0x53525354)
The System Reset extension provides system-wide reset and shutdown.
Function: sbi_system_reset (FID = 0)
// Reboot the system
#define SBI_RESET_TYPE_SHUTDOWN 0
#define SBI_RESET_TYPE_COLD_REBOOT 1
#define SBI_RESET_TYPE_WARM_REBOOT 2
register unsigned long a0 asm("a0") = SBI_RESET_TYPE_COLD_REBOOT;
register unsigned long a1 asm("a1") = 0; // Reset reason
register unsigned long a6 asm("a6") = 0; // FID_SYSTEM_RESET
register unsigned long a7 asm("a7") = 0x53525354; // EID_SRST
asm volatile("ecall" : "+r"(a0) : "r"(a1), "r"(a6), "r"(a7) : "memory");
// This call does not return
Figure 10.2: SBI Extensions Overview
graph TB
subgraph "SBI Extensions"
BASE[Base Extension<br/>0x10<br/>Version, Features]
TIME[Timer Extension<br/>0x54494D45<br/>Set Timer]
IPI[IPI Extension<br/>0x735049<br/>Send IPI]
RFENCE[RFENCE Extension<br/>0x52464E43<br/>Remote Fence]
HSM[HSM Extension<br/>0x48534D<br/>Hart Management]
SRST[System Reset<br/>0x53525354<br/>Reset/Shutdown]
PMU[PMU Extension<br/>0x504D55<br/>Performance Counters]
end
SMODE[S-mode Software<br/>Linux, FreeBSD] -->|ecall| BASE
SMODE -->|ecall| TIME
SMODE -->|ecall| IPI
SMODE -->|ecall| RFENCE
SMODE -->|ecall| HSM
SMODE -->|ecall| SRST
SMODE -->|ecall| PMU
style SMODE fill:#e1f5ff
style TIME fill:#ccffcc
style IPI fill:#ffffcc
style RFENCE fill:#ffcccc
10.4 Console and Debug Output
Console I/O via SBI
SBI provides simple console I/O for early debugging before full UART drivers are available.
Legacy console functions (deprecated but widely used):
sbi_console_putchar(EID = 0x01): Output one charactersbi_console_getchar(EID = 0x02): Input one character
Example: Early printk
void sbi_putchar(char c) {
register unsigned long a0 asm("a0") = c;
register unsigned long a7 asm("a7") = 0x01; // Legacy console putchar
asm volatile("ecall" : "+r"(a0) : "r"(a7) : "memory");
}
void early_printk(const char *str) {
while (*str) {
if (*str == '\n')
sbi_putchar('\r');
sbi_putchar(*str++);
}
}
// Usage
early_printk("Hello from S-mode!\n");
M-mode implementation:
void sbi_console_putchar(int ch) {
// Platform-specific UART output
uart_putc(ch);
}
int sbi_console_getchar(void) {
// Platform-specific UART input
return uart_getc();
}
Modern approach: Use the Debug Console Extension (DBCN) for more features (buffered I/O, formatted output).
10.5 Hypervisor Extension (H Extension)
Virtualization Support in RISC-V
The Hypervisor extension (H) adds virtualization support to RISC-V, enabling a hypervisor to run multiple guest operating systems. Unlike ARM’s built-in EL2, RISC-V virtualization is an optional extension.
Key features:
- VS-mode and VU-mode: Virtualized supervisor and user modes
- Two-stage address translation: Guest physical → Host physical
- Virtual interrupts: Virtualized interrupt delivery
- Hypervisor CSRs: Control virtualization features
Privilege modes with H extension:
- M-mode: Machine mode (firmware)
- HS-mode: Hypervisor-extended supervisor mode (hypervisor)
- VS-mode: Virtual supervisor mode (guest OS)
- U-mode: User mode (applications)
- VU-mode: Virtual user mode (guest applications)
Figure 10.3: RISC-V Virtualization Architecture
graph TB
subgraph "M-mode"
OPENSBI[OpenSBI<br/>Firmware]
end
subgraph "HS-mode (Hypervisor)"
KVM[KVM/Xen<br/>Hypervisor]
end
subgraph "VS-mode (Guest OS)"
GUEST1[Guest Linux 1]
GUEST2[Guest Linux 2]
end
subgraph "VU-mode (Guest Apps)"
APP1[App 1]
APP2[App 2]
APP3[App 3]
end
OPENSBI -->|SBI calls| KVM
KVM -->|VM Entry/Exit| GUEST1
KVM -->|VM Entry/Exit| GUEST2
GUEST1 --> APP1
GUEST1 --> APP2
GUEST2 --> APP3
style OPENSBI fill:#ffe1e1
style KVM fill:#fff4e1
style GUEST1 fill:#e1ffe1
style GUEST2 fill:#e1ffe1
style APP1 fill:#e1f5ff
style APP2 fill:#e1f5ff
style APP3 fill:#e1f5ff
Two-Stage Address Translation
With the H extension, address translation happens in two stages:
-
First stage (G-stage): Guest virtual address (GVA) → Guest physical address (GPA)
- Controlled by guest OS (vsatp CSR)
- Guest thinks it’s managing physical memory
-
Second stage (H-stage): Guest physical address (GPA) → Host physical address (HPA)
- Controlled by hypervisor (hgatp CSR)
- Translates guest “physical” addresses to real physical addresses
Figure 10.4: Two-Stage Address Translation
graph LR
GVA[Guest Virtual<br/>Address<br/>GVA] -->|G-stage<br/>vsatp| GPA[Guest Physical<br/>Address<br/>GPA]
GPA -->|H-stage<br/>hgatp| HPA[Host Physical<br/>Address<br/>HPA]
HPA --> MEM[Physical<br/>Memory]
style GVA fill:#e1f5ff
style GPA fill:#fff4e1
style HPA fill:#e1ffe1
style MEM fill:#ffcccc
Example:
- Guest OS maps virtual address
0x1000to guest physical address0x80001000(using vsatp) - Hypervisor maps guest physical
0x80001000to host physical0x90001000(using hgatp) - Final access:
0x1000(GVA) →0x80001000(GPA) →0x90001000(HPA)
Virtual Interrupt Handling
The H extension virtualizes interrupts, allowing the hypervisor to inject interrupts into guest VMs.
Hypervisor interrupt CSRs:
- hvip: Hypervisor virtual interrupt pending
- hie: Hypervisor interrupt enable
- hgeip: Hypervisor guest external interrupt pending
Injecting a virtual interrupt:
// Hypervisor code: Inject timer interrupt into guest
void inject_guest_timer_interrupt(void) {
// Set virtual supervisor timer interrupt pending
csr_set(CSR_HVIP, HVIP_VSTIP);
// When guest resumes, it will see a timer interrupt
}
Guest handling:
// Guest OS sees the interrupt as a normal S-mode interrupt
void guest_timer_handler(void) {
// Handle timer interrupt
// Guest doesn't know it's virtualized
}
VM Entry and Exit
The hypervisor switches between HS-mode and VS-mode using special instructions and CSR manipulation.
VM Entry (HS-mode → VS-mode):
void vm_enter(struct vcpu *vcpu) {
// 1. Load guest state
write_csr(CSR_VSSTATUS, vcpu->vsstatus);
write_csr(CSR_VSIE, vcpu->vsie);
write_csr(CSR_VSTVEC, vcpu->vstvec);
write_csr(CSR_VSSCRATCH, vcpu->vsscratch);
write_csr(CSR_VSEPC, vcpu->vsepc);
write_csr(CSR_VSCAUSE, vcpu->vscause);
write_csr(CSR_VSTVAL, vcpu->vstval);
write_csr(CSR_VSATP, vcpu->vsatp);
// 2. Set hstatus.SPV = 1 (virtualization enabled)
csr_set(CSR_HSTATUS, HSTATUS_SPV);
// 3. Set sepc to guest entry point
write_csr(CSR_SEPC, vcpu->pc);
// 4. Enter VS-mode
asm volatile("sret"); // Return to VS-mode
}
VM Exit (VS-mode → HS-mode):
When the guest executes certain instructions (ecall, WFI, privileged CSR access) or takes a trap, control returns to the hypervisor.
void vm_exit_handler(struct vcpu *vcpu) {
// Save guest state
vcpu->vsstatus = read_csr(CSR_VSSTATUS);
vcpu->vsepc = read_csr(CSR_VSEPC);
vcpu->vscause = read_csr(CSR_VSCAUSE);
vcpu->vstval = read_csr(CSR_VSTVAL);
vcpu->pc = read_csr(CSR_SEPC);
// Handle exit reason
unsigned long cause = read_csr(CSR_SCAUSE);
switch (cause) {
case CAUSE_VIRTUAL_SUPERVISOR_ECALL:
// Guest made hypercall
handle_hypercall(vcpu);
break;
case CAUSE_GUEST_PAGE_FAULT:
// Guest page fault (G-stage or H-stage)
handle_guest_page_fault(vcpu);
break;
case CAUSE_VIRTUAL_INSTRUCTION:
// Guest tried to execute privileged instruction
emulate_instruction(vcpu);
break;
default:
// Other traps
inject_exception_to_guest(vcpu, cause);
}
}
10.6 Security Model
Physical Memory Protection (PMP)
PMP is RISC-V’s primary memory protection mechanism, enforced in M-mode. It defines memory regions and access permissions for lower privilege modes.
PMP use cases:
- Protect M-mode firmware from S-mode
- Isolate security-critical regions
- Enforce memory access policies
- Implement basic TEE (Trusted Execution Environment)
PMP configuration registers:
- pmpcfg0-pmpcfg15: Configuration for PMP entries (8 entries per register)
- pmpaddr0-pmpaddr63: Address registers (up to 64 entries)
PMP entry format (pmpcfg):
Bits [7:0] for each entry:
[7]: L (Lock) - Entry cannot be modified until reset
[6:5]: Reserved
[4:3]: A (Address matching mode)
00 = OFF, 01 = TOR, 10 = NA4, 11 = NAPOT
[2]: X (Execute permission)
[1]: W (Write permission)
[0]: R (Read permission)
Example: Protect M-mode firmware
void protect_m_mode_firmware(void) {
// Protect 0x80000000 - 0x80100000 (1 MB M-mode firmware)
// Use TOR (Top-Of-Range) mode
// Entry 0: Start address (0x80000000)
write_csr(pmpaddr0, 0x80000000 >> 2);
// Entry 1: End address (0x80100000)
write_csr(pmpaddr1, 0x80100000 >> 2);
// Configure: TOR mode, R+X, Locked
uint8_t cfg = PMP_R | PMP_X | PMP_TOR | PMP_L;
write_csr(pmpcfg0, cfg << 8); // Entry 1 config
// Now S-mode cannot access 0x80000000 - 0x80100000
}
Enhanced PMP (ePMP)
ePMP extends PMP with additional security features:
- Rule locking: Prevent modification of PMP entries
- M-mode lockdown: Restrict M-mode access to specific regions
- Whitelist mode: Default deny, explicit allow
ePMP adds mseccfg CSR:
Bits:
[2]: RLB (Rule Locking Bypass) - Allow M-mode to modify locked entries
[1]: MMWP (Machine Mode Whitelist Policy) - Enforce whitelist for M-mode
[0]: MML (Machine Mode Lockdown) - Restrict M-mode access
Example: M-mode lockdown
void enable_m_mode_lockdown(void) {
// Set MML bit: M-mode can only access regions with L=1 and X=0
write_csr(CSR_MSECCFG, MSECCFG_MML);
// Now M-mode is restricted to explicitly allowed regions
}
Comparison with ARM TrustZone
RISC-V PMP vs ARM TrustZone:
| Feature | RISC-V PMP | ARM TrustZone |
|---|---|---|
| Isolation | Region-based (up to 64 regions) | World-based (Secure/Non-secure) |
| Granularity | 4 bytes to 2^64 bytes | 4 KB minimum |
| Privilege | M-mode enforced | EL3 enforced |
| Secure world | No built-in secure world | Dedicated S-EL0/S-EL1 |
| Complexity | Simple, flexible | Complex, rich features |
| Use case | Firmware protection, basic TEE | Full TEE, secure boot, DRM |
RISC-V security philosophy: Provide minimal hardware mechanisms (PMP), build rich security features in software (TEE frameworks like Keystone, Penglai).
ARM TrustZone philosophy: Provide rich hardware support for secure world, standardize TEE architecture.
🛠️ Hands-on Lab: Lab 10.1 — Saying Hello Through the Counter (SBI Call)
This lab demonstrates the standard SBI call flow. We’ll use OpenSBI (bundled with QEMU) and place our kernel at 0x80200000 (the default payload address OpenSBI jumps to).
Lab Objectives
- Wrap an
sbi_callAssembly function - Call Legacy Extension (EID=1) to output characters
- Call Base Extension (EID=0x10) to query SBI version
Project Structure
lab10/
├── link.ld # Memory layout (note the different start address)
├── sbi.S # SBI Wrapper and entry point
└── kernel.c # Main program
Code
File 1: link.ld
Note: When using QEMU -bios default (OpenSBI), it loads itself at 0x80000000 and jumps to 0x80200000 to execute our kernel.
OUTPUT_ARCH( "riscv" )
ENTRY( _start )
SECTIONS
{
/* OpenSBI default jump address */
. = 0x80200000;
.text : {
*(.text.boot)
*(.text)
}
.data : { *(.data) }
.bss : { *(.bss) }
. = . + 0x1000;
_stack_top = .;
}
File 2: sbi.S (SBI Wrapper)
.section .text.boot
.global _start
.global sbi_call
# Program entry
_start:
la sp, _stack_top
call kernel_main
loop:
j loop
# long sbi_call(long ext, long fid, long arg0, long arg1, long arg2)
# C calling: a0=ext, a1=fid, a2=arg0, a3=arg1, a4=arg2
sbi_call:
mv a7, a0 # ext -> a7 (EID)
mv a6, a1 # fid -> a6 (FID)
mv a0, a2 # arg0 -> a0
mv a1, a3 # arg1 -> a1
mv a2, a4 # arg2 -> a2
ecall # Trigger Environment Call (trap to M-mode)
ret # Return a0 (return value)
File 3: kernel.c
#include <stdint.h>
// SBI Extension IDs
#define SBI_EID_CONSOLE_PUTCHAR 0x01 // Legacy Console
#define SBI_EID_BASE 0x10 // Base Extension
// Base Extension Function IDs
#define SBI_FID_GET_SPEC_VERSION 0
// External Assembly function
long sbi_call(long ext, long fid, long arg0, long arg1, long arg2);
// Character output (via SBI)
void putchar(char c) {
sbi_call(SBI_EID_CONSOLE_PUTCHAR, 0, c, 0, 0);
}
void print_str(const char *s) {
while (*s) {
putchar(*s++);
}
}
// Query SBI version
void print_sbi_version(void) {
long version = sbi_call(SBI_EID_BASE, SBI_FID_GET_SPEC_VERSION, 0, 0, 0);
long major = (version >> 24) & 0x7f;
long minor = version & 0xffffff;
print_str("SBI Spec Version: ");
putchar('0' + major);
putchar('.');
putchar('0' + minor);
putchar('\n');
}
void kernel_main(void) {
print_str("Hello from S-mode via SBI!\n");
print_sbi_version();
while (1) {}
}
Compile and Run
# Compile
riscv64-unknown-elf-gcc -nostdlib -nostartfiles -T link.ld \
-o kernel sbi.S kernel.c
# Run (QEMU with OpenSBI)
qemu-system-riscv64 -machine virt -nographic -bios default -kernel kernel
Expected Output:
OpenSBI v1.2
...
Hello from S-mode via SBI!
SBI Spec Version: 1.0
Comparison: Lab 9 vs Lab 10
| Item | Lab 9 (Bare-metal) | Lab 10 (SBI) |
|---|---|---|
| Start Address | 0x80000000 | 0x80200000 |
| UART Access | Direct MMIO write | Via SBI ecall |
| Portability | ❌ Hardware-dependent | ✅ Cross-platform |
| Complexity | Simple but fragile | Requires SBI spec knowledge |
danieRTOS Reference: The danieRTOS console uses SBI calls for portable output across different RISC-V platforms.
⚠️ Common Pitfalls
Pitfall 1: Confusing EID and FID
Error Scenario: Putting Extension ID in a6 and Function ID in a7.
Consequence: Calls wrong service, may crash or hang the system.
# ❌ Wrong: EID and FID positions swapped
li a6, 0x10 # This should be EID, but it's in a6
li a7, 0 # This should be FID, but it's in a7
ecall
# ✅ Correct
li a7, 0x10 # a7 = EID (Extension ID)
li a6, 0 # a6 = FID (Function ID)
ecall
Pitfall 2: Forgetting OpenSBI Occupies 0x80000000
Error Scenario: When using OpenSBI, still placing kernel at 0x80000000.
Consequence: Kernel overwrites OpenSBI’s memory, causing SBI calls to fail.
/* ❌ Wrong: Conflicts with OpenSBI */
. = 0x80000000;
/* ✅ Correct: Use OpenSBI's default payload address */
. = 0x80200000;
Pitfall 3: Misusing Legacy Extensions
Error Scenario: Using deprecated Legacy Extensions that newer OpenSBI may not support.
Consequence: SBI call returns -2 (SBI_ERR_NOT_SUPPORTED).
// ⚠️ Legacy Extensions (EID 0-8) are marked deprecated
// Newer OpenSBI may not support them
sbi_call(0x01, 0, 'A', 0, 0); // console_putchar
// ✅ Recommended: Use newer Extensions
// Debug Console Extension (EID = 0x4442434E)
#define SBI_EID_DBCN 0x4442434E
#define SBI_FID_DBCN_WRITE_BYTE 2
sbi_call(SBI_EID_DBCN, SBI_FID_DBCN_WRITE_BYTE, 'A', 0, 0);
💡 Tip: In production, check if an SBI Extension is available:
// Use Base Extension's probe_extension #define SBI_FID_PROBE_EXTENSION 3 long result = sbi_call(SBI_EID_BASE, SBI_FID_PROBE_EXTENSION, target_eid, 0, 0); // result > 0 means the Extension is available
Summary
Machine mode and SBI provide the foundation for RISC-V system software. This chapter covered five key areas:
M-mode firmware is designed to be minimal and platform-specific. It initializes hardware, sets up memory protection, and provides runtime services through the SBI interface. Unlike ARM’s extensive EL3 firmware, RISC-V keeps M-mode simple and delegates most functionality to S-mode.
SBI interface provides a standard ecall-based interface between M-mode firmware and S-mode operating systems. The calling convention uses registers a0-a7 for parameters and returns error codes in a0. This standardization ensures that OS kernels can run on any RISC-V platform without modification.
SBI extensions cover essential system services: Timer extension for scheduling, IPI extension for inter-processor communication, RFENCE extension for TLB synchronization, HSM extension for hart lifecycle management, and System Reset extension for reboot and shutdown. Each extension is identified by an Extension ID (EID) and provides multiple functions.
Hypervisor extension adds virtualization support through VS-mode and VU-mode, enabling guest operating systems to run under a hypervisor. Two-stage address translation (GVA → GPA → HPA) isolates guest memory, while virtual interrupt injection allows the hypervisor to deliver interrupts to guests. VM entry and exit are managed through CSR manipulation and the SRET instruction.
PMP and ePMP provide memory protection by defining access permissions for memory regions. PMP is simpler than ARM TrustZone but sufficient for protecting M-mode firmware and implementing basic trusted execution environments. ePMP adds enhanced security features like M-mode lockdown and rule locking.
In the next chapter, we’ll explore RISC-V ISA extensions, starting with the standard extensions (M, A, F, D, C) and moving to advanced features like Vector and Bit Manipulation.
Chapter 11. RISC-V Standard Extensions
Part VII — ISA Extensions
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Decode ISA Naming: Parse the meaning of ISA strings like RV64GC, RV32IM
- Understand the G Package: Know that G = IMAFD and its historical background
- Master Z/X Extension Logic: Understand the naming rules for new-style Extensions
- Compare Hardware vs Software Implementation: Understand the performance difference between hardware instructions and software emulation
- Detect Extension Presence: Query CPU-supported features via the
misaCSR
💡 Scenario: Skill Trees and DLCs
Scene: Junior is reading Linux Kernel compile options and gets intimidated by a long ISA string.
Junior: “Senior, is this MARCH variable a Wi-Fi password? RV64IMAFDC_Zicsr… who can read this?”
Senior: (laughs) “This is RISC-V’s ID card. Don’t worry—let’s treat it like RPG character stats. Breaking it down makes it simple.”
Junior: “RPG stats?”
Senior: “Look at the first two characters RV64—this means it’s a 64-bit character that can wield two-handed swords (64-bit registers). RV32 would be 32-bit.”
Junior: “I get that part. What about that string of letters?”
Senior: “Those are ‘skills it has learned.’ Each letter represents an ability:
| Letter | Name | Analogy | Function |
|---|---|---|---|
| I | Integer | Basic Training | Addition, subtraction, logic ops—essential skill |
| M | Multiply | Multiplication Skill | Hardware multiply/divide—without it, you do N additions |
| A | Atomic | Locking Skill | Atomic operations—the foundation of spinlocks |
| F | Float | Single-Precision Magic | 32-bit floating-point math |
| D | Double | Double-Precision Magic | 64-bit floating-point math |
| C | Compressed | Contortion Skill | 16-bit compressed instructions—saves space |
“
Junior: “Makes sense. But what about RV64GC that everyone talks about? There’s no G in that table!”
Senior: “G (General) is a ‘value bundle.’ Since IMAFD are so commonly used together, the spec defines G = I + M + A + F + D. So RV64GC is really shorthand for RV64IMAFDC—the baseline for running Linux.”
Junior: “Got it! What about that Z prefix at the end? Hidden skill?”
Senior: “Pretty much. Since 26 letters aren’t enough anymore, newer features (or features split out from I) use Z prefix plus a name. Think of them as DLC expansions.
For example, Zicsr means CSR operations are supported, Zifencei means instruction fence is supported. This ‘password’ just tells the compiler: ‘This CPU bought these DLCs, feel free to use these instructions’!“
Junior: “Ha! So it’s just a skill list—that makes it much clearer!”
RISC-V’s modular design is one of its most distinctive features. Unlike monolithic instruction set architectures that bundle everything together, RISC-V separates functionality into a minimal base ISA plus optional extensions. This approach allows implementations to include only the features they need, from tiny microcontrollers to high-performance servers.
The base integer ISA (RV32I or RV64I) provides just enough instructions to run a complete operating system and applications—47 instructions in total. But most practical systems need more: multiplication and division, atomic operations for synchronization, floating-point arithmetic, and compressed instructions for code density. These capabilities come from standard extensions, each identified by a single letter.
Understanding these extensions is crucial for anyone working with RISC-V. Compiler writers need to know which instructions are available. Hardware designers must decide which extensions to implement. Software developers need to understand the performance implications of using extension instructions versus emulating them in software.
In this chapter, we’ll explore the standard extensions that form the foundation of most RISC-V systems: M for multiplication, A for atomics, F and D for floating-point, C for compressed instructions, and B for bit manipulation. We’ll see how these extensions integrate with the base ISA and compare them with similar features in ARM and x86.
11.1 Extension Overview
The Extension Model
RISC-V extensions follow a carefully designed model. The base ISA (I) is frozen and will never change. Extensions add functionality without modifying the base. Once an extension is ratified, it too is frozen, ensuring long-term stability.
Extensions are identified by single letters: M, A, F, D, C, V, B, and so on. A processor’s capabilities are described by concatenating these letters: RV64IMAFD means a 64-bit processor with integer, multiplication, atomic, single-precision float, and double-precision float extensions. The letter G is shorthand for IMAFD (general-purpose), so RV64GC means RV64IMAFD plus compressed instructions.
Standard vs Non-Standard Extensions
Standard extensions are defined by RISC-V International and ratified through a formal process. They have reserved letter codes and are guaranteed to be compatible across implementations. Non-standard extensions use the X prefix (like Xvendor) and are vendor-specific.
Custom extensions can add specialized instructions without conflicting with standard ones. The instruction encoding reserves opcode space for custom instructions, allowing vendors to innovate while maintaining compatibility with standard software.
Extension Discovery
Software can detect which extensions are present by reading the misa CSR (Machine ISA register). Each bit in misa corresponds to an extension:
// Read misa to detect extensions
unsigned long misa = read_csr(CSR_MISA);
bool has_M = (misa & (1 << ('M' - 'A'))); // Bit 12
bool has_A = (misa & (1 << ('A' - 'A'))); // Bit 0
bool has_F = (misa & (1 << ('F' - 'A'))); // Bit 5
bool has_D = (misa & (1 << ('D' - 'A'))); // Bit 3
bool has_C = (misa & (1 << ('C' - 'A'))); // Bit 2
On some implementations, misa is read-only. On others, writing to misa can enable or disable extensions dynamically, though this is rare in practice.
Figure 11.1: RISC-V Extension Ecosystem
graph TB
subgraph "Base ISA (Frozen)"
RV32I[RV32I<br/>32-bit Integer]
RV64I[RV64I<br/>64-bit Integer]
RV128I[RV128I<br/>128-bit Integer]
end
subgraph "Standard Extensions"
M[M: Multiply/Divide]
A[A: Atomics]
F[F: Float 32-bit]
D[D: Float 64-bit]
C[C: Compressed]
V[V: Vector]
B[B: Bit Manipulation]
end
subgraph "Profiles"
RVA22[RVA22 Profile<br/>Application Processors]
RVA23[RVA23 Profile<br/>+ Vector]
end
RV64I --> M
RV64I --> A
RV64I --> F
RV64I --> D
RV64I --> C
RV64I --> V
RV64I --> B
M --> RVA22
A --> RVA22
F --> RVA22
D --> RVA22
C --> RVA22
RVA22 --> RVA23
V --> RVA23
style RV64I fill:#90EE90
style M fill:#87CEEB
style A fill:#87CEEB
style F fill:#87CEEB
style D fill:#87CEEB
style C fill:#87CEEB
style V fill:#FFD700
style B fill:#FFD700
style RVA22 fill:#FFB6C1
style RVA23 fill:#FFB6C1
The diagram shows how extensions build on the base ISA and combine into profiles for specific use cases.
11.2 M Extension: Integer Multiplication and Division
Why M is Optional
The M extension adds integer multiplication and division instructions. You might wonder why these fundamental operations aren’t in the base ISA. The answer is simplicity and flexibility.
Tiny embedded systems (like IoT sensors) may never need multiplication or division. Making M optional allows these systems to save chip area and power. Software can emulate multiplication using shifts and adds if needed, though much slower than hardware.
For most systems, M is essential. It’s part of the G (general-purpose) bundle and required by the RVA22 profile for application processors.
Multiplication Instructions
The M extension provides four multiplication instructions for RV32 and RV64:
MUL rd, rs1, rs2: Multiply rs1 by rs2, store the lower XLEN bits in rd. This is the most common multiplication, used when you only need the low-order result.
# Example: Multiply two 32-bit numbers
li a0, 100
li a1, 200
mul a2, a0, a1 # a2 = 100 * 200 = 20000
MULH rd, rs1, rs2: Multiply signed rs1 by signed rs2, store the upper XLEN bits in rd. Used for detecting overflow or implementing multi-word multiplication.
MULHU rd, rs1, rs2: Multiply unsigned rs1 by unsigned rs2, store the upper XLEN bits in rd.
MULHSU rd, rs1, rs2: Multiply signed rs1 by unsigned rs2, store the upper XLEN bits in rd. This asymmetric variant is useful for certain algorithms.
Why separate high and low multiply? A full multiplication of two XLEN-bit numbers produces a 2×XLEN-bit result. MUL gives you the low half, MULH/MULHU/MULHSU give you the high half. To get the full result, you execute both:
# 64-bit × 64-bit = 128-bit multiplication
mul a2, a0, a1 # Low 64 bits
mulh a3, a0, a1 # High 64 bits
# Result is in a3:a2 (128 bits)
Division and Remainder
The M extension also provides division and remainder instructions:
DIV rd, rs1, rs2: Signed division, rd = rs1 / rs2 (truncated toward zero).
DIVU rd, rs1, rs2: Unsigned division, rd = rs1 / rs2.
REM rd, rs1, rs2: Signed remainder, rd = rs1 % rs2.
REMU rd, rs1, rs2: Unsigned remainder, rd = rs1 % rs2.
# Example: Divide 100 by 7
li a0, 100
li a1, 7
div a2, a0, a1 # a2 = 100 / 7 = 14
rem a3, a0, a1 # a3 = 100 % 7 = 2
Division by zero does not trap in RISC-V. Instead, it returns defined values: division by zero returns -1 (all bits set), and remainder by zero returns the dividend. This allows software to check for zero explicitly if needed, without the overhead of trap handling.
RV64 Word Operations
On RV64, the M extension adds word-sized (32-bit) variants that operate on the lower 32 bits and sign-extend the result to 64 bits:
MULW, DIVW, DIVUW, REMW, REMUW
# RV64: 32-bit multiplication with sign extension
li a0, 0x80000000 # -2147483648 (32-bit)
li a1, 2
mulw a2, a0, a1 # a2 = 0xFFFFFFFF00000000 (sign-extended)
These are essential for efficiently handling 32-bit data on 64-bit processors.
Performance Characteristics
Multiplication and division are slower than addition and logic operations. Typical latencies:
- MUL: 2-4 cycles (pipelined, throughput 1/cycle)
- DIV: 10-40 cycles (not pipelined, variable latency)
Division is particularly expensive. Compilers optimize division by constants into multiplication by the reciprocal when possible.
11.3 A Extension: Atomic Instructions
The Need for Atomics
In multi-processor systems, multiple harts (hardware threads) may access shared memory simultaneously. Without atomic operations, race conditions can corrupt data. Consider incrementing a shared counter:
// Non-atomic increment (WRONG for multi-threaded code)
int counter = 0;
void increment() {
counter++; // Read-modify-write: NOT atomic!
}
This compiles to three separate instructions:
lw a0, counter
addi a0, a0, 1
sw a0, counter
If two harts execute this simultaneously, both might read the same value, increment it, and write back the same result—losing one increment. Atomic instructions solve this problem.
Load-Reserved / Store-Conditional
The A extension provides two fundamental primitives for building atomic operations:
LR.W rd, (rs1): Load-Reserved Word. Loads a word from memory and registers a reservation on that address.
SC.W rd, rs1, (rs2): Store-Conditional Word. Stores rs1 to memory at rs2 only if the reservation is still valid. Returns 0 in rd on success, non-zero on failure.
The reservation is invalidated if another hart writes to the reserved address or if certain events occur (context switch, cache eviction, etc.).
Atomic increment using LR/SC:
# Atomic increment of counter
retry:
lr.w a0, (a1) # Load counter, set reservation
addi a0, a0, 1 # Increment
sc.w a2, a0, (a1) # Store if reservation valid
bnez a2, retry # Retry if SC failed
If another hart modifies the counter between LR and SC, the SC fails and the loop retries. This ensures atomicity.
Atomic Memory Operations (AMO)
For common atomic operations, the A extension provides dedicated AMO instructions that are more efficient than LR/SC loops:
AMOSWAP.W rd, rs2, (rs1): Atomically swap memory[rs1] with rs2, return old value in rd.
AMOADD.W rd, rs2, (rs1): Atomically add rs2 to memory[rs1], return old value in rd.
AMOAND.W, AMOOR.W, AMOXOR.W: Atomic AND, OR, XOR.
AMOMIN.W, AMOMAX.W, AMOMINU.W, AMOMAXU.W: Atomic min/max (signed and unsigned).
# Atomic increment using AMO (simpler than LR/SC)
amoadd.w zero, a0, (a1) # Atomically add a0 to memory[a1]
Atomic Ordering Annotations
AMO and LR/SC instructions can have ordering annotations:
.aq (acquire): Subsequent memory operations cannot be reordered before this instruction.
.rl (release): Previous memory operations cannot be reordered after this instruction.
.aqrl: Both acquire and release.
# Atomic swap with acquire-release semantics
amoswap.w.aqrl a0, a1, (a2)
These annotations are crucial for implementing lock-free data structures and memory barriers (see Chapter 6 on memory ordering).
RV64 Variants
On RV64, the A extension provides both word (32-bit) and doubleword (64-bit) variants:
LR.W / SC.W: 32-bit load-reserved / store-conditional LR.D / SC.D: 64-bit load-reserved / store-conditional AMOADD.W / AMOADD.D: 32-bit / 64-bit atomic add (and similarly for other AMO operations)
Comparison with ARM and x86
ARM: Uses LDREX/STREX (load-exclusive / store-exclusive), similar to RISC-V’s LR/SC. ARMv8.1 added atomic instructions (LDADD, LDSWP, etc.) similar to RISC-V’s AMO.
x86: Uses LOCK prefix with normal instructions (LOCK ADD, LOCK XCHG, etc.) and dedicated atomic instructions (CMPXCHG). x86’s model is more complex but provides strong ordering by default.
RISC-V’s approach is cleaner: LR/SC for flexibility, AMO for common cases, explicit ordering annotations for performance.
11.4 F and D Extensions: Floating-Point
Floating-Point in RISC-V
The F extension adds single-precision (32-bit) floating-point, and the D extension adds double-precision (64-bit). Both follow the IEEE 754 standard, ensuring compatibility with other architectures and programming languages.
Floating-point is optional because not all systems need it. Embedded controllers often work with integers only. But for scientific computing, graphics, and many applications, floating-point is essential.
Floating-Point Register File
F and D extensions add a separate register file with 32 floating-point registers, f0 through f31. Each register is FLEN bits wide, where FLEN is 32 for F-only, 64 for D, and 128 for Q (quad-precision, future extension).
The separation of integer and floating-point registers simplifies hardware design and allows both to be accessed simultaneously. It also follows the tradition of RISC architectures like MIPS and SPARC.
Floating-Point CSRs
Three CSRs control floating-point behavior:
fcsr (Floating-Point Control and Status Register): Combined control and status.
frm (Floating-Point Rounding Mode): Bits [7:5] of fcsr, selects rounding mode:
- 000: Round to nearest, ties to even (default)
- 001: Round toward zero (truncate)
- 010: Round down (toward -∞)
- 011: Round up (toward +∞)
- 100: Round to nearest, ties to max magnitude
fflags (Floating-Point Exception Flags): Bits [4:0] of fcsr, records exceptions:
- NV: Invalid operation
- DZ: Divide by zero
- OF: Overflow
- UF: Underflow
- NX: Inexact
// Set rounding mode to round toward zero
write_csr(CSR_FRM, 0b001);
// Check for floating-point exceptions
unsigned int flags = read_csr(CSR_FFLAGS);
if (flags & 0x10) {
// Invalid operation occurred
}
F Extension Instructions
The F extension provides arithmetic, comparison, conversion, and move instructions for single-precision floats:
Arithmetic: FADD.S, FSUB.S, FMUL.S, FDIV.S, FSQRT.S
Fused multiply-add: FMADD.S, FMSUB.S, FNMADD.S, FNMSUB.S
Comparison: FEQ.S, FLT.S, FLE.S
Conversion: FCVT.W.S (float to int), FCVT.S.W (int to float), and variants
Move: FMV.X.W (float reg to int reg), FMV.W.X (int reg to float reg)
Load/Store: FLW, FSW
# Example: Compute (a * b) + c using fused multiply-add
flw fa0, 0(a0) # Load a
flw fa1, 4(a0) # Load b
flw fa2, 8(a0) # Load c
fmadd.s fa3, fa0, fa1, fa2 # fa3 = (a * b) + c
fsw fa3, 12(a0) # Store result
D Extension Instructions
The D extension extends F with double-precision operations. All F instructions have D equivalents (FADD.D, FMUL.D, etc.). Additionally, D provides conversions between single and double precision:
FCVT.S.D: Convert double to single (with rounding) FCVT.D.S: Convert single to double (exact)
# Convert single to double
flw fa0, 0(a0) # Load single-precision
fcvt.d.s fa1, fa0 # Convert to double-precision
fsd fa1, 0(a1) # Store double-precision
NaN Boxing
On RV64 with F extension (but not D), single-precision values in 64-bit registers must be NaN-boxed: the upper 32 bits are set to all 1s. This allows hardware to distinguish between valid single-precision values and invalid data.
Valid single-precision in 64-bit register:
[63:32] = 0xFFFFFFFF
[31:0] = single-precision value
With D extension, this is not needed because registers are naturally 64 bits.
Performance
Floating-point operations are typically slower than integer operations:
- FADD/FSUB: 3-5 cycles latency
- FMUL: 4-6 cycles latency
- FDIV: 10-20 cycles latency (not pipelined)
- FSQRT: 15-30 cycles latency
Fused multiply-add (FMADD) is particularly valuable: it computes (a × b) + c in one instruction with a single rounding, faster and more accurate than separate multiply and add.
11.5 C Extension: Compressed Instructions
The Code Density Problem
RISC architectures traditionally use fixed-length 32-bit instructions. This simplifies decoding and pipelining but wastes memory and instruction cache space. Many common operations (like “add register to register” or “load from stack”) don’t need 32 bits to encode.
The C extension addresses this by adding 16-bit compressed instructions that can be freely mixed with standard 32-bit instructions. This improves code density by 25-30% with minimal hardware complexity.
How Compressed Instructions Work
Compressed instructions are 16 bits and aligned on 16-bit boundaries. The processor’s fetch unit automatically expands them to equivalent 32-bit instructions before decoding. This expansion is transparent to software—compressed instructions are just a more compact encoding.
The low 2 bits of an instruction indicate its length:
xx00,xx01,xx10: 16-bit compressed instruction (C extension)xxx11: 32-bit standard instruction (or longer for future extensions)
This encoding allows the processor to determine instruction boundaries without pre-decoding.
Common Compressed Instructions
The C extension provides compressed forms of the most frequent operations:
C.ADD rd, rs2: Add register to register (expands to ADD rd, rd, rs2)
C.ADDI rd, imm: Add immediate (expands to ADDI rd, rd, imm)
C.LW rd’, offset(rs1’): Load word (expands to LW rd, offset(rs1))
C.SW rs2’, offset(rs1’): Store word (expands to SW rs2, offset(rs1))
C.J offset: Jump (expands to JAL x0, offset)
C.JALR rs1: Jump and link register (expands to JALR x1, 0(rs1))
C.MV rd, rs2: Move register (expands to ADD rd, x0, rs2)
C.LI rd, imm: Load immediate (expands to ADDI rd, x0, imm)
# Standard 32-bit instructions (8 bytes total)
addi sp, sp, -16
sw ra, 12(sp)
# Compressed equivalents (4 bytes total)
c.addi16sp sp, -16
c.swsp ra, 12
Register Encoding Restrictions
To fit in 16 bits, compressed instructions have restrictions:
- Many use only registers x8-x15 (s0-s1, a0-a5), encoded in 3 bits
- Immediates are smaller (6-bit instead of 12-bit)
- Offsets are scaled (e.g., word loads use offset×4)
The compiler and assembler handle these restrictions automatically, using compressed instructions when possible and falling back to 32-bit instructions when necessary.
Code Density Improvement
Typical programs see 25-30% code size reduction with the C extension. This translates to:
- Better instruction cache utilization
- Reduced memory bandwidth
- Lower power consumption (fewer instruction fetches)
For embedded systems with limited flash memory, this can be the difference between fitting the program or not.
Mixing 16-bit and 32-bit Instructions
Compressed and standard instructions can be freely mixed in the same program. The processor handles alignment automatically:
Address Instruction
0x1000: c.addi sp, -16 (16-bit)
0x1002: c.sw ra, 12(sp) (16-bit)
0x1004: jal ra, function (32-bit)
0x1008: c.lwsp ra, 12 (16-bit)
Branch targets and jump addresses can be any 16-bit aligned address, not just 32-bit aligned.
11.6 B Extension: Bit Manipulation
Why Bit Manipulation Matters
Bit manipulation operations—counting leading zeros, rotating bits, extracting bit fields—are common in cryptography, compression, hashing, and low-level systems programming. Without dedicated instructions, these operations require multiple instructions and are slow.
The B extension adds efficient bit manipulation instructions. Unlike M, A, F, D, and C, the B extension is modular, divided into several sub-extensions that can be implemented independently.
B Extension Sub-Extensions
Zba: Address generation instructions (shift-add for array indexing)
Zbb: Basic bit manipulation (count leading zeros, rotate, min/max, sign-extend)
Zbc: Carry-less multiplication (for cryptography)
Zbs: Single-bit operations (set, clear, invert, extract)
A processor might implement Zba and Zbb for general use, while omitting Zbc if cryptography isn’t needed.
Zba: Address Generation
Zba provides shift-add instructions for efficient array indexing:
SH1ADD rd, rs1, rs2: rd = (rs1 << 1) + rs2 SH2ADD rd, rs1, rs2: rd = (rs1 << 2) + rs2 SH3ADD rd, rs1, rs2: rd = (rs1 << 3) + rs2
// Array indexing: address = base + (index * sizeof(element))
int array[100];
int index = 10;
// Without Zba (3 instructions):
// slli t0, index, 2 # t0 = index * 4
// add t0, t0, base # t0 = base + (index * 4)
// lw a0, 0(t0)
// With Zba (2 instructions):
// sh2add t0, index, base # t0 = base + (index * 4)
// lw a0, 0(t0)
Zbb: Basic Bit Manipulation
Zbb provides commonly used bit operations:
CLZ rd, rs: Count leading zeros CTZ rd, rs: Count trailing zeros CPOP rd, rs: Count population (number of 1 bits)
ROL rd, rs1, rs2: Rotate left ROR rd, rs1, rs2: Rotate right
MIN rd, rs1, rs2: Signed minimum MAX rd, rs1, rs2: Signed maximum MINU, MAXU: Unsigned variants
SEXT.B, SEXT.H: Sign-extend byte/halfword to XLEN
# Count leading zeros (useful for finding highest set bit)
li a0, 0x00001000
clz a1, a0 # a1 = 51 (on RV64)
# Rotate right by 4 bits
li a0, 0x12345678
rori a1, a0, 4 # a1 = 0x81234567
Zbc: Carry-Less Multiplication
Zbc provides carry-less multiplication, used in cryptographic algorithms like AES-GCM:
CLMUL rd, rs1, rs2: Carry-less multiply (low half) CLMULH rd, rs1, rs2: Carry-less multiply (high half)
Carry-less multiplication is like normal multiplication but without carries between bit positions—essentially XOR instead of ADD.
Zbs: Single-Bit Operations
Zbs provides instructions for manipulating individual bits:
BSET rd, rs1, rs2: Set bit rs2 in rs1 BCLR rd, rs1, rs2: Clear bit rs2 in rs1 BINV rd, rs1, rs2: Invert bit rs2 in rs1 BEXT rd, rs1, rs2: Extract bit rs2 from rs1
# Set bit 5 in register a0
bseti a0, a0, 5 # a0 |= (1 << 5)
# Extract bit 3 from register a1
bexti a2, a1, 3 # a2 = (a1 >> 3) & 1
Performance Impact
Bit manipulation instructions typically execute in 1 cycle, same as basic ALU operations. Without them, equivalent operations might take 3-10 instructions. For cryptography and compression, this can mean 2-5× speedup.
11.7 Zicsr and Zifencei
Zicsr: CSR Instructions
The Zicsr extension defines the CSR (Control and Status Register) instructions we’ve used throughout this book: CSRRW, CSRRS, CSRRC, and their immediate variants.
Historically, these were part of the base I extension. But to keep the base ISA truly minimal, they were separated into Zicsr. Any system that needs to access CSRs (which is almost all systems) implements Zicsr.
Zifencei: Instruction Fence
The Zifencei extension provides the FENCE.I instruction, which synchronizes instruction and data caches. This is necessary when code is modified at runtime (self-modifying code, JIT compilation, dynamic linking).
FENCE.I: Ensures that all previous stores to instruction memory are visible to subsequent instruction fetches.
# Example: JIT compiler writes new code to memory
sw a0, 0(a1) # Write instruction to memory
sw a2, 4(a1) # Write another instruction
fence.i # Synchronize I-cache and D-cache
jalr a1 # Jump to newly written code
Without FENCE.I, the processor might execute stale instructions from the I-cache instead of the newly written code.
Like Zicsr, Zifencei was separated from the base ISA to keep it minimal. Systems that don’t modify code at runtime can omit it.
11.8 RVA22 Profile
The Need for Profiles
With so many optional extensions, how do software developers know what features they can rely on? A processor implementing just RV64I is very different from one implementing RV64IMAFDCV.
Profiles solve this problem by defining standard combinations of extensions for specific use cases. Software targeting a profile can assume all mandatory features are present.
RVA22 Profile
The RVA22 profile (ratified in 2022) targets application processors capable of running rich operating systems like Linux. It comes in two variants:
RVA22U (Unprivileged): Specifies the user-mode ISA. Mandatory extensions include:
- RV64I base ISA
- M, A, F, D, C extensions (i.e., RV64GC)
- Zicsr, Zifencei
- Zba, Zbb, Zbs (address generation and basic bit manipulation)
- Various other Z-extensions for specific functionality
RVA22S (Supervisor): Adds supervisor-mode requirements for OS support:
- Sv39 virtual memory (39-bit virtual addresses)
- Supervisor mode and required CSRs
- SBI (Supervisor Binary Interface) support
- Additional privilege-related extensions
Profile Compliance
A processor claiming RVA22S compliance guarantees it can run standard Linux distributions and other Unix-like operating systems without modification. This is crucial for software portability.
Future profiles (RVA23, RVA24) will add more features. RVA23 makes the Vector extension (V) mandatory, recognizing the importance of SIMD for modern applications.
Embedded Profiles
Separate profiles exist for embedded systems:
- Microcontroller profiles (RVM): Minimal feature sets for resource-constrained devices
- Real-time profiles: Add requirements for deterministic interrupt handling
These profiles ensure that embedded software can target well-defined platforms.
🛠️ Hands-on Lab: Lab 11.1 — The Power of Hardware Acceleration (Soft vs Hard Mul)
This lab demonstrates the performance difference between “having the M Extension” and “not having the M Extension” through compiler options.
Lab Objectives
- Use the same C code (multiplication operations)
- Compile for RV64I (no multiply instruction) and RV64IM (with multiply instruction)
- Observe Assembly differences
- Compare execution cycles
Code
Create mul_test.c:
// mul_test.c - Compare software vs hardware multiply
#include <stdint.h>
// Read cycle counter
static inline uint64_t read_cycles(void) {
uint64_t val;
asm volatile("csrr %0, mcycle" : "=r"(val));
return val;
}
// Simple multiply function
long multiply(long a, long b) {
return a * b;
}
// Multiple multiplication test
volatile long result;
void bench_multiply(int iterations) {
long a = 123456;
long b = 789012;
for (int i = 0; i < iterations; i++) {
result = multiply(a, b);
a++;
}
}
int main(void) {
int iterations = 10000;
uint64_t start = read_cycles();
bench_multiply(iterations);
uint64_t end = read_cycles();
// Simple output (assumes putchar available)
// cycles = end - start
return 0;
}
Experiment Steps
Step A: Compile as RV64IM (with multiply instruction)
# Tell compiler it can use multiply instructions (mul, mulw, etc.)
riscv64-unknown-elf-gcc -O2 -march=rv64im -mabi=lp64 \
-c mul_test.c -o mul_hard.o
# View Assembly
riscv64-unknown-elf-objdump -d mul_hard.o
Observe Assembly:
multiply:
mul a0, a0, a1 # Direct hardware multiply, 1 cycle
ret
Step B: Compile as RV64I (no multiply instruction)
# Tell compiler "this CPU doesn't know multiplication"
riscv64-unknown-elf-gcc -O2 -march=rv64i -mabi=lp64 \
-c mul_test.c -o mul_soft.o
# View Assembly
riscv64-unknown-elf-objdump -d mul_soft.o
Observe Assembly:
multiply:
call __muldi3 # Call software emulation library (libgcc)
Analysis
__muldi3 is libgcc’s software multiply implementation, internally composed of dozens of add, shift, branch instructions:
# Simplified logic of __muldi3 (Shift-and-Add algorithm)
__muldi3:
li t0, 0 # result = 0
loop:
andi t1, a1, 1 # if (b & 1)
beqz t1, skip
add t0, t0, a0 # result += a
skip:
slli a0, a0, 1 # a <<= 1
srli a1, a1, 1 # b >>= 1
bnez a1, loop # while (b != 0)
mv a0, t0
ret
Expected Results
| Config | Single Multiply Cycles | 10000 Multiplies |
|---|---|---|
| RV64IM | ~1-4 cycles | ~10,000-40,000 cycles |
| RV64I | ~30-60 cycles | ~300,000-600,000 cycles |
Conclusion: Hardware M extension can be 10-50x faster than software emulation!
danieRTOS Reference: The danieRTOS Makefile uses
-march=rv64gcto ensure all standard extensions are available.
Design Trade-off
💭 Why isn’t the M Extension mandatory?
In extremely resource-constrained embedded systems (such as 8-bit compatible microcontrollers), chip designers may choose not to implement a hardware multiplier to save transistors. In such cases, the compiler automatically uses software emulation. RISC-V’s modularity makes this trade-off possible—you only pay for what you need.
⚠️ Common Pitfalls
Pitfall 1: Misunderstanding G’s Composition
Misconception: “RV64G includes compressed instructions (C)”
Correct Understanding:
- G = IMAFD, does NOT include C
- GC = IMAFD + C
- Linux typically requires RV64GC because most distributions default to C extension for space savings
# ❌ Wrong: Thinking G includes C
riscv64-linux-gnu-gcc -march=rv64g # Actually doesn't have C
# ✅ Correct: Explicitly specify GC
riscv64-linux-gnu-gcc -march=rv64gc
Pitfall 2: misa Detection Trap
Error Scenario: Reading misa in S-mode or U-mode.
Consequence: Illegal Instruction Exception, because misa is an M-mode CSR.
// ❌ Wrong: Reading misa directly in S-mode
unsigned long misa;
asm volatile("csrr %0, misa" : "=r"(misa)); // Exception!
// ✅ Correct: Get info via SBI or Device Tree
// Or use try-catch mechanism to test specific instructions
Pitfall 3: Ignoring Zicsr and Zifencei
Error Scenario: In newer spec versions, CSR operations and FENCE.I have been separated from I.
Consequence: Using old -march=rv64i may cause problems.
# Old version (pre-2019)
# CSR operations were part of I
# New version (post-2019)
# CSR operations require explicit Zicsr
riscv64-unknown-elf-gcc -march=rv64i_zicsr_zifencei ...
# Simpler approach: Use G (it implies Zicsr and Zifencei)
riscv64-unknown-elf-gcc -march=rv64gc ...
💡 Tip: In practice, use standard combinations like
rv64gcorrv64imacto avoid missing essential Extensions.
Summary
RISC-V’s modular extension system is one of its greatest strengths. The base ISA provides a minimal foundation, while standard extensions add functionality as needed. This chapter covered the core extensions that form the basis of most RISC-V systems.
M extension adds integer multiplication and division, essential for most applications beyond the simplest embedded systems. The separate high and low multiply instructions efficiently handle multi-word arithmetic, while division provides defined behavior even for division by zero.
A extension provides atomic operations for multi-processor synchronization. Load-reserved and store-conditional offer flexible primitives for building lock-free algorithms, while atomic memory operations provide efficient implementations of common patterns. Explicit ordering annotations give programmers fine-grained control over memory consistency.
F and D extensions add IEEE 754 floating-point arithmetic with a separate register file and dedicated CSRs for rounding modes and exception flags. Fused multiply-add instructions provide both performance and accuracy benefits. The separation of single and double precision allows implementations to include only what they need.
C extension improves code density by 25-30% through 16-bit compressed instructions that expand transparently to 32-bit equivalents. This reduces memory usage and improves cache efficiency with minimal hardware complexity, making it valuable for both embedded systems and high-performance processors.
B extension adds efficient bit manipulation through modular sub-extensions. Address generation instructions accelerate array indexing, basic bit manipulation provides common operations like count-leading-zeros and rotate, and specialized instructions support cryptography and other domains.
Zicsr and Zifencei complete the picture by providing CSR access and instruction cache synchronization. Though separated from the base ISA for minimality, they’re essential for almost all practical systems.
RVA22 profile ties these extensions together into a coherent platform for application processors. By mandating specific extensions and versions, profiles ensure software portability while preserving RISC-V’s flexibility.
In the next chapter, we’ll explore the Vector extension, RISC-V’s approach to SIMD and data-parallel processing.
Chapter 12. Vector Processing & SIMD Comparison
Part VII — ISA Extensions
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand SISD vs SIMD: Grasp the difference between scalar and vector operations
- Master VLA Core Concepts: Understand the value of Vector-Length Agnostic design
- Use the vsetvli Instruction: Perform Strip-mining (chunked processing)
- Write Vector Loops: Master the standard VLA loop structure
- Compare Different SIMD Architectures: Understand RISC-V V vs ARM SVE vs x86 AVX differences
💡 Scenario: The Flexible Noodle Cutter
Scene: Junior is staring at SIMD code on the screen, frustrated.
Junior: “Architect, this is painful. I wrote an optimized version for 128-bit hardware before. Now the company switched to a new chip that supports 512-bit. My original loop only cuts 4 floats at a time, but now I could cut 16. I have to rewrite the entire loop logic, including the ‘tail’ handling.”
Architect: “That’s the downside of Fixed-width SIMD. It’s like a fixed-size cookie cutter. Using a small cutter on a big dough is inefficient; switch to a bigger cutter, and your old recipe (code) doesn’t work anymore.”
Junior: “How does RISC-V solve this?”
Architect: “RISC-V’s V (Vector) Extension uses a design called VLA (Vector-Length Agnostic).
Imagine you have a ‘smart noodle cutting machine’. You don’t need to tell it ‘I want 4 pieces’ or ‘I want 16 pieces.’ You just say: ‘Here’s a big lump of dough (total length N), please cut it with your maximum capacity.’“
Junior: “Maximum capacity?”
Architect: “Right. The hardware responds: ‘Report: my blade can cut 8 pieces at a time (vl).’
Then you cut 8 pieces, push the dough forward, and ask again.
When you’re down to the last 3 pieces of dough, you don’t need to change cutters—the hardware automatically tells you: ‘This time I’ll only cut 3.’
Code written this way runs on 128-bit or 1024-bit machines without changing a single line, automatically achieving maximum performance.“
Junior: “Wow, even the tail is handled automatically? Teach me this instruction!”
Architect: “This is the legendary artifact: vsetvli.”
Modern applications increasingly demand data-parallel processing. Image processing applies the same filter to millions of pixels. Machine learning performs matrix operations on thousands of elements. Scientific simulations compute physics equations across vast grids. These workloads share a common pattern: the same operation repeated on different data.
Traditional scalar processors handle one operation at a time. To process 1000 elements, they execute 1000 separate instructions. Vector processors, by contrast, operate on multiple elements simultaneously with a single instruction—Single Instruction, Multiple Data (SIMD). This can provide 4×, 8×, or even greater speedups for data-parallel code.
Every major architecture offers SIMD extensions: x86 has SSE and AVX, ARM has NEON and SVE. RISC-V’s answer is the V extension (Vector), ratified in 2021. But RISC-V takes a different approach from its predecessors. Instead of fixed-width vectors that become obsolete as hardware improves, RISC-V uses vector-length agnostic programming—code that adapts automatically to different hardware implementations.
This chapter explores the V extension’s design, compares it with ARM and x86 SIMD, and shows how to write efficient vector code. We’ll see why RISC-V’s approach offers better long-term scalability than traditional SIMD architectures.
12.1 Vector Extension Overview
The SIMD Evolution
SIMD extensions have evolved through multiple generations, each adding wider vectors:
- x86: MMX (64-bit) → SSE (128-bit) → AVX (256-bit) → AVX-512 (512-bit)
- ARM: NEON (128-bit) → SVE (128-2048 bits, scalable)
Each generation requires new instructions and software rewrites. Code optimized for 128-bit vectors doesn’t automatically benefit from 256-bit hardware. This creates a dilemma: should compilers target narrow vectors for compatibility or wide vectors for performance?
Vector-Length Agnostic Programming
RISC-V’s V extension solves this with vector-length agnostic (VLA) programming. Instead of specifying exact vector widths, programs specify operations on abstract vectors. The hardware determines the actual vector length at runtime based on its capabilities.
A program written for V extension runs on any implementation, from embedded processors with 128-bit vectors to supercomputers with 4096-bit vectors, automatically using the available width. This future-proofs software and simplifies compiler design.
Key Concepts
VLEN: Vector register length in bits, implementation-defined (must be power of 2, minimum 128, maximum 65536). A processor might have VLEN=256 (256-bit vectors) or VLEN=512 (512-bit vectors).
ELEN: Maximum element width in bits, implementation-defined (minimum 32, maximum 64). Determines the largest element type (e.g., ELEN=64 supports 64-bit integers and doubles).
SEW: Selected element width in bits, chosen by software (8, 16, 32, or 64). Determines how many elements fit in a vector register.
LMUL: Vector register group multiplier (1/8, 1/4, 1/2, 1, 2, 4, 8). Allows using multiple registers as a single logical vector for larger operations.
AVL: Application vector length, the number of elements the application wants to process.
VL: Vector length, the number of elements actually processed by an instruction (VL ≤ AVL, VL ≤ VLEN/SEW).
The relationship is: VL = min(AVL, VLEN/SEW × LMUL)
Figure 12.1: Vector Register Organization
Vector Register File (32 registers: v0-v31)
Each register: VLEN bits (implementation-defined)
Element Width (SEW) - determines elements per register:
VLEN = 256 bits (example)
├─ SEW=8: 32 elements (bytes)
├─ SEW=16: 16 elements (halfwords)
├─ SEW=32: 8 elements (words)
└─ SEW=64: 4 elements (doublewords)
Register Grouping (LMUL) - use multiple registers as one:
├─ LMUL=1: 1 register (e.g., v0)
├─ LMUL=2: 2 registers (e.g., v0-v1)
├─ LMUL=4: 4 registers (e.g., v0-v3)
└─ LMUL=8: 8 registers (e.g., v0-v7)
A vector register can be interpreted with different element widths (SEW), and multiple consecutive registers can be grouped (LMUL) for larger operations.
12.2 Vector Register Organization
Vector Register File
The V extension adds 32 vector registers, v0 through v31. Each register is VLEN bits wide, where VLEN is implementation-defined. Unlike scalar registers which are always 32 or 64 bits, vector registers can be 128, 256, 512, or even larger.
Register v0 has a special role: it’s used as the mask register for predicated operations (more on this later).
Element Width and Capacity
A vector register holds multiple elements. The number depends on the selected element width (SEW):
Number of elements = VLEN / SEW
For VLEN=256:
- SEW=8 (byte): 32 elements
- SEW=16 (halfword): 16 elements
- SEW=32 (word): 8 elements
- SEW=64 (doubleword): 4 elements
Software selects SEW based on the data type being processed.
Register Grouping (LMUL)
Sometimes you need to process more elements than fit in one register. LMUL (register group multiplier) allows treating multiple consecutive registers as a single logical vector:
- LMUL=1: Use 1 register (default)
- LMUL=2: Use 2 consecutive registers (e.g., v0-v1)
- LMUL=4: Use 4 consecutive registers (e.g., v0-v3)
- LMUL=8: Use 8 consecutive registers (e.g., v0-v7)
With LMUL=2 and SEW=32 on VLEN=256, you get 16 elements (8 per register × 2 registers).
LMUL can also be fractional (1/2, 1/4, 1/8) to use only part of a register, leaving more registers available for other operations.
Register Alignment
When LMUL > 1, register numbers must be aligned:
- LMUL=2: Use v0, v2, v4, … (even registers)
- LMUL=4: Use v0, v4, v8, … (multiples of 4)
- LMUL=8: Use v0, v8, v16, v24 (multiples of 8)
This simplifies hardware implementation.
12.3 Vector Configuration
The vtype CSR
Vector operations are configured through the vtype CSR (vector type register), which specifies:
- SEW: Selected element width (8, 16, 32, or 64 bits)
- LMUL: Register group multiplier (1/8, 1/4, 1/2, 1, 2, 4, 8)
- vta: Vector tail agnostic (how to handle elements beyond VL)
- vma: Vector mask agnostic (how to handle masked-off elements)
The vsetvl Instruction
Before executing vector instructions, software must configure vtype and set VL using the vsetvl instruction:
vsetvli rd, rs1, vtypei: Set VL and vtype. rs1 contains AVL (requested vector length), vtypei encodes SEW and LMUL, rd receives the actual VL.
# Configure for 32-bit elements, LMUL=1
li a0, 100 # AVL = 100 elements to process
vsetvli t0, a0, e32, m1 # Set SEW=32, LMUL=1, VL = min(AVL, VLEN/32)
# t0 now contains actual VL
The hardware sets VL to the smaller of:
- AVL (what the application requested)
- VLEN/SEW × LMUL (what the hardware can handle)
If AVL=100 but the hardware can only process 8 elements at a time (VLEN=256, SEW=32, LMUL=1), then VL=8. The application must loop to process all 100 elements.
Vector-Length Agnostic Loop
Here’s the standard pattern for processing an array:
void vadd_vv(int *dst, int *src1, int *src2, size_t n) {
size_t vl;
for (size_t i = 0; i < n; i += vl) {
vl = vsetvl_e32m1(n - i); // Set VL for remaining elements
vle32_v_i32m1(v1, &src1[i], vl); // Load src1[i:i+vl]
vle32_v_i32m1(v2, &src2[i], vl); // Load src2[i:i+vl]
vadd_vv_i32m1(v3, v1, v2, vl); // v3 = v1 + v2
vse32_v_i32m1(&dst[i], v3, vl); // Store dst[i:i+vl]
}
}
This code works on any VLEN. On VLEN=128, it processes 4 elements per iteration. On VLEN=512, it processes 16 elements per iteration. No code changes needed.
Encoding vtype
The vtypei immediate in vsetvli encodes SEW and LMUL:
vtypei[2:0] = LMUL encoding:
000 = LMUL=1, 001 = LMUL=2, 010 = LMUL=4, 011 = LMUL=8
101 = LMUL=1/8, 110 = LMUL=1/4, 111 = LMUL=1/2
vtypei[5:3] = SEW encoding:
000 = SEW=8, 001 = SEW=16, 010 = SEW=32, 011 = SEW=64
vtypei[6] = vta (tail agnostic)
vtypei[7] = vma (mask agnostic)
The assembler provides convenient mnemonics: e32, m1 means SEW=32, LMUL=1.
12.4 Vector Arithmetic and Logic
Vector-Vector Operations
Vector arithmetic instructions operate on corresponding elements from two vector registers:
vadd.vv vd, vs2, vs1: vd[i] = vs2[i] + vs1[i] for i = 0 to VL-1
vsub.vv, vmul.vv, vdiv.vv: Subtraction, multiplication, division
vand.vv, vor.vv, vxor.vv: Bitwise AND, OR, XOR
# Vector addition: v3 = v1 + v2
vsetvli t0, a0, e32, m1
vle32.v v1, (a1) # Load first vector
vle32.v v2, (a2) # Load second vector
vadd.vv v3, v1, v2 # Add element-wise
vse32.v v3, (a3) # Store result
Vector-Scalar Operations
Often you need to add the same scalar to all vector elements. Vector-scalar instructions use a scalar register (x register) as the second operand:
vadd.vx vd, vs2, rs1: vd[i] = vs2[i] + rs1 for all i
# Add constant 10 to all elements
li a0, 10
vsetvli t0, a1, e32, m1
vle32.v v1, (a2)
vadd.vx v2, v1, a0 # v2[i] = v1[i] + 10
vse32.v v2, (a3)
Vector-Immediate Operations
For small constants, vector-immediate instructions avoid loading into a scalar register:
vadd.vi vd, vs2, imm: vd[i] = vs2[i] + imm (imm is 5-bit signed)
# Increment all elements by 1
vadd.vi v2, v1, 1 # v2[i] = v1[i] + 1
Widening and Narrowing Operations
Widening operations produce results twice as wide as the inputs:
vwaddu.vv vd, vs2, vs1: Widening unsigned add (e.g., 32-bit inputs → 64-bit results)
vwadd.vv: Widening signed add
Narrowing operations reduce width:
vnsrl.wv vd, vs2, vs1: Narrowing shift right logical (e.g., 64-bit inputs → 32-bit results)
These are essential for avoiding overflow in accumulations or reducing precision after computation.
Fused Multiply-Add
Vector fused multiply-add computes (a × b) + c in one instruction:
vfmadd.vv vd, vs1, vs2: vd[i] = (vd[i] × vs1[i]) + vs2[i]
This is crucial for matrix multiplication and other linear algebra operations.
12.5 Vector Memory Operations
Unit-Stride Loads and Stores
The most common memory access pattern is unit-stride: consecutive elements in memory.
vle32.v vd, (rs1): Load VL elements of 32-bit width from address rs1
vse32.v vs3, (rs1): Store VL elements of 32-bit width to address rs1
# Load 32-bit integers from array
vsetvli t0, a0, e32, m1
vle32.v v1, (a1) # Load v1[0:VL-1] from memory[a1]
The number of bytes loaded is VL × SEW/8. For VL=8 and SEW=32, this loads 32 bytes.
Strided Loads and Stores
Strided access loads elements separated by a constant stride:
vlse32.v vd, (rs1), rs2: Load elements from rs1, rs1+rs2, rs1+2×rs2, …
vsse32.v vs3, (rs1), rs2: Store with stride
// Load every other element (stride = 8 bytes for 32-bit elements)
vlse32.v v1, (a1), 8 # Load a1[0], a1[2], a1[4], ...
This is useful for accessing matrix columns or interleaved data.
Indexed (Scatter/Gather) Loads and Stores
Indexed access uses a vector of indices to load/store non-contiguous elements. This is also called “gather” (for loads) and “scatter” (for stores).
vluxei32.v vd, (rs1), vs2: Load elements from rs1+vs2[i] for each i (unordered)
vsuxei32.v vs3, (rs1), vs2: Store with indices (unordered)
# Example: Gather operation
# Suppose we have an array a[] and want to load a[1], a[3], a[5], a[2]
# First, create an index vector containing [1, 3, 5, 2]
vle32.v v1, (a1) # Load index vector: v1 = [1, 3, 5, 2]
vluxei32.v v2, (a2), v1 # Gather: v2[0]=a[1], v2[1]=a[3], v2[2]=a[5], v2[3]=a[2]
The index vector (v1 in the example) contains the indices of elements to load. For each element i, the instruction loads from address base + index[i] * element_size. So if v1 contains [1, 3, 5, 2], the gather operation loads:
- v2[0] = memory[a2 + 1*4] (element at index 1)
- v2[1] = memory[a2 + 3*4] (element at index 3)
- v2[2] = memory[a2 + 5*4] (element at index 5)
- v2[3] = memory[a2 + 2*4] (element at index 2)
This is essential for sparse matrix operations, indirect addressing, and accessing non-contiguous data.
Segment Loads and Stores
Segment operations load/store groups of elements (like struct fields):
vlseg2e32.v vd, (rs1): Load 2-field segments (e.g., {x, y} pairs)
vsseg2e32.v vs3, (rs1): Store 2-field segments
// Load array of {x, y} pairs
struct point { int x, y; };
struct point points[100];
vlseg2e32.v v1, (a0) # v1 = all x values, v2 = all y values
This efficiently handles structure-of-arrays (SoA) and array-of-structures (AoS) conversions.
Figure 12.2a: Unit-Stride Access
Unit-Stride (consecutive elements):
Memory: [0] [1] [2] [3] [4] [5] [6] [7]
↓ ↓ ↓ ↓
Vector: [0] [1] [2] [3]
Figure 12.2b: Strided Access
Strided (every 2nd element, stride=2):
Memory: [0] [1] [2] [3] [4] [5] [6] [7]
↓ ↓ ↓ ↓
Vector: [0] [2] [4] [6]
Figure 12.2c: Indexed (Gather) Access
graph TB
subgraph "Index Vector"
idx0["indices[0] = 1"]
idx1["indices[1] = 3"]
idx2["indices[2] = 5"]
idx3["indices[3] = 2"]
end
subgraph "Memory Array"
m0["mem[0] = 0"]
m1["mem[1] = 1"]
m2["mem[2] = 2"]
m3["mem[3] = 3"]
m4["mem[4] = 4"]
m5["mem[5] = 5"]
m6["mem[6] = 6"]
m7["mem[7] = 7"]
end
subgraph "Result Vector"
v0["vector[0] = 1"]
v1["vector[1] = 3"]
v2["vector[2] = 5"]
v3["vector[3] = 2"]
end
idx0 --> m1
m1 --> v0
idx1 --> m3
m3 --> v1
idx2 --> m5
m5 --> v2
idx3 --> m2
m2 --> v3
style m1 fill:#90EE90
style m2 fill:#FFB6C1
style m3 fill:#87CEEB
style m5 fill:#FFD700
Each index points to a memory location, and the value at that location is loaded into the corresponding vector position.
12.6 Vector Masking
Predicated Execution
Not all elements in a vector may need processing. Masking allows selectively enabling or disabling operations on individual elements.
The mask is stored in vector register v0, with one bit per element. If v0[i] = 1, element i is processed; if v0[i] = 0, element i is skipped (or handled according to vma setting).
Masked Operations
Most vector instructions have a masked variant using the .vm suffix:
vadd.vv vd, vs2, vs1, v0.t: Add only where v0[i] = 1
# Conditional add: dst[i] = (mask[i]) ? src1[i] + src2[i] : dst[i]
vle1.v v0, (a0) # Load mask into v0
vle32.v v1, (a1) # Load src1
vle32.v v2, (a2) # Load src2
vle32.v v3, (a3) # Load dst (for masked-off elements)
vadd.vv v3, v1, v2, v0.t # Add where mask is 1, keep v3 where mask is 0
vse32.v v3, (a3) # Store result
Comparison and Mask Generation
Comparison instructions generate masks:
vmseq.vv vd, vs2, vs1: vd[i] = (vs2[i] == vs1[i]) ? 1 : 0
vmslt.vv, vmsle.vv, vmsgt.vv: Less than, less or equal, greater than
# Find elements greater than 100
li a0, 100
vsetvli t0, a1, e32, m1
vle32.v v1, (a2)
vmsgt.vx v0, v1, a0 # v0[i] = (v1[i] > 100) ? 1 : 0
Mask Logical Operations
Masks can be combined with logical operations:
vmand.mm vd, vs2, vs1: Mask AND vmor.mm, vmxor.mm, vmnand.mm: Mask OR, XOR, NAND
# Combine two conditions: (a > 100) AND (a < 200)
vmsgt.vx v1, v2, a0 # v1 = (v2 > 100)
vmslt.vx v3, v2, a1 # v3 = (v2 < 200)
vmand.mm v0, v1, v3 # v0 = v1 AND v3
Use Cases
Masking is essential for:
- Conditional operations (if-then-else in vector code)
- Handling loop tails (when array size isn’t a multiple of VL)
- Sparse computations (skip zero elements)
- Implementing reductions with conditions
12.7 Vector Reductions
What is a Reduction?
A reduction combines all elements of a vector into a single scalar result. Common examples: sum all elements, find maximum, count non-zero elements.
Reduction Instructions
vredsum.vs vd, vs2, vs1: Sum all elements of vs2, add to vs1[0], store in vd[0]
vredmax.vs, vredmin.vs: Find maximum or minimum
vredand.vs, vredor.vs, vredxor.vs: Bitwise AND, OR, XOR of all elements
# Sum all elements of an array
vsetvli t0, a0, e32, m1
vmv.v.i v2, 0 # Initialize accumulator to 0
vle32.v v1, (a1) # Load vector
vredsum.vs v2, v1, v2 # v2[0] = sum(v1[0:VL-1]) + v2[0]
vmv.x.s a2, v2 # Move result to scalar register
For arrays larger than VL, loop and accumulate:
int sum_array(int *arr, size_t n) {
int sum = 0;
size_t vl;
for (size_t i = 0; i < n; i += vl) {
vl = vsetvl_e32m1(n - i);
vle32_v_i32m1(v1, &arr[i], vl);
vredsum_vs_i32m1_i32m1(v2, v1, v2, vl);
}
return vmv_x_s_i32m1_i32(v2);
}
Masked Reductions
Reductions can be masked to sum only selected elements:
# Sum elements where mask is 1
vredsum.vs v2, v1, v2, v0.t
This is useful for conditional sums (e.g., sum all positive elements).
12.8 Comparison with ARM NEON and x86 AVX
ARM NEON
ARM NEON provides 128-bit SIMD with 32 vector registers (v0-v31 in AArch64). Each register can hold:
- 16 × 8-bit elements
- 8 × 16-bit elements
- 4 × 32-bit elements
- 2 × 64-bit elements
NEON instructions specify the element width explicitly:
# ARM NEON: Add two vectors of 4 × 32-bit integers
ld1 {v0.4s}, [x0] // Load 4 × 32-bit
ld1 {v1.4s}, [x1]
add v2.4s, v0.4s, v1.4s // Add element-wise
st1 {v2.4s}, [x2]
Limitations of NEON:
- Fixed 128-bit width (no scalability)
- Code must be rewritten for wider vectors
- No predication (masking) in base NEON
ARM SVE (Scalable Vector Extension)
SVE addresses NEON’s limitations with scalable vectors (128-2048 bits). Like RISC-V V, SVE uses vector-length agnostic programming:
# ARM SVE: Vector add (works on any vector length)
ld1w z0.s, p0/z, [x0] // Load with predication
ld1w z1.s, p0/z, [x1]
add z2.s, z0.s, z1.s // Add
st1w z2.s, p0, [x2] // Store with predication
SVE and RISC-V V share similar philosophies: scalable vectors, predication, and VLA programming. However, SVE is more complex with more instruction variants and addressing modes.
x86 AVX
x86’s SIMD evolved through multiple generations:
- SSE: 128-bit (16 registers: xmm0-xmm15)
- AVX: 256-bit (16 registers: ymm0-ymm15)
- AVX-512: 512-bit (32 registers: zmm0-zmm31)
Each generation added new instructions:
# x86 AVX: Add two vectors of 8 × 32-bit integers
vmovdqu ymm0, [rax] ; Load 256 bits
vmovdqu ymm1, [rbx]
vpaddd ymm2, ymm0, ymm1 ; Add 8 × 32-bit
vmovdqu [rcx], ymm2 ; Store
Limitations of x86 SIMD:
- Fixed widths (128, 256, 512 bits)
- Code must be rewritten for each generation
- AVX-512 has many variants (AVX-512F, AVX-512BW, AVX-512DQ, etc.)
- Complexity: thousands of SIMD instructions
RISC-V V Advantages
Compared to NEON and AVX, RISC-V V offers:
- Scalability: One codebase works on any VLEN (128 to 65536 bits)
- Simplicity: Fewer instruction variants, consistent naming
- Predication: Built-in masking for all operations
- Flexibility: Fractional LMUL, widening/narrowing operations
- Future-proof: No need to rewrite code for wider vectors
Trade-offs:
- RISC-V V is newer (less mature tooling and libraries)
- x86 AVX has extensive optimization for specific workloads
- ARM NEON is simpler for fixed-width use cases
Figure 12.3: SIMD Architecture Comparison
| Feature | x86 SSE/AVX | ARM NEON | ARM SVE | RISC-V V |
|---|---|---|---|---|
| Vector Width | Fixed: 128/256/512 bits | Fixed: 128 bits | Scalable: 128-2048 bits | Scalable: 128-65536 bits |
| Registers | 16 (SSE/AVX) 32 (AVX-512) | 32 | 32 | 32 |
| Scalability | No (fixed per generation) | No (fixed) | Yes (scalable) | Yes (scalable) |
| Code Portability | No (rewrite per generation) | Yes (single codebase) | Yes (single codebase) | Yes (single codebase) |
| Predication | Partial (AVX-512 only) | No (base NEON) | Yes | Yes |
| Instruction Count | ~1000s (across generations) | ~200 | ~400 | ~300 |
| Complexity | High (many variants) | Low | Medium | Low |
| Ratification | 1999 (SSE) 2011 (AVX) 2016 (AVX-512) | 2005 | 2016 | 2021 |
| Key Advantage | Mature ecosystem | Simple, widely deployed | Scalable, predication | Scalable, simple, future-proof |
| Key Limitation | Fixed widths, complexity | Fixed 128-bit only | Complex instruction set | Newer, less mature tooling |
🛠️ Hands-on Lab: Lab 12.1 — Vector Addition
This lab demonstrates the classic VLA loop structure—the foundational pattern for RISC-V Vector programming.
Lab Objectives
- Write RISC-V Vector Assembly to implement C[i] = A[i] + B[i]
- Understand the meaning of
vsetvli’s return valuevl - Compare the structure of Scalar Loop vs Vector Loop
Strip-mining Loop Structure
This is the core pattern of VLA programming:
while (n > 0) {
vl = vsetvli(n); // Ask hardware: how many can you handle?
load(vl elements); // Load vl elements
compute(); // Execute operation
store(vl elements); // Store vl elements
n -= vl; // Decrease remaining count
pointers += vl; // Advance pointers
}
Code
File 1: vector_add.S
# vector_add.S - Vector Addition (VLA version)
.section .text
.global vec_add
# void vec_add(int *a, int *b, int *c, int n)
# a0 = pointer to A
# a1 = pointer to B
# a2 = pointer to C
# a3 = n (element count)
vec_add:
# --- Strip-mining Loop ---
loop:
# 1. Set vector length
# vsetvli rd, rs1, vtype
# t0: hardware returns actual elements it can process (vl)
# a3: remaining elements (AVL)
# e32: element size 32-bit
# m1: LMUL=1 (use 1 vector register)
# ta, ma: Tail/Mask Agnostic
vsetvli t0, a3, e32, m1, ta, ma
# 2. Load data
vle32.v v0, (a0) # v0 = A[0:vl]
vle32.v v1, (a1) # v1 = B[0:vl]
# 3. Execute addition
vadd.vv v2, v0, v1 # v2 = v0 + v1
# 4. Write back data
vse32.v v2, (a2) # C[0:vl] = v2
# 5. Advance pointers (int32 = 4 bytes)
slli t1, t0, 2 # t1 = vl * 4
add a0, a0, t1 # A pointer advances
add a1, a1, t1 # B pointer advances
add a2, a2, t1 # C pointer advances
# 6. Update remaining count
sub a3, a3, t0 # n = n - vl
# 7. Continue loop
bnez a3, loop
ret
File 2: main.c
#include <stdio.h>
extern void vec_add(int *a, int *b, int *c, int n);
#define N 100 // Intentionally not power of 2, to test tail handling
int main(void) {
int a[N], b[N], c[N];
// Initialize
for (int i = 0; i < N; i++) {
a[i] = i;
b[i] = 100;
c[i] = 0;
}
printf("Starting Vector Add...\n");
vec_add(a, b, c, N);
// Verify
int error = 0;
for (int i = 0; i < N; i++) {
if (c[i] != a[i] + 100) {
error++;
}
}
if (error == 0) {
printf("SUCCESS: All %d elements correct!\n", N);
} else {
printf("FAILED: %d errors\n", error);
}
return 0;
}
Compile and Run
# Compile (requires V extension support)
riscv64-unknown-elf-gcc -march=rv64gcv -o vec_add main.c vector_add.S
# Run on QEMU with V extension
qemu-riscv64 -cpu rv64,v=true vec_add
Expected Output:
Starting Vector Add...
SUCCESS: All 100 elements correct!
What You Just Did
- vsetvli: Asked hardware “how many elements can you process?” and got
vlback - Automatic Tail Handling: When N=100 and VLEN allows 8 elements per iteration, the last iteration automatically processes only 4 elements
- Portable Code: This same code runs on any VLEN (128-bit, 256-bit, 1024-bit) without modification
danieRTOS Reference: While danieRTOS doesn’t use vector operations directly, understanding VLA patterns helps when optimizing memory copy operations in the kernel.
⚠️ Common Pitfalls
Pitfall 1: Unnecessarily Handling the Tail
Error Scenario: Habituated to traditional SIMD, writing extra tail-handling loops.
Consequence: Wasted effort, and may introduce bugs.
// ❌ Wrong: No need to handle tail yourself
void vec_add_wrong(int *a, int *b, int *c, int n) {
int i;
// Vector part
for (i = 0; i + 4 <= n; i += 4) {
// vector_add_4(a+i, b+i, c+i);
}
// Tail part (this is redundant in VLA!)
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
}
// ✅ Correct: vsetvli handles tail automatically
// See Assembly example above
Pitfall 2: Assuming Fixed VLEN
Error Scenario: Hardcoding assumptions like VLEN=256 or other specific values.
Consequence: Program behaves incorrectly or performs poorly on different hardware.
# ❌ Wrong: Assuming 8 elements per iteration
loop:
li t0, 8 # Hardcoded!
vsetvli zero, t0, e32, m1, ta, ma
...
# ✅ Correct: Let hardware decide
loop:
vsetvli t0, a3, e32, m1, ta, ma # a3 = remaining count
...
Pitfall 3: Forgetting LMUL’s Impact
Error Scenario: Not understanding LMUL (Vector Register Group Multiplier).
Explanation:
- LMUL=1: Use 1 vector register (v0-v31)
- LMUL=2: Use 2 registers as a group (v0-v1, v2-v3, …)
- LMUL=4/8: Larger groups
# LMUL=1: Can use v0-v31 (32 independent registers)
vsetvli t0, a3, e32, m1, ta, ma
# LMUL=2: Can use v0, v2, v4, ... (16 groups, 2 each)
vsetvli t0, a3, e32, m2, ta, ma
# Now v0 and v1 are the same group, cannot be used separately
# LMUL=8: Can use v0, v8, v16, v24 (4 groups, 8 each)
vsetvli t0, a3, e32, m8, ta, ma
💡 Tip: LMUL > 1 can process more elements but reduces available register count. For simple vector addition, LMUL=1 is usually sufficient.
Summary
The RISC-V Vector extension represents a modern approach to SIMD processing, learning from decades of experience with x86 and ARM SIMD architectures. Its vector-length agnostic design ensures that code written today will automatically benefit from wider vectors in future hardware.
Vector-length agnostic programming is the V extension’s defining feature. By abstracting away the physical vector width, RISC-V allows a single binary to run efficiently on implementations ranging from tiny embedded processors to supercomputers. This eliminates the need to maintain multiple code paths for different vector widths, simplifying both compiler and application development.
Vector configuration through the vsetvl instruction and vtype CSR provides fine-grained control over element width (SEW), register grouping (LMUL), and vector length (VL). The hardware automatically determines the optimal VL based on the application’s request (AVL) and the implementation’s capabilities (VLEN), making it easy to write portable high-performance code.
Vector operations cover the full spectrum of data-parallel computation: arithmetic and logic operations with vector-vector, vector-scalar, and vector-immediate variants; widening and narrowing operations for precision management; and fused multiply-add for efficient linear algebra. The consistent instruction naming and behavior make the V extension easier to learn than x86’s sprawling SIMD instruction set.
Vector memory operations support diverse access patterns: unit-stride for contiguous data, strided for matrix columns and interleaved data, indexed for sparse matrices and indirect addressing, and segment operations for structure-of-arrays conversions. This flexibility enables efficient vectorization of a wide range of algorithms.
Vector masking provides predicated execution, allowing conditional operations on individual vector elements. Comparison instructions generate masks, mask logical operations combine conditions, and masked operations selectively process elements. This is essential for handling loop tails, implementing conditional logic in vector code, and optimizing sparse computations.
Vector reductions efficiently combine all elements of a vector into a scalar result, supporting operations like sum, maximum, minimum, and bitwise reductions. Masked reductions enable conditional aggregations, crucial for many algorithms.
Compared to ARM NEON and x86 AVX, RISC-V V offers superior scalability and simplicity. While NEON is limited to 128-bit vectors and AVX requires separate code for each generation (128, 256, 512 bits), RISC-V V code automatically adapts to any vector width. ARM SVE shares RISC-V’s scalable philosophy but with greater complexity. x86’s SIMD has evolved into thousands of instructions across multiple incompatible extensions, while RISC-V V maintains a clean, orthogonal design.
The V extension positions RISC-V well for future data-parallel workloads in machine learning, scientific computing, multimedia processing, and other domains where SIMD performance is critical.
Chapter 13. SoC Integration
Part VIII — System Design, Platform Spec & SoC Integration
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand PMP’s Role: Grasp how Physical Memory Protection limits access at the hardware level
- Distinguish TOR vs NAPOT: Understand the configuration differences and use cases for each Address Matching Mode
- Configure PMP Entries: Set up read-only regions and intercept illegal writes
- Understand PLIC Architecture: Grasp how the Platform-Level Interrupt Controller operates
- Integrate SoC Components: Understand how CPU, Memory, and Peripherals connect via Interconnect
💡 Scenario: The Museum’s Red Barrier Poles
Scene: Junior stares at a “Store Access Fault” exception code on the screen, looking confused.
Junior: “Architect, this is so strange. I already turned off the MMU (virtual memory) and I’m using physical addresses directly to write to this variable. Why is the CPU still blocking me? Is the board broken?”
Architect: “The board is fine. You just hit PMP (Physical Memory Protection)’s ‘red barrier poles.’
Imagine memory is a museum:
| Mechanism | Analogy | Function |
|---|---|---|
| MMU (Page Table) | Tour map | Tells you where exhibits are (VA → PA) |
| PMP | Red barrier poles + bulletproof glass | Hardware security, limits who can touch what |
Even if you bypass the tour guide (turn off MMU) and rush straight to the Mona Lisa, the hardware security (PMP Checker) will still stop you, because your ID (Privilege Mode) says you’re just an ordinary visitor (S-mode/U-mode), and this area is only accessible to the museum director (M-mode).“
Junior: “So how do I set up these barrier poles? Do I need start and end addresses?”
Architect: “There are two common ways to set up the barriers:
-
TOR (Top of Range): Like stretching a rope. You need two poles (two PMP Entries), and the area between them is the controlled region. Good for arbitrary-sized regions.
-
NAPOT (Naturally Aligned Power of Two): Like placing a fixed-size dome (4KB, 2MB…) over exhibits. You only need to set the center point and dome size—more resource-efficient (uses only one Entry).
Today let’s try using NAPOT to cover a 4KB region and make it ‘read-only,’ and see what happens to your program.“
A RISC-V processor core doesn’t operate in isolation. To build a complete system-on-chip (SoC), the core must integrate with memory controllers, interrupt controllers, I/O devices, and system interconnects. This integration determines how software accesses hardware, how devices communicate, and how the system maintains security and performance.
RISC-V provides a modular approach to SoC design. Unlike monolithic architectures that prescribe specific peripheral implementations, RISC-V defines standard interfaces while allowing flexibility in implementation. The Physical Memory Protection (PMP) unit controls memory access in machine mode. The Platform-Level Interrupt Controller (PLIC) routes interrupts from devices to cores. Memory-mapped I/O (MMIO) provides a uniform mechanism for device access. System interconnects like TileLink and AXI connect components together.
This chapter explores how RISC-V cores integrate into complete SoCs. We’ll examine the essential components—PMP, IOMMU, PLIC, MMIO, memory maps, interconnects, and DMA—and see how they work together to create functional systems. Understanding SoC integration is crucial for system designers, firmware developers, and anyone working with RISC-V hardware platforms.
13.1 Physical Memory Protection (PMP)
The Need for Memory Isolation
In systems without virtual memory, how do we prevent untrusted code from accessing sensitive memory regions? A bare-metal application might need to protect its firmware from buggy drivers. An embedded RTOS might need to isolate tasks from each other. Machine-mode firmware must protect itself from supervisor-mode operating systems.
Physical Memory Protection (PMP) provides hardware-enforced memory access control using physical addresses. Unlike virtual memory’s page tables (which operate in S-mode), PMP operates in M-mode and applies to all lower privilege levels. This makes PMP essential for systems without MMUs and useful for protecting M-mode resources even in systems with MMUs.
PMP Architecture
PMP uses a set of configuration registers to define protected memory regions:
pmpcfg0-pmpcfg15: Configuration registers (RV32 has 4, RV64 has 16)
pmpaddr0-pmpaddr63: Address registers (up to 64 regions)
Each PMP entry consists of:
- An address register (pmpaddr) defining the region
- A configuration byte (in pmpcfg) specifying permissions and matching mode
PMP Configuration Format
pmpcfg format (8 bits per entry):
7 6:5 4:3 2 1 0
L 0 0 A X W R
L: Lock bit (prevents further modification)
A: Address matching mode (OFF, TOR, NA4, NAPOT)
X: Execute permission
W: Write permission
R: Read permission
Address Matching Modes
PMP supports four address matching modes:
OFF (A=0): Region is disabled, no protection applied
TOR (A=1): Top-of-Range. Region is [pmpaddr[i-1], pmpaddr[i])
NA4 (A=2): Naturally Aligned 4-byte region
NAPOT (A=3): Naturally Aligned Power-Of-Two region
The most commonly used modes are TOR and NAPOT:
// Example: Protect 64KB region at 0x80000000 using NAPOT
// NAPOT encoding: address = base | (size/2 - 1)
// For 64KB: 0x80000000 | 0x7FFF = 0x80007FFF
pmpaddr0 = 0x80007FFF >> 2; // Right shift by 2 (PMP addresses are >> 2)
pmpcfg0 = 0x1F; // L=0, A=3 (NAPOT), X=1, W=1, R=1
// Example: Protect range [0x80000000, 0x80010000) using TOR
pmpaddr0 = 0x80000000 >> 2; // Start address
pmpaddr1 = 0x80010000 >> 2; // End address
pmpcfg0 = 0x09; // Entry 0: A=1 (TOR), X=0, W=0, R=1
PMP Priority and Matching
When an access occurs, PMP checks entries from lowest to highest index. The first matching entry determines the access permissions. If no entry matches, the access is denied (for M-mode, this behavior is implementation-defined).
Access check algorithm:
1. For i = 0 to N-1:
- If address matches pmpaddr[i] region:
- Check permissions in pmpcfg[i]
- If allowed: grant access
- If denied: raise access fault
2. If no match: deny access (or allow for M-mode)
Lock Bit
The lock bit (L) prevents further modification of a PMP entry until the next reset. This is crucial for protecting M-mode firmware from being disabled by compromised S-mode code:
// Lock the firmware region
pmpaddr0 = 0x80000000 >> 2;
pmpaddr1 = 0x80010000 >> 2;
pmpcfg0 = 0x89; // L=1, A=1 (TOR), X=0, W=0, R=1
// Any subsequent write to pmpcfg0 or pmpaddr0/1 is ignored
pmpcfg0 = 0x00; // This write has no effect!
PMP Use Cases
- Firmware Protection: M-mode firmware protects itself from S-mode OS
- Device Memory Protection: Prevent unauthorized access to MMIO regions
- Task Isolation: Embedded RTOS isolates tasks without MMU
- Secure Boot: Protect boot ROM and secure storage
13.2 IOMMU for RISC-V
The DMA Problem
Direct Memory Access (DMA) allows devices to access memory without CPU intervention, improving performance for I/O-intensive workloads. But DMA creates a security problem: devices use physical addresses and bypass the CPU’s virtual memory protection. A malicious or buggy device could read sensitive data or corrupt kernel memory.
An Input-Output Memory Management Unit (IOMMU) solves this by providing address translation and access control for devices. Just as the MMU translates virtual addresses for the CPU, the IOMMU translates device addresses for peripherals. This enables:
- Device isolation: Each device sees only its own memory
- Virtualization: Virtual machines can safely pass through devices
- Large address spaces: 32-bit devices can access >4GB memory
RISC-V IOMMU Architecture
The RISC-V IOMMU specification defines a standard interface for device address translation. The IOMMU sits between devices and memory, intercepting device memory requests and translating them through device page tables.
Device → IOMMU → Memory
↓
Device Context
Device Page Tables
Key components:
Device Context: Per-device configuration (page table pointer, permissions)
Device Directory Table (DDT): Maps device IDs to device contexts
I/O Page Tables: Similar to CPU page tables, but for device addresses
Command Queue: Software sends commands to IOMMU (invalidate TLB, etc.)
Fault Queue: IOMMU reports translation faults to software
Device Address Translation
When a device issues a memory request:
- IOMMU extracts device ID from the request
- Looks up device context in DDT
- Walks device page tables to translate address
- Checks permissions (read/write/execute)
- Forwards translated request to memory or reports fault
// Simplified IOMMU translation
struct device_context {
uint64_t page_table_root; // Root page table address
uint32_t permissions; // Device permissions
uint32_t address_width; // Device address width
};
// Device issues read from device address 0x1000
device_addr = 0x1000;
device_id = 0x42;
// IOMMU lookup
ctx = ddt[device_id];
physical_addr = walk_page_table(ctx.page_table_root, device_addr);
if (physical_addr && (ctx.permissions & READ)) {
forward_to_memory(physical_addr);
} else {
report_fault(device_id, device_addr);
}
IOMMU Page Table Formats
RISC-V IOMMU supports multiple page table formats:
- Sv39/Sv48/Sv57: Same format as CPU page tables (for simplicity)
- MSI Page Tables: Special format for Message Signaled Interrupts
Using the same format as CPU page tables simplifies software—the OS can reuse existing page table code for device mappings.
IOMMU vs ARM SMMU
ARM’s System Memory Management Unit (SMMU) provides similar functionality:
| Feature | RISC-V IOMMU | ARM SMMU |
|---|---|---|
| Page Table Format | Sv39/Sv48/Sv57 | LPAE (Long descriptor) |
| Device Identification | Device ID (configurable) | Stream ID |
| Command Interface | Command queue | CMDQ (Command Queue) |
| Fault Reporting | Fault queue | Event queue |
| Virtualization | Two-stage translation | Stage 1 + Stage 2 |
| Complexity | Simpler, modular | More complex, feature-rich |
RISC-V IOMMU emphasizes simplicity and reuse of existing CPU MMU concepts, while ARM SMMU has evolved through multiple generations with extensive features.
13.3 Platform-Level Interrupt Integration
Interrupt Routing in SoCs
A typical RISC-V SoC has dozens of interrupt sources: UARTs, timers, GPIOs, network controllers, storage devices. Each device needs to signal the CPU when it requires attention. The Platform-Level Interrupt Controller (PLIC) manages this complexity by:
- Collecting interrupts from all devices
- Routing interrupts to appropriate cores
- Managing interrupt priorities
- Providing claim/complete mechanism
PLIC Architecture
Interrupt Sources (1-1023)
↓
PLIC Gateway (per source)
↓
PLIC Core (priority arbitration)
↓
Interrupt Targets (cores × contexts)
Key concepts:
Interrupt Source: A device that can generate interrupts (numbered 1-1023, source 0 is reserved)
Interrupt Gateway: Converts device interrupt signal to PLIC internal format
Interrupt Target: A CPU context (M-mode or S-mode on each core)
Priority: Each source has a priority (0 = never interrupt, 1-7 = increasing priority)
Threshold: Each target has a threshold (only interrupts with priority > threshold are delivered)
PLIC Memory Map
The PLIC is accessed through memory-mapped registers:
Base Address: 0x0C000000 (typical)
Interrupt Priorities:
0x0C000000 + 4*source_id: Priority for source (0-7)
Interrupt Pending:
0x0C001000: Pending bits (1024 bits, read-only)
Interrupt Enable:
0x0C002000 + 0x80*context: Enable bits for context
Priority Threshold:
0x0C200000 + 0x1000*context: Threshold for context
Claim/Complete:
0x0C200004 + 0x1000*context: Claim and complete register
Interrupt Handling Flow
- Device asserts interrupt: Device raises interrupt line
- PLIC gateway captures: Gateway sets pending bit
- PLIC arbitration: PLIC selects highest priority pending interrupt for each target
- CPU notification: PLIC asserts external interrupt to CPU
- Software claim: Interrupt handler reads claim register (returns source ID, clears pending)
- Software handling: Handler services the device
- Software complete: Handler writes source ID to complete register
// PLIC interrupt handler
void plic_handler(void) {
uint32_t source = plic_claim(); // Read claim register
if (source == UART0_IRQ) {
uart0_interrupt_handler();
} else if (source == TIMER_IRQ) {
timer_interrupt_handler();
}
// ... handle other sources
plic_complete(source); // Write to complete register
}
uint32_t plic_claim(void) {
volatile uint32_t *claim = (uint32_t*)(PLIC_BASE + 0x200004);
return *claim; // Reading claim atomically claims the interrupt
}
void plic_complete(uint32_t source) {
volatile uint32_t *complete = (uint32_t*)(PLIC_BASE + 0x200004);
*complete = source; // Writing complete releases the interrupt
}
Multi-Core Interrupt Routing
In a multi-core system, each core has separate M-mode and S-mode contexts. The PLIC can route interrupts to specific cores:
// Configure UART interrupt to route to core 0 S-mode
#define PLIC_ENABLE_BASE 0x0C002000
#define UART0_IRQ 10
#define CORE0_S_MODE_CONTEXT 1
// Enable UART interrupt for core 0 S-mode
uint32_t *enable = (uint32_t*)(PLIC_ENABLE_BASE + 0x80 * CORE0_S_MODE_CONTEXT);
enable[UART0_IRQ / 32] |= (1 << (UART0_IRQ % 32));
// Set priority
uint32_t *priority = (uint32_t*)(PLIC_BASE + 4 * UART0_IRQ);
*priority = 5; // Priority 5
// Set threshold
uint32_t *threshold = (uint32_t*)(PLIC_BASE + 0x200000 + 0x1000 * CORE0_S_MODE_CONTEXT);
*threshold = 0; // Accept all priorities > 0
13.4 Memory-Mapped I/O (MMIO)
Unified Address Space
RISC-V uses memory-mapped I/O: devices appear as memory locations. Reading from a device register uses the same load instruction as reading from RAM. Writing to a device register uses the same store instruction. This unifies the programming model—no special I/O instructions needed.
# Read UART status register
li a0, 0x10000000 # UART base address
lw a1, 0(a0) # Read status register
# Write to UART data register
li a2, 'A' # Character to send
sw a2, 4(a0) # Write to data register
MMIO Address Regions
A typical RISC-V SoC memory map divides address space into regions:
0x00000000 - 0x0FFFFFFF: Debug/Boot ROM
0x10000000 - 0x1FFFFFFF: Peripherals (UART, SPI, GPIO, etc.)
0x20000000 - 0x2FFFFFFF: PLIC
0x30000000 - 0x3FFFFFFF: Reserved
0x40000000 - 0x7FFFFFFF: More peripherals
0x80000000 - 0xFFFFFFFF: DRAM
Each peripheral gets a block of addresses for its registers:
UART0: 0x10000000 - 0x10000FFF
0x10000000: Status register
0x10000004: Data register
0x10000008: Control register
0x1000000C: Baud rate register
MMIO Ordering Requirements
MMIO accesses have ordering requirements that differ from normal memory:
- Device registers may have side effects: Reading a status register might clear an interrupt flag
- Write order matters: Writing control registers in wrong order can cause device malfunction
- Read/write dependencies: A write must complete before a subsequent read sees the result
RISC-V provides fence instructions to enforce ordering:
# Ensure MMIO write completes before continuing
li a0, 0x10000000
li a1, 0x1
sw a1, 0(a0) # Write to control register
fence iorw, iorw # Ensure write completes
lw a2, 4(a0) # Read status register
The FENCE instruction takes two operands specifying predecessor and successor operations:
- i: Device input (MMIO read)
- o: Device output (MMIO write)
- r: Memory read
- w: Memory write
Common patterns:
fence iorw, iorw: Full fence (all operations)fence ow, ow: Ensure MMIO writes complete in orderfence ir, ir: Ensure MMIO reads complete in order
Uncached vs Cached MMIO
MMIO regions must be marked as uncached in page tables. Caching device registers would cause:
- Stale data: Cache might return old value instead of current device state
- Lost writes: Write to cached location might not reach device
- Side effect loss: Reading cached value doesn’t trigger device side effects
Page table entries for MMIO use special attributes:
// Mark MMIO region as uncached and unbuffered
pte = (physical_addr >> 12) << 10; // PPN
pte |= PTE_V | PTE_R | PTE_W; // Valid, readable, writable
pte |= PTE_A | PTE_D; // Accessed, dirty
// Do NOT set PTE_C (cacheable) for MMIO
13.5 SoC Memory Map
Typical RISC-V SoC Layout
A complete RISC-V SoC memory map includes ROM, RAM, peripherals, and reserved regions. Here’s a typical layout for a 32-bit SoC:
Address Range Size Description
0x00000000-0x00000FFF 4 KB Debug ROM
0x00001000-0x00000FFF 60 KB Reserved
0x00010000-0x0001FFFF 64 KB Boot ROM (mask ROM)
0x00020000-0x00FFFFFF ~16 MB Reserved
0x01000000-0x01FFFFFF 16 MB CLINT (Core-Local Interruptor)
0x02000000-0x0BFFFFFF 160 MB Reserved
0x0C000000-0x0FFFFFFF 64 MB PLIC
0x10000000-0x1000FFFF 64 KB UART0
0x10010000-0x1001FFFF 64 KB SPI0
0x10020000-0x1002FFFF 64 KB GPIO
0x10030000-0x1FFFFFFF ~256 MB Other peripherals
0x20000000-0x3FFFFFFF 512 MB Reserved
0x40000000-0x7FFFFFFF 1 GB External devices
0x80000000-0xFFFFFFFF 2 GB DRAM
Address Decode and Routing
The SoC interconnect decodes addresses and routes requests to appropriate components:
CPU issues load/store
↓
Address decode
↓
├─ 0x00000000-0x0FFFFFFF → Boot ROM / CLINT / PLIC
├─ 0x10000000-0x1FFFFFFF → Peripheral bus
├─ 0x20000000-0x7FFFFFFF → Reserved / External
└─ 0x80000000-0xFFFFFFFF → DRAM controller
Memory Map Examples
Different RISC-V platforms use different memory maps:
SiFive FU540 (HiFive Unleashed):
0x00001000: Boot ROM
0x02000000: CLINT
0x0C000000: PLIC
0x10000000: UART0
0x10010000: QSPI0
0x10040000: GPIO
0x80000000: DDR (8 GB)
Kendryte K210:
0x00000000: SRAM (6 MB)
0x40000000: Peripherals
0x50000000: AI accelerator
0x80000000: Flash (16 MB)
QEMU virt machine:
0x00001000: Boot ROM
0x02000000: CLINT
0x0C000000: PLIC
0x10000000: UART0
0x10001000: VirtIO devices
0x80000000: DRAM (configurable)
13.6 System Interconnects
The Need for Interconnects
A modern SoC has multiple masters (CPU cores, DMA controllers, GPUs) and multiple slaves (memory, peripherals, accelerators). An interconnect fabric connects these components, handling:
- Address routing: Directing requests to correct destination
- Arbitration: Managing concurrent accesses
- Data width conversion: Connecting 32-bit devices to 64-bit buses
- Clock domain crossing: Bridging different clock frequencies
AXI (Advanced eXtensible Interface)
ARM’s AMBA AXI is widely used in RISC-V SoCs due to its maturity and IP availability. AXI4 provides:
- Separate read/write channels: Independent read and write transactions
- Burst transfers: Efficient multi-beat transfers
- Out-of-order completion: Transactions can complete in any order
- Quality of Service (QoS): Priority-based arbitration
AXI signals:
Write Address Channel: AWADDR, AWLEN, AWSIZE, AWVALID, AWREADY
Write Data Channel: WDATA, WSTRB, WLAST, WVALID, WREADY
Write Response: BRESP, BVALID, BREADY
Read Address Channel: ARADDR, ARLEN, ARSIZE, ARVALID, ARREADY
Read Data Channel: RDATA, RRESP, RLAST, RVALID, RREADY
AHB (Advanced High-performance Bus)
AHB is simpler than AXI, suitable for lower-performance peripherals:
- Single channel: Address and data share the same channel
- Pipelined: Two-stage pipeline (address, data)
- Simpler protocol: Easier to implement
- Lower performance: No out-of-order, limited bursts
TileLink
TileLink is a RISC-V-native interconnect developed at UC Berkeley:
- Designed for RISC-V: Matches RISC-V memory model
- Scalable: From simple embedded to complex multi-core
- Cache coherence: Built-in support for coherent caches
- Three conformance levels:
- TL-UL: Uncached Lightweight (simple peripherals)
- TL-UH: Uncached Heavyweight (DMA, accelerators)
- TL-C: Cached (coherent caches)
TileLink advantages for RISC-V:
- Native support for RISC-V atomics (LR/SC, AMO)
- Efficient cache coherence protocol
- Open specification (no licensing)
Interconnect Comparison
| Feature | AXI4 | AHB | TileLink |
|---|---|---|---|
| Channels | 5 independent | 1 shared | 3 (A, D, optional C/E) |
| Burst Support | Yes (up to 256 beats) | Yes (limited) | Yes |
| Out-of-Order | Yes | No | Yes |
| Cache Coherence | No (needs ACE) | No | Yes (TL-C) |
| Complexity | High | Low | Medium |
| Performance | High | Medium | High |
| RISC-V Atomics | Requires extensions | Requires extensions | Native support |
| Licensing | ARM (free for use) | ARM (free for use) | Open (BSD) |
| Ecosystem | Mature, extensive IP | Mature, simple IP | Growing, RISC-V focused |
Choosing an Interconnect
- AXI: Best for high-performance SoCs, extensive IP ecosystem, industry standard
- AHB: Best for simple embedded systems, low-cost peripherals
- TileLink: Best for RISC-V-native designs, cache coherence, open ecosystem
13.7 DMA and Coherency
DMA Controller Integration
A DMA controller transfers data between memory and peripherals without CPU intervention. This frees the CPU for other tasks while large data transfers proceed in the background.
Typical DMA use cases:
- Disk I/O: Transfer data between storage and memory
- Network I/O: Move packets between NIC and memory
- Audio/Video: Stream data to/from media devices
- Memory-to-memory: Fast memory copy operations
DMA controller architecture:
CPU configures DMA
↓
DMA reads source (memory or device)
↓
DMA writes destination (device or memory)
↓
DMA signals completion (interrupt)
Cache Coherency Considerations
DMA creates coherency problems when the CPU has caches:
Problem 1: Stale cache data
1. CPU writes data to memory (data in cache, not yet in RAM)
2. DMA reads from memory (gets old data, not cached data)
3. DMA sends wrong data to device
Problem 2: Stale memory data
1. DMA writes data to memory
2. CPU reads data (gets old cached data, not new DMA data)
3. CPU processes wrong data
Solutions:
- Software cache management (simple, lower performance):
// Before DMA read (device → memory)
dma_start(device, buffer, size);
dma_wait_complete();
cache_invalidate(buffer, size); // Discard cached data
// Now CPU can read fresh data
// Before DMA write (memory → device)
cache_flush(buffer, size); // Write cached data to memory
dma_start(buffer, device, size);
dma_wait_complete();
- Hardware cache coherence (complex, higher performance):
- DMA controller participates in cache coherence protocol
- DMA snoops CPU caches or uses coherent interconnect
- Requires coherent interconnect (ACE for AXI, TL-C for TileLink)
DMA and Virtual Memory
DMA controllers typically use physical addresses, but software uses virtual addresses. This creates challenges:
Problem: Virtual address buffer might span non-contiguous physical pages
Virtual: [0x1000-0x2FFF] (8 KB contiguous)
Physical: [0x80000000-0x80000FFF] + [0x85000000-0x85000FFF] (non-contiguous!)
Solutions:
- Scatter-Gather DMA: DMA controller accepts list of physical address/length pairs
struct sg_entry {
uint64_t addr; // Physical address
uint32_t len; // Length in bytes
};
struct sg_entry sg_list[] = {
{0x80000000, 4096},
{0x85000000, 4096},
};
dma_start_sg(device, sg_list, 2);
-
IOMMU: Translate device addresses to physical addresses (see Section 13.2)
-
Physically contiguous buffers: Allocate DMA buffers from reserved physical memory
🛠️ Hands-on Lab: Lab 13.1 — Memory Firewall (PMP Shield)
This lab demonstrates PMP’s core functionality: setting protection rules in M-mode, then switching to S-mode to attempt a violation.
Lab Objectives
- Configure PMP Entry 0 as Read-Only (R=1, W=0) to protect target variable
- Configure PMP Entry 1 as Allow-All (R=1, W=1, X=1) to let other code run normally
- Switch to S-mode and attempt a write to trigger Store Access Fault
NAPOT Encoding Principle
NAPOT encoding can be abstract for beginners. The key formula is:
pmpaddr = (base_addr >> 2) | ((size >> 3) - 1)
Example: Protect a 4KB region starting at 0x80200000
- Base = 0x80200000, Size = 4KB (0x1000)
- 0x80200000 >> 2 = 0x20080000
- (0x1000 >> 3) - 1 = 0x1FF
- pmpaddr = 0x20080000 | 0x1FF = 0x200801FF
Encoding Rules:
| pmpaddr low bits | Corresponding region size |
|---|---|
...aaaaa0 | 8 bytes |
...aaaa01 | 16 bytes |
...aaa011 | 32 bytes |
...a01111 | 128 bytes |
...0111111111 | 4KB (what we use) |
Code (pmp_lab.S)
.section .text
.global _start
_start:
# ---------------------------------------------------
# 1. Set up Trap Handler (catch Access Fault later)
# ---------------------------------------------------
la t0, trap_handler
csrw mtvec, t0
# ---------------------------------------------------
# 2. Configure PMP (in M-mode)
# ---------------------------------------------------
# [Target] Protect a 4KB region at 0x80200000
# Using NAPOT mode
li t0, 0x200801FF
csrw pmpaddr0, t0
# Entry 0: Enable + NAPOT + Read Only (R=1, W=0, X=0)
# PMP_R(1) | PMP_A_NAPOT(0x18) = 0x19
# Entry 1: Open other memory (Allow All)
# pmpaddr1 set to all 1s (max address), mode set to TOR
# PMP_R(1) | PMP_W(1) | PMP_X(1) | PMP_A_TOR(0x08) = 0x0F
# pmpcfg0 = (pmp1cfg << 8) | pmp0cfg = 0x0F19
li t0, -1
csrw pmpaddr1, t0
li t0, 0x0F19
csrw pmpcfg0, t0 # Firewall activated!
# ---------------------------------------------------
# 3. Drop to S-mode
# ---------------------------------------------------
# Set mstatus.MPP = 01 (Supervisor)
li t0, (3 << 11)
csrc mstatus, t0 # Clear MPP
li t0, (1 << 11)
csrs mstatus, t0 # Set MPP to 01 (S-mode)
la t0, s_mode_entry
csrw mepc, t0
mret # Jump! Identity becomes Supervisor
s_mode_entry:
# ---------------------------------------------------
# 4. Trigger Attack (S-mode Attempt)
# ---------------------------------------------------
li a0, 0x80200000 # Address protected by PMP0
li t1, 0xDEADBEEF
# Attempt write! Should trigger Exception 7 (Store Access Fault)
sw t1, 0(a0)
# If we survive, experiment failed
li a0, 0
j stop
stop:
j stop
trap_handler:
# Read mcause to check exception type
csrr t0, mcause
# Exception 7 = Store Access Fault
li t1, 7
bne t0, t1, unexpected
# SUCCESS: PMP blocked the illegal write!
li a0, 1 # Return success code
j stop
unexpected:
li a0, -1 # Unexpected exception
j stop
Compile and Run
# Assemble
riscv64-unknown-elf-as -march=rv64g -o pmp_lab.o pmp_lab.S
# Link (ensure _start is entry point)
riscv64-unknown-elf-ld -T link.ld -o pmp_lab.elf pmp_lab.o
# Run on QEMU
qemu-system-riscv64 -machine virt -nographic -bios pmp_lab.elf
Expected Behavior
- M-mode: PMP entries configured, firewall activated
- mret: Privilege drops to S-mode
- sw instruction: Triggers Store Access Fault (Exception 7)
- Trap handler: Confirms PMP did its job
danieRTOS Reference: A real RTOS would use PMP to isolate kernel data from user tasks, preventing task corruption.
⚠️ Common Pitfalls
Pitfall 1: PMP Priority Order Error
Error Scenario: Put “deny rule” in pmp15, put “allow rule” in pmp0.
Consequence: pmp0 matches first, allowing all access. The deny rule never takes effect.
// ❌ Wrong: Order reversed
pmp0: Allow All (RWX) // Matches first, permits everything
pmp1: Deny 0x80200000 // Never gets checked
// ✅ Correct: Write specific Deny first, generic Allow last
pmp0: Read-Only 0x80200000 // Check sensitive region first
pmp1: Allow All // Allow other regions
💡 Memory aid: Like firewall rules, write exceptions first, default last.
Pitfall 2: Forgetting the Default Deny Rule
Error Scenario: Only set one PMP Entry to protect the key area, forgot to open other memory.
Consequence: Code region not matched by any PMP Entry, S/U-mode can’t even fetch the next instruction.
# ❌ Wrong: Only one Entry
csrw pmpaddr0, t0 # Protect key
csrw pmpcfg0, 0x19 # Read-Only
mret # Jump to S-mode then immediately Crash!
# ✅ Correct: Add Allow All Entry
csrw pmpaddr0, t0 # Protect key
csrw pmpaddr1, t1 # Max address
csrw pmpcfg0, 0x0F19 # pmp0=RO, pmp1=RWX
Pitfall 3: Lock Bit Irreversibility
Error Scenario: Setting Lock bit (L=1) during development.
Consequence: PMP Entry locked. Only hardware Reset can unlock. Cannot modify rules during debug.
// ❌ Dangerous: Setting Lock during development
pmpcfg0 = 0x99; // L=1, A=NAPOT, R=1
// ✅ Recommended: Only set Lock in Production
#ifdef PRODUCTION
pmpcfg0 = 0x99; // Locked
#else
pmpcfg0 = 0x19; // Dev mode, not locked
#endif
💡 Reminder: Lock bit is meant to prevent malicious modification of M-mode Firmware. Don’t use it during development.
Summary
SoC integration connects RISC-V cores with the rest of the system. This chapter covered seven essential components that make a complete RISC-V system-on-chip.
Physical Memory Protection (PMP) provides hardware-enforced memory access control using physical addresses. PMP operates in M-mode and protects memory regions from untrusted code. With up to 64 configurable regions, four address matching modes (OFF, TOR, NA4, NAPOT), and lockable entries, PMP enables firmware protection, device memory isolation, and task separation in systems without MMUs.
IOMMU extends memory protection to devices by translating device addresses and enforcing access control. The RISC-V IOMMU uses the same page table format as the CPU MMU, simplifying software implementation. This enables device isolation, safe device passthrough for virtualization, and protection against malicious or buggy devices.
Platform-Level Interrupt Controller (PLIC) manages interrupt routing in multi-core systems. The PLIC collects interrupts from up to 1023 sources, arbitrates by priority, and routes them to appropriate CPU contexts. The claim/complete mechanism ensures atomic interrupt handling, while per-context enable masks and thresholds provide flexible interrupt management.
Memory-Mapped I/O (MMIO) provides a uniform mechanism for device access using standard load and store instructions. MMIO regions must be marked uncached in page tables, and fence instructions ensure proper ordering of device accesses. This unified address space simplifies the programming model compared to architectures with separate I/O instructions.
SoC memory maps organize address space into regions for ROM, RAM, peripherals, and reserved areas. Different RISC-V platforms use different layouts, but all follow the principle of address decode and routing through the interconnect fabric. Understanding the memory map is essential for firmware development and device driver programming.
System interconnects connect multiple masters and slaves in the SoC. AXI provides high performance with extensive IP ecosystem, AHB offers simplicity for embedded systems, and TileLink provides RISC-V-native features including cache coherence. The choice depends on performance requirements, IP availability, and coherence needs.
DMA and coherency enable efficient data transfers but require careful management of cache coherence. Software can use cache flush and invalidate operations, or hardware can provide coherent DMA through snooping or coherent interconnects. IOMMU or scatter-gather DMA solves the virtual-to-physical address translation problem for DMA transfers.
Together, these components form the foundation of RISC-V SoC design, enabling everything from simple microcontrollers to complex multi-core application processors.
Chapter 14. RISC-V Platform Profiles & Embedded Systems
Part VIII — System Design, Platform Spec & SoC Integration
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand the Fragmentation Problem: Grasp how ISA Fragmentation harms software ecosystems
- Distinguish Profile Types: Understand the use-case differences between RVA (Application) and RVM (Microcontroller)
- Decode Profile Names: Interpret the meaning of names like RVA22S, RVM23
💡 Scenario: À La Carte or Set Menu?
Scene: Junior has multiple chip vendor spec sheets spread across the desk, looking bewildered.
Junior: “Senior, I’m going cross-eyed looking at these spec sheets. This vendor says they support RV64GC, that one says RVA22, and another says RVM23. I just want to run Linux—which one do I pick? Can’t it be as simple as buying an x86 computer?”
Senior: “That’s the downside of RISC-V being ‘too free’—Fragmentation. Before, everyone assembled their own instruction sets: you want M, he wants F, I want C. Then you write software only to find that this CPU is missing an instruction, instant Crash.”
Junior: “So Profiles are meant to solve this?”
Senior: “Exactly. Think of it as officially certified set menus:
| Before (À la carte) | Now (Profile) |
|---|---|
| Pick your own instruction set | Official pre-configured menu |
| Forgot to order FPU = Linux won’t boot | RVA22 label = guaranteed to run |
| Verify compatibility for each product | One ISO runs all compliant hardware |
RVA (Application Profile) is the ‘deluxe menu’ for rich operating systems like Linux/Android. If a vendor dares to slap the RVA22 label on their chip, they guarantee it has MMU, atomic instructions, floating-point, and everything else needed to run Linux.“
Junior: “What about RVM?”
Senior: “RVM (Microcontroller Profile) is the ‘economy menu’ for RTOS or bare-metal. It drops heavy equipment like MMU, focusing on low power and real-time control. If you’re just making a smart rice cooker, RVM is enough; but if you’re making a smartphone, you definitely need RVA.”
Junior: “What about the numbers 20, 22, 23? Performance levels?”
Senior: “Not performance—year. Like a ‘2022 model year’ car. RVA22 represents standards defined in 2022, RVA23 adds some new features (like stronger vector instructions). Newer Linux distributions typically require newer Profile years.”
Junior: “Got it! So when choosing a CPU, I don’t need to check instructions one by one anymore—just confirm it meets the Profile menu I need!”
RISC-V’s modularity is both a strength and a challenge. The base ISA is minimal, and implementations choose which extensions to include. This flexibility enables optimization for specific use cases—a microcontroller might include only RV32IMC, while a server processor might implement RV64IMAFDCV. But this variability creates a problem: how does software know what features are available?
Platform profiles solve this by defining standard combinations of extensions for specific use cases. A profile specifies mandatory extensions, optional extensions, and implementation requirements. Software targeting a profile can assume certain features exist, simplifying development and improving portability. The RVA22 profile defines requirements for application processors running rich operating systems. The RVA23 profile adds newer extensions and stricter requirements. Embedded profiles target microcontrollers and real-time systems.
This chapter explores RISC-V platform profiles and embedded system design. We’ll examine the RVA22 and RVA23 profiles for application processors, embedded profiles for microcontrollers, the differences between MMU and no-MMU systems, and how RISC-V compares to ARM Cortex-M in the embedded space.
14.1 Platform Profiles Overview
The Fragmentation Problem
RISC-V’s modularity creates potential fragmentation. Consider these valid RISC-V implementations:
- RV32I: Minimal 32-bit core (base ISA only)
- RV32IMC: Embedded core (multiply, compressed)
- RV64IMAFDC: Application processor (full general-purpose)
- RV64IMAFDCV: High-performance with vectors
Software written for RV64IMAFDC won’t run on RV32IMC. Even within the same base (RV64), different extension combinations create incompatibilities. This makes it difficult to distribute binary software or develop portable operating systems.
Profiles as a Solution
A platform profile defines:
- Mandatory extensions: Must be implemented
- Optional extensions: May be implemented
- Mandatory features: Specific implementation requirements (e.g., minimum PMP entries)
- Discovery mechanism: How software detects features
Profiles create standard targets for software development. An OS can target “RVA22 profile” and know exactly what features are available. Hardware vendors can claim “RVA22 compliant” and guarantee compatibility with RVA22 software.
Profile Naming Convention
RISC-V profiles use a naming scheme:
- RV: RISC-V
- A/M/E: Application / Microcontroller / Embedded
- 22/23/…: Year of ratification (2022, 2023, etc.)
- S/U: Supervisor mode / User mode (for application profiles)
Examples:
- RVA22S: Application profile, 2022, Supervisor mode
- RVA22U: Application profile, 2022, User mode
- RVA23S: Application profile, 2023, Supervisor mode
- RVM23: Microcontroller profile, 2023
Profile Versioning
Profiles evolve over time:
- Major version: Incompatible changes (e.g., RVA22 → RVA23)
- Minor version: Backward-compatible additions
- Errata: Bug fixes, no functional changes
A profile version guarantees:
- Forward compatibility: RVA22 software runs on RVA23 hardware
- Feature stability: Mandatory extensions don’t change within a version
14.2 RVA22 Profile (Application Processor)
Target Use Case
RVA22 targets application processors running rich operating systems like Linux, FreeBSD, or commercial RTOSes. These systems need:
- Virtual memory (MMU)
- Privilege levels (M, S, U modes)
- Standard extensions for general-purpose computing
- Sufficient performance for application workloads
RVA22S (Supervisor Mode)
RVA22S defines requirements for systems running supervisor-mode operating systems.
Mandatory ISA Extensions:
- RV64I: 64-bit base integer ISA
- M: Integer multiplication and division
- A: Atomic instructions
- F: Single-precision floating-point
- D: Double-precision floating-point
- C: Compressed instructions
- Zicsr: CSR instructions
- Zifencei: Instruction fence
- Zicntr: Base counters (cycle, time, instret)
- Zihpm: Hardware performance counters
- Ziccif: Main memory supports instruction fetch
- Ziccrse: Main memory supports misaligned loads/stores
- Ziccamoa: Main memory supports all atomics
- Zicclsm: Main memory supports misaligned atomics
- Za64rs: Reservation set size (64 bytes)
- Zihintpause: Pause hint instruction
- Zba: Address generation (bit manipulation)
- Zbb: Basic bit manipulation
- Zbs: Single-bit instructions
- Zkt: Data-independent execution time (timing side-channel protection)
Mandatory Privileged Features:
- Sv39: Page-based virtual memory (39-bit virtual address)
- Svpbmt: Page-based memory types
- Svadu: Hardware A/D bit updates
- Sstc: Supervisor-mode timer interrupts
- Sscofpmf: Count overflow and privilege mode filtering
Mandatory Implementation Requirements:
- At least 8 PMP entries
- At least 29 hardware performance counters
- Misaligned loads/stores supported in main memory
- LR/SC reservation set size of 64 bytes
RVA22U (User Mode)
RVA22U defines requirements for user-mode applications. It’s a subset of RVA22S:
- Same ISA extensions as RVA22S
- No privileged features (no Sv39, no PMP requirements)
- Targets user-space applications running on RVA22S systems
Example: Checking RVA22 Compliance
// Check if system is RVA22S compliant
bool is_rva22s_compliant(void) {
// Check ISA extensions via misa
uint64_t misa = read_csr(misa);
if ((misa & MISA_I) == 0) return false; // RV64I
if ((misa & MISA_M) == 0) return false; // M extension
if ((misa & MISA_A) == 0) return false; // A extension
if ((misa & MISA_F) == 0) return false; // F extension
if ((misa & MISA_D) == 0) return false; // D extension
if ((misa & MISA_C) == 0) return false; // C extension
// Check Sv39 support
uint64_t satp = read_csr(satp);
write_csr(satp, SATP_MODE_SV39 << 60);
if ((read_csr(satp) >> 60) != SATP_MODE_SV39) return false;
write_csr(satp, satp); // Restore
// Check PMP entries (at least 8)
int pmp_count = count_pmp_entries();
if (pmp_count < 8) return false;
// Check other features via device tree or ACPI
// ...
return true;
}
14.3 RVA23 Profile
Improvements Over RVA22
RVA23, ratified in 2023, builds on RVA22 with additional extensions and stricter requirements:
New Mandatory Extensions:
- Zicond: Integer conditional operations
- Zimop: May-be-operations (reserved for future extensions)
- Zcmop: Compressed may-be-operations
- Zcb: Additional compressed instructions
- Zfa: Additional floating-point instructions
- Zawrs: Wait-on-reservation-set instructions
- Supm: Pointer masking (supervisor mode)
Enhanced Requirements:
- Minimum 16 PMP entries (up from 8)
- Sv48 or Sv57 support (in addition to Sv39)
- Improved performance counter requirements
- Stricter timing guarantees for Zkt
Forward Compatibility
RVA23 is forward-compatible with RVA22:
- All RVA22 mandatory extensions remain mandatory in RVA23
- RVA22 software runs unmodified on RVA23 hardware
- RVA23 adds features but doesn’t remove or change existing ones
Migration Path
Hardware vendors can support both profiles:
RVA22 hardware → RVA23 hardware
(2022-2024) (2024+)
↓ ↓
RVA22 software runs on both
RVA23 software runs only on RVA23+
14.4 Embedded Profiles
Embedded System Requirements
Embedded systems have different constraints than application processors:
- Limited resources: Small memory, low power, cost-sensitive
- Real-time requirements: Deterministic interrupt latency, predictable timing
- No virtual memory: Many embedded systems run without MMU
- Simpler software: Bare-metal or RTOS, not full OS
RISC-V embedded profiles target these use cases with minimal mandatory extensions and flexible configurations.
RV32E Base ISA
RV32E is a reduced version of RV32I with only 16 registers (x0-x15) instead of 32. This saves:
- Silicon area: Smaller register file
- Power: Fewer registers to manage
- Code size: Shorter register encodings in some cases
RV32E is suitable for ultra-low-cost microcontrollers where every gate counts.
# RV32E example: Only x0-x15 available
addi x10, x0, 42 # OK: x10 is in range
addi x20, x0, 42 # ERROR: x20 doesn't exist in RV32E
Microcontroller-Oriented Features
Embedded profiles emphasize:
- Compressed instructions (C): Reduce code size by 25-30%
- Multiply (M): Essential for many embedded algorithms
- Bit manipulation (B): Efficient for embedded control tasks
- Fast interrupts: Low-latency interrupt handling
Interrupt Handling for Embedded
Embedded systems need fast, deterministic interrupt response. RISC-V provides two interrupt architectures:
- CLINT (Core-Local Interruptor): Basic timer and software interrupts
- CLIC (Core-Local Interrupt Controller): Advanced interrupt controller for embedded
CLIC features:
- Vectored interrupts: Jump directly to handler (no dispatch overhead)
- Nested interrupts: Higher-priority interrupts preempt lower-priority
- Tail-chaining: Back-to-back interrupts without full context save/restore
- Configurable levels: Up to 256 priority levels
// CLIC interrupt handler (vectored)
void uart_irq_handler(void) {
// Directly entered, no dispatch needed
char c = UART->DATA;
buffer[head++] = c;
// Tail-chaining: if another interrupt pending, jump directly to it
}
Low-Power Considerations
Embedded systems prioritize power efficiency:
- Clock gating: Disable unused modules
- Power domains: Shut down inactive regions
- Sleep modes: WFI (Wait For Interrupt) instruction
- Dynamic voltage/frequency scaling: Adjust performance vs power
# Enter low-power mode
wfi # Wait for interrupt (CPU sleeps)
# CPU wakes on interrupt, resumes here
14.5 MMU vs No-MMU Systems
MMU-Based Systems
Systems with Memory Management Units (MMUs) provide:
- Virtual memory: Each process has its own address space
- Memory protection: Processes can’t access each other’s memory
- Demand paging: Load pages from disk on demand
- Large address spaces: 64-bit virtual addresses
MMU-based systems run full operating systems like Linux:
Process A: 0x00000000-0xFFFFFFFF (virtual)
↓ MMU translation
Physical: 0x80000000-0x80FFFFFF
Process B: 0x00000000-0xFFFFFFFF (virtual)
↓ MMU translation
Physical: 0x81000000-0x81FFFFFF
Requirements:
- Sv39/Sv48/Sv57 page tables
- TLB for translation caching
- Page fault handling
- Sufficient memory for page tables
No-MMU Systems
Systems without MMUs use physical addresses directly:
- Simpler hardware: No TLB, no page table walker
- Lower cost: Fewer gates, less power
- Faster context switch: No TLB flush needed
- Deterministic: No TLB miss latency
No-MMU systems run:
- Bare-metal: Single application, no OS
- RTOS: Real-time OS with static memory allocation
- Embedded Linux: uClinux (Linux without MMU)
Memory Protection Without MMU
No-MMU systems can still provide memory protection using PMP (Physical Memory Protection):
// Protect firmware region (0x80000000-0x80010000)
pmpaddr0 = 0x80000000 >> 2;
pmpaddr1 = 0x80010000 >> 2;
pmpcfg0 = 0x89; // L=1, A=1 (TOR), X=0, W=0, R=1
// Protect peripheral region (0x10000000-0x20000000)
pmpaddr2 = 0x10000000 >> 2;
pmpaddr3 = 0x20000000 >> 2;
pmpcfg0 |= (0x8B << 16); // L=1, A=1 (TOR), X=0, W=1, R=1
PMP provides:
- Region-based protection (not page-based)
- M-mode enforcement
- Locked regions (can’t be changed until reset)
Comparison
| Feature | MMU System | No-MMU System |
|---|---|---|
| Address Space | Virtual (per-process) | Physical (shared) |
| Memory Protection | Page-based (4KB granularity) | Region-based (PMP) |
| OS Support | Linux, FreeBSD, etc. | RTOS, bare-metal, uClinux |
| Context Switch | Slow (TLB flush) | Fast (no TLB) |
| Memory Overhead | Page tables (~1% of RAM) | None |
| Complexity | High | Low |
| Determinism | Lower (TLB misses) | Higher (no TLB) |
| Use Cases | Servers, desktops, phones | Microcontrollers, embedded |
14.6 Comparison with ARM Cortex-M
ARM Cortex-M Overview
ARM Cortex-M is the dominant architecture for microcontrollers. The Cortex-M family includes:
- Cortex-M0/M0+: Ultra-low-cost, minimal features
- Cortex-M3: Mainstream, good performance/cost
- Cortex-M4: DSP extensions, floating-point
- Cortex-M7: High performance, cache, double-precision FP
- Cortex-M33/M55: ARMv8-M, TrustZone, vector extensions
RISC-V Embedded vs ARM Cortex-M
| Feature | RISC-V Embedded | ARM Cortex-M |
|---|---|---|
| ISA | Open, modular | Proprietary, fixed |
| Registers | 32 (or 16 for RV32E) | 16 (13 general-purpose) |
| Instruction Set | RISC, load-store | Thumb-2 (mixed 16/32-bit) |
| Compressed Instructions | Optional (C extension) | Standard (Thumb) |
| Multiply/Divide | Optional (M extension) | Standard (M3+) |
| Floating-Point | Optional (F/D extensions) | Optional (M4+) |
| Vector/SIMD | Optional (V extension) | Optional (M55 Helium) |
| Privilege Levels | M/S/U modes | Handler/Thread modes |
| Memory Protection | PMP (region-based) | MPU (region-based) |
| Interrupt Controller | CLIC (vectored, nested) | NVIC (vectored, nested) |
| Interrupt Priorities | Up to 256 levels | 8-256 levels (implementation) |
| Licensing | Open (no fees) | Proprietary (licensing fees) |
| Ecosystem | Growing | Mature, extensive |
Interrupt Model Comparison (CLIC vs NVIC)
ARM NVIC (Nested Vectored Interrupt Controller):
- Vectored interrupts (direct jump to handler)
- Nested interrupts with priority levels
- Tail-chaining for back-to-back interrupts
- Automatic context save/restore
RISC-V CLIC (Core-Local Interrupt Controller):
- Similar features to NVIC
- More flexible priority levels (up to 256)
- Configurable interrupt modes
- Compatible with RISC-V privilege model
Both provide comparable functionality for embedded interrupt handling.
Ecosystem Comparison
ARM Cortex-M Advantages:
- Mature ecosystem (20+ years)
- Extensive vendor support (ST, NXP, TI, etc.)
- Rich middleware (CMSIS, Mbed, etc.)
- Large developer community
- Proven in billions of devices
RISC-V Embedded Advantages:
- No licensing fees (lower cost)
- Open ISA (customizable)
- Modern design (cleaner than ARM legacy)
- Growing ecosystem (SiFive, Espressif, etc.)
- Future-proof (community-driven evolution)
Use Case Recommendations
Choose ARM Cortex-M when:
- Mature ecosystem is critical
- Extensive middleware needed
- Proven reliability required
- Time-to-market is tight
Choose RISC-V Embedded when:
- Cost optimization is important
- Customization is needed
- Open-source ecosystem preferred
- Long-term flexibility valued
Example: ESP32-C3 (RISC-V) vs ESP32 (Xtensa)
Espressif’s ESP32-C3 demonstrates RISC-V in embedded:
- RV32IMC core (32-bit, multiply, compressed)
- 160 MHz, single-core
- Wi-Fi + Bluetooth LE
- 400 KB SRAM, 4 MB flash
- Arduino, ESP-IDF support
Compared to ESP32 (Xtensa):
- Similar performance
- Better toolchain (GCC, LLVM)
- Open ISA (vs proprietary Xtensa)
- Growing ecosystem
🛠️ Hands-on Lab: Lab 14.1 — Profile Detector
This lab demonstrates how to detect which Extensions the current hardware supports—key to understanding Profile practical applications.
Lab Objectives
- Read the
misaCSR to see supported Extensions - Check if critical CSRs exist
- Determine which Profile level the current system meets
Code (profile_detect.c)
#include <stdio.h>
#include <stdint.h>
// Read misa CSR (requires M-mode privilege)
static inline uint64_t read_misa() {
uint64_t val;
asm volatile ("csrr %0, misa" : "=r" (val));
return val;
}
// Extension check (misa bit mapping: A=0, B=1, ..., Z=25)
#define HAS_EXT(misa, letter) ((misa) & (1UL << ((letter) - 'A')))
void check_profile(uint64_t misa) {
printf("=== RISC-V Profile Detector ===\n\n");
// Check XLEN (MXL field in misa[63:62])
int xlen = (misa >> 62) == 2 ? 64 : 32;
printf("Base: RV%d\n", xlen);
// List supported Extensions
printf("Extensions: ");
for (char c = 'A'; c <= 'Z'; c++) {
if (HAS_EXT(misa, c)) {
printf("%c", c);
}
}
printf("\n\n");
// RVA22 minimum requirements check
int has_m = HAS_EXT(misa, 'M'); // Integer Multiply
int has_a = HAS_EXT(misa, 'A'); // Atomics
int has_f = HAS_EXT(misa, 'F'); // Single-precision Float
int has_d = HAS_EXT(misa, 'D'); // Double-precision Float
int has_c = HAS_EXT(misa, 'C'); // Compressed
printf("--- Profile Compatibility ---\n");
// RVA22 requires: RV64IMAFDC + more
if (xlen == 64 && has_m && has_a && has_f && has_d && has_c) {
printf("[✓] RVA22 basic requirements: PASS\n");
printf(" (Additional checks needed: Zba, Zbb, Zbs, Sv39, PMP>=8)\n");
} else {
printf("[✗] RVA22 basic requirements: FAIL\n");
if (xlen != 64) printf(" Missing: 64-bit base\n");
if (!has_m) printf(" Missing: M (Multiply)\n");
if (!has_a) printf(" Missing: A (Atomics)\n");
if (!has_f) printf(" Missing: F (Float)\n");
if (!has_d) printf(" Missing: D (Double)\n");
if (!has_c) printf(" Missing: C (Compressed)\n");
}
// RVM (Microcontroller) compatibility is more relaxed
if (has_m && has_c) {
printf("[✓] RVM basic requirements: PASS (RV32/64 + M + C)\n");
}
}
int main() {
uint64_t misa = read_misa();
if (misa == 0) {
printf("Error: Cannot read misa (not in M-mode?)\n");
return 1;
}
check_profile(misa);
return 0;
}
Compile and Run
# Compile (requires M-mode execution environment)
riscv64-unknown-elf-gcc -march=rv64gc -o profile_detect profile_detect.c
# Run on Spike (M-mode simulation)
spike pk profile_detect
Expected Output (QEMU virt machine)
=== RISC-V Profile Detector ===
Base: RV64
Extensions: ACDFIMSU
--- Profile Compatibility ---
[✓] RVA22 basic requirements: PASS
(Additional checks needed: Zba, Zbb, Zbs, Sv39, PMP>=8)
[✓] RVM basic requirements: PASS (RV32/64 + M + C)
Key Takeaways
- misa CSR: Single read reveals all standard extensions
- Profile != Full Compliance: Even if IMAFDC is present, RVA22 still needs additional features like Zba, Zbb, Zbs, Sv39, and 8+ PMP entries
- Runtime Detection: Production code should check features at boot, not assume
danieRTOS Reference: danieRTOS checks for M extension at startup to decide whether to use hardware or software multiply.
⚠️ Common Pitfalls
Pitfall 1: Profile Version ≠ Performance
Misconception: “RVA23 is 10% faster than RVA22”
Truth: Profile version only represents the feature set’s year, not clock speed or hardware performance.
RVA22 @ 2.0 GHz could be much faster than RVA23 @ 1.0 GHz!
Profiles define "software compatibility", not "hardware performance".
Pitfall 2: Thinking RV64G = RVA22
Error Scenario: Seeing vendor claim RV64GC support and assuming latest Fedora will run.
Truth: RVA22 requires additional Extensions and hardware features:
| Requirement | RV64GC | RVA22 |
|---|---|---|
| IMAFDCSU | ✓ | ✓ |
| Zba, Zbb, Zbs | ✗ | ✓ (Mandatory) |
| Sv39 (MMU) | Not specified | ✓ (Mandatory) |
| PMP ≥ 8 entries | Not specified | ✓ (Mandatory) |
Pitfall 3: Ignoring Profile Backward Compatibility
Error Scenario: Programs compiled for RVA23 won’t run on RVA22 hardware.
Correct Understanding:
RVA22 software → RVA23 hardware ✓ (Forward compatible)
RVA23 software → RVA22 hardware ✗ (May use new instructions)
💡 Recommendation: For maximum compatibility, specify an older Profile as your compilation target.
Summary
Platform profiles and embedded system design define how RISC-V adapts to different use cases. This chapter covered five key areas that enable RISC-V to serve both high-performance application processors and resource-constrained embedded systems.
Platform profiles solve the fragmentation problem by defining standard combinations of extensions. Profiles specify mandatory extensions, optional features, and implementation requirements. This creates standard targets for software development and guarantees compatibility. The naming convention (RVA22S, RVA23U, etc.) clearly indicates the profile’s purpose and version.
RVA22 profile targets application processors running rich operating systems. RVA22S requires 64-bit base ISA, standard extensions (M, A, F, D, C), bit manipulation (Zba, Zbb, Zbs), Sv39 virtual memory, and at least 8 PMP entries. RVA22U provides the same ISA extensions for user-mode applications. This profile enables Linux, FreeBSD, and other full operating systems to run portably across RISC-V implementations.
RVA23 profile builds on RVA22 with additional extensions and stricter requirements. New mandatory extensions include Zicond (conditional operations), Zfa (additional floating-point), and Zawrs (wait-on-reservation-set). Enhanced requirements include 16 PMP entries and Sv48/Sv57 support. RVA23 maintains forward compatibility—RVA22 software runs unmodified on RVA23 hardware.
Embedded profiles target microcontrollers and real-time systems with different constraints. RV32E reduces the register file to 16 registers for ultra-low-cost applications. Embedded systems emphasize compressed instructions for code size, fast interrupt handling through CLIC, and low-power features like WFI and clock gating. These profiles enable RISC-V to compete in the microcontroller market.
MMU vs no-MMU systems represent two different approaches to memory management. MMU-based systems provide virtual memory, per-process address spaces, and page-based protection, enabling full operating systems like Linux. No-MMU systems use physical addresses directly, offering simpler hardware, lower cost, and better determinism. PMP provides region-based memory protection for no-MMU systems, enabling task isolation without the complexity of virtual memory.
Comparison with ARM Cortex-M shows RISC-V’s competitive position in embedded systems. Both architectures provide similar features—vectored interrupts, nested interrupt handling, memory protection, and optional floating-point. RISC-V offers advantages in licensing (open, no fees), customization (modular ISA), and modern design. ARM Cortex-M leads in ecosystem maturity, vendor support, and proven deployment. The choice depends on priorities: cost and flexibility favor RISC-V, while ecosystem maturity favors ARM.
Together, platform profiles and embedded system features enable RISC-V to serve the full spectrum from tiny microcontrollers to powerful application processors, with clear standards for software portability and hardware compliance.
Chapter 15. Debugging & Trace
Part IX — Performance, Debug & Tools
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Master GDB Stub Usage: Use QEMU
-s -Sto start a GDB Server for remote debugging - Know Core Debug Commands:
break,si,info reg,x/, and other GDB commands - Build a Debug Mindset: Adopt a systematic “Observe → Hypothesize → Verify” debugging workflow
💡 Scenario: The Pocket Watch That Stops Time
Scene: Junior is pointing at garbled results on the screen, about to lose it.
Junior: “I can’t take this anymore. I just wanted to add 1 to 5, why is the result 34821? I’ve been staring at these five lines of Assembly for an hour!”
Senior: “Looking with your eyes isn’t enough. Junior, program execution is like a speeding train—how can you see if there’s a crack in the wheels while sitting by the tracks?”
Junior: “So what do I do? Add printf?”
Senior: “printf is fine, but in bare-metal situations or when the program crashes, you can’t even print anything. You need a ‘pocket watch’ that can pause time—GDB.”
Junior: “Pause time?”
Senior: “Exactly. Through a JTAG hardware interface (or the QEMU emulator we’re using now), we can force the CPU into Debug Mode.
| Debug Mode Feature | Analogy |
|---|---|
| Check Registers | Peeking in a wallet |
| View Memory | Searching through drawers |
| Single Step | Slow-motion replay |
| Breakpoint | Setting a trap |
In this mode, the CPU is like someone pressed the pause button. We can execute just one instruction at a time and see where things went wrong.
Come on, give me that broken program. Let’s go bug hunting.“
Software development requires debugging. A program crashes, and we need to know why. A function returns the wrong value, and we need to step through its execution. A performance bottleneck appears, and we need to trace instruction flow. Debugging transforms opaque failures into understandable problems.
RISC-V provides a comprehensive debug architecture that supports both halting debug (stop the processor, examine state) and non-intrusive trace (record execution without stopping). The Debug Module allows external debuggers to control the processor through JTAG or other interfaces. Hardware breakpoints and triggers enable precise control over when to halt execution. Debug mode provides a special execution environment for debug operations. Trace support captures instruction and data flow for post-mortem analysis.
This chapter explores RISC-V debugging and trace capabilities. We’ll examine the debug architecture, debug interfaces, hardware breakpoints, debug mode operation, trace support, and how RISC-V compares to ARM’s CoreSight debug infrastructure.
15.1 RISC-V Debug Architecture
Debug Requirements
A debug system must provide:
- Halt and resume: Stop processor execution, examine state, continue
- Register access: Read and write CPU registers
- Memory access: Read and write system memory
- Breakpoints: Stop execution at specific instructions or data accesses
- Single-step: Execute one instruction at a time
- Reset control: Reset the processor or system
RISC-V’s debug architecture separates concerns into distinct modules, allowing flexible implementation while maintaining standard interfaces.
Debug Components
External Debugger (GDB, OpenOCD)
↓
Debug Transport Module (DTM) - JTAG, USB, etc.
↓
Debug Module Interface (DMI)
↓
Debug Module (DM)
↓
RISC-V Core (enters Debug Mode)
Key components:
Debug Transport Module (DTM): Provides physical connection (JTAG, USB, etc.)
Debug Module (DM): Controls the core, implements debug operations
Debug Module Interface (DMI): Standard interface between DTM and DM
Debug Mode: Special execution mode for debug operations
Debug Module (DM)
The Debug Module is the central component that:
- Halts and resumes the core
- Provides abstract commands for register/memory access
- Manages hardware breakpoints (triggers)
- Controls reset
The DM is accessed through memory-mapped registers:
DM Base Address: 0x00000000 (implementation-defined)
Key registers:
dmcontrol: Control register (halt, resume, reset)
dmstatus: Status register (halted, running, etc.)
hartinfo: Hart information
abstractcs: Abstract command status
command: Abstract command register
data0-11: Data transfer registers
progbuf0-15: Program buffer
Debug Mode vs Machine Mode
Debug mode is a special execution mode distinct from M/S/U modes:
- Higher privilege than M-mode
- Can access all system resources
- Uses separate CSRs (dcsr, dpc, dscratch0/1)
- Executes from Debug ROM or Program Buffer
Privilege Hierarchy:
Debug Mode (highest)
↓
M-mode
↓
S-mode
↓
U-mode (lowest)
15.2 Debug Interface
JTAG Interface
JTAG (Joint Test Action Group) is the standard debug interface for RISC-V. It provides:
- 4-wire interface (TDI, TDO, TCK, TMS)
- Boundary scan for testing
- Debug access to the core
JTAG signals:
TDI: Test Data In (serial data input)
TDO: Test Data Out (serial data output)
TCK: Test Clock
TMS: Test Mode Select (state machine control)
TRST: Test Reset (optional)
JTAG state machine:
Test-Logic-Reset
↓
Run-Test/Idle
↓
Select-DR-Scan → Capture-DR → Shift-DR → Exit1-DR → Update-DR
↓
Select-IR-Scan → Capture-IR → Shift-IR → Exit1-IR → Update-IR
Debug Module Interface (DMI)
DMI is a standard register interface between DTM and DM. It provides:
- 32-bit or 64-bit register access
- Address space for DM registers
- Status and error reporting
DMI operations:
// Read DM register
uint32_t dmi_read(uint32_t addr) {
// DTM shifts address into JTAG
// DTM reads data from DM
// Returns data
}
// Write DM register
void dmi_write(uint32_t addr, uint32_t data) {
// DTM shifts address and data into JTAG
// DM performs write
}
Abstract Commands
Abstract commands provide high-level debug operations without requiring debug mode entry:
Access Register: Read/write CPU registers Access Memory: Read/write system memory Quick Access: Fast register access
// Example: Read register x10 using abstract command
void read_register_x10(uint32_t *value) {
// Write abstract command
dm_write(command, 0x00221000); // regno=10, transfer, size=32
// Wait for completion
while (dm_read(abstractcs) & ABSTRACTCS_BUSY);
// Read result
*value = dm_read(data0);
}
System Bus Access
The DM can access system memory directly through the system bus, bypassing the core:
// Read memory at address 0x80000000
uint32_t read_memory(uint64_t addr) {
dm_write(sbaddress0, addr & 0xFFFFFFFF);
dm_write(sbaddress1, addr >> 32);
dm_write(sbcs, SBCS_SBREADONADDR | SBCS_SBACCESS32);
while (dm_read(sbcs) & SBCS_SBBUSY);
return dm_read(sbdata0);
}
15.3 Hardware Breakpoints and Triggers
Trigger Module
RISC-V provides a flexible trigger system for hardware breakpoints and watchpoints. Triggers can:
- Break on instruction execution (instruction breakpoint)
- Break on data access (data watchpoint)
- Break on exceptions
- Chain multiple conditions
Triggers are configured through CSRs:
tselect: Select trigger registertdata1: Trigger configurationtdata2: Trigger match valuetdata3: Additional trigger data (optional)
Trigger Types
RISC-V defines several trigger types:
Type 2 (mcontrol): Address/data match trigger
- Match on instruction fetch, load, or store
- Configurable match conditions (equal, greater, less, mask)
- Action: enter debug mode, raise exception, or trace
Type 3 (icount): Instruction count trigger
- Break after N instructions
- Useful for single-stepping
Type 4 (itrigger): Interrupt trigger
- Break on specific interrupts
Type 5 (etrigger): Exception trigger
- Break on specific exceptions
Breakpoint Configuration
Setting an instruction breakpoint:
// Set breakpoint at address 0x80000100
void set_breakpoint(uint64_t addr) {
// Select trigger 0
write_csr(tselect, 0);
// Configure mcontrol trigger
uint64_t tdata1 = 0;
tdata1 |= (2ULL << 60); // type = 2 (mcontrol)
tdata1 |= (1ULL << 6); // m = 1 (match in M-mode)
tdata1 |= (1ULL << 2); // execute = 1 (match on instruction fetch)
tdata1 |= (1ULL << 12); // action = 1 (enter debug mode)
write_csr(tdata1, tdata1);
write_csr(tdata2, addr); // Match address
}
Setting a data watchpoint:
// Set watchpoint on store to address 0x80001000
void set_watchpoint(uint64_t addr) {
write_csr(tselect, 1);
uint64_t tdata1 = 0;
tdata1 |= (2ULL << 60); // type = 2 (mcontrol)
tdata1 |= (1ULL << 6); // m = 1 (match in M-mode)
tdata1 |= (1ULL << 1); // store = 1 (match on store)
tdata1 |= (1ULL << 12); // action = 1 (enter debug mode)
write_csr(tdata1, tdata1);
write_csr(tdata2, addr);
}
Trigger Chaining
Multiple triggers can be chained to create complex conditions:
// Break when PC = 0x80000100 AND x10 = 42
void set_conditional_breakpoint(void) {
// Trigger 0: Match PC
write_csr(tselect, 0);
uint64_t tdata1_0 = (2ULL << 60) | (1ULL << 6) | (1ULL << 2) | (1ULL << 11);
// chain = 1 (bit 11)
write_csr(tdata1, tdata1_0);
write_csr(tdata2, 0x80000100);
// Trigger 1: Match x10 value (requires data trigger support)
write_csr(tselect, 1);
uint64_t tdata1_1 = (2ULL << 60) | (1ULL << 6) | (1ULL << 12);
// action = 1 (enter debug mode)
write_csr(tdata1, tdata1_1);
// Implementation-specific: match register value
}
Trigger Actions
When a trigger fires, it can:
- Enter debug mode (action = 1)
- Raise breakpoint exception (action = 0)
- Generate trace event (implementation-specific)
15.4 Debug Mode
Entering Debug Mode
The core enters debug mode when:
- External debugger requests halt (via dmcontrol.haltreq)
- Hardware breakpoint fires (trigger with action = 1)
- Single-step completes (dcsr.step = 1)
- Debug interrupt (haltreq signal)
Upon entering debug mode:
- PC is saved to
dpc(Debug PC) - Cause is saved to
dcsr.cause - Core halts execution
- PC jumps to Debug ROM or Program Buffer
Debug CSRs
Debug mode uses three special CSRs:
dcsr (Debug Control and Status):
Bits:
[31:28] xdebugver: Debug spec version
[15] ebreakm: ebreak enters debug mode in M-mode
[14] ebreaks: ebreak enters debug mode in S-mode
[13] ebreaku: ebreak enters debug mode in U-mode
[8:6] cause: Why debug mode was entered
[2] step: Single-step mode
[1:0] prv: Privilege mode before debug
dpc (Debug PC): Saved PC when entering debug mode
dscratch0/1: Scratch registers for debug code
Debug ROM and Program Buffer
When entering debug mode, the core executes code from:
Debug ROM: Small ROM containing debug entry code
- Saves context
- Waits for debugger commands
- Restores context on resume
Program Buffer: RAM for debugger-supplied code
- Debugger writes instructions here
- Core executes them in debug mode
- Used for complex operations (e.g., memory copy)
Example debug ROM code:
# Debug ROM entry point
debug_rom_entry:
# Save x10 to dscratch0
csrw dscratch0, x10
# Load program buffer address
lui x10, %hi(progbuf)
addi x10, x10, %lo(progbuf)
# Jump to program buffer
jr x10
# Program buffer (written by debugger)
progbuf:
# Debugger writes instructions here
# Example: read x5 into data0
csrw dscratch1, x5
# ... more instructions ...
ebreak # Return to debug ROM
Resuming from Debug Mode
To resume execution:
- Debugger writes to dmcontrol.resumereq
- Debug ROM restores context
- Core executes
dretinstruction - PC restored from
dpc - Privilege mode restored from
dcsr.prv
# Resume from debug mode
debug_resume:
# Restore x10 from dscratch0
csrr x10, dscratch0
# Return from debug mode
dret # PC ← dpc, privilege ← dcsr.prv
Single-Stepping
Single-step mode executes one instruction then re-enters debug mode:
// Enable single-step
void enable_single_step(void) {
uint64_t dcsr = read_csr(dcsr);
dcsr |= (1 << 2); // step = 1
write_csr(dcsr, dcsr);
}
// Debugger workflow:
// 1. Halt core
// 2. Enable single-step
// 3. Resume (executes one instruction)
// 4. Core re-enters debug mode
// 5. Repeat
15.5 Trace Support
RISC-V Trace Specification
Trace captures program execution for analysis without halting the core. RISC-V trace provides:
- Instruction trace: Record executed instructions
- Data trace: Record memory accesses
- Trace compression: Reduce trace bandwidth
Trace is non-intrusive—it doesn’t affect program execution or timing.
Instruction Trace
Instruction trace records:
- Executed instructions (PC values)
- Branch outcomes (taken/not taken)
- Exceptions and interrupts
- Context changes (privilege mode, ASID)
Trace packets encode this information efficiently:
Trace packet types:
Format 0: Uncompressed address (full PC)
Format 1: Differential address (PC delta)
Format 2: Address with branch map
Format 3: Synchronization packet
Trace Compression
Full instruction trace is expensive (bandwidth, storage). RISC-V trace uses compression:
Branch map: Encode multiple branch outcomes in one packet
Example: 8 branches, outcomes = 10110010
Packet: [type=2, branches=10110010, address=...]
Differential encoding: Encode PC delta instead of full PC
Previous PC: 0x80000100
Current PC: 0x80000104
Packet: [type=1, delta=+4]
Implicit sequences: Don’t trace sequential instructions
PC sequence: 0x100, 0x104, 0x108, 0x10c
Trace: [0x100, count=4] # Implicit +4 increments
Data Trace
Data trace records memory accesses:
- Load/store addresses
- Data values
- Access size (byte, halfword, word, doubleword)
Data trace packet:
Data trace packet:
[type] [address] [data] [size]
Example:
SW x10, 0(x5) # Store word
Packet: [type=store, addr=0x80001000, data=0x12345678, size=4]
Trace Filtering
Trace can be filtered to reduce bandwidth:
- Address range filtering (trace only specific code regions)
- Privilege filtering (trace only M-mode, S-mode, etc.)
- Event filtering (trace only branches, exceptions, etc.)
// Configure trace filtering (implementation-specific)
void configure_trace_filter(uint64_t start, uint64_t end) {
// Enable trace for address range [start, end]
trace_write(TRACE_ADDR_START, start);
trace_write(TRACE_ADDR_END, end);
trace_write(TRACE_CONTROL, TRACE_ENABLE | TRACE_FILTER_ADDR);
}
15.6 Comparison with ARM Debug
RISC-V Debug vs ARM CoreSight
ARM CoreSight is a comprehensive debug and trace infrastructure. Comparison:
| Feature | RISC-V Debug | ARM CoreSight |
|---|---|---|
| Debug Interface | JTAG, DTM/DMI | JTAG, SWD (Serial Wire Debug) |
| Debug Module | Debug Module (DM) | Debug Access Port (DAP) |
| Halt/Resume | dmcontrol register | DHCSR register |
| Breakpoints | Trigger module (flexible) | FPB (Flash Patch and Breakpoint) |
| Watchpoints | Trigger module | DWT (Data Watchpoint and Trace) |
| Trace | RISC-V Trace spec | ETM (Embedded Trace Macrocell) |
| Trace Compression | Branch map, differential | Branch broadcast, compression |
| System Access | System bus access | AHB-AP, APB-AP |
| Complexity | Modular, simple | Comprehensive, complex |
JTAG vs SWD
RISC-V uses JTAG (4-5 wires), ARM supports both JTAG and SWD (2 wires):
JTAG (RISC-V, ARM):
TDI, TDO, TCK, TMS, (TRST)
5 pins, standard interface
SWD (ARM only):
SWDIO (bidirectional data)
SWCLK (clock)
2 pins, lower pin count
SWD advantages:
- Fewer pins (important for small packages)
- Faster than JTAG in some cases
- ARM-specific optimization
JTAG advantages:
- Industry standard (IEEE 1149.1)
- Widely supported tools
- Boundary scan capability
Trace Comparison
RISC-V Trace vs ARM ETM (Embedded Trace Macrocell):
| Feature | RISC-V Trace | ARM ETM |
|---|---|---|
| Instruction Trace | Yes | Yes |
| Data Trace | Yes | Yes (ETMv4+) |
| Compression | Branch map, differential | Branch broadcast, Q elements |
| Bandwidth | Configurable | Configurable |
| Filtering | Address, privilege, event | Address, context ID, VMID |
| Timestamps | Optional | Yes |
| Trace Port | Implementation-specific | TPIU (Trace Port Interface Unit) |
| Trace Buffer | Implementation-specific | ETB (Embedded Trace Buffer) |
ARM ETM is mature and widely deployed. RISC-V Trace is newer but follows similar principles with simpler encoding.
Debug Tools
Both architectures support standard debug tools:
RISC-V:
- GDB (GNU Debugger)
- OpenOCD (Open On-Chip Debugger)
- SEGGER J-Link
- Lauterbach TRACE32
ARM:
- GDB
- Keil MDK
- ARM DS-5 / Arm Development Studio
- SEGGER J-Link
- Lauterbach TRACE32
Practical Differences
RISC-V advantages:
- Simpler, more modular design
- Open specification (no licensing)
- Flexible trigger system
- Easier to implement
ARM advantages:
- Mature ecosystem
- SWD reduces pin count
- Comprehensive trace infrastructure
- Extensive tool support
For embedded systems, SWD’s 2-pin interface is attractive. For complex SoCs, both architectures provide comparable debug capabilities.
🛠️ Hands-on Lab: Lab 15.1 — The Vanishing Values (Bug Hunting with GDB)
This lab features a classic Pointer Stride Error—the most common mistake for RISC-V beginners: assuming an int pointer +1 moves 4 bytes, but in Assembly addi x, x, 1 really does add just 1 byte.
Lab Objectives
- Launch QEMU’s GDB Server feature
- Connect GDB and load symbols
- Use
layout asmto view assembly - Find the two bugs in the program
Buggy Code (buggy_sum.S)
.section .data
# Define an array: 10, 20, 30, 40, 50
# Expected result: 10+20+30+40+50 = 150 (Hex: 0x96)
nums: .word 10, 20, 30, 40, 50
.section .text
.global _start
_start:
la t0, nums # t0 points to array start
li t1, 5 # t1 is loop counter (Count = 5)
# BUG 1: Forgot to initialize accumulator a0
# We assume a0 is 0, but it might be garbage
loop:
lw t2, 0(t0) # Load current number into t2
add a0, a0, t2 # Accumulate: a0 = a0 + t2
# BUG 2: Pointer stride error!
# We're reading words (4 bytes), but here we only add 1
addi t0, t0, 1 # ❌ Should be addi t0, t0, 4
addi t1, t1, -1 # Decrement counter
bnez t1, loop # If not done, continue loop
stop:
j stop
Debug Workflow
Step A: Compile (with debug info)
# -g is key! Tells compiler to keep symbol table
riscv64-unknown-elf-gcc -g -nostdlib -o buggy_sum.elf buggy_sum.S
Step B: Start QEMU (as Target)
# -S: Pause CPU immediately after startup
# -s: Enable GDB Server, default Port 1234
qemu-system-riscv64 -machine virt -nographic \
-kernel buggy_sum.elf -S -s
(Terminal will hang—open another terminal for GDB)
Step C: Start GDB (as Host)
riscv64-unknown-elf-gdb buggy_sum.elf
Step D: GDB Interactive Investigation
(gdb) target remote :1234 # Connect to QEMU
(gdb) layout asm # Open Assembly view
(gdb) break loop # Set breakpoint at loop label
(gdb) continue # Run until breakpoint
# After entering the loop...
(gdb) info reg a0 # Observe accumulator → not 0!
(gdb) info reg t0 # Observe pointer
(gdb) si # Single Step one instruction
(gdb) info reg t0 # Look at pointer again → only moved 1 byte!
# After finding the problem...
(gdb) x/5xw &nums # View array memory contents
Expected Findings
- Bug 1 (a0 uninitialized): First time entering loop,
info reg a0shows garbage value - Bug 2 (pointer stride): Each
addi t0, t0, 1only increases t0 by 1, causing misaligned data reads
Fixed Code
_start:
la t0, nums
li t1, 5
li a0, 0 # ✅ FIX 1: Initialize accumulator
loop:
lw t2, 0(t0)
add a0, a0, t2
addi t0, t0, 4 # ✅ FIX 2: Stride = 4 bytes (word size)
addi t1, t1, -1
bnez t1, loop
stop:
j stop
danieRTOS Reference: The danieRTOS context switch code carefully uses word-aligned offsets when saving/restoring registers to the stack.
⚠️ Common Pitfalls
Pitfall 1: Compiler Optimization Interferes with Debugging
Error Scenario: After compiling with -O2, line numbers in GDB don’t match source code, variables “disappear”.
Cause: Optimizer reorders instructions, eliminates registers, inlines functions.
# ❌ Don't use high optimization when debugging
riscv64-unknown-elf-gcc -O2 -o program.elf program.c
# ✅ Use -O0 -g when debugging
riscv64-unknown-elf-gcc -O0 -g -o program.elf program.c
Pitfall 2: Forgetting the -g Flag
Error Scenario: GDB shows “No symbol table is loaded”.
Cause: Compiled without -g, symbol info was discarded.
# ❌ No debug info
riscv64-unknown-elf-gcc -o program.elf program.c
# ✅ Keep debug info
riscv64-unknown-elf-gcc -g -o program.elf program.c
Pitfall 3: QEMU Not Paused with -S
Error Scenario: Program already finished or crashed by the time GDB connects.
Solution: Always add -S to make QEMU pause after startup, waiting for GDB.
# ❌ Program starts executing immediately after launch
qemu-system-riscv64 -machine virt -kernel program.elf -s
# ✅ Program pauses after launch, waiting for GDB
qemu-system-riscv64 -machine virt -kernel program.elf -S -s
💡 Tip:
-s= GDB Server on port 1234,-S= Stop at startup. These are often used together.
Summary
Debugging and trace are essential for software development and system analysis. This chapter explored RISC-V’s debug architecture and how it compares to ARM’s mature CoreSight infrastructure.
Debug architecture separates concerns into modular components. The Debug Transport Module provides physical connectivity through JTAG or other interfaces. The Debug Module controls the core through a standard Debug Module Interface. Debug mode provides a privileged execution environment for debug operations. This separation allows flexible implementations while maintaining standard interfaces for debugger tools.
Debug interface uses JTAG as the standard physical layer, providing four-wire connectivity for debug access. The Debug Module Interface defines register-level operations for controlling the core. Abstract commands enable high-level operations like register and memory access without requiring debug mode entry. System bus access allows the debugger to read and write memory directly, bypassing the core entirely.
Hardware breakpoints and triggers provide flexible mechanisms for halting execution. The trigger module supports multiple trigger types including address match, instruction count, interrupt, and exception triggers. Triggers can match on instruction fetch, data load, or data store. Trigger chaining enables complex conditional breakpoints. Actions include entering debug mode, raising exceptions, or generating trace events.
Debug mode is a special execution environment with higher privilege than M-mode. The core enters debug mode on external halt requests, breakpoint hits, or single-step completion. Debug CSRs (dcsr, dpc, dscratch) manage debug state. The Debug ROM provides entry code, while the Program Buffer allows debuggers to execute custom instruction sequences. Single-stepping executes one instruction then re-enters debug mode, enabling step-through debugging.
Trace support captures program execution non-intrusively. Instruction trace records executed instructions, branch outcomes, and control flow changes. Data trace records memory accesses and data values. Trace compression reduces bandwidth through branch maps, differential encoding, and implicit sequences. Trace filtering limits capture to specific address ranges, privilege levels, or events, reducing trace data volume.
Comparison with ARM shows both similarities and differences. ARM CoreSight provides comprehensive debug and trace infrastructure with mature tool support. SWD offers a 2-pin alternative to JTAG, reducing pin count for embedded systems. ARM ETM provides extensive trace capabilities with sophisticated compression. RISC-V’s debug architecture is simpler and more modular, with an open specification and flexible trigger system. Both architectures support standard tools like GDB and OpenOCD, ensuring practical usability.
Together, RISC-V’s debug and trace capabilities enable effective software development, system analysis, and problem diagnosis across the full range from embedded microcontrollers to high-performance application processors.
Chapter 16. Performance Counters & PMU
Part IX — Performance, Debug & Tools
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Read Performance Counters: Use
csrrto readcycleandinstretCSRs - Calculate IPC Metrics: Understand the meaning and formula for Instructions Per Cycle
- Identify Performance Bottlenecks: Distinguish characteristics of Compute-bound vs Memory-bound programs
💡 Scenario: The CPU’s Indigestion
Scene: Junior comes running to Senior with a data sheet.
Junior: “Senior, look! I unrolled this loop, which made more instructions, but the execution time actually got shorter. That doesn’t make sense! More instructions should mean slower, right?”
Senior: “That’s a common rookie mistake—only looking at ‘food quantity’ (Instruction Count), not ‘digestion speed’ (IPC).
A CPU is like a hot dog eating contest competitor:
| Concept | Analogy |
|---|---|
| Cycle | Contest time (seconds) |
| Instret (Instructions) | Number of hot dogs eaten |
| IPC (Instructions Per Cycle) | Swallowing speed |
Formula: IPC = Instret / Cycle
“
Junior: “So after I unrolled the loop, even though there are more hot dogs, they’re being swallowed faster?”
Senior: “Exactly. Your previous code probably had ‘Data Dependencies’—the previous bite wasn’t swallowed yet, so the next bite couldn’t go in, causing the competitor to just stand there dazed (Pipeline Stall), resulting in low IPC.
After loop unrolling, instructions don’t interfere with each other, so the CPU can swallow several at once (Pipeline filled), and IPC goes up. So even though total instruction count increased, because swallowing is fast enough, total time actually decreased.“
Junior: “I see! So higher IPC is always better?”
Senior: “Not necessarily. If you just have them drink water (execute nop), they can swallow super fast (high IPC), but they’re not actually eating anything (no useful work). So when looking at performance, we must look at Cycle Count and IPC together.”
Performance optimization requires measurement. A program runs slowly, and we need to know why. Cache misses dominate execution time, or branch mispredictions cause pipeline stalls, or memory bandwidth limits throughput. Performance counters transform vague slowness into quantifiable bottlenecks.
RISC-V provides a Performance Monitoring Unit (PMU) through a set of hardware performance counters. These counters track events like cycles executed, instructions retired, cache hits and misses, branch predictions, and TLB accesses. The basic counters (cycle, instret, time) are mandatory and provide fundamental metrics. Hardware performance counters (mhpmcounter3-31) are optional and track implementation-specific events. Together, these counters enable profiling, bottleneck identification, and performance analysis.
This chapter explores RISC-V performance counters and the PMU. We’ll examine the counter architecture, basic counters, hardware performance counters, performance events, profiling techniques, and how RISC-V compares to ARM’s PMU.
16.1 Performance Counter Architecture
Performance Monitoring Overview
Performance monitoring answers questions like:
- How many cycles did this function take?
- What is the IPC (instructions per cycle)?
- How many cache misses occurred?
- How many branches were mispredicted?
- Where is the performance bottleneck?
RISC-V performance counters provide hardware-based measurement with minimal overhead. Counters increment automatically on specific events, allowing precise measurement without software instrumentation.
Counter CSRs
RISC-V defines performance counter CSRs in three privilege levels:
Machine-mode counters (M-mode only):
mcycle: Machine cycle counterminstret: Machine instructions-retired countermhpmcounter3-31: Machine hardware performance counters (29 counters)
Supervisor/User-mode counters (readable from S/U-mode):
cycle: Cycle counter (shadow of mcycle)instret: Instructions-retired counter (shadow of minstret)hpmcounter3-31: Hardware performance counters (shadow of mhpmcounter)
Time counter:
time: Real-time counter (wall-clock time)
For RV32, each counter has a high-word CSR (e.g., mcycleh, cycleh) for 64-bit values.
Counter Privilege Levels
Counters are accessible based on privilege:
M-mode: Can read/write all counters (mcycle, minstret, mhpmcounter)
S-mode: Can read cycle, instret, hpmcounter (if enabled)
U-mode: Can read cycle, instret, hpmcounter (if enabled)
Access control via mcounteren and scounteren:
// Enable cycle and instret for S-mode and U-mode
uint64_t mcounteren = (1 << 0) | (1 << 2); // CY, IR
write_csr(mcounteren, mcounteren);
// Enable cycle and instret for U-mode (from S-mode)
uint64_t scounteren = (1 << 0) | (1 << 2);
write_csr(scounteren, scounteren);
Counter Inhibit
Counters can be inhibited (stopped) via mcountinhibit:
// Stop cycle and instret counters
uint64_t mcountinhibit = (1 << 0) | (1 << 2); // CY, IR
write_csr(mcountinhibit, mcountinhibit);
// Resume counters
write_csr(mcountinhibit, 0);
This is useful for:
- Measuring specific code regions
- Reducing power consumption
- Preventing counter overflow
16.2 Basic Performance Counters
mcycle / cycle (Cycle Counter)
The cycle counter tracks the number of clock cycles executed by the hart:
// Read cycle counter
uint64_t start = read_csr(cycle);
// ... code to measure ...
uint64_t end = read_csr(cycle);
uint64_t cycles = end - start;
printf("Cycles: %llu\n", cycles);
For RV32, use cycleh for the high 32 bits:
// RV32: Read 64-bit cycle counter
uint64_t read_cycle_rv32(void) {
uint32_t hi, lo, hi2;
do {
hi = read_csr(cycleh);
lo = read_csr(cycle);
hi2 = read_csr(cycleh);
} while (hi != hi2); // Retry if high word changed
return ((uint64_t)hi << 32) | lo;
}
minstret / instret (Instructions Retired Counter)
The instructions-retired counter tracks the number of instructions completed:
// Read instret counter
uint64_t start = read_csr(instret);
// ... code to measure ...
uint64_t end = read_csr(instret);
uint64_t instructions = end - start;
printf("Instructions: %llu\n", instructions);
IPC Calculation
Combining cycle and instret gives IPC (instructions per cycle):
// Measure IPC
uint64_t cycles_start = read_csr(cycle);
uint64_t instret_start = read_csr(instret);
// ... code to measure ...
uint64_t cycles_end = read_csr(cycle);
uint64_t instret_end = read_csr(instret);
uint64_t cycles = cycles_end - cycles_start;
uint64_t instructions = instret_end - instret_start;
double ipc = (double)instructions / cycles;
printf("IPC: %.2f\n", ipc);
IPC interpretation:
- IPC close to 1: Good utilization (in-order core)
- IPC > 1: Superscalar execution (out-of-order core)
- IPC < 1: Pipeline stalls (cache misses, branch mispredicts, etc.)
time (Real-Time Counter)
The time counter provides wall-clock time:
// Read time counter
uint64_t start_time = read_csr(time);
// ... code to measure ...
uint64_t end_time = read_csr(time);
uint64_t elapsed = end_time - start_time;
// Convert to microseconds (assuming 1 MHz time counter)
printf("Elapsed time: %llu us\n", elapsed);
The time counter frequency is platform-specific (typically 1 MHz or 10 MHz). It’s useful for:
- Wall-clock timing
- Timeout implementation
- Real-time scheduling
Difference: cycle vs time
cycle: Counts CPU cycles (stops during sleep, varies with frequency scaling)time: Counts real time (continues during sleep, constant frequency)
// Example: Measure sleep overhead
uint64_t cycles_before = read_csr(cycle);
uint64_t time_before = read_csr(time);
wfi(); // Sleep until interrupt
uint64_t cycles_after = read_csr(cycle);
uint64_t time_after = read_csr(time);
printf("Cycles during sleep: %llu\n", cycles_after - cycles_before); // ~0
printf("Time during sleep: %llu\n", time_after - time_before); // > 0
16.3 Hardware Performance Counters
mhpmcounter3-31 (Hardware Performance Counters)
RISC-V provides up to 29 hardware performance counters (HPM counters) for tracking implementation-specific events. These counters are optional—implementations may provide 0 to 29 counters.
Counter CSRs:
mhpmcounter3-31: M-mode counters (29 counters)hpmcounter3-31: S/U-mode readable counters (shadows of mhpmcounter)mhpmevent3-31: Event selection registers
Event Selection (mhpmevent CSRs)
Each HPM counter has an associated event selector:
// Configure mhpmcounter3 to count L1 I-cache misses
write_csr(mhpmevent3, EVENT_L1_ICACHE_MISS);
// Reset counter
write_csr(mhpmcounter3, 0);
// ... code to measure ...
// Read counter
uint64_t icache_misses = read_csr(mhpmcounter3);
printf("L1 I-cache misses: %llu\n", icache_misses);
Event codes are implementation-specific. Common events include:
- Cache events (L1/L2 hits, misses)
- Branch events (taken, not-taken, mispredicted)
- Pipeline events (stalls, flushes)
- Memory events (loads, stores, TLB misses)
Counter Overflow Handling
Counters are 64-bit and rarely overflow. If overflow is a concern:
// Check for overflow (counter wrapped around)
uint64_t start = read_csr(mhpmcounter3);
// ... code ...
uint64_t end = read_csr(mhpmcounter3);
if (end < start) {
// Overflow occurred
uint64_t count = (UINT64_MAX - start) + end + 1;
} else {
uint64_t count = end - start;
}
Some implementations support overflow interrupts (implementation-specific).
Example: Multi-Counter Measurement
Measuring multiple events simultaneously:
// Configure counters
write_csr(mhpmevent3, EVENT_L1_DCACHE_MISS);
write_csr(mhpmevent4, EVENT_L2_CACHE_MISS);
write_csr(mhpmevent5, EVENT_BRANCH_MISPREDICT);
// Reset counters
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
write_csr(mhpmcounter5, 0);
// Measure code
uint64_t cycles_start = read_csr(cycle);
uint64_t instret_start = read_csr(instret);
// ... code to measure ...
uint64_t cycles_end = read_csr(cycle);
uint64_t instret_end = read_csr(instret);
uint64_t l1_misses = read_csr(mhpmcounter3);
uint64_t l2_misses = read_csr(mhpmcounter4);
uint64_t branch_mispredicts = read_csr(mhpmcounter5);
// Report
printf("Cycles: %llu\n", cycles_end - cycles_start);
printf("Instructions: %llu\n", instret_end - instret_start);
printf("L1 D-cache misses: %llu\n", l1_misses);
printf("L2 cache misses: %llu\n", l2_misses);
printf("Branch mispredicts: %llu\n", branch_mispredicts);
16.4 Performance Events
Cache Events
Cache events track memory hierarchy performance:
L1 Instruction Cache:
- L1 I-cache access
- L1 I-cache miss
- L1 I-cache hit
L1 Data Cache:
- L1 D-cache access
- L1 D-cache miss
- L1 D-cache hit
- L1 D-cache writeback
L2 Cache:
- L2 cache access
- L2 cache miss
- L2 cache hit
Example: Measure cache miss rate:
// Configure counters
write_csr(mhpmevent3, EVENT_L1_DCACHE_ACCESS);
write_csr(mhpmevent4, EVENT_L1_DCACHE_MISS);
// Reset and measure
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
// ... code ...
uint64_t accesses = read_csr(mhpmcounter3);
uint64_t misses = read_csr(mhpmcounter4);
double miss_rate = (double)misses / accesses * 100.0;
printf("L1 D-cache miss rate: %.2f%%\n", miss_rate);
Branch Events
Branch events track control flow performance:
Branch Types:
- Branch instructions executed
- Branch taken
- Branch not taken
Branch Prediction:
- Branch mispredicted
- Branch correctly predicted
Example: Measure branch prediction accuracy:
write_csr(mhpmevent3, EVENT_BRANCH_EXECUTED);
write_csr(mhpmevent4, EVENT_BRANCH_MISPREDICT);
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
// ... code with branches ...
uint64_t branches = read_csr(mhpmcounter3);
uint64_t mispredicts = read_csr(mhpmcounter4);
double accuracy = (1.0 - (double)mispredicts / branches) * 100.0;
printf("Branch prediction accuracy: %.2f%%\n", accuracy);
Pipeline Events
Pipeline events track execution efficiency:
Stalls:
- Pipeline stall cycles
- Load-use stall
- Store buffer full stall
Flushes:
- Pipeline flush (branch mispredict, exception)
- I-cache flush
- D-cache flush
Example: Identify stall sources:
write_csr(mhpmevent3, EVENT_PIPELINE_STALL);
write_csr(mhpmevent4, EVENT_LOAD_USE_STALL);
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
// ... code ...
uint64_t total_stalls = read_csr(mhpmcounter3);
uint64_t load_use_stalls = read_csr(mhpmcounter4);
printf("Total stall cycles: %llu\n", total_stalls);
printf("Load-use stalls: %llu (%.1f%%)\n",
load_use_stalls,
(double)load_use_stalls / total_stalls * 100.0);
Memory Events
Memory events track memory system activity:
Memory Operations:
- Load instructions
- Store instructions
- Atomic instructions
TLB Events:
- TLB access
- TLB miss (I-TLB, D-TLB)
- Page table walk
Example: Measure TLB performance:
write_csr(mhpmevent3, EVENT_DTLB_ACCESS);
write_csr(mhpmevent4, EVENT_DTLB_MISS);
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
// ... code with memory accesses ...
uint64_t tlb_accesses = read_csr(mhpmcounter3);
uint64_t tlb_misses = read_csr(mhpmcounter4);
double tlb_miss_rate = (double)tlb_misses / tlb_accesses * 100.0;
printf("D-TLB miss rate: %.2f%%\n", tlb_miss_rate);
16.5 Profiling and Analysis
perf Tool for RISC-V
The Linux perf tool supports RISC-V performance counters:
# Count cycles and instructions
perf stat -e cycles,instructions ./my_program
# Sample on cycles (profiling)
perf record -e cycles ./my_program
perf report
# Count cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./my_program
# Count branch mispredictions
perf stat -e branch-misses,branches ./my_program
PMU Programming
Kernel-level PMU programming:
// Linux kernel: Configure PMU for profiling
void setup_pmu_profiling(void) {
// Enable cycle and instret for user mode
write_csr(mcounteren, 0x7); // CY, TM, IR
// Configure HPM counter for L1 D-cache misses
write_csr(mhpmevent3, EVENT_L1_DCACHE_MISS);
write_csr(mhpmcounter3, 0);
// Enable counter for user mode
uint64_t mcounteren = read_csr(mcounteren);
mcounteren |= (1 << 3); // HPM3
write_csr(mcounteren, mcounteren);
}
Event Sampling
Sampling-based profiling collects periodic samples:
// Pseudo-code: Sample-based profiling
void pmu_interrupt_handler(void) {
// Read PC where interrupt occurred
uint64_t pc = read_csr(mepc);
// Record sample
record_sample(pc);
// Reset counter for next sample
write_csr(mhpmcounter3, -SAMPLE_PERIOD);
}
// Setup sampling
void setup_sampling(void) {
// Configure counter to overflow after SAMPLE_PERIOD events
write_csr(mhpmevent3, EVENT_CYCLES);
write_csr(mhpmcounter3, -SAMPLE_PERIOD);
// Enable overflow interrupt (implementation-specific)
enable_pmu_interrupt();
}
Performance Analysis Techniques
Top-down analysis:
- Measure overall IPC
- If IPC is low, identify bottleneck:
- Cache misses? → Optimize data layout
- Branch mispredicts? → Improve branch predictability
- Pipeline stalls? → Reduce dependencies
Hotspot analysis:
- Use sampling to find hot functions
- Measure counters for hot functions
- Optimize based on counter data
Comparative analysis:
- Measure before optimization
- Apply optimization
- Measure after optimization
- Compare counter values
Example workflow:
# Before optimization
perf stat -e cycles,instructions,L1-dcache-load-misses ./program
# Cycles: 1000000, Instructions: 500000, IPC: 0.5, Misses: 50000
# After optimization (improved data locality)
perf stat -e cycles,instructions,L1-dcache-load-misses ./program_opt
# Cycles: 600000, Instructions: 500000, IPC: 0.83, Misses: 10000
# Result: 40% speedup, 80% reduction in cache misses
16.6 Comparison with ARM PMU
RISC-V Counters vs ARM PMU
ARM provides a Performance Monitoring Unit (PMU) with similar capabilities. Comparison:
| Feature | RISC-V PMU | ARM PMU |
|---|---|---|
| Basic Counters | cycle, instret, time | PMCCNTR (cycle), no instret |
| HPM Counters | mhpmcounter3-31 (up to 29) | PMEVCNTRn (typically 6-8) |
| Event Selection | mhpmevent3-31 | PMEVTYPER (event type) |
| Counter Width | 64-bit | 32-bit or 64-bit (ARMv8) |
| Overflow | Implementation-specific | Overflow interrupt (PMOVSCLR) |
| Access Control | mcounteren, scounteren | PMUSERENR (user enable) |
| Counter Inhibit | mcountinhibit | PMCNTENSET/CLR (enable/disable) |
| Privilege Levels | M/S/U modes | EL0/EL1/EL2/EL3 |
Event Mapping
Common events mapped between architectures:
| Event | RISC-V | ARM |
|---|---|---|
| Cycles | cycle CSR | PMCCNTR |
| Instructions | instret CSR | No direct equivalent |
| L1 I-cache miss | Implementation-specific | 0x01 |
| L1 D-cache miss | Implementation-specific | 0x03 |
| L2 cache miss | Implementation-specific | 0x17 |
| Branch mispredict | Implementation-specific | 0x10 |
| Branch executed | Implementation-specific | 0x0C |
| TLB miss | Implementation-specific | 0x05 (I-TLB), 0x06 (D-TLB) |
ARM event codes are standardized (ARM Architecture Reference Manual), while RISC-V event codes are implementation-specific.
Profiling Tool Comparison
Both architectures support standard profiling tools:
RISC-V:
# perf on RISC-V Linux
perf stat -e cycles,instructions,cache-misses ./program
perf record -e cycles -g ./program
perf report
ARM:
# perf on ARM Linux
perf stat -e cycles,instructions,cache-misses ./program
perf record -e cycles -g ./program
perf report
The perf tool abstracts architecture differences, providing a consistent interface.
Practical Differences
RISC-V advantages:
- 64-bit counters (no overflow on long runs)
- Separate instret counter (ARM lacks this)
- Up to 29 HPM counters (ARM typically 6-8)
- Simpler privilege model
ARM advantages:
- Standardized event codes (portable across implementations)
- Mature PMU infrastructure
- Overflow interrupts (standard)
- Extensive tool support
Example: Measuring IPC
RISC-V:
uint64_t cycles = read_csr(cycle);
uint64_t instret = read_csr(instret);
double ipc = (double)instret / cycles;
ARM (requires software counting):
uint64_t cycles = read_pmccntr();
// No instret equivalent—must use PMU event counter
uint64_t instret = read_pmevcntr(0); // Configured for instruction count
double ipc = (double)instret / cycles;
RISC-V’s dedicated instret counter simplifies IPC measurement.
Implementation Examples
RISC-V:
- SiFive U74: 2 HPM counters (L1 cache events)
- SiFive P550: 6 HPM counters (cache, branch, TLB events)
- Alibaba XuanTie C910: 4 HPM counters
ARM:
- Cortex-A53: 6 PMU counters
- Cortex-A72: 6 PMU counters
- Cortex-A76: 6 PMU counters
- Neoverse N1: 6 PMU counters
RISC-V implementations vary widely in HPM counter count. ARM implementations are more consistent (typically 6 counters).
🛠️ Hands-on Lab: Lab 16.1 — The CPU’s EKG (Measuring IPC)
This lab demonstrates how to read hardware performance counters and calculate IPC.
⚠️ Important Warning: In QEMU TCG mode or Spike,
cycleusually just followsinstret(IPC ≈ 1), which doesn’t reflect real hardware pipeline behavior. Run on real hardware to observe significant differences.
Lab Objectives
- Implement C functions to read
cycleandinstret - Design two workloads: High dependency (low IPC) vs High parallelism (high IPC)
- Calculate and print IPC
Code (pmu_lab.c)
#include <stdio.h>
#include <stdint.h>
// ---------------------------------------------------------
// Helper Functions: Read CSRs
// ---------------------------------------------------------
static inline uint64_t read_cycle() {
uint64_t val;
asm volatile ("csrr %0, cycle" : "=r" (val));
return val;
}
static inline uint64_t read_instret() {
uint64_t val;
asm volatile ("csrr %0, instret" : "=r" (val));
return val;
}
// ---------------------------------------------------------
// Workload 1: High Dependency (Low IPC)
// ---------------------------------------------------------
void workload_dependency(int iters) {
volatile int a = 1;
for (int i = 0; i < iters; i++) {
// Each add must wait for previous to complete
asm volatile (
"add %0, %0, %0 \n"
"add %0, %0, %0 \n"
"add %0, %0, %0 \n"
: "+r" (a)
);
}
}
// ---------------------------------------------------------
// Workload 2: Independent (High IPC)
// ---------------------------------------------------------
void workload_independent(int iters) {
volatile int a = 1, b = 2, c = 3;
for (int i = 0; i < iters; i++) {
// Instructions are independent, CPU can issue simultaneously
asm volatile (
"add %0, %0, %0 \n"
"add %1, %1, %1 \n"
"add %2, %2, %2 \n"
: "+r" (a), "+r" (b), "+r" (c)
);
}
}
// ---------------------------------------------------------
// Measurement Function
// ---------------------------------------------------------
void measure(const char* name, void (*func)(int), int iters) {
uint64_t start_c = read_cycle();
uint64_t start_i = read_instret();
func(iters);
uint64_t end_c = read_cycle();
uint64_t end_i = read_instret();
uint64_t delta_c = end_c - start_c;
uint64_t delta_i = end_i - start_i;
double ipc = (double)delta_i / delta_c;
printf("[%s]\n", name);
printf(" Cycles : %lu\n", delta_c);
printf(" Instrs : %lu\n", delta_i);
printf(" IPC : %.2f\n\n", ipc);
}
int main() {
printf("=== RISC-V PMU Demo ===\n");
printf("Warning: On QEMU/Spike, IPC is simulated as ~1.0\n");
printf("Run on real hardware for accurate results.\n\n");
int iters = 100000;
measure("Dependent Workload", workload_dependency, iters);
measure("Independent Workload", workload_independent, iters);
return 0;
}
Compile and Run
# Compile
riscv64-unknown-elf-gcc -O0 -o pmu_lab pmu_lab.c
# Run on Spike (simulated, IPC ≈ 1)
spike pk pmu_lab
# On real hardware, expect:
# - Dependent Workload: IPC ≈ 0.3-0.5 (stalls)
# - Independent Workload: IPC ≈ 1.5-2.0 (parallel)
Expected Output (Real Hardware)
=== RISC-V PMU Demo ===
Warning: On QEMU/Spike, IPC is simulated as ~1.0
Run on real hardware for accurate results.
[Dependent Workload]
Cycles : 1200000
Instrs : 400000
IPC : 0.33
[Independent Workload]
Cycles : 240000
Instrs : 400000
IPC : 1.67
danieRTOS Reference: danieRTOS uses cycle counters in its scheduler to measure context switch overhead and task execution time.
⚠️ Common Pitfalls
Pitfall 1: Higher IPC = Faster Program?
Misconception: The optimization goal is to maximize IPC.
Truth: High IPC doesn’t necessarily mean fast programs.
// Super high IPC, but does no useful work
for (int i = 0; i < 1000000; i++) {
asm volatile ("nop"); // IPC might approach 4.0!
}
// Lower IPC, but actually doing computation
for (int i = 0; i < 1000000; i++) {
result += array[i]; // IPC might only be 0.5
}
💡 Correct Understanding: Performance =
Instret / TimeorInstret / Cycle, but only if those instructions do useful work.
Pitfall 2: Ignoring Counter Overflow
Error Scenario: After long execution, counter overflows causing negative results.
Solution: Use 64-bit counters (RV64) or correctly handle 32-bit counter overflow.
// RV32: Need to read cycleh (high 32 bits)
uint64_t read_cycle_rv32() {
uint32_t lo, hi1, hi2;
do {
hi1 = read_csr(cycleh);
lo = read_csr(cycle);
hi2 = read_csr(cycleh);
} while (hi1 != hi2); // Guard against overflow during read
return ((uint64_t)hi1 << 32) | lo;
}
Pitfall 3: Confusing cycle and time
Error Scenario: Using cycle to measure sleep time.
Truth:
| CSR | Behavior |
|---|---|
cycle | Tracks CPU execution cycles, stops during WFI |
time | Tracks real time, continues during WFI |
// ❌ Wrong: cycle doesn't increment during WFI
start = read_cycle();
wfi(); // Wait for interrupt
end = read_cycle();
sleep_time = end - start; // Result is nearly 0!
// ✅ Correct: Use time for sleep measurement
start = read_time();
wfi();
end = read_time();
sleep_time = end - start; // Correctly reflects wait time
Summary
Performance counters and the Performance Monitoring Unit enable quantitative performance analysis. This chapter explored RISC-V’s counter architecture and how it compares to ARM’s mature PMU infrastructure.
Performance counter architecture provides hardware-based measurement with minimal overhead. Counter CSRs exist at multiple privilege levels—machine-mode counters (mcycle, minstret, mhpmcounter) and supervisor/user-mode readable shadows (cycle, instret, hpmcounter). Access control through mcounteren and scounteren enables selective counter exposure to lower privilege levels. Counter inhibit via mcountinhibit allows stopping counters to measure specific code regions or reduce power consumption.
Basic performance counters provide fundamental metrics. The cycle counter tracks clock cycles executed by the hart. The instret counter tracks instructions retired (completed). The time counter provides wall-clock time at a constant frequency. Together, cycle and instret enable IPC calculation, a key performance metric. The difference between cycle (stops during sleep) and time (continues during sleep) enables measuring sleep overhead and real-time intervals.
Hardware performance counters track implementation-specific events through mhpmcounter3-31 (up to 29 counters). Event selection via mhpmevent CSRs configures what each counter tracks. Counters are 64-bit, minimizing overflow concerns. Multiple counters can measure different events simultaneously, enabling comprehensive performance characterization. Counter overflow handling is implementation-specific, with some implementations supporting overflow interrupts.
Performance events cover the full spectrum of microarchitectural activity. Cache events track L1 instruction cache, L1 data cache, and L2 cache hits and misses, revealing memory hierarchy performance. Branch events track branch execution and prediction accuracy, identifying control flow bottlenecks. Pipeline events track stalls and flushes, showing execution efficiency. Memory events track loads, stores, and TLB performance, revealing memory system behavior.
Profiling and analysis leverage performance counters for optimization. The Linux perf tool provides a standard interface to RISC-V counters for counting events and sampling-based profiling. PMU programming in the kernel configures counters and enables user-mode access. Event sampling collects periodic samples to identify hot code regions. Performance analysis techniques include top-down analysis (identify bottleneck category), hotspot analysis (find hot functions), and comparative analysis (measure optimization impact).
Comparison with ARM shows both similarities and differences. ARM’s PMU provides similar functionality with a cycle counter and multiple event counters. ARM standardizes event codes across implementations, while RISC-V leaves them implementation-specific. RISC-V provides 64-bit counters and a dedicated instret counter, simplifying IPC measurement. ARM provides standardized overflow interrupts. Both architectures support the perf tool, providing a consistent user experience. RISC-V allows up to 29 HPM counters, while ARM implementations typically provide 6-8 counters.
Together, RISC-V’s performance counters enable effective performance measurement, profiling, and optimization across the full range from embedded systems to high-performance processors.
Chapter 17. RISC-V vs ARM vs MIPS — A Systematic Comparison
Part X — RISC-V vs Other Architectures
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Compare Architectures Multi-dimensionally: Analyze architectural differences from licensing, ecosystem, and technical debt perspectives
- Understand RISC-V’s Rise: Grasp why RISC-V is called “MIPS 2.0” yet succeeded where MIPS struggled
- Make Technology Choices: Select appropriate architecture based on project needs (IoT vs Mobile vs Server)
💡 Scenario: War at the Round Table
Scene: The whiteboard in the conference room displays three large words: ARM vs RISC-V vs MIPS. The atmosphere is tense.
Architect: “Everyone, the specs for our new product line ‘Project X’ are finalized. We need this chip to run AI acceleration, with extremely low power consumption, and most importantly—the BOM cost is being squeezed hard. Today we must decide on the core architecture.”
Senior (ARM Advocate): “Architect, for safety’s sake, I still recommend the ARM Cortex-M series. Although licensing fees are expensive and we pay per-chip royalties, the toolchain is mature—everyone knows Keil and IAR. If we choose a new architecture to save money and the software team goes crazy debugging, delaying time-to-market, the losses will be even greater.”
Junior (Newcomer): “But Senior, I heard RISC-V is free? Wouldn’t our profit margins be much higher?”
Architect: “Junior, to be precise, the RISC-V ISA (specification) is free, but good IP (design) still costs money (like SiFive or Andes), though usually without royalties. For high-volume products like ours, this can indeed save a huge amount in ‘toll fees.’”
Professor (Consultant): “And it’s not just about money. Senior, have you considered technical flexibility? ARM’s instruction set is closed. If we want to add a few special instructions for our AI algorithm, will ARM listen to us? But with RISC-V, we can use Custom Extensions to add our own instructions—performance might improve tenfold.”
Senior: “Professor makes a good point, but Custom Extensions have risks too. If we add random instructions ourselves, GCC and LLVM won’t recognize them. Wouldn’t we need to maintain our own compiler team?”
Architect: “That’s the trade-off. Let me summarize:
| Consideration | ARM | RISC-V | Decision Impact |
|---|---|---|---|
| Licensing Cost | High (Per-chip Royalty) | Low (No Royalty) | RISC-V saves money at high volume |
| Ecosystem Maturity | High (20+ years) | Medium (Growing fast) | ARM safer short-term, RISC-V has long-term potential |
| Customization Flexibility | Low (Requires negotiation) | High (Custom Ext.) | RISC-V advantage for AI/crypto acceleration |
| Software Tools | Mature | Improving | ARM temporarily leads in debug experience |
| “ |
Junior: “What about MIPS? I remember university textbooks all taught MIPS?”
Professor: “MIPS is a classic and contributed greatly to education. But its business model had problems—licensing too expensive, IP company changed hands multiple times, ecosystem withered. RISC-V inherits MIPS’s spirit in many ways (Clean RISC Design), but learned the lesson: use open-source model to avoid patent hell.”
Architect: “Alright, our conclusion: for short-term projects requiring stable time-to-market, choose ARM; for long-term strategy needing customization and cost-consciousness, choose RISC-V. As for MIPS, unless maintaining legacy product lines, not recommended for new projects.”
Architecture choice shapes everything. The instruction set determines how software expresses computation, how hardware implements execution, and how ecosystems develop around the platform. RISC-V enters a landscape dominated by ARM in mobile and embedded systems, and historically influenced by MIPS in education and networking. Understanding how these architectures compare reveals RISC-V’s design decisions, trade-offs, and competitive position.
This chapter provides a systematic comparison of RISC-V, ARM, and MIPS across eleven dimensions: ISA design philosophy, instruction set complexity, register architecture, exception and interrupt models, memory models, virtual memory, interrupt architecture, calling conventions, pipeline and microarchitecture, ecosystem and licensing, and future directions. Each section examines how the architectures approach the same problem, highlighting similarities, differences, and the implications for software and hardware.
RISC-V represents a modern, modular, open approach. ARM represents a comprehensive, evolving, commercial approach. MIPS represents classic RISC simplicity with historical commercial roots. Together, they illustrate the spectrum of architectural design choices.
17.1 ISA Design Philosophy
RISC-V: Modular and Extensible
RISC-V’s philosophy emphasizes modularity and extensibility:
- Base ISA: Minimal, frozen foundation (RV32I, RV64I, RV128I)
- Standard extensions: Optional, composable modules (M, A, F, D, C, V, etc.)
- Custom extensions: Vendor-specific additions without ISA fragmentation
- Clean slate: No legacy baggage, designed from first principles
This modular approach allows:
- Tiny microcontrollers (RV32I only)
- Application processors (RV64IMAFDCV)
- Custom accelerators (base + custom extensions)
ARM: Comprehensive and Evolving
ARM’s philosophy emphasizes comprehensiveness and evolution:
- Comprehensive ISA: Rich instruction set covering many use cases
- Profiles: Different ISA subsets (A-profile, R-profile, M-profile)
- Backward compatibility: New versions extend, rarely remove features
- Market-driven: Features added based on market needs
ARM profiles:
- A-profile: Application processors (ARMv8-A, ARMv9-A)
- R-profile: Real-time processors (ARMv8-R)
- M-profile: Microcontrollers (ARMv8-M, ARMv9-M)
MIPS: Classic RISC Simplicity
MIPS’s philosophy emphasizes classic RISC principles:
- Simple, regular ISA: Load-store architecture, fixed-length instructions
- Delayed branches: Expose pipeline to software (MIPS I-IV)
- Coprocessor model: Floating-point and other functions as coprocessors
- Minimal complexity: Keep hardware simple, let software handle complexity
MIPS influenced RISC-V’s design but RISC-V modernized many aspects (no delayed branches, cleaner privilege model).
Design Trade-offs
| Aspect | RISC-V | ARM | MIPS |
|---|---|---|---|
| Modularity | High (base + extensions) | Medium (profiles) | Low (monolithic) |
| Extensibility | High (custom extensions) | Low (vendor-specific) | Low (proprietary) |
| Backward Compatibility | High (frozen base) | High (evolutionary) | Medium (versions) |
| Complexity | Low to medium | Medium to high | Low |
| Flexibility | High | Medium | Low |
17.2 Instruction Set Complexity
Instruction Count Comparison
Approximate instruction counts:
| Architecture | Base Instructions | With Extensions | Total (typical) |
|---|---|---|---|
| RISC-V RV32I | 47 | +M(8), +A(11), +F(26), +D(26), +C(~40) | ~150-200 |
| RISC-V RV64I | 59 | +M(8), +A(11), +F(26), +D(26), +C(~40) | ~170-220 |
| ARM ARMv8-A | ~500 base | +NEON, +SVE, +crypto | ~1000+ |
| MIPS32 | ~100 base | +FPU, +DSP | ~200-300 |
RISC-V has the smallest base ISA. ARM has the largest instruction set. MIPS falls in between.
Encoding Formats
RISC-V: 6 base formats (R, I, S, B, U, J) + compressed (CR, CI, CSS, CIW, CL, CS, CA, CB, CJ)
R-type: [funct7|rs2|rs1|funct3|rd|opcode]
I-type: [imm[11:0]|rs1|funct3|rd|opcode]
S-type: [imm[11:5]|rs2|rs1|funct3|imm[4:0]|opcode]
ARM: Multiple formats (data processing, load/store, branch, etc.)
Data processing: [cond|00|I|opcode|S|Rn|Rd|operand2]
Load/store: [cond|01|I|P|U|B|W|L|Rn|Rd|offset]
MIPS: 3 formats (R, I, J)
R-type: [opcode|rs|rt|rd|shamt|funct]
I-type: [opcode|rs|rt|immediate]
J-type: [opcode|address]
RISC-V and MIPS have simpler, more regular encodings than ARM.
Addressing Modes
RISC-V:
- Register:
add rd, rs1, rs2 - Immediate:
addi rd, rs1, imm - Base+offset:
lw rd, offset(rs1)
ARM:
- Register:
ADD Rd, Rn, Rm - Immediate:
ADD Rd, Rn, #imm - Base+offset:
LDR Rd, [Rn, #offset] - Base+register:
LDR Rd, [Rn, Rm] - Pre/post-indexed:
LDR Rd, [Rn, #offset]!orLDR Rd, [Rn], #offset
MIPS:
- Register:
add $rd, $rs, $rt - Immediate:
addi $rt, $rs, imm - Base+offset:
lw $rt, offset($rs)
ARM has the most addressing modes. RISC-V and MIPS are simpler.
ISA Complexity Metrics
| Metric | RISC-V | ARM | MIPS |
|---|---|---|---|
| Instruction formats | 6 base + 9 compressed | 10+ | 3 |
| Addressing modes | 3 | 10+ | 3 |
| Conditional execution | Branch only | Most instructions (ARMv7), limited (ARMv8) | Branch only |
| Instruction length | 32-bit (16-bit compressed) | 32-bit (16-bit Thumb) | 32-bit |
| Regularity | High | Medium | High |
RISC-V and MIPS prioritize regularity. ARM prioritizes expressiveness.
17.3 Register Architecture
RISC-V: x0-x31 (x0 = zero)
RISC-V provides 32 general-purpose registers:
x0: Hardwired zero (reads as 0, writes ignored)x1-x31: General-purpose registersf0-f31: Floating-point registers (F/D extensions)
Special conventions:
x1(ra): Return addressx2(sp): Stack pointerx8(s0/fp): Frame pointer
ARM: X0-X30 + XZR + SP
ARM ARMv8-A provides 31 general-purpose registers + special registers:
X0-X30: General-purpose registers (64-bit)W0-W30: 32-bit views of X0-X30XZR(X31): Zero register (reads as 0, writes ignored)SP: Stack pointer (separate from general registers)PC: Program counter (not directly accessible in ARMv8)
ARM has 31 general registers vs RISC-V’s 32 (including zero).
MIPS: $0-$31 ($0 = zero)
MIPS provides 32 general-purpose registers:
$0($zero): Hardwired zero$1-$31: General-purpose registers
Special conventions:
$31($ra): Return address$29($sp): Stack pointer$30($fp): Frame pointer
MIPS and RISC-V have identical register counts and zero register concept.
Special-Purpose Registers
RISC-V: CSRs (Control and Status Registers)
- Accessed via
csrr,csrw,csrrw, etc. - Examples:
mstatus,mtvec,mepc,mcause
ARM: System registers
- Accessed via
MRS,MSRinstructions - Examples:
SCTLR_EL1,VBAR_EL1,ELR_EL1,ESR_EL1
MIPS: Coprocessor 0 registers
- Accessed via
mfc0,mtc0instructions - Examples:
Status,Cause,EPC,BadVAddr
All three use separate namespaces for system registers.
17.4 Exception and Interrupt Models
RISC-V: Trap Model (M/S/U Modes)
RISC-V uses a unified trap model:
- Privilege modes: M (Machine), S (Supervisor), U (User)
- Trap types: Exceptions (synchronous), Interrupts (asynchronous)
- Trap vector:
mtvec(M-mode),stvec(S-mode) - Trap cause:
mcause(M-mode),scause(S-mode) - Trap PC:
mepc(M-mode),sepc(S-mode)
Trap handling:
# M-mode trap handler
trap_handler:
csrr t0, mcause # Read cause
csrr t1, mepc # Read PC
# Handle trap
mret # Return from trap
ARM: Exception Levels (EL0-EL3)
ARM uses exception levels:
- Exception levels: EL0 (User), EL1 (OS), EL2 (Hypervisor), EL3 (Secure Monitor)
- Exception types: Synchronous, IRQ, FIQ, SError
- Exception vector:
VBAR_EL1,VBAR_EL2,VBAR_EL3 - Exception syndrome:
ESR_EL1,ESR_EL2,ESR_EL3 - Exception link:
ELR_EL1,ELR_EL2,ELR_EL3
Exception handling:
# EL1 exception handler
exception_handler:
mrs x0, ESR_EL1 # Read syndrome
mrs x1, ELR_EL1 # Read PC
# Handle exception
eret # Return from exception
MIPS: Exception Handling
MIPS uses a simpler exception model:
- Modes: User, Kernel
- Exception vector: Fixed at 0x80000180 (general) or 0x80000000 (reset/NMI)
- Cause register: Encodes exception type
- EPC: Exception PC
Exception handling:
# MIPS exception handler
exception_handler:
mfc0 k0, Cause # Read cause
mfc0 k1, EPC # Read PC
# Handle exception
eret # Return from exception
Comparison Table
| Feature | RISC-V | ARM | MIPS |
|---|---|---|---|
| Privilege Levels | 3 (M/S/U) + optional H | 4 (EL0-EL3) | 2 (User/Kernel) |
| Trap/Exception Types | Unified (trap) | 4 types (Sync, IRQ, FIQ, SError) | Unified (exception) |
| Vector Table | mtvec, stvec | VBAR_ELn | Fixed address |
| Cause Register | mcause, scause | ESR_ELn | Cause |
| Return Instruction | mret, sret | eret | eret |
| Nested Interrupts | Software-managed | Hardware-managed | Software-managed |
ARM has the most sophisticated exception model. RISC-V is modular. MIPS is simplest.
17.5 Memory Models
RISC-V: RVWMO (Weak Memory Ordering)
RISC-V uses a weak memory model (RVWMO):
- Ordering: Relaxed by default
- Fences:
fenceinstruction for ordering - Atomics: LR/SC and AMO instructions
- TSO extension: Optional total store ordering (Ztso)
Memory ordering:
# RISC-V: Ensure store visible before load
sw x10, 0(x5)
fence w, r # Write-to-read fence
lw x11, 0(x6)
ARM: Weak Memory Model
ARM uses a weak memory model:
- Ordering: Relaxed by default
- Barriers: DMB (data memory barrier), DSB (data synchronization barrier), ISB (instruction synchronization barrier)
- Atomics: LDXR/STXR (exclusive) and atomic instructions (ARMv8.1+)
Memory ordering:
# ARM: Ensure store visible before load
STR X0, [X1]
DMB SY # Data memory barrier (system)
LDR X2, [X3]
MIPS: Sequential Consistency Variants
MIPS traditionally used sequential consistency, but modern MIPS supports weak ordering:
- Ordering: Sequential consistency (MIPS I-III), weak ordering (MIPS IV+)
- Barriers:
syncinstruction - Atomics: LL/SC (load-linked/store-conditional)
Memory ordering:
# MIPS: Ensure store visible before load
sw $t0, 0($a0)
sync # Synchronization barrier
lw $t1, 0($a1)
Memory Barrier Instructions
| Architecture | Barrier Instruction | Purpose |
|---|---|---|
| RISC-V | fence r, w | Order reads before writes |
| RISC-V | fence w, r | Order writes before reads |
| RISC-V | fence rw, rw | Full fence |
| ARM | DMB | Data memory barrier |
| ARM | DSB | Data synchronization barrier |
| ARM | ISB | Instruction synchronization barrier |
| MIPS | sync | Synchronization barrier |
RISC-V’s fence is more fine-grained (specify predecessor/successor). ARM has multiple barrier types. MIPS has a single sync instruction.
17.6 Virtual Memory
RISC-V: Sv39/Sv48 Page Tables
RISC-V virtual memory:
- Sv39: 39-bit virtual address, 3-level page table
- Sv48: 48-bit virtual address, 4-level page table
- Sv57: 57-bit virtual address, 5-level page table (future)
- Page sizes: 4 KB, 2 MB (megapage), 1 GB (gigapage)
- TLB: Implementation-specific
Page table entry (Sv39):
[63:54] Reserved
[53:28] PPN[2] (26 bits)
[27:19] PPN[1] (9 bits)
[18:10] PPN[0] (9 bits)
[9:0] Flags (V, R, W, X, U, G, A, D)
ARM: 48-bit VA with TTBR0/1
ARM virtual memory (ARMv8-A):
- VA size: 48-bit (configurable 36-48 bits)
- Page table levels: 4 levels (configurable)
- Page sizes: 4 KB, 2 MB, 1 GB
- TTBR0/TTBR1: Separate page tables for user/kernel
Page table entry (4 KB granule):
[63:48] Ignored/SW use
[47:12] Output address
[11:2] Attributes (AF, SH, AP, NS, etc.)
[1] Table/Block
[0] Valid
MIPS: TLB-Based MMU
MIPS virtual memory:
- TLB: Software-managed TLB (no hardware page table walk)
- VA size: 32-bit (MIPS32), 64-bit (MIPS64)
- Page sizes: 4 KB (typical), configurable
- TLB entries: 16-64 entries (implementation-specific)
TLB entry:
EntryHi: [VPN | ASID]
EntryLo0: [PFN | C | D | V | G]
EntryLo1: [PFN | C | D | V | G]
PageMask: Page size
Page Table Walk Comparison
| Feature | RISC-V | ARM | MIPS |
|---|---|---|---|
| Hardware Walk | Yes | Yes | No (software TLB refill) |
| Page Table Levels | 3-5 (Sv39-Sv57) | 4 (configurable) | N/A (TLB only) |
| Page Sizes | 4 KB, 2 MB, 1 GB | 4 KB, 2 MB, 1 GB | 4 KB (configurable) |
| TLB Refill | Hardware | Hardware | Software (exception) |
| ASID | Yes (satp.ASID) | Yes (TTBR.ASID) | Yes (EntryHi.ASID) |
RISC-V and ARM use hardware page table walks. MIPS uses software TLB refill, giving more flexibility but requiring software overhead.
17.7 Interrupt Architecture
RISC-V: PLIC/CLIC/AIA
RISC-V interrupt architecture has evolved:
- CLINT: Core-Local Interruptor (timer, software interrupts)
- PLIC: Platform-Level Interrupt Controller (external interrupts)
- CLIC: Core-Local Interrupt Controller (vectored, nested interrupts for embedded)
- AIA: Advanced Interrupt Architecture (MSI, IMSIC for servers)
PLIC example:
// PLIC interrupt handler
void plic_handler(void) {
uint32_t source = plic_claim(); // Claim interrupt
if (source == UART_IRQ) {
uart_handler();
}
plic_complete(source); // Complete interrupt
}
ARM: GIC (GICv3/GICv4)
ARM uses the Generic Interrupt Controller (GIC):
- GICv2: Legacy, supports up to 8 cores
- GICv3: Modern, supports many cores, message-based
- GICv4: Adds virtualization support
- Interrupt types: SGI (software), PPI (private peripheral), SPI (shared peripheral)
GIC example:
// GIC interrupt handler
void gic_handler(void) {
uint32_t intid = gic_acknowledge(); // Acknowledge interrupt
if (intid == UART_INTID) {
uart_handler();
}
gic_end_of_interrupt(intid); // End of interrupt
}
MIPS: Simple IRQ Model
MIPS uses a simple interrupt model:
- Interrupt lines: 8 hardware interrupt lines (IP0-IP7)
- Interrupt mask: Controlled by Status register
- Interrupt pending: Indicated by Cause register
- External controller: Optional external interrupt controller
MIPS example:
// MIPS interrupt handler
void mips_interrupt_handler(void) {
uint32_t cause = read_c0_cause();
uint32_t pending = (cause >> 8) & 0xFF; // IP bits
if (pending & (1 << 2)) { // IP2
uart_handler();
}
}
Interrupt Routing and Priority
| Feature | RISC-V (PLIC) | ARM (GIC) | MIPS |
|---|---|---|---|
| Interrupt Sources | 1-1023 | 32-1020 | 8 lines |
| Priority Levels | 0-255 | 0-255 | None (software) |
| Routing | Per-hart enable | Affinity routing | Fixed |
| Vectoring | Optional (CLIC) | Yes | No |
| Nesting | Software-managed | Hardware-managed | Software-managed |
ARM GIC is the most sophisticated. RISC-V PLIC is flexible. MIPS is simplest.
17.8 Calling Conventions
RISC-V: RV64 SysV ABI
RISC-V calling convention (RV64):
- Arguments: a0-a7 (x10-x17)
- Return values: a0-a1 (x10-x11)
- Saved registers: s0-s11 (x8-x9, x18-x27)
- Temporary registers: t0-t6 (x5-x7, x28-x31)
- Stack pointer: sp (x2)
- Return address: ra (x1)
Function call:
# Caller
addi sp, sp, -16
sd ra, 8(sp)
call function # ra ← PC+4, PC ← function
ld ra, 8(sp)
addi sp, sp, 16
# Callee
function:
addi sp, sp, -16
sd s0, 8(sp)
# ... function body ...
ld s0, 8(sp)
addi sp, sp, 16
ret # PC ← ra
ARM: AAPCS64
ARM calling convention (AAPCS64):
- Arguments: X0-X7
- Return values: X0-X1
- Saved registers: X19-X28
- Temporary registers: X9-X15
- Stack pointer: SP
- Return address: X30 (LR)
- Frame pointer: X29 (FP)
Function call:
# Caller
STP X29, X30, [SP, #-16]!
BL function # LR ← PC+4, PC ← function
LDP X29, X30, [SP], #16
# Callee
function:
STP X29, X30, [SP, #-16]!
MOV X29, SP
# ... function body ...
LDP X29, X30, [SP], #16
RET # PC ← LR
MIPS: O32/N32/N64 ABIs
MIPS has multiple ABIs:
- O32: 32-bit, 4 argument registers
- N32: 32-bit pointers, 64-bit registers
- N64: 64-bit
MIPS O32 calling convention:
- Arguments: $a0-$a3 ($4-$7)
- Return values: $v0-$v1 ($2-$3)
- Saved registers: $s0-$s7 ($16-$23)
- Temporary registers: $t0-$t9 ($8-$15, $24-$25)
- Stack pointer: $sp ($29)
- Return address: $ra ($31)
Function call:
# Caller
addiu $sp, $sp, -8
sw $ra, 4($sp)
jal function # $ra ← PC+8, PC ← function
lw $ra, 4($sp)
addiu $sp, $sp, 8
# Callee
function:
addiu $sp, $sp, -8
sw $s0, 4($sp)
# ... function body ...
lw $s0, 4($sp)
addiu $sp, $sp, 8
jr $ra # PC ← $ra
ABI Comparison
| Feature | RISC-V | ARM | MIPS (O32) |
|---|---|---|---|
| Argument Registers | 8 (a0-a7) | 8 (X0-X7) | 4 ($a0-$a3) |
| Return Registers | 2 (a0-a1) | 2 (X0-X1) | 2 ($v0-$v1) |
| Saved Registers | 12 (s0-s11) | 10 (X19-X28) | 8 ($s0-$s7) |
| Temporary Registers | 7 (t0-t6) | 7 (X9-X15) | 10 ($t0-$t9) |
| Stack Growth | Downward | Downward | Downward |
| Alignment | 16 bytes | 16 bytes | 8 bytes (O32) |
RISC-V and ARM have more argument registers than MIPS O32, reducing stack usage.
17.9 Pipeline and Microarchitecture
In-Order vs Out-of-Order
All three architectures support both in-order and out-of-order implementations:
RISC-V:
- In-order: SiFive E-series, Rocket
- Out-of-order: SiFive P-series, BOOM (Berkeley Out-of-Order Machine)
ARM:
- In-order: Cortex-A5, Cortex-A7, Cortex-A53
- Out-of-order: Cortex-A72, Cortex-A76, Neoverse N1/V1
MIPS:
- In-order: MIPS 24K, 34K
- Out-of-order: MIPS 74K, R10000
Branch Prediction Strategies
Modern implementations use sophisticated branch prediction:
| Implementation | Branch Predictor | Accuracy |
|---|---|---|
| SiFive U74 | 2-level adaptive | ~90% |
| SiFive P550 | TAGE predictor | ~95% |
| ARM Cortex-A76 | Multi-level predictor | ~95%+ |
| MIPS 74K | 2-level adaptive | ~90% |
Out-of-order cores require better branch prediction to maintain performance.
Cache Hierarchies
Typical cache configurations:
RISC-V (SiFive P550):
- L1 I-cache: 32 KB, 4-way
- L1 D-cache: 32 KB, 8-way
- L2 cache: 512 KB - 2 MB, 8-way
ARM (Cortex-A76):
- L1 I-cache: 64 KB, 4-way
- L1 D-cache: 64 KB, 4-way
- L2 cache: 256 KB - 512 KB, 8-way
MIPS (74K):
- L1 I-cache: 32 KB, 4-way
- L1 D-cache: 32 KB, 4-way
- L2 cache: Optional, implementation-specific
Implementation Examples
| Implementation | Type | Pipeline Stages | Issue Width | IPC (peak) |
|---|---|---|---|---|
| SiFive E76 | In-order | 8 | 1 | 1.0 |
| SiFive P550 | Out-of-order | 13 | 3 | 3.0 |
| ARM Cortex-A53 | In-order | 8 | 2 | 2.0 |
| ARM Cortex-A76 | Out-of-order | 13 | 4 | 4.0 |
| MIPS 24K | In-order | 8 | 1 | 1.0 |
| MIPS 74K | Out-of-order | 15 | 2 | 2.0 |
ARM has the most aggressive out-of-order implementations. RISC-V is catching up. MIPS development has slowed.
17.10 Ecosystem and Licensing
RISC-V: Open and Free ISA
RISC-V ecosystem:
- ISA: Open, royalty-free
- Licensing: No licensing fees
- Governance: RISC-V International (non-profit)
- Implementations: Open-source (Rocket, BOOM) and commercial (SiFive, Andes, etc.)
- Software: GCC, LLVM, Linux, FreeBSD, Zephyr, FreeRTOS
Advantages:
- No licensing costs
- Customizable (add custom extensions)
- Growing ecosystem
- Academic and research-friendly
Challenges:
- Younger ecosystem (less mature than ARM)
- Fewer commercial implementations
- Software ecosystem still developing
ARM: Commercial Licensing
ARM ecosystem:
- ISA: Proprietary
- Licensing: Architecture license (design own core) or implementation license (use ARM core)
- Governance: ARM Holdings (commercial company)
- Implementations: ARM Cortex series, vendor designs (Apple, Qualcomm, Samsung)
- Software: Mature ecosystem (GCC, LLVM, Linux, Android, iOS)
Advantages:
- Mature, proven ecosystem
- Extensive software support
- Wide industry adoption
- Strong performance
Challenges:
- Licensing fees
- Less customizable
- Vendor lock-in
MIPS: Historical Commercial, Now Open
MIPS ecosystem:
- ISA: Now open (MIPS Open initiative, 2018)
- Licensing: Historically commercial, now open
- Governance: MIPS Open (Wave Computing)
- Implementations: Historical (MIPS Technologies), now limited
- Software: GCC, LLVM, Linux (legacy support)
Status:
- Historical importance (education, networking)
- Declining commercial adoption
- Open initiative came too late
- Largely superseded by ARM and RISC-V
Industry Adoption and Ecosystem
| Aspect | RISC-V | ARM | MIPS |
|---|---|---|---|
| Market Share | Growing (embedded, IoT) | Dominant (mobile, embedded) | Declining |
| Vendors | SiFive, Andes, Alibaba, etc. | ARM, Apple, Qualcomm, etc. | Limited |
| Software Ecosystem | Growing | Mature | Legacy |
| Tool Support | Good (GCC, LLVM) | Excellent | Good (legacy) |
| Community | Active, growing | Mature, large | Small |
| Education | Increasing | Common | Historical |
ARM dominates commercially. RISC-V is growing rapidly. MIPS is declining.
17.11 Future Directions
RISC-V Roadmap
RISC-V is actively evolving:
- Ratified extensions: V (vector), B (bit manipulation), Zicond (conditional ops)
- In progress: J (JIT support), P (packed SIMD), Zc (code size reduction)
- Proposed: Crypto extensions, memory tagging, capabilities
- Platform profiles: RVA22, RVA23 (standardize extension combinations)
Focus areas:
- Performance (vector, out-of-order)
- Security (crypto, memory tagging)
- Code density (compressed, Zc)
- Ecosystem maturity (tools, software)
ARM Roadmap
ARM continues to evolve:
- ARMv9-A: SVE2 (scalable vector), TME (transactional memory), MTE (memory tagging)
- Confidential Compute: Realm Management Extension (RME)
- AI/ML: Matrix extensions, neural processing
- Automotive: ASIL-D safety, real-time
Focus areas:
- AI/ML acceleration
- Security (confidential computing)
- Automotive and safety
- Performance scaling
Emerging Extensions
RISC-V:
- Crypto: AES, SHA, scalar crypto
- Vector: V extension (ratified), improvements ongoing
- Hypervisor: H extension (ratified)
- Bit manipulation: B extension (ratified)
ARM:
- SVE2: Scalable vector (successor to NEON)
- MTE: Memory Tagging Extension (security)
- TME: Transactional Memory Extension
- SME: Scalable Matrix Extension (AI/ML)
Security Features
Both architectures are adding security features:
RISC-V:
- PMP (Physical Memory Protection)
- sPMP (Supervisor PMP)
- Crypto extensions
- Proposed: Memory tagging, capabilities (CHERI-like)
ARM:
- TrustZone (secure world)
- MTE (Memory Tagging Extension)
- PAC (Pointer Authentication Codes)
- BTI (Branch Target Identification)
Industry Trends
RISC-V trends:
- Rapid adoption in China (Alibaba, Huawei)
- Growing in IoT and embedded
- Increasing in data center (SiFive, Ventana)
- Strong in research and education
ARM trends:
- Continued dominance in mobile
- Growing in data center (AWS Graviton, Ampere)
- Strong in automotive
- Expanding in AI/ML
MIPS trends:
- Declining market share
- Legacy support only
- Some niche applications (networking)
The future favors RISC-V (open, growing) and ARM (mature, dominant). MIPS is largely historical.
Summary
Comparing RISC-V, ARM, and MIPS reveals different architectural philosophies and trade-offs. This chapter examined eleven dimensions of comparison, showing how each architecture approaches the same problems.
ISA design philosophy distinguishes the three architectures fundamentally. RISC-V emphasizes modularity and extensibility with a frozen base ISA and composable extensions. ARM emphasizes comprehensiveness and evolution with rich instruction sets and market-driven features. MIPS emphasizes classic RISC simplicity with regular encodings and minimal complexity. These philosophies shape all subsequent design decisions.
Instruction set complexity varies significantly. RISC-V has the smallest base ISA (47 instructions for RV32I) with optional extensions. ARM has the largest instruction set (1000+ instructions) covering many use cases. MIPS falls in between with classic RISC simplicity. Encoding formats reflect this—RISC-V and MIPS use regular formats, while ARM uses more complex encodings for expressiveness.
Register architecture shows convergence. All three provide 31-32 general-purpose registers with a hardwired zero register (RISC-V x0, ARM XZR, MIPS $0). ARM separates the stack pointer from general registers. All three use separate namespaces for system registers accessed through special instructions.
Exception and interrupt models reflect different privilege architectures. RISC-V uses a unified trap model with M/S/U modes. ARM uses exception levels (EL0-EL3) with four exception types. MIPS uses a simpler two-mode model. ARM provides the most sophisticated exception handling, while MIPS is simplest.
Memory models all use weak ordering for performance. RISC-V uses RVWMO with fine-grained fence instructions. ARM uses a weak model with multiple barrier types (DMB, DSB, ISB). MIPS evolved from sequential consistency to weak ordering with sync barriers. All three provide atomic instructions for synchronization.
Virtual memory shows architectural maturity. RISC-V and ARM use hardware page table walks with multi-level page tables (Sv39/Sv48 for RISC-V, 4-level for ARM). MIPS uses software-managed TLB refill, providing flexibility at the cost of software overhead. All three support multiple page sizes and ASID for efficient context switching.
Interrupt architecture ranges from simple to sophisticated. RISC-V uses PLIC for platform interrupts with CLIC for embedded and AIA for servers. ARM uses GIC (Generic Interrupt Controller) with mature multi-core support. MIPS uses a simple 8-line interrupt model. ARM GIC is most mature, RISC-V is evolving, MIPS is simplest.
Calling conventions show practical similarities. All three use similar register allocation strategies with 4-8 argument registers, 2 return registers, and saved/temporary register sets. RISC-V and ARM provide 8 argument registers, while MIPS O32 provides only 4. All use downward-growing stacks with 8-16 byte alignment.
Pipeline and microarchitecture demonstrate implementation diversity. All three architectures support both in-order and out-of-order implementations. ARM has the most aggressive out-of-order cores (Cortex-A76, Neoverse). RISC-V is catching up with competitive designs (SiFive P550, BOOM). MIPS development has slowed. Branch prediction and cache hierarchies are similar across modern implementations.
Ecosystem and licensing represent the fundamental business difference. RISC-V is open and royalty-free, enabling customization and eliminating licensing costs. ARM is proprietary with licensing fees but offers a mature, proven ecosystem. MIPS was commercial but opened too late, now declining. ARM dominates commercially, RISC-V is growing rapidly, MIPS is historical.
Future directions show active evolution. RISC-V is adding vector, crypto, and security extensions while standardizing platform profiles. ARM is advancing AI/ML, security (MTE), and confidential computing. Both are adding memory tagging and enhanced security features. Industry trends favor RISC-V (open, growing) and ARM (mature, dominant), while MIPS remains largely historical.
Together, these comparisons show RISC-V as a modern, modular, open alternative to ARM’s mature, comprehensive, commercial approach, with MIPS representing classic RISC principles now largely superseded. The choice depends on priorities: openness and customization favor RISC-V, ecosystem maturity favors ARM.
Appendix A. CSR Reference
Control and Status Register Quick Reference
💡 Usage Guide: This appendix is your “dashboard” during development. When you need to look up a CSR’s bit positions or operation methods, flip right here.
🛠️ Common CSR Quick Reference
mstatus (Machine Status) Bit Map
This is the most frequently used CSR, controlling interrupts, privilege modes, and other core functions.
63 62 38 37 36 34 33 32 22 21 20 19 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ SD │WPRI│ MBE│ SBE│ SXL│ UXL│WPRI│ TSR│ TW │ TVM│ MXR│ SUM│MPRV│ XS │ FS │ MPP│WPRI│ SPP│MPIE│
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
│ │ │
│ │ └─ Bit 7: MPIE
│ └─ Bit 11-12: MPP
└─ Bit 3: MIE
Key Bit Descriptions:
| Bit | Name | Description |
|---|---|---|
| 3 | MIE | Machine Interrupt Enable (global interrupt switch) |
| 7 | MPIE | Previous MIE (MIE value before entering trap) |
| 11-12 | MPP | Previous Privilege (00=U, 01=S, 11=M) |
| 17 | MPRV | Modify Privilege (Load/Store use MPP privilege) |
Common Operation Snippets
// 1. Enable Global Interrupt
csrs mstatus, (1 << 3); // Set MIE bit
// 2. Disable Global Interrupt
csrc mstatus, (1 << 3); // Clear MIE bit
// 3. Set next mode to S-mode (preparing for mret)
csrc mstatus, (3 << 11); // Clear MPP
csrs mstatus, (1 << 11); // Set MPP = 01 (S-mode)
// 4. Read current MPP value
csrr t0, mstatus
srli t0, t0, 11
andi t0, t0, 3 // t0 = MPP (0=U, 1=S, 3=M)
mie / mip (Interrupt Enable/Pending) Reference
These two CSRs control interrupt enable and pending status.
Bit Position Reference:
┌─────────────────────────────────────────────────────────────┐
│ 11 │ 9 │ 7 │ 5 │ 3 │ 1 │ │
│ MEIE │ SEIE │ MTIE │ STIE │ MSIE │ SSIE │ │
│ M │ S │ M │ S │ M │ S │ │
│ Ext │ Ext │Timer │Timer │ Soft │ Soft │ │
└─────────────────────────────────────────────────────────────┘
Common Operations:
// Enable Machine Timer Interrupt
csrs mie, (1 << 7); // Set MTIE
// Enable Machine External Interrupt
csrs mie, (1 << 11); // Set MEIE
// Check if Timer Interrupt is Pending
csrr t0, mip
andi t0, t0, (1 << 7) // t0 != 0 means Timer interrupt pending
mcause (Machine Cause) Decode Table
When a trap occurs, mcause tells you the reason.
Interrupt (mcause[63] = 1):
| Code | Name | Description |
|---|---|---|
| 1 | Supervisor Software Interrupt | S-mode software interrupt |
| 3 | Machine Software Interrupt | M-mode software interrupt |
| 5 | Supervisor Timer Interrupt | S-mode timer interrupt |
| 7 | Machine Timer Interrupt | M-mode timer interrupt |
| 9 | Supervisor External Interrupt | S-mode external interrupt |
| 11 | Machine External Interrupt | M-mode external interrupt |
Exception (mcause[63] = 0):
| Code | Name | Description |
|---|---|---|
| 0 | Instruction Address Misaligned | Instruction address not aligned |
| 1 | Instruction Access Fault | Instruction access error |
| 2 | Illegal Instruction | Invalid instruction |
| 3 | Breakpoint | Breakpoint (ebreak) |
| 4 | Load Address Misaligned | Load address not aligned |
| 5 | Load Access Fault | Load access error |
| 6 | Store Address Misaligned | Store address not aligned |
| 7 | Store Access Fault | Store access error |
| 8 | Environment Call from U-mode | U-mode ecall |
| 9 | Environment Call from S-mode | S-mode ecall |
| 11 | Environment Call from M-mode | M-mode ecall |
| 12 | Instruction Page Fault | Instruction page fault |
| 13 | Load Page Fault | Load page fault |
| 15 | Store Page Fault | Store page fault |
Trap Handler Example:
void trap_handler() {
uint64_t cause;
asm volatile ("csrr %0, mcause" : "=r" (cause));
if (cause & (1UL << 63)) {
// Interrupt
uint64_t code = cause & 0x7FF;
switch (code) {
case 7: handle_timer_interrupt(); break;
case 11: handle_external_interrupt(); break;
}
} else {
// Exception
switch (cause) {
case 2: handle_illegal_instruction(); break;
case 7: handle_store_access_fault(); break;
case 8: handle_ecall_from_umode(); break;
}
}
}
This appendix provides a comprehensive reference for RISC-V Control and Status Registers (CSRs). CSRs control processor behavior, report status, and provide access to privileged functionality. Each CSR has a 12-bit address and is accessed using dedicated CSR instructions (CSRRW, CSRRS, CSRRC, and their immediate variants).
A.1 CSR Address Space Organization
CSR addresses are 12 bits, organized as follows:
Bits [11:10]: Privilege Level
00 = User/Unprivileged
01 = Supervisor
10 = Hypervisor (reserved in base spec)
11 = Machine
Bits [9:8]: Read/Write Access
00 = Read/Write
01 = Read/Write
10 = Read/Write
11 = Read-Only
Bits [7:0]: Register Number
Access Rules:
- Accessing a CSR from insufficient privilege level causes an illegal instruction exception
- Writing to a read-only CSR (bits [11:10] = 11) causes an illegal instruction exception
- Unimplemented CSRs may read as zero or cause an exception (implementation-defined)
A.2 Machine-Level CSRs (M-mode)
Machine Information Registers
| CSR | Address | R/W | Description |
|---|---|---|---|
| mvendorid | 0xF11 | RO | Vendor ID (JEDEC manufacturer ID) |
| marchid | 0xF12 | RO | Architecture ID (implementation-specific) |
| mimpid | 0xF13 | RO | Implementation ID (version number) |
| mhartid | 0xF14 | RO | Hardware thread ID (unique per hart) |
| mconfigptr | 0xF15 | RO | Pointer to configuration data structure |
Usage: These read-only CSRs identify the processor implementation. Software can use them to detect features, apply workarounds, or report system information.
Example:
csrr t0, mhartid # Read hart ID
csrr t1, mvendorid # Read vendor ID
Machine Trap Setup
| CSR | Address | R/W | Description |
|---|---|---|---|
| mstatus | 0x300 | RW | Machine status register |
| misa | 0x301 | RW | ISA and extensions (may be read-only) |
| medeleg | 0x302 | RW | Exception delegation to S-mode |
| mideleg | 0x303 | RW | Interrupt delegation to S-mode |
| mie | 0x304 | RW | Machine interrupt enable |
| mtvec | 0x305 | RW | Machine trap-handler base address |
| mcounteren | 0x306 | RW | Counter enable for S-mode |
| mstatush | 0x310 | RW | Additional machine status (RV32 only) |
Machine Trap Handling
| CSR | Address | R/W | Description |
|---|---|---|---|
| mscratch | 0x340 | RW | Scratch register for M-mode trap handlers |
| mepc | 0x341 | RW | Machine exception program counter |
| mcause | 0x342 | RW | Machine trap cause |
| mtval | 0x343 | RW | Machine bad address or instruction |
| mip | 0x344 | RW | Machine interrupt pending |
| mtinst | 0x34A | RW | Machine trap instruction (transformed) |
| mtval2 | 0x34B | RW | Machine bad guest physical address |
Machine Memory Protection
| CSR | Address | R/W | Description |
|---|---|---|---|
| pmpcfg0 | 0x3A0 | RW | PMP configuration register 0 |
| pmpcfg1 | 0x3A1 | RW | PMP configuration register 1 (RV32 only) |
| pmpcfg2 | 0x3A2 | RW | PMP configuration register 2 |
| pmpcfg3 | 0x3A3 | RW | PMP configuration register 3 (RV32 only) |
| pmpcfg4-15 | 0x3A4-0x3AF | RW | PMP configuration registers 4-15 |
| pmpaddr0-15 | 0x3B0-0x3BF | RW | PMP address registers 0-15 |
| pmpaddr16-63 | 0x3C0-0x3EF | RW | PMP address registers 16-63 |
Note: RV32 uses pmpcfg0, pmpcfg2, pmpcfg4, etc. (even-numbered only). RV64 uses pmpcfg0, pmpcfg2, pmpcfg4, etc., with each holding 8 configuration bytes.
Machine Counters and Timers
| CSR | Address | R/W | Description |
|---|---|---|---|
| mcycle | 0xB00 | RW | Machine cycle counter (lower 32/64 bits) |
| minstret | 0xB02 | RW | Machine instructions retired counter |
| mhpmcounter3-31 | 0xB03-0xB1F | RW | Machine performance monitoring counters |
| mcycleh | 0xB80 | RW | Upper 32 bits of mcycle (RV32 only) |
| minstreth | 0xB82 | RW | Upper 32 bits of minstret (RV32 only) |
| mhpmcounter3h-31h | 0xB83-0xB9F | RW | Upper 32 bits of mhpmcounter (RV32 only) |
Machine Counter Setup
| CSR | Address | R/W | Description |
|---|---|---|---|
| mcountinhibit | 0x320 | RW | Machine counter-inhibit register |
| mhpmevent3-31 | 0x323-0x33F | RW | Machine performance monitoring event selectors |
Usage: mcountinhibit controls which counters are active. Setting bit N stops counter N from incrementing, saving power.
A.3 Supervisor-Level CSRs (S-mode)
Supervisor Trap Setup
| CSR | Address | R/W | Description |
|---|---|---|---|
| sstatus | 0x100 | RW | Supervisor status register (subset of mstatus) |
| sie | 0x104 | RW | Supervisor interrupt enable |
| stvec | 0x105 | RW | Supervisor trap-handler base address |
| scounteren | 0x106 | RW | Counter enable for U-mode |
Supervisor Trap Handling
| CSR | Address | R/W | Description |
|---|---|---|---|
| sscratch | 0x140 | RW | Scratch register for S-mode trap handlers |
| sepc | 0x141 | RW | Supervisor exception program counter |
| scause | 0x142 | RW | Supervisor trap cause |
| stval | 0x143 | RW | Supervisor bad address or instruction |
| sip | 0x144 | RW | Supervisor interrupt pending |
Supervisor Address Translation and Protection
| CSR | Address | R/W | Description |
|---|---|---|---|
| satp | 0x180 | RW | Supervisor address translation and protection |
satp Format (RV64):
Bits [63:60]: Mode (0=Bare, 8=Sv39, 9=Sv48, 10=Sv57)
Bits [59:44]: ASID (Address Space Identifier)
Bits [43:0]: PPN (Physical Page Number of root page table)
A.4 User-Level CSRs (U-mode)
Floating-Point Control and Status
| CSR | Address | R/W | Description |
|---|---|---|---|
| fflags | 0x001 | RW | Floating-point accrued exceptions |
| frm | 0x002 | RW | Floating-point rounding mode |
| fcsr | 0x003 | RW | Floating-point control and status (fflags + frm) |
fflags Bits:
- Bit 0: NV (Invalid Operation)
- Bit 1: DZ (Divide by Zero)
- Bit 2: OF (Overflow)
- Bit 3: UF (Underflow)
- Bit 4: NX (Inexact)
frm Values:
- 0: RNE (Round to Nearest, ties to Even)
- 1: RTZ (Round towards Zero)
- 2: RDN (Round Down, towards -∞)
- 3: RUP (Round Up, towards +∞)
- 4: RMM (Round to Nearest, ties to Max Magnitude)
User Counters and Timers
| CSR | Address | R/W | Description |
|---|---|---|---|
| cycle | 0xC00 | RO | Cycle counter (lower 32/64 bits) |
| time | 0xC01 | RO | Timer (lower 32/64 bits) |
| instret | 0xC02 | RO | Instructions retired counter |
| hpmcounter3-31 | 0xC03-0xC1F | RO | Performance monitoring counters |
| cycleh | 0xC80 | RO | Upper 32 bits of cycle (RV32 only) |
| timeh | 0xC81 | RO | Upper 32 bits of time (RV32 only) |
| instreth | 0xC82 | RO | Upper 32 bits of instret (RV32 only) |
| hpmcounter3h-31h | 0xC83-0xC9F | RO | Upper 32 bits of hpmcounter (RV32 only) |
Note: These are read-only shadows of the machine-level counters. Access can be disabled by mcounteren (for S-mode) or scounteren (for U-mode).
A.5 Debug CSRs
Debug CSRs are accessible only in Debug Mode (entered via debugger or trigger).
| CSR | Address | R/W | Description |
|---|---|---|---|
| dcsr | 0x7B0 | RW | Debug control and status register |
| dpc | 0x7B1 | RW | Debug program counter |
| dscratch0 | 0x7B2 | RW | Debug scratch register 0 |
| dscratch1 | 0x7B3 | RW | Debug scratch register 1 |
dcsr Bit Fields:
- Bits [31:28]: xdebugver (Debug specification version)
- Bits [8:6]: cause (Reason for entering debug mode)
- 1: ebreak instruction
- 2: Trigger module
- 3: Debugger halt request
- 4: Single step
- 5: Reset halt
- Bit [2]: step (Single-step mode enable)
- Bits [1:0]: prv (Privilege level before entering debug mode)
A.6 Trigger/Debug Module CSRs
| CSR | Address | R/W | Description |
|---|---|---|---|
| tselect | 0x7A0 | RW | Trigger select register |
| tdata1 | 0x7A1 | RW | Trigger data register 1 (type and config) |
| tdata2 | 0x7A2 | RW | Trigger data register 2 (match value) |
| tdata3 | 0x7A3 | RW | Trigger data register 3 (additional data) |
| tinfo | 0x7A4 | RO | Trigger info (supported types) |
| tcontrol | 0x7A5 | RW | Trigger control |
| mcontext | 0x7A8 | RW | Machine context register |
| scontext | 0x7AA | RW | Supervisor context register |
Usage: Triggers enable hardware breakpoints and watchpoints. tselect chooses which trigger to configure, tdata1-3 configure the selected trigger.
A.7 Key CSR Bit Fields
mstatus (Machine Status Register)
RV64 Format:
Bit 63: SD (State Dirty - summary of FS/XS)
Bits 36-37: SXL (S-mode XLEN)
Bits 34-35: UXL (U-mode XLEN)
Bit 22: TSR (Trap SRET)
Bit 21: TW (Timeout Wait - trap WFI)
Bit 20: TVM (Trap Virtual Memory - trap SATP writes)
Bit 19: MXR (Make eXecutable Readable)
Bit 18: SUM (permit Supervisor User Memory access)
Bit 17: MPRV (Modify PRiVilege)
Bits 15-16: XS (user eXtension State)
Bits 13-14: FS (Floating-point State)
Bits 11-12: MPP (Machine Previous Privilege)
Bit 8: SPP (Supervisor Previous Privilege)
Bit 7: MPIE (Machine Previous Interrupt Enable)
Bit 5: SPIE (Supervisor Previous Interrupt Enable)
Bit 3: MIE (Machine Interrupt Enable)
Bit 1: SIE (Supervisor Interrupt Enable)
FS/XS Values:
- 0: Off (all off)
- 1: Initial (none dirty, some on)
- 2: Clean (none dirty, some on)
- 3: Dirty (some dirty)
MPP/SPP Values:
- 0: User mode
- 1: Supervisor mode
- 3: Machine mode
mtvec (Machine Trap Vector)
Format:
Bits [XLEN-1:2]: BASE (trap handler base address, 4-byte aligned)
Bits [1:0]: MODE
0 = Direct (all traps to BASE)
1 = Vectored (interrupts to BASE + 4*cause, exceptions to BASE)
Example:
la t0, trap_handler
csrw mtvec, t0 # Direct mode (MODE=0)
la t0, trap_handler
ori t0, t0, 1 # Set MODE=1
csrw mtvec, t0 # Vectored mode
mcause (Machine Cause Register)
Format:
- Bit [XLEN-1]: Interrupt (1=interrupt, 0=exception)
- Bits [XLEN-2:0]: Exception Code
Exception Codes (Interrupt=0):
| Code | Exception |
|---|---|
| 0 | Instruction address misaligned |
| 1 | Instruction access fault |
| 2 | Illegal instruction |
| 3 | Breakpoint |
| 4 | Load address misaligned |
| 5 | Load access fault |
| 6 | Store/AMO address misaligned |
| 7 | Store/AMO access fault |
| 8 | Environment call from U-mode |
| 9 | Environment call from S-mode |
| 11 | Environment call from M-mode |
| 12 | Instruction page fault |
| 13 | Load page fault |
| 15 | Store/AMO page fault |
Interrupt Codes (Interrupt=1):
| Code | Interrupt |
|---|---|
| 0 | User software interrupt |
| 1 | Supervisor software interrupt |
| 3 | Machine software interrupt |
| 4 | User timer interrupt |
| 5 | Supervisor timer interrupt |
| 7 | Machine timer interrupt |
| 8 | User external interrupt |
| 9 | Supervisor external interrupt |
| 11 | Machine external interrupt |
satp (Supervisor Address Translation and Protection)
RV64 Format:
Bits [63:60]: MODE
0 = Bare (no translation)
8 = Sv39 (39-bit virtual address)
9 = Sv48 (48-bit virtual address)
10 = Sv57 (57-bit virtual address)
Bits [59:44]: ASID (Address Space Identifier, 16 bits)
Bits [43:0]: PPN (Physical Page Number of root page table, 44 bits)
RV32 Format:
Bit [31]: MODE (0=Bare, 1=Sv32)
Bits [30:22]: ASID (9 bits)
Bits [21:0]: PPN (22 bits)
Example:
# Switch to Sv39 mode with ASID=1, root page table at 0x80200000
li t0, 0x8000000000080200 # MODE=8, ASID=0, PPN=0x80200
csrw satp, t0
sfence.vma # Flush TLB
A.8 CSR Instructions Quick Reference
| Instruction | Format | Operation |
|---|---|---|
| CSRRW | csrrw rd, csr, rs1 | t = CSR; CSR = rs1; rd = t |
| CSRRS | csrrs rd, csr, rs1 | t = CSR; CSR = t | rs1; rd = t |
| CSRRC | csrrc rd, csr, rs1 | t = CSR; CSR = t & ~rs1; rd = t |
| CSRRWI | csrrwi rd, csr, imm | t = CSR; CSR = imm; rd = t |
| CSRRSI | csrrsi rd, csr, imm | t = CSR; CSR = t | imm; rd = t |
| CSRRCI | csrrci rd, csr, imm | t = CSR; CSR = t & ~imm; rd = t |
Pseudo-instructions:
csrr rd, csr # Read CSR (csrrs rd, csr, x0)
csrw csr, rs1 # Write CSR (csrrw x0, csr, rs1)
csrs csr, rs1 # Set bits (csrrs x0, csr, rs1)
csrc csr, rs1 # Clear bits (csrrc x0, csr, rs1)
csrwi csr, imm # Write immediate (csrrwi x0, csr, imm)
csrsi csr, imm # Set bits immediate (csrrsi x0, csr, imm)
csrci csr, imm # Clear bits immediate (csrrci x0, csr, imm)
A.9 Common CSR Usage Patterns
Enable Machine-Mode Interrupts
# Enable machine timer and external interrupts
li t0, 0x88 # MTIE (bit 7) + MEIE (bit 11)
csrs mie, t0 # Set bits in mie
# Enable global interrupts
li t0, 0x8 # MIE (bit 3)
csrs mstatus, t0 # Set MIE in mstatus
Trap Handler Entry
trap_handler:
# Save context
csrrw sp, mscratch, sp # Swap sp with mscratch
# Save registers on stack
addi sp, sp, -32*8
sd x1, 0(sp)
sd x2, 8(sp)
# ... save all registers ...
# Read trap cause
csrr t0, mcause
csrr t1, mepc
csrr t2, mtval
# Handle trap...
Context Switch (Change satp)
# Switch to new process page table
# a0 = new satp value
csrw satp, a0
sfence.vma # Flush TLB
Disable Interrupts for Critical Section
# Save and disable interrupts
csrrci t0, mstatus, 0x8 # Clear MIE, save old mstatus
# Critical section...
# Restore interrupts
csrw mstatus, t0 # Restore original mstatus
A.10 CSR Access Permissions
Privilege Level Check:
- CSR address bits [11:10] encode minimum privilege level
- Accessing CSR from lower privilege → illegal instruction exception
Read-Only Check:
- CSR address bits [11:10] = 11 → read-only
- Writing to read-only CSR → illegal instruction exception
Implementation-Defined Behavior:
- Unimplemented CSRs may:
- Read as zero, writes ignored (WARL - Write Any, Read Legal)
- Cause illegal instruction exception
- Implementation must document behavior
A.11 References
- RISC-V Privileged Specification: Complete CSR definitions and bit fields
- RISC-V Debug Specification: Debug CSRs (dcsr, dpc, tselect, tdata)
- RISC-V ISA Manual: CSR instructions and access rules
Appendix B. Extension Reference
RISC-V ISA Extensions Quick Reference
💡 Usage Guide: This appendix is your “menu” during project planning. When deciding which extensions your project needs, reference the decision guide here.
🧩 Extension Selection Guide (Decision Guide)
Quick Decision Table
| Extension | Full Name | When to Use? | Dependencies | Recommendation |
|---|---|---|---|---|
| M | Multiply/Divide | Almost all projects need it | None | ✅ Strongly recommended |
| A | Atomic | Multi-core, OS, Lock-free | None | ✅ Required for OS |
| F | Single Float | Floating-point (games, scientific) | None | As needed |
| D | Double Float | High-precision floating-point | F | As needed |
| C | Compressed | Reduce code size 20-30% | None | ✅ Strongly recommended |
| V | Vector | AI/DSP/Matrix operations | D | Required for HPC |
| Zba | Address Gen | Heavy array access a[i*4] | None | Performance optimization |
| Zbb | Bit Manipulation | Bit operations (popcount, clz) | None | Performance optimization |
| Zbs | Single-bit | Single-bit operations | None | Performance optimization |
| Zicsr | CSR Access | CSR access (separated from I) | None | ✅ Required for system code |
| Zifencei | Fence.I | Instruction cache sync (JIT) | None | Required for self-modifying code |
Common Combinations (Profiles)
Minimal Embedded: RV32IMC (Multiply + Compressed)
Standard Embedded: RV32IMAC (+ Atomic operations)
Application Proc: RV64IMAFDC (= RV64GC, full general-purpose)
High-Performance: RV64GCV (+ Vector)
Dependency Graph
┌─────┐
│ I │ (Base ISA)
└──┬──┘
│
┌──────┼──────┬──────┬──────┐
▼ ▼ ▼ ▼ ▼
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ M │ │ A │ │ C │ │ F │ │Zicsr│
└───┘ └───┘ └───┘ └─┬─┘ └───┘
│
▼
┌───┐
│ D │
└─┬─┘
│
▼
┌───┐
│ V │
└───┘
⚠️ Common Pitfalls
Pitfall 1: Thinking G Includes C
Misconception: RV64G includes compressed instructions.
Truth: G = IMAFD, does NOT include C. For compressed instructions, explicitly write RV64GC.
# ❌ Wrong: Assuming G has compression
riscv64-unknown-elf-gcc -march=rv64g ...
# ✅ Correct: Explicitly add C
riscv64-unknown-elf-gcc -march=rv64gc ...
Pitfall 2: Forgetting Zicsr and Zifencei
Background: Starting from RISC-V 2.1 spec, CSR instructions and FENCE.I were separated from I.
Impact: Some toolchains require explicit specification.
# If compiler complains about missing csrr/csrw
riscv64-unknown-elf-gcc -march=rv64gc_zicsr_zifencei ...
Pitfall 3: misa Can Only Be Read in M-mode
Error Scenario: Trying to read misa to detect extensions in S-mode or U-mode.
Solution: Use SBI query, or record during M-mode initialization.
// ❌ In S-mode, this triggers Illegal Instruction
uint64_t misa;
asm volatile ("csrr %0, misa" : "=r" (misa));
// ✅ Query via SBI or Device Tree
// Or save misa to global variable during M-mode boot
This appendix provides a comprehensive reference for RISC-V ISA extensions. RISC-V’s modular design allows implementations to include only the extensions they need, from minimal embedded systems to high-performance application processors.
B.1 Base ISAs
| Base ISA | Description | Register Width | Address Space |
|---|---|---|---|
| RV32I | 32-bit integer base | 32 bits | 32-bit (4 GB) |
| RV64I | 64-bit integer base | 64 bits | 64-bit (16 EB) |
| RV128I | 128-bit integer base (future) | 128 bits | 128-bit |
| RV32E | Embedded variant (16 registers) | 32 bits | 32-bit (4 GB) |
RV32I: The base 32-bit integer instruction set. Includes 32 general-purpose registers (x0-x31), integer arithmetic, logical operations, loads/stores, branches, and jumps. Sufficient for simple embedded systems.
RV64I: Extends RV32I to 64-bit. Adds 64-bit arithmetic operations (ADDW, SUBW, etc.) and 64-bit loads/stores (LD, SD). Registers are 64 bits wide. Used for application processors and servers.
RV32E: Reduced version of RV32I with only 16 registers (x0-x15). Designed for ultra-low-cost embedded systems where area is critical. Reduces register file size by 50%.
B.2 Standard Extensions
M Extension: Integer Multiplication and Division
Status: Ratified
Description: Adds integer multiply, divide, and remainder instructions.
| Instruction | Description |
|---|---|
| MUL | Multiply (lower XLEN bits) |
| MULH | Multiply high (signed × signed) |
| MULHSU | Multiply high (signed × unsigned) |
| MULHU | Multiply high (unsigned × unsigned) |
| DIV | Divide (signed) |
| DIVU | Divide (unsigned) |
| REM | Remainder (signed) |
| REMU | Remainder (unsigned) |
RV64 Additions: MULW, DIVW, DIVUW, REMW, REMUW (32-bit variants)
Usage: Essential for most applications. Division is expensive in hardware, so minimal systems may omit M extension and use software division.
A Extension: Atomic Instructions
Status: Ratified Description: Adds atomic memory operations for synchronization.
Load-Reserved/Store-Conditional:
| Instruction | Description |
|---|---|
| LR.W/D | Load-Reserved Word/Doubleword |
| SC.W/D | Store-Conditional Word/Doubleword |
Atomic Memory Operations (AMO):
| Instruction | Description |
|---|---|
| AMOSWAP.W/D | Atomic swap |
| AMOADD.W/D | Atomic add |
| AMOAND.W/D | Atomic AND |
| AMOOR.W/D | Atomic OR |
| AMOXOR.W/D | Atomic XOR |
| AMOMAX.W/D | Atomic maximum (signed) |
| AMOMAXU.W/D | Atomic maximum (unsigned) |
| AMOMIN.W/D | Atomic minimum (signed) |
| AMOMINU.W/D | Atomic minimum (unsigned) |
Ordering Annotations: .aq (acquire), .rl (release), .aqrl (both)
Usage: Required for multi-core systems and lock-free algorithms.
F Extension: Single-Precision Floating-Point
Status: Ratified
Description: Adds 32 floating-point registers (f0-f31) and single-precision (32-bit) floating-point operations.
Registers: 32 × 32-bit floating-point registers (f0-f31)
Instructions:
- Arithmetic: FADD.S, FSUB.S, FMUL.S, FDIV.S, FSQRT.S
- Fused Multiply-Add: FMADD.S, FMSUB.S, FNMADD.S, FNMSUB.S
- Comparison: FEQ.S, FLT.S, FLE.S
- Conversion: FCVT.W.S, FCVT.S.W, FCVT.L.S, FCVT.S.L
- Move: FMV.X.W, FMV.W.X
- Load/Store: FLW, FSW
- Sign Injection: FSGNJ.S, FSGNJN.S, FSGNJX.S
- Min/Max: FMIN.S, FMAX.S
- Classification: FCLASS.S
CSRs: fflags, frm, fcsr (floating-point control and status)
D Extension: Double-Precision Floating-Point
Status: Ratified
Description: Extends F extension to support double-precision (64-bit) floating-point.
Requires: F extension
Registers: Extends f0-f31 to 64 bits each
Instructions: Same as F extension but with .D suffix (FADD.D, FMUL.D, etc.)
Additional Conversions: FCVT.S.D, FCVT.D.S (convert between single and double)
Load/Store: FLD, FSD (64-bit loads/stores)
C Extension: Compressed Instructions
Status: Ratified
Description: Adds 16-bit compressed instructions to reduce code size.
Encoding: 16-bit instructions (bits [1:0] ≠ 11) intermixed with 32-bit instructions
Instruction Categories:
- Loads/Stores: C.LW, C.LD, C.SW, C.SD, C.LWSP, C.LDSP, C.SWSP, C.SDSP
- Arithmetic: C.ADDI, C.ADDIW, C.ADDI16SP, C.ADDI4SPN, C.LI, C.LUI
- Logical: C.ANDI, C.SLLI, C.SRLI, C.SRAI
- Branches: C.BEQZ, C.BNEZ
- Jumps: C.J, C.JAL, C.JR, C.JALR
- Register Move: C.MV, C.ADD
- Special: C.NOP, C.EBREAK
Code Size Reduction: Typically 25-30% smaller code compared to RV32I/RV64I alone
Usage: Highly recommended for all systems. Minimal hardware cost, significant code density improvement.
V Extension: Vector Operations
Status: Ratified (v1.0)
Description: Adds vector processing with variable-length vectors.
Registers: 32 vector registers (v0-v31), each with configurable element width and length
CSRs:
- vtype: Vector type (element width, LMUL)
- vl: Vector length
- vstart: Vector start index (for resuming after exception)
- vxrm: Vector fixed-point rounding mode
- vxsat: Vector fixed-point saturation flag
Configuration: vsetvl, vsetvli (set vector length and type)
Instruction Categories:
- Arithmetic: VADD, VSUB, VMUL, VDIV, VREM
- Logical: VAND, VOR, VXOR
- Shift: VSLL, VSRL, VSRA
- Comparison: VMSEQ, VMSNE, VMSLT, VMSLE, VMSGTU
- Load/Store: VLE, VSE (unit-stride), VLSE, VSSE (strided), VLXEI, VSXEI (indexed)
- Reduction: VREDSUM, VREDMAX, VREDMIN
- Mask: VMAND, VMOR, VMXOR, VMNOT
- Permutation: VSLIDEUP, VSLIDEDOWN, VRGATHER
- Floating-Point: VFADD, VFMUL, VFDIV, VFSQRT, VFMADD
Usage: High-performance computing, DSP, machine learning
B.3 Bit Manipulation Extensions
Zba: Address Generation
Status: Ratified Description: Instructions for address calculation.
| Instruction | Description |
|---|---|
| SH1ADD | Shift left by 1 and add (rs1 << 1) + rs2 |
| SH2ADD | Shift left by 2 and add (rs1 << 2) + rs2 |
| SH3ADD | Shift left by 3 and add (rs1 << 3) + rs2 |
Usage: Efficient array indexing (e.g., a[i] where elements are 2, 4, or 8 bytes)
Zbb: Basic Bit Manipulation
Status: Ratified Description: Common bit manipulation operations.
| Instruction | Description |
|---|---|
| ANDN | AND with inverted operand |
| ORN | OR with inverted operand |
| XNOR | XOR with inverted operand |
| CLZ | Count leading zeros |
| CTZ | Count trailing zeros |
| CPOP | Count population (number of 1 bits) |
| MAX | Maximum (signed) |
| MAXU | Maximum (unsigned) |
| MIN | Minimum (signed) |
| MINU | Minimum (unsigned) |
| SEXT.B | Sign-extend byte |
| SEXT.H | Sign-extend halfword |
| ZEXT.H | Zero-extend halfword |
| ROL | Rotate left |
| ROR | Rotate right |
| RORI | Rotate right immediate |
| ORC.B | OR-combine bytes |
| REV8 | Byte-reverse (endian swap) |
Usage: Cryptography, compression, bit-field manipulation
Zbc: Carry-Less Multiplication
Status: Ratified Description: Carry-less multiplication for cryptography.
| Instruction | Description |
|---|---|
| CLMUL | Carry-less multiply (lower half) |
| CLMULH | Carry-less multiply (upper half) |
| CLMULR | Carry-less multiply (reversed) |
Usage: AES-GCM, CRC calculation
Zbs: Single-Bit Instructions
Status: Ratified Description: Single-bit set, clear, invert, extract.
| Instruction | Description |
|---|---|
| BCLR | Bit clear |
| BCLRI | Bit clear immediate |
| BEXT | Bit extract |
| BEXTI | Bit extract immediate |
| BINV | Bit invert |
| BINVI | Bit invert immediate |
| BSET | Bit set |
| BSETI | Bit set immediate |
Usage: Bit-field manipulation, flag management
B.4 Compressed Extensions (Zc*)
Zcb: Code Size Reduction (16-bit)
Status: Ratified Description: Additional 16-bit instructions for code density.
Instructions: C.LBU, C.LHU, C.LH, C.SB, C.SH, C.ZEXT.B, C.SEXT.B, C.SEXT.H, C.ZEXT.H, C.ZEXT.W, C.NOT, C.MUL
Usage: Further code size reduction beyond C extension
Zcmp: Push/Pop and Move
Status: Ratified Description: Push/pop multiple registers, double move.
Instructions:
- CM.PUSH: Push registers to stack
- CM.POP: Pop registers from stack
- CM.POPRET: Pop and return
- CM.POPRETZ: Pop, return, and zero a0
- CM.MVA01S: Move two registers to a0/a1
- CM.MVSA01: Move a0/a1 to two registers
Usage: Function prologue/epilogue optimization
Zcmt: Table Jump
Status: Ratified Description: Indirect jump via table for switch statements.
Instructions: CM.JT, CM.JALT (jump via table)
Usage: Efficient switch/case implementation
B.5 Cache Management Extensions
Zicbom: Cache Block Management
Status: Ratified Description: Cache block clean, flush, and invalidate.
Instructions:
- CBO.CLEAN: Clean cache block (write back if dirty)
- CBO.FLUSH: Flush cache block (write back and invalidate)
- CBO.INVAL: Invalidate cache block
Usage: DMA coherence, cache maintenance
Zicbop: Cache Block Prefetch
Status: Ratified Description: Prefetch hints for cache optimization.
Instructions:
- PREFETCH.R: Prefetch for read
- PREFETCH.W: Prefetch for write
- PREFETCH.I: Prefetch for instruction
Usage: Performance optimization, prefetching
Zicboz: Cache Block Zero
Status: Ratified Description: Zero a cache block efficiently.
Instructions: CBO.ZERO (zero cache block)
Usage: Fast memory initialization
B.6 Privileged Extensions
Zicsr: CSR Instructions
Status: Ratified (part of base) Description: Control and Status Register access instructions.
Instructions: CSRRW, CSRRS, CSRRC, CSRRWI, CSRRSI, CSRRCI
Usage: Required for privileged software (OS, firmware)
Zifencei: Instruction Fetch Fence
Status: Ratified (part of base) Description: Synchronize instruction and data caches.
Instructions: FENCE.I
Usage: Self-modifying code, JIT compilation, code loading
Zihintpause: Pause Hint
Status: Ratified Description: Hint for spin-wait loops.
Instructions: PAUSE (encoded as FENCE with specific operands)
Usage: Reduce power in spin-locks
B.7 Hypervisor Extension (H)
Status: Ratified Description: Support for virtualization.
Features:
- Two-stage address translation (VS-stage and G-stage)
- Virtual supervisor mode (VS-mode)
- Hypervisor CSRs (hstatus, hedeleg, hideleg, hgatp, etc.)
- Virtual interrupt management
Instructions: HLV, HSV (hypervisor load/store), HFENCE.VVMA, HFENCE.GVMA
Usage: Virtual machines, hypervisors (KVM, Xen)
B.8 Cryptography Extensions
Zk: Scalar Cryptography
Status: Ratified Description: Cryptographic instructions for AES, SHA, SM3, SM4.
Sub-extensions:
- Zkn: NIST algorithms (AES, SHA-256, SHA-512)
- Zks: ShangMi algorithms (SM3, SM4)
- Zkb: Bit manipulation for crypto
- Zkr: Entropy source (seed CSR)
Instructions:
- AES: AES32ESI, AES32ESMI, AES32DSI, AES32DSMI, AES64ES, AES64DS, etc.
- SHA-256: SHA256SIG0, SHA256SIG1, SHA256SUM0, SHA256SUM1
- SHA-512: SHA512SIG0, SHA512SIG1, SHA512SUM0, SHA512SUM1
- SM3: SM3P0, SM3P1
- SM4: SM4ED, SM4KS
Usage: Secure boot, TLS, disk encryption
B.9 Extension Combinations
Common Combinations
| Combination | Name | Description |
|---|---|---|
| RV32I | Base | Minimal 32-bit system |
| RV32IM | - | Base + multiply/divide |
| RV32IMC | - | Base + multiply + compressed |
| RV32IMAC | - | Base + multiply + atomic + compressed |
| RV32IMAFC | - | Base + M + A + F + C |
| RV32IMAFDC | - | Base + M + A + F + D + C |
| RV32GC | General | RV32IMAFD_Zicsr_Zifencei + C |
| RV64GC | General | RV64IMAFD_Zicsr_Zifencei + C |
RV32G / RV64G: “General-purpose” configuration = IMAFD + Zicsr + Zifencei
B.10 Platform Profiles
RVA22 Profile (Application Processors)
Base: RV64I Mandatory Extensions:
- M, A, F, D, C (IMAFD + C)
- Zicsr, Zifencei
- Zba, Zbb, Zbs (bit manipulation)
- Zihintpause
- Zicbom, Zicbop, Zicboz (cache management)
- Sv39 (virtual memory)
- Privileged spec v1.12+
Optional Extensions: V, H, Zk
Usage: Linux-capable application processors
RVA23 Profile (Next-generation)
Adds to RVA22:
- Sv48 or Sv57 (larger virtual address space)
- Zihintntl (non-temporal locality hints)
- Zicond (conditional operations)
- Zawrs (wait-on-reservation-set)
- Zcb, Zcmp, Zcmt (additional compressed)
- Vector extension (V) mandatory
Usage: High-performance servers, HPC
RVM23 Profile (Microcontrollers)
Base: RV32I or RV64I Mandatory Extensions:
- M, C
- Zicsr, Zifencei
- Zba, Zbb, Zbs
- Zicbop, Zicboz
- PMP (Physical Memory Protection)
Optional: A, F, D, V
Usage: Embedded microcontrollers
B.11 Extension Naming Convention
Format: RV[32|64|128][I|E][Extensions]
Examples:
RV32I: 32-bit base integerRV64IMAC: 64-bit with M, A, C extensionsRV32GC: 32-bit general-purpose with compressedRV64GCV: 64-bit general-purpose with compressed and vector
Ordering: Extensions listed in canonical order (IMAFDQCV…)
B.12 Extension Detection
Runtime Detection (misa CSR)
csrr t0, misa
andi t1, t0, (1 << 0) # Check 'A' extension (bit 0)
bnez t1, has_atomic
misa Bit Assignments:
- Bit 0: A (Atomic)
- Bit 2: C (Compressed)
- Bit 3: D (Double-precision FP)
- Bit 4: E (Embedded - RV32E)
- Bit 5: F (Single-precision FP)
- Bit 7: H (Hypervisor)
- Bit 8: I (Base integer ISA)
- Bit 12: M (Multiply/Divide)
- Bit 20: U (User mode)
- Bit 21: V (Vector)
B.13 References
- RISC-V ISA Manual: Complete extension specifications
- RISC-V Profiles: RVA22, RVA23, RVM23 specifications
- Extension Specifications: Individual ratified extension documents
Appendix C. Boot Loader Reference Implementation
Minimal RISC-V Bootloader Example
💡 Usage Guide: This appendix is your “boot disk” for starting projects. When you need to write bare-metal code from scratch, copy templates directly from here.
🚀 Minimal Viable Boot Template (Copy-Paste Ready)
Minimal Linker Script (link.ld)
This is the most frequently copy-pasted file in bare-metal projects:
/* link.ld - For QEMU virt machine */
OUTPUT_ARCH(riscv)
ENTRY(_start)
MEMORY {
RAM (rwx) : ORIGIN = 0x80000000, LENGTH = 128M
}
SECTIONS {
. = 0x80000000;
.text : {
*(.text.boot) /* Ensure boot code comes first */
*(.text .text.*)
} > RAM
.rodata : {
*(.rodata .rodata.*)
} > RAM
.data : {
*(.data .data.*)
} > RAM
.bss : {
_bss_start = .;
*(.bss .bss.*)
*(COMMON)
_bss_end = .;
} > RAM
. = ALIGN(16);
. = . + 0x4000; /* Reserve 16KB Stack */
_stack_top = .;
}
Minimal Entry Point (entry.S)
# entry.S - Minimal boot code
.section .text.boot
.global _start
_start:
# 1. Set Stack Pointer
la sp, _stack_top
# 2. Clear BSS section
la t0, _bss_start
la t1, _bss_end
clear_bss:
bge t0, t1, bss_done
sd zero, 0(t0)
addi t0, t0, 8
j clear_bss
bss_done:
# 3. Jump to C main
call main
# 4. Halt after main returns
halt:
wfi
j halt
Minimal Main (main.c)
// main.c - Minimal Hello World (UART)
#define UART_BASE 0x10000000 // QEMU virt UART address
void uart_putc(char c) {
volatile char *uart = (volatile char *)UART_BASE;
*uart = c;
}
void uart_puts(const char *s) {
while (*s) uart_putc(*s++);
}
int main(void) {
uart_puts("Hello, RISC-V!\n");
return 0;
}
Compile and Run
# Compile
riscv64-unknown-elf-gcc -nostdlib -T link.ld \
-o hello.elf entry.S main.c
# Run
qemu-system-riscv64 -machine virt -nographic \
-kernel hello.elf
📊 Typical Boot Flow Diagram
┌─────────────────────────────────────────────────────────────┐
│ Power-On Reset │
└─────────────────────────┬───────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ ZSBL (Zeroth-Stage Bootloader) - ROM │
│ • PC = Reset Vector (0x1000 or implementation-defined) │
│ • Initialize clock, DRAM Controller │
│ • Jump to FSBL │
└─────────────────────────┬───────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ FSBL (First-Stage Bootloader) - Flash/ROM │
│ • Initialize SPI/SD storage device │
│ • Load OpenSBI to DRAM │
│ • Jump to OpenSBI │
└─────────────────────────┬───────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ OpenSBI (M-mode Firmware) │
│ • Set up PMP to protect M-mode memory │
│ • Initialize SBI services │
│ • Set medeleg/mideleg to delegate traps │
│ • Jump to S-mode Kernel │
└─────────────────────────┬───────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ Linux Kernel (S-mode) │
│ • Initialize virtual memory │
│ • Start init process │
└─────────────────────────────────────────────────────────────┘
This appendix provides a reference implementation of a minimal RISC-V bootloader. This code demonstrates the essential steps required to boot a RISC-V system from reset to loading an operating system. While production bootloaders like U-Boot are much more complex, this example illustrates the core concepts.
C.1 Boot Sequence Overview
Power-On Reset
↓
Reset Vector (0x1000 or implementation-defined)
↓
ZSBL (Zeroth-Stage Bootloader) - ROM code
├─ Initialize clocks
├─ Initialize DRAM
└─ Jump to FSBL
↓
FSBL (First-Stage Bootloader) - Flash/ROM
├─ Initialize storage (SPI/SD)
├─ Load SSBL to DRAM
├─ Verify SSBL (optional)
└─ Jump to SSBL
↓
SSBL (Second-Stage Bootloader) - U-Boot/GRUB
├─ Initialize devices
├─ Load kernel and device tree
├─ Set up boot arguments
└─ Jump to kernel
↓
Operating System (Linux/FreeBSD)
C.2 ZSBL: Zeroth-Stage Bootloader
Purpose: Minimal ROM code to initialize DRAM and load FSBL.
Constraints:
- Must fit in small on-chip ROM (typically 16-64 KB)
- No DRAM available initially
- Must run from ROM or tightly-coupled memory (TCM)
ZSBL Entry Point
# zsbl_start.S - ZSBL entry point
# Runs in M-mode immediately after reset
.section .text.init
.global _start
_start:
# Disable interrupts
csrw mie, zero
csrw mip, zero
# Set up trap vector (point to error handler)
la t0, trap_handler
csrw mtvec, t0
# Initialize stack pointer
# Use on-chip SRAM (e.g., 0x08000000 + 16KB)
la sp, _stack_top
# Clear BSS section
la t0, _bss_start
la t1, _bss_end
1:
bge t0, t1, 2f
sd zero, 0(t0)
addi t0, t0, 8
j 1b
2:
# Jump to C code
call zsbl_main
# Should never return
j .
trap_handler:
# Minimal trap handler - just hang
j .
ZSBL Main Function
// zsbl_main.c - ZSBL main logic
#include <stdint.h>
// Hardware addresses (example for SiFive FU540)
#define DRAM_BASE 0x80000000
#define DRAM_SIZE (8 * 1024 * 1024 * 1024UL) // 8 GB
#define FSBL_LOAD_ADDR 0x80000000
#define FSBL_SIZE (128 * 1024) // 128 KB
#define SPI_FLASH_BASE 0x20000000
// DRAM controller registers (simplified)
#define DRAM_CTRL_BASE 0x10000000
#define DRAM_INIT_REG (DRAM_CTRL_BASE + 0x00)
#define DRAM_STATUS_REG (DRAM_CTRL_BASE + 0x04)
void dram_init(void) {
volatile uint32_t *init_reg = (uint32_t *)DRAM_INIT_REG;
volatile uint32_t *status_reg = (uint32_t *)DRAM_STATUS_REG;
// Trigger DRAM initialization
*init_reg = 0x1;
// Wait for DRAM ready
while ((*status_reg & 0x1) == 0) {
// Busy wait
}
}
void load_fsbl(void) {
uint8_t *src = (uint8_t *)SPI_FLASH_BASE;
uint8_t *dst = (uint8_t *)FSBL_LOAD_ADDR;
// Simple memcpy from SPI flash to DRAM
for (size_t i = 0; i < FSBL_SIZE; i++) {
dst[i] = src[i];
}
}
void zsbl_main(void) {
// 1. Initialize DRAM
dram_init();
// 2. Load FSBL from SPI flash to DRAM
load_fsbl();
// 3. Jump to FSBL
void (*fsbl_entry)(void) = (void (*)(void))FSBL_LOAD_ADDR;
fsbl_entry();
// Should never reach here
while (1);
}
C.3 FSBL: First-Stage Bootloader
Purpose: Load second-stage bootloader (U-Boot) from storage.
Features:
- Initialize storage controller (SPI, SD, eMMC)
- Load SSBL image from storage
- Verify SSBL (checksum or signature)
- Jump to SSBL
FSBL Main Function
// fsbl_main.c - FSBL main logic
#include <stdint.h>
#define SSBL_LOAD_ADDR 0x80200000 // Load U-Boot at 2MB offset
#define SSBL_SIZE (512 * 1024) // 512 KB
#define SSBL_FLASH_OFFSET 0x40000 // Offset in SPI flash
// SPI controller registers (simplified)
#define SPI_BASE 0x10040000
#define SPI_CTRL_REG (SPI_BASE + 0x00)
#define SPI_DATA_REG (SPI_BASE + 0x04)
#define SPI_STATUS_REG (SPI_BASE + 0x08)
void spi_init(void) {
volatile uint32_t *ctrl_reg = (uint32_t *)SPI_CTRL_REG;
// Configure SPI: 8-bit mode, clock divider = 4
*ctrl_reg = 0x04;
}
void spi_read(uint32_t offset, uint8_t *buf, size_t len) {
volatile uint32_t *data_reg = (uint32_t *)SPI_DATA_REG;
volatile uint32_t *status_reg = (uint32_t *)SPI_STATUS_REG;
// Send read command (0x03) + 24-bit address
*data_reg = 0x03;
*data_reg = (offset >> 16) & 0xFF;
*data_reg = (offset >> 8) & 0xFF;
*data_reg = offset & 0xFF;
// Read data
for (size_t i = 0; i < len; i++) {
// Wait for data ready
while ((*status_reg & 0x1) == 0);
buf[i] = *data_reg & 0xFF;
}
}
uint32_t calculate_checksum(uint8_t *data, size_t len) {
uint32_t sum = 0;
for (size_t i = 0; i < len; i++) {
sum += data[i];
}
return sum;
}
void fsbl_main(void) {
uint8_t *ssbl_addr = (uint8_t *)SSBL_LOAD_ADDR;
// 1. Initialize SPI controller
spi_init();
// 2. Load SSBL from SPI flash
spi_read(SSBL_FLASH_OFFSET, ssbl_addr, SSBL_SIZE);
// 3. Verify SSBL (simple checksum)
uint32_t *checksum_ptr = (uint32_t *)(ssbl_addr + SSBL_SIZE - 4);
uint32_t expected_checksum = *checksum_ptr;
uint32_t actual_checksum = calculate_checksum(ssbl_addr, SSBL_SIZE - 4);
if (actual_checksum != expected_checksum) {
// Checksum failed - hang
while (1);
}
// 4. Jump to SSBL
void (*ssbl_entry)(void) = (void (*)(void))SSBL_LOAD_ADDR;
ssbl_entry();
// Should never reach here
while (1);
}
C.4 Minimal SSBL: Second-Stage Bootloader
Purpose: Load kernel and device tree, set up boot environment.
Features:
- Parse device tree
- Load kernel image
- Set up boot arguments
- Jump to kernel in S-mode
SSBL Main Function
// ssbl_main.c - Minimal second-stage bootloader
#include <stdint.h>
#define KERNEL_LOAD_ADDR 0x80400000 // Load kernel at 4MB offset
#define DTB_LOAD_ADDR 0x82000000 // Load device tree at 32MB
#define KERNEL_SIZE (8 * 1024 * 1024) // 8 MB
#define DTB_SIZE (64 * 1024) // 64 KB
// Boot arguments for kernel
struct boot_args {
uint64_t hartid;
uint64_t dtb_addr;
};
void uart_putc(char c) {
volatile uint32_t *uart_tx = (uint32_t *)0x10010000;
*uart_tx = c;
}
void uart_puts(const char *s) {
while (*s) {
uart_putc(*s++);
}
}
void load_kernel_and_dtb(void) {
// In real bootloader, this would load from storage
// For this example, assume kernel and DTB are already in memory
uart_puts("Loading kernel...\n");
// ... load kernel to KERNEL_LOAD_ADDR ...
uart_puts("Loading device tree...\n");
// ... load DTB to DTB_LOAD_ADDR ...
}
void jump_to_kernel(uint64_t hartid, uint64_t dtb_addr, uint64_t kernel_addr) {
// Set up registers for kernel entry
// a0 = hartid
// a1 = dtb_addr
__asm__ volatile (
"mv a0, %0\n"
"mv a1, %1\n"
"jr %2\n"
:
: "r"(hartid), "r"(dtb_addr), "r"(kernel_addr)
: "a0", "a1"
);
}
void ssbl_main(void) {
uint64_t hartid;
// Read hart ID
__asm__ volatile ("csrr %0, mhartid" : "=r"(hartid));
uart_puts("SSBL: Second-Stage Bootloader\n");
// 1. Load kernel and device tree
load_kernel_and_dtb();
// 2. Set up boot arguments
uart_puts("Booting kernel...\n");
// 3. Jump to kernel (in S-mode)
// Note: In real bootloader, would delegate to S-mode first
jump_to_kernel(hartid, DTB_LOAD_ADDR, KERNEL_LOAD_ADDR);
// Should never reach here
while (1);
}
C.5 Linker Script
Purpose: Define memory layout for bootloader.
/* bootloader.ld - Linker script for RISC-V bootloader */
OUTPUT_ARCH("riscv")
ENTRY(_start)
MEMORY
{
ROM (rx) : ORIGIN = 0x00001000, LENGTH = 64K
SRAM (rwx) : ORIGIN = 0x08000000, LENGTH = 16K
DRAM (rwx) : ORIGIN = 0x80000000, LENGTH = 8G
}
SECTIONS
{
/* Code section in ROM */
.text : {
*(.text.init)
*(.text*)
} > ROM
/* Read-only data in ROM */
.rodata : {
*(.rodata*)
} > ROM
/* Data section in SRAM */
.data : {
_data_start = .;
*(.data*)
_data_end = .;
} > SRAM AT> ROM
/* BSS section in SRAM */
.bss : {
_bss_start = .;
*(.bss*)
*(COMMON)
_bss_end = .;
} > SRAM
/* Stack in SRAM */
.stack : {
. = ALIGN(16);
. += 8K;
_stack_top = .;
} > SRAM
}
C.6 Makefile
# Makefile for RISC-V bootloader
CROSS_COMPILE = riscv64-unknown-elf-
CC = $(CROSS_COMPILE)gcc
AS = $(CROSS_COMPILE)as
LD = $(CROSS_COMPILE)ld
OBJCOPY = $(CROSS_COMPILE)objcopy
CFLAGS = -march=rv64imac -mabi=lp64 -mcmodel=medany \
-O2 -Wall -Wextra -nostdlib -nostartfiles \
-fno-builtin -fno-common
LDFLAGS = -T bootloader.ld -nostdlib
ZSBL_OBJS = zsbl_start.o zsbl_main.o
FSBL_OBJS = fsbl_start.o fsbl_main.o
SSBL_OBJS = ssbl_start.o ssbl_main.o
all: zsbl.bin fsbl.bin ssbl.bin
zsbl.elf: $(ZSBL_OBJS)
$(LD) $(LDFLAGS) -o $@ $^
fsbl.elf: $(FSBL_OBJS)
$(LD) $(LDFLAGS) -o $@ $^
ssbl.elf: $(SSBL_OBJS)
$(LD) $(LDFLAGS) -o $@ $^
%.bin: %.elf
$(OBJCOPY) -O binary $< $@
%.o: %.S
$(CC) $(CFLAGS) -c -o $@ $<
%.o: %.c
$(CC) $(CFLAGS) -c -o $@ $<
clean:
rm -f *.o *.elf *.bin
.PHONY: all clean
C.7 Common Boot Issues and Solutions
Issue 1: Hart Hangs at Reset
Symptoms: System doesn’t boot, no output.
Possible Causes:
- Reset vector pointing to invalid address
- ROM not mapped correctly
- Clock not initialized
Debug Steps:
# Add debug output at very first instruction
_start:
li t0, 0x10010000 # UART base
li t1, 'A'
sw t1, 0(t0) # Write 'A' to UART
# ... rest of code ...
Issue 2: DRAM Initialization Fails
Symptoms: System hangs after DRAM init, or data corruption.
Possible Causes:
- Incorrect DRAM controller configuration
- Clock frequency mismatch
- Timing parameters wrong
Debug Steps:
void dram_test(void) {
volatile uint32_t *test_addr = (uint32_t *)0x80000000;
// Write test pattern
*test_addr = 0xDEADBEEF;
// Read back
if (*test_addr != 0xDEADBEEF) {
uart_puts("DRAM test failed!\n");
while (1);
}
}
Issue 3: Bootloader Doesn’t Load
Symptoms: ZSBL runs but FSBL doesn’t start.
Possible Causes:
- SPI flash not initialized
- Wrong flash offset
- Corrupted image
Debug Steps:
void fsbl_main(void) {
uart_puts("FSBL starting...\n");
// Verify first few bytes of SSBL
uint8_t *ssbl = (uint8_t *)SSBL_LOAD_ADDR;
uart_puts("First bytes: ");
for (int i = 0; i < 16; i++) {
uart_puthex(ssbl[i]);
uart_putc(' ');
}
uart_putc('\n');
}
C.8 References
- U-Boot Documentation: https://u-boot.readthedocs.io/
- OpenSBI Documentation: https://github.com/riscv-software-src/opensbi
- RISC-V Boot Flow: RISC-V Platform Specification
- Device Tree Specification: https://devicetree.org/
Appendix D. SBI Call Reference
Supervisor Binary Interface (SBI) Quick Reference
💡 Usage Guide: This appendix is your “API manual” for S-mode calling M-mode services. When you forget whether a7 is EID or FID, flip right here.
🎯 SBI Calling Convention Diagram
┌─────────────────────────────────────────────────────────────────┐
│ S-mode (Kernel/OS) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ a7 │ │ a6 │ │ a0-a5 │ │ ecall │ │
│ │ EID │ │ FID │ │ Args │ │ ──────► │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │
├───────┼───────────┼───────────┼─────────────────────────────────┤
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Trap to M-mode (OpenSBI) │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ a0 │ │ a1 │ │
│ │ Error │ │ Value │ │
│ │ Code │ │(Return) │ │
│ └─────────┘ └─────────┘ │
│ │
│ M-mode (OpenSBI Firmware) │
└─────────────────────────────────────────────────────────────────┘
Memory Aid: a7 = EID (Which Extension), a6 = FID (Which Function)
📋 Common EID Quick Reference
| EID (Hex) | EID (ASCII) | Name | Purpose | Common FID |
|---|---|---|---|---|
0x10 | — | Base | Query SBI version/vendor | 0=version, 3=probe Extension |
0x54494D45 | “TIME” | Timer | Set Timer interrupt | 0=set_timer |
0x735049 | “sPI” | IPI | Cross-core interrupt | 0=send_ipi |
0x52464E43 | “RFNC” | RFENCE | Remote TLB flush | 0-6 (various fences) |
0x4442434E | “DBCN” | Debug Console | Debug output | 0=write, 1=read |
0x48534D | “HSM” | Hart State Mgmt | Start/stop Hart | 0=start, 1=stop |
🛠️ Common SBI Wrappers (Copy-Paste Ready)
sbi_call Universal Interface
struct sbiret {
long error;
long value;
};
static inline struct sbiret sbi_call(long eid, long fid,
long a0, long a1, long a2, long a3, long a4, long a5)
{
struct sbiret ret;
register long r_a0 asm("a0") = a0;
register long r_a1 asm("a1") = a1;
register long r_a2 asm("a2") = a2;
register long r_a3 asm("a3") = a3;
register long r_a4 asm("a4") = a4;
register long r_a5 asm("a5") = a5;
register long r_a6 asm("a6") = fid;
register long r_a7 asm("a7") = eid;
asm volatile("ecall"
: "+r"(r_a0), "+r"(r_a1)
: "r"(r_a2), "r"(r_a3), "r"(r_a4), "r"(r_a5), "r"(r_a6), "r"(r_a7)
: "memory");
ret.error = r_a0;
ret.value = r_a1;
return ret;
}
Common Function Wrappers
// 1. Set Timer Interrupt (Most commonly used!)
static inline void sbi_set_timer(uint64_t stime_value) {
sbi_call(0x54494D45, 0, stime_value, 0, 0, 0, 0, 0);
}
// 2. Write a character (Debug Console)
static inline void sbi_debug_console_write_byte(char c) {
sbi_call(0x4442434E, 2, c, 0, 0, 0, 0, 0);
}
// 3. Query SBI version
static inline long sbi_get_spec_version(void) {
struct sbiret ret = sbi_call(0x10, 0, 0, 0, 0, 0, 0, 0);
return ret.value;
}
// 4. Probe if Extension is supported
static inline long sbi_probe_extension(long eid) {
struct sbiret ret = sbi_call(0x10, 3, eid, 0, 0, 0, 0, 0);
return ret.value; // 0 = not supported, non-0 = supported
}
⚠️ Common Pitfalls
Pitfall 1: EID and FID Order Confused
Symptom: SBI call returns SBI_ERR_NOT_SUPPORTED.
Cause: a7 and a6 are swapped.
// ❌ Wrong: EID and FID swapped
register long a6 asm("a6") = 0x10; // Should be FID
register long a7 asm("a7") = 0; // Should be EID
// ✅ Correct: a7=EID, a6=FID
register long a6 asm("a6") = 0; // FID = 0 (get_spec_version)
register long a7 asm("a7") = 0x10; // EID = 0x10 (Base Extension)
Pitfall 2: Forgetting to Check Error Code
Symptom: SBI call fails but program continues, causing hard-to-trace subsequent errors.
// ❌ Wrong: Ignoring error code
sbi_set_timer(next_time);
// ✅ Correct: Check error
struct sbiret ret = sbi_call(0x54494D45, 0, next_time, 0, 0, 0, 0, 0);
if (ret.error != 0) {
panic("sbi_set_timer failed: %ld", ret.error);
}
This appendix provides a comprehensive reference for RISC-V SBI (Supervisor Binary Interface) calls. SBI defines the standard interface between supervisor mode (S-mode) software and machine mode (M-mode) firmware, enabling portable operating systems across different RISC-V platforms.
D.1 SBI Calling Convention
Register Usage
Input Registers:
| Register | Purpose |
|---|---|
| a7 | Extension ID (EID) |
| a6 | Function ID (FID) |
| a0 | Parameter 0 / Return value |
| a1 | Parameter 1 / Return value (optional) |
| a2 | Parameter 2 |
| a3 | Parameter 3 |
| a4 | Parameter 4 |
| a5 | Parameter 5 |
Output Registers:
| Register | Purpose |
|---|---|
| a0 | Error code (0 = success, negative = error) |
| a1 | Return value (function-specific) |
Preserved Registers: All registers except a0 and a1 are preserved across SBI calls.
Invocation
// S-mode code invokes SBI using ecall instruction
register unsigned long a0 asm("a0") = param0;
register unsigned long a1 asm("a1") = param1;
register unsigned long a6 asm("a6") = function_id;
register unsigned long a7 asm("a7") = extension_id;
asm volatile("ecall"
: "+r"(a0), "+r"(a1)
: "r"(a6), "r"(a7)
: "memory");
// a0 contains error code, a1 contains return value
D.2 SBI Error Codes
| Code | Name | Description |
|---|---|---|
| 0 | SBI_SUCCESS | Operation completed successfully |
| -1 | SBI_ERR_FAILED | Operation failed |
| -2 | SBI_ERR_NOT_SUPPORTED | Function not supported |
| -3 | SBI_ERR_INVALID_PARAM | Invalid parameter |
| -4 | SBI_ERR_DENIED | Permission denied |
| -5 | SBI_ERR_INVALID_ADDRESS | Invalid address |
| -6 | SBI_ERR_ALREADY_AVAILABLE | Resource already available |
| -7 | SBI_ERR_ALREADY_STARTED | Already started |
| -8 | SBI_ERR_ALREADY_STOPPED | Already stopped |
D.3 Base Extension (EID = 0x10)
Purpose: Query SBI implementation details and supported extensions.
D.3.1 Get SBI Specification Version (FID = 0)
Returns: SBI specification version.
- a1[31:24]: Major version
- a1[23:0]: Minor version
long sbi_get_spec_version(void) {
register unsigned long a0 asm("a0");
register unsigned long a1 asm("a1");
register unsigned long a6 asm("a6") = 0;
register unsigned long a7 asm("a7") = 0x10;
asm volatile("ecall" : "=r"(a0), "=r"(a1) : "r"(a6), "r"(a7) : "memory");
return a1; // Version in a1
}
D.3.2 Get SBI Implementation ID (FID = 1)
Returns: Implementation ID.
| ID | Implementation |
|---|---|
| 0 | Berkeley Boot Loader (BBL) |
| 1 | OpenSBI |
| 2 | Xvisor |
| 3 | KVM |
| 4 | RustSBI |
| 5 | Diosix |
D.3.3 Get SBI Implementation Version (FID = 2)
Returns: Implementation-specific version number.
D.3.4 Probe SBI Extension (FID = 3)
Parameters:
- a0: Extension ID to probe
Returns:
- a1: 0 = not available, 1 = available
long sbi_probe_extension(long extension_id) {
register unsigned long a0 asm("a0") = extension_id;
register unsigned long a1 asm("a1");
register unsigned long a6 asm("a6") = 3;
register unsigned long a7 asm("a7") = 0x10;
asm volatile("ecall" : "+r"(a0), "=r"(a1) : "r"(a6), "r"(a7) : "memory");
return a1;
}
D.3.5 Get Machine Vendor ID (FID = 4)
Returns: mvendorid CSR value.
D.3.6 Get Machine Architecture ID (FID = 5)
Returns: marchid CSR value.
D.3.7 Get Machine Implementation ID (FID = 6)
Returns: mimpid CSR value.
D.4 Timer Extension (EID = 0x54494D45 “TIME”)
Purpose: Program timer interrupts.
D.4.1 Set Timer (FID = 0)
Parameters:
- a0: Timer value (absolute time in ticks)
Description: Programs the timer to fire at the specified time. Clears pending timer interrupt.
void sbi_set_timer(uint64_t stime_value) {
register unsigned long a0 asm("a0") = stime_value;
register unsigned long a6 asm("a6") = 0;
register unsigned long a7 asm("a7") = 0x54494D45;
asm volatile("ecall" : "+r"(a0) : "r"(a6), "r"(a7) : "memory");
}
// Usage: Set timer to fire in 1 second (assuming 10 MHz timebase)
uint64_t current_time;
asm volatile("rdtime %0" : "=r"(current_time));
sbi_set_timer(current_time + 10000000);
D.5 IPI Extension (EID = 0x735049 “sPI”)
Purpose: Send inter-processor interrupts.
D.5.1 Send IPI (FID = 0)
Parameters:
- a0: Hart mask (bitmap of target harts)
- a1: Hart mask base (base hart ID for the mask)
Description: Sends supervisor software interrupt to specified harts.
long sbi_send_ipi(unsigned long hart_mask, unsigned long hart_mask_base) {
register unsigned long a0 asm("a0") = hart_mask;
register unsigned long a1 asm("a1") = hart_mask_base;
register unsigned long a6 asm("a6") = 0;
register unsigned long a7 asm("a7") = 0x735049;
asm volatile("ecall" : "+r"(a0), "+r"(a1) : "r"(a6), "r"(a7) : "memory");
return a0;
}
// Usage: Send IPI to harts 1, 2, 3
sbi_send_ipi(0b1110, 0); // Bits 1, 2, 3 set, base = 0
// Send IPI to hart 65 (bit 1 of mask, base = 64)
sbi_send_ipi(0b10, 64);
D.6 RFENCE Extension (EID = 0x52464E43 “RFNC”)
Purpose: Remote fence operations for TLB and instruction cache synchronization.
D.6.1 Remote FENCE.I (FID = 0)
Parameters:
- a0: Hart mask
- a1: Hart mask base
Description: Execute FENCE.I on remote harts.
long sbi_remote_fence_i(unsigned long hart_mask, unsigned long hart_mask_base) {
register unsigned long a0 asm("a0") = hart_mask;
register unsigned long a1 asm("a1") = hart_mask_base;
register unsigned long a6 asm("a6") = 0;
register unsigned long a7 asm("a7") = 0x52464E43;
asm volatile("ecall" : "+r"(a0), "+r"(a1) : "r"(a6), "r"(a7) : "memory");
return a0;
}
D.6.2 Remote SFENCE.VMA (FID = 1)
Parameters:
- a0: Hart mask
- a1: Hart mask base
- a2: Start address (virtual address)
- a3: Size (number of pages)
Description: Execute SFENCE.VMA on remote harts for specified address range.
long sbi_remote_sfence_vma(unsigned long hart_mask, unsigned long hart_mask_base,
unsigned long start_addr, unsigned long size) {
register unsigned long a0 asm("a0") = hart_mask;
register unsigned long a1 asm("a1") = hart_mask_base;
register unsigned long a2 asm("a2") = start_addr;
register unsigned long a3 asm("a3") = size;
register unsigned long a6 asm("a6") = 1;
register unsigned long a7 asm("a7") = 0x52464E43;
asm volatile("ecall" : "+r"(a0), "+r"(a1)
: "r"(a2), "r"(a3), "r"(a6), "r"(a7) : "memory");
return a0;
}
// Usage: Flush TLB for address range on all harts
sbi_remote_sfence_vma(~0UL, 0, 0x80000000, 4096); // Flush 1 page
D.6.3 Remote SFENCE.VMA with ASID (FID = 2)
Parameters:
- a0: Hart mask
- a1: Hart mask base
- a2: Start address
- a3: Size
- a4: ASID
Description: Execute SFENCE.VMA with ASID on remote harts.
long sbi_remote_sfence_vma_asid(unsigned long hart_mask, unsigned long hart_mask_base,
unsigned long start_addr, unsigned long size,
unsigned long asid) {
register unsigned long a0 asm("a0") = hart_mask;
register unsigned long a1 asm("a1") = hart_mask_base;
register unsigned long a2 asm("a2") = start_addr;
register unsigned long a3 asm("a3") = size;
register unsigned long a4 asm("a4") = asid;
register unsigned long a6 asm("a6") = 2;
register unsigned long a7 asm("a7") = 0x52464E43;
asm volatile("ecall" : "+r"(a0), "+r"(a1)
: "r"(a2), "r"(a3), "r"(a4), "r"(a6), "r"(a7) : "memory");
return a0;
}
D.6.4 Remote HFENCE.GVMA (FID = 3)
Parameters:
- a0: Hart mask
- a1: Hart mask base
- a2: Guest physical address
- a3: Size
Description: Execute HFENCE.GVMA on remote harts (hypervisor extension).
D.6.5 Remote HFENCE.GVMA with VMID (FID = 4)
Parameters:
- a0: Hart mask
- a1: Hart mask base
- a2: Guest physical address
- a3: Size
- a4: VMID
Description: Execute HFENCE.GVMA with VMID on remote harts.
D.6.6 Remote HFENCE.VVMA (FID = 5)
Parameters:
- a0: Hart mask
- a1: Hart mask base
- a2: Guest virtual address
- a3: Size
Description: Execute HFENCE.VVMA on remote harts.
D.6.7 Remote HFENCE.VVMA with ASID (FID = 6)
Parameters:
- a0: Hart mask
- a1: Hart mask base
- a2: Guest virtual address
- a3: Size
- a4: ASID
Description: Execute HFENCE.VVMA with ASID on remote harts.
D.7 Hart State Management Extension (EID = 0x48534D “HSM”)
Purpose: Manage hart lifecycle (start, stop, suspend).
D.7.1 Hart Start (FID = 0)
Parameters:
- a0: Hart ID
- a1: Start address (physical address)
- a2: Opaque parameter (passed to hart in a1)
Returns: SBI_SUCCESS or error code
Description: Start the specified hart at the given address.
long sbi_hart_start(unsigned long hartid, unsigned long start_addr,
unsigned long opaque) {
register unsigned long a0 asm("a0") = hartid;
register unsigned long a1 asm("a1") = start_addr;
register unsigned long a2 asm("a2") = opaque;
register unsigned long a6 asm("a6") = 0;
register unsigned long a7 asm("a7") = 0x48534D;
asm volatile("ecall" : "+r"(a0), "+r"(a1)
: "r"(a2), "r"(a6), "r"(a7) : "memory");
return a0;
}
// Usage: Start hart 1 at address 0x80200000
sbi_hart_start(1, 0x80200000, 0);
D.7.2 Hart Stop (FID = 1)
Parameters: None
Returns: Does not return on success
Description: Stop the current hart. Hart enters stopped state.
void sbi_hart_stop(void) {
register unsigned long a6 asm("a6") = 1;
register unsigned long a7 asm("a7") = 0x48534D;
asm volatile("ecall" : : "r"(a6), "r"(a7) : "memory");
}
D.7.3 Hart Get Status (FID = 2)
Parameters:
- a0: Hart ID
Returns:
- a1: Hart status
Hart Status Values:
| Value | Status |
|---|---|
| 0 | STARTED |
| 1 | STOPPED |
| 2 | START_PENDING |
| 3 | STOP_PENDING |
| 4 | SUSPENDED |
| 5 | SUSPEND_PENDING |
| 6 | RESUME_PENDING |
long sbi_hart_get_status(unsigned long hartid) {
register unsigned long a0 asm("a0") = hartid;
register unsigned long a1 asm("a1");
register unsigned long a6 asm("a6") = 2;
register unsigned long a7 asm("a7") = 0x48534D;
asm volatile("ecall" : "+r"(a0), "=r"(a1) : "r"(a6), "r"(a7) : "memory");
return a1;
}
D.7.4 Hart Suspend (FID = 3)
Parameters:
- a0: Suspend type
- a1: Resume address
- a2: Opaque parameter
Suspend Types:
| Value | Type |
|---|---|
| 0x00000000 | RETENTIVE (retain state, low latency) |
| 0x80000000 | NON_RETENTIVE (lose state, save/restore required) |
D.8 System Reset Extension (EID = 0x53525354 “SRST”)
Purpose: System-wide reset and shutdown.
D.8.1 System Reset (FID = 0)
Parameters:
- a0: Reset type
- a1: Reset reason
Returns: Does not return on success
Reset Types:
| Value | Type |
|---|---|
| 0x00000000 | SHUTDOWN |
| 0x00000001 | COLD_REBOOT |
| 0x00000002 | WARM_REBOOT |
Reset Reasons:
| Value | Reason |
|---|---|
| 0x00000000 | NO_REASON |
| 0x00000001 | SYSTEM_FAILURE |
void sbi_system_reset(unsigned long reset_type, unsigned long reset_reason) {
register unsigned long a0 asm("a0") = reset_type;
register unsigned long a1 asm("a1") = reset_reason;
register unsigned long a6 asm("a6") = 0;
register unsigned long a7 asm("a7") = 0x53525354;
asm volatile("ecall" : : "r"(a0), "r"(a1), "r"(a6), "r"(a7) : "memory");
__builtin_unreachable();
}
// Usage: Reboot the system
#define SBI_RESET_TYPE_COLD_REBOOT 1
#define SBI_RESET_REASON_NO_REASON 0
sbi_system_reset(SBI_RESET_TYPE_COLD_REBOOT, SBI_RESET_REASON_NO_REASON);
D.9 Performance Monitoring Unit Extension (EID = 0x504D55 “PMU”)
Purpose: Configure and read performance counters.
D.9.1 Get Number of Counters (FID = 0)
Returns:
- a1: Number of counters
D.9.2 Get Counter Info (FID = 1)
Parameters:
- a0: Counter index
Returns:
- a1: Counter info
D.9.3 Configure Matching Counters (FID = 2)
Parameters:
- a0: Counter index base
- a1: Counter mask
- a2: Config flags
- a3: Event index
- a4: Event data
Returns: Number of counters configured
D.9.4 Start Counters (FID = 3)
Parameters:
- a0: Counter index base
- a1: Counter mask
- a2: Start flags
- a3: Initial value
D.9.5 Stop Counters (FID = 4)
Parameters:
- a0: Counter index base
- a1: Counter mask
- a2: Stop flags
D.9.6 Read Firmware Counter (FID = 5)
Parameters:
- a0: Counter index
Returns:
- a1: Counter value
D.10 Legacy Extensions (Deprecated)
Note: These extensions are deprecated but still widely used for compatibility.
D.10.1 Console Putchar (EID = 0x01)
Parameters:
- a0: Character to output
void sbi_console_putchar(int ch) {
register unsigned long a0 asm("a0") = ch;
register unsigned long a7 asm("a7") = 0x01;
asm volatile("ecall" : "+r"(a0) : "r"(a7) : "memory");
}
D.10.2 Console Getchar (EID = 0x02)
Returns:
- a0: Character read, or -1 if no character available
int sbi_console_getchar(void) {
register unsigned long a0 asm("a0");
register unsigned long a7 asm("a7") = 0x02;
asm volatile("ecall" : "=r"(a0) : "r"(a7) : "memory");
return a0;
}
D.10.3 Legacy Set Timer (EID = 0x00)
Parameters:
- a0: Timer value
Note: Use Timer Extension (0x54494D45) instead.
D.10.4 Legacy Clear IPI (EID = 0x03)
Note: Deprecated. Clear sip.SSIP bit directly.
D.10.5 Legacy Send IPI (EID = 0x04)
Parameters:
- a0: Hart mask pointer
Note: Use IPI Extension (0x735049) instead.
D.10.6 Legacy Remote FENCE.I (EID = 0x05)
Parameters:
- a0: Hart mask pointer
Note: Use RFENCE Extension (0x52464E43) instead.
D.10.7 Legacy Remote SFENCE.VMA (EID = 0x06)
Parameters:
- a0: Hart mask pointer
- a1: Start address
- a2: Size
Note: Use RFENCE Extension instead.
D.10.8 Legacy Remote SFENCE.VMA with ASID (EID = 0x07)
Parameters:
- a0: Hart mask pointer
- a1: Start address
- a2: Size
- a3: ASID
Note: Use RFENCE Extension instead.
D.10.9 Legacy System Shutdown (EID = 0x08)
Note: Use System Reset Extension (0x53525354) instead.
D.11 Extension ID Summary
| EID | Name | Description |
|---|---|---|
| 0x10 | BASE | Base extension (version, probe) |
| 0x54494D45 | TIME | Timer programming |
| 0x735049 | sPI | Inter-processor interrupts |
| 0x52464E43 | RFNC | Remote fence operations |
| 0x48534D | HSM | Hart state management |
| 0x53525354 | SRST | System reset |
| 0x504D55 | PMU | Performance monitoring |
| 0x4442434E | DBCN | Debug console |
| 0x53555350 | SUSP | System suspend |
| 0x43505043 | CPPC | Collaborative Processor Performance Control |
| 0x4E41434C | NACL | Nested Acceleration |
D.12 Common Usage Patterns
Early Boot Console Output
void early_printk(const char *str) {
while (*str) {
if (*str == '\n')
sbi_console_putchar('\r');
sbi_console_putchar(*str++);
}
}
Timer-based Scheduling
void setup_timer_interrupt(uint64_t interval_us) {
uint64_t current_time;
asm volatile("rdtime %0" : "=r"(current_time));
// Assuming 10 MHz timebase (100 ns per tick)
uint64_t ticks = interval_us * 10;
sbi_set_timer(current_time + ticks);
// Enable supervisor timer interrupt
csr_set(sie, SIE_STIE);
}
Multi-core Synchronization
void flush_tlb_all_harts(void) {
// Flush TLB on all harts
sbi_remote_sfence_vma(~0UL, 0, 0, ~0UL);
}
void wake_up_secondary_harts(void) {
for (int i = 1; i < num_harts; i++) {
sbi_hart_start(i, (unsigned long)secondary_start, 0);
}
}
System Shutdown
void system_poweroff(void) {
sbi_system_reset(0, 0); // SHUTDOWN, NO_REASON
while (1); // Should never reach here
}
void system_reboot(void) {
sbi_system_reset(1, 0); // COLD_REBOOT, NO_REASON
while (1);
}
D.13 References
- SBI Specification: https://github.com/riscv-non-isa/riscv-sbi-doc
- OpenSBI Documentation: https://github.com/riscv-software-src/opensbi
- Linux RISC-V SBI Implementation: arch/riscv/kernel/sbi.c
Appendix E. RISC-V vs ARM Instruction Comparison
Quick Reference for Porting Between RISC-V and ARM
💡 Usage Guide: This appendix is your “translator” for architecture porting. When you need to port ARM code to RISC-V (or vice versa), check the comparison tables here.
This appendix provides a side-by-side comparison of common instructions between RISC-V and ARM (ARMv8-A AArch64). This reference is designed to help developers porting code between the two architectures.
E.1 Arithmetic and Logical Instructions
Integer Arithmetic
| Operation | RISC-V | ARM |
|---|---|---|
| Add | add rd, rs1, rs2 | ADD Xd, Xn, Xm |
| Add immediate | addi rd, rs1, imm | ADD Xd, Xn, #imm |
| Subtract | sub rd, rs1, rs2 | SUB Xd, Xn, Xm |
| Subtract immediate | addi rd, rs1, -imm | SUB Xd, Xn, #imm |
| Negate | sub rd, x0, rs | NEG Xd, Xm |
| Multiply | mul rd, rs1, rs2 | MUL Xd, Xn, Xm |
| Multiply high (signed) | mulh rd, rs1, rs2 | SMULH Xd, Xn, Xm |
| Multiply high (unsigned) | mulhu rd, rs1, rs2 | UMULH Xd, Xn, Xm |
| Divide (signed) | div rd, rs1, rs2 | SDIV Xd, Xn, Xm |
| Divide (unsigned) | divu rd, rs1, rs2 | UDIV Xd, Xn, Xm |
| Remainder (signed) | rem rd, rs1, rs2 | No direct equivalent (use MSUB) |
| Remainder (unsigned) | remu rd, rs1, rs2 | No direct equivalent (use MSUB) |
Note: ARM does not have direct remainder instructions. Use: MSUB Xd, Xn, Xm, Xo (Xd = Xo - Xn * Xm)
Logical Operations
| Operation | RISC-V | ARM |
|---|---|---|
| AND | and rd, rs1, rs2 | AND Xd, Xn, Xm |
| AND immediate | andi rd, rs1, imm | AND Xd, Xn, #imm |
| OR | or rd, rs1, rs2 | ORR Xd, Xn, Xm |
| OR immediate | ori rd, rs1, imm | ORR Xd, Xn, #imm |
| XOR | xor rd, rs1, rs2 | EOR Xd, Xn, Xm |
| XOR immediate | xori rd, rs1, imm | EOR Xd, Xn, #imm |
| NOT | xori rd, rs, -1 | MVN Xd, Xm |
| AND NOT | andn rd, rs1, rs2 (Zbb) | BIC Xd, Xn, Xm |
| OR NOT | orn rd, rs1, rs2 (Zbb) | ORN Xd, Xn, Xm |
Shift Operations
| Operation | RISC-V | ARM |
|---|---|---|
| Shift left logical | sll rd, rs1, rs2 | LSL Xd, Xn, Xm |
| Shift left immediate | slli rd, rs1, shamt | LSL Xd, Xn, #imm |
| Shift right logical | srl rd, rs1, rs2 | LSR Xd, Xn, Xm |
| Shift right immediate | srli rd, rs1, shamt | LSR Xd, Xn, #imm |
| Shift right arithmetic | sra rd, rs1, rs2 | ASR Xd, Xn, Xm |
| Shift right arith imm | srai rd, rs1, shamt | ASR Xd, Xn, #imm |
| Rotate right | ror rd, rs1, rs2 (Zbb) | ROR Xd, Xn, Xm |
| Rotate right immediate | rori rd, rs1, shamt (Zbb) | ROR Xd, Xn, #imm |
E.2 Load and Store Instructions
Basic Loads
| Operation | RISC-V | ARM |
|---|---|---|
| Load byte (signed) | lb rd, offset(rs1) | LDRSB Xd, [Xn, #offset] |
| Load byte (unsigned) | lbu rd, offset(rs1) | LDRB Wd, [Xn, #offset] |
| Load halfword (signed) | lh rd, offset(rs1) | LDRSH Xd, [Xn, #offset] |
| Load halfword (unsigned) | lhu rd, offset(rs1) | LDRH Wd, [Xn, #offset] |
| Load word (signed) | lw rd, offset(rs1) | LDRSW Xd, [Xn, #offset] |
| Load word (unsigned) | lwu rd, offset(rs1) | LDR Wd, [Xn, #offset] |
| Load doubleword | ld rd, offset(rs1) | LDR Xd, [Xn, #offset] |
Basic Stores
| Operation | RISC-V | ARM |
|---|---|---|
| Store byte | sb rs2, offset(rs1) | STRB Wd, [Xn, #offset] |
| Store halfword | sh rs2, offset(rs1) | STRH Wd, [Xn, #offset] |
| Store word | sw rs2, offset(rs1) | STR Wd, [Xn, #offset] |
| Store doubleword | sd rs2, offset(rs1) | STR Xd, [Xn, #offset] |
Addressing Modes
RISC-V: Only base+offset
lw t0, 8(sp) # Load from sp + 8
ARM: Multiple modes
LDR X0, [SP, #8] # Base + offset
LDR X0, [SP, #8]! # Pre-indexed (update SP)
LDR X0, [SP], #8 # Post-indexed (update SP after)
LDR X0, [SP, X1] # Base + register
LDR X0, [SP, X1, LSL #3] # Base + shifted register
Porting Note: RISC-V requires separate add/sub for pre/post-indexed addressing:
# ARM: LDR X0, [SP], #8
# RISC-V equivalent:
ld t0, 0(sp)
addi sp, sp, 8
E.3 Branch and Jump Instructions
Conditional Branches
| Operation | RISC-V | ARM |
|---|---|---|
| Branch if equal | beq rs1, rs2, label | CMP Xn, Xm + B.EQ label |
| Branch if not equal | bne rs1, rs2, label | CMP Xn, Xm + B.NE label |
| Branch if less than | blt rs1, rs2, label | CMP Xn, Xm + B.LT label |
| Branch if >= (signed) | bge rs1, rs2, label | CMP Xn, Xm + B.GE label |
| Branch if < (unsigned) | bltu rs1, rs2, label | CMP Xn, Xm + B.LO label |
| Branch if >= (unsigned) | bgeu rs1, rs2, label | CMP Xn, Xm + B.HS label |
Key Difference: RISC-V compares and branches in one instruction. ARM requires separate compare.
Unconditional Jumps
| Operation | RISC-V | ARM |
|---|---|---|
| Jump | jal x0, label or j label | B label |
| Jump and link | jal ra, label | BL label |
| Jump register | jalr x0, 0(rs1) or jr rs1 | BR Xn |
| Jump and link register | jalr ra, 0(rs1) | BLR Xn |
| Return | jalr x0, 0(ra) or ret | RET |
E.4 Compare and Set Instructions
Comparisons
| Operation | RISC-V | ARM |
|---|---|---|
| Set if less than | slt rd, rs1, rs2 | CMP Xn, Xm + CSET Xd, LT |
| Set if less (unsigned) | sltu rd, rs1, rs2 | CMP Xn, Xm + CSET Xd, LO |
| Set if less than imm | slti rd, rs1, imm | CMP Xn, #imm + CSET Xd, LT |
| Set if less imm (uns) | sltiu rd, rs1, imm | CMP Xn, #imm + CSET Xd, LO |
ARM Condition Codes:
| RISC-V | ARM Condition |
|---|---|
beq | B.EQ (equal) |
bne | B.NE (not equal) |
blt | B.LT (less than, signed) |
bge | B.GE (greater or equal, signed) |
bltu | B.LO (lower, unsigned) |
bgeu | B.HS (higher or same, unsigned) |
E.5 Atomic Instructions
Load-Reserved / Store-Conditional
| Operation | RISC-V | ARM |
|---|---|---|
| Load-reserved word | lr.w rd, (rs1) | LDXR Wd, [Xn] |
| Load-reserved dword | lr.d rd, (rs1) | LDXR Xd, [Xn] |
| Store-conditional word | sc.w rd, rs2, (rs1) | STXR Ws, Wd, [Xn] |
| Store-conditional dword | sc.d rd, rs2, (rs1) | STXR Ws, Xd, [Xn] |
Example: Atomic Increment
RISC-V:
retry:
lr.w t0, (a0)
addi t0, t0, 1
sc.w t1, t0, (a0)
bnez t1, retry
ARM:
retry:
LDXR W0, [X1]
ADD W0, W0, #1
STXR W2, W0, [X1]
CBNZ W2, retry
Atomic Memory Operations (AMO)
| Operation | RISC-V | ARM |
|---|---|---|
| Atomic swap | amoswap.w rd, rs2, (rs1) | SWP Wd, Wm, [Xn] |
| Atomic add | amoadd.w rd, rs2, (rs1) | LDADD Ws, Wt, [Xn] |
| Atomic AND | amoand.w rd, rs2, (rs1) | LDCLR Ws, Wt, [Xn] (inverted) |
| Atomic OR | amoor.w rd, rs2, (rs1) | LDSET Ws, Wt, [Xn] |
| Atomic XOR | amoxor.w rd, rs2, (rs1) | LDEOR Ws, Wt, [Xn] |
| Atomic max (signed) | amomax.w rd, rs2, (rs1) | LDSMAX Ws, Wt, [Xn] |
| Atomic max (unsigned) | amomaxu.w rd, rs2, (rs1) | LDUMAX Ws, Wt, [Xn] |
| Atomic min (signed) | amomin.w rd, rs2, (rs1) | LDSMIN Ws, Wt, [Xn] |
| Atomic min (unsigned) | amominu.w rd, rs2, (rs1) | LDUMIN Ws, Wt, [Xn] |
Ordering Annotations:
- RISC-V:
.aq(acquire),.rl(release),.aqrl(both) - ARM:
LDADDvsLDADDAvsLDADDLvsLDADDAL
E.6 Memory Barriers
| Operation | RISC-V | ARM |
|---|---|---|
| Full fence | fence rw, rw | DMB SY |
| Read fence | fence r, r | DMB LD |
| Write fence | fence w, w | DMB ST |
| Acquire fence | fence r, rw | DMB LD |
| Release fence | fence rw, w | DMB ST |
| Instruction fence | fence.i | ISB |
| TLB fence | sfence.vma | TLBI + DSB + ISB |
RISC-V FENCE Format: fence pred, succ
pred: Predecessor operations (r=read, w=write, rw=both)succ: Successor operations (r=read, w=write, rw=both)
ARM Barrier Types:
SY: Full systemST: Store onlyLD: Load onlyISH: Inner shareableOSH: Outer shareable
E.7 System Instructions
CSR / System Register Access
| Operation | RISC-V | ARM |
|---|---|---|
| Read CSR/sysreg | csrr rd, csr | MRS Xd, sysreg |
| Write CSR/sysreg | csrw csr, rs | MSR sysreg, Xn |
| Read-modify-write | csrrw rd, csr, rs | MRS + modify + MSR |
| Set bits | csrrs rd, csr, rs | MRS + ORR + MSR |
| Clear bits | csrrc rd, csr, rs | MRS + BIC + MSR |
Exception and Privilege
| Operation | RISC-V | ARM |
|---|---|---|
| System call | ecall | SVC #imm |
| Breakpoint | ebreak | BRK #imm |
| Return from exception | mret / sret | ERET |
| Wait for interrupt | wfi | WFI |
| Supervisor call | ecall (from U-mode) | SVC #imm |
| Hypervisor call | ecall (from VS-mode) | HVC #imm |
E.8 Bit Manipulation (Zbb vs ARM)
| Operation | RISC-V (Zbb) | ARM |
|---|---|---|
| Count leading zeros | clz rd, rs | CLZ Xd, Xn |
| Count trailing zeros | ctz rd, rs | No direct (use RBIT + CLZ) |
| Count population | cpop rd, rs | No direct (use CNT in NEON) |
| Byte reverse | rev8 rd, rs | REV Xd, Xn |
| Sign-extend byte | sext.b rd, rs | SXTB Xd, Wn |
| Sign-extend halfword | sext.h rd, rs | SXTH Xd, Wn |
| Zero-extend halfword | zext.h rd, rs | UXTH Wd, Wn |
| Min (signed) | min rd, rs1, rs2 | No direct (use CMP + CSEL) |
| Max (signed) | max rd, rs1, rs2 | No direct (use CMP + CSEL) |
| Rotate right | ror rd, rs1, rs2 | ROR Xd, Xn, Xm |
E.9 Calling Convention (ABI)
Register Usage
| Purpose | RISC-V | ARM |
|---|---|---|
| Arguments | a0-a7 (x10-x17) | X0-X7 |
| Return value | a0-a1 (x10-x11) | X0-X1 |
| Saved registers | s0-s11 (x8-x9, x18-x27) | X19-X28 |
| Temporary registers | t0-t6 (x5-x7, x28-x31) | X9-X15 |
| Stack pointer | sp (x2) | SP |
| Frame pointer | fp/s0 (x8) | X29 (FP) |
| Return address | ra (x1) | X30 (LR) |
| Zero register | x0 (zero) | XZR (X31) |
Function Prologue/Epilogue
RISC-V:
function:
addi sp, sp, -16
sd ra, 8(sp)
sd s0, 0(sp)
# ... function body ...
ld s0, 0(sp)
ld ra, 8(sp)
addi sp, sp, 16
ret
ARM:
function:
STP X29, X30, [SP, #-16]!
MOV X29, SP
# ... function body ...
LDP X29, X30, [SP], #16
RET
Key Differences:
- ARM has
STP/LDP(store/load pair) for efficient stack operations - RISC-V uses separate
sd/ldinstructions - ARM uses X30 (LR) for return address; RISC-V uses x1 (ra)
E.10 Common Code Patterns
Loop Example
RISC-V:
li t0, 0 # i = 0
li t1, 10 # limit = 10
loop:
# ... loop body ...
addi t0, t0, 1 # i++
blt t0, t1, loop # if (i < 10) goto loop
ARM:
MOV X0, #0 # i = 0
MOV X1, #10 # limit = 10
loop:
# ... loop body ...
ADD X0, X0, #1 # i++
CMP X0, X1
B.LT loop # if (i < 10) goto loop
Switch Statement
RISC-V:
# Assume a0 = switch value
li t0, 3
bgtu a0, t0, default
slli t0, a0, 2 # t0 = a0 * 4
la t1, jump_table
add t0, t0, t1
lw t0, 0(t0)
jr t0
jump_table:
.word case0
.word case1
.word case2
.word case3
ARM:
# Assume X0 = switch value
CMP X0, #3
B.HI default
ADR X1, jump_table
LDR X2, [X1, X0, LSL #3]
BR X2
jump_table:
.quad case0
.quad case1
.quad case2
.quad case3
E.11 Porting Checklist
Syntax Differences
| Aspect | RISC-V | ARM |
|---|---|---|
| Register prefix | x, f, a, t, s | X, W, V, Q |
| Immediate prefix | None | # |
| Memory syntax | offset(base) | [base, #offset] |
| Comment | # | // or ; |
| Directive prefix | . | . |
Common Pitfalls
- Zero Register: RISC-V
x0vs ARMXZR(different encoding) - Stack Pointer: RISC-V
spis x2; ARMSPis separate - Return Address: RISC-V stores in
ra; ARM usesLR(X30) - Addressing Modes: ARM has more complex modes (pre/post-indexed)
- Conditional Execution: ARM has conditional instructions; RISC-V uses branches
- Remainder: RISC-V has
rem/remu; ARM requires division + multiply-subtract
Performance Considerations
- Code Density: ARM Thumb-2 vs RISC-V compressed (C extension)
- Instruction Fusion: Both support micro-op fusion (implementation-dependent)
- Branch Prediction: Similar capabilities (implementation-dependent)
- Memory Ordering: RISC-V RVWMO is weaker than ARM (more reordering allowed)
E.12 References
- RISC-V ISA Manual: https://riscv.org/technical/specifications/
- ARM Architecture Reference Manual: ARMv8-A
- RISC-V ABI Specification: https://github.com/riscv-non-isa/riscv-elf-psabi-doc
- ARM Procedure Call Standard: AAPCS64
Appendix F. Memory Model Quick Reference
RISC-V Weak Memory Ordering (RVWMO) Quick Reference
💡 Usage Guide: This appendix is your “safety manual” for multi-core synchronization. When you encounter mysterious bugs in lock-free code, check Memory Ordering here first.
🔄 Producer-Consumer Synchronization Pattern (Copy-Paste Ready)
This is the most classic multi-core synchronization pattern, guaranteeing Consumer sees complete data written by Producer.
Producer (Core 0) - Write Side
# Producer: Write data, then set Flag
# s0 = data address, s1 = Flag address, t0 = data, t1 = Flag value
sw t0, 0(s0) # 1. Write Data
fence w, w # 2. Store-Store Fence: Ensure data written first
sw t1, 0(s1) # 3. Write Flag (Ready = 1)
Explanation: fence w,w ensures “data write” is visible to other cores before “Flag write”.
Consumer (Core 1) - Read Side
# Consumer: Wait for Flag, then read data
# s0 = data address, s1 = Flag address
wait_flag:
lw t1, 0(s1) # 1. Read Flag
beqz t1, wait_flag # Wait for Flag to become Ready
fence r, r # 2. Load-Load Fence: Ensure Flag seen before reading Data
lw t0, 0(s0) # 3. Read Data
Explanation: fence r,r ensures “Flag read” completes before “Data read”.
Complete C Example
// Shared variables
volatile int data = 0;
volatile int flag = 0;
// Producer (Core 0)
void producer(void) {
data = 42; // Write data
asm volatile ("fence w, w" ::: "memory"); // Store-Store Fence
flag = 1; // Set Flag
}
// Consumer (Core 1)
int consumer(void) {
while (flag == 0) { } // Wait for Flag
asm volatile ("fence r, r" ::: "memory"); // Load-Load Fence
return data; // Read data (guaranteed to be 42)
}
📋 FENCE Usage Quick Reference
| Scenario | FENCE Type | Description |
|---|---|---|
| Publish Data | fence w, w | Ensure data visible before Flag |
| Consume Data | fence r, r | Ensure Flag read before data |
| Release Lock | fence rw, w | Ensure Critical Section ops complete before Unlock |
| Acquire Lock | fence r, rw | Ensure ops after Lock don’t execute early |
| Full Barrier | fence rw, rw | Strongest Fence, no ops can cross |
| Self-Modify Code | fence.i | After modifying instructions, flush I-cache |
🔐 Spinlock Example (Using Atomics)
# acquire_lock: Use amoswap.w.aq to acquire lock
# a0 = lock address, t0 = 1 (locked), t1 = result
acquire_lock:
li t0, 1
retry:
amoswap.w.aq t1, t0, (a0) # Atomic swap with Acquire
bnez t1, retry # If was 1 (locked), retry
ret # Successfully acquired lock
# release_lock: Use amoswap.w.rl to release lock
# a0 = lock address
release_lock:
amoswap.w.rl zero, zero, (a0) # Atomic write 0 with Release
ret
Explanation:
.aq(Acquire): Subsequent ops won’t be moved before Lock.rl(Release): Previous ops won’t be moved after Unlock
⚠️ Common Pitfalls
Pitfall 1: Thinking volatile Is Enough
Misconception: C’s volatile guarantees Memory Ordering.
Truth: volatile only prevents compiler optimization, doesn’t guarantee CPU-level Memory Ordering.
// ❌ Wrong: Only volatile, may read stale data on multi-core
volatile int data = 0;
volatile int flag = 0;
// Producer
data = 42;
flag = 1; // CPU may reorder so flag is visible first!
// ✅ Correct: Add fence
data = 42;
asm volatile ("fence w, w" ::: "memory");
flag = 1;
Pitfall 2: Fence in Wrong Position
Symptom: Spinlock looks correct, but still has Race Condition.
// ❌ Wrong: fence after unlock
critical_section();
unlock();
asm volatile ("fence rw, w" ::: "memory"); // Too late!
// ✅ Correct: fence before unlock (or use .rl)
critical_section();
asm volatile ("fence rw, w" ::: "memory");
unlock();
Pitfall 3: Forgetting fence.i
Symptom: JIT or Self-modifying code executes old instructions.
Cause: After modifying instructions, I-Cache still has old content.
// ❌ Wrong: No fence.i after code modification
memcpy(code_buffer, new_code, size);
((void (*)(void))code_buffer)(); // May execute old instructions!
// ✅ Correct: fence.i after modification
memcpy(code_buffer, new_code, size);
asm volatile ("fence.i" ::: "memory");
((void (*)(void))code_buffer)(); // Now executes new instructions
This appendix provides a quick reference for RISC-V’s memory model (RVWMO). Understanding memory ordering is essential for writing correct concurrent code on RISC-V.
F.1 Memory Ordering Basics
What Can Be Reordered?
RISC-V Weak Memory Ordering (RVWMO) allows extensive reordering:
| Reordering | Allowed? | Exception |
|---|---|---|
| Load → Load | ✓ Yes | Same address, or FENCE |
| Load → Store | ✓ Yes | Same address, or FENCE |
| Store → Store | ✓ Yes | Same address, or FENCE |
| Store → Load | ✓ Yes | Same address, or FENCE |
Key Point: Almost everything can be reordered unless:
- Operations access the same address (overlapping)
- Operations are separated by a FENCE instruction
- Operations have data/control dependencies
- Operations use acquire/release atomics
Preserved Program Order (PPO)
Preserved Program Order is the subset of program order that MUST be respected:
- Overlapping addresses:
SWto X, thenLWfrom X → always in order - Explicit fences: Operations separated by
FENCE→ always in order - Acquire/Release: Atomic operations with
.aqor.rl→ enforce ordering - Dependencies: Data dependencies (e.g.,
LWthen use result) → always in order - Control dependencies: Branch then dependent operation → certain orderings preserved
F.2 FENCE Instruction Reference
FENCE Syntax
fence pred, succ
- pred (predecessor): Operations before fence (r, w, or rw)
- succ (successor): Operations after fence (r, w, or rw)
Common FENCE Variants
| FENCE | Meaning | Use Case |
|---|---|---|
fence rw, rw | Full fence | Strongest barrier, orders everything |
fence w, w | Store-store fence | Ensure stores visible in order |
fence r, r | Load-load fence | Ensure loads happen in order |
fence r, rw | Acquire fence | After acquiring lock |
fence rw, w | Release fence | Before releasing lock |
fence.i | Instruction fence | After code modification (JIT, self-modifying code) |
fence.tso | TSO fence | x86-compatible ordering |
FENCE Examples
Full Fence (strongest):
sw a0, 0(s0) # Store 1
fence rw, rw # Full fence
lw t0, 0(s1) # Load 1
All operations before fence complete before any operation after fence.
Store-Store Fence (publish pattern):
sw a0, 0(s0) # Write data
fence w, w # Ensure data written first
sw a1, 0(s1) # Write flag
Ensures stores become visible in order.
Load-Load Fence (consume pattern):
lw t0, 0(s1) # Read flag
fence r, r # Ensure flag read first
lw t1, 0(s0) # Read data
Ensures loads happen in order.
Acquire Fence (after lock acquisition):
lr.w.aq t0, (a0) # Acquire lock (with .aq)
# OR
lw t0, 0(a0) # Read lock
fence r, rw # Acquire fence
# ... critical section ...
Prevents operations in critical section from moving before lock acquisition.
Release Fence (before lock release):
# ... critical section ...
fence rw, w # Release fence
sw zero, 0(a0) # Release lock
Prevents operations in critical section from moving after lock release.
F.3 Atomic Instructions
Load-Reserved / Store-Conditional (LR/SC)
Syntax:
lr.w rd, (rs1) # Load-reserved word
lr.d rd, (rs1) # Load-reserved doubleword
sc.w rd, rs2, (rs1) # Store-conditional word
sc.d rd, rs2, (rs1) # Store-conditional doubleword
Ordering Annotations:
.aq(acquire): No later operations can move before this.rl(release): No earlier operations can move after this.aqrl(both): Full ordering
Example: Atomic Increment
retry:
lr.w t0, (a0) # Load current value
addi t0, t0, 1 # Increment
sc.w t1, t0, (a0) # Try to store
bnez t1, retry # Retry if failed (t1 != 0)
Example: Spinlock Acquire
acquire_lock:
lr.w.aq t0, (a0) # Load-reserved with acquire
bnez t0, acquire_lock # If locked, retry
li t1, 1
sc.w.aq t2, t1, (a0) # Try to acquire
bnez t2, acquire_lock # Retry if failed
Atomic Memory Operations (AMO)
Syntax:
amoswap.w rd, rs2, (rs1) # Atomic swap
amoadd.w rd, rs2, (rs1) # Atomic add
amoand.w rd, rs2, (rs1) # Atomic AND
amoor.w rd, rs2, (rs1) # Atomic OR
amoxor.w rd, rs2, (rs1) # Atomic XOR
amomax.w rd, rs2, (rs1) # Atomic max (signed)
amomaxu.w rd, rs2, (rs1) # Atomic max (unsigned)
amomin.w rd, rs2, (rs1) # Atomic min (signed)
amominu.w rd, rs2, (rs1) # Atomic min (unsigned)
Ordering Annotations: Same as LR/SC (.aq, .rl, .aqrl)
Example: Spinlock with AMOSWAP
acquire_lock:
li t0, 1
amoswap.w.aq t1, t0, (a0) # Swap 1 into lock, get old value
bnez t1, acquire_lock # If old value != 0, retry
release_lock:
amoswap.w.rl zero, zero, (a0) # Swap 0 into lock (release)
F.4 Common Synchronization Patterns
Pattern 1: Message Passing
Problem: Producer writes data, then sets flag. Consumer waits for flag, then reads data.
Solution:
# Producer (Hart 0)
sw a0, 0(s0) # Write data
fence w, w # Ensure data written before flag
sw a1, 0(s1) # Write flag = 1
# Consumer (Hart 1)
loop:
lw t0, 0(s1) # Read flag
beqz t0, loop # Wait for flag
fence r, r # Ensure flag read before data
lw t1, 0(s0) # Read data
Why fences are needed:
- Without
fence w, w: Flag might be visible before data - Without
fence r, r: Data might be read before flag is checked
Pattern 2: Spinlock (LR/SC)
Acquire:
acquire_lock:
lr.w.aq t0, (a0) # Load-reserved with acquire
bnez t0, acquire_lock # If locked, retry
li t1, 1
sc.w.aq t2, t1, (a0) # Try to set lock
bnez t2, acquire_lock # Retry if SC failed
# Lock acquired, critical section follows
Release:
# Critical section
amoswap.w.rl zero, zero, (a0) # Release lock
# OR
fence rw, w
sw zero, 0(a0)
Pattern 3: Spinlock (AMOSWAP)
Acquire:
acquire_lock:
li t0, 1
amoswap.w.aq t1, t0, (a0) # Atomic swap
bnez t1, acquire_lock # If old value != 0, retry
# Lock acquired
Release:
amoswap.w.rl zero, zero, (a0) # Release lock
Pattern 4: Dekker’s Algorithm (Mutual Exclusion)
Hart 0:
li t0, 1
sw t0, flag0 # flag0 = 1
fence w, rw # Ensure flag0 visible before reading flag1
lw t1, flag1 # Read flag1
bnez t1, wait # If flag1 set, wait
# Critical section
fence rw, w # Ensure critical section done
sw zero, flag0 # flag0 = 0
Hart 1: (symmetric, swap flag0 and flag1)
Pattern 5: Producer-Consumer Queue
Producer:
# Write data to queue[tail]
sw a0, 0(s0)
# Increment tail
fence w, w # Ensure data written before tail update
addi s1, s1, 1
sw s1, tail_ptr
Consumer:
# Read tail
lw t0, tail_ptr
lw t1, head_ptr
beq t0, t1, empty # If tail == head, queue empty
# Read data from queue[head]
fence r, r # Ensure tail read before data
lw a0, 0(s2)
# Increment head
addi s2, s2, 1
sw s2, head_ptr
Pattern 6: Barrier (N threads)
Barrier Wait:
barrier_wait:
# Increment counter atomically
li t0, 1
amoadd.w.aq t1, t0, (a0) # counter++, get old value
addi t1, t1, 1 # t1 = new counter value
# Check if all threads arrived
li t2, N # N = number of threads
bne t1, t2, spin # If not all arrived, spin
# Reset counter for next barrier
amoswap.w.rl zero, zero, (a0)
ret
spin:
lw t3, 0(a0)
bne t3, t2, spin
ret
F.5 Memory Model Comparison
RVWMO vs Other Models
| Model | Strength | Reordering Allowed | Fence Overhead |
|---|---|---|---|
| Sequential Consistency | Strongest | None | N/A (no reordering) |
| x86 TSO | Strong | Store→Load only | Low (implicit) |
| ARM | Weak | Extensive | Medium |
| RISC-V RVWMO | Weak | Extensive | Medium |
| RISC-V RVTSO | Strong | Store→Load only | Low |
Ordering Guarantees
| Operation Pair | RISC-V RVWMO | x86 TSO | ARM | SC |
|---|---|---|---|---|
| Load → Load | ✗ | ✓ | ✗ | ✓ |
| Load → Store | ✗ | ✓ | ✗ | ✓ |
| Store → Store | ✗ | ✓ | ✗ | ✓ |
| Store → Load | ✗ | ✗ | ✗ | ✓ |
✓ = Ordered by default ✗ = Can be reordered (need fence)
F.6 FENCE Equivalents Across Architectures
| RISC-V | x86 | ARM | Purpose |
|---|---|---|---|
fence rw, rw | MFENCE | DMB SY | Full barrier |
fence w, w | SFENCE | DMB ST | Store barrier |
fence r, r | LFENCE | DMB LD | Load barrier |
fence r, rw | (implicit) | DMB LD | Acquire |
fence rw, w | (implicit) | DMB ST | Release |
fence.i | (implicit) | ISB | Instruction sync |
fence.tso | (implicit) | - | TSO ordering |
F.7 Acquire/Release Semantics
Acquire Semantics
Meaning: No memory operations after the acquire can move before it.
Use Case: After acquiring a lock, before accessing protected data.
Implementation:
# Option 1: Atomic with .aq
lr.w.aq t0, (a0)
# Option 2: Load + fence
lw t0, 0(a0)
fence r, rw
Release Semantics
Meaning: No memory operations before the release can move after it.
Use Case: After accessing protected data, before releasing a lock.
Implementation:
# Option 1: Atomic with .rl
amoswap.w.rl zero, zero, (a0)
# Option 2: Fence + store
fence rw, w
sw zero, 0(a0)
Acquire-Release Pair
Complete Lock Example:
# Acquire
acquire:
lr.w.aq t0, (a0)
bnez t0, acquire
li t1, 1
sc.w.aq t2, t1, (a0)
bnez t2, acquire
# Critical section
# ... protected operations ...
# Release
amoswap.w.rl zero, zero, (a0)
F.8 Common Pitfalls
Pitfall 1: Missing Fences
Wrong:
# Producer
sw a0, 0(s0) # Write data
sw a1, 0(s1) # Write flag
# Consumer
lw t0, 0(s1) # Read flag
lw t1, 0(s0) # Read data (might be stale!)
Correct:
# Producer
sw a0, 0(s0)
fence w, w # Add fence!
sw a1, 0(s1)
# Consumer
lw t0, 0(s1)
fence r, r # Add fence!
lw t1, 0(s0)
Pitfall 2: Wrong Fence Type
Wrong (using load-load fence for release):
# Critical section
fence r, r # Wrong! Doesn't order stores
sw zero, 0(a0) # Release lock
Correct:
# Critical section
fence rw, w # Correct! Orders all ops before stores
sw zero, 0(a0)
Pitfall 3: Forgetting .aq/.rl on Atomics
Wrong:
lr.w t0, (a0) # Missing .aq!
# Critical section
amoswap.w zero, zero, (a0) # Missing .rl!
Correct:
lr.w.aq t0, (a0)
# Critical section
amoswap.w.rl zero, zero, (a0)
Pitfall 4: Data Race
Wrong (no synchronization):
# Hart 0
sw a0, 0(s0) # Write shared variable
# Hart 1
lw t0, 0(s0) # Read shared variable (DATA RACE!)
Correct (use lock or atomic):
# Hart 0
# ... acquire lock ...
sw a0, 0(s0)
# ... release lock ...
# Hart 1
# ... acquire lock ...
lw t0, 0(s0)
# ... release lock ...
F.9 Quick Decision Tree
Do I need a fence?
Are multiple harts accessing shared memory?
├─ No → No fence needed
└─ Yes → Continue
Are the accesses synchronized (locks, atomics)?
├─ Yes → Fence included in lock/atomic
└─ No → Continue
Are the accesses to the same address?
├─ Yes → No fence needed (hardware preserves order)
└─ No → FENCE REQUIRED!
What type of fence?
├─ Publishing data? → fence w, w
├─ Consuming data? → fence r, r
├─ Acquiring lock? → fence r, rw (or .aq)
├─ Releasing lock? → fence rw, w (or .rl)
└─ Not sure? → fence rw, rw (full fence)
F.10 References
- RISC-V Memory Model Specification: Chapter 14 of RISC-V ISA Manual
- RVWMO Formal Specification: https://github.com/riscv/riscv-isa-manual
- Memory Model Tools: herd7, rmem (for verification)
- Linux Kernel Memory Barriers: Documentation/memory-barriers.txt
About the Author
Danny Jiang is a seasoned system software engineer and technical lead with over 20 years of hands-on experience in firmware development, CPU/SoC architecture, and system validation. Currently serving as a Benchmarking/Application Engineer at SiFive, Danny has built his career working with leading semiconductor and processor companies, including MIPS (under Imagination Technologies, MIPS LLC, and Wave Computing), Broadcom, Western Digital, Andes Technology, and Silicon Integrated Systems (SiS).
Throughout his career, Danny has contributed to the development and deployment of millions of chips across diverse domains—from RISC-V and MIPS processors to SSD controllers, Bluetooth/IoT chipsets, and x86 chipset BIOS. His expertise spans the entire system software stack, from low-level bootloaders and device drivers to ASIC/FPGA validation and system integration.
Professional Expertise
Danny specializes in:
-
Processor Architecture: RISC-V, MIPS, ARM, x86
-
System Software: Bootloaders, firmware, device drivers, RTOS porting
-
Validation & Verification: ASIC/FPGA bring-up, silicon validation, system integration
-
Embedded Systems: IoT, SSD, wireless connectivity (Bluetooth, 802.15.x)
-
Performance Engineering: Benchmarking (CoreMark, Dhrystone), optimization
-
Customer Support: Technical troubleshooting, toolchain customization, training
Connect with Danny:
- Email: djiang.tw@gmail.com
- LinkedIn: linkedin.com/in/danny-jiang-26359644
- GitHub: https://github.com/djiangtw
Other Works:
- See RISC-V Run: Fundamentals (this book)
- Various open-source contributions to RISC-V ecosystem
Acknowledgments
The author would like to thank:
- RISC-V International and all contributors to the RISC-V specifications for creating an open, well-documented ISA
- The RISC-V community for their collaborative spirit and commitment to open standards
- Colleagues and mentors at SiFive, MIPS, Andes, Broadcom, Western Digital, and SiS for their insights and expertise
- Early reviewers who provided valuable feedback on draft chapters
- Family and friends for their unwavering support during the writing process
About the Book
“See RISC-V Run: Fundamentals” is inspired by Dominic Sweetman’s classic “See MIPS Run” and aims to provide the same level of comprehensive, systematic coverage for the RISC-V architecture. This book combines:
- Rigorous technical accuracy based on official RISC-V specifications
- Practical insights from real-world implementation experience across multiple processor families
- Clear explanations suitable for students, engineers, and researchers
- Comparative analysis with ARM and MIPS to build architectural intuition
This volume focuses on fundamental concepts—from ISA basics and programmer’s model to pipeline design, system software, and platform integration. Future volumes, including “See RISC-V Run: Advanced”, will explore microarchitecture optimizations, advanced extensions, and cutting-edge implementations.
The book is licensed under CC BY 4.0, reflecting the author’s commitment to open knowledge sharing, consistent with the RISC-V philosophy.
January 2026
Bibliography and References
This book is based on publicly available specifications and documentation. All information is derived from open-source materials and official RISC-V specifications.
RISC-V Official Specifications
ISA Specifications
-
RISC-V Instruction Set Manual, Volume I: Unprivileged ISA
RISC-V International
https://github.com/riscv/riscv-isa-manual
Latest version: Ratified 2019, with ongoing updates -
RISC-V Instruction Set Manual, Volume II: Privileged Architecture
RISC-V International
https://github.com/riscv/riscv-isa-manual
Latest version: Ratified 2021, with ongoing updates
Extension Specifications
-
RISC-V “V” Vector Extension
RISC-V International
https://github.com/riscv/riscv-v-spec
Version 1.0, Ratified 2021 -
RISC-V Bit Manipulation Extension
RISC-V International
https://github.com/riscv/riscv-bitmanip
Version 1.0, Ratified 2021 -
RISC-V Cryptography Extensions
RISC-V International
https://github.com/riscv/riscv-crypto
Version 1.0, Ratified 2021 -
RISC-V Hypervisor Extension
RISC-V International
Included in Privileged Architecture Specification
Platform Specifications
-
RISC-V Platform-Level Interrupt Controller (PLIC) Specification
RISC-V International
https://github.com/riscv/riscv-plic-spec -
RISC-V Core-Local Interrupt Controller (CLIC) Specification
RISC-V International
https://github.com/riscv/riscv-fast-interrupt -
RISC-V Supervisor Binary Interface (SBI) Specification
RISC-V International
https://github.com/riscv-non-isa/riscv-sbi-doc
Version 1.0, Ratified 2020 -
RISC-V ELF psABI Specification
RISC-V International
https://github.com/riscv-non-isa/riscv-elf-psabi-doc
RISC-V Software and Tools
-
RISC-V GNU Compiler Toolchain
https://github.com/riscv-collab/riscv-gnu-toolchain -
RISC-V LLVM
https://github.com/llvm/llvm-project -
QEMU RISC-V Emulator
https://www.qemu.org/docs/master/system/target-riscv.html -
Spike RISC-V ISA Simulator
https://github.com/riscv-software-src/riscv-isa-sim -
OpenSBI (Open Source Supervisor Binary Interface) https://github.com/riscv-software-src/opensbi
Companion Projects
-
danieRTOS - A Minimal RISC-V RTOS for Learning Danny Jiang https://github.com/djiangtw/djiang-oss-public/tree/main/daniertos A minimal RTOS implementation designed for learning RISC-V system programming. Lab examples in this book reference logic from this project.
-
Building danieRTOS - Technical Column Series Danny Jiang https://github.com/djiangtw/tech-column-public/tree/main/topics/building-daniertos A technical article series documenting the danieRTOS development process, covering Context Switch, Interrupt Handling, Timer, Scheduler, and other core topics.
Classic Architecture Books
-
Sweetman, Dominic. See MIPS Run, Second Edition.
Morgan Kaufmann, 2006.
ISBN: 978-0120884216 -
Patterson, David A., and John L. Hennessy. Computer Organization and Design RISC-V Edition: The Hardware Software Interface.
Morgan Kaufmann, 2017.
ISBN: 978-0128122754 -
Waterman, Andrew, and Krste Asanović (Editors). The RISC-V Reader: An Open Architecture Atlas.
Strawberry Canyon, 2017.
ISBN: 978-0999249109
ARM Architecture References
-
ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile
ARM Limited
https://developer.arm.com/documentation/ -
ARM Cortex-A Series Programmer’s Guide
ARM Limited
https://developer.arm.com/documentation/
Memory Model and Concurrency
-
RISC-V Memory Consistency Model
Included in RISC-V Unprivileged ISA Specification, Chapter 14
https://github.com/riscv/riscv-isa-manual -
Alglave, Jade, et al. “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.2 - Memory Model”
RISC-V Foundation, 2017
Online Resources
-
RISC-V International Website
https://riscv.org/ -
RISC-V Technical Specifications
https://riscv.org/technical/specifications/ -
RISC-V GitHub Organization
https://github.com/riscv -
RISC-V Software Collaboration
https://github.com/riscv-collab -
RISC-V Wiki
https://wiki.riscv.org/
Academic Papers
-
Asanović, Krste, and David A. Patterson. “Instruction Sets Should Be Free: The Case for RISC-V.”
EECS Department, University of California, Berkeley, Technical Report No. UCB/EECS-2014-146, 2014. -
Waterman, Andrew. “Design of the RISC-V Instruction Set Architecture.”
PhD Thesis, University of California, Berkeley, 2016.
Notes
- All RISC-V specifications are available under open licenses (Creative Commons or similar)
- This book does not use any proprietary or confidential information
- All code examples are original or based on publicly available documentation
- Readers should consult the official RISC-V specifications for the most current information
Last Updated: January 2026 (v0p11 Enhancement)