Front Page

title: “See RISC-V Run: Fundamentals” subtitle: “A Comprehensive Guide to RISC-V Architecture” author: “Danny Jiang” version: “Draft v0p11” date: “January 2026”

See RISC-V Run: Fundamentals

A Comprehensive Guide to RISC-V Architecture

From ISA Fundamentals to System Design

Danny Jiang

Draft v0p11 - January 2026

Complete Book:

17 Chapters organized into 10 Parts
6 Appendices with quick reference materials
~100,000+ words (~400 pages)
Comprehensive coverage from ISA fundamentals to system design
Enhanced with Learning Objectives, Scenario Dialogues, Hands-on Labs, and Common Pitfalls

Licensed under CC BY 4.0

Copyright and License

See RISC-V Run: Fundamentals

A Comprehensive Guide to RISC-V Architecture

Version: Draft v0p11 (Enhanced Edition)
Published: January 2026
Author: Danny Jiang
Contact: djiang.tw@gmail.com

License

This work is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

You are free to:

Share
Copy and redistribute the material in any medium or format for any purpose, even commercially
Adapt
Remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

Attribution
You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Full license text: https://creativecommons.org/licenses/by/4.0/

Trademarks

RISC-V is a trademark of RISC-V International. ARM is a trademark of Arm Limited. MIPS is a trademark of MIPS Technologies, Inc. All other trademarks are the property of their respective owners.

Disclaimer

This book is provided “as is” without warranty of any kind, either expressed or implied. The author and publisher shall not be liable for any damages arising from the use of this book.

The information in this book is based on publicly available specifications and documentation. While every effort has been made to ensure accuracy, the RISC-V specifications continue to evolve. Readers should consult the official RISC-V specifications for the most current information.

About This Book

This is the complete book “See RISC-V Run: Fundamentals”. The book contains:

17 Chapters organized into 10 Parts
6 Appendices with quick reference materials
~100,000+ words (~400 pages)
Comprehensive coverage from ISA fundamentals to system design
v0p11 Enhanced: Learning Objectives, Scenario Dialogues, Hands-on Labs, Common Pitfalls

Author’s GitHub: https://github.com/djiangtw

For updates and errata: To be announced

Enhanced Edition, January 2026

Preface

Why This Book?

RISC-V represents a fundamental shift in computer architecture. Unlike proprietary instruction set architectures (ISAs) that dominated the industry for decades, RISC-V is open, modular, and designed for the modern era. It’s not just another ISA—it’s a new way of thinking about processor design, enabling innovation from embedded microcontrollers to high-performance supercomputers.

When I set out to write this book, I wanted to create something that didn’t exist: a comprehensive, systematic guide to RISC-V that combines the depth of official specifications with the clarity of a well-written textbook. I was inspired by Dominic Sweetman’s classic “See MIPS Run”, which masterfully explained MIPS architecture in a way that was both rigorous and accessible. This book aims to do the same for RISC-V—but rather than presenting dry specifications, it teaches you how to think like a system designer through dialogues between “Junior” and “Senior” engineers, using real-world metaphors like “The Gourmet Kitchen” (ISA modularity), “The Museum’s Red Barrier Poles” (PMP), and “The Building’s Privilege Hierarchy” (M/S/U modes) to explain complex concepts.

Who Should Read This Book?

This book is written for anyone who wants to understand RISC-V deeply:

System Software Developers: If you’re writing operating systems, bootloaders, firmware, or low-level drivers for RISC-V, this book provides the architectural knowledge you need. You’ll learn how exceptions work, how virtual memory is implemented, how to use SBI calls, and how to write correct concurrent code under RISC-V’s weak memory model.

Hardware Engineers: If you’re designing RISC-V processors or SoCs, this book explains the ISA from an implementation perspective. You’ll understand pipeline hazards, memory ordering requirements, interrupt controller integration, and the trade-offs between different microarchitectural choices.

Computer Architecture Students: If you’re learning computer architecture, RISC-V is an excellent teaching vehicle. This book provides a complete picture of a modern ISA, from instruction encoding to system-level features, with comparisons to ARM and MIPS to build your architectural intuition.

Engineers Transitioning from ARM or MIPS: If you’re experienced with other architectures and moving to RISC-V, this book highlights the similarities and differences. You’ll find detailed comparisons of instruction sets, exception models, memory models, and calling conventions to accelerate your learning.

What Makes This Book Different?

Comprehensive Coverage: This book covers the entire RISC-V ecosystem—not just the base ISA, but also extensions (M, A, F, D, C, V, B, H), privileged architecture (M/S/U modes), system software interfaces (SBI, ABI), platform specifications (PLIC, CLIC), and real-world implementation considerations.

Systematic Organization: The book is organized into 10 parts that build progressively from fundamentals to advanced topics. Each chapter is self-contained but connects to the broader narrative, making it suitable both for cover-to-cover reading and as a reference.

Practical Focus: Every concept is illustrated with code examples, diagrams, and real-world use cases. You’ll see how to implement spinlocks, how to handle page faults, how to configure PMPs, and how to debug memory ordering issues—not just abstract theory.

Comparative Analysis: Throughout the book, I compare RISC-V with ARM and MIPS. These comparisons help you understand RISC-V’s design philosophy and make informed decisions when porting code or designing systems.

Modern Perspective: RISC-V is a modern ISA designed with lessons learned from decades of processor evolution. This book emphasizes modern features like weak memory ordering, modular extensions, and formal specifications that distinguish RISC-V from older architectures.

How to Use This Book

For Cover-to-Cover Reading: The chapters are designed to be read in order, building from basic concepts to advanced topics. Start with Part I (RISC-V Overview) and work through to Part X (Comparisons with Other Architectures).

As a Reference: Each chapter is self-contained with clear section headings and summaries. The appendices provide quick reference tables for CSRs, extensions, SBI calls, and instruction comparisons. Use the table of contents and index to find specific topics.

For Hands-On Learning: The book includes numerous code examples in assembly and C. I encourage you to run these examples on RISC-V simulators (QEMU, Spike) or real hardware. Experimenting with the code will deepen your understanding.

For Teaching: This book is suitable for undergraduate or graduate courses in computer architecture or systems programming. Each chapter includes learning objectives and summaries that can guide course structure.

What’s in This Book?

This book contains 17 chapters organized into 10 parts, plus 6 comprehensive appendices:

Part I — Introduction introduces the RISC-V ISA, its history, design philosophy, and ecosystem.

Part II — Programmer’s Model covers the fundamental programming model—registers, privilege levels, calling conventions, and execution environment.

Part III — Traps & Exceptions explains the unified trap handling mechanism for exceptions and interrupts.

Part IV — Memory & Addressing covers virtual memory, paging (Sv39/Sv48), memory ordering, and synchronization.

Part V — Pipeline & Microarchitecture explores pipeline fundamentals, hazards, and microarchitecture variations.

Part VI — Booting & System Software details reset, boot flow, firmware, SBI, and OS integration.

Part VII — ISA Extensions covers standard extensions (M, A, F, D, C) and the Vector extension.

Part VIII — System Design, Platform Spec & SoC Integration explains SoC integration, interrupt controllers, and platform profiles.

Part IX — Performance, Debug & Tools covers debugging, trace, and performance monitoring.

Part X — RISC-V vs Other Architectures provides a systematic comparison of RISC-V vs ARM vs MIPS.

Appendices provide quick reference for CSRs, extensions, bootloaders, SBI, RISC-V vs ARM comparison, and memory model.

Total: ~100,000+ words, ~400 pages

Acknowledgments

This book would not have been possible without the RISC-V community’s commitment to open specifications and transparent development. I’m grateful to RISC-V International and all the engineers who have contributed to the RISC-V specifications.

I also want to thank the authors of the classic architecture books that inspired this work, particularly Dominic Sweetman’s “See MIPS Run” and the ARM architecture reference manuals that set the standard for technical documentation.

Finally, thank you to the early readers and reviewers who provided feedback on draft chapters. Your insights have made this book better.

Feedback and Errata

This is a living book. RISC-V continues to evolve, and I’m committed to keeping this book current. If you find errors, have suggestions, or want to share how you’re using the book, please reach out:

Email: djiang.tw@gmail.com
GitHub: https://github.com/djiangtw
Errata: To be announced

Danny Jiang
January 2026

Front Matter

Cover
Copyright and License
Preface
Table of Contents

Part I — Introduction

Chapter 1: What Is RISC-V?

1.1 The Birth of RISC-V
1.2 RISC-V Design Philosophy
1.3 RISC-V ISA Overview
1.4 RISC-V Ecosystem
1.5 Why Learn RISC-V?
1.6 How to Use This Book

Part II — Programmer’s Model

Chapter 2: Programmer’s Model & Register Set

2.1 Integer Register File
2.2 Floating-Point Register File
2.3 Control and Status Registers (CSRs)
2.4 Privilege Modes
2.5 Memory Model
2.6 Comparison with ARM and MIPS

Chapter 3: Privilege Levels & Execution Environment

3.1 RISC-V Privilege Architecture
3.2 Machine Mode (M-mode)
3.3 Supervisor Mode (S-mode)
3.4 User Mode (U-mode)

Part III — Traps & Exceptions

Chapter 4: Trap, Exception, Interrupt

4.1 Trap Handling Overview
4.2 Exception Types
4.3 Interrupt Types
4.4 Trap Delegation
4.5 Trap Vector Table
4.6 Trap Entry and Exit
4.7 Nested Traps
4.8 Comparison with ARM and MIPS

Part IV — Memory & Addressing

Chapter 5: Virtual Memory & Paging (Sv39/Sv48)

5.1 Virtual Memory Overview
5.2 Page Table Structure
5.3 Address Translation
5.4 TLB Management

Chapter 6: Memory Ordering & Synchronization

6.1 Memory Consistency Model
6.2 RISC-V Memory Model (RVWMO)
6.3 Fence Instructions
6.4 Atomic Operations
6.5 Load-Reserved/Store-Conditional
6.6 Comparison with ARM and x86

Part V — Pipeline & Microarchitecture

Chapter 7: RISC-V Pipeline Fundamentals

7.1 Classic Five-Stage Pipeline
7.2 Pipeline Hazards
7.3 Hazard Detection and Resolution
7.4 Branch Prediction
7.5 Pipeline Performance
7.6 Comparison with ARM and MIPS

Chapter 8: Microarchitecture Variations

8.1 In-Order vs Out-of-Order
8.2 Superscalar Execution
8.3 Cache Hierarchy
8.4 Memory Subsystem
8.5 RISC-V Core Examples
8.6 Rocket Core
8.7 BOOM (Berkeley Out-of-Order Machine)
8.8 Performance Comparison

Part VI — Booting & System Software

Chapter 9: Reset, Boot Flow & Firmware

9.1 Reset and Initialization
9.2 Boot ROM
9.3 Bootloader Stages
9.4 Firmware Components
9.5 Device Tree
9.6 Boot Flow Examples
9.7 Comparison with ARM

Chapter 10: Machine Mode, SBI & Supervisor Mode

10.1 Machine Mode Overview
10.2 Supervisor Binary Interface (SBI)
10.3 SBI Implementation (OpenSBI)
10.4 Supervisor Mode Software
10.5 OS Kernel Integration
10.6 Comparison with ARM

Part VII — ISA Extensions

Chapter 11: RISC-V Standard Extensions

11.1 Extension Naming Convention
11.2 M Extension (Integer Multiply/Divide)
11.3 A Extension (Atomic Instructions)
11.4 F Extension (Single-Precision Floating-Point)
11.5 D Extension (Double-Precision Floating-Point)
11.6 C Extension (Compressed Instructions)
11.7 Zicsr and Zifencei
11.8 Other Standard Extensions

Chapter 12: Vector Processing & SIMD Comparison

12.1 Vector Extension Overview
12.2 Vector Register File
12.3 Vector Instructions
12.4 Vector Length Agnostic Programming
12.5 Vector Memory Operations
12.6 Vector Performance
12.7 Comparison with ARM SVE/SVE2
12.8 Comparison with x86 AVX

Part VIII — System Design, Platform Spec & SoC Integration

Chapter 13: SoC Integration

13.1 SoC Architecture Overview
13.2 Interrupt Controllers (PLIC, CLIC, AIA)
13.3 Memory-Mapped I/O
13.4 DMA and Bus Protocols
13.5 Power Management Integration

Chapter 14: RISC-V Platform Profiles & Embedded Systems

14.1 Platform Specifications
14.2 RVA Profiles (RVA22, RVA23)
14.3 Embedded Profiles
14.4 Certification and Compliance

Part IX — Performance, Debug & Tools

Chapter 15: Debugging & Trace

15.1 Debug Architecture Overview
15.2 Debug Module
15.3 Trigger Module
15.4 Trace Architecture
15.5 JTAG and OpenOCD
15.6 GDB Integration

Chapter 16: Performance Counters & PMU

16.1 Performance Monitoring Overview
16.2 Hardware Performance Counters
16.3 Event Selection
16.4 PMU Programming
16.5 Performance Analysis Tools

Part X — RISC-V vs Other Architectures

Chapter 17: RISC-V vs ARM vs MIPS — A Systematic Comparison

17.1 Historical Context
17.2 ISA Design Philosophy
17.3 Instruction Encoding
17.4 Privilege Architecture
17.5 Memory Model
17.6 Ecosystem and Adoption
17.7 Future Outlook

Appendices

Appendix A: CSR Reference
Appendix B: Extension Reference
Appendix C: Bootloader Reference
Appendix D: SBI Reference
Appendix E: RISC-V vs ARM Comparison
Appendix F: Memory Model Reference

Back Matter

About the Author
Bibliography

Chapter 1. What Is RISC-V?

Part I — Introduction

🎯 Learning Objectives

After reading this chapter, you will be able to:

Understand the ISA’s Role: Grasp how the Instruction Set Architecture (ISA) serves as the critical interface between software and hardware
Master RISC-V Core Philosophy: Understand how the three pillars—modularity, simplicity, and openness—influence design decisions
Distinguish Business Model Differences: Recognize the fundamental licensing and ecosystem differences between RISC-V and x86/ARM
Decode ISA Naming: Read and understand ISA strings like RV64GC and RV32IMC
Set Up Development Environment: Successfully install the RISC-V toolchain and run your first program

💡 Scenario: The Break Room Decision

Scene: Monday morning in the company break room. The coffee machine hums quietly. The aroma of fresh coffee fills the air, but Junior’s expression is anything but relaxed.

Junior: (sighing) “Hey Senior, the PM just dropped a new project on me. Says we need to evaluate RISC-V for the MCU selection instead of our usual ARM Cortex-M. My head is spinning—what’s the difference anyway? Don’t they both just run C code? Why make things complicated?”

Senior: (accepting a fresh cup of coffee with a smile) “Don’t panic. Think of it this way: you’ve been eating at a ‘franchise fast-food restaurant’ all this time, and now the boss wants you to try a ‘build-your-own gourmet kitchen.’ Writing C code might feel similar, but the underlying rules of the game have changed.”

Junior: “Rules of the game? You mean the instruction set architecture?”

Senior: “Exactly. Think about it—when we use ARM, the ecosystem is strong, but it’s still one company’s product. To use their IP, your company pays licensing fees, and you get whatever features they decide to give you. If we wanted to add special hardware acceleration for an AI algorithm, could we modify ARM’s core?”

Junior: “No way. The vendor would never let us mess with their core.”

Senior: “And that’s RISC-V’s biggest value. It’s an open standard. Like TCP/IP or Linux, no single company ‘owns’ it. If we need special acceleration, under RISC-V, we can design our own custom instructions and add them in—no begging the vendor for permission.”

Junior: “Sounds flexible, but wouldn’t it be chaotic? Without standards, would programs even run?”

Senior: “That’s the most common newbie question. The clever thing about RISC-V is its modular design. There’s a mandatory ‘base model’ that everyone must follow, and additional extensions are like LEGO blocks—add them when you need them. That’s why everyone from NVIDIA to university labs is playing with it.”

Junior: (eyes lighting up) “Interesting… So we’re not just ‘users’ anymore—we can become ‘designers’?”

Senior: “Bingo! Come on, let’s head back to our desks. I’ll help you set up the environment, and we’ll start from the foundation of these ‘LEGO blocks.’”

RISC-V represents a fundamental shift in how we think about processor architectures. Unlike proprietary instruction sets that require licenses and royalties, RISC-V is completely open and free. Unlike monolithic architectures that bundle everything together, RISC-V is modular, allowing implementations from tiny microcontrollers to high-performance servers. Unlike architectures burdened by decades of legacy decisions, RISC-V was designed from scratch in 2010, learning from 30 years of RISC evolution.

This chapter introduces RISC-V by exploring its historical context, design philosophy, and place in the architecture landscape. We’ll trace the RISC revolution from the 1980s through RISC-V’s creation at UC Berkeley. We’ll examine why an open ISA matters and how RISC-V’s modular design enables unprecedented flexibility. We’ll compare RISC-V with ARM and MIPS to understand its competitive position. By the end, you’ll understand not just what RISC-V is, but why it matters and why it’s rapidly gaining adoption across the industry.

1.1 The RISC Revolution

Origins of RISC

The story of RISC-V begins not in 2010, but in the early 1980s, when computer architects began questioning the prevailing wisdom of complex instruction sets. Two landmark projects emerged almost simultaneously: the Berkeley RISC project led by David Patterson and Carlo Séquin, and the Stanford MIPS project led by John Hennessy. Both teams arrived at a radical conclusion: simpler is better.

The Berkeley RISC project, starting in 1980, challenged the conventional wisdom that complex instructions were necessary for high performance. Patterson’s team demonstrated that a processor with a small, carefully chosen set of simple instructions could outperform contemporary CISC (Complex Instruction Set Computer) processors. The key insight was that compilers, not hardware, should handle complexity.

Meanwhile, at Stanford, John Hennessy’s MIPS (Microprocessor without Interlocked Pipeline Stages) project pursued similar goals with a slightly different approach. The MIPS design emphasized pipeline efficiency and compiler optimization, creating an architecture that would eventually power Silicon Graphics workstations and countless embedded systems.

IBM’s 801 project, though less publicized, also contributed crucial ideas to the RISC philosophy. Started in the late 1970s, it demonstrated that a load-store architecture with a large register file could achieve excellent performance.

RISC Design Philosophy

At the heart of RISC is a deceptively simple idea: make the common case fast. RISC architectures achieve this through several key principles:

Load-Store Architecture: Only load and store instructions access memory. All computation happens in registers. This clean separation simplifies pipeline design and enables higher clock frequencies.

Fixed-Length Instructions: All instructions are the same size (typically 32 bits). This allows the processor to fetch and decode instructions in parallel, simplifying the instruction fetch stage and enabling efficient pipelining.

Simple Addressing Modes: RISC architectures typically support only a few addressing modes, often just base+offset for memory access. Complex address calculations are performed using explicit arithmetic instructions.

Large Register File: With 32 or more general-purpose registers, RISC processors can keep frequently used data in fast registers rather than slow memory. This reduces memory traffic and improves performance.

RISC vs CISC

The RISC vs CISC debate dominated computer architecture discussions throughout the 1980s and 1990s. CISC architectures like the Intel x86 and Motorola 68000 featured hundreds of complex instructions, variable-length instruction encoding, and numerous addressing modes. The philosophy was to provide rich instruction sets that closely matched high-level language constructs.

RISC took the opposite approach. By keeping instructions simple and uniform, RISC processors could achieve higher clock frequencies and more efficient pipelining. The burden of generating efficient code shifted from hardware to compilers, but this proved to be a winning strategy as compiler technology matured.

Performance comparisons initially favored RISC. A RISC processor might execute more instructions to accomplish the same task, but it could execute each instruction faster and with better pipeline efficiency. The result was often superior overall performance, especially for compute-intensive workloads.

Historical Impact

The RISC revolution spawned several influential architectures that shaped the computing landscape:

MIPS: Commercialized by MIPS Computer Systems in 1985, the MIPS architecture powered Silicon Graphics workstations and became ubiquitous in embedded systems. Its clean design made it a favorite for teaching computer architecture. Even today, MIPS processors are found in routers, set-top boxes, and other embedded devices.

SPARC: Sun Microsystems’ Scalable Processor Architecture (1987) dominated the workstation and server market in the 1990s. SPARC’s register windows and clean architecture made it popular for Unix systems and scientific computing.

ARM: Perhaps the most successful RISC architecture, ARM (Acorn RISC Machine, later Advanced RISC Machine) started in 1985 as a processor for the Acorn Archimedes computer. Its focus on power efficiency made it the dominant architecture for mobile devices. Today, ARM processors power virtually every smartphone and tablet.

PowerPC: The alliance between Apple, IBM, and Motorola produced PowerPC in 1991, combining ideas from IBM’s POWER architecture with RISC principles. PowerPC powered Apple Macintosh computers from 1994 to 2006 and remains important in embedded and high-performance computing.

These architectures proved that RISC principles worked in practice. They demonstrated that simple, regular instruction sets could deliver excellent performance while simplifying processor design. This legacy directly influenced RISC-V’s design philosophy.

Figure 1.1: RISC Architecture Evolution Timeline

RISC Architecture Evolution Timeline (1980-2010)

Year	Architecture	Significance
1980	Berkeley RISC-I	First RISC processor, demonstrated RISC principles
1980	IBM 801	Early RISC design, influenced later architectures
1981	Berkeley RISC-II	Refined RISC design, register windows
1983	Stanford MIPS	Emphasized pipeline efficiency
1985	MIPS R2000	First commercial RISC processor
1985	ARM1 (Acorn)	Low-power RISC for personal computers
1987	Sun SPARC	Workstation and server market
1991	PowerPC	Apple/IBM/Motorola alliance
1994	ARMv4 (ARM7TDMI)	Thumb instruction set, embedded dominance
2001	ARMv6 (ARM11)	Advanced features, multimedia support
2010	RISC-V	Open, modular, extensible ISA

This timeline shows how RISC-V builds on 30 years of RISC architecture evolution, learning from both successes and mistakes of its predecessors.

1.2 Why RISC-V? The Open ISA Movement

The Need for Open ISA

By 2010, the computer architecture landscape had consolidated around a few proprietary instruction set architectures. Intel’s x86 dominated PCs and servers. ARM dominated mobile and embedded systems. MIPS and PowerPC served niche markets. All shared one characteristic: they were proprietary, requiring licenses and royalty payments.

This created several problems for researchers, educators, and innovators:

Licensing Costs: Even for academic research, obtaining architecture licenses could be expensive and time-consuming. Companies faced significant royalty payments for each chip manufactured.

Restrictions on Modification: Proprietary ISAs typically prohibited modifications or extensions without explicit permission. This stifled innovation and made it difficult to explore new architectural ideas.

Fragmentation: Each vendor’s extensions and modifications created incompatible variants. ARM alone had numerous profiles and extensions, making it challenging to write portable software.

Long-Term Uncertainty: Companies building products around a proprietary ISA faced uncertainty about future licensing terms, support, and the architecture’s longevity.

The open-source software movement had demonstrated the power of collaborative development and unrestricted access. Why not apply the same principles to hardware instruction sets?

Birth of RISC-V

In 2010, a team at UC Berkeley led by Krste Asanović, David Patterson, Yunsup Lee, and Andrew Waterman set out to create a new instruction set architecture for research and education. They needed an ISA that was:

Free and open, with no licensing fees or restrictions
Clean and simple, suitable for teaching
Practical and complete, capable of running real operating systems
Extensible, allowing custom instructions for specialized applications
Stable, with a frozen base that would never change

The team initially considered using an existing open ISA but found none that met all their requirements. MIPS was becoming open but carried legacy baggage. OpenRISC existed but lacked industry momentum. SPARC V8 was available but complex.

So they designed RISC-V from scratch, learning from 30 years of RISC architecture evolution. The name “RISC-V” (pronounced “risk-five”) represents the fifth generation of RISC architectures developed at Berkeley, following RISC-I through RISC-IV.

The initial RISC-V specification, released in 2011, defined a minimal 32-bit integer instruction set (RV32I) with just 47 instructions. This base ISA was deliberately kept small and frozen, ensuring that software written for RISC-V would run forever.

RISC-V International

As RISC-V gained traction, it became clear that a formal organization was needed to manage the specification and ensure consistency. In 2015, the RISC-V Foundation was established as a non-profit organization to standardize and promote the architecture.

In 2020, the foundation relocated to Switzerland and became RISC-V International, reflecting its global nature and ensuring neutrality from any single country’s export controls or political considerations.

RISC-V International operates through a collaborative governance model:

Technical committees develop and ratify specifications
Members include companies, universities, and individuals
All specifications are freely available
Anyone can implement RISC-V without fees or licenses
Members contribute to development but don’t control the ISA

This open governance ensures that RISC-V remains truly open and vendor-neutral, unlike proprietary ISAs controlled by single companies.

Industry Ecosystem

What started as an academic project has grown into a thriving industry ecosystem. By 2025, RISC-V International has over 3,000 members from more than 70 countries, including major technology companies, startups, universities, and government organizations.

Hardware implementations range from tiny microcontrollers to high-performance application processors:

SiFive produces commercial RISC-V cores for embedded and application processors
Western Digital has shipped billions of RISC-V cores in storage controllers
NVIDIA uses RISC-V for GPU microcontrollers
Alibaba’s T-Head develops high-performance RISC-V processors
Numerous startups are building RISC-V-based products

The software ecosystem has matured rapidly. The GNU toolchain (GCC, binutils) and LLVM support RISC-V. Major operating systems including Linux, FreeBSD, and real-time operating systems run on RISC-V. Language runtimes for Java, Python, JavaScript, and others have been ported.

Academic adoption has been particularly strong. RISC-V’s clean design and open nature make it ideal for teaching computer architecture. Universities worldwide use RISC-V in courses, and researchers use it to explore new architectural ideas without licensing barriers.

Comparison with Proprietary ISAs

The contrast between RISC-V’s open model and proprietary ISAs is stark:

ARM Licensing: ARM Holdings licenses its architecture and core designs. Licensees pay upfront fees and per-chip royalties. Architecture licenses (allowing custom core design) are expensive and restricted. This model has been profitable for ARM but creates barriers for innovation and education.

x86 Duopoly: Intel and AMD control the x86 architecture through patents and cross-licensing agreements. No other company can legally implement x86 processors. This duopoly limits competition and innovation in the PC and server markets.

RISC-V Open Model: Anyone can implement RISC-V without fees, licenses, or royalties. The specification is freely available. Custom extensions are permitted. This openness enables innovation, reduces costs, and eliminates vendor lock-in.

The open model doesn’t mean RISC-V is “free” in the sense of zero cost. Designing and manufacturing processors still requires significant investment. But it removes the artificial barriers of licensing fees and restrictions, allowing competition based on implementation quality rather than ISA access.

1.3 RISC-V Design Philosophy

Simplicity

RISC-V embraces simplicity as a core principle. The base integer instruction set (RV32I) contains just 47 instructions—enough to run a complete operating system and applications, but small enough to understand completely in a few hours.

This simplicity manifests in several ways:

Orthogonal Instruction Set: Instructions don’t have special cases or exceptions. Load and store instructions work the same way regardless of data type. Arithmetic instructions operate uniformly on registers.

Regular Encoding: Instruction formats are consistent and predictable. The opcode is always in the same position. Source and destination registers occupy fixed fields. This regularity simplifies decoding and enables efficient implementation.

No Implicit Operations: RISC-V instructions do exactly what they say, nothing more. There are no hidden side effects, implicit register updates, or condition code modifications (except for explicit compare instructions).

The simplicity extends to the privilege architecture. RISC-V defines three privilege levels (Machine, Supervisor, User) with clean separation of responsibilities. There are no complex security states or trust zones in the base specification—those can be added as extensions if needed.

Modularity

Perhaps RISC-V’s most distinctive feature is its modular design. Rather than defining one monolithic ISA, RISC-V separates functionality into a small base ISA plus optional standard extensions.

The base ISA (RV32I, RV64I, or RV128I) is frozen and will never change. It provides:

Integer arithmetic and logical operations
Load and store instructions
Control flow (branches and jumps)
System instructions (environment calls, fences)

Standard extensions add functionality:

M: Integer multiplication and division
A: Atomic instructions for synchronization
F: Single-precision floating-point
D: Double-precision floating-point
C: Compressed 16-bit instructions for code density
V: Vector operations for data parallelism
B: Bit manipulation

A processor implements only the extensions it needs. An embedded microcontroller might implement just RV32I. A Linux-capable application processor would implement RV64IMAC (often abbreviated as RV64GC, where G = IMAFD). A high-performance processor might add V for vector processing.

This modularity provides several benefits:

Implementations can be tailored to specific applications
New extensions can be added without affecting existing code
The ISA can evolve without breaking compatibility
Educational and research projects can start simple and add complexity as needed

Figure 1.2: RISC-V Modular Architecture

RISC-V ISA Modular Structure

Category	Component	Description	Status
Base ISA	RV32I	32-bit integer base	FROZEN
	RV64I	64-bit integer base	FROZEN
	RV128I	128-bit integer base	FROZEN
Standard Extensions	M	Multiply/Divide	Standard
	A	Atomics	Standard
	F	Single-precision floating-point	Standard
	D	Double-precision floating-point	Standard
	C	Compressed instructions (16-bit)	Standard
	V	Vector operations	Standard
	B	Bit manipulation	Standard
Custom Extensions	Custom	Domain-specific instructions	Vendor-defined

The modular design allows implementations to include only the extensions they need, from minimal embedded systems (RV32I only) to high-performance processors (RV64GCV).

Extensibility

Beyond standard extensions, RISC-V explicitly supports custom extensions. The instruction encoding reserves space for custom opcodes, allowing vendors to add specialized instructions without conflicting with standard ones.

Custom extensions enable:

Domain-specific accelerators (cryptography, AI, signal processing)
Proprietary features for competitive advantage
Research into new instruction types
Rapid prototyping of architectural ideas

The key is that custom extensions don’t break compatibility. Software that doesn’t use custom instructions runs unchanged. Compilers can generate code that uses custom instructions when available and falls back to standard instructions otherwise.

This extensibility has proven valuable in practice. Companies have added custom instructions for encryption, machine learning, and other specialized tasks. Researchers have explored new architectural concepts without forking the ISA.

Stability

RISC-V makes a strong commitment to stability. The base ISA is frozen—it will never change. Software written for RV32I in 2011 will run on RISC-V processors in 2050 and beyond.

Standard extensions follow a rigorous ratification process. Once ratified, an extension is frozen. New versions may add features but must maintain backward compatibility.

This stability provides confidence for long-term investments. Companies can build products knowing the ISA won’t change underneath them. Software developers can write code that will run on future processors.

The stability commitment distinguishes RISC-V from some other open ISAs that have undergone incompatible changes. It also contrasts with proprietary ISAs where vendors can deprecate features or change behavior in new versions.

1.4 RISC-V ISA Overview

Base Integer ISAs

RISC-V defines three base integer ISAs, differing only in register width:

RV32I: 32-bit registers and addresses. Suitable for embedded systems, microcontrollers, and 32-bit applications. The base RV32I instruction set is frozen and contains 47 instructions.

RV64I: 64-bit registers and addresses. Designed for application processors, servers, and systems requiring large address spaces. RV64I extends RV32I with additional instructions for 64-bit operations (like ADDW for 32-bit addition with sign extension).

RV128I: 128-bit registers and addresses. Reserved for future systems requiring very large address spaces. The specification is preliminary and not yet frozen.

All three ISAs share the same basic instruction formats and philosophy. Code written for RV32I can often be recompiled for RV64I with minimal changes. The transition from 32-bit to 64-bit is cleaner than in some other architectures.

There’s also RV32E, a reduced variant with only 16 registers instead of 32, designed for extremely small embedded systems where chip area is critical.

Standard Extensions

The standard extensions add functionality to the base ISA:

M Extension - Multiplication and Division: Adds integer multiply, divide, and remainder instructions. Essential for most applications but optional for the simplest embedded systems. The M extension adds 8 instructions in RV32 and 13 in RV64.

A Extension - Atomic Instructions: Provides atomic memory operations for synchronization in multi-processor systems. Includes load-reserved/store-conditional (LR/SC) for lock-free algorithms and atomic memory operations (AMO) like atomic add, swap, and compare-and-swap.

F and D Extensions - Floating-Point: F adds single-precision (32-bit) floating-point, while D adds double-precision (64-bit). These extensions include arithmetic operations, comparisons, conversions, and a separate register file for floating-point values. They follow the IEEE 754 standard.

C Extension - Compressed Instructions: Adds 16-bit instruction encodings for common operations, improving code density by 25-30%. The C extension is particularly valuable for embedded systems with limited memory. Compressed instructions can be freely mixed with standard 32-bit instructions.

V Extension - Vector Processing: Provides vector operations for data parallelism. Unlike fixed-width SIMD (like ARM NEON or x86 AVX), RISC-V vectors are length-agnostic, allowing the same code to run efficiently on different vector lengths. This is similar to ARM’s SVE but with a cleaner design.

B Extension - Bit Manipulation: Adds instructions for common bit manipulation operations like count leading zeros, rotate, and bit field extraction. These operations are common in cryptography, compression, and other algorithms.

The combination of base ISA plus extensions is denoted by a string like “RV64IMAFD” or “RV32IMC”. The letter “G” is shorthand for IMAFD (general-purpose), so “RV64GC” means RV64IMAFD plus compressed instructions.

ISA Naming Convention

RISC-V uses a systematic naming convention to specify which features a processor implements:

RV32 or RV64 or RV128: Base integer ISA width
I: Base integer instruction set (always present)
M: Multiplication and division
A: Atomic instructions
F: Single-precision floating-point
D: Double-precision floating-point
C: Compressed instructions
V: Vector extension
G: Shorthand for IMAFD (general-purpose)

Additional letters indicate other extensions. The order of letters follows a standard sequence defined in the specification.

Examples:

RV32I: Minimal 32-bit processor, base ISA only
RV32IMC: 32-bit with multiply/divide and compressed instructions (common for microcontrollers)
RV64GC: 64-bit general-purpose processor with compressed instructions (common for application processors)
RV64GCV: 64-bit general-purpose with compressed and vector extensions

This naming convention makes it immediately clear what capabilities a processor has.

Figure 1.3: ISA Naming Convention Examples

Common ISA String Breakdown

ISA String	Components	Meaning
RV64GC	RV64	64-bit base integer ISA
	G	General-purpose = IMAFD
	- I	Base integer instructions
	- M	Multiply/Divide
	- A	Atomics
	- F	Single-precision float
	- D	Double-precision float
	C	Compressed instructions
RV32IMC	RV32	32-bit base integer ISA
	I	Base integer instructions
	M	Multiply/Divide
	C	Compressed instructions

Note: “G” is a shorthand for IMAFD, representing a general-purpose processor configuration.

1.5 RISC-V Profiles

The Profile Concept

As RISC-V matured, the flexibility of optional extensions created a challenge: how do software developers know what features they can rely on? A processor implementing just RV64I is very different from one implementing RV64GCV.

RISC-V Profiles solve this problem by defining standard combinations of extensions for specific use cases. A profile specifies:

Mandatory extensions that must be present
Optional extensions that may be present
Specific versions of each extension
Additional requirements (like privilege modes)

Software targeting a profile can assume all mandatory features are available, simplifying development and ensuring portability.

RVA22 Profile

The RVA22 profile (ratified in 2022) targets application processors capable of running rich operating systems like Linux. It comes in two variants:

RVA22U (Unprivileged): Specifies the user-mode ISA. Mandatory extensions include:

RV64I base ISA
M, A, F, D, C extensions (i.e., RV64GC)
Zicsr (CSR instructions)
Zifencei (instruction fence)
Various other Zextensions for specific functionality

RVA22S (Supervisor): Adds supervisor-mode requirements for OS support:

Sv39 virtual memory (39-bit virtual addresses)
Supervisor mode and required CSRs
SBI (Supervisor Binary Interface) support
Additional privilege-related extensions

A processor claiming RVA22S compliance guarantees it can run standard Linux distributions and other Unix-like operating systems.

RVA23 Profile

RVA23 (ratified in 2023) builds on RVA22 with additional features:

Vector extension (V) is mandatory
Additional bit manipulation instructions
Hypervisor extension for virtualization
Enhanced memory ordering features

RVA23 represents the next generation of application processors, with vector processing as a standard feature rather than an option.

Embedded Profiles

While RVA profiles target application processors, separate profiles exist for embedded systems:

Microcontroller Profiles: Specify minimal feature sets for resource-constrained devices. These might require only RV32IMC with machine mode, omitting supervisor mode and virtual memory.

Real-Time Profiles: Add requirements for deterministic interrupt handling and timing, important for real-time operating systems.

The profile system allows RISC-V to serve diverse markets while maintaining clear compatibility boundaries. Software developers can target a profile rather than trying to support every possible combination of extensions.

1.6 RISC-V vs ARM vs MIPS

Historical Context

To understand RISC-V’s place in the architecture landscape, it’s helpful to compare it with its RISC predecessors:

MIPS (1985): The pioneer of commercial RISC, MIPS established many principles that RISC-V follows. Its clean design made it popular for education and embedded systems. However, MIPS fragmented into multiple incompatible variants (MIPS I through MIPS V, plus MIPS32/MIPS64), and its proprietary nature limited adoption. In 2019, MIPS became open-source, but by then RISC-V had captured the momentum.

ARM (1985): Starting as a simple RISC design, ARM evolved into a complex architecture with numerous extensions and profiles. Its focus on power efficiency made it dominant in mobile devices. However, ARM remains proprietary, requiring licenses and royalties. The architecture has accumulated considerable complexity over 35+ years of evolution.

RISC-V (2010): Learning from both predecessors, RISC-V combines MIPS’s clean design philosophy with ARM’s practical focus on real-world applications, while adding the crucial element of openness. It avoids the fragmentation that plagued MIPS and the complexity that accumulated in ARM.

ISA Complexity Comparison

Instruction Count: RISC-V’s base RV32I has 47 instructions. MIPS32 has about 60 base instructions. ARM’s instruction count is harder to pin down due to numerous variants, but ARMv8-A has hundreds of instructions when including all extensions.

Encoding Formats: RISC-V uses 6 basic instruction formats, all 32 bits (plus 16-bit compressed formats). MIPS uses 3 formats, all 32 bits. ARM uses variable-length encoding in Thumb mode and fixed 32-bit in ARM mode, with complex encoding rules.

Addressing Modes: RISC-V supports base+offset for memory access, with offsets computed explicitly for complex addressing. MIPS is similar. ARM supports more complex addressing modes including auto-increment and scaled indexing.

The trend is clear: RISC-V is the simplest, MIPS is moderately complex, and ARM is the most complex. This simplicity makes RISC-V easier to implement, verify, and teach.

Licensing and Ecosystem

RISC-V: Completely open and free. No licenses required, no royalties, no restrictions. Anyone can implement, modify, or extend RISC-V. The specification is publicly available.

ARM: Proprietary and licensed. Architecture licenses cost millions of dollars. Per-chip royalties apply. Modifications require permission. However, ARM offers extensive IP, tools, and ecosystem support.

MIPS: Historically proprietary, became open-source in 2019 under Wave Computing. However, the transition was rocky, and MIPS lacks the momentum and ecosystem of RISC-V.

The licensing difference is fundamental. RISC-V enables innovation and competition that proprietary ISAs cannot match.

Use Cases and Adoption

Embedded Systems: RISC-V is rapidly gaining share in microcontrollers and embedded processors. Its modularity allows implementations tailored to specific needs. Western Digital has shipped billions of RISC-V cores in storage controllers.

Application Processors: ARM dominates mobile devices, but RISC-V is emerging in this space. SiFive and others are developing high-performance RISC-V cores. The open nature appeals to companies wanting to avoid ARM licensing.

High-Performance Computing: RISC-V is being explored for HPC accelerators and specialized processors. The vector extension makes it competitive for data-parallel workloads.

Education and Research: RISC-V has become the architecture of choice for teaching and research, displacing MIPS in many universities.

Future Outlook

RISC-V’s trajectory is clear: rapid growth driven by openness, simplicity, and industry support. It won’t displace ARM in smartphones overnight, but it’s becoming the default choice for new designs where licensing costs and flexibility matter.

The comparison with MIPS is particularly instructive. MIPS had technical merit but remained proprietary too long. By the time it opened, RISC-V had captured the open ISA mindshare. ARM’s technical excellence and ecosystem are formidable, but its proprietary nature creates opportunities for RISC-V.

RISC-V represents not just a new ISA, but a new model for processor architecture: open, collaborative, and free from vendor lock-in. This model is proving compelling for the next generation of computing.

🛠️ Hands-on Lab: Lab 1.1 — Hello RISC-V World

Now let’s get our hands dirty! In this lab, you’ll install the RISC-V toolchain and run your first program.

Lab Objectives

Understand what a Cross-Compiler is
Install the riscv64-unknown-elf-gcc toolchain
Successfully compile and emulate your first C program

Environment Setup

Choose one of the following methods to install the RISC-V toolchain:

Option A: Using xPack Pre-built Packages (Recommended for Beginners)

# Download xPack RISC-V GNU Toolchain
# https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases

# Linux/macOS example (adjust version as needed)
wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v14.2.0-3/xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
tar xzf xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
export PATH=$PWD/xpack-riscv-none-elf-gcc-14.2.0-3/bin:$PATH

# Verify installation
riscv-none-elf-gcc --version

Option B: Using Docker (Cross-platform Consistency)

# Use pre-built Docker image
docker pull riscv/riscv-gnu-toolchain
docker run -it -v $(pwd):/work riscv/riscv-gnu-toolchain bash

Option C: Ubuntu/Debian Package Manager

sudo apt update
sudo apt install gcc-riscv64-unknown-elf qemu-system-riscv64

# Verify installation
riscv64-unknown-elf-gcc --version
qemu-system-riscv64 --version

Write the Program

Create a simple hello.c:

// hello.c
#include <stdio.h>

int main() {
    printf("Hello, RISC-V World!\n");
    printf("This is my first RISC-V program!\n");
    return 0;
}

Compile and Run

# Compile (using ELF format supported by Proxy Kernel)
riscv64-unknown-elf-gcc -o hello hello.c

# Check file format
file hello
# Output should be: hello: ELF 64-bit LSB executable, UCB RISC-V, ...

# Run using QEMU User Mode (requires qemu-riscv64)
qemu-riscv64 hello

# Or run using Spike + Proxy Kernel
spike pk hello

Expected Output:

Hello, RISC-V World!
This is my first RISC-V program!

What You Just Did

Congratulations! You’ve completed three important steps:

Cross-Compilation: Used a compiler running on x86 to generate code for RISC-V
Emulation: Used QEMU or Spike to execute RISC-V instructions on your x86 machine
Toolchain Validation: Confirmed that the complete toolchain (compiler, linker, emulator) is working

danieRTOS Reference: This lab establishes the foundation for all subsequent labs. The same toolchain setup is used in danieRTOS development.

Extended Challenge (Optional)

Try using objdump to view the compiled assembly:

riscv64-unknown-elf-objdump -d hello | head -50

You’ll see output similar to this:

0000000000010000 <_start>:
   10000:       00001197                auipc   gp,0x1
   10004:       800980e7                jalr    gp
   ...

These are RISC-V instructions! In subsequent chapters, we’ll learn how to read these instructions.

💡 Note on Cross-Compilation:

Host: The computer you’re using (e.g., x86 Linux)

Target: The platform you’re compiling for (e.g., RISC-V)

The Cross-Compiler’s job is to generate Target code on the Host

Summary

RISC-V builds on 30 years of RISC architecture evolution, learning from the successes and mistakes of MIPS, SPARC, ARM, and PowerPC. The RISC revolution of the 1980s demonstrated that simple, regular instruction sets could outperform complex ones, establishing principles that RISC-V follows today: load-store architecture, fixed-length instructions, simple addressing modes, and large register files.

The open ISA movement addresses fundamental problems with proprietary architectures: licensing costs, restrictions on modification, fragmentation, and vendor lock-in. RISC-V provides a completely open and free instruction set architecture, governed by RISC-V International through a collaborative model that ensures vendor neutrality. Anyone can implement, modify, or extend RISC-V without fees or licenses.

RISC-V’s design philosophy emphasizes simplicity (just 47 instructions in the base ISA), modularity (optional extensions for specific needs), extensibility (custom instructions without breaking compatibility), and stability (frozen base ISA that will never change). This modular approach allows implementations tailored to specific applications, from RV32I-only microcontrollers to RV64GCV high-performance processors.

The standard extensions add functionality as needed: M for multiplication and division, A for atomic operations, F and D for floating-point, C for compressed instructions, V for vector processing, and B for bit manipulation. Profiles like RVA22 and RVA23 define standard combinations of extensions for specific use cases, ensuring software portability while preserving flexibility.

Compared to ARM and MIPS, RISC-V is simpler (fewer instructions, cleaner encoding), more modern (no legacy baggage), and completely open (no licensing fees or restrictions). While ARM dominates mobile devices and MIPS has declined, RISC-V is rapidly gaining adoption in embedded systems, emerging in application processors, and becoming the architecture of choice for education and research.

RISC-V represents not just a new ISA, but a new model for processor architecture: open, collaborative, and free from vendor lock-in. This openness, combined with technical excellence and industry support, positions RISC-V as the architecture for the next generation of computing.

Chapter 2. Programmer’s Model & Register Set

Part II — Programmer’s Model

🎯 Learning Objectives

After reading this chapter, you will be able to:

Memorize the 32 General-Purpose Registers: Know x0-x31 and their ABI aliases (a0, s0, sp, ra…)
Understand the Calling Convention: Master the responsibilities of Caller-saved vs Callee-saved registers
Mixed Programming Ability: Write mixed C and Assembly code and understand how they interact
CSR Basics: Know the purpose and access methods of Control and Status Registers
Privilege Level Concepts: Understand the permission differences between M/S/U modes

💡 Scenario: Sticky Notes and the Warehouse

Scene: Junior’s screen displays a dense wall of objdump output.

Junior: “Senior, I’m losing my mind. The documentation says arguments go in a0, a1, but the disassembled code shows x10, x11. And are x1 and ra even the same thing?”

Senior: “Ha, this is a rite of passage for every newbie. The hardware only knows x0 through x31—like street addresses. But to make it easier for us humans to write programs, we’ve established a set of ‘rules’ called the ABI (Application Binary Interface) that gives them nicknames.”

Junior: “Nicknames?”

Senior: “Yes. Imagine you’re repairing a watch. Registers are like the workbench right in front of you—limited space, only 32 parts can fit, but you can grab them instantly, super fast. Memory is like the big warehouse behind you—huge capacity but slow to fetch things.”

Junior: “So what about names like a0?”

Senior: “Those are the workbench’s designated zones.

a0-a7 (Arguments): The ‘mail room.’ Others put parts here for you to work on, and you put finished parts back here for them.
t0-t6 (Temporaries): The ‘scratch area.’ Use it however you want, toss things around—nobody cares.
s0-s11 (Saved): The ‘reserved zone.’ If you borrow this space, you must first save whatever was there, then restore it when you’re done—otherwise the previous user won’t find their stuff.“

Junior: “I see! What about ra (Return Address)?”

Senior: “That’s ‘the way home.’ When a function finishes, it needs to know which line of code to jump back to. Come on, let’s write a program and trace the changes on this ‘workbench’ using a simulator.”

Understanding a processor architecture begins with understanding its programmer’s model: the registers, instructions, and conventions that software uses to interact with hardware. RISC-V’s programmer’s model is clean, regular, and designed for both simplicity and efficiency.

This chapter explores the fundamental elements that every RISC-V programmer must know. We’ll examine the 32 general-purpose registers and their conventional uses, the Control and Status Registers (CSRs) that manage processor state, the privilege levels that separate user code from operating system code, and the calling convention that enables functions to work together. We’ll see how RISC-V’s design choices—like the zero register, separate CSR address space, and clean privilege model—simplify both hardware implementation and software development.

2.1 General-Purpose Registers

The Register File

RISC-V provides 32 general-purpose registers, numbered x0 through x31. Each register is XLEN bits wide, where XLEN is 32 for RV32, 64 for RV64, and 128 for RV128. This discussion focuses on RV64, the most common variant for application processors.

The 32-register design follows RISC tradition. It’s large enough to keep frequently used values in fast registers rather than slow memory, but small enough to implement efficiently in hardware. The register file needs multiple read and write ports to support instruction execution, and size directly impacts chip area and access time.

Unlike some architectures, RISC-V’s registers are truly general-purpose. There are no special restrictions on most registers—any register can be used as a source or destination for most instructions. This orthogonality simplifies both hardware implementation and compiler design.

Register x0: The Zero Register

Register x0 is special: it always reads as zero, and writes to it are discarded. This might seem wasteful, but it’s remarkably useful.

The zero register enables several common operations without dedicated instructions:

NOP (no operation): ADDI x0, x0, 0 adds zero to zero and stores in x0 (which discards the result)
Move: ADDI x1, x2, 0 adds zero to x2 and stores in x1, effectively copying x2 to x1
Load immediate: ADDI x1, x0, 42 adds 42 to zero, loading the constant 42 into x1
Unconditional branch: BEQ x0, x0, target branches if x0 equals x0 (always true)

The zero register also simplifies hardware. Many operations naturally produce zero (like XOR of a register with itself), and having a dedicated zero register makes these operations explicit and efficient.

Standard Register Names (ABI)

While the hardware knows registers as x0-x31, software uses symbolic names defined by the Application Binary Interface (ABI). These names indicate each register’s conventional use:

Register	ABI Name	Description	Saved by
x0	zero	Hard-wired zero	—
x1	ra	Return address	Caller
x2	sp	Stack pointer	Callee
x3	gp	Global pointer	—
x4	tp	Thread pointer	—
x5-x7	t0-t2	Temporaries	Caller
x8	s0/fp	Saved register / Frame pointer	Callee
x9	s1	Saved register	Callee
x10-x11	a0-a1	Function arguments / Return values	Caller
x12-x17	a2-a7	Function arguments	Caller
x18-x27	s2-s11	Saved registers	Callee
x28-x31	t3-t6	Temporaries	Caller

These names are conventions, not hardware requirements. The processor doesn’t enforce them—you could use sp for arithmetic if you wanted (though your program would likely crash). But following the ABI ensures that code from different compilers and libraries can interoperate.

The 32 general-purpose registers are organized with their ABI names and conventional usage. Registers are categorized as: zero register (x0), special-purpose registers (ra, sp, gp, tp), caller-saved temporaries and arguments (t0-t6, a0-a7), and callee-saved registers (s0-s11). The complete register file organization with ABI names is shown in the table above.

Caller-Saved vs Callee-Saved

The ABI divides registers into two categories based on who preserves their values across function calls:

Caller-Saved Registers (t0-t6, a0-a7): The calling function must save these if it needs their values after the call. The called function can freely modify them. These are used for temporary values and function arguments.

Callee-Saved Registers (s0-s11, sp): The called function must preserve these. If it uses them, it must save their values on entry and restore them before returning. These are used for values that must survive across function calls.

This division optimizes the common case. Temporary values don’t need to be saved if they’re not used after a call. Long-lived values are automatically preserved across calls.

Special-Purpose Registers

Several registers have special conventional uses:

ra (x1) - Return Address: Stores the return address for function calls. The JAL and JALR instructions (jump-and-link) automatically write the return address to ra. The function returns by jumping to the address in ra.

sp (x2) - Stack Pointer: Points to the top of the stack. The stack grows downward (toward lower addresses) by convention. Functions allocate stack space by subtracting from sp and deallocate by adding to sp.

gp (x3) - Global Pointer: Points to the middle of a 4KB region of global variables. This allows accessing globals with a single load/store instruction using a 12-bit signed offset (±2KB from gp). The linker sets up gp, and it remains constant during execution.

tp (x4) - Thread Pointer: Points to thread-local storage in multi-threaded programs. Each thread has its own tp value, allowing efficient access to thread-specific data.

fp (x8) - Frame Pointer: An alias for s0, used to point to the current stack frame. Some code uses fp to access local variables and function arguments, while sp may change during function execution. Other code omits the frame pointer to free up another register.

Register Usage in Practice

Understanding register usage is crucial for reading assembly code and understanding compiler output. Here’s a typical function call sequence:

# Caller prepares arguments
li a0, 10          # First argument in a0
li a1, 20          # Second argument in a1

# Caller saves any needed temporaries
sd t0, 0(sp)       # Save t0 if needed after call

# Call function
jal ra, my_func    # Jump to my_func, save return address in ra

# Caller restores temporaries
ld t0, 0(sp)       # Restore t0

# Result is in a0
mv s0, a0          # Save result to callee-saved register

Inside the called function:

my_func:
    # Prologue: allocate stack frame
    addi sp, sp, -32   # Allocate 32 bytes
    sd ra, 24(sp)      # Save return address
    sd s0, 16(sp)      # Save s0 if we'll use it
    
    # Function body uses a0, a1 (arguments)
    add s0, a0, a1     # Use s0 for computation
    
    # Prepare return value
    mv a0, s0          # Return value in a0
    
    # Epilogue: restore and return
    ld s0, 16(sp)      # Restore s0
    ld ra, 24(sp)      # Restore return address
    addi sp, sp, 32    # Deallocate stack frame
    ret                # Return (pseudo-instruction for jalr x0, 0(ra))

This pattern—prologue, body, epilogue—is standard for RISC-V functions. The prologue saves registers and allocates stack space. The body performs the computation. The epilogue restores registers and returns.

2.2 Control and Status Registers (CSRs)

CSR Overview

Beyond the 32 general-purpose registers, RISC-V defines a separate address space for Control and Status Registers (CSRs). These registers control processor behavior, report status, and provide access to privileged functionality.

CSRs are accessed using dedicated instructions (CSRRW, CSRRS, CSRRC, and their immediate variants) rather than normal load/store instructions. Each CSR has a 12-bit address, allowing up to 4,096 CSRs, though only a fraction are currently defined.

The CSR address space is partitioned by privilege level and read/write access:

Bits [11:10] encode the privilege level required to access the CSR
Bits [9:8] indicate read/write vs read-only
Bits [7:0] identify the specific register

This encoding allows the hardware to quickly check access permissions. Attempting to access a CSR from insufficient privilege level or writing to a read-only CSR causes an illegal instruction exception.

Figure 2.1: CSR Address Space Organization

graph TB
    subgraph "CSR Address Space (12-bit)"
        subgraph "Bits [11:10]: Privilege Level"
            M[00: User<br/>01: Supervisor<br/>10: Reserved<br/>11: Machine]
        end

        subgraph "Bits [9:8]: Read/Write"
            RW[00: Read/Write<br/>01: Read/Write<br/>10: Read/Write<br/>11: Read-Only]
        end

        subgraph "Bits [7:0]: Register ID"
            ID[256 possible registers<br/>per privilege level]
        end
    end

    subgraph "Example CSRs"
        MSTATUS[mstatus: 0x300<br/>Machine Status]
        SSTATUS[sstatus: 0x100<br/>Supervisor Status]
        CYCLE[cycle: 0xC00<br/>Cycle Counter Read-Only]
    end

    M --> MSTATUS
    M --> SSTATUS
    RW --> CYCLE

    style M fill:#FFB6C1
    style RW fill:#87CEEB
    style ID fill:#90EE90

Machine-Level CSRs

Machine mode is the highest privilege level in RISC-V, with access to all CSRs. Key machine-level CSRs include:

mstatus (Machine Status): Controls and reports various aspects of processor state:

MIE: Machine Interrupt Enable (global interrupt enable for M-mode)
MPIE: Previous MIE value (saved when taking a trap)
MPP: Previous privilege mode (saved when taking a trap)
MPRV: Modify Privilege (affects memory access privilege)
Various extension enable bits (FS for floating-point, VS for vector)

misa (Machine ISA): Indicates which extensions are implemented. Each bit corresponds to an extension (bit 0 = A extension, bit 12 = M extension, etc.). This register allows software to detect available features. On some implementations, misa is read-only; on others, it can be written to enable/disable extensions dynamically.

mie (Machine Interrupt Enable): Controls which interrupts are enabled. Each bit corresponds to an interrupt source (software interrupt, timer interrupt, external interrupt). Even if a bit is set in mie, interrupts are only taken if MIE in mstatus is also set.

mip (Machine Interrupt Pending): Indicates which interrupts are pending. Hardware sets bits when interrupts arrive; software can read mip to determine which interrupts are waiting.

mtvec (Machine Trap Vector): Specifies the address of the trap handler. The low 2 bits select the mode:

0 (Direct): All traps jump to the same address (mtvec & ~0x3)
1 (Vectored): Interrupts jump to (mtvec & ~0x3) + 4×cause, exceptions jump to (mtvec & ~0x3)

mepc (Machine Exception PC): Stores the program counter of the instruction that caused the trap (for exceptions) or the instruction to resume after handling an interrupt. The trap handler returns by writing mepc to the PC.

mcause (Machine Cause): Indicates what caused the trap. The high bit distinguishes interrupts (1) from exceptions (0). The low bits encode the specific cause (e.g., 2 = illegal instruction, 11 = environment call from M-mode).

mtval (Machine Trap Value): Provides additional information about the trap. For address-related exceptions (like page faults), mtval contains the faulting address. For illegal instruction exceptions, it may contain the instruction itself.

mscratch (Machine Scratch): A general-purpose register for machine-mode software. Typically used to save a register temporarily when entering a trap handler, before the handler has set up its stack.

Supervisor-Level CSRs

Supervisor mode is intended for operating systems. It has its own set of CSRs, analogous to the machine-level ones:

sstatus: A restricted view of mstatus, showing only fields relevant to supervisor mode (SIE, SPIE, SPP, etc.). Writing sstatus actually modifies the corresponding fields in mstatus.

sie, sip: Supervisor interrupt enable and pending registers, similar to mie/mip but for supervisor-level interrupts.

stvec, sepc, scause, stval, sscratch: Supervisor versions of the trap-handling CSRs, used when traps are delegated to supervisor mode.

satp (Supervisor Address Translation and Protection): Controls virtual memory:

MODE: Selects the address translation scheme (Bare, Sv39, Sv48, etc.)
ASID: Address Space Identifier for TLB tagging
PPN: Physical page number of the root page table

The satp register is crucial for virtual memory. Writing to satp can change the address translation mode or switch to a different page table, enabling context switches between processes.

User-Level CSRs

User mode has access to a limited set of CSRs, primarily for performance monitoring and floating-point control:

fflags, frm, fcsr: Floating-point exception flags, rounding mode, and combined control/status register. These allow user code to control floating-point behavior and detect exceptions.

cycle, time, instret: Performance counters accessible from user mode (if not disabled by supervisor/machine mode). These provide the number of cycles elapsed, current time, and instructions retired, useful for profiling and timing.

CSR Instructions

RISC-V provides six CSR manipulation instructions:

CSRRW rd, csr, rs1 (CSR Read-Write): Atomically swap the value in csr with the value in rs1, writing the old CSR value to rd. If rd is x0, the read is suppressed (useful for write-only access).

CSRRS rd, csr, rs1 (CSR Read-Set): Read csr into rd, then set bits in csr corresponding to 1 bits in rs1. If rs1 is x0, this is a read-only operation.

CSRRC rd, csr, rs1 (CSR Read-Clear): Read csr into rd, then clear bits in csr corresponding to 1 bits in rs1.

CSRRWI, CSRRSI, CSRRCI: Immediate variants that use a 5-bit immediate value instead of a register.

These instructions are atomic, ensuring that CSR modifications aren’t interrupted. The read-set and read-clear operations are particularly useful for manipulating individual bits without affecting others.

Example: Enabling machine-mode interrupts:

# Set MIE bit in mstatus
li t0, 0x8              # MIE is bit 3
csrrs zero, mstatus, t0 # Set bit 3, discard old value

Example: Saving and modifying a CSR:

# Save current mstatus and disable interrupts
csrrci t0, mstatus, 0x8 # Clear MIE, save old value in t0
# ... critical section ...
csrw mstatus, t0        # Restore original mstatus

The CSR instructions provide controlled access to privileged state, enabling operating systems and firmware to manage the processor while preventing user code from interfering with system operation.

2.3 Program State and Privilege Levels

Privilege Modes

RISC-V defines three privilege levels, from lowest to highest:

User Mode (U-mode): The least privileged level, intended for application code. User mode cannot access most CSRs or execute privileged instructions. It typically runs with virtual memory enabled, isolating processes from each other and from the OS.

Supervisor Mode (S-mode): Intended for operating systems. Supervisor mode can manage virtual memory, handle traps delegated from machine mode, and access supervisor-level CSRs. It cannot access machine-level CSRs or certain privileged operations reserved for firmware.

Machine Mode (M-mode): The highest privilege level, intended for firmware and bootloaders. Machine mode has unrestricted access to all hardware resources. It can access all CSRs, execute all instructions, and delegate traps to lower privilege levels.

Not all implementations support all modes. A simple embedded system might implement only M-mode. A microcontroller might implement M-mode and U-mode. A full application processor implements all three modes.

The current privilege level is not stored in a dedicated register. Instead, it’s implicit in the processor state and can be inferred from CSRs like mstatus.MPP (previous privilege) after a trap.

Privilege Level Transitions

Transitions between privilege levels occur through well-defined mechanisms:

Trap to Higher Privilege: When an exception occurs or an interrupt arrives, the processor traps to a higher privilege level (or stays at the same level). The trap handler is determined by the xtvec CSR (mtvec for M-mode, stvec for S-mode). The processor saves the current PC in xepc, the cause in xcause, and additional information in xtval.

Return from Trap: The MRET, SRET, and URET instructions return from a trap, restoring the privilege level from xstatus.xPP and the PC from xepc. These instructions are privileged—MRET can only be executed in M-mode, SRET in S-mode or higher.

Environment Call: The ECALL instruction explicitly requests a trap to a higher privilege level. User code uses ECALL to invoke OS services (system calls). OS code uses ECALL to invoke firmware services (SBI calls). The trap handler examines the calling context to determine which service was requested.

This controlled transition mechanism ensures that privilege escalation only occurs through defined entry points, maintaining system security.

Figure 2.2: Privilege Level Transitions

stateDiagram-v2
    [*] --> MMode: Reset/Boot

    MMode: Machine Mode (M-mode)
    SMode: Supervisor Mode (S-mode)
    UMode: User Mode (U-mode)

    MMode --> SMode: MRET (return from M-trap)
    MMode --> UMode: MRET (return from M-trap)

    SMode --> UMode: SRET (return from S-trap)
    SMode --> MMode: Exception/Interrupt<br/>(not delegated)

    UMode --> SMode: ECALL (system call)<br/>Exception/Interrupt<br/>(delegated to S-mode)
    UMode --> MMode: Exception/Interrupt<br/>(not delegated)

    SMode --> SMode: Exception/Interrupt<br/>(delegated to S-mode)

    note right of MMode
        Highest privilege
        Full hardware access
        Firmware/Bootloader
    end note

    note right of SMode
        OS kernel
        Virtual memory control
        Delegated traps
    end note

    note right of UMode
        Application code
        Restricted access
        Virtual memory enabled
    end note

The state diagram shows how privilege levels transition through traps (upward) and return instructions (downward). ECALL explicitly requests higher privilege, while exceptions and interrupts cause automatic transitions.

2.4 Calling Convention and ABI

The RISC-V Calling Convention

The calling convention defines how functions call each other: how arguments are passed, how return values are communicated, and which registers must be preserved. RISC-V follows the System V ABI (Application Binary Interface), which is also used by many other architectures.

The calling convention is a software convention, not enforced by hardware. The processor doesn’t care which registers you use for arguments. But following the convention ensures that code from different compilers and libraries can interoperate.

Argument Passing

Function arguments are passed in registers a0 through a7 (x10-x17). The first argument goes in a0, the second in a1, and so on:

int add(int x, int y, int z) {
    return x + y + z;
}

Compiles to:

add:
    add a0, a0, a1    # a0 = x + y
    add a0, a0, a2    # a0 = (x + y) + z
    ret

Arguments x, y, and z arrive in a0, a1, and a2. The result is returned in a0.

If a function has more than 8 arguments, the additional arguments are passed on the stack:

int sum9(int a, int b, int c, int d, int e, int f, int g, int h, int i) {
    return a + b + c + d + e + f + g + h + i;
}

Arguments a through h are in a0-a7. Argument i is on the stack at sp+0.

Return Values

Return values are passed in a0 and a1:

Single return value (up to XLEN bits): a0
Two return values or 128-bit value on RV64: a0 (low) and a1 (high)

For example, a function returning a 128-bit integer on RV64:

__int128 multiply(__int128 x, __int128 y);

Returns the low 64 bits in a0 and the high 64 bits in a1.

Structures and unions are handled specially:

Small structs (≤ 2×XLEN bits) are returned in a0 and a1
Larger structs are returned via a pointer: the caller allocates space and passes a pointer in a0; the function writes the result there and returns the pointer in a0

Caller-Saved vs Callee-Saved Registers

The calling convention divides registers into two categories:

Caller-saved (temporary registers): The caller must save these if it needs their values preserved across a function call. The called function is free to modify them.

t0-t6 (x5-x7, x28-x31): Temporaries
a0-a7 (x10-x17): Arguments/return values
ra (x1): Return address (modified by call)

Callee-saved (saved registers): The called function must preserve these. If it uses them, it must save them on entry and restore them before returning.

s0-s11 (x8-x9, x18-x27): Saved registers
sp (x2): Stack pointer

Example:

function:
    # Prologue: save callee-saved registers
    addi sp, sp, -16
    sd s0, 0(sp)
    sd s1, 8(sp)

    # Function body: can use s0, s1 freely
    mv s0, a0
    mv s1, a1
    # ... computation ...

    # Epilogue: restore callee-saved registers
    ld s0, 0(sp)
    ld s1, 8(sp)
    addi sp, sp, 16
    ret

Special Registers

Several registers have special roles:

Stack Pointer (sp): Points to the top of the stack. Must be preserved by callees. The stack grows downward (toward lower addresses).

Return Address (ra): Holds the return address for the current function. Set by call (or jal), used by ret (which is jalr zero, 0(ra)).

Frame Pointer (fp/s0): Optionally points to the base of the current stack frame. This is the same register as s0. Using a frame pointer simplifies debugging and stack unwinding, but costs a register.

Global Pointer (gp): Points to global data. Used for relaxation optimization—the linker can replace absolute addresses with gp-relative addresses, saving instructions. Typically set once at program startup and never changed.

Thread Pointer (tp): Points to thread-local storage (TLS). Each thread has its own TLS area. The OS sets tp when creating a thread.

2.5 Stack Frame Structure

Stack Layout

The stack is a region of memory used for:

Local variables
Saved registers
Function arguments (beyond the first 8)
Return addresses (for nested calls)

The stack grows downward. The stack pointer (sp) points to the top (lowest address) of the stack. Allocating stack space means subtracting from sp; deallocating means adding to sp.

A typical stack frame looks like:

Higher addresses
+------------------+
| Caller's frame   |
+------------------+
| Arguments 9+     | ← Passed on stack
+------------------+
| Return address   | ← Saved by caller (if needed)
+------------------+
| Saved registers  | ← Callee-saved (s0-s11)
+------------------+
| Local variables  |
+------------------+
| Outgoing args    | ← For functions this function calls
+------------------+ ← sp (stack pointer)
Lower addresses

Function Prologue and Epilogue

The prologue is code at the start of a function that sets up the stack frame:

function:
    # Prologue
    addi sp, sp, -32      # Allocate 32 bytes
    sd ra, 24(sp)         # Save return address
    sd s0, 16(sp)         # Save s0
    sd s1, 8(sp)          # Save s1
    # (Local variables use sp+0 to sp+7)

    # Function body
    # ...

    # Epilogue
    ld ra, 24(sp)         # Restore return address
    ld s0, 16(sp)         # Restore s0
    ld s1, 8(sp)          # Restore s1
    addi sp, sp, 32       # Deallocate stack frame
    ret

The epilogue is code at the end that tears down the stack frame and returns.

Frame Pointer

Some functions use a frame pointer (fp, which is s0). The frame pointer points to a fixed location in the stack frame, making it easier to access local variables and arguments:

function:
    # Prologue with frame pointer
    addi sp, sp, -32
    sd ra, 24(sp)
    sd s0, 16(sp)         # Save old frame pointer
    addi s0, sp, 32       # Set frame pointer to old sp

    # Now can access locals relative to fp:
    # Local var at fp-8, fp-16, etc.

    # Epilogue
    ld ra, -8(s0)
    ld s0, -16(s0)
    addi sp, s0, -32
    ret

Frame pointers are optional. They simplify debugging (debuggers can walk the stack) and exception handling, but cost a register.

Leaf Functions

A leaf function is one that doesn’t call any other functions. Leaf functions can often avoid saving ra and allocating a stack frame:

leaf_function:
    # No prologue needed
    add a0, a0, a1
    ret
    # No epilogue needed

This is more efficient but only works if the function doesn’t call anything and doesn’t need to save registers.

2.6 Comparison with ARM64 and MIPS

Register Count and Usage

All three architectures have 32 general-purpose registers, but they use them differently:

RISC-V:

32 registers (x0-x31)
x0 is hardwired zero
31 usable registers
Clear caller/callee-saved distinction

ARM64:

31 general-purpose registers (x0-x30)
x31 is special: zero register or stack pointer depending on context
30 fully general registers
Link register (x30) holds return address

MIPS:

32 registers ($0-$31)
$0 is hardwired zero
$31 is return address (ra)
30 usable general registers

Calling Conventions

RISC-V:

Arguments: a0-a7 (8 registers)
Return: a0-a1
Caller-saved: t0-t6, a0-a7
Callee-saved: s0-s11

ARM64:

Arguments: x0-x7 (8 registers)
Return: x0-x1
Caller-saved: x0-x18
Callee-saved: x19-x28

MIPS:

Arguments: $a0-$a3 (4 registers, fewer than RISC-V/ARM)
Return: $v0-$v1
Caller-saved: $t0-$t9
Callee-saved: $s0-$s7

RISC-V and ARM64 are similar, both providing 8 argument registers. MIPS is older and provides only 4, which means more stack usage for functions with many arguments.

Special Registers

RISC-V:

sp (x2): Stack pointer
ra (x1): Return address
gp (x3): Global pointer
tp (x4): Thread pointer

ARM64:

sp (x31 in some contexts): Stack pointer
lr (x30): Link register (return address)
No global pointer equivalent
Platform register for TLS

MIPS:

sp ($29): Stack pointer
ra ($31): Return address
gp ($28): Global pointer
No standard thread pointer

Zero Register

Both RISC-V and MIPS have a hardwired zero register (x0 / $0). ARM64’s x31 can act as zero in some contexts but is also used as the stack pointer, which is more complex.

The zero register is surprisingly useful:

mv rd, rs is addi rd, rs, 0 or add rd, rs, zero
li rd, imm is addi rd, zero, imm
Discarding results: add zero, a0, a1 (compute but discard)

🛠️ Hands-on Lab: Lab 2.1 — Your First RISC-V Function

This lab guides you through implementing a simple addition function, experiencing mixed C and Assembly programming.

Lab Objectives

Understand how C passes arguments to an Assembly function (a0, a1)
Understand how Assembly returns results to C (a0)
Observe the Calling Convention in action

Code

Create a folder lab2 and create the following two files:

File 1: add_func.S (Assembly Implementation)

# add_func.S
.section .text
.global my_add          # Declare my_add as global symbol for the linker

# Function prototype: int my_add(int a, int b);
# Input: a in a0, b in a1
# Output: result goes in a0
my_add:
    # Observation point 1: a0, a1 are already filled by the caller
    add a0, a0, a1      # Perform addition: a0 = a0 + a1

    # Observation point 2: result is now in a0
    ret                 # Return instruction (actually jalr x0, 0(ra))
                        # It jumps to the address stored in ra

File 2: main.c (C Driver Program)

// main.c
#include <stdio.h>

// Declare external assembly function
extern int my_add(int a, int b);

int main() {
    int val1 = 10;
    int val2 = 20;
    int sum;

    printf("About to call assembly function...\n");

    // Call assembly function
    // The compiler automatically puts val1 in a0, val2 in a1
    sum = my_add(val1, val2);

    printf("Result: %d + %d = %d\n", val1, val2, sum);

    return 0;
}

Compile and Run

# Compile
riscv64-unknown-elf-gcc -o lab2_add main.c add_func.S

# Run (using QEMU User Mode)
qemu-riscv64 lab2_add

# Or use Spike + PK
spike pk lab2_add

Expected Output:

About to call assembly function...
Result: 10 + 20 = 30

🛠️ Hands-on Lab: Lab 2.2 — Analyzing Compiler-Generated Assembly

This lab lets you “reverse engineer” what C code looks like after compilation, verifying your understanding against the ABI specification.

Lab Objectives

Learn to use objdump for disassembly
Match the ABI to identify argument and return value register usage
Identify Prologue and Epilogue structures

Code

Create test_abi.c:

// test_abi.c
int calculate(int a, int b, int c) {
    int temp = a + b;
    return temp * c;
}

int main() {
    return calculate(2, 3, 4);
}

Compile and Analyze

# Compile (use -O1 for readable output, -g for debug info)
riscv64-unknown-elf-gcc -O1 -c test_abi.c -o test_abi.o

# Disassemble
riscv64-unknown-elf-objdump -d test_abi.o

What to Observe

You should see output similar to:

0000000000000000 <calculate>:
   0:   00b50533                add     a0,a0,a1    # a0 = a + b (temp)
   4:   02c50533                mul     a0,a0,a2    # a0 = temp * c
   8:   00008067                ret

000000000000000c <main>:
   c:   ff010113                addi    sp,sp,-16   # Prologue: allocate stack
  10:   00113423                sd      ra,8(sp)    # Save return address
  14:   00200513                li      a0,2        # First argument
  18:   00300593                li      a1,3        # Second argument
  1c:   00400613                li      a2,4        # Third argument
  20:   00000097                auipc   ra,0x0      # Prepare call
  24:   000080e7                jalr    ra          # Call calculate
  28:   00813083                ld      ra,8(sp)    # Epilogue: restore ra
  2c:   01010113                addi    sp,sp,16    # Free stack
  30:   00008067                ret

Analysis Exercises

Answer the following questions (answers are in the code comments):

Argument Passing: Which registers hold the arguments 2, 3, 4?
Return Value: Which register holds calculate’s result?
Prologue Purpose: Why does main start with addi sp, sp, -16?
Why Save ra: main saves ra, but calculate doesn’t—why?

💡 Hint: Because calculate is a Leaf Function (doesn’t call other functions), it doesn’t need to save ra. But main calls calculate, so it must save ra—otherwise jalr would overwrite it.

⚠️ Common Pitfalls

Pitfall 1: Misusing Saved Registers

Error Scenario: Freely modifying s0-s11 within a function without saving them first, causing crashes when returning to the caller.

# ❌ Wrong
my_bad_func:
    mv s0, a0           # Directly use s0 without saving!
    call another_func   # Call another function
    mv a0, s0           # Expect s0 to be unchanged...
    ret                 # But caller's s0 has been corrupted!

# ✅ Correct
my_good_func:
    addi sp, sp, -16
    sd s0, 0(sp)        # Save s0 first
    sd ra, 8(sp)        # Save ra (since we're calling)

    mv s0, a0
    call another_func
    mv a0, s0

    ld ra, 8(sp)        # Restore ra
    ld s0, 0(sp)        # Restore s0
    addi sp, sp, 16
    ret

Pitfall 2: Forgetting to Save ra in Non-Leaf Functions

Error Scenario: A function calls another function but forgets to save ra, losing the return address.

# ❌ Wrong
my_func:
    call helper         # jalr writes to ra, overwriting original!
    ret                 # Now ra points to wrong location

# ✅ Correct
my_func:
    addi sp, sp, -16
    sd ra, 8(sp)        # Save ra before calling
    call helper
    ld ra, 8(sp)        # Restore ra
    addi sp, sp, 16
    ret

Pitfall 3: Stack Misalignment

Error Scenario: The RISC-V calling convention requires 16-byte stack alignment. Violating this can cause crashes or subtle bugs.

# ❌ Wrong: 8-byte allocation (misaligned!)
    addi sp, sp, -8

# ✅ Correct: Always use multiples of 16
    addi sp, sp, -16

danieRTOS Reference: The Context Switch implementation in danieRTOS carefully follows these conventions, saving all callee-saved registers before switching tasks.

Summary

RISC-V’s programmer’s model provides a clean, regular interface between software and hardware. The 32 general-purpose registers (x0-x31) follow RISC tradition, with x0 hardwired to zero—a simple feature that enables many common operations without dedicated instructions. The Application Binary Interface (ABI) assigns conventional roles to registers: a0-a7 for arguments, t0-t6 for temporaries, s0-s11 for saved registers, and ra for the return address.

The calling convention balances efficiency and simplicity. Caller-saved registers (temporaries and arguments) allow callees to use them freely without saving. Callee-saved registers (s0-s11) preserve values across calls, enabling long-lived variables. The stack pointer (sp) and frame pointer (s0/fp) support stack frames for local variables and nested calls. This convention enables separate compilation and efficient function calls.

Control and Status Registers (CSRs) manage processor state and configuration. Unlike general-purpose registers, CSRs use a separate 12-bit address space and dedicated instructions (CSRRW, CSRRS, CSRRC). CSRs are partitioned by privilege level: machine mode CSRs (0x300-0x3FF) control hardware, supervisor mode CSRs (0x100-0x1FF) support operating systems, and user mode CSRs (0x000-0x0FF) provide performance counters and other user-accessible state.

RISC-V defines three privilege levels: Machine mode (M-mode) has full hardware access and handles initialization and low-level exceptions. Supervisor mode (S-mode) runs operating systems with virtual memory and controlled hardware access. User mode (U-mode) runs applications with restricted privileges. This clean separation enables secure, efficient systems from embedded microcontrollers (M-mode only) to full operating systems (M+S+U).

Compared to ARM64 and MIPS, RISC-V’s programmer’s model is cleaner and more consistent. ARM64 has similar register conventions but with quirks like x31’s dual role as zero register and stack pointer. MIPS shows its age with fewer argument registers and less consistent naming. RISC-V’s separate CSR address space is cleaner than ARM’s system register encoding or MIPS’s coprocessor 0 model.

The programmer’s model reflects RISC-V’s design philosophy: simplicity, regularity, and clean separation of concerns. These principles make RISC-V easier to learn, implement, and optimize than more complex architectures.

Chapter 3. Privilege Levels & Execution Environment

Part II — The RISC-V Execution Model

Modern processors must balance two competing needs: applications require isolation and protection, while operating systems need controlled access to hardware. RISC-V addresses this through a clean privilege architecture with three levels—Machine, Supervisor, and User—each with well-defined responsibilities and capabilities.

This chapter explores how RISC-V implements privilege separation, from the mandatory Machine mode that controls all hardware to the optional Supervisor and User modes that enable operating systems and applications. We’ll examine the Supervisor Binary Interface (SBI) that abstracts platform differences, the execution environment interface that defines how programs interact with their environment, and how RISC-V’s privilege model compares to ARM’s exception levels. Understanding these concepts is essential for anyone working with RISC-V system software, from firmware developers to OS kernel engineers.

🎯 Learning Objectives

After completing this chapter, you will be able to:

Understand why layered protection is necessary: Grasp the core concepts of Isolation & Protection, and why modern processors must restrict applications from directly accessing hardware
Master the privilege differences between M/S/U modes: Clearly distinguish the capabilities and limitations of Machine Mode (building manager), Supervisor Mode (corporate tenant), and User Mode (regular employee)
Understand how SBI serves as M-mode’s service window: Grasp how the Supervisor Binary Interface acts as the standard communication bridge between S-mode and M-mode
Use ecall to request services: Understand how applications “make a phone call” to the operating system or firmware via the ecall instruction to obtain needed services

💡 Scenario: The Smart Building’s Access Card — Understanding Privilege Levels

Scene: In front of the lab whiteboard. Junior points at the screen showing “Illegal Instruction Exception” with a puzzled look. Architect sets down his coffee cup and adjusts his glasses.

Junior: “Architect, I just wanted to temporarily disable interrupts in my application to get more precise timing. Why did the CPU throw an error and kick me out? I bought this board myself!”

Architect: “Junior, you’re thinking about the CPU too simply. Modern processor design is like a ‘smart office building’ — for security, there must be strict access levels.”

Junior: “Access levels?”

Architect: “Think about it this way:

M-mode (Machine Mode) is the ‘Building Manager’: This is the highest authority. They have the master key to the entire building and can directly control the main power switches (hardware reset, clock configuration). Only they can talk directly to the building’s infrastructure.
S-mode (Supervisor Mode) is the ‘Corporate Tenant’: This is our operating system (OS). It rents several floors and can decide how to arrange the desks (memory management) and who sits where (scheduling), but it can’t cut power to the entire building or interfere with other companies’ floors.
U-mode (User Mode) is the ‘Regular Employee’: This is your application. You can only work within your assigned desk area (allocated memory). Want to adjust the AC? No way. Want to flip the main power switch? Not a chance.“

Junior: “So what if I’m actually cold and want to adjust the AC? (meaning: need hardware resources)”

Architect: “You have to ‘call the front desk.’ In RISC-V, this is called ecall (Environment Call). You make a request, the OS (S-mode) checks if you’re authorized, and if reasonable, the OS does it for you. If it involves lower-level hardware, the OS has to request M-mode’s help.”

Junior: “I see! So when I tried to disable interrupts, it was like an intern trying to run to the server room and pull the main breaker?”

Architect: “Exactly. Security (the CPU’s hardware exception mechanism) immediately stopped you. This layered protection is the key reason why the system doesn’t crash completely because of one bad program.”

        ┌─────────────────────────────────────────┐
        │         M-mode (Building Manager)        │
        │  ┌─────────────────────────────────┐    │
        │  │    S-mode (Corporate Tenant/OS)  │    │
        │  │  ┌─────────────────────────┐    │    │
        │  │  │   U-mode (Employee/App)  │    │    │
        │  │  │                         │    │    │
        │  │  │    ecall ──────────────►│────┼───►│ Call the front desk
        │  │  │    ◄─────────────────── │◄───┼────│ Response
        │  │  └─────────────────────────┘    │    │
        │  └─────────────────────────────────┘    │
        └─────────────────────────────────────────┘

💡 Key Insight: Privilege levels aren’t meant to restrict you — they’re meant to protect the entire system. Just like building access control isn’t meant to hassle employees, but to prevent one person’s mistake from causing a building-wide power outage.

3.1 RISC-V Privilege Architecture

The Privilege Model

RISC-V’s privilege architecture is elegantly simple compared to other modern processors. Where ARM defines four exception levels and x86 has rings 0-3 (though only 0 and 3 are commonly used), RISC-V defines just three privilege modes: Machine (M), Supervisor (S), and User (U).

This simplicity is intentional. The RISC-V designers observed that most systems need only two or three privilege levels: one for applications, one for the operating system, and one for firmware. Additional levels add complexity without proportional benefit for most use cases.

The privilege modes form a hierarchy:

Machine mode is the highest privilege level with unrestricted access to all hardware
Supervisor mode is intended for operating systems, with controlled access to privileged operations
User mode is the lowest privilege level, intended for applications with minimal privileges

Importantly, not all modes are mandatory. A simple embedded system might implement only M-mode. A microcontroller with basic memory protection might implement M-mode and U-mode. A full application processor running Linux implements all three modes.

Machine Mode (M-mode)

Machine mode is the only mandatory privilege level in RISC-V. Every RISC-V processor must implement M-mode, even if it implements no other privilege levels.

M-mode has complete and unrestricted access to the entire system. It can:

Access all memory and I/O devices
Read and write all CSRs
Execute all instructions
Configure and delegate traps to lower privilege levels
Control physical memory protection (PMP)

When a RISC-V processor resets, it starts executing in M-mode. The first code to run—typically a bootloader or firmware—executes in M-mode. This code initializes the hardware, sets up memory protection, and may eventually transfer control to supervisor-mode software (like an operating system) or directly to user-mode applications.

In embedded systems without an operating system, all code may run in M-mode. There’s no requirement to use lower privilege levels if they’re not needed. This flexibility allows RISC-V to scale from simple microcontrollers to complex application processors.

M-mode software typically implements:

Bootloader: Initial code that runs after reset
Firmware: Low-level hardware initialization and runtime services
SBI (Supervisor Binary Interface): Services for supervisor-mode software
Trap handlers: For traps that aren’t delegated to lower privilege levels

Supervisor Mode (S-mode)

Supervisor mode is optional but nearly universal in application processors. It’s designed for operating system kernels that manage multiple user processes.

S-mode has more privileges than U-mode but less than M-mode. It can:

Control virtual memory (page tables, TLB)
Handle traps delegated from M-mode
Access supervisor-level CSRs
Execute privileged instructions (like SFENCE.VMA for TLB management)

S-mode cannot:

Access machine-level CSRs
Directly control physical memory protection
Handle traps not delegated to it
Access I/O devices not mapped into its address space

This restricted access is intentional. M-mode firmware retains ultimate control over the hardware, while S-mode software (the OS kernel) manages virtual memory and user processes. This separation allows the firmware to provide platform-specific services while the OS remains platform-independent.

Operating systems like Linux, FreeBSD, and real-time operating systems run in S-mode on RISC-V. They use S-mode privileges to:

Manage virtual memory for process isolation
Handle system calls from user applications
Manage interrupts and exceptions
Schedule processes and manage resources

User Mode (U-mode)

User mode is the lowest privilege level, intended for application code. Like S-mode, U-mode is optional, but it’s implemented in any system that needs to isolate applications from each other and from the OS.

U-mode has minimal privileges. It can:

Execute unprivileged instructions
Access memory mapped into its virtual address space
Read a few user-accessible CSRs (like performance counters)
Request services from higher privilege levels via ECALL

U-mode cannot:

Access privileged CSRs
Execute privileged instructions
Directly access I/O devices
Modify page tables or TLB
Disable interrupts

When U-mode code needs a privileged operation (like file I/O or memory allocation), it uses the ECALL instruction to trap to S-mode. The OS kernel examines the trap, performs the requested operation if permitted, and returns to U-mode.

This isolation is fundamental to modern operating systems. Each user process runs in U-mode with its own virtual address space. Processes cannot interfere with each other or with the kernel. If a process crashes, it doesn’t affect other processes or the system.

Hypervisor Extension (H-mode)

The hypervisor extension adds support for virtualization, allowing multiple operating systems to run simultaneously on the same hardware. Unlike M/S/U modes, the hypervisor extension is truly optional and only needed for virtualization use cases.

The H extension doesn’t add a new privilege level. Instead, it adds:

VS-mode (Virtual Supervisor): Guest OS kernel mode
VU-mode (Virtual User): Guest OS user mode
Two-stage address translation
Additional CSRs for virtualization control

A hypervisor runs in HS-mode (Hypervisor-extended Supervisor mode) and manages multiple guest operating systems. Each guest OS runs in VS-mode, believing it’s in S-mode. The hypervisor intercepts certain operations and provides virtualized hardware to each guest.

This extension is important for cloud computing and server virtualization, but most embedded systems and even many application processors don’t implement it.

Figure 3.1a: RISC-V Privilege Hierarchy

graph TB
    M[Machine Mode M-mode<br/>Firmware, Bootloader, SBI<br/>Full hardware access]
    S[Supervisor Mode S-mode<br/>OS Kernel<br/>Virtual memory, delegated traps]
    U[User Mode U-mode<br/>Applications<br/>Minimal privileges]

    M -->|MRET| S
    M -->|MRET| U
    S -->|SRET| U
    U -->|ECALL, Exception| S
    S -->|Exception not delegated| M
    U -->|Exception not delegated| M

    style M fill:#FFB6C1
    style S fill:#87CEEB
    style U fill:#90EE90

Figure 3.1b: Hypervisor Extension (Optional)

graph TB
    HS[HS-mode<br/>Hypervisor<br/>Manages guest VMs]
    VS[VS-mode<br/>Guest OS Kernel<br/>Virtualized supervisor]
    VU[VU-mode<br/>Guest Applications<br/>Virtualized user mode]

    HS -->|Return| VS
    HS -->|Return| VU
    VS -->|Return| VU
    VU -->|Trap| VS
    VS -->|Trap| HS

    style HS fill:#DDA0DD
    style VS fill:#F0E68C
    style VU fill:#98FB98

3.2 Privilege Levels vs ARM Exception Levels

RISC-V’s Three-Level Model

RISC-V’s M/S/U privilege model is deliberately minimal. Three levels suffice for most systems:

M-mode for firmware and platform-specific code
S-mode for the OS kernel
U-mode for applications

This simplicity has advantages:

Easier to understand and implement
Fewer privilege transitions mean less overhead
Clear separation of concerns

The model is also flexible. Systems that don’t need all three levels can omit S-mode or U-mode. A bare-metal embedded system might use only M-mode. A simple RTOS might use M-mode and U-mode without S-mode.

ARM’s Four-Level Model

ARM takes a different approach with four exception levels (ELs):

EL0: Applications (like RISC-V U-mode)
EL1: OS kernel (like RISC-V S-mode)
EL2: Hypervisor (for virtualization)
EL3: Secure monitor (for TrustZone)

Additionally, ARM’s TrustZone creates two parallel worlds—Secure and Non-secure—each with its own set of exception levels. This creates considerable complexity:

EL3 manages transitions between Secure and Non-secure worlds
Each world has its own EL0, EL1, and EL2
Different exception levels have different capabilities

The ARM model addresses real needs. EL3 and TrustZone provide hardware-enforced security isolation, important for mobile devices handling sensitive data like payment credentials. EL2 enables efficient virtualization for servers and cloud computing.

But this complexity comes at a cost:

More complex privilege transitions
More CSRs and state to manage
Steeper learning curve
More implementation complexity

Comparison of Privilege Transitions

The mechanisms for changing privilege levels differ between RISC-V and ARM, though the concepts are similar.

RISC-V Transitions:

Upward (to higher privilege): Exception, interrupt, or ECALL instruction
Downward (to lower privilege): MRET or SRET instruction
Trap cause stored in xcause CSR
Return address stored in xepc CSR
Trap handler address from xtvec CSR

ARM Transitions:

Upward: Exception or SVC (supervisor call) instruction
Downward: ERET (exception return) instruction
Exception syndrome stored in ESR_ELx
Return address stored in ELR_ELx
Vector table base in VBAR_ELx

The concepts are nearly identical—both architectures save state, jump to a handler, and provide a return instruction. The main differences are naming and the number of levels involved.

RISC-V’s simpler model means fewer cases to handle. An exception in U-mode might go to S-mode (if delegated) or M-mode (if not). In ARM, an exception in EL0 might go to EL1, EL2, or EL3 depending on configuration, and might involve a world switch if TrustZone is involved.

Security Model Comparison

Security is where the architectures diverge most significantly.

ARM TrustZone: ARM’s TrustZone creates two parallel execution environments—Secure and Non-secure worlds. Each world has its own:

Memory regions (some memory is Secure-only)
Peripherals (some devices are Secure-only)
Exception levels (EL0-EL2 in each world)

The Secure world can access Non-secure resources, but not vice versa. EL3 (Secure monitor) manages transitions between worlds. This provides strong isolation for security-critical code like cryptographic operations, DRM, and payment processing.

TrustZone is mandatory in ARM Cortex-A processors and widely used in mobile devices. It’s proven effective for protecting sensitive operations from compromised OS kernels.

RISC-V PMP and ePMP: RISC-V takes a different approach with Physical Memory Protection (PMP). PMP allows M-mode to define memory regions with specific access permissions for lower privilege levels.

PMP provides:

Up to 16 (or more) memory regions
Per-region permissions (read, write, execute)
Protection for S-mode and U-mode
Flexible region sizes and alignment

The enhanced PMP (ePMP) extension adds:

Locked regions that even M-mode cannot modify
More flexible permission models
Better support for security use cases

PMP is simpler than TrustZone but less comprehensive. It doesn’t provide separate execution environments or world switching. For many use cases, PMP suffices. For high-security applications requiring strong isolation, additional mechanisms may be needed.

RISC-V’s approach is to keep the base architecture simple and allow extensions for specific security needs. Custom extensions can add TrustZone-like features if required. This flexibility allows implementations to choose the right security model for their use case without mandating complexity for systems that don’t need it.

Figure 3.2a: RISC-V Privilege Levels

graph TB
    RV_M[M-mode<br/>Machine Mode<br/>Firmware, SBI]
    RV_S[S-mode<br/>Supervisor Mode<br/>OS Kernel]
    RV_U[U-mode<br/>User Mode<br/>Applications]

    RV_M --> RV_S
    RV_S --> RV_U

    style RV_M fill:#FFB6C1
    style RV_S fill:#87CEEB
    style RV_U fill:#90EE90

Figure 3.2b: ARM Exception Levels

graph TB
    ARM_EL3[EL3<br/>Secure Monitor<br/>TrustZone]
    ARM_EL2[EL2<br/>Hypervisor<br/>Virtualization]
    ARM_EL1[EL1<br/>OS Kernel<br/>Privileged OS]
    ARM_EL0[EL0<br/>Applications<br/>User mode]

    ARM_EL3 --> ARM_EL2
    ARM_EL2 --> ARM_EL1
    ARM_EL1 --> ARM_EL0

    style ARM_EL3 fill:#FF6B6B
    style ARM_EL2 fill:#FFA07A
    style ARM_EL1 fill:#87CEEB
    style ARM_EL0 fill:#90EE90

Figure 3.2c: ARM TrustZone Architecture

graph TB
    MONITOR[EL3<br/>Secure Monitor<br/>World switching]
    SECURE[Secure World<br/>EL0-EL2<br/>Trusted execution]
    NONSECURE[Non-secure World<br/>EL0-EL2<br/>Normal execution]

    MONITOR --> SECURE
    MONITOR --> NONSECURE

    style MONITOR fill:#FF6B6B
    style SECURE fill:#98FB98
    style NONSECURE fill:#FFD700

Trade-offs and Use Cases

Neither model is universally superior—each makes different trade-offs.

RISC-V advantages:

Simpler to understand and implement
Lower overhead for privilege transitions
Flexible—implement only what you need
Easier to verify and validate

ARM advantages:

TrustZone provides strong security isolation
Four exception levels enable finer-grained privilege separation
Mature ecosystem with proven security solutions
Hypervisor support is standard (EL2)

For embedded systems and microcontrollers, RISC-V’s simplicity is often preferable. For mobile devices handling sensitive data, ARM’s TrustZone has proven valuable. For servers and cloud computing, both architectures can provide adequate virtualization support (RISC-V via the H extension, ARM via EL2).

The key insight is that RISC-V’s modular approach allows adding complexity where needed (via extensions like H for virtualization) while keeping the base simple. ARM’s approach is to provide comprehensive features in the base architecture, which ensures consistency but mandates complexity even for systems that don’t need all features.

3.3 Execution Environment Interface (EEI)

What is an EEI?

The Execution Environment Interface (EEI) defines the interface between a program and its execution environment. It specifies:

Which instructions are available
How system calls are made
How the program interacts with I/O
Memory layout and addressing
Interrupt and exception handling

Different privilege levels have different EEIs. A user-mode program has a different EEI than a supervisor-mode kernel, which has a different EEI than machine-mode firmware.

The EEI concept is important because it separates the ISA (which instructions exist) from the execution environment (how those instructions interact with the system). The same RISC-V ISA can support different EEIs for different use cases.

Application Execution Environment (AEE)

The Application Execution Environment is the EEI for user-mode programs. It defines what applications can do and how they interact with the operating system.

A typical AEE provides:

Virtual memory with process isolation
System calls via ECALL instruction
Standard library functions
File I/O, networking, and other OS services
Signal handling for asynchronous events

The AEE is usually defined by the operating system and ABI (Application Binary Interface). For example, Linux on RISC-V defines a specific AEE that includes:

System call numbers and calling convention
Signal delivery mechanism
Virtual memory layout
Thread-local storage access

Applications written for this AEE can run on any RISC-V Linux system, regardless of the underlying hardware.

Supervisor Execution Environment (SEE)

The Supervisor Execution Environment is the EEI for OS kernels running in S-mode. It defines how the kernel interacts with the underlying firmware and hardware.

The SEE is typically provided by M-mode firmware through the Supervisor Binary Interface (SBI). The SBI defines services that M-mode provides to S-mode, such as:

Timer management
Inter-processor interrupts (IPI)
Remote fence operations (TLB shootdown)
System reset and shutdown
Console I/O (for debugging)

By providing these services through SBI, the firmware abstracts platform-specific details. The OS kernel can be platform-independent, calling SBI functions instead of directly accessing hardware. This is similar to BIOS/UEFI on x86 or ARM’s PSCI (Power State Coordination Interface).

Bare-Metal Execution Environment

Not all RISC-V systems run operating systems. Embedded systems often run “bare-metal” code directly on the hardware without an OS.

In a bare-metal environment:

Code runs in M-mode with full hardware access
No virtual memory or process isolation
Direct access to all peripherals
Custom interrupt handlers
Application-specific memory layout

The bare-metal EEI is defined by the hardware platform and any runtime library used. For example, a microcontroller might provide:

Startup code that initializes the hardware
Interrupt vector table
Basic I/O functions
Memory map documentation

Bare-metal programming is common in embedded systems, IoT devices, and real-time applications where the overhead of an OS is unacceptable or unnecessary.

🛠️ Lab 3.1: The Ecall Elevator

This Lab’s goal is to let you “see” the privilege mode transition process. We’ll use QEMU to simulate a simple bare-metal environment and observe how a User Mode program “takes the elevator” to a higher privilege level via ecall.

Objectives

Understand how the ecall instruction triggers an exception and traps to a higher privilege level
Observe the mcause register value to confirm the exception type is “Environment call from U-mode”
Observe the mepc register to confirm it points to the ecall instruction address

Environment Requirements

QEMU RISC-V emulator (qemu-system-riscv64)
RISC-V GCC toolchain (riscv64-unknown-elf-gcc)
GDB debugger

Code

File: ecall_elevator.S

# Lab 3.1: The Ecall Elevator
# Observe how ecall transitions from U-mode to M-mode

.section .text
.global _start

# ============================================================
# M-mode Initialization and Trap Handler Setup
# ============================================================
_start:
    # Set up Trap Handler
    la      t0, trap_handler
    csrw    mtvec, t0

    # Set up User Stack (simplified: using fixed address)
    li      sp, 0x80010000

    # Prepare to switch to U-mode
    # mstatus.MPP = 0 (U-mode), mstatus.MPIE = 1
    li      t0, (0 << 11) | (1 << 7)    # MPP=0 (U-mode), MPIE=1
    csrw    mstatus, t0

    # Set return address (mepc = user_code)
    la      t0, user_code
    csrw    mepc, t0

    # Switch to U-mode
    mret

# ============================================================
# User Mode Code (U-mode)
# ============================================================
user_code:
    # We're now executing in U-mode

    # Prepare syscall arguments
    li      a7, 100         # syscall number = 100 (custom)
    li      a0, 42          # arg0 = 42

    # Press the elevator button!
    ecall                   # <-- Set breakpoint here to observe

    # After ecall returns, a0 contains the return value
    # (This example returns a0 + 1 = 43)

user_loop:
    j       user_loop       # Infinite loop

# ============================================================
# M-mode Trap Handler
# ============================================================
.align 4
trap_handler:
    # === Observation Point 1: Read exception cause ===
    csrr    t0, mcause      # t0 = exception cause
    # U-mode ecall: mcause = 8 (Environment call from U-mode)

    # === Observation Point 2: Read exception address ===
    csrr    t1, mepc        # t1 = address where ecall occurred

    # === Observation Point 3: Read previous privilege state ===
    csrr    t2, mstatus     # mstatus.MPP shows previous mode

    # Simple syscall handling: return a0 + 1
    addi    a0, a0, 1       # return value = argument + 1

    # Skip past ecall instruction (ecall is 4 bytes)
    addi    t1, t1, 4
    csrw    mepc, t1

    # Return to U-mode
    mret

Execution Steps

1. Compile the program

riscv64-unknown-elf-gcc -nostdlib -nostartfiles -T linker.ld \
    -o ecall_elevator.elf ecall_elevator.S

2. Start QEMU and connect GDB

# Terminal 1: Start QEMU
qemu-system-riscv64 -machine virt -nographic \
    -kernel ecall_elevator.elf -S -gdb tcp::1234

# Terminal 2: Connect GDB
riscv64-unknown-elf-gdb ecall_elevator.elf
(gdb) target remote :1234
(gdb) break trap_handler
(gdb) continue

3. Observe key registers

When the breakpoint triggers, execute in GDB:

# Check exception cause
(gdb) print/x $mcause
# Expected: 0x8 (Environment call from U-mode)

# Check ecall instruction address
(gdb) print/x $mepc
# Expected: points to ecall address in user_code

# Check previous privilege state
(gdb) print/x $mstatus
# Check MPP bits (bit 12:11): 00 = U-mode

Key Observations

Register	Value	Meaning
`mcause`	`0x8`	Exception Code 8 = Environment call from U-mode
`mepc`	`ecall address`	Instruction address when trap occurred
`mstatus.MPP`	`0b00`	Previously in U-mode (00=U, 01=S, 11=M)

Food for Thought

💭 Question: Why does the Trap Handler need to add 4 to mepc before returning?

Answer: If we don’t skip past ecall, mret will return to the same ecall instruction, causing an infinite trap loop! This is why real-world syscall handlers (like in danieRTOS) include ctx[CTX_MEPC] += 4 to advance past the ecall.

⚠️ Common Pitfalls

Pitfall 1: Bare-Metal Mindset Carryover

Misconception: “I’m used to writing bare-metal code on embedded systems. RISC-V should just let me access CSRs directly.”

Reality: Many RISC-V development boards run Linux, and your program executes in U-mode where you simply cannot access M-mode CSRs.

// ❌ Wrong: Trying to read mstatus in Linux User Space
#include <stdio.h>

int main() {
    unsigned long mstatus;
    asm volatile ("csrr %0, mstatus" : "=r"(mstatus));
    // Result: Illegal Instruction Exception, program killed by SIGILL
    printf("mstatus = 0x%lx\n", mstatus);
    return 0;
}

// ✅ Correct: Use system calls to get system information
#include <stdio.h>
#include <sys/utsname.h>

int main() {
    struct utsname buf;
    uname(&buf);  // Request kernel to look it up via syscall
    printf("Machine: %s\n", buf.machine);
    return 0;
}

Diagnosis:

If your program mysteriously crashes, first check whether you’re using csrr/csrw to access M-mode or S-mode specific CSRs. In a Linux environment, only a few CSRs (like cycle, time) can be read from U-mode.

Pitfall 2: Forgetting to Skip Past ecall

Symptom: After the Trap Handler finishes, the CPU enters an infinite trap loop.

Cause: mepc still points to the ecall instruction. After mret, it immediately executes ecall again, triggering another trap.

# ❌ Wrong: Not updating mepc
trap_handler:
    csrr    t0, mcause
    # ... handle syscall ...
    mret                # Returns to same ecall, infinite loop!

# ✅ Correct: Skip past ecall instruction
trap_handler:
    csrr    t0, mcause
    csrr    t1, mepc
    addi    t1, t1, 4   # ecall is 4 bytes
    csrw    mepc, t1
    # ... handle syscall ...
    mret                # Returns to instruction after ecall

Pitfall 3: Confusing M/S/U-Specific CSRs

Symptom: Want to read trap information but used the wrong CSR prefix.

Explanation: RISC-V CSRs have different prefixes based on privilege level:

Prefix	Privilege Level	Examples
`m`	Machine	`mstatus`, `mcause`, `mepc`, `mtvec`
`s`	Supervisor	`sstatus`, `scause`, `sepc`, `stvec`
none	User (some readable)	`cycle`, `time`, `instret`

# ❌ Wrong: Reading S-mode CSR in M-mode Trap Handler
trap_handler:
    csrr    t0, scause      # This reads S-mode's exception cause, not current!

# ✅ Correct: Use M-mode CSRs in M-mode
trap_handler:
    csrr    t0, mcause      # Read M-mode's exception cause

💡 Memory Tip: Use the CSRs for the level you’re in. M-mode uses m*, S-mode uses s*.

Summary

RISC-V’s privilege architecture provides a clean, flexible model for separating system software responsibilities. The three privilege levels—Machine (M-mode), Supervisor (S-mode), and User (U-mode)—form a hierarchy where each level has well-defined capabilities and restrictions. M-mode is mandatory and has unrestricted hardware access, making it suitable for firmware and bootloaders. S-mode is optional and designed for operating systems, with controlled access to privileged operations and virtual memory. U-mode is optional and intended for applications, with minimal privileges and strong isolation.

The privilege model is more flexible than it first appears. Simple embedded systems implement only M-mode. Microcontrollers with basic protection implement M-mode and U-mode. Full application processors running Linux implement all three modes. The Hypervisor extension adds VS-mode and VU-mode for virtualization, enabling multiple guest operating systems to run on a single processor.

The Supervisor Binary Interface (SBI) provides a standardized interface between M-mode firmware and S-mode operating systems. SBI abstracts platform-specific details, allowing OS kernels to be portable across different RISC-V implementations. Key SBI services include timer management, inter-processor interrupts, remote fence operations, and system reset. OpenSBI provides a reference implementation that supports numerous platforms.

The Execution Environment Interface (EEI) defines how programs interact with their execution environment. The Application Execution Environment (AEE) is what user programs see—system calls, standard library functions, and OS services. The Supervisor Execution Environment (SEE) is what the OS kernel sees—SBI calls, hardware access, and platform services. Bare-metal environments provide direct hardware access without an OS layer.

Compared to ARM’s four exception levels (EL0-EL3), RISC-V’s three privilege levels are simpler and more flexible. ARM’s EL3 (Secure Monitor) and EL2 (Hypervisor) are always present in ARMv8-A, even if unused. RISC-V makes S-mode and U-mode optional, and adds hypervisor support as an extension. This modularity allows RISC-V to scale from tiny microcontrollers to high-performance servers without carrying unnecessary complexity.

The privilege architecture reflects RISC-V’s design philosophy: provide minimal mandatory features, make everything else optional, and maintain clean separation of concerns. This approach enables efficient implementations across a wide range of applications while preserving the flexibility to add advanced features when needed.

Chapter 4. Trap, Exception, Interrupt

Part III — Control Transfer & Exception System

🎯 Learning Objectives

After reading this chapter, you will be able to:

Distinguish Exception from Interrupt: Understand the fundamental difference between synchronous and asynchronous traps
Master the Trap Handling Flow: Understand the roles of mtvec, mepc, mcause, and mtval
Write a Trap Handler: Implement a basic exception handler
Understand Trap Delegation: Know how M-mode delegates traps to S-mode
Recognize PLIC: Understand the basic architecture of the Platform-Level Interrupt Controller

💡 Scenario: When the CPU Hits the Pause Button

Scene: Junior stares at a completely black terminal screen, looking bewildered.

Junior: “Senior, I have a question. I deliberately stuffed some garbage data .word 0xFFFFFFFF into my program, and it just crashed. But in Linux, if a program goes bad, usually only that program gets killed (Segmentation Fault), and the system stays fine, right?”

Senior: “That’s because Linux has a powerful ‘emergency response center’—the Trap Handler. But in our bare-metal environment right now, you haven’t written a handler. When the CPU encounters an instruction it doesn’t understand, it doesn’t know what to do, so it just… gives up.”

Junior: “So I need to write this response center myself?”

Senior: “Exactly. Think of it this way:

Trap Occurs: Like a robotic arm on the assembly line suddenly shows a red warning light and stops.
Hardware Action: The CPU automatically saves the current progress (PC) in mepc (Machine Exception PC), then jumps to wherever mtvec (Trap Vector) points to for help.
Software Takes Over (Handler): This is the code you need to write. You check mcause (Machine Cause) to see what triggered the alarm, handle the problem, then let the machine continue.“

Junior: “Sounds logical—handle it and go back to what you were doing?”

Senior: “Here’s the trap within the trap. If it’s an ‘Interrupt’ (like a timer going off), you handle it and return to ‘what you were doing.’ But if it’s an ‘Illegal Instruction Exception,’ returning to ‘what you were doing’ just hits the same illegal instruction again—infinite loop! So you have to manually ‘skip over’ that bad instruction.”

Junior: “I see! Let’s try it then.”

When a program encounters an error, receives an interrupt, or makes a system call, control must transfer from the current execution context to a handler that can deal with the event. RISC-V calls this mechanism a “trap”—a general term encompassing both synchronous exceptions (like page faults and illegal instructions) and asynchronous interrupts (like timer ticks and device signals).

Understanding traps is fundamental to system programming. Operating systems rely on traps to implement system calls, handle errors, and respond to hardware events. Firmware uses traps to manage low-level hardware and provide services to higher-level software. Even application programmers benefit from understanding how exceptions propagate and how interrupt latency affects real-time performance.

This chapter explores RISC-V’s trap mechanism in detail: how traps are triggered, how control transfers to handlers, how CSRs record trap information, and how the Platform-Level Interrupt Controller (PLIC) manages external interrupts. We’ll also examine the Advanced Interrupt Architecture (AIA) that extends RISC-V’s interrupt capabilities for high-performance systems, and compare RISC-V’s approach with ARM’s exception model.

4.1 Trap Fundamentals

What is a Trap?

In RISC-V terminology, a “trap” is any event that causes a control transfer to a trap handler. This includes both exceptions (synchronous events caused by instruction execution) and interrupts (asynchronous events from external sources).

The term “trap” is deliberately general. It encompasses:

Illegal instructions
Page faults
System calls (ECALL)
Breakpoints
Timer interrupts
External device interrupts
Inter-processor interrupts

When a trap occurs, the processor:

Saves the current program counter to xepc (where x is m, s, or u depending on the target privilege level)
Saves the trap cause to xcause
Saves additional information to xtval (if applicable)
Updates xstatus to record the previous privilege level and interrupt enable state
Jumps to the trap handler address specified in xtvec

This mechanism is similar to exception handling in other architectures, but RISC-V’s terminology and implementation are particularly clean and consistent.

Exceptions vs Interrupts

The distinction between exceptions and interrupts is fundamental:

Exceptions are synchronous—they’re caused by the execution of a specific instruction. When you execute an instruction that causes an exception, the exception occurs at that point in the program. Examples include:

Illegal instruction: The processor doesn’t recognize the opcode
Page fault: A memory access violates page table permissions
ECALL: The program explicitly requests a trap to higher privilege
Breakpoint: A debugging breakpoint is hit

Exceptions are predictable and reproducible. If you execute the same instruction sequence with the same processor state, you’ll get the same exception at the same point.

Interrupts are asynchronous—they’re caused by events external to the currently executing instruction stream. An interrupt can occur between any two instructions (or even during instruction execution in some implementations). Examples include:

Timer interrupt: A timer has expired
External interrupt: A device needs attention
Software interrupt: Another processor or software has signaled this processor

Interrupts are not predictable from the instruction stream alone. The same program might experience interrupts at different points on different runs, depending on external events.

This distinction affects how traps are handled. Exceptions typically require examining the faulting instruction (available in xtval for some exceptions). Interrupts require identifying which device or source caused the interrupt.

Synchronous vs Asynchronous Traps

The synchronous/asynchronous distinction is encoded in the xcause CSR. The high bit of xcause indicates the trap type:

Bit 63 (in RV64) = 0: Exception (synchronous)
Bit 63 (in RV64) = 1: Interrupt (asynchronous)

The low bits encode the specific cause. For example:

xcause = 0x0000000000000002: Illegal instruction exception
xcause = 0x8000000000000005: Supervisor timer interrupt

This encoding allows trap handlers to quickly distinguish interrupts from exceptions with a simple sign test.

Trap Classification

RISC-V doesn’t formally classify traps beyond the exception/interrupt distinction, but it’s useful to think about exceptions in terms of their behavior:

Faults: Exceptions that can be corrected, after which the faulting instruction can be restarted. Page faults are the classic example. When a page fault occurs:

The OS trap handler is invoked
The handler loads the missing page from disk
The handler updates the page table
The handler returns, and the faulting instruction is re-executed
This time, the instruction succeeds

The key is that xepc points to the faulting instruction, so returning from the trap re-executes it.

Traps (in the narrow sense): Exceptions that are reported after the instruction completes. Breakpoints are an example. The breakpoint exception occurs after the EBREAK instruction executes, and xepc points to the next instruction.

Interrupts: Asynchronous events. The interrupted instruction may or may not have completed. For precise interrupts, xepc points to an instruction that hasn’t executed yet. For imprecise interrupts (rare in RISC-V), the exact point of interruption may be approximate.

Aborts: Unrecoverable errors. These are rare in RISC-V. Most errors that would be aborts in other architectures are either faults (if recoverable) or cause the processor to enter a failure state.

Understanding these classifications helps in writing correct trap handlers. Fault handlers must be idempotent (safe to execute multiple times) because the faulting instruction will be retried. Trap handlers for breakpoints must advance xepc before returning to avoid infinite loops.

4.2 Trap Entry and Exit

Trap Entry Flow

When a trap occurs, the processor performs a well-defined sequence of operations. Understanding this sequence is crucial for writing trap handlers and debugging trap-related issues.

The trap entry flow differs slightly depending on the target privilege level (M-mode, S-mode, or U-mode), but the basic pattern is the same. Let’s consider a trap to M-mode:

Save PC: The current PC is saved to mepc. For exceptions, this is the PC of the faulting instruction. For interrupts, this is the PC of the instruction that would have executed next.
Update mcause: The trap cause is written to mcause. The high bit indicates interrupt (1) or exception (0). The low bits encode the specific cause.
Update mtval: Additional trap-specific information is written to mtval. For address-related exceptions (like page faults), mtval contains the faulting address. For illegal instruction exceptions, mtval may contain the instruction itself. For some traps, mtval is zero.
Update mstatus: Several fields in mstatus are updated:
- MPP (previous privilege) is set to the current privilege level
- MPIE (previous interrupt enable) is set to the current value of MIE
- MIE (interrupt enable) is set to 0, disabling interrupts in M-mode
Set privilege to M-mode: The processor switches to M-mode.
Jump to handler: The PC is set to the trap handler address from mtvec. The exact address depends on the mtvec mode (direct or vectored).

This entire sequence is atomic—it cannot be interrupted. Once a trap begins, it completes before any other trap can occur.

Figure 4.1: Trap Entry and Exit Flow

sequenceDiagram
    participant CPU as CPU Execution
    participant CSR as CSRs
    participant Handler as Trap Handler

    Note over CPU: Trap occurs

    CPU->>CSR: Save state:<br/>PC→xepc, cause→xcause,<br/>value→xtval, update xstatus
    CPU->>CPU: Switch privilege level
    CPU->>Handler: Jump to xtvec

    Note over Handler: Process trap

    Handler->>CPU: Execute xRET
    CPU->>CSR: Restore state:<br/>xepc→PC, xstatus→privilege

    Note over CPU: Resume execution

CSR Updates on Trap

Let’s examine each CSR update in detail:

xepc (Exception Program Counter): This register holds the address to return to after handling the trap. For exceptions, it’s the address of the instruction that caused the exception. For interrupts, it’s the address of the instruction that was about to execute when the interrupt occurred.

The trap handler can modify xepc before returning. This is useful for:

Skipping over a faulting instruction that can’t be fixed
Implementing single-stepping in a debugger
Emulating instructions not supported by the hardware

xcause (Trap Cause): This register indicates why the trap occurred. The format is:

Bit XLEN-1: Interrupt bit (1 = interrupt, 0 = exception)
Bits XLEN-2:0: Exception code or interrupt code

Common exception codes include:

0: Instruction address misaligned
1: Instruction access fault
2: Illegal instruction
3: Breakpoint
5: Load access fault
7: Store/AMO access fault
8: Environment call from U-mode
9: Environment call from S-mode
11: Environment call from M-mode
12: Instruction page fault
13: Load page fault
15: Store/AMO page fault

Common interrupt codes include:

1: Supervisor software interrupt
3: Machine software interrupt
5: Supervisor timer interrupt
7: Machine timer interrupt
9: Supervisor external interrupt
11: Machine external interrupt

xtval (Trap Value): This register provides additional information about the trap. Its contents depend on the trap cause:

For address misaligned or access fault exceptions: The faulting address
For illegal instruction exceptions: The instruction itself (optional)
For breakpoint exceptions: The address of the breakpoint instruction
For page fault exceptions: The faulting virtual address
For other traps: Zero or undefined

The trap handler uses xtval to determine what went wrong and how to fix it. For example, a page fault handler uses xtval to know which page to load from disk.

xstatus (Status Register): Several fields are updated:

xPP (Previous Privilege): Set to the privilege level before the trap. This allows the xRET instruction to return to the correct privilege level.
xPIE (Previous Interrupt Enable): Set to the value of xIE before the trap. This preserves the interrupt enable state.
xIE (Interrupt Enable): Cleared to 0, disabling interrupts at the target privilege level. This prevents nested interrupts from immediately occurring.

These updates ensure that the trap handler knows where it came from and can return correctly.

Trap Vector (xtvec)

The xtvec CSR specifies where trap handlers are located. It has two fields:

BASE (bits XLEN-1:2): The base address of the trap handler(s), aligned to 4 bytes
MODE (bits 1:0): The vectoring mode

Two modes are defined:

Direct mode (MODE=0): All traps jump to BASE. A single trap handler must determine the cause by reading xcause and dispatch accordingly.
Vectored mode (MODE=1): Exceptions jump to BASE. Interrupts jump to BASE + 4×cause. This allows separate handlers for each interrupt source.

Vectored mode is useful for performance. Instead of a single handler that must check xcause and dispatch, each interrupt can have its own handler. This reduces latency for interrupt handling.

Example xtvec values:

0x80000000: Direct mode, handler at 0x80000000
0x80000001: Vectored mode, base at 0x80000000
- Exceptions → 0x80000000
- Supervisor software interrupt (cause 1) → 0x80000004
- Supervisor timer interrupt (cause 5) → 0x80000014
- Supervisor external interrupt (cause 9) → 0x80000024

Trap Return (xRET)

Returning from a trap is accomplished with the xRET instruction (MRET for M-mode, SRET for S-mode, URET for U-mode). The xRET instruction:

Restore PC: Set PC to the value in xepc
Restore privilege: Set the current privilege level to xstatus.xPP
Restore interrupt enable: Set xIE to xstatus.xPIE
Update xPIE: Set xPIE to 1 (enabled)
Update xPP: Set xPP to U-mode (least privilege)

The last two steps prepare for the next trap. Setting xPIE to 1 ensures that interrupts will be enabled after the next trap (unless explicitly disabled). Setting xPP to U-mode ensures that returning from the next trap won’t accidentally escalate privilege.

A typical trap handler epilogue looks like:

    # Restore saved registers
    ld t0, 0(sp)
    ld t1, 8(sp)
    # ... restore other registers ...
    addi sp, sp, 256    # Deallocate stack frame

    mret                # Return from M-mode trap

The MRET instruction is privileged—it can only be executed in M-mode. Similarly, SRET can only be executed in S-mode or higher. Attempting to execute xRET from insufficient privilege causes an illegal instruction exception.

4.3 Exception Causes and Handling

Exception Cause Codes

RISC-V defines a standard set of exception codes. Understanding these codes is essential for writing trap handlers.

Instruction Address Misaligned (0): The PC is not properly aligned for the instruction being fetched. In the base ISA, instructions must be 4-byte aligned. With the C extension, instructions can be 2-byte aligned, but jumping to an odd address still causes this exception.

Instruction Access Fault (1): The instruction fetch failed. This might occur because:

The address is not mapped in the page table
The page doesn’t have execute permission
The address is in a protected region (PMP violation)
A bus error occurred

Illegal Instruction (2): The processor doesn’t recognize the instruction. This occurs when:

The opcode is invalid
The instruction uses an unimplemented extension
The instruction is privileged but executed from insufficient privilege
Reserved fields have incorrect values

Illegal instruction exceptions are often used to emulate unimplemented instructions in software.

Breakpoint (3): The EBREAK instruction was executed. This is used by debuggers to set breakpoints. The trap handler can examine the program state and return control to the debugger.

Load Address Misaligned (4) and Store/AMO Address Misaligned (6): A load or store instruction used an improperly aligned address. For example, a 4-byte load (LW) from an address that’s not 4-byte aligned. Some implementations support misaligned accesses in hardware; others trap and require software emulation.

Load Access Fault (5) and Store/AMO Access Fault (7): A load or store failed for reasons similar to instruction access faults—page table violations, PMP violations, or bus errors.

Environment Call from U/S/M-mode (8/9/11): The ECALL instruction was executed. The exception code indicates which privilege level made the call. This is how system calls are implemented—user code executes ECALL, trapping to the kernel.

Instruction Page Fault (12), Load Page Fault (13), Store/AMO Page Fault (15): A page table entry was found, but it doesn’t grant the required permission. These are distinct from access faults (which occur when no valid translation exists). Page faults are typically handled by loading the page from disk and updating the page table.

Exception Handling Patterns

Different exceptions require different handling strategies:

Illegal Instruction Emulation: When an illegal instruction exception occurs, the handler can:

Read the faulting instruction from memory (or from mtval if provided)
Decode the instruction
If it’s an instruction that can be emulated, perform the operation in software
Update registers and memory as the instruction would have
Advance mepc past the instruction
Return with MRET

This technique is used to support optional extensions in software or to provide backward compatibility.

Page Fault Handling: Page faults are more complex:

Read the faulting address from xtval
Check if the address is valid for the process
If not, terminate the process (segmentation fault)
If valid, allocate a physical page
Load the page contents from disk (if it was swapped out)
Update the page table to map the virtual address to the physical page
Return with SRET (the faulting instruction will be retried)

Page fault handling is critical for virtual memory systems and can involve significant latency if disk I/O is required.

System Call Handling: ECALL exceptions implement system calls:

Read the system call number (typically from register a7)
Read arguments from registers (a0-a6)
Validate arguments
Perform the requested operation
Write the result to a0
Advance sepc past the ECALL instruction
Return with SRET

The key is advancing sepc—otherwise, returning would re-execute the ECALL, creating an infinite loop.

4.4 Interrupt Architecture

Interrupt Types

RISC-V defines three types of interrupts, each with machine-level and supervisor-level variants:

Software Interrupts:

Machine software interrupt (cause code 3)
Supervisor software interrupt (cause code 1)
Triggered by writing to memory-mapped registers
Used for inter-processor interrupts (IPI)

Timer Interrupts:

Machine timer interrupt (cause code 7)
Supervisor timer interrupt (cause code 5)
Triggered when a timer reaches a threshold
Used for scheduling and timekeeping

External Interrupts:

Machine external interrupt (cause code 11)
Supervisor external interrupt (cause code 9)
Triggered by external devices via interrupt controller
Used for I/O device interrupts

Each interrupt type has a corresponding bit in the xie (interrupt enable) and xip (interrupt pending) registers.

Interrupt Enable and Pending

Interrupts are controlled by two sets of CSRs:

xie (Interrupt Enable): Each bit enables a specific interrupt type.

Bit 11: Machine external interrupt enable (MEIE)
Bit 9: Supervisor external interrupt enable (SEIE)
Bit 7: Machine timer interrupt enable (MTIE)
Bit 5: Supervisor timer interrupt enable (STIE)
Bit 3: Machine software interrupt enable (MSIE)
Bit 1: Supervisor software interrupt enable (SSIE)

xip (Interrupt Pending): Each bit indicates if an interrupt is pending.

Same bit positions as xie
Read-only for most bits (set by hardware)
Software interrupt bits can be written

For an interrupt to be taken:

The interrupt must be pending (bit set in xip)
The interrupt must be enabled (bit set in xie)
Global interrupts must be enabled (xIE bit in xstatus)
The interrupt must not be delegated to a lower privilege level

Interrupt Priority

When multiple interrupts are pending, the hardware chooses one based on priority. The standard priority order is:

External interrupts (highest priority)
Software interrupts
Timer interrupts (lowest priority)

Within each category, machine-level interrupts have higher priority than supervisor-level.

This ordering ensures that external device interrupts (which may be time-critical) are serviced before software-triggered interrupts.

Interrupt Nesting

By default, taking an interrupt disables further interrupts (xIE is cleared). This prevents nested interrupts, which simplifies handler code.

However, handlers can re-enable interrupts to allow nesting:

interrupt_handler:
    # Save context
    csrrw sp, mscratch, sp    # Swap sp with mscratch
    addi sp, sp, -256
    sd x1, 0(sp)
    # ... save other registers ...

    # Re-enable interrupts for nesting
    csrsi mstatus, 0x8        # Set MIE bit

    # Handle interrupt
    # ...

    # Disable interrupts before returning
    csrci mstatus, 0x8        # Clear MIE bit

    # Restore context
    ld x1, 0(sp)
    # ... restore other registers ...
    addi sp, sp, 256
    csrrw sp, mscratch, sp
    mret

Nested interrupts require careful stack management and reentrancy considerations.

4.5 Platform-Level Interrupt Controller (PLIC)

PLIC Architecture

The Platform-Level Interrupt Controller (PLIC) is the standard interrupt controller for RISC-V systems. It routes interrupts from external sources (devices) to harts and privilege levels.

Key features:

Supports up to 1024 interrupt sources
Routes interrupts to multiple targets (harts × privilege modes)
Priority-based arbitration
Memory-mapped configuration registers

The PLIC sits between interrupt sources (devices) and interrupt targets (harts):

Devices → PLIC → Harts (M-mode, S-mode)

Interrupt Routing

Each interrupt source can be routed to any combination of targets. A target is a (hart, privilege mode) pair. For example, in a 4-hart system with M-mode and S-mode:

Hart 0 M-mode
Hart 0 S-mode
Hart 1 M-mode
Hart 1 S-mode
… (8 targets total)

Each target has an enable register with one bit per interrupt source. Setting bit N enables source N for that target.

This flexibility allows:

Dedicating certain interrupts to specific harts
Sharing interrupts across multiple harts
Routing interrupts to different privilege levels

Priority and Threshold

Each interrupt source has a priority (typically 0-7, where 0 means “never interrupt”). Each target has a threshold. An interrupt is delivered only if its priority exceeds the target’s threshold.

This allows:

Masking low-priority interrupts during critical sections
Implementing priority-based preemption
Temporarily disabling interrupts without modifying enable bits

Example: Set threshold to 5 to mask all interrupts with priority ≤ 5.

PLIC Memory Map and Programming

The PLIC is configured through memory-mapped registers:

Base Address: 0x0C000000 (typical, platform-specific)

Priority registers:     Base + 0x000000 + source_id * 4
Pending registers:      Base + 0x001000 + (source_id / 32) * 4
Enable registers:       Base + 0x002000 + context * 0x80 + (source_id / 32) * 4
Threshold registers:    Base + 0x200000 + context * 0x1000
Claim/Complete:         Base + 0x200004 + context * 0x1000

A “context” is a (hart, privilege mode) pair. For a system with M-mode and S-mode per hart:

Context 0 = Hart 0, M-mode
Context 1 = Hart 0, S-mode
Context 2 = Hart 1, M-mode
Context 3 = Hart 1, S-mode
…

Example register definitions and initialization:

// PLIC register definitions
#define PLIC_BASE           0x0C000000
#define PLIC_PRIORITY(id)   (PLIC_BASE + (id) * 4)
#define PLIC_PENDING(id)    (PLIC_BASE + 0x1000 + ((id) / 32) * 4)
#define PLIC_ENABLE(hart, mode, id) \
    (PLIC_BASE + 0x2000 + (hart) * 0x100 + (mode) * 0x80 + ((id) / 32) * 4)
#define PLIC_THRESHOLD(hart, mode) \
    (PLIC_BASE + 0x200000 + (hart) * 0x2000 + (mode) * 0x1000)
#define PLIC_CLAIM(hart, mode) \
    (PLIC_BASE + 0x200004 + (hart) * 0x2000 + (mode) * 0x1000)

// Initialize PLIC
void plic_init(void) {
    // Set priority of interrupt source 1 to 7 (highest)
    *(volatile uint32_t *)PLIC_PRIORITY(1) = 7;

    // Enable interrupt source 1 for hart 0 M-mode
    uint32_t *enable_reg = (uint32_t *)PLIC_ENABLE(0, 0, 1);  // mode 0 = M-mode
    *enable_reg |= (1 << (1 % 32));

    // Set priority threshold to 0 (accept all interrupts with priority > 0)
    *(volatile uint32_t *)PLIC_THRESHOLD(0, 0) = 0;

    // Enable M-mode external interrupt
    set_csr(mie, 1 << 11);  // MEIE
    set_csr(mstatus, 1 << 3);  // MIE
}

// PLIC interrupt handler
void plic_handler(void) {
    // Claim interrupt (read source ID)
    uint32_t source = *(volatile uint32_t *)PLIC_CLAIM(0, 0);

    if (source == 0) {
        // No pending interrupt (should not happen)
        return;
    }

    // Handle interrupt
    printf("Handling PLIC interrupt from source %u\n", source);
    handle_device_interrupt(source);

    // Complete interrupt (write back source ID)
    *(volatile uint32_t *)PLIC_CLAIM(0, 0) = source;
}

Claim and Completion

The PLIC uses a claim/complete protocol:

Claim: The handler reads the claim register. This atomically:
- Returns the ID of the highest-priority pending interrupt
- Marks that interrupt as “in service”
- Prevents other harts from claiming the same interrupt
Service: The handler services the interrupt
Complete: The handler writes the interrupt ID to the completion register. This:
- Marks the interrupt as no longer in service
- Allows the interrupt to be triggered again

Example:

void plic_handler(void) {
    uint32_t irq = plic_claim();  // Claim interrupt
    if (irq == UART_IRQ) {
        uart_interrupt_handler();
    } else if (irq == TIMER_IRQ) {
        timer_interrupt_handler();
    }
    // ... handle other interrupts ...
    plic_complete(irq);           // Complete interrupt
}

The claim/complete protocol ensures that:

Each interrupt is handled exactly once
Multiple harts can share the PLIC without races
Level-triggered interrupts don’t cause spurious re-triggers

4.6 Core-Local Interrupt Controller (CLIC)

CLIC vs PLIC

The Core-Local Interrupt Controller (CLIC) is an optional extension that provides lower-latency interrupt handling than the PLIC.

PLIC:

Platform-level (shared across harts)
Flexible routing
Memory-mapped configuration
Higher latency (claim/complete protocol)
Good for general-purpose I/O

CLIC:

Core-local (one per hart)
Direct vectoring to handlers
Lower latency
More complex configuration
Good for real-time, low-latency interrupts

PLIC and CLIC are complementary. A system might use CLIC for time-critical interrupts and PLIC for general I/O.

Vectored Interrupt Handling

CLIC supports vectored interrupts—each interrupt source can have its own handler address. When an interrupt occurs, the hardware jumps directly to the handler, avoiding software dispatch overhead.

The vector table is an array of handler addresses:

Vector Table:
  [0]: Handler for interrupt 0
  [1]: Handler for interrupt 1
  [2]: Handler for interrupt 2
  ...

The hardware computes the handler address as:

handler_address = vector_table_base + (interrupt_id × entry_size)

This eliminates the need for a dispatcher that reads the interrupt ID and jumps to the appropriate handler.

Interrupt Levels and Preemption

CLIC supports multiple interrupt levels (typically 256). Each interrupt has a level, and higher-level interrupts can preempt lower-level ones.

When an interrupt is taken:

The current level is saved
The new level is set to the interrupt’s level
Only interrupts with higher levels can preempt

This provides fine-grained priority control for real-time systems.

4.7 Advanced Interrupt Architecture (AIA)

AIA Overview

The Advanced Interrupt Architecture (AIA) is a newer interrupt specification that extends and improves upon PLIC and CLIC. It provides:

Message-signaled interrupts (MSI)
Interrupt virtualization support
Improved scalability
Better integration with PCIe

AIA is designed for modern systems with many cores and devices, particularly servers and data center applications.

Message-Signaled Interrupts (MSI)

Traditional interrupts use dedicated wires. MSI uses memory writes to signal interrupts:

Device writes to a special address
The write is intercepted by the interrupt controller
An interrupt is triggered

MSI advantages:

No dedicated interrupt wires needed
Scales better (thousands of interrupt sources)
Better for PCIe devices
Supports interrupt remapping

Interrupt Virtualization

AIA includes support for virtualizing interrupts, essential for running multiple guest operating systems:

Guest interrupts can be delivered directly to guest VMs
Hypervisor can intercept and remap interrupts
Reduces virtualization overhead

This is similar to ARM’s GIC virtualization extensions.

4.8 Comparison with ARM GIC

ARM Generic Interrupt Controller (GIC)

ARM’s GIC is the standard interrupt controller for ARM systems. Comparing with RISC-V:

Architecture:

GIC: Centralized, hierarchical (distributor → CPU interfaces)
PLIC: Flat, memory-mapped
CLIC: Core-local, vectored

Interrupt Types:

GIC: SGI (software), PPI (private peripheral), SPI (shared peripheral)
RISC-V: Software, timer, external

Priority Levels:

GIC: Up to 256 priority levels
PLIC: Typically 8 levels
CLIC: Up to 256 levels

Virtualization:

GIC: Built-in virtualization support (GICv3+)
RISC-V: AIA provides virtualization support

Similarities:

Both support priority-based arbitration
Both support routing interrupts to specific cores
Both support message-signaled interrupts (GICv3 ITS, RISC-V AIA)

Differences:

GIC is more complex and feature-rich
PLIC is simpler and more flexible
CLIC provides lower latency for real-time use cases
AIA brings RISC-V closer to GIC’s capabilities

Trade-offs:

RISC-V’s modular approach (PLIC/CLIC/AIA) allows implementations to choose the right complexity
ARM’s unified GIC provides consistency but mandates more complexity
RISC-V is catching up with AIA for advanced features

🛠️ Hands-on Lab: Lab 4.1 — Your First Trap Handler

This lab guides you through implementing a minimal Trap Handler that handles an Illegal Instruction and gracefully skips over it.

Lab Objectives

Set mtvec to point to your Trap Handler
Read mcause to determine the trap type
Modify mepc to skip over the bad instruction
Return using mret

Code

Create lab4_trap.S:

# lab4_trap.S - Minimal Trap Handler Implementation
.section .text
.global _start

_start:
    # 1. Set up Trap Vector
    la t0, trap_handler
    csrw mtvec, t0          # Tell CPU: jump here when trap occurs

    # 2. Execute some normal instructions
    li a0, 100
    li a1, 200
    add a2, a0, a1          # a2 = 300

    # 3. Deliberately trigger an illegal instruction
    .word 0xFFFFFFFF        # This is NOT a valid RISC-V instruction!

    # 4. If Handler is correct, we skip the above and continue here
    li a3, 999              # Marker: we successfully skipped it!

    # 5. Exit program
    li a7, 93               # exit syscall
    li a0, 0                # exit code
    ecall

# ============================================
# Trap Handler
# ============================================
.align 4                    # mtvec requires 4-byte alignment
trap_handler:
    # Save registers we'll use
    addi sp, sp, -16
    sd t0, 0(sp)
    sd t1, 8(sp)

    # Read trap cause
    csrr t0, mcause

    # Check if Illegal Instruction (cause = 2)
    li t1, 2
    bne t0, t1, unknown_trap

    # It's an illegal instruction! Skip it
    # Read mepc (address of trapping instruction)
    csrr t0, mepc

    # Assume 32-bit instruction, skip 4 bytes
    # (Compressed instructions are 2 bytes; simplified here)
    addi t0, t0, 4
    csrw mepc, t0           # Update mepc

    # Restore registers
    ld t0, 0(sp)
    ld t1, 8(sp)
    addi sp, sp, 16

    # Return! CPU will jump to new mepc
    mret

unknown_trap:
    # Unknown trap, halt
    j unknown_trap

Compile and Run

# Compile
riscv64-unknown-elf-gcc -nostdlib -nostartfiles -o lab4_trap lab4_trap.S

# Run with QEMU (requires Machine Mode support)
qemu-system-riscv64 -machine virt -nographic -bios none -kernel lab4_trap

# Or use Spike
spike --isa=rv64gc lab4_trap

What to Observe

Use GDB to trace execution:

# Terminal 1: Start QEMU and wait for GDB
qemu-system-riscv64 -machine virt -nographic -bios none -kernel lab4_trap -s -S

# Terminal 2: Connect GDB
riscv64-unknown-elf-gdb lab4_trap
(gdb) target remote :1234
(gdb) break trap_handler
(gdb) continue

When the breakpoint hits, examine:

(gdb) info registers mcause    # Should show 2 (Illegal Instruction)
(gdb) info registers mepc      # Address of the .word 0xFFFFFFFF
(gdb) stepi                    # Step through handler

You should observe:

mcause = 2 (Illegal Instruction)
mepc points to the address of .word 0xFFFFFFFF
mtval may contain the encoding of the illegal instruction

danieRTOS Reference: The context switch mechanism in danieRTOS builds on these trap handling fundamentals, using mret to switch between tasks.

Key Concept: mepc Adjustment

💭 Why doesn’t an Interrupt need to modify mepc, but an Exception does?

Interrupt: Is “asynchronous”—it occurs between two instructions. mepc points to the “next instruction to execute.” After handling the interrupt, continuing from there is correct.

Exception: Is “synchronous”—triggered by the current instruction. mepc points to the “instruction that triggered the exception.” If you don’t modify mepc, mret will re-execute the same instruction, triggering the exception again, forming an infinite loop!

⚠️ Common Pitfalls

Pitfall 1: Forgetting to Set `mtvec`

Error Scenario: A trap triggers at startup before mtvec is set, causing the CPU to jump to an uninitialized address (usually 0).

# ❌ Wrong: mtvec not set before potential trap
_start:
    ecall                   # If mtvec isn't set, jumps to unknown address

# ✅ Correct: Set mtvec as the first thing
_start:
    la t0, trap_handler
    csrw mtvec, t0
    # Now safe to execute instructions that might trap

Pitfall 2: Corrupting Caller’s Registers in Handler

Error Scenario: The Trap Handler uses a0, t0, etc. without saving/restoring them, causing the interrupted program’s data to be corrupted.

# ❌ Wrong: Directly using registers
trap_handler:
    csrr t0, mcause         # t0 is overwritten!
    # ... handle ...
    mret                    # Original t0 value is gone

# ✅ Correct: Save first, restore after
trap_handler:
    addi sp, sp, -8
    sd t0, 0(sp)            # Save t0

    csrr t0, mcause
    # ... handle ...

    ld t0, 0(sp)            # Restore t0
    addi sp, sp, 8
    mret

Pitfall 3: Confusing `mret` with `ret`

Error Scenario: Using ret instead of mret at the end of the Trap Handler.

# ❌ Wrong: ret is just jalr x0, 0(ra), doesn't restore privilege level
trap_handler:
    # ...
    ret                     # Jumps to ra, but privilege unchanged, mepc unused

# ✅ Correct: mret restores mstatus.MPP and jumps to mepc
trap_handler:
    # ...
    mret                    # Correct return

Summary

RISC-V’s trap mechanism provides a unified framework for handling both synchronous exceptions and asynchronous interrupts. When a trap occurs, the processor saves the current PC to xepc, records the cause in xcause, stores additional information in xtval, updates privilege and interrupt state in xstatus, and jumps to the handler address in xtvec. This clean, consistent mechanism works across all privilege levels (M-mode, S-mode, U-mode) with parallel CSR sets.

Exceptions are synchronous events caused by instruction execution: illegal instructions, misaligned accesses, page faults, breakpoints, and environment calls (ECALL). The xcause register encodes the exception type, while xtval provides context like the faulting address or illegal instruction. Exception handlers can fix the problem (like loading a swapped page) and resume execution, or terminate the offending program.

Interrupts are asynchronous events from external sources: software interrupts (for inter-processor communication), timer interrupts (for preemptive scheduling), and external interrupts (from devices). Each interrupt type has enable bits in xie and pending bits in xip. Interrupts are only taken when globally enabled (xstatus.xIE = 1) and individually enabled. Nested interrupts require careful management of the interrupt enable state.

The Platform-Level Interrupt Controller (PLIC) manages external interrupts from devices. It provides priority-based arbitration, per-hart interrupt routing, and a claim/complete protocol that ensures each interrupt is handled exactly once. The PLIC supports up to 1024 interrupt sources with configurable priorities and thresholds. Memory-mapped registers control configuration and claim interrupts for handling.

The Core-Local Interrupt Controller (CLIC) extends RISC-V with vectored interrupts, preemptive priority levels, and hardware interrupt nesting. CLIC reduces interrupt latency by eliminating software dispatch and enabling direct jumps to interrupt-specific handlers. This makes CLIC suitable for real-time systems where deterministic, low-latency interrupt handling is critical.

The Advanced Interrupt Architecture (AIA) brings message-signaled interrupts (MSI), interrupt virtualization, and scalable interrupt delivery to RISC-V. The Incoming MSI Controller (IMSIC) receives MSI writes and signals interrupts to harts. The Advanced PLIC (APLIC) converts wired interrupts to MSIs. Together, they enable efficient interrupt handling in large multi-core systems and virtualized environments.

Compared to ARM’s Generic Interrupt Controller (GIC), RISC-V’s interrupt architecture is more modular. Simple systems use the basic PLIC. Real-time systems add CLIC for low latency. High-performance systems add AIA for MSI and virtualization. ARM’s GIC provides all features in one controller, which ensures consistency but mandates complexity even for simple systems. RISC-V’s approach allows implementations to choose the right level of complexity for their needs.

Chapter 5. Virtual Memory & Paging (Sv39 / Sv48)

Part IV — Memory & Addressing

🎯 Learning Objectives

After reading this chapter, you will be able to:

Understand VA to PA Translation: Grasp how Virtual Addresses are converted to Physical Addresses via Page Tables
Master Sv39 Structure: Understand the three-level Page Table hierarchy (L2 → L1 → L0)
Configure the satp CSR: Calculate and set satp to enable the MMU
Understand TLB Mechanism: Know how TLB accelerates address translation and when to flush it
Handle Page Faults: Analyze the causes of Page Faults and understand the handling flow

💡 Scenario: The Library’s Call Numbers

Scene: Junior is debugging a multi-process system, staring at GDB’s memory display, increasingly puzzled.

Junior: “Professor, I’ve encountered something really weird. I’m running two programs simultaneously, and when I look at their memory in GDB, both are using address 0x10000! But the data inside is completely different. How is this possible? Is the CPU a quantum computer?”

Professor: (laughing) “This isn’t quantum entanglement. Have you ever been to a library?”

Junior: “Sure, but what does a library have to do with this?”

Professor: “Imagine this: You’re at Library Branch A, and using call number Q123, you find a book called ‘Introduction to Quantum Mechanics.’ Your friend is at Branch B, uses the same call number Q123, and finds ‘Calculus Exercise Collection.’”

Junior: “Because each branch has its own shelf arrangement?”

Professor: “Exactly!

Call Number (Virtual Address): The address the program sees, like a call number. Each program thinks it has an entire library to itself.
Actual Shelf Location (Physical Address): Where the book really is.
Catalog Index (Page Table): The lookup table that translates call numbers to actual shelf locations.
Branch (Process): Each branch has its own catalog index.“

Junior: “So two programs using the same virtual address, through different Page Tables, get translated to different physical addresses?”

Professor: “You’ve got it! This is the essence of Virtual Memory. The operating system prepares a dedicated catalog (Page Table) for each process, making each think it has exclusive use of the entire library, when in reality everyone’s books are crammed into the same warehouse. The benefits are:

Isolation: Program A messing up won’t affect Program B.
Protection: Some shelves are marked ‘Staff Only’—you can’t touch them without permission.
Flexibility: Books can be relocated anytime—just update the catalog.“

Junior: “So what’s the satp CSR for?”

Professor: “satp tells the CPU: ‘Use this particular catalog (Page Table), starting from this location.’ When the OS switches processes, it updates satp to point to a different catalog.”

Junior: “Got it! Let’s try building this catalog ourselves!”

Virtual memory is one of the most important abstractions in modern computing. It provides memory protection, isolating processes from each other and from the operating system. It provides address space abstraction, giving each process a simple, contiguous view of memory regardless of physical layout. It enables memory overcommitment, allowing systems to run more programs than would fit in physical RAM. And it supports shared memory, enabling efficient communication and resource sharing.

RISC-V implements virtual memory through a clean, flexible paging system. Sv39 provides 39-bit virtual addresses (512 GB address space) with three-level page tables, suitable for most application processors. Sv48 extends this to 48-bit virtual addresses (256 TB address space) with four-level page tables for systems requiring larger address spaces. Both modes support superpages (2 MB and 1 GB pages) for reduced TLB pressure and efficient large mappings.

This chapter explores RISC-V’s virtual memory system in detail: page table structures, address translation, TLB management, page faults, and the Physical Memory Protection (PMP) mechanism that provides memory protection even without virtual memory. Understanding these concepts is essential for operating system developers, hypervisor implementers, and anyone working with RISC-V system software.

5.1 Virtual Memory Overview

Why Virtual Memory?

Virtual memory is one of the most important abstractions in modern computing. It solves several fundamental problems that would otherwise make operating systems nearly impossible to build.

First, virtual memory provides memory protection. Without it, any program could read or write any memory location, including the operating system’s code and data. A buggy program could crash the entire system. A malicious program could steal data from other programs or take control of the system. Virtual memory allows the OS to isolate each process in its own address space, preventing interference.

Second, virtual memory provides address space abstraction. Each process sees a simple, contiguous address space starting at address zero, regardless of where its memory is actually located in physical RAM. The process doesn’t need to know—or care—that its memory might be scattered across different physical locations, or that some of it might be swapped to disk. This abstraction simplifies programming and allows the OS to manage physical memory flexibly.

Third, virtual memory enables memory overcommitment. The OS can give each process a large virtual address space (512 GB in Sv39, 256 TB in Sv48) even if the system has far less physical RAM. Most of that virtual space is never actually used. The OS only allocates physical memory for the pages that are actually accessed. This allows running more programs than would fit in physical memory simultaneously.

Fourth, virtual memory supports shared memory. Multiple processes can map the same physical memory into their virtual address spaces. This is essential for shared libraries (like libc), which would otherwise waste memory by being loaded separately for each process. It’s also used for inter-process communication and memory-mapped files.

RISC-V Virtual Memory Modes

RISC-V defines several virtual memory modes, selected by the MODE field in the satp (Supervisor Address Translation and Protection) CSR:

Bare (MODE=0): No address translation. Virtual addresses equal physical addresses. This is the mode used by M-mode and by systems that don’t need virtual memory.
Sv32 (MODE=1): 32-bit virtual addressing for RV32. Provides a 4 GB virtual address space with two-level page tables. Used in 32-bit embedded systems running operating systems.
Sv39 (MODE=8): 39-bit virtual addressing for RV64. Provides a 512 GB virtual address space with three-level page tables. This is the most common mode for 64-bit RISC-V systems running Linux or similar operating systems.
Sv48 (MODE=9): 48-bit virtual addressing for RV64. Provides a 256 TB virtual address space with four-level page tables. Used in systems that need larger address spaces, such as large servers or databases.
Sv57 (MODE=10): 57-bit virtual addressing for RV64. Provides a 128 PB virtual address space with five-level page tables. This mode is defined but rarely implemented, as 256 TB is sufficient for nearly all current applications.

The choice of mode is a trade-off. Larger address spaces require more levels of page table lookup, which increases the cost of TLB misses. Most systems use Sv39, which provides a good balance between address space size and performance.

The satp CSR

The satp (Supervisor Address Translation and Protection) register controls virtual memory. It’s a supervisor-level CSR that can only be accessed from S-mode or M-mode.

In RV64, satp has three fields:

 63    60 59                44 43                                0
+--------+--------------------+----------------------------------+
|  MODE  |        ASID        |               PPN                |
+--------+--------------------+----------------------------------+

MODE (bits 63:60): Selects the address translation mode (0=Bare, 8=Sv39, 9=Sv48, 10=Sv57)
ASID (bits 59:44): Address Space Identifier, a 16-bit tag used to distinguish TLB entries from different processes
PPN (bits 43:0): Physical Page Number of the root page table

To enable virtual memory, the OS:

Allocates a page table in physical memory
Initializes the page table entries
Writes satp with MODE=8 (for Sv39) or MODE=9 (for Sv48) and the physical address of the root page table
Executes SFENCE.VMA to flush the TLB

After this, all memory accesses from S-mode and U-mode go through address translation.

Address Space Identifiers (ASIDs)

The ASID field in satp is an optimization. When the OS switches between processes, it must change satp to point to the new process’s page table. This would normally require flushing the entire TLB, since TLB entries from the old process are no longer valid.

ASIDs avoid this cost. Each TLB entry is tagged with the ASID from satp when it was created. When looking up a TLB entry, the hardware checks that the ASID matches. This allows TLB entries from multiple processes to coexist. When switching processes, the OS just changes satp (including the ASID), and the TLB automatically filters entries.

If the OS runs out of ASIDs (there are only 2^16 = 65536 possible values), it can flush the TLB and reuse ASIDs. But in practice, 65536 is enough for most workloads.

TLB Management

The Translation Lookaside Buffer (TLB) caches recent address translations. Without the TLB, every memory access would require walking the page table, which could take several memory accesses. The TLB makes virtual memory practical by caching translations.

RISC-V doesn’t specify the TLB implementation—it’s a microarchitectural detail. But it does provide instructions for managing the TLB:

SFENCE.VMA: Fence for virtual memory. This instruction orders memory accesses and TLB updates. It’s used after modifying page tables to ensure the TLB is consistent.

SFENCE.VMA can take two optional arguments:

rs1: If non-zero, only flush TLB entries for the virtual address in rs1
rs2: If non-zero, only flush TLB entries for the ASID in rs2

If both are zero, the entire TLB is flushed. If only rs1 is non-zero, only entries for that virtual address are flushed (across all ASIDs). If only rs2 is non-zero, only entries for that ASID are flushed.

This flexibility allows the OS to minimize TLB flushes. For example, when unmapping a single page, the OS can flush just that page’s TLB entry instead of the entire TLB.

Figure 5.1: RISC-V Virtual Memory Modes

graph TB
    subgraph "RV32 Modes"
        BARE32[Bare Mode<br/>No translation<br/>VA = PA]
        SV32[Sv32 Mode<br/>32-bit VA<br/>4 GB address space<br/>2-level page table]
    end

    subgraph "RV64 Modes"
        BARE64[Bare Mode<br/>No translation<br/>VA = PA]
        SV39[Sv39 Mode<br/>39-bit VA<br/>512 GB address space<br/>3-level page table]
        SV48[Sv48 Mode<br/>48-bit VA<br/>256 TB address space<br/>4-level page table]
        SV57[Sv57 Mode<br/>57-bit VA<br/>128 PB address space<br/>5-level page table]
    end

    style BARE32 fill:#FFB6C1
    style SV32 fill:#87CEEB
    style BARE64 fill:#FFB6C1
    style SV39 fill:#90EE90
    style SV48 fill:#FFD700
    style SV57 fill:#DDA0DD

5.2 Sv39: 39-bit Virtual Address Space

Sv39 Overview

Sv39 is the most widely used virtual memory mode for 64-bit RISC-V systems. It provides a 512 GB virtual address space, which is sufficient for most applications while keeping page table walks reasonably fast.

The “39” in Sv39 refers to the number of bits in the virtual address. A 39-bit address can represent 2^39 = 512 GB of address space. This might seem small compared to the 64-bit registers in RV64, but it’s a practical choice. Most programs don’t need more than 512 GB of virtual memory, and using fewer bits means fewer levels of page table lookup.

Sv39 Address Format

An Sv39 virtual address is divided into four parts:

 63        39 38    30 29    21 20    12 11           0
+------------+--------+--------+--------+--------------+
|  Reserved  | VPN[2] | VPN[1] | VPN[0] | Page Offset  |
+------------+--------+--------+--------+--------------+
     25 bits   9 bits   9 bits   9 bits    12 bits

Reserved (bits 63:39): Must be equal to bit 38 (sign extension). This ensures that valid addresses are either in the lower half (0x0000_0000_0000_0000 to 0x0000_003F_FFFF_FFFF) or the upper half (0xFFFF_FFC0_0000_0000 to 0xFFFF_FFFF_FFFF_FFFF) of the 64-bit address space. Addresses that don’t follow this rule cause a page fault.
VPN[2] (bits 38:30): Virtual Page Number, level 2. This is the index into the root page table.
VPN[1] (bits 29:21): Virtual Page Number, level 1. This is the index into the second-level page table.
VPN[0] (bits 20:12): Virtual Page Number, level 0. This is the index into the third-level (leaf) page table.
Page Offset (bits 11:0): Offset within the 4 KB page. This is not translated—it’s copied directly to the physical address.

Each VPN field is 9 bits, which means each page table has 2^9 = 512 entries. The page offset is 12 bits, which means pages are 2^12 = 4096 bytes (4 KB).

Three-Level Page Table Walk

Address translation in Sv39 involves walking a three-level page table. Here’s the algorithm:

Start with the root page table. Its physical address is in satp.PPN.
Use VPN[2] as an index into the root page table. Read the Page Table Entry (PTE) at that index.
If the PTE is invalid (V=0) or has invalid permissions, raise a page fault.
If the PTE is a leaf (R=1, W=1, or X=1), the translation is complete. The PTE contains the physical page number. Go to step 8.
Otherwise, the PTE points to the next level page table. Use VPN[1] as an index into that page table. Read the PTE at that index.
If the PTE is invalid or has invalid permissions, raise a page fault.
If the PTE is a leaf, the translation is complete. Otherwise, use VPN[0] as an index into the third-level page table. Read the PTE at that index. This must be a leaf.
Combine the physical page number from the PTE with the page offset from the virtual address to form the physical address.

This process can require up to three memory accesses (one per level). That’s why the TLB is so important—it caches the result, avoiding the page table walk for subsequent accesses to the same page.

Page Table Entry (PTE) Format

Each PTE in Sv39 is 64 bits:

 63      54 53        28 27        19 18        10 9  8 7 6 5 4 3 2 1 0
+----------+------------+------------+------------+-----+-+-+-+-+-+-+-+-+
| Reserved |   PPN[2]   |   PPN[1]   |   PPN[0]   | RSW |D|A|G|U|X|W|R|V|
+----------+------------+------------+------------+-----+-+-+-+-+-+-+-+-+
  10 bits     26 bits      9 bits       9 bits    2 bits  8 flag bits

Reserved (bits 63:54): Reserved for future use. Must be zero.
PPN[2:0] (bits 53:10): Physical Page Number. For a leaf PTE, this is the physical page number of the mapped page. For a non-leaf PTE, this is the physical page number of the next-level page table.
RSW (bits 9:8): Reserved for Software. The hardware ignores these bits. The OS can use them for any purpose (e.g., tracking page state).
D (bit 7): Dirty. Set by hardware when the page is written. Used by the OS to track which pages need to be written back to disk.
A (bit 6): Accessed. Set by hardware when the page is read or written. Used by the OS for page replacement algorithms.
G (bit 5): Global. If set, this mapping is global and not associated with any ASID. Global mappings are never flushed by ASID-specific SFENCE.VMA.
U (bit 4): User. If set, this page is accessible from U-mode. If clear, the page is only accessible from S-mode.
X (bit 3): Execute. If set, the page can be executed.
W (bit 2): Write. If set, the page can be written.
R (bit 1): Read. If set, the page can be read.
V (bit 0): Valid. If clear, the PTE is invalid and any access causes a page fault.

PTE Flags and Permissions

The R, W, X, and U flags control access permissions. The hardware checks these flags during address translation:

If V=0, the PTE is invalid. Page fault.
If R=0 and W=1, the PTE is invalid (write-only pages are reserved). Page fault.
If R=1, W=1, or X=1, the PTE is a leaf. The physical page number is in PPN[2:0].
If R=0, W=0, and X=0, the PTE is a pointer to the next level. The physical page number of the next-level page table is in PPN[2:0].

For leaf PTEs, the permissions are checked:

If the access is a read and R=0, page fault.
If the access is a write and W=0, page fault.
If the access is an instruction fetch and X=0, page fault.
If the access is from U-mode and U=0, page fault.

The A and D bits are set by hardware when the page is accessed or modified. The OS can clear these bits and use them to implement page replacement algorithms (e.g., LRU).

Superpages

Sv39 supports superpages—large pages that are multiples of the base 4 KB page size. A superpage is created by making a PTE at level 1 or level 2 a leaf (by setting R, W, or X).

A level 1 leaf PTE creates a 2 MB superpage (2^21 bytes). VPN[0] is not used; instead, bits 20:12 of the virtual address become part of the page offset.
A level 2 leaf PTE creates a 1 GB superpage (2^30 bytes). VPN[1] and VPN[0] are not used; instead, bits 29:12 of the virtual address become part of the page offset.

Superpages reduce TLB pressure by covering more memory with fewer TLB entries. They’re commonly used for large allocations like the kernel’s direct map of physical memory, or for large application heaps.

For a superpage PTE to be valid, the PPN must be properly aligned. For a 2 MB superpage, PPN[0] must be zero. For a 1 GB superpage, PPN[1:0] must be zero. If the alignment is incorrect, the PTE is considered invalid.

Figure 5.2: Sv39 Address Translation

graph LR
    VA[Virtual Address<br/>39 bits]
    VPN2[VPN2<br/>9 bits]
    VPN1[VPN1<br/>9 bits]
    VPN0[VPN0<br/>9 bits]
    OFFSET[Offset<br/>12 bits]

    SATP[satp.PPN<br/>Root Page Table]
    L2[Level 2 PTE]
    L1[Level 1 PTE]
    L0[Level 0 PTE<br/>Leaf]

    PPN[Physical Page Number]
    PA[Physical Address]

    VA --> VPN2
    VA --> VPN1
    VA --> VPN0
    VA --> OFFSET

    SATP --> L2
    VPN2 -.index.-> L2
    L2 --> L1
    VPN1 -.index.-> L1
    L1 --> L0
    VPN0 -.index.-> L0
    L0 --> PPN
    PPN --> PA
    OFFSET --> PA

    style VA fill:#FFB6C1
    style SATP fill:#87CEEB
    style L0 fill:#90EE90
    style PA fill:#FFD700

5.3 Sv48: 48-bit Virtual Address Space

Sv48 Overview

Sv48 extends Sv39 by adding one more level to the page table, increasing the virtual address space from 512 GB to 256 TB. This is useful for very large applications, such as databases that manage terabytes of data, or for systems that need to map large amounts of physical memory.

The trade-off is performance. Each additional level of page table adds one more memory access to the page table walk. For workloads with poor TLB hit rates, this can noticeably impact performance. Most systems use Sv39 unless they specifically need the larger address space.

Sv48 Address Format

An Sv48 virtual address has 48 bits of address and 16 bits of sign extension:

 63        48 47    39 38    30 29    21 20    12 11           0
+------------+--------+--------+--------+--------+--------------+
|  Reserved  | VPN[3] | VPN[2] | VPN[1] | VPN[0] | Page Offset  |
+------------+--------+--------+--------+--------+--------------+
    16 bits    9 bits   9 bits   9 bits   9 bits    12 bits

The structure is similar to Sv39, but with an additional VPN[3] field for the fourth level of page table.

Four-Level Page Table Walk

The page table walk in Sv48 is similar to Sv39, but with an extra level:

Start with the root page table at satp.PPN
Use VPN[3] to index into the root page table
If the PTE is a leaf, translation is complete
Otherwise, use VPN[2] to index into the level 2 page table
If the PTE is a leaf, translation is complete
Otherwise, use VPN[1] to index into the level 1 page table
If the PTE is a leaf, translation is complete
Otherwise, use VPN[0] to index into the level 0 page table
This must be a leaf PTE
Combine the PPN from the PTE with the page offset to form the physical address

Sv48 Superpages

Sv48 supports the same superpages as Sv39, plus one additional size:

4 KB: Level 0 leaf (base page size)
2 MB: Level 1 leaf (2^21 bytes)
1 GB: Level 2 leaf (2^30 bytes)
512 GB: Level 3 leaf (2^39 bytes)

The 512 GB superpage is enormous—it’s the entire address space of Sv39! Such large pages are rarely used, but they could be useful for mapping very large regions of physical memory with minimal TLB overhead.

Sv48 vs Sv39 Trade-offs

Choosing between Sv39 and Sv48 involves several considerations:

Address Space:

Sv39: 512 GB (sufficient for most applications)
Sv48: 256 TB (needed for very large databases, in-memory computing)

Page Table Walk Cost:

Sv39: Up to 3 memory accesses
Sv48: Up to 4 memory accesses
Impact depends on TLB hit rate

Memory Overhead:

Sv48 requires more page table memory for sparse address spaces
Each additional level adds 4 KB per 512 GB of virtual address space

Compatibility:

Sv39 is more widely supported
Sv48 may not be implemented on all RISC-V processors

For most systems, Sv39 is the right choice. Sv48 should be used only when the larger address space is genuinely needed.

5.4 Page Faults and Exception Handling

Page Fault Types

RISC-V defines three types of page faults, distinguished by the type of access that caused the fault:

Instruction Page Fault (exception code 12): Occurs when fetching an instruction from a page that is not mapped, not executable, or not accessible at the current privilege level.
Load Page Fault (exception code 13): Occurs when loading from a page that is not mapped, not readable, or not accessible at the current privilege level.
Store/AMO Page Fault (exception code 15): Occurs when storing to a page that is not mapped, not writable, or not accessible at the current privilege level.

When a page fault occurs, the processor:

Sets scause to the exception code (12, 13, or 15)
Sets sepc to the PC of the faulting instruction
Sets stval to the faulting virtual address
Traps to S-mode (or M-mode if not delegated)

The OS page fault handler examines stval to determine which page caused the fault, then decides how to handle it.

Page Fault Handling

The OS can handle page faults in several ways:

Demand Paging: The page is valid but not currently in physical memory. The OS:

Allocates a physical page
Loads the page contents from disk (if it was swapped out) or zeros it (if it’s a new page)
Updates the page table to map the virtual page to the physical page
Executes SFENCE.VMA to flush the TLB
Returns with SRET, which re-executes the faulting instruction

Copy-on-Write: The page is mapped read-only, but the process tries to write to it. This is used for fork() optimization. The OS:

Allocates a new physical page
Copies the contents from the old page to the new page
Updates the page table to map the virtual page to the new page with write permission
Executes SFENCE.VMA
Returns with SRET

Invalid Access: The page is not mapped and should not be. The OS:

Sends a SIGSEGV signal to the process (on Unix-like systems)
The process typically terminates with a segmentation fault

The key is that sepc points to the faulting instruction, so returning from the trap re-executes it. This is essential for demand paging and copy-on-write to work correctly.

🛠️ Hands-on Lab: Lab 5.1 — Putting on Magic Glasses (Enable Paging)

This lab guides you through building the simplest Page Table: Identity Mapping (Virtual Address = Physical Address), and enabling the MMU.

Lab Objectives

Understand Sv39’s Page Table Entry (PTE) structure
Build an Identity Mapping Page Table
Configure the satp CSR and enable the MMU
Understand the role of sfence.vma

Concept Explanation

In Sv39 mode, the Page Table has three levels:

Virtual Address (39-bit):
+--------+--------+--------+------------+
| VPN[2] | VPN[1] | VPN[0] |   Offset   |
|  9-bit |  9-bit |  9-bit |   12-bit   |
+--------+--------+--------+------------+

Page Table Walk:
  satp.PPN → Level 2 Table → Level 1 Table → Level 0 Table → Physical Page

Each Page Table Entry (PTE) is 64-bit:

PTE Format:
+-----------------------------------------------+-------+
|             PPN (44-bit)                      | Flags |
|                                               | RWXUG |
+-----------------------------------------------+-------+
  63                                    10  9       0

Flags:
  V (Valid)     - bit 0: Entry is valid
  R (Read)      - bit 1: Readable
  W (Write)     - bit 2: Writable
  X (Execute)   - bit 3: Executable
  U (User)      - bit 4: User mode accessible
  G (Global)    - bit 5: Global mapping
  A (Accessed)  - bit 6: Has been accessed
  D (Dirty)     - bit 7: Has been written

Code

Create lab5_paging.c:

// lab5_paging.c - Minimal Identity Mapping Demo
#include <stdint.h>

// PTE Flag Definitions
#define PTE_V   (1 << 0)  // Valid
#define PTE_R   (1 << 1)  // Read
#define PTE_W   (1 << 2)  // Write
#define PTE_X   (1 << 3)  // Execute
#define PTE_U   (1 << 4)  // User
#define PTE_A   (1 << 6)  // Accessed
#define PTE_D   (1 << 7)  // Dirty

// Sv39: 512 entries per page table (9-bit index)
#define PAGE_SIZE     4096
#define PTE_PER_PAGE  512

// Page Table (must be 4KB aligned)
__attribute__((aligned(PAGE_SIZE)))
uint64_t root_page_table[PTE_PER_PAGE];

// Simplified: We use 1GB Gigapages for Identity Mapping
// VPN[2] = 0 → PA 0x0000_0000 ~ 0x3FFF_FFFF (1GB)
// VPN[2] = 1 → PA 0x4000_0000 ~ 0x7FFF_FFFF (1GB)

void setup_identity_mapping(void) {
    // Clear Page Table
    for (int i = 0; i < PTE_PER_PAGE; i++) {
        root_page_table[i] = 0;
    }

    // Create Identity Mapping (first 4GB, using 1GB gigapages)
    // This is a Leaf PTE: RWX bits are set, meaning this is the final mapping
    for (int i = 0; i < 4; i++) {
        uint64_t pa = (uint64_t)i << 30;  // Each entry maps 1GB
        uint64_t ppn = pa >> 12;          // PPN = PA >> 12
        root_page_table[i] = (ppn << 10) | PTE_V | PTE_R | PTE_W | PTE_X | PTE_A | PTE_D;
    }
}

void enable_paging(void) {
    uint64_t root_ppn = ((uint64_t)root_page_table) >> 12;

    // satp format: MODE (4-bit) | ASID (16-bit) | PPN (44-bit)
    // MODE = 8 (Sv39)
    uint64_t satp_val = (8ULL << 60) | root_ppn;

    // Set satp
    asm volatile("csrw satp, %0" : : "r"(satp_val));

    // Flush TLB - CRITICAL!
    asm volatile("sfence.vma");
}

int main(void) {
    setup_identity_mapping();
    enable_paging();

    // If we reach here, paging is working!
    // The program continues to run because VA == PA
    return 0;
}

Compile and Run

# Compile (for bare-metal S-mode)
riscv64-unknown-elf-gcc -march=rv64gc -mabi=lp64d -nostdlib \
    -T linker.ld -o lab5_paging lab5_paging.c startup.S

# Run with QEMU
qemu-system-riscv64 -machine virt -nographic -bios none -kernel lab5_paging

What You Just Did

You’ve accomplished the fundamental MMU setup:

Built a Page Table: Created entries that map VA to the same PA
Configured satp: Told the CPU where the Page Table is and which mode to use
Flushed TLB: Ensured the CPU uses the new mappings

danieRTOS Reference: The memory management in danieRTOS uses similar identity mapping for kernel space, with separate per-task mappings for user space.

Paper Exercise: Address Translation Drill

Given an Sv39 Virtual Address: 0x0000_0040_1234_5678

Manually extract:

VPN[2] = bits 38-30 = ?
VPN[1] = bits 29-21 = ?
VPN[0] = bits 20-12 = ?
Offset = bits 11-0 = ?

Click to reveal answer

VA = 0x0000_0040_1234_5678
   = 0b 0000...0001 000000001 000100011 010001010110 01111000

VPN[2] = bits 38-30 = 0x001 = 1
VPN[1] = bits 29-21 = 0x009 = 9
VPN[0] = bits 20-12 = 0x234 = 564
Offset = bits 11-0  = 0x678 = 1656

Translation Process:

From satp.PPN, find the Root Table (Level 2)
Use VPN[2]=1 as index, find Level 1 Table’s PPN
Use VPN[1]=9 as index, find Level 0 Table’s PPN
Use VPN[0]=564 as index, find the final Physical Page’s PPN
Physical Address = (PPN << 12) | Offset

⚠️ Common Pitfalls

Pitfall 1: Page Table Not Aligned

Error Scenario: Page Table is not 4KB aligned, causing satp to compute an incorrect PPN.

// ❌ Wrong: Not aligned
uint64_t page_table[512];  // May not be 4KB aligned!

// ✅ Correct: Force 4KB alignment
__attribute__((aligned(4096)))
uint64_t page_table[512];

Pitfall 2: Forgetting to Flush TLB

Error Scenario: Modified the Page Table but didn’t execute sfence.vma, causing the CPU to continue using stale TLB cache.

// ❌ Wrong: Forgot to flush after modification
page_table[index] = new_pte;
// CPU may still use old mapping!

// ✅ Correct: Flush TLB after modification
page_table[index] = new_pte;
asm volatile("sfence.vma");  // Tell CPU: Page Table changed, clear cache

Pitfall 3: Confusing Leaf PTE with Non-Leaf PTE

Error Scenario: Setting RWX bits on an intermediate level, accidentally creating a gigapage mapping.

// PTE type determination rules:
// - RWX all 0: Non-Leaf (points to next level Page Table)
// - RWX at least one is 1: Leaf (final mapping)

// ❌ Wrong: Level 2 PTE has R bit set, becomes 1GB gigapage!
level2_pte = (next_table_ppn << 10) | PTE_V | PTE_R;  // Accidentally becomes Leaf

// ✅ Correct: Non-Leaf PTE only sets V bit
level2_pte = (next_table_ppn << 10) | PTE_V;  // Correct Non-Leaf

Summary

RISC-V’s virtual memory system provides memory protection, address space abstraction, and flexible memory management through a clean paging mechanism. The satp CSR controls address translation, selecting the translation mode (Bare, Sv32, Sv39, Sv48), specifying the Address Space Identifier (ASID) for TLB tagging, and pointing to the root page table.

Sv39 provides 39-bit virtual addresses with a 512 GB address space, using three-level page tables. Each level has 512 entries indexed by 9-bit VPN fields. Page Table Entries (PTEs) are 64 bits, containing a 44-bit physical page number and 8 flag bits (V, R, W, X, U, G, A, D). Leaf PTEs at level 0 map 4 KB pages. Leaf PTEs at higher levels create superpages: 2 MB at level 1, 1 GB at level 2.

Sv48 extends this to 48-bit virtual addresses with a 256 TB address space, using four-level page tables. The additional level provides more address space at the cost of one extra memory access per translation. Sv48 is needed for large databases, scientific computing, and systems requiring very large address spaces.

The Translation Lookaside Buffer (TLB) caches recent address translations, avoiding expensive page table walks. TLB entries are tagged with ASID to distinguish different address spaces. The SFENCE.VMA instruction flushes TLB entries, with optional parameters to flush specific virtual addresses or ASIDs. Efficient TLB management is critical for performance—unnecessary flushes cause expensive page table walks.

Page faults occur when the hardware cannot complete a translation: invalid PTEs (V=0), permission violations (accessing a page without R/W/X permission), or privilege violations (U-mode accessing a non-U page). The OS page fault handler can implement demand paging (allocate and load pages on first access), copy-on-write (share pages until written), and memory-mapped files (map file contents into address space).

Physical Memory Protection (PMP) provides memory protection without virtual memory, essential for embedded systems and M-mode firmware. PMP uses CSRs (pmpcfg0-15, pmpaddr0-63) to define up to 64 memory regions with access permissions (R, W, X) and address matching modes (OFF, TOR, NA4, NAPOT). PMP checks occur in parallel with virtual memory translation, protecting against both user programs and S-mode OS bugs.

Compared to ARM’s translation system, RISC-V’s is simpler and more regular. ARM uses complex descriptor formats with multiple page sizes and attributes. RISC-V uses a single PTE format with clean flag bits. ARM’s ASID is 16 bits; RISC-V’s is 16 bits in Sv39 and 9 bits in Sv48. Both support superpages, but RISC-V’s approach is more uniform—any level can be a leaf.

RISC-V’s virtual memory design reflects its philosophy: provide a clean, minimal mechanism that’s easy to implement and understand, while supporting the features needed for modern operating systems. The result is a system that’s simpler than ARM’s but equally capable for most applications.

Chapter 6. Memory Ordering & Synchronization

Part IV — Memory & Addressing

🎯 Learning Objectives

After reading this chapter, you will be able to:

Understand Out-of-Order Execution: Know why CPUs reorder memory accesses
Master RVWMO: Understand the basic rules of RISC-V Weak Memory Ordering
Use Fence Instructions: Know when to use fence to enforce ordering
Implement a Spinlock: Use amoswap or lr/sc to build a mutex lock
Avoid Data Races: Identify and fix race conditions in multi-core programs

💡 Scenario: Shipping Logic at the Distribution Center

Scene: Junior wrote a dual-core program where Core 0 writes data and Core 1 reads it, but the results are always scrambled.

Junior: “Senior, I’m losing my mind! My program logic is correct, but the output looks like random numbers.”

Senior: “Show me the code.”

Junior: (showing screen)

// Core 0                    // Core 1
data = 42;                   while (flag == 0) {}
flag = 1;                    print(data);  // Expected: 42

“By logic, Core 1 should wait until flag becomes 1 before printing data, and by then data should already be 42, right? But sometimes it prints 0!”

Senior: “You’re imagining the CPU as too honest. Modern CPUs are like distribution center managers—for efficiency, they’ll ‘secretly reorder shipments.’”

Junior: “What do you mean?”

Senior: “Imagine you’re a logistics manager. You have two packages to ship:

Package A: Ship to Taipei (far)
Package B: Ship to Hsinchu (near)

Which do you ship first?“

Junior: “The Hsinchu one—it’s faster anyway.”

Senior: “Bingo! The CPU thinks the same way. It sees data = 42 and flag = 1 as two stores. It notices flag’s address is already in cache while data has to wait for memory, so it writes flag first.”

Junior: “So Core 1 sees flag == 1, but data hasn’t been written yet?”

Senior: “Exactly. This is Memory Reordering. The fix is to use a Fence instruction to tell the CPU: ‘No cutting in line! All preceding stores must complete before executing subsequent stores.’”

// Core 0 (fixed version)
data = 42;
__sync_synchronize();  // Compiles to: fence iorw, iorw
flag = 1;

Junior: “I see! What about amoswap and those atomic instructions?”

Senior: “That’s a different problem: ‘How do you ensure two people don’t enter the bathroom at the same time?’ That’s what Spinlock solves.”

Modern processors execute instructions out of order, reorder memory accesses, and use caches that can delay when writes become visible to other processors. These optimizations are essential for performance, but they create a fundamental problem: what does a program actually mean when multiple processors access shared memory? Without careful synchronization, programs can observe impossible behaviors where effects appear to happen in the wrong order.

RISC-V addresses this through a memory consistency model that defines which memory access orderings are legal, and synchronization primitives that enforce ordering when needed. The RISC-V Weak Memory Ordering (RVWMO) model allows aggressive reordering for performance while providing fence instructions and atomic operations to enforce ordering where required. Understanding memory ordering is essential for anyone writing concurrent code, implementing synchronization primitives, or optimizing multi-threaded applications.

This chapter explores RISC-V’s memory model in detail: the RVWMO consistency model, fence instructions for enforcing ordering, atomic instructions for lock-free synchronization, the Total Store Ordering (RVTSO) extension for stronger ordering, and comparisons with ARM’s and x86’s memory models. We’ll see how to implement locks, barriers, and lock-free data structures correctly on RISC-V.

6.1 Memory Consistency Models

The Memory Ordering Problem

Modern processors don’t execute instructions in the order they appear in the program. They reorder loads and stores, execute instructions out of order, and use store buffers and caches that can delay when writes become visible to other processors. These optimizations are essential for performance, but they create a problem: what does a program actually mean when multiple processors are accessing shared memory?

Consider this simple example with two processors:

Processor 0:          Processor 1:
x = 1                 y = 1
r1 = y                r2 = x

Initially, x = 0 and y = 0. After both processors execute, what are the possible values of r1 and r2?

In a sequentially consistent system, there are three possible outcomes:

r1 = 0, r2 = 1 (P0 executes first)
r1 = 1, r2 = 0 (P1 executes first)
r1 = 1, r2 = 1 (stores happen before loads)

But in a weakly ordered system like RISC-V, there’s a fourth possibility:

r1 = 0, r2 = 0 (both loads execute before both stores become visible)

This happens because each processor can reorder its own store after its load. The store to x might sit in P0’s store buffer while P0 executes the load from y. Similarly for P1. Both loads see the old values (0), even though both stores eventually complete.

This behavior is surprising and can lead to bugs if programmers aren’t careful. But it’s also essential for performance. Forcing sequential consistency would require stalling the processor on every memory operation, which would be unacceptably slow.

Sequential Consistency (SC)

Sequential consistency is the simplest and most intuitive memory model. It was defined by Leslie Lamport in 1979: “the result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”

In other words:

All memory operations appear to execute in some total order
Each processor’s operations appear in program order within that total order

SC is easy to reason about—programs behave as if they execute one instruction at a time, in order. But SC is also restrictive. It prohibits many optimizations:

Store buffers (stores must be visible immediately)
Out-of-order execution of loads
Speculative loads
Non-blocking caches

Modern processors don’t implement SC because the performance cost is too high.

Weak Memory Models

Weak memory models relax the ordering requirements to allow more optimization. They permit reordering of memory operations, with explicit synchronization instructions (like fences) to enforce ordering when needed.

The key insight is that most memory operations don’t need strict ordering. If a thread is computing on local data, it doesn’t matter if loads and stores are reordered—no other thread can observe the difference. Ordering only matters at synchronization points: when acquiring a lock, releasing a lock, or communicating between threads.

Weak memory models make these synchronization points explicit. The programmer (or compiler) inserts fence instructions or uses atomic operations with ordering semantics to enforce the necessary ordering. Between synchronization points, the hardware is free to reorder operations for performance.

RISC-V Weak Memory Ordering (RVWMO)

RISC-V uses a weak memory model called RVWMO (RISC-V Weak Memory Ordering). It’s similar to the memory models of ARM and Power, but with a formal specification that precisely defines what behaviors are allowed.

RVWMO allows extensive reordering:

Loads can be reordered with other loads
Stores can be reordered with other stores
Loads can be reordered with stores
Stores can be reordered with loads

The only operations that are not reordered are those with explicit dependencies or those separated by fence instructions.

This might sound chaotic, but in practice, most programs don’t need to worry about it. Single-threaded programs behave as expected (the processor preserves the illusion of sequential execution within a thread). Multithreaded programs use locks, atomics, and other synchronization primitives that include the necessary fences.

RVWMO strikes a balance: it’s weak enough to allow aggressive optimization, but strong enough to support efficient synchronization primitives.

6.2 RISC-V Memory Model (RVWMO)

Program Order vs Memory Order

To understand RVWMO, we need to distinguish two concepts:

Program order is the order in which instructions appear in the program. If instruction A appears before instruction B in the program, we say A precedes B in program order.

Memory order is the order in which memory operations become visible to other harts (RISC-V’s term for hardware threads). This is the order that other harts observe when they read from memory.

In a sequentially consistent system, memory order equals program order. In RVWMO, memory order can differ from program order due to reordering.

Load and Store Ordering Rules

RVWMO allows the following reorderings:

Load → Load: A load can be reordered before an earlier load (unless they have a dependency or are separated by a fence)
Load → Store: A load can be reordered before an earlier store (unless they have a dependency or are separated by a fence)
Store → Store: A store can be reordered before an earlier store (unless they overlap in address or are separated by a fence)
Store → Load: A store can be reordered before an earlier load (unless they overlap in address or are separated by a fence)

The key exceptions are:

Operations to overlapping addresses are not reordered
Operations separated by a FENCE instruction are not reordered
Operations with syntactic dependencies are not reordered

Preserved Program Order (PPO)

Not all program order is lost. RVWMO defines Preserved Program Order (PPO)—the subset of program order that is guaranteed to be respected in memory order.

PPO includes:

Overlapping addresses: If two memory operations access overlapping addresses, they are not reordered. For example, a store to address X followed by a load from address X will execute in order.
Explicit synchronization: Operations separated by a FENCE instruction maintain their order.
Acquire/Release: Atomic operations with .aq (acquire) or .rl (release) suffixes enforce ordering.
Syntactic dependencies: If a later instruction uses the result of an earlier instruction, they execute in order. For example:
```
ld a0, 0(a1)      # Load from address in a1
ld a2, 0(a0)      # Load from address in a0 (depends on first load)
```
The second load depends on the first, so they cannot be reordered.
Control dependencies: If a later instruction is control-dependent on an earlier instruction (e.g., after a branch), certain orderings are preserved.

PPO is the foundation of RVWMO. It defines the minimum ordering that the hardware must respect. Everything else can be reordered.

Global Memory Order

RVWMO requires that there exists a global memory order—a total order of all memory operations across all harts that is consistent with each hart’s PPO.

This doesn’t mean operations actually execute in this order. It means that the observable effects must be consistent with some such order. The hardware can reorder, buffer, and cache operations as long as the final result looks like they executed in some valid global order.

This property is called multi-copy atomicity. When a store becomes visible to one hart, it becomes visible to all harts at the same point in the global memory order. There’s no state where hart A sees a store but hart B doesn’t (assuming both are reading the same address).

Multi-copy atomicity simplifies reasoning about concurrent programs. It means you don’t have to worry about stores propagating at different rates to different harts.

Figure 6.1: Memory Ordering Example

sequenceDiagram
    participant H0 as Hart 0
    participant Mem as Memory
    participant H1 as Hart 1

    Note over H0,H1: Initially: x=0, y=0

    H0->>Mem: x = 1 (store)
    Note over H0: Store may be buffered
    H0->>Mem: r1 = y (load)
    Note over H0: Load can execute early

    H1->>Mem: y = 1 (store)
    Note over H1: Store may be buffered
    H1->>Mem: r2 = x (load)
    Note over H1: Load can execute early

    Note over H0,H1: Possible: r1=0, r2=0<br/>(both loads before both stores)
    Note over H0,H1: FENCE prevents this reordering

6.3 Memory Ordering Instructions

The FENCE Instruction

The FENCE instruction enforces ordering between memory operations. It prevents reordering of operations before the fence with operations after the fence.

The basic syntax is:

fence pred, succ

Where pred (predecessor) and succ (successor) specify which types of operations are ordered:

r: Reads (loads)
w: Writes (stores)
rw: Both reads and writes

Common fence variants:

FENCE rw, rw (full fence):

fence rw, rw

Orders all memory operations before the fence with all memory operations after the fence. This is the strongest fence—it prevents all reordering across the fence.

FENCE w, w (store-store fence):

fence w, w

Orders stores before the fence with stores after the fence. Loads can still be reordered. This is useful for ensuring that a sequence of stores becomes visible in order.

FENCE r, r (load-load fence):

fence r, r

Orders loads before the fence with loads after the fence. Stores can still be reordered.

FENCE r, rw (acquire fence):

fence r, rw

Orders loads before the fence with all operations after the fence. This is used for acquire semantics (e.g., after acquiring a lock).

FENCE rw, w (release fence):

fence rw, w

Orders all operations before the fence with stores after the fence. This is used for release semantics (e.g., before releasing a lock).

Example: Message Passing

Consider a classic message-passing pattern:

# Producer (Hart 0)
    sw a0, 0(s0)      # Write data
    fence w, w        # Ensure data is written before flag
    sw a1, 0(s1)      # Write flag

# Consumer (Hart 1)
loop:
    lw t0, 0(s1)      # Read flag
    beqz t0, loop     # Wait for flag
    fence r, r        # Ensure flag is read before data
    lw t1, 0(s0)      # Read data

The producer writes data, then sets a flag. The consumer waits for the flag, then reads the data. The fences ensure that:

The data write happens before the flag write (producer)
The flag read happens before the data read (consumer)

Without the fences, the hardware could reorder the operations, and the consumer might read stale data.

FENCE.I: Instruction Fence

FENCE.I synchronizes the instruction and data streams. It ensures that all previous stores to instruction memory are visible to subsequent instruction fetches.

This is needed for:

Self-modifying code: If a program writes new instructions to memory, FENCE.I ensures those instructions are fetched correctly.
JIT compilation: A JIT compiler generates code at runtime. After writing the code to memory, it executes FENCE.I before jumping to the new code.
Dynamic linking: Loading a shared library involves writing code to memory.

Example:

    # Write new instruction to memory
    sw a0, 0(s0)

    # Ensure instruction cache sees the new instruction
    fence.i

    # Jump to new code
    jalr s0

FENCE.I is relatively expensive—it may require flushing instruction caches. It should be used sparingly.

FENCE.TSO: Total Store Ordering

FENCE.TSO provides x86-like memory ordering. It’s equivalent to FENCE rw, rw but may be implemented more efficiently on some microarchitectures.

TSO (Total Store Ordering) is the memory model used by x86 processors. It’s stronger than RVWMO:

Loads are not reordered with loads
Stores are not reordered with stores
Loads are not reordered with earlier stores
Stores can be reordered with earlier loads (the only relaxation)

FENCE.TSO is useful for porting x86 code to RISC-V. Instead of analyzing the code to determine which fences are needed, you can insert FENCE.TSO at synchronization points and get x86-like behavior.

Acquire and Release Semantics

Atomic operations (from the A extension) can have .aq (acquire) or .rl (release) suffixes that enforce ordering:

Acquire (.aq): No memory operations after the atomic can be reordered before it. This is used when acquiring a lock—you want to ensure that accesses to protected data happen after the lock is acquired.
Release (.rl): No memory operations before the atomic can be reordered after it. This is used when releasing a lock—you want to ensure that accesses to protected data happen before the lock is released.

Example:

# Acquire lock
acquire_loop:
    lr.w.aq t0, 0(a0)     # Load-reserved with acquire
    bnez t0, acquire_loop # Wait if locked
    li t1, 1
    sc.w t1, t1, 0(a0)    # Store-conditional
    bnez t1, acquire_loop # Retry if failed

    # Critical section - access protected data
    lw t2, 0(s0)
    addi t2, t2, 1
    sw t2, 0(s0)

# Release lock
    sw zero, 0(a0)        # Clear lock (with release semantics)
    fence rw, w           # Release fence

The .aq on the load-reserved ensures that loads/stores in the critical section don’t move before the lock acquisition. The release fence ensures they don’t move after the lock release.

6.4 Atomic Operations (A Extension)

The A Extension

The A (Atomic) extension provides atomic memory operations for synchronization. It includes:

Load-Reserved / Store-Conditional (LR/SC)
Atomic Memory Operations (AMO)

These operations are essential for implementing locks, semaphores, and lock-free data structures.

Load-Reserved / Store-Conditional (LR/SC)

LR/SC is a pair of instructions that together implement atomic read-modify-write:

LR.W / LR.D (Load-Reserved):

lr.w rd, (rs1)      # Load word from address in rs1
lr.d rd, (rs1)      # Load doubleword from address in rs1

LR loads a value from memory and establishes a reservation on that memory location. The reservation tracks whether any other hart has written to that location.

SC.W / SC.D (Store-Conditional):

sc.w rd, rs2, (rs1)  # Store word to address in rs1
sc.d rd, rs2, (rs1)  # Store doubleword to address in rs1

SC attempts to store a value to memory. It succeeds only if the reservation is still valid (no other hart has written to the location). SC writes 0 to rd on success, or a non-zero value on failure.

LR/SC Example: Atomic Increment

atomic_increment:
    lr.w t0, 0(a0)        # Load current value
    addi t0, t0, 1        # Increment
    sc.w t1, t0, 0(a0)    # Try to store
    bnez t1, atomic_increment  # Retry if failed

This implements an atomic increment. If another hart modifies the location between the LR and SC, the SC fails, and we retry.

Reservation Set

The reservation is not necessarily on a single address. The hardware maintains a reservation set—a set of bytes that includes the address loaded by LR. The reservation is invalidated if any hart writes to any byte in the reservation set.

The reservation set is implementation-defined, but it must include at least the bytes loaded by LR. It might be as small as a single word, or as large as a cache line.

This means SC can fail spuriously—even if no other hart wrote to the exact address, SC might fail if another hart wrote to a nearby address in the same reservation set. Code using LR/SC must handle spurious failures by retrying.

LR/SC Guidelines

For LR/SC to work correctly:

Minimal code between LR and SC: The reservation can be broken by interrupts, context switches, or other harts’ stores. Keep the critical section small.
No other memory operations: Some implementations invalidate the reservation if the hart performs any other memory operation between LR and SC. Avoid loads or stores between LR and SC.
Always retry on failure: SC can fail spuriously. Always check the result and retry.
Forward progress: Implementations must guarantee that LR/SC eventually succeeds if no other hart is contending. This prevents livelock.

Atomic Memory Operations (AMO)

AMOs are single instructions that atomically read, modify, and write memory. They’re simpler than LR/SC for common operations like increment, swap, or bitwise operations.

The A extension defines these AMOs:

AMOSWAP: Atomic swap

amoswap.w rd, rs2, (rs1)   # Atomically: rd = mem[rs1]; mem[rs1] = rs2
amoswap.d rd, rs2, (rs1)

AMOADD: Atomic add

amoadd.w rd, rs2, (rs1)    # Atomically: rd = mem[rs1]; mem[rs1] += rs2
amoadd.d rd, rs2, (rs1)

AMOAND, AMOOR, AMOXOR: Atomic bitwise operations

amoand.w rd, rs2, (rs1)    # Atomically: rd = mem[rs1]; mem[rs1] &= rs2
amoor.w rd, rs2, (rs1)     # Atomically: rd = mem[rs1]; mem[rs1] |= rs2
amoxor.w rd, rs2, (rs1)    # Atomically: rd = mem[rs1]; mem[rs1] ^= rs2

AMOMIN, AMOMAX: Atomic min/max (signed)

amomin.w rd, rs2, (rs1)    # Atomically: rd = mem[rs1]; mem[rs1] = min(mem[rs1], rs2)
amomax.w rd, rs2, (rs1)    # Atomically: rd = mem[rs1]; mem[rs1] = max(mem[rs1], rs2)

AMOMINU, AMOMAXU: Atomic min/max (unsigned)

amominu.w rd, rs2, (rs1)   # Unsigned min
amomaxu.w rd, rs2, (rs1)   # Unsigned max

All AMOs can have .aq and/or .rl suffixes for acquire/release semantics.

AMO Example: Atomic Increment

Using AMO, atomic increment is a single instruction:

amoadd.w zero, t0, 0(a0)   # Atomically add t0 to mem[a0]

This is simpler and more efficient than the LR/SC version. The old value is discarded (written to zero).

AMO vs LR/SC

When should you use AMO vs LR/SC?

Use AMO when:

The operation matches one of the AMO instructions
You need a simple, single-instruction atomic operation
Performance is critical (AMOs are typically faster than LR/SC)

Use LR/SC when:

The operation is complex (e.g., conditional update)
You need to read the old value and make a decision
The operation doesn’t match any AMO

Example: Atomic compare-and-swap (CAS) requires LR/SC:

cas:
    lr.w t0, 0(a0)         # Load current value
    bne t0, a1, cas_fail   # Compare with expected value
    sc.w t1, a2, 0(a0)     # Store new value
    bnez t1, cas           # Retry if failed
    li a0, 1               # Success
    ret
cas_fail:
    li a0, 0               # Failure
    ret

This can’t be done with a single AMO because it requires a conditional check.

Figure 6.2a: LR/SC Pattern

graph TB
    LR[LR: Load + Reserve]
    COMPUTE[Compute new value]
    SC[SC: Store if reservation valid]
    CHECK{Success?}
    RETRY[Retry]
    DONE[Done]

    LR --> COMPUTE
    COMPUTE --> SC
    SC --> CHECK
    CHECK -->|No| RETRY
    RETRY --> LR
    CHECK -->|Yes| DONE

    style LR fill:#FFB6C1
    style SC fill:#87CEEB
    style DONE fill:#90EE90

Figure 6.2b: AMO Pattern

graph TB
    AMO[AMO: Atomic operation<br/>Single instruction]
    DONE[Done]

    AMO --> DONE

    style AMO fill:#90EE90
    style DONE fill:#87CEEB

6.5 Comparison with ARM and x86

ARM Memory Model

ARM uses a weak memory model similar to RISC-V. The ARMv8 architecture defines:

Relaxed ordering: Like RVWMO, ARM allows extensive reordering
DMB (Data Memory Barrier): Similar to RISC-V FENCE
DSB (Data Synchronization Barrier): Stronger than DMB, waits for operations to complete
ISB (Instruction Synchronization Barrier): Similar to RISC-V FENCE.I

ARM atomic operations:

LDXR/STXR: Load-Exclusive / Store-Exclusive (similar to LR/SC)
Atomic operations: LDADD, LDCLR, LDEOR, etc. (similar to AMOs)

The main difference is that ARM has more fence variants (DMB with different domains and types), while RISC-V has a simpler fence model.

x86 Memory Model

x86 uses a much stronger memory model called TSO (Total Store Ordering):

Loads are not reordered with loads
Stores are not reordered with stores
Loads are not reordered with earlier stores
Stores can be reordered with earlier loads (the only relaxation)

This is much closer to sequential consistency than RVWMO. Most x86 code doesn’t need explicit fences because the hardware provides strong ordering.

x86 atomic operations:

LOCK prefix: Makes an instruction atomic (e.g., LOCK ADD)
CMPXCHG: Compare-and-swap
XCHG: Atomic exchange (implicitly locked)

Porting x86 code to RISC-V requires adding fences. The FENCE.TSO instruction helps by providing x86-like ordering.

Performance Implications

The choice of memory model affects performance:

Weak models (RISC-V, ARM):

Allow aggressive reordering and optimization
Better performance for well-synchronized code
Require careful use of fences
More complex for programmers

Strong models (x86 TSO):

Simpler for programmers
Less optimization opportunity
Implicit ordering has performance cost
Easier to port code

RISC-V’s weak model is a deliberate choice to maximize performance. The cost is that programmers must understand memory ordering and use synchronization correctly.

Figure 6.3: Memory Model Comparison

graph LR
    %% Memory Model Spectrum (Left = Strong, Right = Weak)

    SC[Sequential Consistency<br/>Strict ordering<br/>Simplest, slowest]
    TSO[x86 TSO<br/>Store-Load reordering<br/>Strong model]
    RVWMO[RISC-V RVWMO<br/>Extensive reordering<br/>Needs fences]
    ARM[ARM Weak Order<br/>Similar to RVWMO<br/>More fence types]

    %% Left → Right = Weaker memory ordering
    SC -->|Weaker| TSO
    TSO -->|Weaker| RVWMO
    RVWMO -->|Weaker| ARM

    %% Style the nodes like horizontal bars
    style SC fill:#FFB6C1,stroke:#FF4500,stroke-width:2px
    style TSO fill:#87CEEB,stroke:#1E90FF,stroke-width:2px
    style RVWMO fill:#90EE90,stroke:#3CB371,stroke-width:2px
    style ARM fill:#FFD700,stroke:#FFA500,stroke-width:2px

6.6 Programming with Weak Memory

Best Practices

Programming correctly with weak memory ordering requires discipline:

Use high-level synchronization: Prefer mutexes, semaphores, and atomic types from your language’s standard library. These handle memory ordering correctly.
Understand data races: A data race occurs when two harts access the same memory location without synchronization, and at least one access is a write. Data races are undefined behavior.
Use acquire/release: For custom synchronization, use acquire semantics when reading a synchronization variable and release semantics when writing it.
Minimize critical sections: Keep the code between lock acquisition and release as short as possible.
Test on real hardware: Memory ordering bugs may not appear in simulation or on strongly-ordered processors. Test on actual RISC-V hardware.

Common Patterns

Spinlock:

acquire:
    li t0, 1
acquire_loop:
    amoswap.w.aq t1, t0, 0(a0)  # Atomic swap with acquire
    bnez t1, acquire_loop        # Retry if already locked
    # Critical section

release:
    amoswap.w.rl zero, zero, 0(a0)  # Atomic swap with release

Message passing:

# Producer
    sw a0, 0(s0)       # Write data
    fence rw, w        # Release fence
    sw a1, 0(s1)       # Write flag

# Consumer
    lw t0, 0(s1)       # Read flag
    fence r, rw        # Acquire fence
    lw t1, 0(s0)       # Read data

Dekker’s algorithm (mutual exclusion without atomic operations):

# Hart 0
    li t0, 1
    sw t0, flag0       # flag0 = 1
    fence w, rw
    lw t1, flag1       # Read flag1
    bnez t1, wait      # If flag1 set, wait
    # Critical section
    sw zero, flag0     # flag0 = 0

These patterns rely on careful placement of fences to ensure correct ordering.

Debugging Memory Ordering Issues

Memory ordering bugs are notoriously difficult to debug:

They may occur rarely and non-deterministically
They may not appear on some hardware
They may disappear when debugging code is added

Strategies:

Use memory model checking tools (e.g., herd7, rmem)
Add assertions to check invariants
Use thread sanitizers (e.g., ThreadSanitizer)
Test under high contention
Review synchronization code carefully

The RISC-V memory model is formally specified, which allows using formal verification tools to prove correctness.

🛠️ Hands-on Lab: Lab 6.1 — The Bathroom Battle (Spinlock)

This lab guides you through implementing a Spinlock using the amoswap instruction to protect shared variables from concurrent access by multiple cores.

Lab Objectives

Understand why naive read-modify-write operations cause Race Conditions
Use amoswap (Atomic Memory Operation Swap) to implement a Spinlock
Understand acquire/release semantics

Concept Explanation

Why Atomic Operations?

Consider this “naive” lock implementation:

// ❌ Wrong lock implementation
void lock_acquire(int *lock) {
    while (*lock == 1) {}  // (1) Read: check if lock is held
    *lock = 1;              // (2) Write: acquire lock
}

Problem: Steps (1) and (2) are not atomic!

Time →
Core 0: read lock=0 ──────────────────────── write lock=1
Core 1: ────────────────── read lock=0 ───── write lock=1
         ↑ Both cores think they acquired the lock!

The Role of amoswap

amoswap combines “read old value” and “write new value” into one atomic operation:

# amoswap.w.aq rd, rs2, (rs1)
# Atomically executes:
#   temp = memory[rs1]
#   memory[rs1] = rs2
#   rd = temp

Code

Create lab6_spinlock.S:

# lab6_spinlock.S - Spinlock using amoswap
.section .text
.global spinlock_acquire
.global spinlock_release

# void spinlock_acquire(int *lock)
# a0 = address of lock
spinlock_acquire:
    li t0, 1                    # t0 = 1 (LOCKED state)
spin:
    # amoswap.w.aq: Atomic swap with acquire semantics
    # Atomically: old value → t1, new value 1 → memory[a0]
    amoswap.w.aq t1, t0, (a0)

    # Check if old value was 0 (UNLOCKED)
    bnez t1, spin               # If old value wasn't 0, keep spinning

    ret                         # Successfully acquired lock!

# void spinlock_release(int *lock)
# a0 = address of lock
spinlock_release:
    # amoswap.w.rl: Atomic swap with release semantics
    # Write 0 (UNLOCKED) to lock
    li t0, 0
    amoswap.w.rl zero, t0, (a0)  # Discard result (write to zero)

    ret

C Driver Program main.c:

#include <stdio.h>

extern void spinlock_acquire(int *lock);
extern void spinlock_release(int *lock);

int shared_counter = 0;
int lock = 0;

void increment_safely(void) {
    spinlock_acquire(&lock);

    // Critical Section: protected region
    shared_counter++;

    spinlock_release(&lock);
}

int main() {
    // Simulate concurrent access
    increment_safely();
    increment_safely();
    printf("Counter: %d\n", shared_counter);
    return 0;
}

Compile and Run

# Compile
riscv64-unknown-elf-gcc -o lab6_spinlock main.c lab6_spinlock.S

# Run
qemu-riscv64 lab6_spinlock

Expected Output:

Counter: 2

What You Just Did

You’ve implemented a correct spinlock:

Atomicity: amoswap ensures read-and-write happens as one indivisible operation
Acquire Semantics: .aq ensures subsequent operations in the critical section don’t move before the lock acquisition
Release Semantics: .rl ensures previous operations in the critical section don’t move after the lock release

danieRTOS Reference: The scheduler in danieRTOS uses similar spinlocks to protect the task queue during context switches.

⚠️ Common Pitfalls

Pitfall 1: Confusing `volatile` with Memory Barriers

Error Scenario: Thinking volatile solves multi-core synchronization.

// ❌ Wrong: volatile only prevents compiler optimization, doesn't affect CPU reordering
volatile int flag = 0;
data = 42;
flag = 1;  // CPU may still execute this first!

// ✅ Correct: Need a memory barrier
data = 42;
__sync_synchronize();  // Or: asm volatile("fence rw, rw")
flag = 1;

Pitfall 2: Using Regular load/store in Spinlock

Error Scenario: Not using atomic instructions, allowing two cores into the critical section.

// ❌ Wrong: Non-atomic operations
void bad_lock(int *lock) {
    while (*lock) {}  // Read
    *lock = 1;        // Write — can be interrupted between these!
}

// ✅ Correct: Use atomic operations
void good_lock(int *lock) {
    while (__sync_lock_test_and_set(lock, 1)) {}  // Compiles to amoswap
}

Pitfall 3: Forgetting acquire/release Semantics

Error Scenario: Using atomic operations but without correct ordering.

# ❌ Wrong: No .aq, critical section reads may be moved earlier
amoswap.w t1, t0, (a0)      # No .aq

# ❌ Wrong: No .rl, critical section writes may be delayed
amoswap.w zero, t0, (a0)    # No .rl

# ✅ Correct: acquire for lock, release for unlock
amoswap.w.aq t1, t0, (a0)   # acquire
amoswap.w.rl zero, t0, (a0) # release

Summary

Memory ordering is one of the most subtle and challenging aspects of concurrent programming. Modern processors reorder memory accesses for performance, creating behaviors that can seem impossible from a sequential perspective. RISC-V’s memory model defines which reorderings are legal and provides synchronization primitives to enforce ordering when needed.

The RISC-V Weak Memory Ordering (RVWMO) model allows aggressive reordering: loads can be reordered with loads, stores with stores, and loads with earlier stores. Only store-load ordering is preserved by default. This weak model enables high performance but requires explicit synchronization. The model is formally specified using happens-before relationships and preserved program order, allowing formal verification of concurrent algorithms.

Fence instructions enforce memory ordering. FENCE with predecessor and successor sets (r, w, rw) creates ordering between memory operations. FENCE RW, RW is a full barrier preventing all reordering. FENCE W, W orders stores (for publish). FENCE R, RW orders loads before subsequent accesses (for acquire). FENCE RW, W orders all accesses before stores (for release). FENCE.I synchronizes instruction and data caches after code modification.

Atomic instructions provide indivisible read-modify-write operations essential for synchronization. Load-Reserved/Store-Conditional (LR/SC) implements lock-free algorithms: LR loads and reserves, SC stores only if the reservation is still valid. Atomic Memory Operations (AMO) like AMOSWAP, AMOADD, and AMOAND perform atomic operations directly. Acquire and release annotations (.aq, .rl) provide ordering without separate fences.

The Total Store Ordering (RVTSO) extension provides stronger ordering compatible with x86’s TSO model. RVTSO preserves all orderings except store-load, making it easier to port x86 code but potentially reducing performance. Most RISC-V implementations use RVWMO for better performance, but RVTSO is available for compatibility.

Common synchronization patterns include spinlocks (using AMOSWAP or LR/SC), mutexes (with futex system calls for blocking), barriers (using atomic counters), and lock-free data structures (using LR/SC for ABA-safe updates). Each pattern requires careful fence placement to ensure correctness under RVWMO.

Compared to ARM and x86, RISC-V’s memory model is similar to ARM’s (both are weakly ordered) but simpler and more formally specified. x86’s TSO model is stronger, preserving more orderings by default, which simplifies programming but may reduce performance. RISC-V’s RVTSO extension provides x86 compatibility when needed.

Memory ordering bugs are notoriously difficult to debug—they occur rarely, non-deterministically, and may disappear when debugging code is added. Strategies include using memory model checking tools (herd7, rmem), thread sanitizers (ThreadSanitizer), formal verification, and careful code review. RISC-V’s formal memory model specification enables rigorous verification of concurrent algorithms.

Chapter 7. RISC-V Pipeline Fundamentals

Part V — Pipeline & Microarchitecture

🎯 Learning Objectives

After reading this chapter, you will be able to:

Understand the Pipeline Concept: Know why pipelining improves throughput
Master the 5-Stage Pipeline: Be familiar with the functions of IF, ID, EX, MEM, WB stages
Identify Hazard Types: Distinguish between Structural, Data, and Control Hazards
Understand Solutions: Grasp the principles of Stalling, Forwarding, and Branch Prediction
Analyze Pipeline Performance: Calculate factors affecting CPI (Cycles Per Instruction)

💡 Scenario: The Wisdom of the Factory Assembly Line

Scene: Junior visits Architect’s semiconductor factory, curious about how the production line works.

Junior: “Architect, I’ve always had a question. Textbooks say a CPU executes one instruction per cycle, but I see each instruction goes through five steps—fetch, decode, execute, memory access, writeback. How can it possibly complete in one cycle?”

Architect: “Great question. Come, let me show you the factory floor.”

(They walk to the production line)

Architect: “Look at this assembly line. Each station does only one thing:

Station 1: Get parts (Fetch)
Station 2: Check specifications (Decode)
Station 3: Assemble (Execute)
Station 4: Quality inspection (Memory)
Station 5: Package (Writeback)

Each product goes through five stations to complete. If each station takes 1 minute, one product takes 5 minutes, right?“

Junior: “Right.”

Architect: “But look—how many products are being processed simultaneously on the line right now?”

Junior: “Five! There’s one at each station.”

Architect: “Exactly. Although each product takes 5 minutes to complete, the line outputs one finished product every minute. This is the power of Pipeline—Throughput is one instruction per cycle, even though Latency for a single instruction is five cycles.”

Junior: “I see! What if one station gets stuck?”

Architect: “That’s a Hazard. Imagine the screwdriver at station 3 breaks—

Structural Hazard: Not enough tools—two products fighting for the same screwdriver.
Data Hazard: Station 3 needs a part from station 5, but that product isn’t finished yet.
Control Hazard: A phone call says ‘Stop! Switch to a different model!’—all half-finished products are scrapped.“

Junior: “How do you solve these?”

Architect: “Three tricks:

Stall: Stop the line and wait, but this reduces throughput.
Forwarding: Station 3’s result doesn’t wait for station 5 to package—pass it directly from the side.
Prediction: Guess what model the boss wants. If right, keep going. If wrong, tear it down and redo.“

Junior: “Got it! Let’s see how the CPU handles these situations.”

The pipeline is the heart of modern processor design. It’s the mechanism that allows a processor to work on multiple instructions simultaneously, dramatically improving throughput. In this chapter, we’ll explore how RISC-V processors implement pipelining, from the classic five-stage pipeline to advanced techniques for handling hazards and branches.

Understanding pipelines is crucial for anyone working with RISC-V, whether you’re designing hardware, writing compilers, or optimizing performance-critical code. The beauty of RISC-V’s design is that its clean, regular instruction set makes it particularly well-suited for efficient pipeline implementation. We’ll examine the classic five-stage pipeline (Fetch, Decode, Execute, Memory, Writeback), the three types of hazards that disrupt pipeline flow (structural, data, control), and techniques for handling them (forwarding, stalling, branch prediction). We’ll also explore how pipeline depth affects performance and complexity.

7.1 Classic Five-Stage Pipeline

The classic five-stage pipeline is the foundation of most RISC processor designs. It divides instruction execution into five distinct stages, allowing up to five instructions to be in flight simultaneously. Let’s walk through each stage.

Figure 7.1: Five-Stage Pipeline Overview

graph LR
    IF[IF<br/>Instruction<br/>Fetch] --> ID[ID<br/>Instruction<br/>Decode]
    ID --> EX[EX<br/>Execute]
    EX --> MEM[MEM<br/>Memory<br/>Access]
    MEM --> WB[WB<br/>Write<br/>Back]

    style IF fill:#e1f5ff
    style ID fill:#fff4e1
    style EX fill:#ffe1e1
    style MEM fill:#e1ffe1
    style WB fill:#f0e1ff

Figure 7.2: Pipeline Timing Diagram

Cycle:  1    2    3    4    5    6    7    8    9
I1:     IF   ID   EX   MEM  WB
I2:          IF   ID   EX   MEM  WB
I3:               IF   ID   EX   MEM  WB
I4:                    IF   ID   EX   MEM  WB
I5:                         IF   ID   EX   MEM  WB

In steady state, all five stages are busy with different instructions, achieving a throughput of one instruction per cycle (IPC = 1).

Instruction Fetch (IF)

The first stage fetches the next instruction from memory. The program counter (PC) points to the address of the instruction to fetch. The instruction is read from the instruction cache (I-cache) or main memory if there’s a cache miss.

In RISC-V, all instructions are either 16-bit (compressed, with C extension) or 32-bit (standard). The fetch unit must handle both formats, though in a simple implementation without the C extension, all instructions are 32-bit aligned.

IF Stage:
  instruction = I-cache[PC]
  next_PC = PC + 4  // or PC + 2 for compressed instructions

Fetch bandwidth is critical for performance. A processor that can fetch multiple instructions per cycle (superscalar) needs wider fetch paths and more complex PC prediction logic.

Instruction Decode (ID)

The second stage decodes the instruction and reads operands from the register file. The decoder examines the opcode and function fields to determine what operation to perform and which registers to read.

RISC-V’s regular instruction format makes decoding straightforward. All instructions have the opcode in bits [6:0], and register specifiers are always in the same positions:

rs1 (source register 1): bits [19:15]
rs2 (source register 2): bits [24:20]
rd (destination register): bits [11:7]

ID Stage:
  opcode = instruction[6:0]
  rs1_data = register_file[instruction[19:15]]
  rs2_data = register_file[instruction[24:20]]
  rd_addr = instruction[11:7]
  immediate = decode_immediate(instruction)

Immediate generation is also part of this stage. RISC-V has several immediate formats (I-type, S-type, B-type, U-type, J-type), and the decoder must extract and sign-extend the immediate value correctly.

Execute (EX)

The third stage performs the actual computation. This is where the ALU (Arithmetic Logic Unit) operates on the source operands to produce a result.

For arithmetic instructions like ADD, SUB, AND, the ALU performs the operation. For load/store instructions, the ALU calculates the memory address by adding the base register and offset. For branches, the ALU evaluates the branch condition.

EX Stage:
  case opcode:
    ADD:  result = rs1_data + rs2_data
    SUB:  result = rs1_data - rs2_data
    LOAD: address = rs1_data + immediate
    BEQ:  taken = (rs1_data == rs2_data)

Branch condition evaluation happens here. If a branch is taken, the pipeline must be flushed (more on this in Section 7.4).

Memory Access (MEM)

The fourth stage accesses data memory for load and store instructions. For loads, data is read from the data cache (D-cache). For stores, data is written to the cache.

MEM Stage:
  if LOAD:
    load_data = D-cache[address]
  if STORE:
    D-cache[address] = rs2_data

Cache hit or miss is determined here. A cache miss can stall the pipeline for many cycles while data is fetched from main memory.

For non-memory instructions, this stage does nothing (or passes through the result from the EX stage).

Write Back (WB)

The fifth and final stage writes the result back to the register file. This is the commit point where the instruction’s effects become architecturally visible.

WB Stage:
  if rd != x0:  // x0 is hardwired to zero
    register_file[rd] = result

RISC-V’s x0 register is always zero, so writes to x0 are discarded. This is checked in hardware to avoid unnecessary register file writes.

Pipeline Example: Executing a Simple Program

Let’s trace a simple RISC-V program through the pipeline:

# Example: Calculate sum = a + b + c
    lw   x1, 0(x10)    # I1: Load a from memory
    lw   x2, 4(x10)    # I2: Load b from memory
    lw   x3, 8(x10)    # I3: Load c from memory
    add  x4, x1, x2    # I4: x4 = a + b
    add  x5, x4, x3    # I5: x5 = (a + b) + c
    sw   x5, 12(x10)   # I6: Store sum to memory

Cycle-by-cycle execution (assuming no cache misses):

Cycle:  1    2    3    4    5    6    7    8    9    10   11
I1:     IF   ID   EX   MEM  WB
I2:          IF   ID   EX   MEM  WB
I3:               IF   ID   EX   MEM  WB
I4:                    IF   ID   EX   MEM  WB
I5:                         IF   ID   EX   MEM  WB
I6:                              IF   ID   EX   MEM  WB

In this ideal case, 6 instructions complete in 11 cycles. After the pipeline fills (first 5 cycles), we achieve 1 instruction per cycle.

7.2 Pipeline Hazards

Pipelining would be perfect if instructions were completely independent. Unfortunately, they’re not. Hazards are situations where the next instruction cannot execute in the next clock cycle. There are three types of hazards.

Structural Hazards

A structural hazard occurs when two instructions need the same hardware resource at the same time. For example, if the instruction fetch and memory access stages both need to access memory in the same cycle, there’s a conflict.

In a simple RISC-V implementation with a single memory port, you can’t fetch an instruction and perform a load/store simultaneously. The solution is either to stall one operation or to use separate instruction and data caches (Harvard architecture).

Register file port conflicts are another example. If the register file has only one write port, you can’t write back two results in the same cycle. Most RISC-V implementations avoid this by having enough ports or by carefully scheduling operations.

Data Hazards

Data hazards occur when an instruction depends on the result of a previous instruction that hasn’t completed yet. There are three types:

RAW (Read After Write) — The most common hazard. An instruction tries to read a register before a previous instruction writes it:

add  x1, x2, x3   # x1 = x2 + x3
sub  x4, x1, x5   # x4 = x1 - x5  (needs x1 from previous instruction)

The sub instruction needs the value of x1, but the add instruction hasn’t written it yet. This is a true dependency and must be handled carefully.

Figure 7.3: RAW Data Hazard

Cycle:           1    2    3    4    5    6    7    8
add x1,x2,x3:    IF   ID   EX   MEM  WB
sub x4,x1,x5:         IF   ID   --   --   EX   MEM  WB
                                └─ stall ─┘

Without forwarding, the sub must stall until add writes x1 in cycle 5.

WAR (Write After Read) — An instruction writes a register before a previous instruction reads it. This is an anti-dependency:

add  x1, x2, x3   # reads x2
sub  x2, x4, x5   # writes x2

In an in-order pipeline, WAR hazards don’t occur because instructions complete in order. But in out-of-order processors (Chapter 8), they can happen.

WAW (Write After Write) — Two instructions write the same register. This is an output dependency:

add  x1, x2, x3   # writes x1
sub  x1, x4, x5   # writes x1

Again, this is mainly a concern for out-of-order processors.

Control Hazards

Control hazards occur when the pipeline doesn’t know which instruction to fetch next. This happens with branches and jumps.

Consider a conditional branch:

beq  x1, x2, target   # if x1 == x2, jump to target
add  x3, x4, x5       # next instruction if not taken
...
target:
  sub  x6, x7, x8     # target instruction if taken

The pipeline doesn’t know whether to fetch the add or the sub until the branch condition is evaluated in the EX stage. By that time, the pipeline has already fetched the next instruction speculatively.

Branch misprediction causes pipeline bubbles (wasted cycles) because the speculatively fetched instructions must be discarded.

Figure 7.6: Branch Misprediction

Cycle:           1    2    3    4    5    6
beq (taken):     IF   ID   EX   MEM  WB
Wrong Path I1:        IF   ID   XX
Wrong Path I2:             IF   XX
Correct Path:                   IF   ID   EX
                                └─ 3 cycles wasted ─┘

When the branch is resolved in cycle 3 and found to be mispredicted, instructions from the wrong path are squashed, wasting 2-3 cycles.

7.3 Hazard Resolution

Processors use several techniques to handle hazards without stalling the pipeline too much.

Forwarding (Bypassing)

Forwarding (also called bypassing) allows a result to be used before it’s written back to the register file. This is the most important technique for reducing data hazard stalls.

Consider our earlier example:

add  x1, x2, x3   # x1 = x2 + x3 (result available at end of EX stage)
sub  x4, x1, x5   # x4 = x1 - x5 (needs x1 in EX stage)

Without forwarding, the sub would have to wait until the add writes x1 in the WB stage (3 cycles later). With forwarding, the result from the add instruction’s EX stage can be forwarded directly to the sub instruction’s EX stage.

Forwarding paths are data paths that bypass the register file:

EX-to-EX forwarding: Result from EX stage to EX stage (1 cycle later)
MEM-to-EX forwarding: Result from MEM stage to EX stage (2 cycles later)
WB-to-EX forwarding: Result from WB stage to EX stage (3 cycles later, but this is just normal register file read)

Figure 7.4: Forwarding Paths

graph TB
    subgraph Pipeline Stages
        IF[IF Stage]
        ID[ID Stage<br/>Register Read]
        EX[EX Stage<br/>ALU]
        MEM[MEM Stage<br/>Data Cache]
        WB[WB Stage<br/>Register Write]
    end

    IF --> ID
    ID --> EX
    EX --> MEM
    MEM --> WB

    EX -.->|EX-to-EX<br/>Forwarding| EX
    MEM -.->|MEM-to-EX<br/>Forwarding| EX
    WB -.->|Normal<br/>Register Read| ID

    style EX fill:#ffe1e1
    style MEM fill:#e1ffe1
    style WB fill:#f0e1ff

Forwarding Logic (simplified):

// Forwarding unit logic
if (EX_MEM.RegWrite && (EX_MEM.rd != 0) && (EX_MEM.rd == ID_EX.rs1))
    ForwardA = 01;  // Forward from EX/MEM pipeline register
else if (MEM_WB.RegWrite && (MEM_WB.rd != 0) && (MEM_WB.rd == ID_EX.rs1))
    ForwardA = 10;  // Forward from MEM/WB pipeline register
else
    ForwardA = 00;  // No forwarding, use register file

// Similar logic for rs2 (ForwardB)

Example with forwarding:

add  x1, x2, x3   # I1: x1 = x2 + x3 (result ready at end of EX)
sub  x4, x1, x5   # I2: x4 = x1 - x5 (needs x1 at start of EX)

Cycle:  1    2    3    4    5    6
I1:     IF   ID   EX   MEM  WB
I2:          IF   ID   EX   MEM  WB
                       ^
                       |
                Forward from I1's EX stage

With forwarding, I2 can execute immediately after I1, with no stall!

Forwarding doesn’t solve all data hazards. The classic example is a load followed immediately by a use:

lw   x1, 0(x2)    # load x1 from memory
add  x3, x1, x4   # use x1 immediately

The load data isn’t available until the end of the MEM stage, but the add needs it at the beginning of the EX stage. Even with forwarding, a one-cycle stall is required.

Figure 7.5: Load-Use Hazard

Cycle:           1    2    3    4    5    6    7
lw x1,0(x2):     IF   ID   EX   MEM  WB
add x3,x1,x4:         IF   ID   --   EX   MEM  WB
                                └─ stall ─┘

The add must stall in cycle 3 because the load data isn’t ready until the end of cycle 4. Even with MEM-to-EX forwarding, we need one bubble.

Pipeline Stalls

When forwarding isn’t enough, the pipeline must stall (insert bubbles). A stall freezes earlier pipeline stages while later stages continue.

For the load-use hazard above, the pipeline inserts a one-cycle stall:

Cycle:  1    2    3    4    5    6
lw      IF   ID   EX   MEM  WB
add          IF   ID   stall EX  MEM

The add instruction’s ID stage is held for an extra cycle, creating a bubble in the EX stage.

Stall detection logic monitors the pipeline for hazards:

// Hazard detection unit
bool load_use_hazard = (ID_EX.MemRead) &&
                       ((ID_EX.rd == IF_ID.rs1) ||
                        (ID_EX.rd == IF_ID.rs2));

if (load_use_hazard) {
    // Stall the pipeline
    PC_write = 0;        // Don't update PC
    IF_ID_write = 0;     // Don't update IF/ID register
    Control_signals = 0; // Insert bubble (nop) in EX stage
}

Performance impact: Each stall reduces IPC (Instructions Per Cycle). Compilers try to schedule instructions to avoid load-use hazards when possible.

Compiler scheduling example:

# Original code (has load-use hazard):
lw   x1, 0(x2)
add  x3, x1, x4    # Stall! (depends on x1)
sub  x5, x6, x7

# Compiler-scheduled code (no hazard):
lw   x1, 0(x2)
sub  x5, x6, x7    # Independent instruction fills the slot
add  x3, x1, x4    # No stall now (x1 is ready)

By reordering independent instructions, the compiler can hide load latency and avoid stalls.

Compiler Scheduling

Compilers can reorder instructions to avoid hazards without changing program semantics. This is called instruction scheduling or software pipelining.

Example: Instead of this (with a load-use hazard):

lw   x1, 0(x2)
add  x3, x1, x4   # stall!

The compiler can insert an independent instruction:

lw   x1, 0(x2)
sub  x5, x6, x7   # independent instruction
add  x3, x1, x4   # no stall now

Loop unrolling and software pipelining are advanced compiler techniques that expose more instruction-level parallelism and reduce hazards.

7.4 Branch Handling

Branches are the bane of pipelining. Every branch is a potential control hazard that can disrupt the smooth flow of instructions through the pipeline.

Branch Prediction Basics

Branch prediction tries to guess whether a branch will be taken or not-taken before the condition is evaluated. The pipeline speculatively fetches and executes instructions based on this prediction. If the prediction is correct, there’s no penalty. If it’s wrong, the pipeline must be flushed and restarted from the correct path.

Misprediction penalty is the number of cycles wasted when a branch is mispredicted. In a five-stage pipeline, if the branch is resolved in the EX stage (cycle 3), the misprediction penalty is 2 cycles (the IF and ID stages of the wrong-path instructions must be discarded).

Branch prediction accuracy is critical for performance. Modern processors achieve 95-99% accuracy on typical workloads. Even a 5% misprediction rate can significantly impact performance if branches are frequent (every 5-10 instructions in typical code).

Static Branch Prediction

Static prediction uses fixed rules that don’t change during execution. The simplest strategies are:

Always not-taken: Assume all branches are not taken. This works well for forward branches (like if statements that skip over error handling code).

Always taken: Assume all branches are taken. This works well for backward branches (like loop back-edges).

BTFNT (Backward Taken, Forward Not-Taken): A hybrid strategy that predicts backward branches as taken and forward branches as not-taken. This is surprisingly effective because loops (backward branches) are usually taken, and forward branches (error checks, early exits) are usually not taken.

Static Prediction:
  if (branch_target < PC):  // backward branch
    predict_taken()
  else:                      // forward branch
    predict_not_taken()

Profile-guided prediction: The compiler can use profiling data to predict branches based on actual execution patterns. Hot paths are predicted as taken.

Dynamic Branch Prediction

Dynamic prediction learns from past branch behavior and adapts during execution. This is much more accurate than static prediction.

Branch History Table (BHT): A table indexed by the branch PC (or a hash of it) that stores prediction information. Each entry contains a two-bit saturating counter:

00: Strongly not-taken
01: Weakly not-taken
10: Weakly taken
11: Strongly taken

When a branch is taken, the counter increments (saturates at 11). When not taken, it decrements (saturates at 00). The prediction is “taken” if the counter is 10 or 11.

Why two bits? A single bit would mispredict on every iteration of a loop (predict taken, but the last iteration is not taken, so flip to not-taken, but the next loop iteration is taken, so flip back…). Two bits provide hysteresis: a single misprediction doesn’t immediately flip the prediction.

Figure 7.7: Two-Bit Saturating Counter State Machine

stateDiagram-v2
    [*] --> SNT
    SNT: 00<br/>Strongly<br/>Not-Taken
    WNT: 01<br/>Weakly<br/>Not-Taken
    WT: 10<br/>Weakly<br/>Taken
    ST: 11<br/>Strongly<br/>Taken

    SNT --> SNT: Not Taken
    SNT --> WNT: Taken
    WNT --> SNT: Not Taken
    WNT --> WT: Taken
    WT --> WNT: Not Taken
    WT --> ST: Taken
    ST --> WT: Not Taken
    ST --> ST: Taken

Example: Loop prediction

for (int i = 0; i < 100; i++) {
    // Loop body
}

The loop back-edge branch is taken 99 times and not-taken once (exit). With a 2-bit counter:

After a few iterations, counter reaches 11 (strongly taken)
Predicts “taken” for iterations 1-99 (correct)
Iteration 100: not-taken (misprediction), counter goes to 10
Next loop: first iteration is taken, counter goes back to 11
Result: Only 1 misprediction per 100 iterations (99% accuracy)

Local vs global history:

Local history: Each branch has its own history (pattern of taken/not-taken).
Global history: All branches share a global history register that tracks the last N branch outcomes.

Global history can capture correlations between branches (e.g., if branch A is taken, branch B is likely taken too).

Branch Target Buffer (BTB)

The BTB is a cache that stores the target addresses of recently executed branches. When a branch is predicted taken, the BTB provides the target address so the fetch unit can immediately fetch from the correct location.

Without a BTB, even if a branch is correctly predicted as taken, the pipeline must wait until the branch target is calculated in the EX stage. The BTB eliminates this delay.

BTB Lookup:
  if (PC in BTB) and (predict_taken):
    next_PC = BTB[PC].target
  else:
    next_PC = PC + 4

Return Address Stack (RAS): Function returns (ret in RISC-V, which is jalr x0, 0(x1)) are a special case. The return address is pushed onto a hardware stack when a function is called (jal or jalr), and popped when returning. This provides near-perfect prediction for function returns.

RISC-V: No Branch Delay Slots

RISC-V does not have branch delay slots, unlike MIPS. In MIPS, the instruction immediately after a branch is always executed, regardless of whether the branch is taken. This is called a delay slot.

MIPS example:

beq  $t0, $t1, target
add  $t2, $t3, $t4      # delay slot: always executed

The add instruction executes even if the branch is taken. Compilers must fill the delay slot with a useful instruction or a nop.

RISC-V eliminated delay slots for several reasons:

Simpler pipeline control: No need to track delay slot instructions.
Cleaner ISA: The semantics are more intuitive.
Better for superscalar: Delay slots complicate multi-issue pipelines.
Compiler complexity: Filling delay slots is tricky and doesn’t always help.

This is a significant improvement over MIPS and makes RISC-V easier to implement and optimize.

7.5 Trap and Interrupt Handling in Pipeline

Traps and interrupts (covered in Chapter 4) have a significant impact on the pipeline. They require precise exception handling and pipeline flushing.

Precise Exceptions

A precise exception means the architectural state is consistent when the exception is taken. All instructions before the faulting instruction have completed, and no instructions after it have modified architectural state.

This requires in-order commit: even if instructions execute out-of-order (Chapter 8), they must commit (update architectural state) in program order.

For a five-stage in-order pipeline, precise exceptions are natural: instructions complete in order. But the pipeline must ensure that:

All instructions before the exception have written back.
The faulting instruction and all later instructions have not modified state.

Pipeline Flush on Trap

When a trap occurs, the pipeline must be flushed. All instructions in the pipeline that are younger than the trap are discarded (squashed).

Cycle:  1    2    3         4       5
I1:     IF   ID   EX(trap)  --      --
I2:          IF   ID        squash  --
I3:               IF        squash  --
I4:                         squash  --

Instructions I2, I3, I4 are squashed. The PC is redirected to the trap handler (from xtvec), and execution resumes there.

Squashing means:

Clear pipeline registers (set to nop or invalid).
Prevent any writes to architectural state (register file, memory, CSRs).
Invalidate any speculative state (branch predictions, cache fills).

Performance Cost

Traps are expensive. A trap in a five-stage pipeline wastes 3-4 cycles (the instructions in the pipeline that must be squashed). In deeper pipelines or out-of-order processors, the cost is even higher.

This is why:

Exception-free code is faster: Avoid page faults, misaligned accesses, illegal instructions.
Interrupts should be infrequent: High interrupt rates can severely degrade performance.
Trap handlers should be fast: The sooner you return from a trap, the sooner useful work resumes.

7.6 Simple In-Order Implementations

Let’s look at how real RISC-V processors implement pipelines.

Single-Issue vs Multi-Issue

Single-issue processors execute one instruction per cycle (at most). The classic five-stage pipeline is single-issue.

Multi-issue (superscalar) processors can execute multiple instructions per cycle. For example, a 2-issue processor can fetch, decode, and execute 2 instructions simultaneously.

Issue width is the maximum number of instructions that can be issued per cycle. Wider issue requires:

Multiple fetch ports (or wider fetch)
Multiple decode units
Multiple execution units (ALUs, load/store units)
More register file ports
More complex hazard detection and forwarding logic

Scalar vs Superscalar

Scalar means single-issue, one instruction at a time.

Superscalar means multi-issue, exploiting instruction-level parallelism (ILP) by executing independent instructions in parallel.

Superscalar processors are more complex but can achieve higher IPC (Instructions Per Cycle). A 4-issue superscalar can theoretically execute 4 instructions per cycle, achieving IPC = 4 (though in practice, IPC is usually 1.5-2.5 due to hazards and dependencies).

RISC-V Implementation Examples

Rocket Core: An open-source, in-order, single-issue RISC-V core developed at UC Berkeley. It has a classic five-stage pipeline and is used in many academic and commercial projects. Rocket is simple, efficient, and easy to understand.

BOOM (Berkeley Out-of-Order Machine): An open-source, out-of-order, superscalar RISC-V core (also from UC Berkeley). BOOM is much more complex than Rocket but achieves higher performance. We’ll cover out-of-order execution in Chapter 8.

SiFive Cores:

E-series (e.g., E20, E21): Small, low-power, in-order cores for embedded systems.
U-series (e.g., U54, U74): Higher-performance, in-order cores with MMU for running Linux.
P-series (e.g., P270, P670): High-performance, out-of-order cores for demanding applications.

Performance Characteristics

CPI (Cycles Per Instruction): The average number of cycles needed to execute one instruction. For an ideal five-stage pipeline with no hazards, CPI = 1. In practice, hazards increase CPI to 1.2-1.5 for in-order cores.

IPC (Instructions Per Cycle): The inverse of CPI. IPC = 1/CPI. Higher IPC means better performance.

Pipeline depth trade-offs:

Deeper pipelines (more stages) allow higher clock frequencies because each stage does less work. But they increase branch misprediction penalties and make hazard handling more complex.
Shallow pipelines (fewer stages) have lower misprediction penalties and simpler control, but lower maximum frequency.

Modern processors balance these trade-offs. RISC-V cores range from 3-stage pipelines (simple embedded cores) to 10+ stage pipelines (high-performance cores).

🛠️ Hands-on Lab: Lab 7.1 — Pipeline Bubble Analysis

This lab is a pencil-and-paper exercise that guides you through analyzing Pipeline Hazards in real assembly code and drawing Pipeline Diagrams.

Lab Objectives

Identify Data Hazards in code
Draw Pipeline Timing Diagrams
Calculate Stall Cycles and actual CPI
Understand how Forwarding reduces Stalls

Analysis Example

Consider the following RISC-V assembly:

    lw   x1, 0(x2)      # I1: Load x1 from memory
    add  x3, x1, x4     # I2: x3 = x1 + x4 (depends on I1's result)
    sub  x5, x3, x6     # I3: x5 = x3 - x6 (depends on I2's result)
    and  x7, x5, x8     # I4: x7 = x5 & x8 (depends on I3's result)

Exercise 1: Pipeline Without Forwarding

Assume no Forwarding—results must be written back in WB before the next instruction can read them in ID.

Draw the Pipeline Diagram:

Cycle:   1    2    3    4    5    6    7    8    9   10   11   12
I1 (lw): IF   ID   EX   MEM  WB
I2 (add):     IF   ID   --   --   --   ID   EX   MEM  WB
                   ↑ stall 3 cycles (waiting for x1 ready)
I3 (sub):               IF   --   --   --   IF   ID   --   --   ...
I4 (and):                                        IF   ID   ...

Calculation:

Ideal case: 4 instructions × 1 cycle = 4 cycles
Actual case: Multiple stalls due to Hazards
Real CPI > 1

Exercise 2: Pipeline With Forwarding

With Forwarding, results from EX or MEM stages can be “forwarded” directly to the next instruction.

Think About:

When is lw’s result earliest available? (Hint: end of MEM stage)
When does add need x1’s value? (Hint: start of EX stage)
Even with Forwarding, does lw followed by add still need a stall?

Click to see answer

lw’s result is available at end of MEM stage (memory read completes)
add needs x1’s value at start of EX stage
Yes, 1 cycle stall needed! Because lw has result at MEM end, but add needs it at EX start—this is called Load-Use Hazard

Cycle:   1    2    3    4    5    6    7    8
I1 (lw): IF   ID   EX   MEM  WB
I2 (add):     IF   ID   --   EX   MEM  WB
                   ↑ 1 cycle stall (Load-Use Hazard)
I3 (sub):          IF   --   ID   EX   MEM  WB
                        ↑ forwarding from I2.EX → I3.EX
I4 (and):               IF   ID   EX   MEM  WB
                             ↑ forwarding from I3.EX → I4.EX

Extended Exercise: Code Scheduling

Compilers can reduce stalls by reordering instructions. Try reordering this code:

# Original code (has Load-Use Hazard)
lw   x1, 0(x2)
add  x3, x1, x4     # depends on x1, must stall
lw   x5, 4(x2)
add  x6, x5, x7     # depends on x5, must stall

# Optimized code (interleave independent instructions)
lw   x1, 0(x2)
lw   x5, 4(x2)      # independent, can execute during x1's MEM
add  x3, x1, x4     # x1 now ready (forwarded from MEM)
add  x6, x5, x7     # x5 now ready

danieRTOS Reference: Understanding pipeline behavior helps optimize context switch code, where minimizing stalls in the critical path improves task switching latency.

⚠️ Common Pitfalls

Pitfall 1: Confusing Throughput and Latency

Misconception: “5-stage pipeline means each instruction takes 5 cycles?”

Correct Understanding:

Latency: A single instruction from IF to WB indeed takes 5 cycles
Throughput: In steady state, 1 instruction completes per cycle
More stages increase clock frequency but also increase Hazard penalty

Pitfall 2: Thinking Forwarding Solves All Data Hazards

Misconception: “With Forwarding, we never need to stall!”

Correct Understanding:

Load-Use Hazard cannot be fully solved by Forwarding
Load result is produced in MEM stage, but next instruction needs it in EX stage
Must insert at least 1 cycle stall (bubble)

lw   x1, 0(x2)      # Result available at cycle 4 (MEM)
add  x3, x1, x4     # Needs value at cycle 3 (EX) — too late!
                    # Must stall 1 cycle

Pitfall 3: Ignoring Control Hazard Cost

Misconception: “A branch instruction is just a jump.”

Correct Understanding:

Branch target is determined in EX stage (comparison operation)
IF and ID stages have already fetched “wrong” subsequent instructions
If branch taken, these instructions must be flushed
This is called Branch Penalty

    beq  x1, x2, target   # cycle 1: IF
    add  x3, x4, x5       # cycle 2: IF (guessed not taken)
    sub  x6, x7, x8       # cycle 3: IF
                          # cycle 3: discover branch taken!
                          # add, sub must be flushed, wasting 2 cycles

Solution: Branch Prediction

Static Prediction: Guess backward branch taken, forward not taken
Dynamic Prediction: Predict based on history (Branch History Table)

Summary

In this chapter, we explored the fundamentals of RISC-V pipelining:

Five-stage pipeline: IF, ID, EX, MEM, WB — the classic RISC pipeline structure.
Hazards: Structural, data (RAW, WAR, WAW), and control hazards that disrupt pipeline flow.
Hazard resolution: Forwarding, stalls, and compiler scheduling to minimize performance loss.
Branch handling: Static and dynamic prediction, BTB, and RISC-V’s elimination of delay slots.
Trap handling: Precise exceptions, pipeline flushing, and performance costs.
Implementations: Single-issue vs multi-issue, scalar vs superscalar, and real RISC-V cores.

RISC-V’s clean, regular ISA makes it ideal for efficient pipeline implementation. The absence of delay slots and complex addressing modes simplifies pipeline control compared to older architectures like MIPS.

In the next chapter, we’ll explore out-of-order execution, where processors dynamically reorder instructions to extract even more parallelism and performance.

Chapter 8. Microarchitecture Variations

Part V — Pipeline & Microarchitecture

🎯 Learning Objectives

After reading this chapter, you will be able to:

Distinguish In-Order from Out-of-Order: Understand the performance differences and use cases of both execution models
Understand Register Renaming: Know how Physical Registers eliminate False Dependencies (WAW/WAR)
Master ROB Mechanism: Understand how the Reorder Buffer guarantees “out-of-order execution, in-order commit”
Recognize Speculative Execution: Understand the principles and security risks (Spectre/Meltdown)
Use Performance Counters: Calculate CPI using mcycle and minstret

💡 Scenario: The Michelin Restaurant Kitchen Philosophy

Scene: Junior is comparing benchmark results of two RISC-V cores and finds that despite similar frequencies, performance differs by a factor of two.

Junior: “Professor, I looked at specs for two RISC-V cores. SiFive U74 and Alibaba C910 both run at about 1.5GHz, but the CoreMark scores differ by almost 2x! How is this possible?”

Professor: “Have you ever been to a Michelin-star restaurant?”

Junior: “Sure, but what does that have to do with CPUs?”

Professor: “Imagine two restaurants.

The first one (In-Order): The chef strictly follows the order of tickets. If the first dish needs 10 minutes for ingredients to thaw, all other dishes must wait—even if the second dish’s ingredients are already ready.

The second one (Out-of-Order): The chef looks at which dish has ingredients ready first and starts with that. While waiting for ingredients to thaw, the chef has already finished three other dishes.“

Junior: “So Out-of-Order means the CPU doesn’t wait idly?”

Professor: “Exactly. But this requires a smart ‘restaurant manager’ to coordinate:

Reservation Station: Tracks what ingredients each dish needs; starts cooking when ingredients arrive.
Reorder Buffer: Even though cooking order is scrambled, dishes must still be served in the order customers placed them—otherwise chaos ensues.
Register Renaming: If two dishes both need ‘eggs,’ but they’re actually different eggs, label them differently to avoid confusion.“

Junior: “Sounds complex. What’s the cost?”

Professor: “Transistor count explodes, and power consumption goes up with it. That’s why phone ‘big cores’ are power-hungry while ‘little cores’ are efficient—big cores are usually Out-of-Order, little cores are In-Order.”

Junior: “Let’s measure the actual performance difference!”

In Chapter 7, we explored the classic five-stage in-order pipeline. But modern processors go far beyond this simple model. Out-of-order (OOO) execution allows processors to dynamically reorder instructions to extract more parallelism, dramatically improving performance. While in-order processors execute instructions in program order and stall on dependencies, out-of-order processors can execute independent instructions while waiting for slow operations to complete.

This chapter explores the microarchitectural techniques that enable high-performance RISC-V processors: register renaming to eliminate false dependencies, reorder buffers to maintain precise exceptions, speculative execution to execute beyond branches, and advanced branch prediction to minimize misprediction penalties. We’ll examine the cache hierarchy that hides memory latency, and cache coherence protocols that maintain consistency across multiple cores. Understanding these techniques is essential for anyone designing high-performance RISC-V systems or optimizing code for modern processors.

8.1 Out-of-Order Execution Basics

In-Order vs Out-of-Order

In-order processors execute instructions in the exact order they appear in the program. If an instruction stalls (e.g., waiting for a cache miss), all subsequent instructions must wait, even if they’re independent and could execute.

Out-of-order (OOO) processors can execute instructions in a different order than the program specifies, as long as the final result is the same. This allows the processor to work around stalls and dependencies, keeping execution units busy.

Example:

lw   x1, 0(x2)     # I1: load (cache miss, 100 cycles)
add  x3, x4, x5    # I2: independent of I1
sub  x6, x7, x8    # I3: independent of I1
add  x9, x1, x10   # I4: depends on I1

In-order execution: I2 and I3 must wait for I1 to complete (100 cycles), even though they’re independent.

Out-of-order execution: I2 and I3 can execute immediately while I1 is waiting for the cache miss. Only I4 must wait for I1.

This simple reordering can dramatically improve performance, especially when memory latency is high.

Figure 8.1: In-Order vs Out-of-Order Execution

Cycle:           1    2    3    ...  103  104  105  106  107
lw x1 (miss):    IF   ID   EX   MEM(100 cycles)  WB
add x3,x4,x5:         IF   ID   -------- stall -------- EX   MEM  WB
sub x6,x7,x8:              IF   -------- stall -------- ID   EX   MEM  WB

Cycle:            1    2    3    4    5    6    7    ...  103  104  105  106
lw x1 (miss):     IF   ID   EX   MEM (100 cycles) -------- WB
add x3,x4,x5:          IF   ID   EX   MEM  WB
sub x6,x7,x8:               IF   ID   EX   MEM  WB
add x9,x1,x10:                   IF   ID   ---- wait ----  EX   MEM  WB

Out-of-order execution allows independent instructions to proceed while the load is waiting, dramatically reducing wasted cycles.

Dynamic Scheduling

Dynamic scheduling is the hardware mechanism that enables out-of-order execution. The processor analyzes dependencies at runtime and schedules instructions to execution units when their operands are ready.

Two classic algorithms for dynamic scheduling:

Scoreboarding (CDC 6600, 1964): A centralized control unit tracks which registers are being written and which instructions are waiting for them. When all operands are ready, the instruction is issued to an execution unit.

Tomasulo’s algorithm (IBM 360/91, 1967): Uses reservation stations to buffer instructions waiting for operands. When an operand is produced, it’s broadcast to all waiting instructions. This eliminates the need for a centralized scoreboard and enables register renaming (Section 8.2).

Modern OOO processors use variations of Tomasulo’s algorithm with additional structures like the Reorder Buffer (ROB) to ensure precise exceptions.

8.2 Register Renaming

The Problem: False Dependencies

Consider this code:

add  x1, x2, x3    # I1: x1 = x2 + x3
sub  x4, x1, x5    # I2: x4 = x1 - x5 (RAW dependency on x1)
add  x1, x6, x7    # I3: x1 = x6 + x7 (WAW dependency on x1)
mul  x8, x1, x9    # I4: x8 = x1 * x9 (RAW dependency on x1 from I3)

I2 has a true dependency (RAW) on I1 — it must wait for I1 to produce x1.

But I3 has a false dependency (WAW) on I1 — both write x1, but I3’s write doesn’t actually depend on I1’s value. Similarly, I4 depends on I3’s x1, not I1’s.

False dependencies (WAR and WAW) limit parallelism because the processor must serialize instructions that could otherwise execute in parallel.

Physical vs Architectural Registers

Register renaming eliminates false dependencies by mapping architectural registers (the 32 registers visible to the programmer) to a larger set of physical registers (hidden from the programmer).

RISC-V has 32 architectural registers (x0-x31). A high-performance OOO processor might have 128 or 256 physical registers.

Register Alias Table (RAT): Maps each architectural register to a physical register. When an instruction writes an architectural register, it’s allocated a new physical register.

Example with renaming:

add  P10, P2, P3   # I1: x1 -> P10
sub  P11, P10, P5  # I2: x4 -> P11, reads P10 (I1's result)
add  P12, P6, P7   # I3: x1 -> P12 (new physical register!)
mul  P13, P12, P9  # I4: x8 -> P13, reads P12 (I3's result)

Now I3 and I4 use P12 for x1, while I1 and I2 use P10. The WAW dependency is eliminated — I3 can execute as soon as P6 and P7 are ready, without waiting for I1.

Figure 8.2: Register Renaming Example

graph TB
    subgraph "Architectural Registers (Programmer View)"
        x1[x1]
        x2[x2]
        x3[x3]
        x4[x4]
    end

    subgraph "Physical Registers (Hardware)"
        P2[P2]
        P3[P3]
        P5[P5]
        P6[P6]
        P7[P7]
        P9[P9]
        P10[P10]
        P11[P11]
        P12[P12]
        P13[P13]
    end

    subgraph "Register Alias Table (RAT)"
        RAT["x1 → P12<br/>x2 → P2<br/>x3 → P3<br/>x4 → P11"]
    end

    x1 -.->|mapped to| P12
    x2 -.->|mapped to| P2
    x3 -.->|mapped to| P3
    x4 -.->|mapped to| P11

    style P10 fill:#ffcccc
    style P12 fill:#ccffcc

In this example, x1 was previously mapped to P10 (shown in red, now free), and is currently mapped to P12 (shown in green).

Free List Management

Physical registers must be recycled when they’re no longer needed. A free list tracks which physical registers are available for allocation.

When an instruction commits (Section 8.3), its old physical register mapping can be freed:

I3 commits: x1 was mapped to P10, now mapped to P12
  -> P10 can be freed (added to free list)

Register renaming eliminates WAR and WAW hazards, leaving only true RAW dependencies. This dramatically increases instruction-level parallelism.

8.3 Reorder Buffer (ROB) and Issue Queue

Reorder Buffer (ROB)

The ROB ensures that instructions commit in program order, even though they execute out-of-order. This is essential for precise exceptions (Chapter 7).

The ROB is a circular buffer that holds all in-flight instructions in program order. Each entry contains:

Instruction PC
Destination register (architectural and physical)
Result value (when execution completes)
Exception status
Ready bit

Instruction flow through the ROB:

Dispatch: Instruction is allocated a ROB entry and issued to an execution unit.
Execute: Instruction executes out-of-order when operands are ready.
Complete: Result is written to the ROB entry and broadcast to waiting instructions.
Commit: When the instruction reaches the head of the ROB and is complete, it commits (updates architectural state).

Commit is in-order: Instructions commit from the head of the ROB one at a time (or in small groups). This ensures that if an exception occurs, all earlier instructions have committed and all later instructions can be discarded.

Figure 8.3: Reorder Buffer (ROB) Structure

graph TB
    subgraph "Reorder Buffer (Circular Queue)"
        direction TB
        ROB1["ROB Entry 1<br/>PC: 0x1000<br/>Dest: x1→P10<br/>Value: 42<br/>Ready: ✓"]
        ROB2["ROB Entry 2<br/>PC: 0x1004<br/>Dest: x2→P11<br/>Value: -<br/>Ready: ✗"]
        ROB3["ROB Entry 3<br/>PC: 0x1008<br/>Dest: x3→P12<br/>Value: 100<br/>Ready: ✓"]
        ROB4["ROB Entry 4<br/>PC: 0x100C<br/>Dest: x4→P13<br/>Value: -<br/>Ready: ✗"]
        DOTS["..."]
    end

    HEAD[Head Pointer<br/>Commit from here] --> ROB1
    TAIL[Tail Pointer<br/>Dispatch to here] --> DOTS

    ROB1 -->|Commit| COMMIT[Update<br/>Architectural<br/>State]

    style ROB1 fill:#ccffcc
    style ROB3 fill:#ccffcc
    style ROB2 fill:#ffcccc
    style ROB4 fill:#ffcccc

Green entries are ready to commit (when they reach the head). Red entries are still executing.

ROB Example Code:

// ROB entry structure
struct ROB_Entry {
    uint64_t PC;              // Instruction address
    uint8_t  arch_reg;        // Architectural register (x0-x31)
    uint8_t  phys_reg;        // Physical register (P0-P127)
    uint64_t value;           // Result value
    bool     ready;           // Execution complete?
    bool     exception;       // Exception occurred?
    uint8_t  exception_code;  // Exception type
};

// Commit logic (simplified)
void commit_instruction() {
    ROB_Entry *entry = &ROB[head];

    if (!entry->ready) {
        return;  // Can't commit yet
    }

    if (entry->exception) {
        // Handle exception: flush pipeline, jump to handler
        flush_pipeline();
        PC = trap_handler_address;
        return;
    }

    // Update architectural state
    if (entry->arch_reg != 0) {  // x0 is always zero
        RAT[entry->arch_reg] = entry->phys_reg;
        free_old_physical_register(entry->arch_reg);
    }

    // Advance head pointer
    head = (head + 1) % ROB_SIZE;
}

Issue Queue and Reservation Stations

The issue queue (or reservation stations) holds instructions waiting for operands. When an instruction is dispatched, it’s placed in the issue queue. When all its operands are ready, it’s issued to an execution unit.

Wakeup and select:

Wakeup: When a result is produced, it’s broadcast to all issue queue entries. Entries waiting for that result mark the operand as ready.
Select: Among all ready instructions, the scheduler selects which ones to issue to execution units (based on priority, age, or other policies).

This is the heart of dynamic scheduling. The issue queue decouples instruction dispatch from execution, allowing the processor to find parallelism dynamically.

8.4 Load/Store Queue

Memory operations are particularly challenging in OOO processors because they must respect memory ordering (Chapter 6) while still allowing reordering for performance.

Load Queue and Store Queue

The load queue holds all in-flight loads. The store queue holds all in-flight stores. These queues track memory addresses and data, and enforce ordering constraints.

Store-to-load forwarding: If a load reads from the same address as an earlier store, the load can get the data directly from the store queue without waiting for the store to commit to memory.

sw   x1, 0(x2)     # Store x1 to address in x2
lw   x3, 0(x2)     # Load from same address

The load can forward the data from the store queue, avoiding a memory access.

Figure 8.4: Load/Store Queue Structure

graph TB
    subgraph "Store Queue"
        SQ1["Store 1<br/>Addr: 0x2000<br/>Data: 42<br/>Committed: ✗"]
        SQ2["Store 2<br/>Addr: 0x2008<br/>Data: 100<br/>Committed: ✗"]
        SQ3["Store 3<br/>Addr: ?<br/>Data: 55<br/>Committed: ✗"]
    end

    subgraph "Load Queue"
        LQ1["Load 1<br/>Addr: 0x2000<br/>Data: ?"]
        LQ2["Load 2<br/>Addr: 0x2010<br/>Data: ?"]
    end

    LQ1 -.->|Address Match<br/>Forward Data| SQ1
    LQ2 -.->|No Match<br/>Go to Cache| CACHE[Data Cache]

    SQ3 -.->|Address Unknown<br/>Must Wait| WAIT[Stall Load]

    style SQ1 fill:#ccffcc
    style SQ3 fill:#ffcccc

Store-to-load forwarding allows loads to get data from earlier stores without waiting for them to commit to memory.

Memory Disambiguation

Memory disambiguation is the problem of determining whether two memory operations access the same address. This is difficult because addresses are often computed dynamically.

sw   x1, 0(x2)     # Store to address A
lw   x3, 0(x4)     # Load from address B — is B == A?

If x2 and x4 contain the same value, the load depends on the store. But the processor doesn’t know this until the addresses are computed.

Conservative approach: Assume all loads depend on all earlier stores. This is safe but limits parallelism.

Speculative approach: Assume loads don’t depend on earlier stores and execute them speculatively. If a dependency is later detected (address match), squash the load and re-execute.

Modern processors use memory dependence prediction to guess which loads depend on which stores, improving speculation accuracy.

8.5 Advanced Branch Prediction

Branch prediction is even more critical in OOO processors because mispredictions waste more work (all the speculatively executed instructions must be discarded).

Two-Level Adaptive Predictors

Two-level predictors use both local and global branch history to make predictions. They can capture complex patterns like:

if (a > 0) {        // Branch B1
  if (b > 0) {      // Branch B2
    ...
  }
}

If B1 is taken, B2 is more likely to be taken. A global history predictor can learn this correlation.

Structure: A global history register (GHR) tracks the last N branch outcomes (taken/not-taken). This is used to index into a pattern history table (PHT) that contains 2-bit counters.

TAGE (Tagged Geometric History Length)

TAGE is a state-of-the-art branch predictor used in modern high-performance processors. It uses multiple predictor tables with different history lengths (e.g., 4, 8, 16, 32, 64 branches).

Each table is indexed by a hash of the PC and history. The predictor uses the longest matching history to make a prediction, falling back to shorter histories if there’s no match.

TAGE achieves very high accuracy (98-99%) on most workloads.

Figure 8.5: TAGE Predictor Structure

graph TB
    PC[Program Counter] --> HASH1[Hash Function 1]
    PC --> HASH2[Hash Function 2]
    PC --> HASH3[Hash Function 3]
    PC --> BASE[Base Predictor]

    GHR[Global History<br/>Register] --> HASH1
    GHR --> HASH2
    GHR --> HASH3

    HASH1 --> T1[Table 1<br/>History: 4]
    HASH2 --> T2[Table 2<br/>History: 16]
    HASH3 --> T3[Table 3<br/>History: 64]

    T1 --> SELECT[Selector<br/>Choose Longest<br/>Matching History]
    T2 --> SELECT
    T3 --> SELECT
    BASE --> SELECT

    SELECT --> PRED[Prediction<br/>Taken/Not-Taken]

    style T3 fill:#ccffcc
    style SELECT fill:#ffffcc

TAGE uses multiple tables with different history lengths. The longest matching history provides the prediction.

Return Address Stack (RAS)

Function returns are predicted using a hardware stack (mentioned in Chapter 7). When a function call is detected (jal or jalr with rd != x0), the return address is pushed onto the RAS. When a return is detected (jalr x0, 0(x1)), the top of the RAS is popped and used as the prediction.

The RAS is very accurate (>99%) because function calls and returns are well-structured.

Indirect Branch Prediction

Indirect branches (jalr with a computed target) are harder to predict than direct branches. The target can vary widely depending on the value in the register.

Indirect branch target buffer (iBTB): A cache indexed by the branch PC that stores recently seen targets. For virtual function calls or switch statements, the iBTB can achieve good accuracy.

Advanced techniques: Some processors use the call path (sequence of recent function calls) to predict indirect branch targets, improving accuracy for polymorphic code.

8.6 Cache Hierarchy

Modern processors have multiple levels of cache to hide memory latency.

L1 Instruction and Data Caches

L1 caches are small (32-64 KB), fast (1-2 cycle latency), and split into separate instruction (I-cache) and data (D-cache) caches.

Split I/D caches allow simultaneous instruction fetch and data access, avoiding structural hazards. They’re also optimized differently: I-caches are read-only and can use simpler replacement policies.

Virtually indexed, physically tagged (VIPT): L1 caches often use virtual addresses for indexing (to avoid TLB lookup latency) but physical addresses for tags (to avoid aliasing issues).

L2 Unified Cache

L2 cache is larger (256 KB - 1 MB), slower (10-20 cycles), and unified (holds both instructions and data).

L2 is the victim cache for L1: when data is evicted from L1, it’s placed in L2. This creates an inclusive hierarchy (L2 contains everything in L1).

L3 Shared Cache

L3 cache (if present) is even larger (4-32 MB), slower (30-50 cycles), and shared among all cores in a multi-core processor.

L3 reduces traffic to main memory and provides a large shared working set for all cores.

Figure 8.6: Cache Hierarchy

graph TB
    subgraph "Core 0"
        CPU0[CPU Core 0]
        L1I0[L1 I-Cache<br/>32 KB<br/>1-2 cycles]
        L1D0[L1 D-Cache<br/>32 KB<br/>1-2 cycles]
    end

    subgraph "Core 1"
        CPU1[CPU Core 1]
        L1I1[L1 I-Cache<br/>32 KB<br/>1-2 cycles]
        L1D1[L1 D-Cache<br/>32 KB<br/>1-2 cycles]
    end

    CPU0 --> L1I0
    CPU0 --> L1D0
    CPU1 --> L1I1
    CPU1 --> L1D1

    L1I0 --> L2_0[L2 Unified<br/>256 KB<br/>10-20 cycles]
    L1D0 --> L2_0
    L1I1 --> L2_1[L2 Unified<br/>256 KB<br/>10-20 cycles]
    L1D1 --> L2_1

    L2_0 --> L3[L3 Shared<br/>8 MB<br/>30-50 cycles]
    L2_1 --> L3

    L3 --> MEM[Main Memory<br/>DDR4/DDR5<br/>100-300 cycles]

    style L1I0 fill:#e1f5ff
    style L1D0 fill:#e1f5ff
    style L1I1 fill:#e1f5ff
    style L1D1 fill:#e1f5ff
    style L2_0 fill:#fff4e1
    style L2_1 fill:#fff4e1
    style L3 fill:#ffe1e1
    style MEM fill:#f0e1ff

Each level is larger and slower. L1 is private per core, L2 may be private or shared, L3 is shared among all cores.

Cache Replacement Policies

LRU (Least Recently Used): Evict the cache line that hasn’t been accessed for the longest time. This is effective but expensive to implement for high associativity.

PLRU (Pseudo-LRU): An approximation of LRU that’s cheaper to implement. Uses a tree of bits to track approximate recency.

Random: Evict a random cache line. Surprisingly effective and very simple.

Modern caches often use PLRU or adaptive policies that combine multiple strategies.

8.7 Cache Coherence

In multi-core systems, each core has its own L1 cache. Cache coherence ensures that all cores see a consistent view of memory.

MESI Protocol

MESI is the most common coherence protocol. Each cache line is in one of four states:

M (Modified): This cache has the only valid copy, and it’s been modified (dirty).
E (Exclusive): This cache has the only valid copy, and it’s clean (matches memory).
S (Shared): Multiple caches have valid copies, all clean.
I (Invalid): This cache line is not valid.

State transitions:

Read miss: If another cache has the line in M, it writes back to memory and transitions to S. The requesting cache loads the line in S.
Write: If the line is in S, all other caches invalidate their copies. The writing cache transitions to M.

Figure 8.7: MESI Protocol State Diagram

stateDiagram-v2
    [*] --> I
    I: Invalid<br/>(No valid copy)
    E: Exclusive<br/>(Only copy, clean)
    S: Shared<br/>(Multiple copies, clean)
    M: Modified<br/>(Only copy, dirty)

    I --> E: Read Miss<br/>(No other copy)
    I --> S: Read Miss<br/>(Other copies exist)
    I --> M: Write Miss

    E --> M: Write
    E --> S: Other Read
    E --> I: Evict

    S --> M: Write<br/>(Invalidate others)
    S --> I: Evict or<br/>Other Write

    M --> S: Other Read<br/>(Write back)
    M --> I: Evict<br/>(Write back)

MESI Example:

// Core 0 and Core 1 both access the same cache line

// Initial state: Both caches Invalid (I)

// Core 0: Read from address 0x1000
// Core 0 cache: I → E (exclusive, no other copy)

// Core 1: Read from address 0x1000
// Core 0 cache: E → S (shared)
// Core 1 cache: I → S (shared)

// Core 0: Write to address 0x1000
// Core 0 cache: S → M (modified, dirty)
// Core 1 cache: S → I (invalidated)

// Core 1: Read from address 0x1000
// Core 0 cache: M → S (write back to memory)
// Core 1 cache: I → S (load from memory)

MOESI Protocol

MOESI adds an O (Owned) state: the cache has a dirty copy, but other caches may have shared copies. This reduces write-backs to memory.

Snooping vs Directory-Based Coherence

Snooping: All caches monitor (snoop) a shared bus for memory transactions. When a cache sees a transaction that affects its data, it responds appropriately (invalidate, write-back, etc.).

Snooping is simple but doesn’t scale well beyond ~8-16 cores because bus bandwidth becomes a bottleneck.

Directory-based: A centralized directory tracks which caches have copies of each cache line. Coherence messages are sent point-to-point rather than broadcast.

Directory-based coherence scales better to many cores (64+) and is used in large multi-core processors.

RISC-V Coherence Considerations

RISC-V doesn’t mandate a specific coherence protocol. The RVWMO memory model (Chapter 6) defines the ordering guarantees, but the coherence mechanism is implementation-defined.

Most RISC-V multi-core systems use MESI or MOESI with snooping (for small core counts) or directory-based coherence (for large core counts).

8.8 Comparison with ARM and MIPS OOO Cores

Let’s compare RISC-V OOO implementations with other architectures.

RISC-V BOOM vs ARM Cortex-A76/A78

BOOM (Berkeley Out-of-Order Machine) is an open-source RISC-V OOO core. It has:

3-4 issue width
128-entry ROB
64-entry issue queue
Advanced branch prediction (TAGE)
L1 I/D caches, L2 unified cache

ARM Cortex-A76 (2018) is a high-performance mobile core:

4-issue width
128-entry ROB
Sophisticated branch prediction
64 KB L1 I/D, 256-512 KB L2

Comparison: BOOM and Cortex-A76 are similar in structure. Both use register renaming, ROB, and advanced branch prediction. The main differences are in implementation details (pipeline depth, cache sizes, power optimization).

RISC-V’s simpler ISA (no complex addressing modes, no condition codes) makes the OOO logic slightly simpler than ARM’s, but the difference is small in modern designs.

RISC-V vs MIPS R10000 Pipeline

MIPS R10000 (1996) was one of the first commercial OOO processors:

4-issue superscalar
32-entry active list (similar to ROB)
Register renaming with 64 physical registers
Speculative execution

The R10000 pioneered many techniques still used today. Modern RISC-V OOO cores like BOOM are evolutionary descendants of the R10000 design philosophy.

Key difference: RISC-V has no branch delay slots, making branch misprediction recovery simpler than MIPS.

Microarchitecture Trade-offs

Complexity vs Performance: OOO execution provides 2-3x performance improvement over in-order for general-purpose workloads, but at the cost of 3-5x more transistors and power.

When to use OOO:

High-performance applications (servers, desktops, high-end mobile)
Workloads with irregular memory access patterns
Code with many branches and dependencies

When to use in-order:

Embedded systems with power/area constraints
Predictable real-time workloads
Simple control-dominated code

RISC-V’s flexibility allows both in-order (Rocket, SiFive E/U-series) and OOO (BOOM, SiFive P-series) implementations, making it suitable for a wide range of applications.

🛠️ Hands-on Lab: Lab 8.1 — The Truth Behind Performance Counters

This lab guides you through using RISC-V’s hardware performance counters to measure CPI (Cycles Per Instruction) of the same code under different conditions.

Lab Objectives

Read mcycle (Machine Cycle Counter) and minstret (Machine Instructions Retired)
Calculate CPI = Cycles / Instructions
Observe how different code patterns affect CPI

Concept Explanation

RISC-V provides two key performance counter CSRs:

CSR	Name	Description
`mcycle`	Machine Cycle Counter	Clock cycles elapsed since reset
`minstret`	Machine Instructions Retired	Instructions completed since reset

CPI (Cycles Per Instruction) = mcycle / minstret

CPI = 1.0: Ideal case, one instruction completes per cycle
CPI > 1.0: Stalls present (cache miss, hazard, etc.)
CPI < 1.0: Superscalar processor, multiple instructions complete per cycle

Code

Create lab8_perf.c:

// lab8_perf.c - Performance Counter Measurement
#include <stdio.h>
#include <stdint.h>

// Read mcycle
static inline uint64_t read_mcycle(void) {
    uint64_t val;
    asm volatile("csrr %0, mcycle" : "=r"(val));
    return val;
}

// Read minstret
static inline uint64_t read_minstret(void) {
    uint64_t val;
    asm volatile("csrr %0, minstret" : "=r"(val));
    return val;
}

// Test function 1: Simple addition loop (no dependencies)
volatile int result1;
void test_independent(int n) {
    int a = 0, b = 0, c = 0, d = 0;
    for (int i = 0; i < n; i++) {
        a += 1;
        b += 2;
        c += 3;
        d += 4;
    }
    result1 = a + b + c + d;
}

// Test function 2: Addition loop with dependencies
volatile int result2;
void test_dependent(int n) {
    int a = 0;
    for (int i = 0; i < n; i++) {
        a += 1;
        a += a;  // depends on previous line's result
        a += a;  // depends on previous line's result
        a += a;  // depends on previous line's result
    }
    result2 = a;
}

void measure(const char *name, void (*func)(int), int n) {
    uint64_t cycle_start = read_mcycle();
    uint64_t instr_start = read_minstret();

    func(n);

    uint64_t cycle_end = read_mcycle();
    uint64_t instr_end = read_minstret();

    uint64_t cycles = cycle_end - cycle_start;
    uint64_t instrs = instr_end - instr_start;

    // Calculate CPI (multiply by 100 to avoid floating point)
    uint64_t cpi_x100 = (cycles * 100) / instrs;

    printf("%s:\n", name);
    printf("  Cycles: %lu\n", cycles);
    printf("  Instructions: %lu\n", instrs);
    printf("  CPI: %lu.%02lu\n", cpi_x100 / 100, cpi_x100 % 100);
}

int main() {
    int n = 100000;

    printf("=== Performance Counter Lab ===\n\n");

    measure("Independent Operations", test_independent, n);
    printf("\n");
    measure("Dependent Operations", test_dependent, n);

    return 0;
}

Compile and Run

# Compile
riscv64-unknown-elf-gcc -O2 -o lab8_perf lab8_perf.c

# Run (requires M-mode access to mcycle/minstret)
qemu-riscv64 lab8_perf

Expected Output (values vary by implementation):

=== Performance Counter Lab ===

Independent Operations:
  Cycles: 500123
  Instructions: 700045
  CPI: 0.71

Dependent Operations:
  Cycles: 1200456
  Instructions: 800089
  CPI: 1.50

What You Just Did

Independent Operations: Four parallel additions can be executed simultaneously by OOO processors, resulting in CPI < 1
Dependent Operations: Each addition depends on the previous result, forcing sequential execution, resulting in CPI > 1

danieRTOS Reference: The scheduler uses similar performance measurement techniques to profile task execution time and optimize scheduling decisions.

⚠️ Common Pitfalls

Pitfall 1: Thinking Out-of-Order Is Always Better

Misconception: “OoO processors are always faster than In-Order!”

Correct Understanding:

OoO has heavier penalties on Branch Misprediction (more instructions to flush)
OoO may not be optimal in power-constrained scenarios (phones, IoT)
For highly parallel workloads (GPU-like), simple In-Order cores are often more efficient

Pitfall 2: Ignoring Spectre/Meltdown Risks

Misconception: “Speculative Execution is just a performance optimization with no side effects.”

Correct Understanding:

Spectre and Meltdown attacks exploit side-channel effects of Speculative Execution
Even when mispredicted instructions are cancelled, their effects on Cache remain
Attackers can infer secret data by measuring Cache access timing

// Simplified Spectre attack concept
if (x < array1_size) {           // Bounds check
    y = array2[array1[x] * 256]; // If x is out of bounds, this shouldn't execute
}
// But Speculative Execution may "execute first, ask questions later"
// Even after discovering x is out of bounds and cancelling,
// array2's cache state has leaked array1[x]'s value

Pitfall 3: Confusing mcycle with time

Error Scenario: Using mcycle to measure “real time.”

Correct Understanding:

mcycle: CPU clock cycles, tied to CPU frequency
time: Real time (usually from RTC or Timer)
If CPU frequency changes dynamically (DVFS), mcycle cannot be directly converted to time

// ❌ Wrong: Assuming mcycle equals time
uint64_t start = read_mcycle();
do_something();
uint64_t elapsed_ns = (read_mcycle() - start) * 1000000000 / CPU_FREQ;
// If CPU frequency changed, this calculation is wrong

// ✅ Correct: Use time CSR or SBI timer
uint64_t start = read_time();
do_something();
uint64_t elapsed_ns = (read_time() - start) * 1000000000 / TIMER_FREQ;

Summary

In this chapter, we explored advanced microarchitecture techniques for high-performance RISC-V processors:

Out-of-order execution: Dynamic scheduling to extract instruction-level parallelism.
Register renaming: Eliminating false dependencies (WAR, WAW) with physical registers.
Reorder buffer: Ensuring in-order commit for precise exceptions.
Load/store queues: Memory disambiguation and store-to-load forwarding.
Advanced branch prediction: TAGE, RAS, indirect branch prediction.
Cache hierarchy: L1/L2/L3 caches with LRU/PLRU replacement.
Cache coherence: MESI/MOESI protocols for multi-core consistency.
Comparisons: RISC-V BOOM vs ARM Cortex-A76 and MIPS R10000.

RISC-V’s clean ISA makes it an excellent target for both simple in-order cores and complex OOO designs. The architecture doesn’t impose unnecessary constraints, allowing microarchitects to innovate freely.

In the next chapter, we’ll shift focus from hardware to software, exploring how RISC-V systems boot and how firmware and operating systems interact with the hardware.

Chapter 9: Reset, Boot Flow & Firmware

Part VI — Booting & System Software

🎯 Learning Objectives

After reading this chapter, you will be able to:

Understand the Reset Vector: Know where the first instruction executes after RISC-V powers on
Master the Boot Flow: Understand the relay from BootROM → Loader → Firmware → OS
Write Linker Scripts: Define your program’s memory layout
Implement Bare-metal Programs: Control hardware directly without an OS
Understand UART MMIO: Output characters through Memory-Mapped I/O

💡 Scenario: The First Leg of the Relay Race

Scene: Junior presses the Reset button on the development board, watching logs scroll across the terminal.

Junior: “Architect, when we write C, we always start from main(). But when the CPU just powers on, RAM should be empty, right? Where does the CPU get its first instruction?”

Architect: “Great question. It’s like a relay race—main() is actually the third or fourth leg.

First Leg (Reset Vector): The CPU hardware is designed so that after power-on, the PC (Program Counter) automatically points to a fixed location (usually ROM). There, a small hardcoded program (BootROM) lives.
Second Leg (Loader): BootROM’s job is simple—copy the next program (like OpenSBI or U-Boot) from storage (Flash/SD card) into RAM, then jump to it.
Third Leg (Firmware/OS): Now the environment is more comfortable—we have RAM available, and we can finally prepare to run your main().“

Junior: “Can we skip all that complex OS stuff and directly be that ‘second leg of the relay,’ controlling the hardware ourselves?”

Architect: “Absolutely—that’s called Bare-metal programming. In this world, there’s no printf, no malloc, not even a Stack unless you set it up yourself. Let’s try sending our first shout into this ‘wilderness.’”

Junior: “Sounds exciting!”

Architect: “But here’s the key: before entering C code, you must set up the Stack Pointer (SP). C function calls depend on the Stack—jumping into C without setting SP will crash immediately.”

What happens when you power on a RISC-V system? Unlike application software that runs in a well-prepared environment, the boot process starts from nothing—no operating system, no memory initialization, not even a stack. This chapter explores how RISC-V systems bootstrap themselves from power-on reset to a running operating system.

The boot process is a carefully orchestrated sequence of firmware stages, each preparing the environment for the next. We’ll trace this journey from the reset vector through machine-mode firmware (ZSBL, FSBL, OpenSBI), bootloaders (U-Boot, GRUB), and finally to the operating system handoff. Understanding this process is essential for firmware developers, system integrators, and anyone debugging boot issues.

9.1 Reset and Boot Sequence

Power-On Reset

When power is applied to a RISC-V processor, hardware reset logic initializes the core to a known state. All harts (hardware threads) begin execution in Machine mode (M-mode), the highest privilege level with full access to all hardware resources.

Reset state (defined by the RISC-V Privileged Specification):

PC (Program Counter): Set to the reset vector address (implementation-defined, often 0x1000 or 0x80000000)
Privilege mode: M-mode (mstatus.MPP = 3)
Interrupts: Disabled (mstatus.MIE = 0, mie = 0)
Virtual memory: Disabled (satp = 0)
Most CSRs: Undefined or zero
General-purpose registers: Undefined (except x0, which is always zero)

Only one hart boots by default. In multi-hart systems, the boot hart (usually hart 0) starts executing from the reset vector, while other harts are held in a wait state until explicitly started by the boot hart.

Reset Vector

The reset vector is the first instruction address executed after reset. This address is implementation-defined and typically points to:

ROM (Read-Only Memory): Contains first-stage bootloader (FSBL)
Flash memory: Contains firmware image
RAM: Pre-loaded by JTAG debugger (for development)

Example reset vectors:

SiFive FU540: 0x1000 (ROM)
SiFive FU740: 0x1000 (ROM)
QEMU virt machine: 0x1000 (ROM)
Rocket Chip: 0x10000 (configurable)

Figure 9.1: RISC-V Boot Sequence Overview

graph TB
    RESET[Power-On Reset] --> RV[Reset Vector<br/>0x1000]
    RV --> ROM[ROM Code<br/>M-mode]
    ROM --> FSBL[First-Stage<br/>Bootloader<br/>ZSBL/FSBL]
    FSBL --> SSBL[Second-Stage<br/>Bootloader<br/>U-Boot/OpenSBI]
    SSBL --> OS[Operating System<br/>Linux/FreeBSD<br/>S-mode]
    
    ROM -.->|Initialize| HW[Hardware<br/>DRAM, Clocks,<br/>Peripherals]
    FSBL -.->|Load from| STORAGE[Storage<br/>Flash/SD/Network]
    SSBL -.->|SBI Services| RUNTIME[M-mode Runtime<br/>OpenSBI]
    
    style RESET fill:#ffcccc
    style ROM fill:#ffe1e1
    style FSBL fill:#fff4e1
    style SSBL fill:#e1ffe1
    style OS fill:#e1f5ff

Early Initialization

The first code executed at the reset vector must be extremely careful — it runs with no stack, no initialized data, and minimal hardware setup. Typical early initialization:

# Reset vector entry point (M-mode)
_start:
    # Disable interrupts (already disabled by reset, but be explicit)
    csrw    mie, zero
    csrw    mip, zero
    
    # Initialize global pointer (gp) for data access
    .option push
    .option norelax
    la      gp, __global_pointer$
    .option pop
    
    # Set up stack pointer (sp)
    la      sp, __stack_top
    
    # Clear BSS (uninitialized data)
    la      t0, __bss_start
    la      t1, __bss_end
1:  bge     t0, t1, 2f
    sd      zero, 0(t0)
    addi    t0, t0, 8
    j       1b
2:
    # Jump to C code
    call    boot_main

Key steps:

Disable interrupts: Ensure no interrupts occur during initialization
Set up gp (global pointer): Enables efficient access to global variables
Set up sp (stack pointer): Enables function calls and local variables
Clear BSS: Zero-initialize uninitialized global variables
Jump to C code: Now safe to run higher-level code

9.2 Machine Mode Initialization

CSR Initialization

Machine-mode firmware must initialize critical CSRs before proceeding. These control interrupt handling, memory protection, and hardware features.

Essential CSR initialization:

// Initialize machine-mode CSRs
void init_machine_mode(void) {
    // 1. Set up trap vector
    write_csr(mtvec, (uintptr_t)&m_trap_vector);
    
    // 2. Enable machine-mode interrupts (but keep global IE off for now)
    write_csr(mie, MIE_MSIE | MIE_MTIE | MIE_MEIE);
    
    // 3. Initialize mstatus
    uintptr_t mstatus = read_csr(mstatus);
    mstatus &= ~MSTATUS_MIE;  // Keep interrupts disabled
    mstatus |= MSTATUS_FS_INITIAL;  // Enable FPU (if present)
    write_csr(mstatus, mstatus);
    
    // 4. Clear pending interrupts
    write_csr(mip, 0);
    
    // 5. Initialize performance counters (if present)
    write_csr(mcounteren, 0x7);  // Enable cycle, time, instret for S-mode
}

Physical Memory Protection (PMP) Setup

PMP is RISC-V’s mechanism for isolating memory regions and enforcing access permissions. M-mode firmware configures PMP entries to protect firmware code, restrict device access, and define memory regions for S-mode.

PMP configuration example:

// Configure PMP to allow S-mode access to RAM
void setup_pmp(void) {
    // Entry 0: Protect M-mode firmware (0x80000000 - 0x80100000)
    // TOR (Top-Of-Range) addressing
    write_csr(pmpaddr0, 0x80000000 >> 2);
    write_csr(pmpaddr1, 0x80100000 >> 2);
    write_csr(pmpcfg0, PMP_R | PMP_X | PMP_L | PMP_TOR);  // R-X, locked

    // Entry 1: Allow S-mode full access to RAM (0x80100000 - 0x88000000)
    write_csr(pmpaddr2, 0x80100000 >> 2);
    write_csr(pmpaddr3, 0x88000000 >> 2);
    write_csr(pmpcfg0, (PMP_R | PMP_W | PMP_X | PMP_TOR) << 8);  // RWX

    // Entry 2: Allow access to UART (0x10000000 - 0x10001000)
    write_csr(pmpaddr4, 0x10000000 >> 2);
    write_csr(pmpaddr5, 0x10001000 >> 2);
    write_csr(pmpcfg0, (PMP_R | PMP_W | PMP_TOR) << 16);  // RW
}

PMP addressing modes:

OFF: Entry disabled
TOR (Top-Of-Range): Region from pmpaddr[i-1] to pmpaddr[i]
NA4: Naturally aligned 4-byte region
NAPOT: Naturally aligned power-of-2 region

Figure 9.2: PMP Memory Protection

graph TB
    subgraph "Physical Memory"
        ROM[ROM<br/>0x1000-0x10000<br/>M-mode only]
        MRAM[M-mode RAM<br/>0x80000000-0x80100000<br/>Locked, R-X]
        SRAM[S-mode RAM<br/>0x80100000-0x88000000<br/>RWX]
        UART[UART<br/>0x10000000-0x10001000<br/>RW]
        FLASH[Flash<br/>0x20000000-0x24000000<br/>R-X]
    end

    PMP0[PMP Entry 0<br/>M-mode Firmware] --> MRAM
    PMP1[PMP Entry 1<br/>S-mode Memory] --> SRAM
    PMP2[PMP Entry 2<br/>UART Device] --> UART

    style MRAM fill:#ffcccc
    style SRAM fill:#ccffcc
    style UART fill:#ffffcc

Memory Configuration

M-mode firmware must initialize DRAM controllers and configure memory timing. This is highly platform-specific and often the most complex part of early boot.

Typical DRAM initialization:

Configure DRAM controller registers (timing, refresh rate)
Perform DRAM training (calibrate delays)
Test memory (optional, but recommended)
Set up memory map (base address, size)

Example (simplified):

void init_dram(void) {
    volatile uint32_t *dram_ctrl = (uint32_t *)0x10000000;

    // Configure DRAM timing (platform-specific)
    dram_ctrl[0] = 0x12345678;  // Timing register
    dram_ctrl[1] = 0x9ABCDEF0;  // Refresh register

    // Wait for DRAM ready
    while (!(dram_ctrl[2] & 0x1));

    // Simple memory test
    volatile uint64_t *mem = (uint64_t *)0x80000000;
    mem[0] = 0xDEADBEEFCAFEBABE;
    if (mem[0] != 0xDEADBEEFCAFEBABE) {
        // Memory test failed
        while (1);
    }
}

9.3 Firmware and Bootloader

Firmware Stages

RISC-V boot firmware is typically organized into multiple stages, each with specific responsibilities:

ZSBL (Zeroth-Stage Bootloader): Minimal ROM code, initializes DRAM
FSBL (First-Stage Bootloader): Loads SSBL from storage
SSBL (Second-Stage Bootloader): Full-featured bootloader (U-Boot)
Runtime firmware: M-mode services (OpenSBI)

Why multiple stages?

ROM size constraints: ZSBL must fit in small on-chip ROM
Flexibility: FSBL/SSBL can be updated without hardware changes
Feature richness: Later stages can use DRAM and have more code space

First-Stage Bootloader (FSBL)

FSBL’s primary job is to load the second-stage bootloader from non-volatile storage (Flash, SD card, network).

FSBL responsibilities:

Initialize storage controller (SPI, SD, eMMC)
Load SSBL image from storage to DRAM
Verify SSBL integrity (checksum, signature)
Jump to SSBL entry point

Example FSBL flow:

void fsbl_main(void) {
    // 1. Initialize storage
    spi_flash_init();

    // 2. Load SSBL from flash to DRAM
    uint8_t *ssbl_dest = (uint8_t *)0x80200000;
    uint32_t ssbl_size = 512 * 1024;  // 512 KB
    spi_flash_read(0x100000, ssbl_dest, ssbl_size);

    // 3. Verify checksum
    if (!verify_checksum(ssbl_dest, ssbl_size)) {
        panic("SSBL checksum failed");
    }

    // 4. Jump to SSBL
    void (*ssbl_entry)(void) = (void (*)(void))ssbl_dest;
    ssbl_entry();
}

U-Boot for RISC-V

U-Boot is the most common second-stage bootloader for RISC-V Linux systems. It provides a rich environment for loading and booting operating systems.

U-Boot features:

Multiple boot sources: Flash, SD, USB, network (TFTP, NFS)
File system support: FAT, ext2/3/4, SquashFS
Network stack: DHCP, TFTP, NFS
Scripting: Boot scripts for automation
Device tree: Passes hardware description to OS
Interactive shell: For debugging and manual boot

U-Boot boot flow:

U-Boot SPL (if used) → U-Boot proper → Load kernel → Load device tree → Boot kernel

9.4 OpenSBI: Supervisor Binary Interface

OpenSBI is the reference implementation of the RISC-V Supervisor Binary Interface (SBI). It provides a standard interface between M-mode firmware and S-mode operating systems.

OpenSBI Architecture

OpenSBI runs in M-mode and provides runtime services to S-mode software (OS kernels, hypervisors). It acts as a thin firmware layer that abstracts platform-specific details.

Figure 9.3: OpenSBI Architecture

graph TB
    subgraph "S-mode (Supervisor)"
        LINUX[Linux Kernel]
        FREEBSD[FreeBSD Kernel]
        HV[Hypervisor]
    end

    subgraph "M-mode (Machine)"
        OPENSBI[OpenSBI Runtime]
        PLATFORM[Platform Code<br/>UART, Timer, IPI]
    end

    subgraph "Hardware"
        CPU[RISC-V Core]
        CLINT[CLINT<br/>Timer, IPI]
        PLIC[PLIC<br/>Interrupts]
        UART_HW[UART]
    end

    LINUX -->|ecall| OPENSBI
    FREEBSD -->|ecall| OPENSBI
    HV -->|ecall| OPENSBI

    OPENSBI --> PLATFORM
    PLATFORM --> CLINT
    PLATFORM --> PLIC
    PLATFORM --> UART_HW

    style OPENSBI fill:#e1ffe1
    style PLATFORM fill:#fff4e1

OpenSBI provides:

Timer services: Set timer interrupts
IPI (Inter-Processor Interrupt): Send IPIs to other harts
RFENCE: Remote fence operations (TLB flush, I-cache flush)
Hart state management: Start/stop harts
System reset: Reboot or shutdown
Console I/O: Early debug output

Platform Initialization

OpenSBI initializes platform-specific hardware during boot:

// OpenSBI platform initialization (simplified)
int sbi_platform_init(void) {
    // 1. Initialize console (UART)
    uart_init();
    sbi_printf("OpenSBI v1.0\n");

    // 2. Initialize CLINT (timer and IPI)
    clint_init();

    // 3. Initialize PLIC (interrupt controller)
    plic_init();

    // 4. Set up PMP for S-mode
    setup_pmp();

    // 5. Initialize other harts
    for (int i = 1; i < num_harts; i++) {
        sbi_hsm_hart_start(i, smode_entry, 0);
    }

    return 0;
}

SBI Runtime Services

S-mode software invokes SBI services using the ecall instruction. The SBI call convention uses registers to pass function ID and parameters:

a7: SBI extension ID (EID)
a6: SBI function ID (FID)
a0-a5: Parameters
a0: Return value (0 = success, negative = error)
a1: Additional return value (optional)

Example: Setting a timer

# S-mode code: Set timer for 1 second from now
li      a7, 0x54494D45    # EID_TIME = 0x54494D45 ("TIME")
li      a6, 0             # FID_SET_TIMER = 0
rdtime  a0                # Read current time
li      t0, 10000000      # 1 second at 10 MHz
add     a0, a0, t0        # Target time
ecall                     # Call OpenSBI

OpenSBI handles the ecall:

// OpenSBI trap handler
void sbi_trap_handler(struct sbi_trap_regs *regs) {
    if (regs->cause == CAUSE_SUPERVISOR_ECALL) {
        ulong eid = regs->a7;
        ulong fid = regs->a6;

        if (eid == SBI_EXT_TIME && fid == SBI_EXT_TIME_SET_TIMER) {
            // Set timer
            uint64_t next_time = regs->a0;
            clint_set_timer(current_hart(), next_time);
            regs->a0 = SBI_SUCCESS;
        }

        // Advance sepc past ecall
        regs->sepc += 4;
    }
}

9.5 Supervisor Mode Handoff

M-mode to S-mode Transition

After OpenSBI initialization, control is transferred to S-mode (the operating system). This transition involves:

Set up S-mode entry point: sepc = OS entry address
Configure mstatus: Set MPP = 1 (S-mode), enable interrupts
Delegate interrupts/exceptions: Configure mideleg and medeleg
Pass parameters: Device tree address in a1
Execute mret: Return to S-mode

OpenSBI handoff code:

void sbi_boot_hart(ulong next_addr, ulong next_mode, ulong fdt_addr) {
    // 1. Set S-mode entry point
    csr_write(CSR_SEPC, next_addr);

    // 2. Configure mstatus for S-mode
    ulong mstatus = csr_read(CSR_MSTATUS);
    mstatus = INSERT_FIELD(mstatus, MSTATUS_MPP, PRV_S);  // Return to S-mode
    mstatus = INSERT_FIELD(mstatus, MSTATUS_MPIE, 0);     // Disable interrupts initially
    mstatus = INSERT_FIELD(mstatus, MSTATUS_SPP, 0);      // S-mode came from U-mode
    csr_write(CSR_MSTATUS, mstatus);

    // 3. Delegate interrupts to S-mode
    csr_write(CSR_MIDELEG, MIP_SSIP | MIP_STIP | MIP_SEIP);

    // 4. Delegate exceptions to S-mode
    csr_write(CSR_MEDELEG, (1 << CAUSE_MISALIGNED_FETCH) |
                           (1 << CAUSE_FETCH_PAGE_FAULT) |
                           (1 << CAUSE_LOAD_PAGE_FAULT) |
                           (1 << CAUSE_STORE_PAGE_FAULT));

    // 5. Pass device tree address in a1
    register ulong a0 asm("a0") = current_hartid();
    register ulong a1 asm("a1") = fdt_addr;

    // 6. Jump to S-mode
    asm volatile("mret" : : "r"(a0), "r"(a1));
}

Device Tree Passing

The device tree (DTB) describes the hardware platform to the OS. OpenSBI passes the DTB address to the kernel in register a1.

Device tree structure (simplified):

/dts-v1/;

/ {
    #address-cells = <2>;
    #size-cells = <2>;
    compatible = "sifive,fu740", "sifive,fu540";
    model = "SiFive HiFive Unmatched";

    cpus {
        #address-cells = <1>;
        #size-cells = <0>;

        cpu@0 {
            device_type = "cpu";
            reg = <0>;
            compatible = "sifive,u74", "riscv";
            riscv,isa = "rv64imafdc";
            mmu-type = "riscv,sv39";
        };
        // More CPUs...
    };

    memory@80000000 {
        device_type = "memory";
        reg = <0x0 0x80000000 0x2 0x00000000>;  // 8 GB at 0x80000000
    };

    soc {
        uart@10010000 {
            compatible = "sifive,uart0";
            reg = <0x0 0x10010000 0x0 0x1000>;
            interrupts = <4>;
        };
        // More devices...
    };
};

Kernel receives DTB:

// Linux kernel entry point (arch/riscv/kernel/head.S)
_start:
    // a0 = hartid
    // a1 = DTB address

    // Save DTB address
    la      t0, dtb_early_pa
    sd      a1, 0(t0)

    // Continue boot...

9.6 Linux Boot on RISC-V

Linux Kernel Entry Point

The Linux kernel for RISC-V starts in arch/riscv/kernel/head.S with the following state:

Privilege mode: S-mode
MMU: Disabled (satp = 0)
Interrupts: Disabled
a0: Hart ID
a1: Device tree physical address

Early kernel initialization:

# arch/riscv/kernel/head.S (simplified)
_start:
    # Disable interrupts
    csrw    sie, zero
    csrw    sip, zero

    # Save hart ID and DTB address
    mv      s0, a0          # s0 = hartid
    mv      s1, a1          # s1 = DTB address

    # Set up temporary stack
    la      sp, init_thread_union + THREAD_SIZE

    # Clear BSS
    la      t0, __bss_start
    la      t1, __bss_stop
1:  sd      zero, 0(t0)
    addi    t0, t0, 8
    blt     t0, t1, 1b

    # Set up early page tables
    call    setup_vm

    # Enable MMU
    la      t0, early_pg_dir
    srli    t0, t0, 12
    li      t1, SATP_MODE_SV39
    or      t0, t0, t1
    csrw    satp, t0
    sfence.vma

    # Jump to virtual address space
    la      t0, .Lvirtual
    jr      t0
.Lvirtual:
    # Now running with MMU enabled
    call    start_kernel

Device Tree Parsing

The kernel parses the device tree to discover hardware:

// Simplified device tree parsing
void __init setup_arch(char **cmdline_p) {
    // 1. Unflatten device tree
    unflatten_device_tree();

    // 2. Parse memory nodes
    early_init_dt_scan_memory();

    // 3. Parse CPU nodes
    for_each_of_cpu_node(node) {
        parse_cpu_node(node);
    }

    // 4. Parse chosen node (bootargs, initrd)
    early_init_dt_scan_chosen(cmdline_p);

    // 5. Set up memory management
    setup_bootmem();
    paging_init();
}

Figure 9.4: Linux Boot Sequence

OpenSBI (M-mode)
    |
    | mret (a0=hartid, a1=DTB address)
    v
_start (S-mode, arch/riscv/kernel/head.S)
    |
    +---> Setup early page tables
    +---> Enable MMU (write satp, sfence.vma)
    |
    v
start_kernel() (init/main.c)
    |
    +---> parse_early_param()
    +---> setup_arch()  ← Parse device tree, setup memory
    +---> mm_init()     ← Memory management init
    +---> sched_init()  ← Scheduler init
    +---> rest_init()
          |
          +---> kernel_init() → Init process (PID 1)

9.7 Comparison with ARM Trusted Firmware

RISC-V’s boot architecture is simpler than ARM’s, but serves similar purposes.

Boot Flow Comparison

ARM Trusted Firmware (TF-A) uses multiple boot stages:

BL1: ROM code (EL3)
BL2: Trusted boot firmware (EL3)
BL31: Runtime firmware (EL3)
BL32: Secure OS (S-EL1, optional)
BL33: Non-secure bootloader (EL2/EL1) → OS

RISC-V OpenSBI is simpler:

ZSBL/FSBL: ROM code (M-mode)
OpenSBI: Runtime firmware (M-mode)
U-Boot: Bootloader (S-mode)
OS: Linux/FreeBSD (S-mode)

Key differences:

Feature	ARM TF-A	RISC-V OpenSBI
Privilege levels	EL0-EL3 (4 levels)	U/S/M (3 levels)
Secure world	TrustZone (S-EL0/1)	PMP-based isolation
Runtime firmware	BL31 (EL3)	OpenSBI (M-mode)
Hypervisor	EL2 (built-in)	H-extension (optional)
Boot stages	BL1→BL2→BL31→BL33	ZSBL→FSBL→OpenSBI→U-Boot
Complexity	High (many stages)	Lower (fewer stages)

M-mode vs EL3

Both M-mode and EL3 are the highest privilege levels, but differ in scope:

M-mode (RISC-V):

Minimal, focused on essential services
Delegates most exceptions/interrupts to S-mode
Thin runtime layer (OpenSBI ~50 KB)
No built-in secure world (use PMP)

EL3 (ARM):

Rich feature set (TrustZone, secure monitor)
Handles all secure world transitions
Larger runtime (TF-A ~200 KB+)
Built-in secure/non-secure separation

RISC-V’s philosophy: Keep M-mode minimal, push complexity to S-mode. ARM’s philosophy: Rich firmware layer with extensive security features.

🛠️ Hands-on Lab: Lab 9.2 — Survival in the Wilderness (Bare-metal Hello World)

This is the most “pure” programming experience of your career. We’ll strip away all OS protections and talk directly to hardware.

Lab Objectives

Write a Linker Script to define memory layout
Write Assembly startup code to set the Stack Pointer
Use MMIO to directly control UART output
Run on QEMU

Project Structure

Create a folder lab9 with three files:

lab9/
├── link.ld    # Map: defines memory layout
├── entry.S    # Startup key: set SP and jump to main
└── main.c     # Logic brain: UART driver

Code

File 1: link.ld (Linker Script)

Tell the Linker: our program starts at RAM’s beginning (0x80000000).

OUTPUT_ARCH( "riscv" )
ENTRY( _start )

SECTIONS
{
  /* QEMU virt machine RAM starts at 0x80000000 */
  . = 0x80000000;

  /* Text section: put startup code first */
  .text : {
    *(.text.boot)
    *(.text)
  }

  /* Data section */
  .data : { *(.data) }

  /* Uninitialized data section (BSS) */
  .bss : { *(.bss) }

  /* Define Stack Top, reserve 4KB */
  . = . + 0x1000;
  _stack_top = .;
}

File 2: entry.S (Startup Code)

This is the first code the CPU executes.

.section .text.boot
.global _start

_start:
    # 1. Disable interrupts (good practice, usually off at boot anyway)
    csrw mie, zero

    # 2. Set Stack Pointer (SP)
    #    C function calls depend on Stack—jumping into C without SP crashes
    la sp, _stack_top

    # 3. Jump to C main function
    call main

    # 4. If main returns (shouldn't happen), loop forever
loop:
    j loop

File 3: main.c (UART Driver)

QEMU virt machine UART0 is fixed at 0x10000000.

#include <stdint.h>

// QEMU virt UART base address
#define UART0_BASE 0x10000000

// Define a pointer to this address
// volatile is crucial! Tells compiler not to optimize away reads/writes
volatile uint8_t *uart0 = (uint8_t *)(UART0_BASE);

void put_char(char c) {
    // Write character directly to memory address
    // UART controller transmits it
    *uart0 = c;
}

void print_str(const char *s) {
    while (*s) {
        put_char(*s++);
    }
}

int main(void) {
    print_str("Hello from Bare-metal RISC-V!\n");

    // Loop forever (no OS to return to)
    while (1) {}
    return 0;
}

Compile and Run

# Compile
riscv64-unknown-elf-gcc -nostdlib -nostartfiles -T link.ld \
    -o bare_hello entry.S main.c

# Run on QEMU (M-mode, no firmware)
qemu-system-riscv64 -machine virt -nographic -bios none -kernel bare_hello

Expected Output:

Hello from Bare-metal RISC-V!

(Press Ctrl+A, then X to exit QEMU)

What You Just Did

You’ve written a complete bare-metal program:

Linker Script: Defined where code and stack live in memory
Startup Code: Set SP and jumped to C—the essential bootstrap
UART MMIO: Talked directly to hardware using memory-mapped I/O

danieRTOS Reference: The danieRTOS entry point follows the same pattern—entry.S sets up SP and calls kernel_main().

Deep Dive: QEMU Memory Map

Why 0x10000000 and 0x80000000? This is the QEMU virt machine’s Memory Map:

Address Range	Purpose
`0x0000_1000`	BootROM (Reset Vector)
`0x1000_0000`	UART (we write characters here)
`0x8000_0000`	DRAM (our program runs here)

This is the magic of MMIO (Memory Mapped I/O): to the CPU, it’s just writing to memory, but to the system, it’s controlling peripheral devices.

Extended Challenge

💭 Try making the program print “Booting…” then do a simple count loop before printing “Done!”

This will give you a taste of how “primitive” the world is without a sleep() function.

⚠️ Common Pitfalls

Pitfall 1: Forgetting the Stack Pointer

Error Scenario: Jumping directly from Assembly to C’s main() without setting sp.

Consequence: Program crashes as soon as it enters a C function, because C needs the Stack for local variables and return addresses.

# ❌ Wrong: Forgot to set SP
_start:
    call main    # Jump into C with garbage sp—certain death

# ✅ Correct: Set SP first
_start:
    la sp, _stack_top
    call main

Pitfall 2: Linker Script Section Order

Error Scenario: Not placing .text.boot at the beginning.

Consequence: CPU starts executing at 0x80000000, but that’s not _start—it’s some other function’s code, causing unpredictable behavior.

/* ❌ Wrong: .text.boot not first */
.text : {
    *(.text)       /* Other code comes in first */
    *(.text.boot)  /* _start pushed to later */
}

/* ✅ Correct: Ensure .text.boot is first */
.text : {
    *(.text.boot)  /* Startup code goes first */
    *(.text)
}

Pitfall 3: Hardcoding UART Address

Error Scenario: Hardcoding 0x10000000 in your program, then running it on a different board.

Consequence: Different hardware has different memory maps—UART address may be completely different.

// ❌ Problem: Hardcoded address, not portable
#define UART0_BASE 0x10000000

// ✅ Better: Use Device Tree or header definitions
// Or use SBI (next chapter)

💡 Tip: This is why we need SBI (Supervisor Binary Interface)—it provides a standardized interface so you don’t have to worry about hardware differences. Next chapter, we’ll learn how to output characters through SBI instead of directly manipulating UART.

Summary

The RISC-V boot process is a carefully orchestrated sequence:

Reset: Hart starts in M-mode at reset vector
ZSBL/FSBL: Initialize DRAM, load bootloader
OpenSBI: Provide SBI runtime services
U-Boot: Load kernel and device tree
Linux: Parse device tree, initialize hardware, start init

Key takeaways:

✅ M-mode firmware is minimal and platform-specific
✅ OpenSBI provides standard SBI interface
✅ Device tree describes hardware to OS
✅ PMP protects M-mode firmware from S-mode
✅ Simpler than ARM’s multi-stage boot flow

In the next chapter, we’ll dive deeper into M-mode firmware design, SBI call interface, and the Hypervisor extension.

Chapter 10: Machine Mode, SBI & Supervisor Mode

Part VI — Booting & System Software

🎯 Learning Objectives

After reading this chapter, you will be able to:

Understand SBI Architecture: Grasp the layered design and value of the Supervisor Binary Interface
Master SBI Calling Convention: Know the roles of a7 (EID), a6 (FID), a0-a5 (Args)
Implement SBI Calls: Make ecall requests to OpenSBI services
Understand Exception Delegation: Know how M-mode delegates traps to S-mode
Distinguish M-mode and S-mode Responsibilities: Understand why RISC-V encourages thin M-mode

💡 Scenario: Please Get the Manager

Scene: Junior is pulling his hair at the screen—the UART driver from the last chapter doesn’t work on the new board.

Junior: “Senior, I’m going crazy. Remember the UART driver we wrote last chapter? I just switched to a different board to try it out, and after digging through the datasheet, I found this board’s UART address is 0x54000000, not 0x10000000. Do I have to modify the code every time I switch boards?”

Senior: “That’s exactly why we need SBI (Supervisor Binary Interface). What you’re doing now is like going to a restaurant and rushing into the kitchen to cook your own food. Change restaurants (hardware), the kitchen layout is different, and you don’t know how to cook anymore.”

Junior: “So what should I do?”

Senior: “You need to learn to ‘call the manager.’ In RISC-V, M-mode (OpenSBI) is that manager.

You (S-mode Kernel) just need to sit at your table and use the standard format (SBI Call) to shout: ‘Manager, please print a character for me!’

The manager receives your request, looks up this restaurant’s kitchen layout, and prints the character for you. This way, no matter which restaurant you go to, as long as you know how to call the manager, you’re fine.“

Junior: “Sounds much easier! How do I call?”

Senior: “Use the ecall instruction. But before calling, you need to write your request on specific ‘sticky notes’ (Registers):

Register	Purpose	Analogy
`a7`	Extension ID (EID)	Which department?
`a6`	Function ID (FID)	What service?
`a0-a5`	Arguments	Service parameters
`a0, a1`	Return Values	Manager’s reply

Come on, let’s try it.“

Machine mode is RISC-V’s highest privilege level, with unrestricted access to all hardware resources. But with great power comes great responsibility—M-mode firmware must be minimal, robust, and provide essential services to supervisor mode software. This chapter explores how to design M-mode firmware, implement SBI services, and support advanced features like virtualization and security.

Unlike monolithic firmware architectures, RISC-V encourages a thin M-mode layer that delegates most functionality to S-mode. This design philosophy keeps M-mode simple and portable while allowing rich OS features in S-mode. We’ll examine M-mode firmware design patterns, the Supervisor Binary Interface (SBI) specification, hypervisor support through the H extension, and security features like Physical Memory Protection (PMP) and the WorldGuard extension.

10.1 Machine Mode Firmware Design

Minimal M-mode Firmware

The RISC-V philosophy is to keep M-mode firmware as small as possible. A minimal M-mode firmware might be only a few kilobytes, providing just enough functionality to boot S-mode software.

Minimal M-mode responsibilities:

Early hardware initialization: DRAM, clocks, reset
Platform-specific setup: Configure peripherals
SBI runtime services: Timer, IPI, RFENCE, console
Exception delegation: Pass most traps to S-mode
Boot S-mode software: Set up and jump to OS

Example minimal M-mode firmware structure:

// Minimal M-mode firmware
void m_mode_main(unsigned long hartid, void *fdt) {
    // 1. Initialize platform
    platform_init();
    
    // 2. Set up trap handler
    write_csr(mtvec, (uintptr_t)&m_trap_entry);
    
    // 3. Configure PMP
    setup_pmp();
    
    // 4. Delegate exceptions and interrupts
    write_csr(medeleg, 0xb1ff);  // Delegate most exceptions
    write_csr(mideleg, 0x0222);  // Delegate S-mode interrupts
    
    // 5. Start other harts (if multi-core)
    if (hartid == 0) {
        for (int i = 1; i < num_harts; i++) {
            start_hart(i);
        }
    }
    
    // 6. Boot S-mode payload
    boot_next_stage(hartid, fdt);
}

Platform-Specific Initialization

Each RISC-V platform has unique hardware that must be initialized. M-mode firmware abstracts these details through a platform layer.

Platform initialization example:

// Platform-specific initialization
struct platform_ops {
    int (*early_init)(void);
    int (*final_init)(void);
    void (*console_putc)(char c);
    int (*console_getc)(void);
    void (*timer_init)(void);
    void (*ipi_send)(unsigned long hartid);
    void (*system_reset)(void);
};

// SiFive FU740 platform
static struct platform_ops fu740_ops = {
    .early_init = fu740_early_init,
    .final_init = fu740_final_init,
    .console_putc = uart_putc,
    .console_getc = uart_getc,
    .timer_init = clint_timer_init,
    .ipi_send = clint_ipi_send,
    .system_reset = fu740_system_reset,
};

int platform_init(void) {
    struct platform_ops *ops = &fu740_ops;
    
    // Early initialization
    if (ops->early_init)
        ops->early_init();
    
    // Initialize console
    if (ops->console_putc)
        sbi_console_init(ops->console_putc, ops->console_getc);
    
    // Initialize timer
    if (ops->timer_init)
        ops->timer_init();
    
    // Final initialization
    if (ops->final_init)
        ops->final_init();
    
    return 0;
}

Runtime Services

M-mode firmware provides runtime services to S-mode through the SBI interface. These services remain active after S-mode boots.

Core runtime services:

Timer: Set timer interrupts (sbi_set_timer)
IPI: Send inter-processor interrupts (sbi_send_ipi)
RFENCE: Remote fence operations (sbi_remote_fence_i, sbi_remote_sfence_vma)
Console: Debug output (sbi_console_putchar, sbi_console_getchar)
Hart management: Start/stop harts (sbi_hart_start, sbi_hart_stop)
System reset: Reboot/shutdown (sbi_system_reset)

10.2 SBI Call Interface

SBI Call Mechanism

S-mode software invokes SBI services using the ecall instruction. This traps to M-mode, which handles the request and returns to S-mode.

Figure 10.1: SBI Call Flow

sequenceDiagram
    participant S as S-mode<br/>(Linux)
    participant M as M-mode<br/>(OpenSBI)
    participant HW as Hardware<br/>(CLINT/PLIC)
    
    S->>S: Prepare SBI call<br/>(a7=EID, a6=FID, a0-a5=params)
    S->>M: ecall
    Note over M: Trap to M-mode<br/>mcause = 9 (ecall from S-mode)
    M->>M: Decode EID/FID
    M->>HW: Perform operation<br/>(e.g., set timer)
    HW-->>M: Operation complete
    M->>M: Set return value (a0)
    M->>S: mret
    Note over S: Resume S-mode<br/>Check a0 for result

SBI Calling Convention

SBI calls use a standard register convention:

Input registers:

a7: Extension ID (EID) — Identifies the SBI extension
a6: Function ID (FID) — Identifies the function within the extension
a0-a5: Function parameters (up to 6 parameters)

Output registers:

a0: Error code (0 = success, negative = error)
a1: Return value (optional, function-specific)

Preserved registers: All registers except a0-a1 are preserved across SBI calls.

Example: Send IPI

// S-mode code: Send IPI to hart 1
static inline long sbi_send_ipi(unsigned long hart_mask,
                                unsigned long hart_mask_base) {
    register unsigned long a0 asm("a0") = hart_mask;
    register unsigned long a1 asm("a1") = hart_mask_base;
    register unsigned long a6 asm("a6") = SBI_EXT_IPI_SEND_IPI;
    register unsigned long a7 asm("a7") = SBI_EXT_IPI;
    
    asm volatile("ecall"
                 : "+r"(a0), "+r"(a1)
                 : "r"(a6), "r"(a7)
                 : "memory");
    
    return a0;  // Return error code
}

// Usage
sbi_send_ipi(1 << 1, 0);  // Send IPI to hart 1

SBI Error Codes

SBI functions return standard error codes:

#define SBI_SUCCESS                0
#define SBI_ERR_FAILED            -1
#define SBI_ERR_NOT_SUPPORTED     -2
#define SBI_ERR_INVALID_PARAM     -3
#define SBI_ERR_DENIED            -4
#define SBI_ERR_INVALID_ADDRESS   -5
#define SBI_ERR_ALREADY_AVAILABLE -6
#define SBI_ERR_ALREADY_STARTED   -7
#define SBI_ERR_ALREADY_STOPPED   -8

Error handling:

long ret = sbi_send_ipi(hart_mask, 0);
if (ret < 0) {
    switch (ret) {
    case SBI_ERR_INVALID_PARAM:
        pr_err("Invalid hart mask\n");
        break;
    case SBI_ERR_FAILED:
        pr_err("IPI send failed\n");
        break;
    default:
        pr_err("Unknown error: %ld\n", ret);
    }
}

10.3 SBI Standard Extensions

SBI defines multiple extensions, each providing related functionality. Extensions are identified by EID (Extension ID).

Timer Extension (EID = 0x54494D45)

The Timer extension provides timer interrupt services.

Function: sbi_set_timer (FID = 0)

Sets the timer to fire at a specific time value.

// Set timer to fire in 1 second
uint64_t current_time = rdtime();
uint64_t next_time = current_time + 10000000;  // 10 MHz clock

register unsigned long a0 asm("a0") = next_time;
register unsigned long a6 asm("a6") = 0;  // FID_SET_TIMER
register unsigned long a7 asm("a7") = 0x54494D45;  // EID_TIME
asm volatile("ecall" : "+r"(a0) : "r"(a6), "r"(a7) : "memory");

M-mode implementation:

void sbi_set_timer(uint64_t stime_value) {
    unsigned long hartid = current_hartid();

    // Write to CLINT mtimecmp register
    volatile uint64_t *mtimecmp = (uint64_t *)(CLINT_BASE + 0x4000 + hartid * 8);
    *mtimecmp = stime_value;

    // Clear pending timer interrupt
    csr_clear(CSR_MIP, MIP_STIP);
}

IPI Extension (EID = 0x735049)

The IPI extension sends inter-processor interrupts.

Function: sbi_send_ipi (FID = 0)

Sends IPI to a set of harts specified by a hart mask.

// Send IPI to harts 1, 2, 3
unsigned long hart_mask = 0b1110;  // Bits 1, 2, 3 set
sbi_send_ipi(hart_mask, 0);

M-mode implementation:

int sbi_send_ipi(unsigned long hart_mask, unsigned long hart_mask_base) {
    for (int i = 0; i < 64; i++) {
        if (hart_mask & (1UL << i)) {
            unsigned long hartid = hart_mask_base + i;

            // Write to CLINT MSIP register
            volatile uint32_t *msip = (uint32_t *)(CLINT_BASE + hartid * 4);
            *msip = 1;
        }
    }
    return SBI_SUCCESS;
}

RFENCE Extension (EID = 0x52464E43)

The RFENCE extension performs remote fence operations (TLB flush, I-cache flush) on other harts.

Functions:

sbi_remote_fence_i (FID = 0): Flush instruction cache
sbi_remote_sfence_vma (FID = 1): Flush TLB entries
sbi_remote_sfence_vma_asid (FID = 2): Flush TLB entries for specific ASID

Example: Remote TLB flush

// Flush TLB on harts 1-3 for address range 0x80000000-0x80001000
unsigned long hart_mask = 0b1110;
unsigned long start_addr = 0x80000000;
unsigned long size = 0x1000;

register unsigned long a0 asm("a0") = hart_mask;
register unsigned long a1 asm("a1") = 0;  // hart_mask_base
register unsigned long a2 asm("a2") = start_addr;
register unsigned long a3 asm("a3") = size;
register unsigned long a6 asm("a6") = 1;  // FID_REMOTE_SFENCE_VMA
register unsigned long a7 asm("a7") = 0x52464E43;  // EID_RFENCE
asm volatile("ecall" : "+r"(a0) : "r"(a1), "r"(a2), "r"(a3), "r"(a6), "r"(a7) : "memory");

M-mode implementation:

int sbi_remote_sfence_vma(unsigned long hart_mask, unsigned long hart_mask_base,
                          unsigned long start_addr, unsigned long size) {
    // Send IPI to target harts
    for (int i = 0; i < 64; i++) {
        if (hart_mask & (1UL << i)) {
            unsigned long hartid = hart_mask_base + i;

            // Store fence parameters for target hart
            remote_fence_info[hartid].start = start_addr;
            remote_fence_info[hartid].size = size;
            remote_fence_info[hartid].type = FENCE_SFENCE_VMA;

            // Send IPI
            clint_send_ipi(hartid);
        }
    }

    // Wait for completion (optional, depends on implementation)
    return SBI_SUCCESS;
}

// IPI handler on target hart
void handle_remote_fence_ipi(void) {
    struct remote_fence_info *info = &remote_fence_info[current_hartid()];

    if (info->type == FENCE_SFENCE_VMA) {
        // Perform sfence.vma
        if (info->size == 0) {
            asm volatile("sfence.vma" ::: "memory");
        } else {
            // Flush specific range (implementation-specific)
            for (unsigned long addr = info->start;
                 addr < info->start + info->size;
                 addr += PAGE_SIZE) {
                asm volatile("sfence.vma %0" :: "r"(addr) : "memory");
            }
        }
    }
}

HSM Extension (EID = 0x48534D)

The Hart State Management (HSM) extension controls hart lifecycle.

Functions:

sbi_hart_start (FID = 0): Start a hart
sbi_hart_stop (FID = 1): Stop current hart
sbi_hart_get_status (FID = 2): Get hart status

Example: Start a hart

// Start hart 1 at address 0x80200000 with argument 0x12345678
unsigned long hartid = 1;
unsigned long start_addr = 0x80200000;
unsigned long opaque = 0x12345678;

register unsigned long a0 asm("a0") = hartid;
register unsigned long a1 asm("a1") = start_addr;
register unsigned long a2 asm("a2") = opaque;
register unsigned long a6 asm("a6") = 0;  // FID_HART_START
register unsigned long a7 asm("a7") = 0x48534D;  // EID_HSM
asm volatile("ecall" : "+r"(a0) : "r"(a1), "r"(a2), "r"(a6), "r"(a7) : "memory");

M-mode implementation:

int sbi_hart_start(unsigned long hartid, unsigned long start_addr, unsigned long opaque) {
    if (hartid >= num_harts)
        return SBI_ERR_INVALID_PARAM;

    if (hart_state[hartid] != HART_STOPPED)
        return SBI_ERR_ALREADY_STARTED;

    // Set up hart entry point
    hart_entry_addr[hartid] = start_addr;
    hart_entry_arg[hartid] = opaque;

    // Wake up hart (platform-specific)
    platform_hart_start(hartid);

    hart_state[hartid] = HART_STARTED;
    return SBI_SUCCESS;
}

System Reset Extension (EID = 0x53525354)

The System Reset extension provides system-wide reset and shutdown.

Function: sbi_system_reset (FID = 0)

// Reboot the system
#define SBI_RESET_TYPE_SHUTDOWN  0
#define SBI_RESET_TYPE_COLD_REBOOT  1
#define SBI_RESET_TYPE_WARM_REBOOT  2

register unsigned long a0 asm("a0") = SBI_RESET_TYPE_COLD_REBOOT;
register unsigned long a1 asm("a1") = 0;  // Reset reason
register unsigned long a6 asm("a6") = 0;  // FID_SYSTEM_RESET
register unsigned long a7 asm("a7") = 0x53525354;  // EID_SRST
asm volatile("ecall" : "+r"(a0) : "r"(a1), "r"(a6), "r"(a7) : "memory");
// This call does not return

Figure 10.2: SBI Extensions Overview

graph TB
    subgraph "SBI Extensions"
        BASE[Base Extension<br/>0x10<br/>Version, Features]
        TIME[Timer Extension<br/>0x54494D45<br/>Set Timer]
        IPI[IPI Extension<br/>0x735049<br/>Send IPI]
        RFENCE[RFENCE Extension<br/>0x52464E43<br/>Remote Fence]
        HSM[HSM Extension<br/>0x48534D<br/>Hart Management]
        SRST[System Reset<br/>0x53525354<br/>Reset/Shutdown]
        PMU[PMU Extension<br/>0x504D55<br/>Performance Counters]
    end

    SMODE[S-mode Software<br/>Linux, FreeBSD] -->|ecall| BASE
    SMODE -->|ecall| TIME
    SMODE -->|ecall| IPI
    SMODE -->|ecall| RFENCE
    SMODE -->|ecall| HSM
    SMODE -->|ecall| SRST
    SMODE -->|ecall| PMU

    style SMODE fill:#e1f5ff
    style TIME fill:#ccffcc
    style IPI fill:#ffffcc
    style RFENCE fill:#ffcccc

10.4 Console and Debug Output

Console I/O via SBI

SBI provides simple console I/O for early debugging before full UART drivers are available.

Legacy console functions (deprecated but widely used):

sbi_console_putchar (EID = 0x01): Output one character
sbi_console_getchar (EID = 0x02): Input one character

Example: Early printk

void sbi_putchar(char c) {
    register unsigned long a0 asm("a0") = c;
    register unsigned long a7 asm("a7") = 0x01;  // Legacy console putchar
    asm volatile("ecall" : "+r"(a0) : "r"(a7) : "memory");
}

void early_printk(const char *str) {
    while (*str) {
        if (*str == '\n')
            sbi_putchar('\r');
        sbi_putchar(*str++);
    }
}

// Usage
early_printk("Hello from S-mode!\n");

M-mode implementation:

void sbi_console_putchar(int ch) {
    // Platform-specific UART output
    uart_putc(ch);
}

int sbi_console_getchar(void) {
    // Platform-specific UART input
    return uart_getc();
}

Modern approach: Use the Debug Console Extension (DBCN) for more features (buffered I/O, formatted output).

10.5 Hypervisor Extension (H Extension)

Virtualization Support in RISC-V

The Hypervisor extension (H) adds virtualization support to RISC-V, enabling a hypervisor to run multiple guest operating systems. Unlike ARM’s built-in EL2, RISC-V virtualization is an optional extension.

Key features:

VS-mode and VU-mode: Virtualized supervisor and user modes
Two-stage address translation: Guest physical → Host physical
Virtual interrupts: Virtualized interrupt delivery
Hypervisor CSRs: Control virtualization features

Privilege modes with H extension:

M-mode: Machine mode (firmware)
HS-mode: Hypervisor-extended supervisor mode (hypervisor)
VS-mode: Virtual supervisor mode (guest OS)
U-mode: User mode (applications)
VU-mode: Virtual user mode (guest applications)

Figure 10.3: RISC-V Virtualization Architecture

graph TB
    subgraph "M-mode"
        OPENSBI[OpenSBI<br/>Firmware]
    end

    subgraph "HS-mode (Hypervisor)"
        KVM[KVM/Xen<br/>Hypervisor]
    end

    subgraph "VS-mode (Guest OS)"
        GUEST1[Guest Linux 1]
        GUEST2[Guest Linux 2]
    end

    subgraph "VU-mode (Guest Apps)"
        APP1[App 1]
        APP2[App 2]
        APP3[App 3]
    end

    OPENSBI -->|SBI calls| KVM
    KVM -->|VM Entry/Exit| GUEST1
    KVM -->|VM Entry/Exit| GUEST2
    GUEST1 --> APP1
    GUEST1 --> APP2
    GUEST2 --> APP3

    style OPENSBI fill:#ffe1e1
    style KVM fill:#fff4e1
    style GUEST1 fill:#e1ffe1
    style GUEST2 fill:#e1ffe1
    style APP1 fill:#e1f5ff
    style APP2 fill:#e1f5ff
    style APP3 fill:#e1f5ff

Two-Stage Address Translation

With the H extension, address translation happens in two stages:

First stage (G-stage): Guest virtual address (GVA) → Guest physical address (GPA)
- Controlled by guest OS (vsatp CSR)
- Guest thinks it’s managing physical memory
Second stage (H-stage): Guest physical address (GPA) → Host physical address (HPA)
- Controlled by hypervisor (hgatp CSR)
- Translates guest “physical” addresses to real physical addresses

Figure 10.4: Two-Stage Address Translation

graph LR
    GVA[Guest Virtual<br/>Address<br/>GVA] -->|G-stage<br/>vsatp| GPA[Guest Physical<br/>Address<br/>GPA]
    GPA -->|H-stage<br/>hgatp| HPA[Host Physical<br/>Address<br/>HPA]
    HPA --> MEM[Physical<br/>Memory]

    style GVA fill:#e1f5ff
    style GPA fill:#fff4e1
    style HPA fill:#e1ffe1
    style MEM fill:#ffcccc

Example:

Guest OS maps virtual address 0x1000 to guest physical address 0x80001000 (using vsatp)
Hypervisor maps guest physical 0x80001000 to host physical 0x90001000 (using hgatp)
Final access: 0x1000 (GVA) → 0x80001000 (GPA) → 0x90001000 (HPA)

Virtual Interrupt Handling

The H extension virtualizes interrupts, allowing the hypervisor to inject interrupts into guest VMs.

Hypervisor interrupt CSRs:

hvip: Hypervisor virtual interrupt pending
hie: Hypervisor interrupt enable
hgeip: Hypervisor guest external interrupt pending

Injecting a virtual interrupt:

// Hypervisor code: Inject timer interrupt into guest
void inject_guest_timer_interrupt(void) {
    // Set virtual supervisor timer interrupt pending
    csr_set(CSR_HVIP, HVIP_VSTIP);

    // When guest resumes, it will see a timer interrupt
}

Guest handling:

// Guest OS sees the interrupt as a normal S-mode interrupt
void guest_timer_handler(void) {
    // Handle timer interrupt
    // Guest doesn't know it's virtualized
}

VM Entry and Exit

The hypervisor switches between HS-mode and VS-mode using special instructions and CSR manipulation.

VM Entry (HS-mode → VS-mode):

void vm_enter(struct vcpu *vcpu) {
    // 1. Load guest state
    write_csr(CSR_VSSTATUS, vcpu->vsstatus);
    write_csr(CSR_VSIE, vcpu->vsie);
    write_csr(CSR_VSTVEC, vcpu->vstvec);
    write_csr(CSR_VSSCRATCH, vcpu->vsscratch);
    write_csr(CSR_VSEPC, vcpu->vsepc);
    write_csr(CSR_VSCAUSE, vcpu->vscause);
    write_csr(CSR_VSTVAL, vcpu->vstval);
    write_csr(CSR_VSATP, vcpu->vsatp);

    // 2. Set hstatus.SPV = 1 (virtualization enabled)
    csr_set(CSR_HSTATUS, HSTATUS_SPV);

    // 3. Set sepc to guest entry point
    write_csr(CSR_SEPC, vcpu->pc);

    // 4. Enter VS-mode
    asm volatile("sret");  // Return to VS-mode
}

VM Exit (VS-mode → HS-mode):

When the guest executes certain instructions (ecall, WFI, privileged CSR access) or takes a trap, control returns to the hypervisor.

void vm_exit_handler(struct vcpu *vcpu) {
    // Save guest state
    vcpu->vsstatus = read_csr(CSR_VSSTATUS);
    vcpu->vsepc = read_csr(CSR_VSEPC);
    vcpu->vscause = read_csr(CSR_VSCAUSE);
    vcpu->vstval = read_csr(CSR_VSTVAL);
    vcpu->pc = read_csr(CSR_SEPC);

    // Handle exit reason
    unsigned long cause = read_csr(CSR_SCAUSE);

    switch (cause) {
    case CAUSE_VIRTUAL_SUPERVISOR_ECALL:
        // Guest made hypercall
        handle_hypercall(vcpu);
        break;
    case CAUSE_GUEST_PAGE_FAULT:
        // Guest page fault (G-stage or H-stage)
        handle_guest_page_fault(vcpu);
        break;
    case CAUSE_VIRTUAL_INSTRUCTION:
        // Guest tried to execute privileged instruction
        emulate_instruction(vcpu);
        break;
    default:
        // Other traps
        inject_exception_to_guest(vcpu, cause);
    }
}

10.6 Security Model

Physical Memory Protection (PMP)

PMP is RISC-V’s primary memory protection mechanism, enforced in M-mode. It defines memory regions and access permissions for lower privilege modes.

PMP use cases:

Protect M-mode firmware from S-mode
Isolate security-critical regions
Enforce memory access policies
Implement basic TEE (Trusted Execution Environment)

PMP configuration registers:

pmpcfg0-pmpcfg15: Configuration for PMP entries (8 entries per register)
pmpaddr0-pmpaddr63: Address registers (up to 64 entries)

PMP entry format (pmpcfg):

Bits [7:0] for each entry:
  [7]: L (Lock) - Entry cannot be modified until reset
  [6:5]: Reserved
  [4:3]: A (Address matching mode)
         00 = OFF, 01 = TOR, 10 = NA4, 11 = NAPOT
  [2]: X (Execute permission)
  [1]: W (Write permission)
  [0]: R (Read permission)

Example: Protect M-mode firmware

void protect_m_mode_firmware(void) {
    // Protect 0x80000000 - 0x80100000 (1 MB M-mode firmware)
    // Use TOR (Top-Of-Range) mode

    // Entry 0: Start address (0x80000000)
    write_csr(pmpaddr0, 0x80000000 >> 2);

    // Entry 1: End address (0x80100000)
    write_csr(pmpaddr1, 0x80100000 >> 2);

    // Configure: TOR mode, R+X, Locked
    uint8_t cfg = PMP_R | PMP_X | PMP_TOR | PMP_L;
    write_csr(pmpcfg0, cfg << 8);  // Entry 1 config

    // Now S-mode cannot access 0x80000000 - 0x80100000
}

Enhanced PMP (ePMP)

ePMP extends PMP with additional security features:

Rule locking: Prevent modification of PMP entries
M-mode lockdown: Restrict M-mode access to specific regions
Whitelist mode: Default deny, explicit allow

ePMP adds mseccfg CSR:

Bits:
  [2]: RLB (Rule Locking Bypass) - Allow M-mode to modify locked entries
  [1]: MMWP (Machine Mode Whitelist Policy) - Enforce whitelist for M-mode
  [0]: MML (Machine Mode Lockdown) - Restrict M-mode access

Example: M-mode lockdown

void enable_m_mode_lockdown(void) {
    // Set MML bit: M-mode can only access regions with L=1 and X=0
    write_csr(CSR_MSECCFG, MSECCFG_MML);

    // Now M-mode is restricted to explicitly allowed regions
}

Comparison with ARM TrustZone

RISC-V PMP vs ARM TrustZone:

Feature	RISC-V PMP	ARM TrustZone
Isolation	Region-based (up to 64 regions)	World-based (Secure/Non-secure)
Granularity	4 bytes to 2^64 bytes	4 KB minimum
Privilege	M-mode enforced	EL3 enforced
Secure world	No built-in secure world	Dedicated S-EL0/S-EL1
Complexity	Simple, flexible	Complex, rich features
Use case	Firmware protection, basic TEE	Full TEE, secure boot, DRM

RISC-V security philosophy: Provide minimal hardware mechanisms (PMP), build rich security features in software (TEE frameworks like Keystone, Penglai).

ARM TrustZone philosophy: Provide rich hardware support for secure world, standardize TEE architecture.

🛠️ Hands-on Lab: Lab 10.1 — Saying Hello Through the Counter (SBI Call)

This lab demonstrates the standard SBI call flow. We’ll use OpenSBI (bundled with QEMU) and place our kernel at 0x80200000 (the default payload address OpenSBI jumps to).

Lab Objectives

Wrap an sbi_call Assembly function
Call Legacy Extension (EID=1) to output characters
Call Base Extension (EID=0x10) to query SBI version

Project Structure

lab10/
├── link.ld     # Memory layout (note the different start address)
├── sbi.S       # SBI Wrapper and entry point
└── kernel.c    # Main program

Code

File 1: link.ld

Note: When using QEMU -bios default (OpenSBI), it loads itself at 0x80000000 and jumps to 0x80200000 to execute our kernel.

OUTPUT_ARCH( "riscv" )
ENTRY( _start )

SECTIONS
{
  /* OpenSBI default jump address */
  . = 0x80200000;

  .text : {
    *(.text.boot)
    *(.text)
  }
  .data : { *(.data) }
  .bss : { *(.bss) }

  . = . + 0x1000;
  _stack_top = .;
}

File 2: sbi.S (SBI Wrapper)

.section .text.boot
.global _start
.global sbi_call

# Program entry
_start:
    la sp, _stack_top
    call kernel_main
loop:
    j loop

# long sbi_call(long ext, long fid, long arg0, long arg1, long arg2)
# C calling: a0=ext, a1=fid, a2=arg0, a3=arg1, a4=arg2
sbi_call:
    mv a7, a0       # ext -> a7 (EID)
    mv a6, a1       # fid -> a6 (FID)
    mv a0, a2       # arg0 -> a0
    mv a1, a3       # arg1 -> a1
    mv a2, a4       # arg2 -> a2

    ecall           # Trigger Environment Call (trap to M-mode)

    ret             # Return a0 (return value)

File 3: kernel.c

#include <stdint.h>

// SBI Extension IDs
#define SBI_EID_CONSOLE_PUTCHAR 0x01  // Legacy Console
#define SBI_EID_BASE            0x10  // Base Extension

// Base Extension Function IDs
#define SBI_FID_GET_SPEC_VERSION 0

// External Assembly function
long sbi_call(long ext, long fid, long arg0, long arg1, long arg2);

// Character output (via SBI)
void putchar(char c) {
    sbi_call(SBI_EID_CONSOLE_PUTCHAR, 0, c, 0, 0);
}

void print_str(const char *s) {
    while (*s) {
        putchar(*s++);
    }
}

// Query SBI version
void print_sbi_version(void) {
    long version = sbi_call(SBI_EID_BASE, SBI_FID_GET_SPEC_VERSION, 0, 0, 0);
    long major = (version >> 24) & 0x7f;
    long minor = version & 0xffffff;

    print_str("SBI Spec Version: ");
    putchar('0' + major);
    putchar('.');
    putchar('0' + minor);
    putchar('\n');
}

void kernel_main(void) {
    print_str("Hello from S-mode via SBI!\n");
    print_sbi_version();

    while (1) {}
}

Compile and Run

# Compile
riscv64-unknown-elf-gcc -nostdlib -nostartfiles -T link.ld \
    -o kernel sbi.S kernel.c

# Run (QEMU with OpenSBI)
qemu-system-riscv64 -machine virt -nographic -bios default -kernel kernel

Expected Output:

OpenSBI v1.2
   ...
Hello from S-mode via SBI!
SBI Spec Version: 1.0

Comparison: Lab 9 vs Lab 10

Item	Lab 9 (Bare-metal)	Lab 10 (SBI)
Start Address	0x80000000	0x80200000
UART Access	Direct MMIO write	Via SBI ecall
Portability	❌ Hardware-dependent	✅ Cross-platform
Complexity	Simple but fragile	Requires SBI spec knowledge

danieRTOS Reference: The danieRTOS console uses SBI calls for portable output across different RISC-V platforms.

⚠️ Common Pitfalls

Pitfall 1: Confusing EID and FID

Error Scenario: Putting Extension ID in a6 and Function ID in a7.

Consequence: Calls wrong service, may crash or hang the system.

# ❌ Wrong: EID and FID positions swapped
    li a6, 0x10      # This should be EID, but it's in a6
    li a7, 0         # This should be FID, but it's in a7
    ecall

# ✅ Correct
    li a7, 0x10      # a7 = EID (Extension ID)
    li a6, 0         # a6 = FID (Function ID)
    ecall

Pitfall 2: Forgetting OpenSBI Occupies 0x80000000

Error Scenario: When using OpenSBI, still placing kernel at 0x80000000.

Consequence: Kernel overwrites OpenSBI’s memory, causing SBI calls to fail.

/* ❌ Wrong: Conflicts with OpenSBI */
. = 0x80000000;

/* ✅ Correct: Use OpenSBI's default payload address */
. = 0x80200000;

Pitfall 3: Misusing Legacy Extensions

Error Scenario: Using deprecated Legacy Extensions that newer OpenSBI may not support.

Consequence: SBI call returns -2 (SBI_ERR_NOT_SUPPORTED).

// ⚠️ Legacy Extensions (EID 0-8) are marked deprecated
// Newer OpenSBI may not support them
sbi_call(0x01, 0, 'A', 0, 0);  // console_putchar

// ✅ Recommended: Use newer Extensions
// Debug Console Extension (EID = 0x4442434E)
#define SBI_EID_DBCN  0x4442434E
#define SBI_FID_DBCN_WRITE_BYTE 2
sbi_call(SBI_EID_DBCN, SBI_FID_DBCN_WRITE_BYTE, 'A', 0, 0);

💡 Tip: In production, check if an SBI Extension is available:

// Use Base Extension's probe_extension
#define SBI_FID_PROBE_EXTENSION 3
long result = sbi_call(SBI_EID_BASE, SBI_FID_PROBE_EXTENSION,
                       target_eid, 0, 0);
// result > 0 means the Extension is available

Summary

Machine mode and SBI provide the foundation for RISC-V system software. This chapter covered five key areas:

M-mode firmware is designed to be minimal and platform-specific. It initializes hardware, sets up memory protection, and provides runtime services through the SBI interface. Unlike ARM’s extensive EL3 firmware, RISC-V keeps M-mode simple and delegates most functionality to S-mode.

SBI interface provides a standard ecall-based interface between M-mode firmware and S-mode operating systems. The calling convention uses registers a0-a7 for parameters and returns error codes in a0. This standardization ensures that OS kernels can run on any RISC-V platform without modification.

SBI extensions cover essential system services: Timer extension for scheduling, IPI extension for inter-processor communication, RFENCE extension for TLB synchronization, HSM extension for hart lifecycle management, and System Reset extension for reboot and shutdown. Each extension is identified by an Extension ID (EID) and provides multiple functions.

Hypervisor extension adds virtualization support through VS-mode and VU-mode, enabling guest operating systems to run under a hypervisor. Two-stage address translation (GVA → GPA → HPA) isolates guest memory, while virtual interrupt injection allows the hypervisor to deliver interrupts to guests. VM entry and exit are managed through CSR manipulation and the SRET instruction.

PMP and ePMP provide memory protection by defining access permissions for memory regions. PMP is simpler than ARM TrustZone but sufficient for protecting M-mode firmware and implementing basic trusted execution environments. ePMP adds enhanced security features like M-mode lockdown and rule locking.

In the next chapter, we’ll explore RISC-V ISA extensions, starting with the standard extensions (M, A, F, D, C) and moving to advanced features like Vector and Bit Manipulation.

Chapter 11. RISC-V Standard Extensions

Part VII — ISA Extensions

🎯 Learning Objectives

After reading this chapter, you will be able to:

Decode ISA Naming: Parse the meaning of ISA strings like RV64GC, RV32IM
Understand the G Package: Know that G = IMAFD and its historical background
Master Z/X Extension Logic: Understand the naming rules for new-style Extensions
Compare Hardware vs Software Implementation: Understand the performance difference between hardware instructions and software emulation
Detect Extension Presence: Query CPU-supported features via the misa CSR

💡 Scenario: Skill Trees and DLCs

Scene: Junior is reading Linux Kernel compile options and gets intimidated by a long ISA string.

Junior: “Senior, is this MARCH variable a Wi-Fi password? RV64IMAFDC_Zicsr… who can read this?”

Senior: (laughs) “This is RISC-V’s ID card. Don’t worry—let’s treat it like RPG character stats. Breaking it down makes it simple.”

Junior: “RPG stats?”

Senior: “Look at the first two characters RV64—this means it’s a 64-bit character that can wield two-handed swords (64-bit registers). RV32 would be 32-bit.”

Junior: “I get that part. What about that string of letters?”

Senior: “Those are ‘skills it has learned.’ Each letter represents an ability:

Letter	Name	Analogy	Function
I	Integer	Basic Training	Addition, subtraction, logic ops—essential skill
M	Multiply	Multiplication Skill	Hardware multiply/divide—without it, you do N additions
A	Atomic	Locking Skill	Atomic operations—the foundation of spinlocks
F	Float	Single-Precision Magic	32-bit floating-point math
D	Double	Double-Precision Magic	64-bit floating-point math
C	Compressed	Contortion Skill	16-bit compressed instructions—saves space

“

Junior: “Makes sense. But what about RV64GC that everyone talks about? There’s no G in that table!”

Senior: “G (General) is a ‘value bundle.’ Since IMAFD are so commonly used together, the spec defines G = I + M + A + F + D. So RV64GC is really shorthand for RV64IMAFDC—the baseline for running Linux.”

Junior: “Got it! What about that Z prefix at the end? Hidden skill?”

Senior: “Pretty much. Since 26 letters aren’t enough anymore, newer features (or features split out from I) use Z prefix plus a name. Think of them as DLC expansions.

For example, Zicsr means CSR operations are supported, Zifencei means instruction fence is supported. This ‘password’ just tells the compiler: ‘This CPU bought these DLCs, feel free to use these instructions’!“

Junior: “Ha! So it’s just a skill list—that makes it much clearer!”

RISC-V’s modular design is one of its most distinctive features. Unlike monolithic instruction set architectures that bundle everything together, RISC-V separates functionality into a minimal base ISA plus optional extensions. This approach allows implementations to include only the features they need, from tiny microcontrollers to high-performance servers.

The base integer ISA (RV32I or RV64I) provides just enough instructions to run a complete operating system and applications—47 instructions in total. But most practical systems need more: multiplication and division, atomic operations for synchronization, floating-point arithmetic, and compressed instructions for code density. These capabilities come from standard extensions, each identified by a single letter.

Understanding these extensions is crucial for anyone working with RISC-V. Compiler writers need to know which instructions are available. Hardware designers must decide which extensions to implement. Software developers need to understand the performance implications of using extension instructions versus emulating them in software.

In this chapter, we’ll explore the standard extensions that form the foundation of most RISC-V systems: M for multiplication, A for atomics, F and D for floating-point, C for compressed instructions, and B for bit manipulation. We’ll see how these extensions integrate with the base ISA and compare them with similar features in ARM and x86.

11.1 Extension Overview

The Extension Model

RISC-V extensions follow a carefully designed model. The base ISA (I) is frozen and will never change. Extensions add functionality without modifying the base. Once an extension is ratified, it too is frozen, ensuring long-term stability.

Extensions are identified by single letters: M, A, F, D, C, V, B, and so on. A processor’s capabilities are described by concatenating these letters: RV64IMAFD means a 64-bit processor with integer, multiplication, atomic, single-precision float, and double-precision float extensions. The letter G is shorthand for IMAFD (general-purpose), so RV64GC means RV64IMAFD plus compressed instructions.

Standard vs Non-Standard Extensions

Standard extensions are defined by RISC-V International and ratified through a formal process. They have reserved letter codes and are guaranteed to be compatible across implementations. Non-standard extensions use the X prefix (like Xvendor) and are vendor-specific.

Custom extensions can add specialized instructions without conflicting with standard ones. The instruction encoding reserves opcode space for custom instructions, allowing vendors to innovate while maintaining compatibility with standard software.

Extension Discovery

Software can detect which extensions are present by reading the misa CSR (Machine ISA register). Each bit in misa corresponds to an extension:

// Read misa to detect extensions
unsigned long misa = read_csr(CSR_MISA);

bool has_M = (misa & (1 << ('M' - 'A')));  // Bit 12
bool has_A = (misa & (1 << ('A' - 'A')));  // Bit 0
bool has_F = (misa & (1 << ('F' - 'A')));  // Bit 5
bool has_D = (misa & (1 << ('D' - 'A')));  // Bit 3
bool has_C = (misa & (1 << ('C' - 'A')));  // Bit 2

On some implementations, misa is read-only. On others, writing to misa can enable or disable extensions dynamically, though this is rare in practice.

Figure 11.1: RISC-V Extension Ecosystem

graph TB
    subgraph "Base ISA (Frozen)"
        RV32I[RV32I<br/>32-bit Integer]
        RV64I[RV64I<br/>64-bit Integer]
        RV128I[RV128I<br/>128-bit Integer]
    end
    
    subgraph "Standard Extensions"
        M[M: Multiply/Divide]
        A[A: Atomics]
        F[F: Float 32-bit]
        D[D: Float 64-bit]
        C[C: Compressed]
        V[V: Vector]
        B[B: Bit Manipulation]
    end
    
    subgraph "Profiles"
        RVA22[RVA22 Profile<br/>Application Processors]
        RVA23[RVA23 Profile<br/>+ Vector]
    end
    
    RV64I --> M
    RV64I --> A
    RV64I --> F
    RV64I --> D
    RV64I --> C
    RV64I --> V
    RV64I --> B
    
    M --> RVA22
    A --> RVA22
    F --> RVA22
    D --> RVA22
    C --> RVA22
    
    RVA22 --> RVA23
    V --> RVA23
    
    style RV64I fill:#90EE90
    style M fill:#87CEEB
    style A fill:#87CEEB
    style F fill:#87CEEB
    style D fill:#87CEEB
    style C fill:#87CEEB
    style V fill:#FFD700
    style B fill:#FFD700
    style RVA22 fill:#FFB6C1
    style RVA23 fill:#FFB6C1

The diagram shows how extensions build on the base ISA and combine into profiles for specific use cases.

11.2 M Extension: Integer Multiplication and Division

Why M is Optional

The M extension adds integer multiplication and division instructions. You might wonder why these fundamental operations aren’t in the base ISA. The answer is simplicity and flexibility.

Tiny embedded systems (like IoT sensors) may never need multiplication or division. Making M optional allows these systems to save chip area and power. Software can emulate multiplication using shifts and adds if needed, though much slower than hardware.

For most systems, M is essential. It’s part of the G (general-purpose) bundle and required by the RVA22 profile for application processors.

Multiplication Instructions

The M extension provides four multiplication instructions for RV32 and RV64:

MUL rd, rs1, rs2: Multiply rs1 by rs2, store the lower XLEN bits in rd. This is the most common multiplication, used when you only need the low-order result.

# Example: Multiply two 32-bit numbers
li a0, 100
li a1, 200
mul a2, a0, a1      # a2 = 100 * 200 = 20000

MULH rd, rs1, rs2: Multiply signed rs1 by signed rs2, store the upper XLEN bits in rd. Used for detecting overflow or implementing multi-word multiplication.

MULHU rd, rs1, rs2: Multiply unsigned rs1 by unsigned rs2, store the upper XLEN bits in rd.

MULHSU rd, rs1, rs2: Multiply signed rs1 by unsigned rs2, store the upper XLEN bits in rd. This asymmetric variant is useful for certain algorithms.

Why separate high and low multiply? A full multiplication of two XLEN-bit numbers produces a 2×XLEN-bit result. MUL gives you the low half, MULH/MULHU/MULHSU give you the high half. To get the full result, you execute both:

# 64-bit × 64-bit = 128-bit multiplication
mul  a2, a0, a1     # Low 64 bits
mulh a3, a0, a1     # High 64 bits
# Result is in a3:a2 (128 bits)

Division and Remainder

The M extension also provides division and remainder instructions:

DIV rd, rs1, rs2: Signed division, rd = rs1 / rs2 (truncated toward zero).

DIVU rd, rs1, rs2: Unsigned division, rd = rs1 / rs2.

REM rd, rs1, rs2: Signed remainder, rd = rs1 % rs2.

REMU rd, rs1, rs2: Unsigned remainder, rd = rs1 % rs2.

# Example: Divide 100 by 7
li a0, 100
li a1, 7
div a2, a0, a1      # a2 = 100 / 7 = 14
rem a3, a0, a1      # a3 = 100 % 7 = 2

Division by zero does not trap in RISC-V. Instead, it returns defined values: division by zero returns -1 (all bits set), and remainder by zero returns the dividend. This allows software to check for zero explicitly if needed, without the overhead of trap handling.

RV64 Word Operations

On RV64, the M extension adds word-sized (32-bit) variants that operate on the lower 32 bits and sign-extend the result to 64 bits:

MULW, DIVW, DIVUW, REMW, REMUW

# RV64: 32-bit multiplication with sign extension
li a0, 0x80000000   # -2147483648 (32-bit)
li a1, 2
mulw a2, a0, a1     # a2 = 0xFFFFFFFF00000000 (sign-extended)

These are essential for efficiently handling 32-bit data on 64-bit processors.

Performance Characteristics

Multiplication and division are slower than addition and logic operations. Typical latencies:

MUL: 2-4 cycles (pipelined, throughput 1/cycle)
DIV: 10-40 cycles (not pipelined, variable latency)

Division is particularly expensive. Compilers optimize division by constants into multiplication by the reciprocal when possible.

11.3 A Extension: Atomic Instructions

The Need for Atomics

In multi-processor systems, multiple harts (hardware threads) may access shared memory simultaneously. Without atomic operations, race conditions can corrupt data. Consider incrementing a shared counter:

// Non-atomic increment (WRONG for multi-threaded code)
int counter = 0;

void increment() {
    counter++;  // Read-modify-write: NOT atomic!
}

This compiles to three separate instructions:

lw   a0, counter
addi a0, a0, 1
sw   a0, counter

If two harts execute this simultaneously, both might read the same value, increment it, and write back the same result—losing one increment. Atomic instructions solve this problem.

Load-Reserved / Store-Conditional

The A extension provides two fundamental primitives for building atomic operations:

LR.W rd, (rs1): Load-Reserved Word. Loads a word from memory and registers a reservation on that address.

SC.W rd, rs1, (rs2): Store-Conditional Word. Stores rs1 to memory at rs2 only if the reservation is still valid. Returns 0 in rd on success, non-zero on failure.

The reservation is invalidated if another hart writes to the reserved address or if certain events occur (context switch, cache eviction, etc.).

Atomic increment using LR/SC:

# Atomic increment of counter
retry:
    lr.w  a0, (a1)      # Load counter, set reservation
    addi  a0, a0, 1     # Increment
    sc.w  a2, a0, (a1)  # Store if reservation valid
    bnez  a2, retry     # Retry if SC failed

If another hart modifies the counter between LR and SC, the SC fails and the loop retries. This ensures atomicity.

Atomic Memory Operations (AMO)

For common atomic operations, the A extension provides dedicated AMO instructions that are more efficient than LR/SC loops:

AMOSWAP.W rd, rs2, (rs1): Atomically swap memory[rs1] with rs2, return old value in rd.

AMOADD.W rd, rs2, (rs1): Atomically add rs2 to memory[rs1], return old value in rd.

AMOAND.W, AMOOR.W, AMOXOR.W: Atomic AND, OR, XOR.

AMOMIN.W, AMOMAX.W, AMOMINU.W, AMOMAXU.W: Atomic min/max (signed and unsigned).

# Atomic increment using AMO (simpler than LR/SC)
amoadd.w zero, a0, (a1)  # Atomically add a0 to memory[a1]

Atomic Ordering Annotations

AMO and LR/SC instructions can have ordering annotations:

.aq (acquire): Subsequent memory operations cannot be reordered before this instruction.

.rl (release): Previous memory operations cannot be reordered after this instruction.

.aqrl: Both acquire and release.

# Atomic swap with acquire-release semantics
amoswap.w.aqrl a0, a1, (a2)

These annotations are crucial for implementing lock-free data structures and memory barriers (see Chapter 6 on memory ordering).

RV64 Variants

On RV64, the A extension provides both word (32-bit) and doubleword (64-bit) variants:

LR.W / SC.W: 32-bit load-reserved / store-conditional LR.D / SC.D: 64-bit load-reserved / store-conditional AMOADD.W / AMOADD.D: 32-bit / 64-bit atomic add (and similarly for other AMO operations)

Comparison with ARM and x86

ARM: Uses LDREX/STREX (load-exclusive / store-exclusive), similar to RISC-V’s LR/SC. ARMv8.1 added atomic instructions (LDADD, LDSWP, etc.) similar to RISC-V’s AMO.

x86: Uses LOCK prefix with normal instructions (LOCK ADD, LOCK XCHG, etc.) and dedicated atomic instructions (CMPXCHG). x86’s model is more complex but provides strong ordering by default.

RISC-V’s approach is cleaner: LR/SC for flexibility, AMO for common cases, explicit ordering annotations for performance.

11.4 F and D Extensions: Floating-Point

Floating-Point in RISC-V

The F extension adds single-precision (32-bit) floating-point, and the D extension adds double-precision (64-bit). Both follow the IEEE 754 standard, ensuring compatibility with other architectures and programming languages.

Floating-point is optional because not all systems need it. Embedded controllers often work with integers only. But for scientific computing, graphics, and many applications, floating-point is essential.

Floating-Point Register File

F and D extensions add a separate register file with 32 floating-point registers, f0 through f31. Each register is FLEN bits wide, where FLEN is 32 for F-only, 64 for D, and 128 for Q (quad-precision, future extension).

The separation of integer and floating-point registers simplifies hardware design and allows both to be accessed simultaneously. It also follows the tradition of RISC architectures like MIPS and SPARC.

Floating-Point CSRs

Three CSRs control floating-point behavior:

fcsr (Floating-Point Control and Status Register): Combined control and status.

frm (Floating-Point Rounding Mode): Bits [7:5] of fcsr, selects rounding mode:

000: Round to nearest, ties to even (default)
001: Round toward zero (truncate)
010: Round down (toward -∞)
011: Round up (toward +∞)
100: Round to nearest, ties to max magnitude

fflags (Floating-Point Exception Flags): Bits [4:0] of fcsr, records exceptions:

NV: Invalid operation
DZ: Divide by zero
OF: Overflow
UF: Underflow
NX: Inexact

// Set rounding mode to round toward zero
write_csr(CSR_FRM, 0b001);

// Check for floating-point exceptions
unsigned int flags = read_csr(CSR_FFLAGS);
if (flags & 0x10) {
    // Invalid operation occurred
}

F Extension Instructions

The F extension provides arithmetic, comparison, conversion, and move instructions for single-precision floats:

Arithmetic: FADD.S, FSUB.S, FMUL.S, FDIV.S, FSQRT.S

Fused multiply-add: FMADD.S, FMSUB.S, FNMADD.S, FNMSUB.S

Comparison: FEQ.S, FLT.S, FLE.S

Conversion: FCVT.W.S (float to int), FCVT.S.W (int to float), and variants

Move: FMV.X.W (float reg to int reg), FMV.W.X (int reg to float reg)

Load/Store: FLW, FSW

# Example: Compute (a * b) + c using fused multiply-add
flw  fa0, 0(a0)     # Load a
flw  fa1, 4(a0)     # Load b
flw  fa2, 8(a0)     # Load c
fmadd.s fa3, fa0, fa1, fa2  # fa3 = (a * b) + c
fsw  fa3, 12(a0)    # Store result

D Extension Instructions

The D extension extends F with double-precision operations. All F instructions have D equivalents (FADD.D, FMUL.D, etc.). Additionally, D provides conversions between single and double precision:

FCVT.S.D: Convert double to single (with rounding) FCVT.D.S: Convert single to double (exact)

# Convert single to double
flw  fa0, 0(a0)     # Load single-precision
fcvt.d.s fa1, fa0   # Convert to double-precision
fsd  fa1, 0(a1)     # Store double-precision

NaN Boxing

On RV64 with F extension (but not D), single-precision values in 64-bit registers must be NaN-boxed: the upper 32 bits are set to all 1s. This allows hardware to distinguish between valid single-precision values and invalid data.

Valid single-precision in 64-bit register:
[63:32] = 0xFFFFFFFF
[31:0]  = single-precision value

With D extension, this is not needed because registers are naturally 64 bits.

Performance

Floating-point operations are typically slower than integer operations:

FADD/FSUB: 3-5 cycles latency
FMUL: 4-6 cycles latency
FDIV: 10-20 cycles latency (not pipelined)
FSQRT: 15-30 cycles latency

Fused multiply-add (FMADD) is particularly valuable: it computes (a × b) + c in one instruction with a single rounding, faster and more accurate than separate multiply and add.

11.5 C Extension: Compressed Instructions

The Code Density Problem

RISC architectures traditionally use fixed-length 32-bit instructions. This simplifies decoding and pipelining but wastes memory and instruction cache space. Many common operations (like “add register to register” or “load from stack”) don’t need 32 bits to encode.

The C extension addresses this by adding 16-bit compressed instructions that can be freely mixed with standard 32-bit instructions. This improves code density by 25-30% with minimal hardware complexity.

How Compressed Instructions Work

Compressed instructions are 16 bits and aligned on 16-bit boundaries. The processor’s fetch unit automatically expands them to equivalent 32-bit instructions before decoding. This expansion is transparent to software—compressed instructions are just a more compact encoding.

The low 2 bits of an instruction indicate its length:

xx00, xx01, xx10: 16-bit compressed instruction (C extension)
xxx11: 32-bit standard instruction (or longer for future extensions)

This encoding allows the processor to determine instruction boundaries without pre-decoding.

Common Compressed Instructions

The C extension provides compressed forms of the most frequent operations:

C.ADD rd, rs2: Add register to register (expands to ADD rd, rd, rs2)

C.ADDI rd, imm: Add immediate (expands to ADDI rd, rd, imm)

C.LW rd’, offset(rs1’): Load word (expands to LW rd, offset(rs1))

C.SW rs2’, offset(rs1’): Store word (expands to SW rs2, offset(rs1))

C.J offset: Jump (expands to JAL x0, offset)

C.JALR rs1: Jump and link register (expands to JALR x1, 0(rs1))

C.MV rd, rs2: Move register (expands to ADD rd, x0, rs2)

C.LI rd, imm: Load immediate (expands to ADDI rd, x0, imm)

# Standard 32-bit instructions (8 bytes total)
addi sp, sp, -16
sw   ra, 12(sp)

# Compressed equivalents (4 bytes total)
c.addi16sp sp, -16
c.swsp ra, 12

Register Encoding Restrictions

To fit in 16 bits, compressed instructions have restrictions:

Many use only registers x8-x15 (s0-s1, a0-a5), encoded in 3 bits
Immediates are smaller (6-bit instead of 12-bit)
Offsets are scaled (e.g., word loads use offset×4)

The compiler and assembler handle these restrictions automatically, using compressed instructions when possible and falling back to 32-bit instructions when necessary.

Code Density Improvement

Typical programs see 25-30% code size reduction with the C extension. This translates to:

Better instruction cache utilization
Reduced memory bandwidth
Lower power consumption (fewer instruction fetches)

For embedded systems with limited flash memory, this can be the difference between fitting the program or not.

Mixing 16-bit and 32-bit Instructions

Compressed and standard instructions can be freely mixed in the same program. The processor handles alignment automatically:

Address  Instruction
0x1000:  c.addi sp, -16     (16-bit)
0x1002:  c.sw ra, 12(sp)    (16-bit)
0x1004:  jal ra, function   (32-bit)
0x1008:  c.lwsp ra, 12      (16-bit)

Branch targets and jump addresses can be any 16-bit aligned address, not just 32-bit aligned.

11.6 B Extension: Bit Manipulation

Why Bit Manipulation Matters

Bit manipulation operations—counting leading zeros, rotating bits, extracting bit fields—are common in cryptography, compression, hashing, and low-level systems programming. Without dedicated instructions, these operations require multiple instructions and are slow.

The B extension adds efficient bit manipulation instructions. Unlike M, A, F, D, and C, the B extension is modular, divided into several sub-extensions that can be implemented independently.

B Extension Sub-Extensions

Zba: Address generation instructions (shift-add for array indexing)

Zbb: Basic bit manipulation (count leading zeros, rotate, min/max, sign-extend)

Zbc: Carry-less multiplication (for cryptography)

Zbs: Single-bit operations (set, clear, invert, extract)

A processor might implement Zba and Zbb for general use, while omitting Zbc if cryptography isn’t needed.

Zba: Address Generation

Zba provides shift-add instructions for efficient array indexing:

SH1ADD rd, rs1, rs2: rd = (rs1 << 1) + rs2 SH2ADD rd, rs1, rs2: rd = (rs1 << 2) + rs2 SH3ADD rd, rs1, rs2: rd = (rs1 << 3) + rs2

// Array indexing: address = base + (index * sizeof(element))
int array[100];
int index = 10;

// Without Zba (3 instructions):
// slli t0, index, 2    # t0 = index * 4
// add  t0, t0, base    # t0 = base + (index * 4)
// lw   a0, 0(t0)

// With Zba (2 instructions):
// sh2add t0, index, base  # t0 = base + (index * 4)
// lw     a0, 0(t0)

Zbb: Basic Bit Manipulation

Zbb provides commonly used bit operations:

CLZ rd, rs: Count leading zeros CTZ rd, rs: Count trailing zeros CPOP rd, rs: Count population (number of 1 bits)

ROL rd, rs1, rs2: Rotate left ROR rd, rs1, rs2: Rotate right

MIN rd, rs1, rs2: Signed minimum MAX rd, rs1, rs2: Signed maximum MINU, MAXU: Unsigned variants

SEXT.B, SEXT.H: Sign-extend byte/halfword to XLEN

# Count leading zeros (useful for finding highest set bit)
li   a0, 0x00001000
clz  a1, a0          # a1 = 51 (on RV64)

# Rotate right by 4 bits
li   a0, 0x12345678
rori a1, a0, 4       # a1 = 0x81234567

Zbc: Carry-Less Multiplication

Zbc provides carry-less multiplication, used in cryptographic algorithms like AES-GCM:

CLMUL rd, rs1, rs2: Carry-less multiply (low half) CLMULH rd, rs1, rs2: Carry-less multiply (high half)

Carry-less multiplication is like normal multiplication but without carries between bit positions—essentially XOR instead of ADD.

Zbs: Single-Bit Operations

Zbs provides instructions for manipulating individual bits:

BSET rd, rs1, rs2: Set bit rs2 in rs1 BCLR rd, rs1, rs2: Clear bit rs2 in rs1 BINV rd, rs1, rs2: Invert bit rs2 in rs1 BEXT rd, rs1, rs2: Extract bit rs2 from rs1

# Set bit 5 in register a0
bseti a0, a0, 5      # a0 |= (1 << 5)

# Extract bit 3 from register a1
bexti a2, a1, 3      # a2 = (a1 >> 3) & 1

Performance Impact

Bit manipulation instructions typically execute in 1 cycle, same as basic ALU operations. Without them, equivalent operations might take 3-10 instructions. For cryptography and compression, this can mean 2-5× speedup.

11.7 Zicsr and Zifencei

Zicsr: CSR Instructions

The Zicsr extension defines the CSR (Control and Status Register) instructions we’ve used throughout this book: CSRRW, CSRRS, CSRRC, and their immediate variants.

Historically, these were part of the base I extension. But to keep the base ISA truly minimal, they were separated into Zicsr. Any system that needs to access CSRs (which is almost all systems) implements Zicsr.

Zifencei: Instruction Fence

The Zifencei extension provides the FENCE.I instruction, which synchronizes instruction and data caches. This is necessary when code is modified at runtime (self-modifying code, JIT compilation, dynamic linking).

FENCE.I: Ensures that all previous stores to instruction memory are visible to subsequent instruction fetches.

# Example: JIT compiler writes new code to memory
sw   a0, 0(a1)       # Write instruction to memory
sw   a2, 4(a1)       # Write another instruction
fence.i              # Synchronize I-cache and D-cache
jalr a1              # Jump to newly written code

Without FENCE.I, the processor might execute stale instructions from the I-cache instead of the newly written code.

Like Zicsr, Zifencei was separated from the base ISA to keep it minimal. Systems that don’t modify code at runtime can omit it.

11.8 RVA22 Profile

The Need for Profiles

With so many optional extensions, how do software developers know what features they can rely on? A processor implementing just RV64I is very different from one implementing RV64IMAFDCV.

Profiles solve this problem by defining standard combinations of extensions for specific use cases. Software targeting a profile can assume all mandatory features are present.

RVA22 Profile

The RVA22 profile (ratified in 2022) targets application processors capable of running rich operating systems like Linux. It comes in two variants:

RVA22U (Unprivileged): Specifies the user-mode ISA. Mandatory extensions include:

RV64I base ISA
M, A, F, D, C extensions (i.e., RV64GC)
Zicsr, Zifencei
Zba, Zbb, Zbs (address generation and basic bit manipulation)
Various other Z-extensions for specific functionality

RVA22S (Supervisor): Adds supervisor-mode requirements for OS support:

Sv39 virtual memory (39-bit virtual addresses)
Supervisor mode and required CSRs
SBI (Supervisor Binary Interface) support
Additional privilege-related extensions

Profile Compliance

A processor claiming RVA22S compliance guarantees it can run standard Linux distributions and other Unix-like operating systems without modification. This is crucial for software portability.

Future profiles (RVA23, RVA24) will add more features. RVA23 makes the Vector extension (V) mandatory, recognizing the importance of SIMD for modern applications.

Embedded Profiles

Separate profiles exist for embedded systems:

Microcontroller profiles (RVM): Minimal feature sets for resource-constrained devices
Real-time profiles: Add requirements for deterministic interrupt handling

These profiles ensure that embedded software can target well-defined platforms.

🛠️ Hands-on Lab: Lab 11.1 — The Power of Hardware Acceleration (Soft vs Hard Mul)

This lab demonstrates the performance difference between “having the M Extension” and “not having the M Extension” through compiler options.

Lab Objectives

Use the same C code (multiplication operations)
Compile for RV64I (no multiply instruction) and RV64IM (with multiply instruction)
Observe Assembly differences
Compare execution cycles

Code

Create mul_test.c:

// mul_test.c - Compare software vs hardware multiply
#include <stdint.h>

// Read cycle counter
static inline uint64_t read_cycles(void) {
    uint64_t val;
    asm volatile("csrr %0, mcycle" : "=r"(val));
    return val;
}

// Simple multiply function
long multiply(long a, long b) {
    return a * b;
}

// Multiple multiplication test
volatile long result;
void bench_multiply(int iterations) {
    long a = 123456;
    long b = 789012;
    for (int i = 0; i < iterations; i++) {
        result = multiply(a, b);
        a++;
    }
}

int main(void) {
    int iterations = 10000;

    uint64_t start = read_cycles();
    bench_multiply(iterations);
    uint64_t end = read_cycles();

    // Simple output (assumes putchar available)
    // cycles = end - start
    return 0;
}

Experiment Steps

Step A: Compile as RV64IM (with multiply instruction)

# Tell compiler it can use multiply instructions (mul, mulw, etc.)
riscv64-unknown-elf-gcc -O2 -march=rv64im -mabi=lp64 \
    -c mul_test.c -o mul_hard.o

# View Assembly
riscv64-unknown-elf-objdump -d mul_hard.o

Observe Assembly:

multiply:
    mul     a0, a0, a1    # Direct hardware multiply, 1 cycle
    ret

Step B: Compile as RV64I (no multiply instruction)

# Tell compiler "this CPU doesn't know multiplication"
riscv64-unknown-elf-gcc -O2 -march=rv64i -mabi=lp64 \
    -c mul_test.c -o mul_soft.o

# View Assembly
riscv64-unknown-elf-objdump -d mul_soft.o

Observe Assembly:

multiply:
    call    __muldi3      # Call software emulation library (libgcc)

Analysis

__muldi3 is libgcc’s software multiply implementation, internally composed of dozens of add, shift, branch instructions:

# Simplified logic of __muldi3 (Shift-and-Add algorithm)
__muldi3:
    li t0, 0          # result = 0
loop:
    andi t1, a1, 1    # if (b & 1)
    beqz t1, skip
    add t0, t0, a0    #   result += a
skip:
    slli a0, a0, 1    # a <<= 1
    srli a1, a1, 1    # b >>= 1
    bnez a1, loop     # while (b != 0)
    mv a0, t0
    ret

Expected Results

Config	Single Multiply Cycles	10000 Multiplies
RV64IM	~1-4 cycles	~10,000-40,000 cycles
RV64I	~30-60 cycles	~300,000-600,000 cycles

Conclusion: Hardware M extension can be 10-50x faster than software emulation!

danieRTOS Reference: The danieRTOS Makefile uses -march=rv64gc to ensure all standard extensions are available.

Design Trade-off

💭 Why isn’t the M Extension mandatory?

In extremely resource-constrained embedded systems (such as 8-bit compatible microcontrollers), chip designers may choose not to implement a hardware multiplier to save transistors. In such cases, the compiler automatically uses software emulation. RISC-V’s modularity makes this trade-off possible—you only pay for what you need.

⚠️ Common Pitfalls

Pitfall 1: Misunderstanding G’s Composition

Misconception: “RV64G includes compressed instructions (C)”

Correct Understanding:

G = IMAFD, does NOT include C
GC = IMAFD + C
Linux typically requires RV64GC because most distributions default to C extension for space savings

# ❌ Wrong: Thinking G includes C
riscv64-linux-gnu-gcc -march=rv64g  # Actually doesn't have C

# ✅ Correct: Explicitly specify GC
riscv64-linux-gnu-gcc -march=rv64gc

Pitfall 2: misa Detection Trap

Error Scenario: Reading misa in S-mode or U-mode.

Consequence: Illegal Instruction Exception, because misa is an M-mode CSR.

// ❌ Wrong: Reading misa directly in S-mode
unsigned long misa;
asm volatile("csrr %0, misa" : "=r"(misa));  // Exception!

// ✅ Correct: Get info via SBI or Device Tree
// Or use try-catch mechanism to test specific instructions

Pitfall 3: Ignoring Zicsr and Zifencei

Error Scenario: In newer spec versions, CSR operations and FENCE.I have been separated from I.

Consequence: Using old -march=rv64i may cause problems.

# Old version (pre-2019)
# CSR operations were part of I

# New version (post-2019)
# CSR operations require explicit Zicsr
riscv64-unknown-elf-gcc -march=rv64i_zicsr_zifencei ...

# Simpler approach: Use G (it implies Zicsr and Zifencei)
riscv64-unknown-elf-gcc -march=rv64gc ...

💡 Tip: In practice, use standard combinations like rv64gc or rv64imac to avoid missing essential Extensions.

Summary

RISC-V’s modular extension system is one of its greatest strengths. The base ISA provides a minimal foundation, while standard extensions add functionality as needed. This chapter covered the core extensions that form the basis of most RISC-V systems.

M extension adds integer multiplication and division, essential for most applications beyond the simplest embedded systems. The separate high and low multiply instructions efficiently handle multi-word arithmetic, while division provides defined behavior even for division by zero.

A extension provides atomic operations for multi-processor synchronization. Load-reserved and store-conditional offer flexible primitives for building lock-free algorithms, while atomic memory operations provide efficient implementations of common patterns. Explicit ordering annotations give programmers fine-grained control over memory consistency.

F and D extensions add IEEE 754 floating-point arithmetic with a separate register file and dedicated CSRs for rounding modes and exception flags. Fused multiply-add instructions provide both performance and accuracy benefits. The separation of single and double precision allows implementations to include only what they need.

C extension improves code density by 25-30% through 16-bit compressed instructions that expand transparently to 32-bit equivalents. This reduces memory usage and improves cache efficiency with minimal hardware complexity, making it valuable for both embedded systems and high-performance processors.

B extension adds efficient bit manipulation through modular sub-extensions. Address generation instructions accelerate array indexing, basic bit manipulation provides common operations like count-leading-zeros and rotate, and specialized instructions support cryptography and other domains.

Zicsr and Zifencei complete the picture by providing CSR access and instruction cache synchronization. Though separated from the base ISA for minimality, they’re essential for almost all practical systems.

RVA22 profile ties these extensions together into a coherent platform for application processors. By mandating specific extensions and versions, profiles ensure software portability while preserving RISC-V’s flexibility.

In the next chapter, we’ll explore the Vector extension, RISC-V’s approach to SIMD and data-parallel processing.

Chapter 12. Vector Processing & SIMD Comparison

Part VII — ISA Extensions

🎯 Learning Objectives

After reading this chapter, you will be able to:

Understand SISD vs SIMD: Grasp the difference between scalar and vector operations
Master VLA Core Concepts: Understand the value of Vector-Length Agnostic design
Use the vsetvli Instruction: Perform Strip-mining (chunked processing)
Write Vector Loops: Master the standard VLA loop structure
Compare Different SIMD Architectures: Understand RISC-V V vs ARM SVE vs x86 AVX differences

💡 Scenario: The Flexible Noodle Cutter

Scene: Junior is staring at SIMD code on the screen, frustrated.

Junior: “Architect, this is painful. I wrote an optimized version for 128-bit hardware before. Now the company switched to a new chip that supports 512-bit. My original loop only cuts 4 floats at a time, but now I could cut 16. I have to rewrite the entire loop logic, including the ‘tail’ handling.”

Architect: “That’s the downside of Fixed-width SIMD. It’s like a fixed-size cookie cutter. Using a small cutter on a big dough is inefficient; switch to a bigger cutter, and your old recipe (code) doesn’t work anymore.”

Junior: “How does RISC-V solve this?”

Architect: “RISC-V’s V (Vector) Extension uses a design called VLA (Vector-Length Agnostic).

Imagine you have a ‘smart noodle cutting machine’. You don’t need to tell it ‘I want 4 pieces’ or ‘I want 16 pieces.’ You just say: ‘Here’s a big lump of dough (total length N), please cut it with your maximum capacity.’“

Junior: “Maximum capacity?”

Architect: “Right. The hardware responds: ‘Report: my blade can cut 8 pieces at a time (vl).’

Then you cut 8 pieces, push the dough forward, and ask again.

When you’re down to the last 3 pieces of dough, you don’t need to change cutters—the hardware automatically tells you: ‘This time I’ll only cut 3.’

Code written this way runs on 128-bit or 1024-bit machines without changing a single line, automatically achieving maximum performance.“

Junior: “Wow, even the tail is handled automatically? Teach me this instruction!”

Architect: “This is the legendary artifact: vsetvli.”

Modern applications increasingly demand data-parallel processing. Image processing applies the same filter to millions of pixels. Machine learning performs matrix operations on thousands of elements. Scientific simulations compute physics equations across vast grids. These workloads share a common pattern: the same operation repeated on different data.

Traditional scalar processors handle one operation at a time. To process 1000 elements, they execute 1000 separate instructions. Vector processors, by contrast, operate on multiple elements simultaneously with a single instruction—Single Instruction, Multiple Data (SIMD). This can provide 4×, 8×, or even greater speedups for data-parallel code.

Every major architecture offers SIMD extensions: x86 has SSE and AVX, ARM has NEON and SVE. RISC-V’s answer is the V extension (Vector), ratified in 2021. But RISC-V takes a different approach from its predecessors. Instead of fixed-width vectors that become obsolete as hardware improves, RISC-V uses vector-length agnostic programming—code that adapts automatically to different hardware implementations.

This chapter explores the V extension’s design, compares it with ARM and x86 SIMD, and shows how to write efficient vector code. We’ll see why RISC-V’s approach offers better long-term scalability than traditional SIMD architectures.

12.1 Vector Extension Overview

The SIMD Evolution

SIMD extensions have evolved through multiple generations, each adding wider vectors:

x86: MMX (64-bit) → SSE (128-bit) → AVX (256-bit) → AVX-512 (512-bit)
ARM: NEON (128-bit) → SVE (128-2048 bits, scalable)

Each generation requires new instructions and software rewrites. Code optimized for 128-bit vectors doesn’t automatically benefit from 256-bit hardware. This creates a dilemma: should compilers target narrow vectors for compatibility or wide vectors for performance?

Vector-Length Agnostic Programming

RISC-V’s V extension solves this with vector-length agnostic (VLA) programming. Instead of specifying exact vector widths, programs specify operations on abstract vectors. The hardware determines the actual vector length at runtime based on its capabilities.

A program written for V extension runs on any implementation, from embedded processors with 128-bit vectors to supercomputers with 4096-bit vectors, automatically using the available width. This future-proofs software and simplifies compiler design.

Key Concepts

VLEN: Vector register length in bits, implementation-defined (must be power of 2, minimum 128, maximum 65536). A processor might have VLEN=256 (256-bit vectors) or VLEN=512 (512-bit vectors).

ELEN: Maximum element width in bits, implementation-defined (minimum 32, maximum 64). Determines the largest element type (e.g., ELEN=64 supports 64-bit integers and doubles).

SEW: Selected element width in bits, chosen by software (8, 16, 32, or 64). Determines how many elements fit in a vector register.

LMUL: Vector register group multiplier (1/8, 1/4, 1/2, 1, 2, 4, 8). Allows using multiple registers as a single logical vector for larger operations.

AVL: Application vector length, the number of elements the application wants to process.

VL: Vector length, the number of elements actually processed by an instruction (VL ≤ AVL, VL ≤ VLEN/SEW).

The relationship is: VL = min(AVL, VLEN/SEW × LMUL)

Figure 12.1: Vector Register Organization

Vector Register File (32 registers: v0-v31)
    Each register: VLEN bits (implementation-defined)

Element Width (SEW) - determines elements per register:
    VLEN = 256 bits (example)
    ├─ SEW=8:  32 elements (bytes)
    ├─ SEW=16: 16 elements (halfwords)
    ├─ SEW=32:  8 elements (words)
    └─ SEW=64:  4 elements (doublewords)

Register Grouping (LMUL) - use multiple registers as one:
    ├─ LMUL=1: 1 register  (e.g., v0)
    ├─ LMUL=2: 2 registers (e.g., v0-v1)
    ├─ LMUL=4: 4 registers (e.g., v0-v3)
    └─ LMUL=8: 8 registers (e.g., v0-v7)

A vector register can be interpreted with different element widths (SEW), and multiple consecutive registers can be grouped (LMUL) for larger operations.

12.2 Vector Register Organization

Vector Register File

The V extension adds 32 vector registers, v0 through v31. Each register is VLEN bits wide, where VLEN is implementation-defined. Unlike scalar registers which are always 32 or 64 bits, vector registers can be 128, 256, 512, or even larger.

Element Width and Capacity

A vector register holds multiple elements. The number depends on the selected element width (SEW):

Number of elements = VLEN / SEW

For VLEN=256:

SEW=8 (byte): 32 elements
SEW=16 (halfword): 16 elements
SEW=32 (word): 8 elements
SEW=64 (doubleword): 4 elements

Software selects SEW based on the data type being processed.

Register Grouping (LMUL)

Sometimes you need to process more elements than fit in one register. LMUL (register group multiplier) allows treating multiple consecutive registers as a single logical vector:

LMUL=1: Use 1 register (default)
LMUL=2: Use 2 consecutive registers (e.g., v0-v1)
LMUL=4: Use 4 consecutive registers (e.g., v0-v3)
LMUL=8: Use 8 consecutive registers (e.g., v0-v7)

With LMUL=2 and SEW=32 on VLEN=256, you get 16 elements (8 per register × 2 registers).

LMUL can also be fractional (1/2, 1/4, 1/8) to use only part of a register, leaving more registers available for other operations.

Register Alignment

When LMUL > 1, register numbers must be aligned:

LMUL=2: Use v0, v2, v4, … (even registers)
LMUL=4: Use v0, v4, v8, … (multiples of 4)
LMUL=8: Use v0, v8, v16, v24 (multiples of 8)

This simplifies hardware implementation.

12.3 Vector Configuration

The vtype CSR

Vector operations are configured through the vtype CSR (vector type register), which specifies:

SEW: Selected element width (8, 16, 32, or 64 bits)
LMUL: Register group multiplier (1/8, 1/4, 1/2, 1, 2, 4, 8)
vta: Vector tail agnostic (how to handle elements beyond VL)
vma: Vector mask agnostic (how to handle masked-off elements)

The vsetvl Instruction

Before executing vector instructions, software must configure vtype and set VL using the vsetvl instruction:

vsetvli rd, rs1, vtypei: Set VL and vtype. rs1 contains AVL (requested vector length), vtypei encodes SEW and LMUL, rd receives the actual VL.

# Configure for 32-bit elements, LMUL=1
li a0, 100              # AVL = 100 elements to process
vsetvli t0, a0, e32, m1 # Set SEW=32, LMUL=1, VL = min(AVL, VLEN/32)
                        # t0 now contains actual VL

The hardware sets VL to the smaller of:

AVL (what the application requested)
VLEN/SEW × LMUL (what the hardware can handle)

If AVL=100 but the hardware can only process 8 elements at a time (VLEN=256, SEW=32, LMUL=1), then VL=8. The application must loop to process all 100 elements.

Vector-Length Agnostic Loop

Here’s the standard pattern for processing an array:

void vadd_vv(int *dst, int *src1, int *src2, size_t n) {
    size_t vl;
    for (size_t i = 0; i < n; i += vl) {
        vl = vsetvl_e32m1(n - i);  // Set VL for remaining elements
        
        vle32_v_i32m1(v1, &src1[i], vl);  // Load src1[i:i+vl]
        vle32_v_i32m1(v2, &src2[i], vl);  // Load src2[i:i+vl]
        vadd_vv_i32m1(v3, v1, v2, vl);    // v3 = v1 + v2
        vse32_v_i32m1(&dst[i], v3, vl);   // Store dst[i:i+vl]
    }
}

This code works on any VLEN. On VLEN=128, it processes 4 elements per iteration. On VLEN=512, it processes 16 elements per iteration. No code changes needed.

Encoding vtype

The vtypei immediate in vsetvli encodes SEW and LMUL:

vtypei[2:0] = LMUL encoding:
  000 = LMUL=1, 001 = LMUL=2, 010 = LMUL=4, 011 = LMUL=8
  101 = LMUL=1/8, 110 = LMUL=1/4, 111 = LMUL=1/2

vtypei[5:3] = SEW encoding:
  000 = SEW=8, 001 = SEW=16, 010 = SEW=32, 011 = SEW=64

vtypei[6] = vta (tail agnostic)
vtypei[7] = vma (mask agnostic)

The assembler provides convenient mnemonics: e32, m1 means SEW=32, LMUL=1.

12.4 Vector Arithmetic and Logic

Vector-Vector Operations

Vector arithmetic instructions operate on corresponding elements from two vector registers:

vadd.vv vd, vs2, vs1: vd[i] = vs2[i] + vs1[i] for i = 0 to VL-1

vsub.vv, vmul.vv, vdiv.vv: Subtraction, multiplication, division

vand.vv, vor.vv, vxor.vv: Bitwise AND, OR, XOR

# Vector addition: v3 = v1 + v2
vsetvli t0, a0, e32, m1
vle32.v v1, (a1)        # Load first vector
vle32.v v2, (a2)        # Load second vector
vadd.vv v3, v1, v2      # Add element-wise
vse32.v v3, (a3)        # Store result

Vector-Scalar Operations

Often you need to add the same scalar to all vector elements. Vector-scalar instructions use a scalar register (x register) as the second operand:

vadd.vx vd, vs2, rs1: vd[i] = vs2[i] + rs1 for all i

# Add constant 10 to all elements
li a0, 10
vsetvli t0, a1, e32, m1
vle32.v v1, (a2)
vadd.vx v2, v1, a0      # v2[i] = v1[i] + 10
vse32.v v2, (a3)

Vector-Immediate Operations

For small constants, vector-immediate instructions avoid loading into a scalar register:

vadd.vi vd, vs2, imm: vd[i] = vs2[i] + imm (imm is 5-bit signed)

# Increment all elements by 1
vadd.vi v2, v1, 1       # v2[i] = v1[i] + 1

Widening and Narrowing Operations

Widening operations produce results twice as wide as the inputs:

vwaddu.vv vd, vs2, vs1: Widening unsigned add (e.g., 32-bit inputs → 64-bit results)

vwadd.vv: Widening signed add

Narrowing operations reduce width:

vnsrl.wv vd, vs2, vs1: Narrowing shift right logical (e.g., 64-bit inputs → 32-bit results)

These are essential for avoiding overflow in accumulations or reducing precision after computation.

Fused Multiply-Add

Vector fused multiply-add computes (a × b) + c in one instruction:

vfmadd.vv vd, vs1, vs2: vd[i] = (vd[i] × vs1[i]) + vs2[i]

This is crucial for matrix multiplication and other linear algebra operations.

12.5 Vector Memory Operations

Unit-Stride Loads and Stores

The most common memory access pattern is unit-stride: consecutive elements in memory.

vle32.v vd, (rs1): Load VL elements of 32-bit width from address rs1

vse32.v vs3, (rs1): Store VL elements of 32-bit width to address rs1

# Load 32-bit integers from array
vsetvli t0, a0, e32, m1
vle32.v v1, (a1)        # Load v1[0:VL-1] from memory[a1]

The number of bytes loaded is VL × SEW/8. For VL=8 and SEW=32, this loads 32 bytes.

Strided Loads and Stores

Strided access loads elements separated by a constant stride:

vlse32.v vd, (rs1), rs2: Load elements from rs1, rs1+rs2, rs1+2×rs2, …

vsse32.v vs3, (rs1), rs2: Store with stride

// Load every other element (stride = 8 bytes for 32-bit elements)
vlse32.v v1, (a1), 8    # Load a1[0], a1[2], a1[4], ...

This is useful for accessing matrix columns or interleaved data.

Indexed (Scatter/Gather) Loads and Stores

Indexed access uses a vector of indices to load/store non-contiguous elements. This is also called “gather” (for loads) and “scatter” (for stores).

vluxei32.v vd, (rs1), vs2: Load elements from rs1+vs2[i] for each i (unordered)

vsuxei32.v vs3, (rs1), vs2: Store with indices (unordered)

# Example: Gather operation
# Suppose we have an array a[] and want to load a[1], a[3], a[5], a[2]
# First, create an index vector containing [1, 3, 5, 2]
vle32.v v1, (a1)        # Load index vector: v1 = [1, 3, 5, 2]
vluxei32.v v2, (a2), v1 # Gather: v2[0]=a[1], v2[1]=a[3], v2[2]=a[5], v2[3]=a[2]

The index vector (v1 in the example) contains the indices of elements to load. For each element i, the instruction loads from address base + index[i] * element_size. So if v1 contains [1, 3, 5, 2], the gather operation loads:

v2[0] = memory[a2 + 1*4] (element at index 1)
v2[1] = memory[a2 + 3*4] (element at index 3)
v2[2] = memory[a2 + 5*4] (element at index 5)
v2[3] = memory[a2 + 2*4] (element at index 2)

This is essential for sparse matrix operations, indirect addressing, and accessing non-contiguous data.

Segment Loads and Stores

Segment operations load/store groups of elements (like struct fields):

vlseg2e32.v vd, (rs1): Load 2-field segments (e.g., {x, y} pairs)

vsseg2e32.v vs3, (rs1): Store 2-field segments

// Load array of {x, y} pairs
struct point { int x, y; };
struct point points[100];

vlseg2e32.v v1, (a0)    # v1 = all x values, v2 = all y values

This efficiently handles structure-of-arrays (SoA) and array-of-structures (AoS) conversions.

Figure 12.2a: Unit-Stride Access

Unit-Stride (consecutive elements):
Memory:   [0] [1] [2] [3] [4] [5] [6] [7]
           ↓   ↓   ↓   ↓
Vector:   [0] [1] [2] [3]

Figure 12.2b: Strided Access

Strided (every 2nd element, stride=2):
Memory:   [0] [1] [2] [3] [4] [5] [6] [7]
           ↓       ↓       ↓       ↓
Vector:   [0]     [2]     [4]     [6]

Figure 12.2c: Indexed (Gather) Access

graph TB
    subgraph "Index Vector"
        idx0["indices[0] = 1"]
        idx1["indices[1] = 3"]
        idx2["indices[2] = 5"]
        idx3["indices[3] = 2"]
    end

    subgraph "Memory Array"
        m0["mem[0] = 0"]
        m1["mem[1] = 1"]
        m2["mem[2] = 2"]
        m3["mem[3] = 3"]
        m4["mem[4] = 4"]
        m5["mem[5] = 5"]
        m6["mem[6] = 6"]
        m7["mem[7] = 7"]
    end

    subgraph "Result Vector"
        v0["vector[0] = 1"]
        v1["vector[1] = 3"]
        v2["vector[2] = 5"]
        v3["vector[3] = 2"]
    end

    idx0 --> m1
    m1 --> v0

    idx1 --> m3
    m3 --> v1

    idx2 --> m5
    m5 --> v2

    idx3 --> m2
    m2 --> v3

    style m1 fill:#90EE90
    style m2 fill:#FFB6C1
    style m3 fill:#87CEEB
    style m5 fill:#FFD700

Each index points to a memory location, and the value at that location is loaded into the corresponding vector position.

12.6 Vector Masking

Predicated Execution

Not all elements in a vector may need processing. Masking allows selectively enabling or disabling operations on individual elements.

The mask is stored in vector register v0, with one bit per element. If v0[i] = 1, element i is processed; if v0[i] = 0, element i is skipped (or handled according to vma setting).

Masked Operations

Most vector instructions have a masked variant using the .vm suffix:

vadd.vv vd, vs2, vs1, v0.t: Add only where v0[i] = 1

# Conditional add: dst[i] = (mask[i]) ? src1[i] + src2[i] : dst[i]
vle1.v v0, (a0)         # Load mask into v0
vle32.v v1, (a1)        # Load src1
vle32.v v2, (a2)        # Load src2
vle32.v v3, (a3)        # Load dst (for masked-off elements)
vadd.vv v3, v1, v2, v0.t # Add where mask is 1, keep v3 where mask is 0
vse32.v v3, (a3)        # Store result

Comparison and Mask Generation

Comparison instructions generate masks:

vmseq.vv vd, vs2, vs1: vd[i] = (vs2[i] == vs1[i]) ? 1 : 0

vmslt.vv, vmsle.vv, vmsgt.vv: Less than, less or equal, greater than

# Find elements greater than 100
li a0, 100
vsetvli t0, a1, e32, m1
vle32.v v1, (a2)
vmsgt.vx v0, v1, a0     # v0[i] = (v1[i] > 100) ? 1 : 0

Mask Logical Operations

Masks can be combined with logical operations:

vmand.mm vd, vs2, vs1: Mask AND vmor.mm, vmxor.mm, vmnand.mm: Mask OR, XOR, NAND

# Combine two conditions: (a > 100) AND (a < 200)
vmsgt.vx v1, v2, a0     # v1 = (v2 > 100)
vmslt.vx v3, v2, a1     # v3 = (v2 < 200)
vmand.mm v0, v1, v3     # v0 = v1 AND v3

Use Cases

Masking is essential for:

Conditional operations (if-then-else in vector code)
Handling loop tails (when array size isn’t a multiple of VL)
Sparse computations (skip zero elements)
Implementing reductions with conditions

12.7 Vector Reductions

What is a Reduction?

A reduction combines all elements of a vector into a single scalar result. Common examples: sum all elements, find maximum, count non-zero elements.

Reduction Instructions

vredsum.vs vd, vs2, vs1: Sum all elements of vs2, add to vs1[0], store in vd[0]

vredmax.vs, vredmin.vs: Find maximum or minimum

vredand.vs, vredor.vs, vredxor.vs: Bitwise AND, OR, XOR of all elements

# Sum all elements of an array
vsetvli t0, a0, e32, m1
vmv.v.i v2, 0           # Initialize accumulator to 0
vle32.v v1, (a1)        # Load vector
vredsum.vs v2, v1, v2   # v2[0] = sum(v1[0:VL-1]) + v2[0]
vmv.x.s a2, v2          # Move result to scalar register

For arrays larger than VL, loop and accumulate:

int sum_array(int *arr, size_t n) {
    int sum = 0;
    size_t vl;
    for (size_t i = 0; i < n; i += vl) {
        vl = vsetvl_e32m1(n - i);
        vle32_v_i32m1(v1, &arr[i], vl);
        vredsum_vs_i32m1_i32m1(v2, v1, v2, vl);
    }
    return vmv_x_s_i32m1_i32(v2);
}

Masked Reductions

Reductions can be masked to sum only selected elements:

# Sum elements where mask is 1
vredsum.vs v2, v1, v2, v0.t

This is useful for conditional sums (e.g., sum all positive elements).

12.8 Comparison with ARM NEON and x86 AVX

ARM NEON

ARM NEON provides 128-bit SIMD with 32 vector registers (v0-v31 in AArch64). Each register can hold:

16 × 8-bit elements
8 × 16-bit elements
4 × 32-bit elements
2 × 64-bit elements

NEON instructions specify the element width explicitly:

# ARM NEON: Add two vectors of 4 × 32-bit integers
ld1 {v0.4s}, [x0]       // Load 4 × 32-bit
ld1 {v1.4s}, [x1]
add v2.4s, v0.4s, v1.4s // Add element-wise
st1 {v2.4s}, [x2]

Limitations of NEON:

Fixed 128-bit width (no scalability)
Code must be rewritten for wider vectors
No predication (masking) in base NEON

ARM SVE (Scalable Vector Extension)

SVE addresses NEON’s limitations with scalable vectors (128-2048 bits). Like RISC-V V, SVE uses vector-length agnostic programming:

# ARM SVE: Vector add (works on any vector length)
ld1w z0.s, p0/z, [x0]   // Load with predication
ld1w z1.s, p0/z, [x1]
add z2.s, z0.s, z1.s    // Add
st1w z2.s, p0, [x2]     // Store with predication

SVE and RISC-V V share similar philosophies: scalable vectors, predication, and VLA programming. However, SVE is more complex with more instruction variants and addressing modes.

x86 AVX

x86’s SIMD evolved through multiple generations:

SSE: 128-bit (16 registers: xmm0-xmm15)
AVX: 256-bit (16 registers: ymm0-ymm15)
AVX-512: 512-bit (32 registers: zmm0-zmm31)

Each generation added new instructions:

# x86 AVX: Add two vectors of 8 × 32-bit integers
vmovdqu ymm0, [rax]     ; Load 256 bits
vmovdqu ymm1, [rbx]
vpaddd ymm2, ymm0, ymm1 ; Add 8 × 32-bit
vmovdqu [rcx], ymm2     ; Store

Limitations of x86 SIMD:

Fixed widths (128, 256, 512 bits)
Code must be rewritten for each generation
AVX-512 has many variants (AVX-512F, AVX-512BW, AVX-512DQ, etc.)
Complexity: thousands of SIMD instructions

RISC-V V Advantages

Compared to NEON and AVX, RISC-V V offers:

Scalability: One codebase works on any VLEN (128 to 65536 bits)
Simplicity: Fewer instruction variants, consistent naming
Predication: Built-in masking for all operations
Flexibility: Fractional LMUL, widening/narrowing operations
Future-proof: No need to rewrite code for wider vectors

Trade-offs:

RISC-V V is newer (less mature tooling and libraries)
x86 AVX has extensive optimization for specific workloads
ARM NEON is simpler for fixed-width use cases

Figure 12.3: SIMD Architecture Comparison

Feature	x86 SSE/AVX	ARM NEON	ARM SVE	RISC-V V
Vector Width	Fixed: 128/256/512 bits	Fixed: 128 bits	Scalable: 128-2048 bits	Scalable: 128-65536 bits
Registers	16 (SSE/AVX) 32 (AVX-512)	32	32	32
Scalability	No (fixed per generation)	No (fixed)	Yes (scalable)	Yes (scalable)
Code Portability	No (rewrite per generation)	Yes (single codebase)	Yes (single codebase)	Yes (single codebase)
Predication	Partial (AVX-512 only)	No (base NEON)	Yes	Yes
Instruction Count	~1000s (across generations)	~200	~400	~300
Complexity	High (many variants)	Low	Medium	Low
Ratification	1999 (SSE) 2011 (AVX) 2016 (AVX-512)	2005	2016	2021
Key Advantage	Mature ecosystem	Simple, widely deployed	Scalable, predication	Scalable, simple, future-proof
Key Limitation	Fixed widths, complexity	Fixed 128-bit only	Complex instruction set	Newer, less mature tooling

🛠️ Hands-on Lab: Lab 12.1 — Vector Addition

This lab demonstrates the classic VLA loop structure—the foundational pattern for RISC-V Vector programming.

Lab Objectives

Write RISC-V Vector Assembly to implement C[i] = A[i] + B[i]
Understand the meaning of vsetvli’s return value vl
Compare the structure of Scalar Loop vs Vector Loop

Strip-mining Loop Structure

This is the core pattern of VLA programming:

while (n > 0) {
    vl = vsetvli(n);    // Ask hardware: how many can you handle?
    load(vl elements);   // Load vl elements
    compute();           // Execute operation
    store(vl elements);  // Store vl elements
    n -= vl;             // Decrease remaining count
    pointers += vl;      // Advance pointers
}

Code

File 1: vector_add.S

# vector_add.S - Vector Addition (VLA version)
.section .text
.global vec_add

# void vec_add(int *a, int *b, int *c, int n)
# a0 = pointer to A
# a1 = pointer to B
# a2 = pointer to C
# a3 = n (element count)

vec_add:
    # --- Strip-mining Loop ---
loop:
    # 1. Set vector length
    # vsetvli rd, rs1, vtype
    # t0: hardware returns actual elements it can process (vl)
    # a3: remaining elements (AVL)
    # e32: element size 32-bit
    # m1: LMUL=1 (use 1 vector register)
    # ta, ma: Tail/Mask Agnostic
    vsetvli t0, a3, e32, m1, ta, ma

    # 2. Load data
    vle32.v v0, (a0)    # v0 = A[0:vl]
    vle32.v v1, (a1)    # v1 = B[0:vl]

    # 3. Execute addition
    vadd.vv v2, v0, v1  # v2 = v0 + v1

    # 4. Write back data
    vse32.v v2, (a2)    # C[0:vl] = v2

    # 5. Advance pointers (int32 = 4 bytes)
    slli t1, t0, 2      # t1 = vl * 4
    add a0, a0, t1      # A pointer advances
    add a1, a1, t1      # B pointer advances
    add a2, a2, t1      # C pointer advances

    # 6. Update remaining count
    sub a3, a3, t0      # n = n - vl

    # 7. Continue loop
    bnez a3, loop

    ret

File 2: main.c

#include <stdio.h>

extern void vec_add(int *a, int *b, int *c, int n);

#define N 100  // Intentionally not power of 2, to test tail handling

int main(void) {
    int a[N], b[N], c[N];

    // Initialize
    for (int i = 0; i < N; i++) {
        a[i] = i;
        b[i] = 100;
        c[i] = 0;
    }

    printf("Starting Vector Add...\n");
    vec_add(a, b, c, N);

    // Verify
    int error = 0;
    for (int i = 0; i < N; i++) {
        if (c[i] != a[i] + 100) {
            error++;
        }
    }

    if (error == 0) {
        printf("SUCCESS: All %d elements correct!\n", N);
    } else {
        printf("FAILED: %d errors\n", error);
    }

    return 0;
}

Compile and Run

# Compile (requires V extension support)
riscv64-unknown-elf-gcc -march=rv64gcv -o vec_add main.c vector_add.S

# Run on QEMU with V extension
qemu-riscv64 -cpu rv64,v=true vec_add

Expected Output:

Starting Vector Add...
SUCCESS: All 100 elements correct!

What You Just Did

vsetvli: Asked hardware “how many elements can you process?” and got vl back
Automatic Tail Handling: When N=100 and VLEN allows 8 elements per iteration, the last iteration automatically processes only 4 elements
Portable Code: This same code runs on any VLEN (128-bit, 256-bit, 1024-bit) without modification

danieRTOS Reference: While danieRTOS doesn’t use vector operations directly, understanding VLA patterns helps when optimizing memory copy operations in the kernel.

⚠️ Common Pitfalls

Pitfall 1: Unnecessarily Handling the Tail

Error Scenario: Habituated to traditional SIMD, writing extra tail-handling loops.

Consequence: Wasted effort, and may introduce bugs.

// ❌ Wrong: No need to handle tail yourself
void vec_add_wrong(int *a, int *b, int *c, int n) {
    int i;
    // Vector part
    for (i = 0; i + 4 <= n; i += 4) {
        // vector_add_4(a+i, b+i, c+i);
    }
    // Tail part (this is redundant in VLA!)
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// ✅ Correct: vsetvli handles tail automatically
// See Assembly example above

Pitfall 2: Assuming Fixed VLEN

Error Scenario: Hardcoding assumptions like VLEN=256 or other specific values.

Consequence: Program behaves incorrectly or performs poorly on different hardware.

# ❌ Wrong: Assuming 8 elements per iteration
loop:
    li t0, 8              # Hardcoded!
    vsetvli zero, t0, e32, m1, ta, ma
    ...

# ✅ Correct: Let hardware decide
loop:
    vsetvli t0, a3, e32, m1, ta, ma  # a3 = remaining count
    ...

Pitfall 3: Forgetting LMUL’s Impact

Error Scenario: Not understanding LMUL (Vector Register Group Multiplier).

Explanation:

LMUL=1: Use 1 vector register (v0-v31)
LMUL=2: Use 2 registers as a group (v0-v1, v2-v3, …)
LMUL=4/8: Larger groups

# LMUL=1: Can use v0-v31 (32 independent registers)
vsetvli t0, a3, e32, m1, ta, ma

# LMUL=2: Can use v0, v2, v4, ... (16 groups, 2 each)
vsetvli t0, a3, e32, m2, ta, ma
# Now v0 and v1 are the same group, cannot be used separately

# LMUL=8: Can use v0, v8, v16, v24 (4 groups, 8 each)
vsetvli t0, a3, e32, m8, ta, ma

💡 Tip: LMUL > 1 can process more elements but reduces available register count. For simple vector addition, LMUL=1 is usually sufficient.

Summary

The RISC-V Vector extension represents a modern approach to SIMD processing, learning from decades of experience with x86 and ARM SIMD architectures. Its vector-length agnostic design ensures that code written today will automatically benefit from wider vectors in future hardware.

Vector-length agnostic programming is the V extension’s defining feature. By abstracting away the physical vector width, RISC-V allows a single binary to run efficiently on implementations ranging from tiny embedded processors to supercomputers. This eliminates the need to maintain multiple code paths for different vector widths, simplifying both compiler and application development.

Vector configuration through the vsetvl instruction and vtype CSR provides fine-grained control over element width (SEW), register grouping (LMUL), and vector length (VL). The hardware automatically determines the optimal VL based on the application’s request (AVL) and the implementation’s capabilities (VLEN), making it easy to write portable high-performance code.

Vector operations cover the full spectrum of data-parallel computation: arithmetic and logic operations with vector-vector, vector-scalar, and vector-immediate variants; widening and narrowing operations for precision management; and fused multiply-add for efficient linear algebra. The consistent instruction naming and behavior make the V extension easier to learn than x86’s sprawling SIMD instruction set.

Vector memory operations support diverse access patterns: unit-stride for contiguous data, strided for matrix columns and interleaved data, indexed for sparse matrices and indirect addressing, and segment operations for structure-of-arrays conversions. This flexibility enables efficient vectorization of a wide range of algorithms.

Vector masking provides predicated execution, allowing conditional operations on individual vector elements. Comparison instructions generate masks, mask logical operations combine conditions, and masked operations selectively process elements. This is essential for handling loop tails, implementing conditional logic in vector code, and optimizing sparse computations.

Vector reductions efficiently combine all elements of a vector into a scalar result, supporting operations like sum, maximum, minimum, and bitwise reductions. Masked reductions enable conditional aggregations, crucial for many algorithms.

Compared to ARM NEON and x86 AVX, RISC-V V offers superior scalability and simplicity. While NEON is limited to 128-bit vectors and AVX requires separate code for each generation (128, 256, 512 bits), RISC-V V code automatically adapts to any vector width. ARM SVE shares RISC-V’s scalable philosophy but with greater complexity. x86’s SIMD has evolved into thousands of instructions across multiple incompatible extensions, while RISC-V V maintains a clean, orthogonal design.

The V extension positions RISC-V well for future data-parallel workloads in machine learning, scientific computing, multimedia processing, and other domains where SIMD performance is critical.

Chapter 13. SoC Integration

Part VIII — System Design, Platform Spec & SoC Integration

🎯 Learning Objectives

After reading this chapter, you will be able to:

Understand PMP’s Role: Grasp how Physical Memory Protection limits access at the hardware level
Distinguish TOR vs NAPOT: Understand the configuration differences and use cases for each Address Matching Mode
Configure PMP Entries: Set up read-only regions and intercept illegal writes
Understand PLIC Architecture: Grasp how the Platform-Level Interrupt Controller operates
Integrate SoC Components: Understand how CPU, Memory, and Peripherals connect via Interconnect

💡 Scenario: The Museum’s Red Barrier Poles

Scene: Junior stares at a “Store Access Fault” exception code on the screen, looking confused.

Junior: “Architect, this is so strange. I already turned off the MMU (virtual memory) and I’m using physical addresses directly to write to this variable. Why is the CPU still blocking me? Is the board broken?”

Architect: “The board is fine. You just hit PMP (Physical Memory Protection)’s ‘red barrier poles.’

Imagine memory is a museum:

Mechanism	Analogy	Function
MMU (Page Table)	Tour map	Tells you where exhibits are (VA → PA)
PMP	Red barrier poles + bulletproof glass	Hardware security, limits who can touch what

Even if you bypass the tour guide (turn off MMU) and rush straight to the Mona Lisa, the hardware security (PMP Checker) will still stop you, because your ID (Privilege Mode) says you’re just an ordinary visitor (S-mode/U-mode), and this area is only accessible to the museum director (M-mode).“

Junior: “So how do I set up these barrier poles? Do I need start and end addresses?”

Architect: “There are two common ways to set up the barriers:

TOR (Top of Range): Like stretching a rope. You need two poles (two PMP Entries), and the area between them is the controlled region. Good for arbitrary-sized regions.
NAPOT (Naturally Aligned Power of Two): Like placing a fixed-size dome (4KB, 2MB…) over exhibits. You only need to set the center point and dome size—more resource-efficient (uses only one Entry).

Today let’s try using NAPOT to cover a 4KB region and make it ‘read-only,’ and see what happens to your program.“

A RISC-V processor core doesn’t operate in isolation. To build a complete system-on-chip (SoC), the core must integrate with memory controllers, interrupt controllers, I/O devices, and system interconnects. This integration determines how software accesses hardware, how devices communicate, and how the system maintains security and performance.

RISC-V provides a modular approach to SoC design. Unlike monolithic architectures that prescribe specific peripheral implementations, RISC-V defines standard interfaces while allowing flexibility in implementation. The Physical Memory Protection (PMP) unit controls memory access in machine mode. The Platform-Level Interrupt Controller (PLIC) routes interrupts from devices to cores. Memory-mapped I/O (MMIO) provides a uniform mechanism for device access. System interconnects like TileLink and AXI connect components together.

This chapter explores how RISC-V cores integrate into complete SoCs. We’ll examine the essential components—PMP, IOMMU, PLIC, MMIO, memory maps, interconnects, and DMA—and see how they work together to create functional systems. Understanding SoC integration is crucial for system designers, firmware developers, and anyone working with RISC-V hardware platforms.

13.1 Physical Memory Protection (PMP)

The Need for Memory Isolation

In systems without virtual memory, how do we prevent untrusted code from accessing sensitive memory regions? A bare-metal application might need to protect its firmware from buggy drivers. An embedded RTOS might need to isolate tasks from each other. Machine-mode firmware must protect itself from supervisor-mode operating systems.

Physical Memory Protection (PMP) provides hardware-enforced memory access control using physical addresses. Unlike virtual memory’s page tables (which operate in S-mode), PMP operates in M-mode and applies to all lower privilege levels. This makes PMP essential for systems without MMUs and useful for protecting M-mode resources even in systems with MMUs.

PMP Architecture

PMP uses a set of configuration registers to define protected memory regions:

pmpcfg0-pmpcfg15: Configuration registers (RV32 has 4, RV64 has 16)

pmpaddr0-pmpaddr63: Address registers (up to 64 regions)

Each PMP entry consists of:

An address register (pmpaddr) defining the region
A configuration byte (in pmpcfg) specifying permissions and matching mode

PMP Configuration Format

pmpcfg format (8 bits per entry):
  7     6:5   4:3   2     1     0
  L     0 0   A     X     W     R

L: Lock bit (prevents further modification)
A: Address matching mode (OFF, TOR, NA4, NAPOT)
X: Execute permission
W: Write permission
R: Read permission

Address Matching Modes

PMP supports four address matching modes:

OFF (A=0): Region is disabled, no protection applied

TOR (A=1): Top-of-Range. Region is [pmpaddr[i-1], pmpaddr[i])

NA4 (A=2): Naturally Aligned 4-byte region

NAPOT (A=3): Naturally Aligned Power-Of-Two region

The most commonly used modes are TOR and NAPOT:

// Example: Protect 64KB region at 0x80000000 using NAPOT
// NAPOT encoding: address = base | (size/2 - 1)
// For 64KB: 0x80000000 | 0x7FFF = 0x80007FFF
pmpaddr0 = 0x80007FFF >> 2;  // Right shift by 2 (PMP addresses are >> 2)
pmpcfg0 = 0x1F;  // L=0, A=3 (NAPOT), X=1, W=1, R=1

// Example: Protect range [0x80000000, 0x80010000) using TOR
pmpaddr0 = 0x80000000 >> 2;  // Start address
pmpaddr1 = 0x80010000 >> 2;  // End address
pmpcfg0 = 0x09;  // Entry 0: A=1 (TOR), X=0, W=0, R=1

PMP Priority and Matching

When an access occurs, PMP checks entries from lowest to highest index. The first matching entry determines the access permissions. If no entry matches, the access is denied (for M-mode, this behavior is implementation-defined).

Access check algorithm:
1. For i = 0 to N-1:
   - If address matches pmpaddr[i] region:
     - Check permissions in pmpcfg[i]
     - If allowed: grant access
     - If denied: raise access fault
2. If no match: deny access (or allow for M-mode)

Lock Bit

The lock bit (L) prevents further modification of a PMP entry until the next reset. This is crucial for protecting M-mode firmware from being disabled by compromised S-mode code:

// Lock the firmware region
pmpaddr0 = 0x80000000 >> 2;
pmpaddr1 = 0x80010000 >> 2;
pmpcfg0 = 0x89;  // L=1, A=1 (TOR), X=0, W=0, R=1

// Any subsequent write to pmpcfg0 or pmpaddr0/1 is ignored
pmpcfg0 = 0x00;  // This write has no effect!

PMP Use Cases

Firmware Protection: M-mode firmware protects itself from S-mode OS
Device Memory Protection: Prevent unauthorized access to MMIO regions
Task Isolation: Embedded RTOS isolates tasks without MMU
Secure Boot: Protect boot ROM and secure storage

13.2 IOMMU for RISC-V

The DMA Problem

Direct Memory Access (DMA) allows devices to access memory without CPU intervention, improving performance for I/O-intensive workloads. But DMA creates a security problem: devices use physical addresses and bypass the CPU’s virtual memory protection. A malicious or buggy device could read sensitive data or corrupt kernel memory.

An Input-Output Memory Management Unit (IOMMU) solves this by providing address translation and access control for devices. Just as the MMU translates virtual addresses for the CPU, the IOMMU translates device addresses for peripherals. This enables:

Device isolation: Each device sees only its own memory
Virtualization: Virtual machines can safely pass through devices
Large address spaces: 32-bit devices can access >4GB memory

RISC-V IOMMU Architecture

The RISC-V IOMMU specification defines a standard interface for device address translation. The IOMMU sits between devices and memory, intercepting device memory requests and translating them through device page tables.

Device → IOMMU → Memory
         ↓
    Device Context
    Device Page Tables

Key components:

Device Context: Per-device configuration (page table pointer, permissions)

Device Directory Table (DDT): Maps device IDs to device contexts

I/O Page Tables: Similar to CPU page tables, but for device addresses

Command Queue: Software sends commands to IOMMU (invalidate TLB, etc.)

Fault Queue: IOMMU reports translation faults to software

Device Address Translation

When a device issues a memory request:

IOMMU extracts device ID from the request
Looks up device context in DDT
Walks device page tables to translate address
Checks permissions (read/write/execute)
Forwards translated request to memory or reports fault

// Simplified IOMMU translation
struct device_context {
    uint64_t page_table_root;  // Root page table address
    uint32_t permissions;       // Device permissions
    uint32_t address_width;     // Device address width
};

// Device issues read from device address 0x1000
device_addr = 0x1000;
device_id = 0x42;

// IOMMU lookup
ctx = ddt[device_id];
physical_addr = walk_page_table(ctx.page_table_root, device_addr);
if (physical_addr && (ctx.permissions & READ)) {
    forward_to_memory(physical_addr);
} else {
    report_fault(device_id, device_addr);
}

IOMMU Page Table Formats

RISC-V IOMMU supports multiple page table formats:

Sv39/Sv48/Sv57: Same format as CPU page tables (for simplicity)
MSI Page Tables: Special format for Message Signaled Interrupts

Using the same format as CPU page tables simplifies software—the OS can reuse existing page table code for device mappings.

IOMMU vs ARM SMMU

ARM’s System Memory Management Unit (SMMU) provides similar functionality:

Feature	RISC-V IOMMU	ARM SMMU
Page Table Format	Sv39/Sv48/Sv57	LPAE (Long descriptor)
Device Identification	Device ID (configurable)	Stream ID
Command Interface	Command queue	CMDQ (Command Queue)
Fault Reporting	Fault queue	Event queue
Virtualization	Two-stage translation	Stage 1 + Stage 2
Complexity	Simpler, modular	More complex, feature-rich

RISC-V IOMMU emphasizes simplicity and reuse of existing CPU MMU concepts, while ARM SMMU has evolved through multiple generations with extensive features.

13.3 Platform-Level Interrupt Integration

Interrupt Routing in SoCs

A typical RISC-V SoC has dozens of interrupt sources: UARTs, timers, GPIOs, network controllers, storage devices. Each device needs to signal the CPU when it requires attention. The Platform-Level Interrupt Controller (PLIC) manages this complexity by:

Collecting interrupts from all devices
Routing interrupts to appropriate cores
Managing interrupt priorities
Providing claim/complete mechanism

PLIC Architecture

Interrupt Sources (1-1023)
    ↓
PLIC Gateway (per source)
    ↓
PLIC Core (priority arbitration)
    ↓
Interrupt Targets (cores × contexts)

Key concepts:

Interrupt Source: A device that can generate interrupts (numbered 1-1023, source 0 is reserved)

Interrupt Gateway: Converts device interrupt signal to PLIC internal format

Interrupt Target: A CPU context (M-mode or S-mode on each core)

Priority: Each source has a priority (0 = never interrupt, 1-7 = increasing priority)

Threshold: Each target has a threshold (only interrupts with priority > threshold are delivered)

PLIC Memory Map

The PLIC is accessed through memory-mapped registers:

Base Address: 0x0C000000 (typical)

Interrupt Priorities:
  0x0C000000 + 4*source_id: Priority for source (0-7)

Interrupt Pending:
  0x0C001000: Pending bits (1024 bits, read-only)

Interrupt Enable:
  0x0C002000 + 0x80*context: Enable bits for context

Priority Threshold:
  0x0C200000 + 0x1000*context: Threshold for context

Claim/Complete:
  0x0C200004 + 0x1000*context: Claim and complete register

Interrupt Handling Flow

Device asserts interrupt: Device raises interrupt line
PLIC gateway captures: Gateway sets pending bit
PLIC arbitration: PLIC selects highest priority pending interrupt for each target
CPU notification: PLIC asserts external interrupt to CPU
Software claim: Interrupt handler reads claim register (returns source ID, clears pending)
Software handling: Handler services the device
Software complete: Handler writes source ID to complete register

// PLIC interrupt handler
void plic_handler(void) {
    uint32_t source = plic_claim();  // Read claim register

    if (source == UART0_IRQ) {
        uart0_interrupt_handler();
    } else if (source == TIMER_IRQ) {
        timer_interrupt_handler();
    }
    // ... handle other sources

    plic_complete(source);  // Write to complete register
}

uint32_t plic_claim(void) {
    volatile uint32_t *claim = (uint32_t*)(PLIC_BASE + 0x200004);
    return *claim;  // Reading claim atomically claims the interrupt
}

void plic_complete(uint32_t source) {
    volatile uint32_t *complete = (uint32_t*)(PLIC_BASE + 0x200004);
    *complete = source;  // Writing complete releases the interrupt
}

Multi-Core Interrupt Routing

In a multi-core system, each core has separate M-mode and S-mode contexts. The PLIC can route interrupts to specific cores:

// Configure UART interrupt to route to core 0 S-mode
#define PLIC_ENABLE_BASE 0x0C002000
#define UART0_IRQ 10
#define CORE0_S_MODE_CONTEXT 1

// Enable UART interrupt for core 0 S-mode
uint32_t *enable = (uint32_t*)(PLIC_ENABLE_BASE + 0x80 * CORE0_S_MODE_CONTEXT);
enable[UART0_IRQ / 32] |= (1 << (UART0_IRQ % 32));

// Set priority
uint32_t *priority = (uint32_t*)(PLIC_BASE + 4 * UART0_IRQ);
*priority = 5;  // Priority 5

// Set threshold
uint32_t *threshold = (uint32_t*)(PLIC_BASE + 0x200000 + 0x1000 * CORE0_S_MODE_CONTEXT);
*threshold = 0;  // Accept all priorities > 0

13.4 Memory-Mapped I/O (MMIO)

Unified Address Space

RISC-V uses memory-mapped I/O: devices appear as memory locations. Reading from a device register uses the same load instruction as reading from RAM. Writing to a device register uses the same store instruction. This unifies the programming model—no special I/O instructions needed.

# Read UART status register
li   a0, 0x10000000      # UART base address
lw   a1, 0(a0)           # Read status register

# Write to UART data register
li   a2, 'A'             # Character to send
sw   a2, 4(a0)           # Write to data register

MMIO Address Regions

A typical RISC-V SoC memory map divides address space into regions:

0x00000000 - 0x0FFFFFFF: Debug/Boot ROM
0x10000000 - 0x1FFFFFFF: Peripherals (UART, SPI, GPIO, etc.)
0x20000000 - 0x2FFFFFFF: PLIC
0x30000000 - 0x3FFFFFFF: Reserved
0x40000000 - 0x7FFFFFFF: More peripherals
0x80000000 - 0xFFFFFFFF: DRAM

Each peripheral gets a block of addresses for its registers:

UART0: 0x10000000 - 0x10000FFF
  0x10000000: Status register
  0x10000004: Data register
  0x10000008: Control register
  0x1000000C: Baud rate register

MMIO Ordering Requirements

MMIO accesses have ordering requirements that differ from normal memory:

Device registers may have side effects: Reading a status register might clear an interrupt flag
Write order matters: Writing control registers in wrong order can cause device malfunction
Read/write dependencies: A write must complete before a subsequent read sees the result

RISC-V provides fence instructions to enforce ordering:

# Ensure MMIO write completes before continuing
li   a0, 0x10000000
li   a1, 0x1
sw   a1, 0(a0)           # Write to control register
fence iorw, iorw         # Ensure write completes
lw   a2, 4(a0)           # Read status register

The FENCE instruction takes two operands specifying predecessor and successor operations:

i: Device input (MMIO read)
o: Device output (MMIO write)
r: Memory read
w: Memory write

Common patterns:

fence iorw, iorw: Full fence (all operations)
fence ow, ow: Ensure MMIO writes complete in order
fence ir, ir: Ensure MMIO reads complete in order

Uncached vs Cached MMIO

MMIO regions must be marked as uncached in page tables. Caching device registers would cause:

Stale data: Cache might return old value instead of current device state
Lost writes: Write to cached location might not reach device
Side effect loss: Reading cached value doesn’t trigger device side effects

Page table entries for MMIO use special attributes:

// Mark MMIO region as uncached and unbuffered
pte = (physical_addr >> 12) << 10;  // PPN
pte |= PTE_V | PTE_R | PTE_W;       // Valid, readable, writable
pte |= PTE_A | PTE_D;                // Accessed, dirty
// Do NOT set PTE_C (cacheable) for MMIO

13.5 SoC Memory Map

Typical RISC-V SoC Layout

A complete RISC-V SoC memory map includes ROM, RAM, peripherals, and reserved regions. Here’s a typical layout for a 32-bit SoC:

Address Range          Size      Description
0x00000000-0x00000FFF  4 KB      Debug ROM
0x00001000-0x00000FFF  60 KB     Reserved
0x00010000-0x0001FFFF  64 KB     Boot ROM (mask ROM)
0x00020000-0x00FFFFFF  ~16 MB    Reserved
0x01000000-0x01FFFFFF  16 MB     CLINT (Core-Local Interruptor)
0x02000000-0x0BFFFFFF  160 MB    Reserved
0x0C000000-0x0FFFFFFF  64 MB     PLIC
0x10000000-0x1000FFFF  64 KB     UART0
0x10010000-0x1001FFFF  64 KB     SPI0
0x10020000-0x1002FFFF  64 KB     GPIO
0x10030000-0x1FFFFFFF  ~256 MB   Other peripherals
0x20000000-0x3FFFFFFF  512 MB    Reserved
0x40000000-0x7FFFFFFF  1 GB      External devices
0x80000000-0xFFFFFFFF  2 GB      DRAM

Address Decode and Routing

The SoC interconnect decodes addresses and routes requests to appropriate components:

CPU issues load/store
    ↓
Address decode
    ↓
    ├─ 0x00000000-0x0FFFFFFF → Boot ROM / CLINT / PLIC
    ├─ 0x10000000-0x1FFFFFFF → Peripheral bus
    ├─ 0x20000000-0x7FFFFFFF → Reserved / External
    └─ 0x80000000-0xFFFFFFFF → DRAM controller

Memory Map Examples

Different RISC-V platforms use different memory maps:

SiFive FU540 (HiFive Unleashed):

0x00001000: Boot ROM
0x02000000: CLINT
0x0C000000: PLIC
0x10000000: UART0
0x10010000: QSPI0
0x10040000: GPIO
0x80000000: DDR (8 GB)

Kendryte K210:

0x00000000: SRAM (6 MB)
0x40000000: Peripherals
0x50000000: AI accelerator
0x80000000: Flash (16 MB)

QEMU virt machine:

0x00001000: Boot ROM
0x02000000: CLINT
0x0C000000: PLIC
0x10000000: UART0
0x10001000: VirtIO devices
0x80000000: DRAM (configurable)

13.6 System Interconnects

The Need for Interconnects

A modern SoC has multiple masters (CPU cores, DMA controllers, GPUs) and multiple slaves (memory, peripherals, accelerators). An interconnect fabric connects these components, handling:

Address routing: Directing requests to correct destination
Arbitration: Managing concurrent accesses
Data width conversion: Connecting 32-bit devices to 64-bit buses
Clock domain crossing: Bridging different clock frequencies

AXI (Advanced eXtensible Interface)

ARM’s AMBA AXI is widely used in RISC-V SoCs due to its maturity and IP availability. AXI4 provides:

Separate read/write channels: Independent read and write transactions
Burst transfers: Efficient multi-beat transfers
Out-of-order completion: Transactions can complete in any order
Quality of Service (QoS): Priority-based arbitration

AXI signals:

Write Address Channel: AWADDR, AWLEN, AWSIZE, AWVALID, AWREADY
Write Data Channel:    WDATA, WSTRB, WLAST, WVALID, WREADY
Write Response:        BRESP, BVALID, BREADY
Read Address Channel:  ARADDR, ARLEN, ARSIZE, ARVALID, ARREADY
Read Data Channel:     RDATA, RRESP, RLAST, RVALID, RREADY

AHB (Advanced High-performance Bus)

AHB is simpler than AXI, suitable for lower-performance peripherals:

Single channel: Address and data share the same channel
Pipelined: Two-stage pipeline (address, data)
Simpler protocol: Easier to implement
Lower performance: No out-of-order, limited bursts

TileLink

TileLink is a RISC-V-native interconnect developed at UC Berkeley:

Designed for RISC-V: Matches RISC-V memory model
Scalable: From simple embedded to complex multi-core
Cache coherence: Built-in support for coherent caches
Three conformance levels:
- TL-UL: Uncached Lightweight (simple peripherals)
- TL-UH: Uncached Heavyweight (DMA, accelerators)
- TL-C: Cached (coherent caches)

TileLink advantages for RISC-V:

Native support for RISC-V atomics (LR/SC, AMO)
Efficient cache coherence protocol
Open specification (no licensing)

Interconnect Comparison

Feature	AXI4	AHB	TileLink
Channels	5 independent	1 shared	3 (A, D, optional C/E)
Burst Support	Yes (up to 256 beats)	Yes (limited)	Yes
Out-of-Order	Yes	No	Yes
Cache Coherence	No (needs ACE)	No	Yes (TL-C)
Complexity	High	Low	Medium
Performance	High	Medium	High
RISC-V Atomics	Requires extensions	Requires extensions	Native support
Licensing	ARM (free for use)	ARM (free for use)	Open (BSD)
Ecosystem	Mature, extensive IP	Mature, simple IP	Growing, RISC-V focused

Choosing an Interconnect

AXI: Best for high-performance SoCs, extensive IP ecosystem, industry standard
AHB: Best for simple embedded systems, low-cost peripherals
TileLink: Best for RISC-V-native designs, cache coherence, open ecosystem

13.7 DMA and Coherency

DMA Controller Integration

A DMA controller transfers data between memory and peripherals without CPU intervention. This frees the CPU for other tasks while large data transfers proceed in the background.

Typical DMA use cases:

Disk I/O: Transfer data between storage and memory
Network I/O: Move packets between NIC and memory
Audio/Video: Stream data to/from media devices
Memory-to-memory: Fast memory copy operations

DMA controller architecture:

CPU configures DMA
    ↓
DMA reads source (memory or device)
    ↓
DMA writes destination (device or memory)
    ↓
DMA signals completion (interrupt)

Cache Coherency Considerations

DMA creates coherency problems when the CPU has caches:

Problem 1: Stale cache data

1. CPU writes data to memory (data in cache, not yet in RAM)
2. DMA reads from memory (gets old data, not cached data)
3. DMA sends wrong data to device

Problem 2: Stale memory data

1. DMA writes data to memory
2. CPU reads data (gets old cached data, not new DMA data)
3. CPU processes wrong data

Solutions:

Software cache management (simple, lower performance):

// Before DMA read (device → memory)
dma_start(device, buffer, size);
dma_wait_complete();
cache_invalidate(buffer, size);  // Discard cached data
// Now CPU can read fresh data

// Before DMA write (memory → device)
cache_flush(buffer, size);       // Write cached data to memory
dma_start(buffer, device, size);
dma_wait_complete();

Hardware cache coherence (complex, higher performance):

DMA controller participates in cache coherence protocol
DMA snoops CPU caches or uses coherent interconnect
Requires coherent interconnect (ACE for AXI, TL-C for TileLink)

DMA and Virtual Memory

DMA controllers typically use physical addresses, but software uses virtual addresses. This creates challenges:

Problem: Virtual address buffer might span non-contiguous physical pages

Virtual:  [0x1000-0x2FFF] (8 KB contiguous)
Physical: [0x80000000-0x80000FFF] + [0x85000000-0x85000FFF] (non-contiguous!)

Solutions:

Scatter-Gather DMA: DMA controller accepts list of physical address/length pairs

struct sg_entry {
    uint64_t addr;   // Physical address
    uint32_t len;    // Length in bytes
};

struct sg_entry sg_list[] = {
    {0x80000000, 4096},
    {0x85000000, 4096},
};
dma_start_sg(device, sg_list, 2);

IOMMU: Translate device addresses to physical addresses (see Section 13.2)
Physically contiguous buffers: Allocate DMA buffers from reserved physical memory

🛠️ Hands-on Lab: Lab 13.1 — Memory Firewall (PMP Shield)

This lab demonstrates PMP’s core functionality: setting protection rules in M-mode, then switching to S-mode to attempt a violation.

Lab Objectives

Configure PMP Entry 0 as Read-Only (R=1, W=0) to protect target variable
Configure PMP Entry 1 as Allow-All (R=1, W=1, X=1) to let other code run normally
Switch to S-mode and attempt a write to trigger Store Access Fault

NAPOT Encoding Principle

NAPOT encoding can be abstract for beginners. The key formula is:

pmpaddr = (base_addr >> 2) | ((size >> 3) - 1)

Example: Protect a 4KB region starting at 0x80200000
- Base = 0x80200000, Size = 4KB (0x1000)
- 0x80200000 >> 2 = 0x20080000
- (0x1000 >> 3) - 1 = 0x1FF
- pmpaddr = 0x20080000 | 0x1FF = 0x200801FF

Encoding Rules:

pmpaddr low bits	Corresponding region size
`...aaaaa0`	8 bytes
`...aaaa01`	16 bytes
`...aaa011`	32 bytes
`...a01111`	128 bytes
`...0111111111`	4KB (what we use)

Code (pmp_lab.S)

.section .text
.global _start

_start:
    # ---------------------------------------------------
    # 1. Set up Trap Handler (catch Access Fault later)
    # ---------------------------------------------------
    la t0, trap_handler
    csrw mtvec, t0

    # ---------------------------------------------------
    # 2. Configure PMP (in M-mode)
    # ---------------------------------------------------

    # [Target] Protect a 4KB region at 0x80200000
    # Using NAPOT mode
    li t0, 0x200801FF
    csrw pmpaddr0, t0

    # Entry 0: Enable + NAPOT + Read Only (R=1, W=0, X=0)
    # PMP_R(1) | PMP_A_NAPOT(0x18) = 0x19

    # Entry 1: Open other memory (Allow All)
    # pmpaddr1 set to all 1s (max address), mode set to TOR
    # PMP_R(1) | PMP_W(1) | PMP_X(1) | PMP_A_TOR(0x08) = 0x0F

    # pmpcfg0 = (pmp1cfg << 8) | pmp0cfg = 0x0F19
    li t0, -1
    csrw pmpaddr1, t0

    li t0, 0x0F19
    csrw pmpcfg0, t0        # Firewall activated!

    # ---------------------------------------------------
    # 3. Drop to S-mode
    # ---------------------------------------------------

    # Set mstatus.MPP = 01 (Supervisor)
    li t0, (3 << 11)
    csrc mstatus, t0        # Clear MPP
    li t0, (1 << 11)
    csrs mstatus, t0        # Set MPP to 01 (S-mode)

    la t0, s_mode_entry
    csrw mepc, t0
    mret                    # Jump! Identity becomes Supervisor

s_mode_entry:
    # ---------------------------------------------------
    # 4. Trigger Attack (S-mode Attempt)
    # ---------------------------------------------------
    li a0, 0x80200000       # Address protected by PMP0
    li t1, 0xDEADBEEF

    # Attempt write! Should trigger Exception 7 (Store Access Fault)
    sw t1, 0(a0)

    # If we survive, experiment failed
    li a0, 0
    j stop

stop:
    j stop

trap_handler:
    # Read mcause to check exception type
    csrr t0, mcause

    # Exception 7 = Store Access Fault
    li t1, 7
    bne t0, t1, unexpected

    # SUCCESS: PMP blocked the illegal write!
    li a0, 1                # Return success code
    j stop

unexpected:
    li a0, -1               # Unexpected exception
    j stop

Compile and Run

# Assemble
riscv64-unknown-elf-as -march=rv64g -o pmp_lab.o pmp_lab.S

# Link (ensure _start is entry point)
riscv64-unknown-elf-ld -T link.ld -o pmp_lab.elf pmp_lab.o

# Run on QEMU
qemu-system-riscv64 -machine virt -nographic -bios pmp_lab.elf

Expected Behavior

M-mode: PMP entries configured, firewall activated
mret: Privilege drops to S-mode
sw instruction: Triggers Store Access Fault (Exception 7)
Trap handler: Confirms PMP did its job

danieRTOS Reference: A real RTOS would use PMP to isolate kernel data from user tasks, preventing task corruption.

⚠️ Common Pitfalls

Pitfall 1: PMP Priority Order Error

Error Scenario: Put “deny rule” in pmp15, put “allow rule” in pmp0.

Consequence: pmp0 matches first, allowing all access. The deny rule never takes effect.

// ❌ Wrong: Order reversed
pmp0: Allow All (RWX)      // Matches first, permits everything
pmp1: Deny 0x80200000      // Never gets checked

// ✅ Correct: Write specific Deny first, generic Allow last
pmp0: Read-Only 0x80200000 // Check sensitive region first
pmp1: Allow All            // Allow other regions

💡 Memory aid: Like firewall rules, write exceptions first, default last.

Pitfall 2: Forgetting the Default Deny Rule

Error Scenario: Only set one PMP Entry to protect the key area, forgot to open other memory.

Consequence: Code region not matched by any PMP Entry, S/U-mode can’t even fetch the next instruction.

# ❌ Wrong: Only one Entry
csrw pmpaddr0, t0       # Protect key
csrw pmpcfg0, 0x19      # Read-Only
mret                    # Jump to S-mode then immediately Crash!

# ✅ Correct: Add Allow All Entry
csrw pmpaddr0, t0       # Protect key
csrw pmpaddr1, t1       # Max address
csrw pmpcfg0, 0x0F19    # pmp0=RO, pmp1=RWX

Pitfall 3: Lock Bit Irreversibility

Error Scenario: Setting Lock bit (L=1) during development.

Consequence: PMP Entry locked. Only hardware Reset can unlock. Cannot modify rules during debug.

// ❌ Dangerous: Setting Lock during development
pmpcfg0 = 0x99;  // L=1, A=NAPOT, R=1

// ✅ Recommended: Only set Lock in Production
#ifdef PRODUCTION
    pmpcfg0 = 0x99;  // Locked
#else
    pmpcfg0 = 0x19;  // Dev mode, not locked
#endif

💡 Reminder: Lock bit is meant to prevent malicious modification of M-mode Firmware. Don’t use it during development.

Summary

SoC integration connects RISC-V cores with the rest of the system. This chapter covered seven essential components that make a complete RISC-V system-on-chip.

Physical Memory Protection (PMP) provides hardware-enforced memory access control using physical addresses. PMP operates in M-mode and protects memory regions from untrusted code. With up to 64 configurable regions, four address matching modes (OFF, TOR, NA4, NAPOT), and lockable entries, PMP enables firmware protection, device memory isolation, and task separation in systems without MMUs.

IOMMU extends memory protection to devices by translating device addresses and enforcing access control. The RISC-V IOMMU uses the same page table format as the CPU MMU, simplifying software implementation. This enables device isolation, safe device passthrough for virtualization, and protection against malicious or buggy devices.

Platform-Level Interrupt Controller (PLIC) manages interrupt routing in multi-core systems. The PLIC collects interrupts from up to 1023 sources, arbitrates by priority, and routes them to appropriate CPU contexts. The claim/complete mechanism ensures atomic interrupt handling, while per-context enable masks and thresholds provide flexible interrupt management.

Memory-Mapped I/O (MMIO) provides a uniform mechanism for device access using standard load and store instructions. MMIO regions must be marked uncached in page tables, and fence instructions ensure proper ordering of device accesses. This unified address space simplifies the programming model compared to architectures with separate I/O instructions.

SoC memory maps organize address space into regions for ROM, RAM, peripherals, and reserved areas. Different RISC-V platforms use different layouts, but all follow the principle of address decode and routing through the interconnect fabric. Understanding the memory map is essential for firmware development and device driver programming.

System interconnects connect multiple masters and slaves in the SoC. AXI provides high performance with extensive IP ecosystem, AHB offers simplicity for embedded systems, and TileLink provides RISC-V-native features including cache coherence. The choice depends on performance requirements, IP availability, and coherence needs.

DMA and coherency enable efficient data transfers but require careful management of cache coherence. Software can use cache flush and invalidate operations, or hardware can provide coherent DMA through snooping or coherent interconnects. IOMMU or scatter-gather DMA solves the virtual-to-physical address translation problem for DMA transfers.

Together, these components form the foundation of RISC-V SoC design, enabling everything from simple microcontrollers to complex multi-core application processors.

Chapter 14. RISC-V Platform Profiles & Embedded Systems

Part VIII — System Design, Platform Spec & SoC Integration

🎯 Learning Objectives

After reading this chapter, you will be able to:

Understand the Fragmentation Problem: Grasp how ISA Fragmentation harms software ecosystems
Distinguish Profile Types: Understand the use-case differences between RVA (Application) and RVM (Microcontroller)
Decode Profile Names: Interpret the meaning of names like RVA22S, RVM23

Scene: Junior has multiple chip vendor spec sheets spread across the desk, looking bewildered.

Junior: “Senior, I’m going cross-eyed looking at these spec sheets. This vendor says they support RV64GC, that one says RVA22, and another says RVM23. I just want to run Linux—which one do I pick? Can’t it be as simple as buying an x86 computer?”

Senior: “That’s the downside of RISC-V being ‘too free’—Fragmentation. Before, everyone assembled their own instruction sets: you want M, he wants F, I want C. Then you write software only to find that this CPU is missing an instruction, instant Crash.”

Junior: “So Profiles are meant to solve this?”

Senior: “Exactly. Think of it as officially certified set menus:

Before (À la carte)	Now (Profile)
Pick your own instruction set	Official pre-configured menu
Forgot to order FPU = Linux won’t boot	RVA22 label = guaranteed to run
Verify compatibility for each product	One ISO runs all compliant hardware

RVA (Application Profile) is the ‘deluxe menu’ for rich operating systems like Linux/Android. If a vendor dares to slap the RVA22 label on their chip, they guarantee it has MMU, atomic instructions, floating-point, and everything else needed to run Linux.“

Junior: “What about RVM?”

Senior: “RVM (Microcontroller Profile) is the ‘economy menu’ for RTOS or bare-metal. It drops heavy equipment like MMU, focusing on low power and real-time control. If you’re just making a smart rice cooker, RVM is enough; but if you’re making a smartphone, you definitely need RVA.”

Junior: “What about the numbers 20, 22, 23? Performance levels?”

Senior: “Not performance—year. Like a ‘2022 model year’ car. RVA22 represents standards defined in 2022, RVA23 adds some new features (like stronger vector instructions). Newer Linux distributions typically require newer Profile years.”

Junior: “Got it! So when choosing a CPU, I don’t need to check instructions one by one anymore—just confirm it meets the Profile menu I need!”

RISC-V’s modularity is both a strength and a challenge. The base ISA is minimal, and implementations choose which extensions to include. This flexibility enables optimization for specific use cases—a microcontroller might include only RV32IMC, while a server processor might implement RV64IMAFDCV. But this variability creates a problem: how does software know what features are available?

Platform profiles solve this by defining standard combinations of extensions for specific use cases. A profile specifies mandatory extensions, optional extensions, and implementation requirements. Software targeting a profile can assume certain features exist, simplifying development and improving portability. The RVA22 profile defines requirements for application processors running rich operating systems. The RVA23 profile adds newer extensions and stricter requirements. Embedded profiles target microcontrollers and real-time systems.

This chapter explores RISC-V platform profiles and embedded system design. We’ll examine the RVA22 and RVA23 profiles for application processors, embedded profiles for microcontrollers, the differences between MMU and no-MMU systems, and how RISC-V compares to ARM Cortex-M in the embedded space.

14.1 Platform Profiles Overview

The Fragmentation Problem

RISC-V’s modularity creates potential fragmentation. Consider these valid RISC-V implementations:

RV32I: Minimal 32-bit core (base ISA only)
RV32IMC: Embedded core (multiply, compressed)
RV64IMAFDC: Application processor (full general-purpose)
RV64IMAFDCV: High-performance with vectors

Software written for RV64IMAFDC won’t run on RV32IMC. Even within the same base (RV64), different extension combinations create incompatibilities. This makes it difficult to distribute binary software or develop portable operating systems.

Profiles as a Solution

A platform profile defines:

Mandatory extensions: Must be implemented
Optional extensions: May be implemented
Mandatory features: Specific implementation requirements (e.g., minimum PMP entries)
Discovery mechanism: How software detects features

Profiles create standard targets for software development. An OS can target “RVA22 profile” and know exactly what features are available. Hardware vendors can claim “RVA22 compliant” and guarantee compatibility with RVA22 software.

Profile Naming Convention

RISC-V profiles use a naming scheme:

RV: RISC-V
A/M/E: Application / Microcontroller / Embedded
22/23/…: Year of ratification (2022, 2023, etc.)
S/U: Supervisor mode / User mode (for application profiles)

Examples:

RVA22S: Application profile, 2022, Supervisor mode
RVA22U: Application profile, 2022, User mode
RVA23S: Application profile, 2023, Supervisor mode
RVM23: Microcontroller profile, 2023

Profile Versioning

Profiles evolve over time:

Major version: Incompatible changes (e.g., RVA22 → RVA23)
Minor version: Backward-compatible additions
Errata: Bug fixes, no functional changes

A profile version guarantees:

Forward compatibility: RVA22 software runs on RVA23 hardware
Feature stability: Mandatory extensions don’t change within a version

14.2 RVA22 Profile (Application Processor)

Target Use Case

RVA22 targets application processors running rich operating systems like Linux, FreeBSD, or commercial RTOSes. These systems need:

Virtual memory (MMU)
Privilege levels (M, S, U modes)
Standard extensions for general-purpose computing
Sufficient performance for application workloads

RVA22S (Supervisor Mode)

RVA22S defines requirements for systems running supervisor-mode operating systems.

Mandatory ISA Extensions:

RV64I: 64-bit base integer ISA
M: Integer multiplication and division
A: Atomic instructions
F: Single-precision floating-point
D: Double-precision floating-point
C: Compressed instructions
Zicsr: CSR instructions
Zifencei: Instruction fence
Zicntr: Base counters (cycle, time, instret)
Zihpm: Hardware performance counters
Ziccif: Main memory supports instruction fetch
Ziccrse: Main memory supports misaligned loads/stores
Ziccamoa: Main memory supports all atomics
Zicclsm: Main memory supports misaligned atomics
Za64rs: Reservation set size (64 bytes)
Zihintpause: Pause hint instruction
Zba: Address generation (bit manipulation)
Zbb: Basic bit manipulation
Zbs: Single-bit instructions
Zkt: Data-independent execution time (timing side-channel protection)

Mandatory Privileged Features:

Sv39: Page-based virtual memory (39-bit virtual address)
Svpbmt: Page-based memory types
Svadu: Hardware A/D bit updates
Sstc: Supervisor-mode timer interrupts
Sscofpmf: Count overflow and privilege mode filtering

Mandatory Implementation Requirements:

At least 8 PMP entries
At least 29 hardware performance counters
Misaligned loads/stores supported in main memory
LR/SC reservation set size of 64 bytes

RVA22U (User Mode)

RVA22U defines requirements for user-mode applications. It’s a subset of RVA22S:

Same ISA extensions as RVA22S
No privileged features (no Sv39, no PMP requirements)
Targets user-space applications running on RVA22S systems

Example: Checking RVA22 Compliance

// Check if system is RVA22S compliant
bool is_rva22s_compliant(void) {
    // Check ISA extensions via misa
    uint64_t misa = read_csr(misa);
    if ((misa & MISA_I) == 0) return false;  // RV64I
    if ((misa & MISA_M) == 0) return false;  // M extension
    if ((misa & MISA_A) == 0) return false;  // A extension
    if ((misa & MISA_F) == 0) return false;  // F extension
    if ((misa & MISA_D) == 0) return false;  // D extension
    if ((misa & MISA_C) == 0) return false;  // C extension
    
    // Check Sv39 support
    uint64_t satp = read_csr(satp);
    write_csr(satp, SATP_MODE_SV39 << 60);
    if ((read_csr(satp) >> 60) != SATP_MODE_SV39) return false;
    write_csr(satp, satp);  // Restore
    
    // Check PMP entries (at least 8)
    int pmp_count = count_pmp_entries();
    if (pmp_count < 8) return false;
    
    // Check other features via device tree or ACPI
    // ...
    
    return true;
}

14.3 RVA23 Profile

Improvements Over RVA22

RVA23, ratified in 2023, builds on RVA22 with additional extensions and stricter requirements:

New Mandatory Extensions:

Zicond: Integer conditional operations
Zimop: May-be-operations (reserved for future extensions)
Zcmop: Compressed may-be-operations
Zcb: Additional compressed instructions
Zfa: Additional floating-point instructions
Zawrs: Wait-on-reservation-set instructions
Supm: Pointer masking (supervisor mode)

Enhanced Requirements:

Minimum 16 PMP entries (up from 8)
Sv48 or Sv57 support (in addition to Sv39)
Improved performance counter requirements
Stricter timing guarantees for Zkt

Forward Compatibility

RVA23 is forward-compatible with RVA22:

All RVA22 mandatory extensions remain mandatory in RVA23
RVA22 software runs unmodified on RVA23 hardware
RVA23 adds features but doesn’t remove or change existing ones

Migration Path

Hardware vendors can support both profiles:

RVA22 hardware → RVA23 hardware
  (2022-2024)      (2024+)
       ↓               ↓
   RVA22 software runs on both
   RVA23 software runs only on RVA23+

14.4 Embedded Profiles

Embedded System Requirements

Embedded systems have different constraints than application processors:

Limited resources: Small memory, low power, cost-sensitive
Real-time requirements: Deterministic interrupt latency, predictable timing
No virtual memory: Many embedded systems run without MMU
Simpler software: Bare-metal or RTOS, not full OS

RISC-V embedded profiles target these use cases with minimal mandatory extensions and flexible configurations.

RV32E Base ISA

RV32E is a reduced version of RV32I with only 16 registers (x0-x15) instead of 32. This saves:

Silicon area: Smaller register file
Power: Fewer registers to manage
Code size: Shorter register encodings in some cases

RV32E is suitable for ultra-low-cost microcontrollers where every gate counts.

# RV32E example: Only x0-x15 available
addi x10, x0, 42     # OK: x10 is in range
addi x20, x0, 42     # ERROR: x20 doesn't exist in RV32E

Microcontroller-Oriented Features

Embedded profiles emphasize:

Compressed instructions (C): Reduce code size by 25-30%
Multiply (M): Essential for many embedded algorithms
Bit manipulation (B): Efficient for embedded control tasks
Fast interrupts: Low-latency interrupt handling

Interrupt Handling for Embedded

Embedded systems need fast, deterministic interrupt response. RISC-V provides two interrupt architectures:

CLINT (Core-Local Interruptor): Basic timer and software interrupts
CLIC (Core-Local Interrupt Controller): Advanced interrupt controller for embedded

CLIC features:

Vectored interrupts: Jump directly to handler (no dispatch overhead)
Nested interrupts: Higher-priority interrupts preempt lower-priority
Tail-chaining: Back-to-back interrupts without full context save/restore
Configurable levels: Up to 256 priority levels

// CLIC interrupt handler (vectored)
void uart_irq_handler(void) {
    // Directly entered, no dispatch needed
    char c = UART->DATA;
    buffer[head++] = c;
    // Tail-chaining: if another interrupt pending, jump directly to it
}

Low-Power Considerations

Embedded systems prioritize power efficiency:

Clock gating: Disable unused modules
Power domains: Shut down inactive regions
Sleep modes: WFI (Wait For Interrupt) instruction
Dynamic voltage/frequency scaling: Adjust performance vs power

# Enter low-power mode
wfi              # Wait for interrupt (CPU sleeps)
# CPU wakes on interrupt, resumes here

14.5 MMU vs No-MMU Systems

MMU-Based Systems

Systems with Memory Management Units (MMUs) provide:

Virtual memory: Each process has its own address space
Memory protection: Processes can’t access each other’s memory
Demand paging: Load pages from disk on demand
Large address spaces: 64-bit virtual addresses

MMU-based systems run full operating systems like Linux:

Process A: 0x00000000-0xFFFFFFFF (virtual)
    ↓ MMU translation
Physical: 0x80000000-0x80FFFFFF

Process B: 0x00000000-0xFFFFFFFF (virtual)
    ↓ MMU translation
Physical: 0x81000000-0x81FFFFFF

Requirements:

Sv39/Sv48/Sv57 page tables
TLB for translation caching
Page fault handling
Sufficient memory for page tables

No-MMU Systems

Systems without MMUs use physical addresses directly:

Simpler hardware: No TLB, no page table walker
Lower cost: Fewer gates, less power
Faster context switch: No TLB flush needed
Deterministic: No TLB miss latency

No-MMU systems run:

Bare-metal: Single application, no OS
RTOS: Real-time OS with static memory allocation
Embedded Linux: uClinux (Linux without MMU)

Memory Protection Without MMU

No-MMU systems can still provide memory protection using PMP (Physical Memory Protection):

// Protect firmware region (0x80000000-0x80010000)
pmpaddr0 = 0x80000000 >> 2;
pmpaddr1 = 0x80010000 >> 2;
pmpcfg0 = 0x89;  // L=1, A=1 (TOR), X=0, W=0, R=1

// Protect peripheral region (0x10000000-0x20000000)
pmpaddr2 = 0x10000000 >> 2;
pmpaddr3 = 0x20000000 >> 2;
pmpcfg0 |= (0x8B << 16);  // L=1, A=1 (TOR), X=0, W=1, R=1

PMP provides:

Region-based protection (not page-based)
M-mode enforcement
Locked regions (can’t be changed until reset)

Comparison

Feature	MMU System	No-MMU System
Address Space	Virtual (per-process)	Physical (shared)
Memory Protection	Page-based (4KB granularity)	Region-based (PMP)
OS Support	Linux, FreeBSD, etc.	RTOS, bare-metal, uClinux
Context Switch	Slow (TLB flush)	Fast (no TLB)
Memory Overhead	Page tables (~1% of RAM)	None
Complexity	High	Low
Determinism	Lower (TLB misses)	Higher (no TLB)
Use Cases	Servers, desktops, phones	Microcontrollers, embedded

14.6 Comparison with ARM Cortex-M

ARM Cortex-M Overview

ARM Cortex-M is the dominant architecture for microcontrollers. The Cortex-M family includes:

Cortex-M0/M0+: Ultra-low-cost, minimal features
Cortex-M3: Mainstream, good performance/cost
Cortex-M4: DSP extensions, floating-point
Cortex-M7: High performance, cache, double-precision FP
Cortex-M33/M55: ARMv8-M, TrustZone, vector extensions

RISC-V Embedded vs ARM Cortex-M

Feature	RISC-V Embedded	ARM Cortex-M
ISA	Open, modular	Proprietary, fixed
Registers	32 (or 16 for RV32E)	16 (13 general-purpose)
Instruction Set	RISC, load-store	Thumb-2 (mixed 16/32-bit)
Compressed Instructions	Optional (C extension)	Standard (Thumb)
Multiply/Divide	Optional (M extension)	Standard (M3+)
Floating-Point	Optional (F/D extensions)	Optional (M4+)
Vector/SIMD	Optional (V extension)	Optional (M55 Helium)
Privilege Levels	M/S/U modes	Handler/Thread modes
Memory Protection	PMP (region-based)	MPU (region-based)
Interrupt Controller	CLIC (vectored, nested)	NVIC (vectored, nested)
Interrupt Priorities	Up to 256 levels	8-256 levels (implementation)
Licensing	Open (no fees)	Proprietary (licensing fees)
Ecosystem	Growing	Mature, extensive

Interrupt Model Comparison (CLIC vs NVIC)

ARM NVIC (Nested Vectored Interrupt Controller):

Vectored interrupts (direct jump to handler)
Nested interrupts with priority levels
Tail-chaining for back-to-back interrupts
Automatic context save/restore

RISC-V CLIC (Core-Local Interrupt Controller):

Similar features to NVIC
More flexible priority levels (up to 256)
Configurable interrupt modes
Compatible with RISC-V privilege model

Both provide comparable functionality for embedded interrupt handling.

Ecosystem Comparison

ARM Cortex-M Advantages:

Mature ecosystem (20+ years)
Extensive vendor support (ST, NXP, TI, etc.)
Rich middleware (CMSIS, Mbed, etc.)
Large developer community
Proven in billions of devices

RISC-V Embedded Advantages:

No licensing fees (lower cost)
Open ISA (customizable)
Modern design (cleaner than ARM legacy)
Growing ecosystem (SiFive, Espressif, etc.)
Future-proof (community-driven evolution)

Use Case Recommendations

Choose ARM Cortex-M when:

Mature ecosystem is critical
Extensive middleware needed
Proven reliability required
Time-to-market is tight

Choose RISC-V Embedded when:

Cost optimization is important
Customization is needed
Open-source ecosystem preferred
Long-term flexibility valued

Example: ESP32-C3 (RISC-V) vs ESP32 (Xtensa)

Espressif’s ESP32-C3 demonstrates RISC-V in embedded:

RV32IMC core (32-bit, multiply, compressed)
160 MHz, single-core
Wi-Fi + Bluetooth LE
400 KB SRAM, 4 MB flash
Arduino, ESP-IDF support

Compared to ESP32 (Xtensa):

Similar performance
Better toolchain (GCC, LLVM)
Open ISA (vs proprietary Xtensa)
Growing ecosystem

🛠️ Hands-on Lab: Lab 14.1 — Profile Detector

This lab demonstrates how to detect which Extensions the current hardware supports—key to understanding Profile practical applications.

Lab Objectives

Read the misa CSR to see supported Extensions
Check if critical CSRs exist
Determine which Profile level the current system meets

Code (profile_detect.c)

#include <stdio.h>
#include <stdint.h>

// Read misa CSR (requires M-mode privilege)
static inline uint64_t read_misa() {
    uint64_t val;
    asm volatile ("csrr %0, misa" : "=r" (val));
    return val;
}

// Extension check (misa bit mapping: A=0, B=1, ..., Z=25)
#define HAS_EXT(misa, letter) ((misa) & (1UL << ((letter) - 'A')))

void check_profile(uint64_t misa) {
    printf("=== RISC-V Profile Detector ===\n\n");

    // Check XLEN (MXL field in misa[63:62])
    int xlen = (misa >> 62) == 2 ? 64 : 32;
    printf("Base: RV%d\n", xlen);

    // List supported Extensions
    printf("Extensions: ");
    for (char c = 'A'; c <= 'Z'; c++) {
        if (HAS_EXT(misa, c)) {
            printf("%c", c);
        }
    }
    printf("\n\n");

    // RVA22 minimum requirements check
    int has_m = HAS_EXT(misa, 'M');  // Integer Multiply
    int has_a = HAS_EXT(misa, 'A');  // Atomics
    int has_f = HAS_EXT(misa, 'F');  // Single-precision Float
    int has_d = HAS_EXT(misa, 'D');  // Double-precision Float
    int has_c = HAS_EXT(misa, 'C');  // Compressed

    printf("--- Profile Compatibility ---\n");

    // RVA22 requires: RV64IMAFDC + more
    if (xlen == 64 && has_m && has_a && has_f && has_d && has_c) {
        printf("[✓] RVA22 basic requirements: PASS\n");
        printf("    (Additional checks needed: Zba, Zbb, Zbs, Sv39, PMP>=8)\n");
    } else {
        printf("[✗] RVA22 basic requirements: FAIL\n");
        if (xlen != 64) printf("    Missing: 64-bit base\n");
        if (!has_m) printf("    Missing: M (Multiply)\n");
        if (!has_a) printf("    Missing: A (Atomics)\n");
        if (!has_f) printf("    Missing: F (Float)\n");
        if (!has_d) printf("    Missing: D (Double)\n");
        if (!has_c) printf("    Missing: C (Compressed)\n");
    }

    // RVM (Microcontroller) compatibility is more relaxed
    if (has_m && has_c) {
        printf("[✓] RVM basic requirements: PASS (RV32/64 + M + C)\n");
    }
}

int main() {
    uint64_t misa = read_misa();

    if (misa == 0) {
        printf("Error: Cannot read misa (not in M-mode?)\n");
        return 1;
    }

    check_profile(misa);
    return 0;
}

Compile and Run

# Compile (requires M-mode execution environment)
riscv64-unknown-elf-gcc -march=rv64gc -o profile_detect profile_detect.c

# Run on Spike (M-mode simulation)
spike pk profile_detect

Expected Output (QEMU virt machine)

=== RISC-V Profile Detector ===

Base: RV64
Extensions: ACDFIMSU

--- Profile Compatibility ---
[✓] RVA22 basic requirements: PASS
    (Additional checks needed: Zba, Zbb, Zbs, Sv39, PMP>=8)
[✓] RVM basic requirements: PASS (RV32/64 + M + C)

Key Takeaways

misa CSR: Single read reveals all standard extensions
Profile != Full Compliance: Even if IMAFDC is present, RVA22 still needs additional features like Zba, Zbb, Zbs, Sv39, and 8+ PMP entries
Runtime Detection: Production code should check features at boot, not assume

danieRTOS Reference: danieRTOS checks for M extension at startup to decide whether to use hardware or software multiply.

⚠️ Common Pitfalls

Pitfall 1: Profile Version ≠ Performance

Misconception: “RVA23 is 10% faster than RVA22”

Truth: Profile version only represents the feature set’s year, not clock speed or hardware performance.

RVA22 @ 2.0 GHz could be much faster than RVA23 @ 1.0 GHz!

Profiles define "software compatibility", not "hardware performance".

Pitfall 2: Thinking RV64G = RVA22

Error Scenario: Seeing vendor claim RV64GC support and assuming latest Fedora will run.

Truth: RVA22 requires additional Extensions and hardware features:

Requirement	RV64GC	RVA22
IMAFDCSU	✓	✓
Zba, Zbb, Zbs	✗	✓ (Mandatory)
Sv39 (MMU)	Not specified	✓ (Mandatory)
PMP ≥ 8 entries	Not specified	✓ (Mandatory)

Pitfall 3: Ignoring Profile Backward Compatibility

Error Scenario: Programs compiled for RVA23 won’t run on RVA22 hardware.

Correct Understanding:

RVA22 software → RVA23 hardware  ✓ (Forward compatible)
RVA23 software → RVA22 hardware  ✗ (May use new instructions)

💡 Recommendation: For maximum compatibility, specify an older Profile as your compilation target.

Summary

Platform profiles and embedded system design define how RISC-V adapts to different use cases. This chapter covered five key areas that enable RISC-V to serve both high-performance application processors and resource-constrained embedded systems.

Platform profiles solve the fragmentation problem by defining standard combinations of extensions. Profiles specify mandatory extensions, optional features, and implementation requirements. This creates standard targets for software development and guarantees compatibility. The naming convention (RVA22S, RVA23U, etc.) clearly indicates the profile’s purpose and version.

RVA22 profile targets application processors running rich operating systems. RVA22S requires 64-bit base ISA, standard extensions (M, A, F, D, C), bit manipulation (Zba, Zbb, Zbs), Sv39 virtual memory, and at least 8 PMP entries. RVA22U provides the same ISA extensions for user-mode applications. This profile enables Linux, FreeBSD, and other full operating systems to run portably across RISC-V implementations.

RVA23 profile builds on RVA22 with additional extensions and stricter requirements. New mandatory extensions include Zicond (conditional operations), Zfa (additional floating-point), and Zawrs (wait-on-reservation-set). Enhanced requirements include 16 PMP entries and Sv48/Sv57 support. RVA23 maintains forward compatibility—RVA22 software runs unmodified on RVA23 hardware.

Embedded profiles target microcontrollers and real-time systems with different constraints. RV32E reduces the register file to 16 registers for ultra-low-cost applications. Embedded systems emphasize compressed instructions for code size, fast interrupt handling through CLIC, and low-power features like WFI and clock gating. These profiles enable RISC-V to compete in the microcontroller market.

MMU vs no-MMU systems represent two different approaches to memory management. MMU-based systems provide virtual memory, per-process address spaces, and page-based protection, enabling full operating systems like Linux. No-MMU systems use physical addresses directly, offering simpler hardware, lower cost, and better determinism. PMP provides region-based memory protection for no-MMU systems, enabling task isolation without the complexity of virtual memory.

Comparison with ARM Cortex-M shows RISC-V’s competitive position in embedded systems. Both architectures provide similar features—vectored interrupts, nested interrupt handling, memory protection, and optional floating-point. RISC-V offers advantages in licensing (open, no fees), customization (modular ISA), and modern design. ARM Cortex-M leads in ecosystem maturity, vendor support, and proven deployment. The choice depends on priorities: cost and flexibility favor RISC-V, while ecosystem maturity favors ARM.

Together, platform profiles and embedded system features enable RISC-V to serve the full spectrum from tiny microcontrollers to powerful application processors, with clear standards for software portability and hardware compliance.

Chapter 15. Debugging & Trace

Part IX — Performance, Debug & Tools

🎯 Learning Objectives

After reading this chapter, you will be able to:

Master GDB Stub Usage: Use QEMU -s -S to start a GDB Server for remote debugging
Know Core Debug Commands: break, si, info reg, x/, and other GDB commands
Build a Debug Mindset: Adopt a systematic “Observe → Hypothesize → Verify” debugging workflow

💡 Scenario: The Pocket Watch That Stops Time

Scene: Junior is pointing at garbled results on the screen, about to lose it.

Junior: “I can’t take this anymore. I just wanted to add 1 to 5, why is the result 34821? I’ve been staring at these five lines of Assembly for an hour!”

Senior: “Looking with your eyes isn’t enough. Junior, program execution is like a speeding train—how can you see if there’s a crack in the wheels while sitting by the tracks?”

Junior: “So what do I do? Add printf?”

Senior: “printf is fine, but in bare-metal situations or when the program crashes, you can’t even print anything. You need a ‘pocket watch’ that can pause time—GDB.”

Junior: “Pause time?”

Senior: “Exactly. Through a JTAG hardware interface (or the QEMU emulator we’re using now), we can force the CPU into Debug Mode.

Debug Mode Feature	Analogy
Check Registers	Peeking in a wallet
View Memory	Searching through drawers
Single Step	Slow-motion replay
Breakpoint	Setting a trap

In this mode, the CPU is like someone pressed the pause button. We can execute just one instruction at a time and see where things went wrong.

Come on, give me that broken program. Let’s go bug hunting.“

Software development requires debugging. A program crashes, and we need to know why. A function returns the wrong value, and we need to step through its execution. A performance bottleneck appears, and we need to trace instruction flow. Debugging transforms opaque failures into understandable problems.

RISC-V provides a comprehensive debug architecture that supports both halting debug (stop the processor, examine state) and non-intrusive trace (record execution without stopping). The Debug Module allows external debuggers to control the processor through JTAG or other interfaces. Hardware breakpoints and triggers enable precise control over when to halt execution. Debug mode provides a special execution environment for debug operations. Trace support captures instruction and data flow for post-mortem analysis.

This chapter explores RISC-V debugging and trace capabilities. We’ll examine the debug architecture, debug interfaces, hardware breakpoints, debug mode operation, trace support, and how RISC-V compares to ARM’s CoreSight debug infrastructure.

15.1 RISC-V Debug Architecture

Debug Requirements

A debug system must provide:

Halt and resume: Stop processor execution, examine state, continue
Register access: Read and write CPU registers
Memory access: Read and write system memory
Breakpoints: Stop execution at specific instructions or data accesses
Single-step: Execute one instruction at a time
Reset control: Reset the processor or system

RISC-V’s debug architecture separates concerns into distinct modules, allowing flexible implementation while maintaining standard interfaces.

Debug Components

External Debugger (GDB, OpenOCD)
    ↓
Debug Transport Module (DTM) - JTAG, USB, etc.
    ↓
Debug Module Interface (DMI)
    ↓
Debug Module (DM)
    ↓
RISC-V Core (enters Debug Mode)

Key components:

Debug Transport Module (DTM): Provides physical connection (JTAG, USB, etc.)

Debug Module (DM): Controls the core, implements debug operations

Debug Module Interface (DMI): Standard interface between DTM and DM

Debug Mode: Special execution mode for debug operations

Debug Module (DM)

The Debug Module is the central component that:

Halts and resumes the core
Provides abstract commands for register/memory access
Manages hardware breakpoints (triggers)
Controls reset

The DM is accessed through memory-mapped registers:

DM Base Address: 0x00000000 (implementation-defined)

Key registers:
  dmcontrol:   Control register (halt, resume, reset)
  dmstatus:    Status register (halted, running, etc.)
  hartinfo:    Hart information
  abstractcs:  Abstract command status
  command:     Abstract command register
  data0-11:    Data transfer registers
  progbuf0-15: Program buffer

Debug Mode vs Machine Mode

Debug mode is a special execution mode distinct from M/S/U modes:

Higher privilege than M-mode
Can access all system resources
Uses separate CSRs (dcsr, dpc, dscratch0/1)
Executes from Debug ROM or Program Buffer

Privilege Hierarchy:
  Debug Mode (highest)
    ↓
  M-mode
    ↓
  S-mode
    ↓
  U-mode (lowest)

15.2 Debug Interface

JTAG Interface

JTAG (Joint Test Action Group) is the standard debug interface for RISC-V. It provides:

4-wire interface (TDI, TDO, TCK, TMS)
Boundary scan for testing
Debug access to the core

JTAG signals:

TDI:  Test Data In (serial data input)
TDO:  Test Data Out (serial data output)
TCK:  Test Clock
TMS:  Test Mode Select (state machine control)
TRST: Test Reset (optional)

JTAG state machine:

Test-Logic-Reset
    ↓
Run-Test/Idle
    ↓
Select-DR-Scan → Capture-DR → Shift-DR → Exit1-DR → Update-DR
    ↓
Select-IR-Scan → Capture-IR → Shift-IR → Exit1-IR → Update-IR

Debug Module Interface (DMI)

DMI is a standard register interface between DTM and DM. It provides:

32-bit or 64-bit register access
Address space for DM registers
Status and error reporting

DMI operations:

// Read DM register
uint32_t dmi_read(uint32_t addr) {
    // DTM shifts address into JTAG
    // DTM reads data from DM
    // Returns data
}

// Write DM register
void dmi_write(uint32_t addr, uint32_t data) {
    // DTM shifts address and data into JTAG
    // DM performs write
}

Abstract Commands

Abstract commands provide high-level debug operations without requiring debug mode entry:

Access Register: Read/write CPU registers Access Memory: Read/write system memory Quick Access: Fast register access

// Example: Read register x10 using abstract command
void read_register_x10(uint32_t *value) {
    // Write abstract command
    dm_write(command, 0x00221000);  // regno=10, transfer, size=32
    
    // Wait for completion
    while (dm_read(abstractcs) & ABSTRACTCS_BUSY);
    
    // Read result
    *value = dm_read(data0);
}

System Bus Access

The DM can access system memory directly through the system bus, bypassing the core:

// Read memory at address 0x80000000
uint32_t read_memory(uint64_t addr) {
    dm_write(sbaddress0, addr & 0xFFFFFFFF);
    dm_write(sbaddress1, addr >> 32);
    dm_write(sbcs, SBCS_SBREADONADDR | SBCS_SBACCESS32);
    
    while (dm_read(sbcs) & SBCS_SBBUSY);
    
    return dm_read(sbdata0);
}

15.3 Hardware Breakpoints and Triggers

Trigger Module

RISC-V provides a flexible trigger system for hardware breakpoints and watchpoints. Triggers can:

Break on instruction execution (instruction breakpoint)
Break on data access (data watchpoint)
Break on exceptions
Chain multiple conditions

Triggers are configured through CSRs:

tselect: Select trigger register
tdata1: Trigger configuration
tdata2: Trigger match value
tdata3: Additional trigger data (optional)

Trigger Types

RISC-V defines several trigger types:

Type 2 (mcontrol): Address/data match trigger

Match on instruction fetch, load, or store
Configurable match conditions (equal, greater, less, mask)
Action: enter debug mode, raise exception, or trace

Type 3 (icount): Instruction count trigger

Break after N instructions
Useful for single-stepping

Type 4 (itrigger): Interrupt trigger

Break on specific interrupts

Type 5 (etrigger): Exception trigger

Break on specific exceptions

Breakpoint Configuration

Setting an instruction breakpoint:

// Set breakpoint at address 0x80000100
void set_breakpoint(uint64_t addr) {
    // Select trigger 0
    write_csr(tselect, 0);

    // Configure mcontrol trigger
    uint64_t tdata1 = 0;
    tdata1 |= (2ULL << 60);      // type = 2 (mcontrol)
    tdata1 |= (1ULL << 6);       // m = 1 (match in M-mode)
    tdata1 |= (1ULL << 2);       // execute = 1 (match on instruction fetch)
    tdata1 |= (1ULL << 12);      // action = 1 (enter debug mode)

    write_csr(tdata1, tdata1);
    write_csr(tdata2, addr);     // Match address
}

Setting a data watchpoint:

// Set watchpoint on store to address 0x80001000
void set_watchpoint(uint64_t addr) {
    write_csr(tselect, 1);

    uint64_t tdata1 = 0;
    tdata1 |= (2ULL << 60);      // type = 2 (mcontrol)
    tdata1 |= (1ULL << 6);       // m = 1 (match in M-mode)
    tdata1 |= (1ULL << 1);       // store = 1 (match on store)
    tdata1 |= (1ULL << 12);      // action = 1 (enter debug mode)

    write_csr(tdata1, tdata1);
    write_csr(tdata2, addr);
}

Trigger Chaining

Multiple triggers can be chained to create complex conditions:

// Break when PC = 0x80000100 AND x10 = 42
void set_conditional_breakpoint(void) {
    // Trigger 0: Match PC
    write_csr(tselect, 0);
    uint64_t tdata1_0 = (2ULL << 60) | (1ULL << 6) | (1ULL << 2) | (1ULL << 11);
    // chain = 1 (bit 11)
    write_csr(tdata1, tdata1_0);
    write_csr(tdata2, 0x80000100);

    // Trigger 1: Match x10 value (requires data trigger support)
    write_csr(tselect, 1);
    uint64_t tdata1_1 = (2ULL << 60) | (1ULL << 6) | (1ULL << 12);
    // action = 1 (enter debug mode)
    write_csr(tdata1, tdata1_1);
    // Implementation-specific: match register value
}

Trigger Actions

When a trigger fires, it can:

Enter debug mode (action = 1)
Raise breakpoint exception (action = 0)
Generate trace event (implementation-specific)

15.4 Debug Mode

Entering Debug Mode

The core enters debug mode when:

External debugger requests halt (via dmcontrol.haltreq)
Hardware breakpoint fires (trigger with action = 1)
Single-step completes (dcsr.step = 1)
Debug interrupt (haltreq signal)

Upon entering debug mode:

PC is saved to dpc (Debug PC)
Cause is saved to dcsr.cause
Core halts execution
PC jumps to Debug ROM or Program Buffer

Debug CSRs

Debug mode uses three special CSRs:

dcsr (Debug Control and Status):

Bits:
  [31:28] xdebugver: Debug spec version
  [15]    ebreakm: ebreak enters debug mode in M-mode
  [14]    ebreaks: ebreak enters debug mode in S-mode
  [13]    ebreaku: ebreak enters debug mode in U-mode
  [8:6]   cause: Why debug mode was entered
  [2]     step: Single-step mode
  [1:0]   prv: Privilege mode before debug

dpc (Debug PC): Saved PC when entering debug mode

dscratch0/1: Scratch registers for debug code

Debug ROM and Program Buffer

When entering debug mode, the core executes code from:

Debug ROM: Small ROM containing debug entry code

Saves context
Waits for debugger commands
Restores context on resume

Program Buffer: RAM for debugger-supplied code

Debugger writes instructions here
Core executes them in debug mode
Used for complex operations (e.g., memory copy)

Example debug ROM code:

# Debug ROM entry point
debug_rom_entry:
    # Save x10 to dscratch0
    csrw dscratch0, x10

    # Load program buffer address
    lui x10, %hi(progbuf)
    addi x10, x10, %lo(progbuf)

    # Jump to program buffer
    jr x10

# Program buffer (written by debugger)
progbuf:
    # Debugger writes instructions here
    # Example: read x5 into data0
    csrw dscratch1, x5
    # ... more instructions ...
    ebreak  # Return to debug ROM

Resuming from Debug Mode

To resume execution:

Debugger writes to dmcontrol.resumereq
Debug ROM restores context
Core executes dret instruction
PC restored from dpc
Privilege mode restored from dcsr.prv

# Resume from debug mode
debug_resume:
    # Restore x10 from dscratch0
    csrr x10, dscratch0

    # Return from debug mode
    dret  # PC ← dpc, privilege ← dcsr.prv

Single-Stepping

Single-step mode executes one instruction then re-enters debug mode:

// Enable single-step
void enable_single_step(void) {
    uint64_t dcsr = read_csr(dcsr);
    dcsr |= (1 << 2);  // step = 1
    write_csr(dcsr, dcsr);
}

// Debugger workflow:
// 1. Halt core
// 2. Enable single-step
// 3. Resume (executes one instruction)
// 4. Core re-enters debug mode
// 5. Repeat

15.5 Trace Support

RISC-V Trace Specification

Trace captures program execution for analysis without halting the core. RISC-V trace provides:

Instruction trace: Record executed instructions
Data trace: Record memory accesses
Trace compression: Reduce trace bandwidth

Trace is non-intrusive—it doesn’t affect program execution or timing.

Instruction Trace

Instruction trace records:

Executed instructions (PC values)
Branch outcomes (taken/not taken)
Exceptions and interrupts
Context changes (privilege mode, ASID)

Trace packets encode this information efficiently:

Trace packet types:
  Format 0: Uncompressed address (full PC)
  Format 1: Differential address (PC delta)
  Format 2: Address with branch map
  Format 3: Synchronization packet

Trace Compression

Full instruction trace is expensive (bandwidth, storage). RISC-V trace uses compression:

Branch map: Encode multiple branch outcomes in one packet

Example: 8 branches, outcomes = 10110010
  Packet: [type=2, branches=10110010, address=...]

Differential encoding: Encode PC delta instead of full PC

Previous PC: 0x80000100
Current PC:  0x80000104
Packet: [type=1, delta=+4]

Implicit sequences: Don’t trace sequential instructions

PC sequence: 0x100, 0x104, 0x108, 0x10c
Trace: [0x100, count=4]  # Implicit +4 increments

Data Trace

Data trace records memory accesses:

Load/store addresses
Data values
Access size (byte, halfword, word, doubleword)

Data trace packet:

Data trace packet:
  [type] [address] [data] [size]

Example:
  SW x10, 0(x5)  # Store word
  Packet: [type=store, addr=0x80001000, data=0x12345678, size=4]

Trace Filtering

Trace can be filtered to reduce bandwidth:

Address range filtering (trace only specific code regions)
Privilege filtering (trace only M-mode, S-mode, etc.)
Event filtering (trace only branches, exceptions, etc.)

// Configure trace filtering (implementation-specific)
void configure_trace_filter(uint64_t start, uint64_t end) {
    // Enable trace for address range [start, end]
    trace_write(TRACE_ADDR_START, start);
    trace_write(TRACE_ADDR_END, end);
    trace_write(TRACE_CONTROL, TRACE_ENABLE | TRACE_FILTER_ADDR);
}

15.6 Comparison with ARM Debug

RISC-V Debug vs ARM CoreSight

ARM CoreSight is a comprehensive debug and trace infrastructure. Comparison:

Feature	RISC-V Debug	ARM CoreSight
Debug Interface	JTAG, DTM/DMI	JTAG, SWD (Serial Wire Debug)
Debug Module	Debug Module (DM)	Debug Access Port (DAP)
Halt/Resume	dmcontrol register	DHCSR register
Breakpoints	Trigger module (flexible)	FPB (Flash Patch and Breakpoint)
Watchpoints	Trigger module	DWT (Data Watchpoint and Trace)
Trace	RISC-V Trace spec	ETM (Embedded Trace Macrocell)
Trace Compression	Branch map, differential	Branch broadcast, compression
System Access	System bus access	AHB-AP, APB-AP
Complexity	Modular, simple	Comprehensive, complex

JTAG vs SWD

RISC-V uses JTAG (4-5 wires), ARM supports both JTAG and SWD (2 wires):

JTAG (RISC-V, ARM):
  TDI, TDO, TCK, TMS, (TRST)
  5 pins, standard interface

SWD (ARM only):
  SWDIO (bidirectional data)
  SWCLK (clock)
  2 pins, lower pin count

SWD advantages:

Fewer pins (important for small packages)
Faster than JTAG in some cases
ARM-specific optimization

JTAG advantages:

Industry standard (IEEE 1149.1)
Widely supported tools
Boundary scan capability

Trace Comparison

RISC-V Trace vs ARM ETM (Embedded Trace Macrocell):

Feature	RISC-V Trace	ARM ETM
Instruction Trace	Yes	Yes
Data Trace	Yes	Yes (ETMv4+)
Compression	Branch map, differential	Branch broadcast, Q elements
Bandwidth	Configurable	Configurable
Filtering	Address, privilege, event	Address, context ID, VMID
Timestamps	Optional	Yes
Trace Port	Implementation-specific	TPIU (Trace Port Interface Unit)
Trace Buffer	Implementation-specific	ETB (Embedded Trace Buffer)

ARM ETM is mature and widely deployed. RISC-V Trace is newer but follows similar principles with simpler encoding.

Debug Tools

Both architectures support standard debug tools:

RISC-V:

GDB (GNU Debugger)
OpenOCD (Open On-Chip Debugger)
SEGGER J-Link
Lauterbach TRACE32

ARM:

GDB
Keil MDK
ARM DS-5 / Arm Development Studio
SEGGER J-Link
Lauterbach TRACE32

Practical Differences

RISC-V advantages:

Simpler, more modular design
Open specification (no licensing)
Flexible trigger system
Easier to implement

ARM advantages:

Mature ecosystem
SWD reduces pin count
Comprehensive trace infrastructure
Extensive tool support

For embedded systems, SWD’s 2-pin interface is attractive. For complex SoCs, both architectures provide comparable debug capabilities.

🛠️ Hands-on Lab: Lab 15.1 — The Vanishing Values (Bug Hunting with GDB)

This lab features a classic Pointer Stride Error—the most common mistake for RISC-V beginners: assuming an int pointer +1 moves 4 bytes, but in Assembly addi x, x, 1 really does add just 1 byte.

Lab Objectives

Launch QEMU’s GDB Server feature
Connect GDB and load symbols
Use layout asm to view assembly
Find the two bugs in the program

Buggy Code (buggy_sum.S)

.section .data
# Define an array: 10, 20, 30, 40, 50
# Expected result: 10+20+30+40+50 = 150 (Hex: 0x96)
nums: .word 10, 20, 30, 40, 50

.section .text
.global _start

_start:
    la  t0, nums        # t0 points to array start
    li  t1, 5           # t1 is loop counter (Count = 5)
    # BUG 1: Forgot to initialize accumulator a0
    # We assume a0 is 0, but it might be garbage

loop:
    lw  t2, 0(t0)       # Load current number into t2
    add a0, a0, t2      # Accumulate: a0 = a0 + t2

    # BUG 2: Pointer stride error!
    # We're reading words (4 bytes), but here we only add 1
    addi t0, t0, 1      # ❌ Should be addi t0, t0, 4

    addi t1, t1, -1     # Decrement counter
    bnez t1, loop       # If not done, continue loop

stop:
    j stop

Debug Workflow

Step A: Compile (with debug info)

# -g is key! Tells compiler to keep symbol table
riscv64-unknown-elf-gcc -g -nostdlib -o buggy_sum.elf buggy_sum.S

Step B: Start QEMU (as Target)

# -S: Pause CPU immediately after startup
# -s: Enable GDB Server, default Port 1234
qemu-system-riscv64 -machine virt -nographic \
    -kernel buggy_sum.elf -S -s

(Terminal will hang—open another terminal for GDB)

Step C: Start GDB (as Host)

riscv64-unknown-elf-gdb buggy_sum.elf

Step D: GDB Interactive Investigation

(gdb) target remote :1234     # Connect to QEMU
(gdb) layout asm              # Open Assembly view
(gdb) break loop              # Set breakpoint at loop label
(gdb) continue                # Run until breakpoint

# After entering the loop...
(gdb) info reg a0             # Observe accumulator → not 0!
(gdb) info reg t0             # Observe pointer
(gdb) si                      # Single Step one instruction
(gdb) info reg t0             # Look at pointer again → only moved 1 byte!

# After finding the problem...
(gdb) x/5xw &nums             # View array memory contents

Expected Findings

Bug 1 (a0 uninitialized): First time entering loop, info reg a0 shows garbage value
Bug 2 (pointer stride): Each addi t0, t0, 1 only increases t0 by 1, causing misaligned data reads

Fixed Code

_start:
    la  t0, nums
    li  t1, 5
    li  a0, 0           # ✅ FIX 1: Initialize accumulator

loop:
    lw  t2, 0(t0)
    add a0, a0, t2
    addi t0, t0, 4      # ✅ FIX 2: Stride = 4 bytes (word size)
    addi t1, t1, -1
    bnez t1, loop

stop:
    j stop

danieRTOS Reference: The danieRTOS context switch code carefully uses word-aligned offsets when saving/restoring registers to the stack.

⚠️ Common Pitfalls

Pitfall 1: Compiler Optimization Interferes with Debugging

Error Scenario: After compiling with -O2, line numbers in GDB don’t match source code, variables “disappear”.

Cause: Optimizer reorders instructions, eliminates registers, inlines functions.

# ❌ Don't use high optimization when debugging
riscv64-unknown-elf-gcc -O2 -o program.elf program.c

# ✅ Use -O0 -g when debugging
riscv64-unknown-elf-gcc -O0 -g -o program.elf program.c

Pitfall 2: Forgetting the `-g` Flag

Error Scenario: GDB shows “No symbol table is loaded”.

Cause: Compiled without -g, symbol info was discarded.

# ❌ No debug info
riscv64-unknown-elf-gcc -o program.elf program.c

# ✅ Keep debug info
riscv64-unknown-elf-gcc -g -o program.elf program.c

Pitfall 3: QEMU Not Paused with `-S`

Error Scenario: Program already finished or crashed by the time GDB connects.

Solution: Always add -S to make QEMU pause after startup, waiting for GDB.

# ❌ Program starts executing immediately after launch
qemu-system-riscv64 -machine virt -kernel program.elf -s

# ✅ Program pauses after launch, waiting for GDB
qemu-system-riscv64 -machine virt -kernel program.elf -S -s

💡 Tip: -s = GDB Server on port 1234, -S = Stop at startup. These are often used together.

Summary

Debugging and trace are essential for software development and system analysis. This chapter explored RISC-V’s debug architecture and how it compares to ARM’s mature CoreSight infrastructure.

Debug architecture separates concerns into modular components. The Debug Transport Module provides physical connectivity through JTAG or other interfaces. The Debug Module controls the core through a standard Debug Module Interface. Debug mode provides a privileged execution environment for debug operations. This separation allows flexible implementations while maintaining standard interfaces for debugger tools.

Debug interface uses JTAG as the standard physical layer, providing four-wire connectivity for debug access. The Debug Module Interface defines register-level operations for controlling the core. Abstract commands enable high-level operations like register and memory access without requiring debug mode entry. System bus access allows the debugger to read and write memory directly, bypassing the core entirely.

Hardware breakpoints and triggers provide flexible mechanisms for halting execution. The trigger module supports multiple trigger types including address match, instruction count, interrupt, and exception triggers. Triggers can match on instruction fetch, data load, or data store. Trigger chaining enables complex conditional breakpoints. Actions include entering debug mode, raising exceptions, or generating trace events.

Debug mode is a special execution environment with higher privilege than M-mode. The core enters debug mode on external halt requests, breakpoint hits, or single-step completion. Debug CSRs (dcsr, dpc, dscratch) manage debug state. The Debug ROM provides entry code, while the Program Buffer allows debuggers to execute custom instruction sequences. Single-stepping executes one instruction then re-enters debug mode, enabling step-through debugging.

Trace support captures program execution non-intrusively. Instruction trace records executed instructions, branch outcomes, and control flow changes. Data trace records memory accesses and data values. Trace compression reduces bandwidth through branch maps, differential encoding, and implicit sequences. Trace filtering limits capture to specific address ranges, privilege levels, or events, reducing trace data volume.

Comparison with ARM shows both similarities and differences. ARM CoreSight provides comprehensive debug and trace infrastructure with mature tool support. SWD offers a 2-pin alternative to JTAG, reducing pin count for embedded systems. ARM ETM provides extensive trace capabilities with sophisticated compression. RISC-V’s debug architecture is simpler and more modular, with an open specification and flexible trigger system. Both architectures support standard tools like GDB and OpenOCD, ensuring practical usability.

Together, RISC-V’s debug and trace capabilities enable effective software development, system analysis, and problem diagnosis across the full range from embedded microcontrollers to high-performance application processors.

Chapter 16. Performance Counters & PMU

Part IX — Performance, Debug & Tools

🎯 Learning Objectives

After reading this chapter, you will be able to:

Read Performance Counters: Use csrr to read cycle and instret CSRs
Calculate IPC Metrics: Understand the meaning and formula for Instructions Per Cycle
Identify Performance Bottlenecks: Distinguish characteristics of Compute-bound vs Memory-bound programs

💡 Scenario: The CPU’s Indigestion

Scene: Junior comes running to Senior with a data sheet.

Junior: “Senior, look! I unrolled this loop, which made more instructions, but the execution time actually got shorter. That doesn’t make sense! More instructions should mean slower, right?”

Senior: “That’s a common rookie mistake—only looking at ‘food quantity’ (Instruction Count), not ‘digestion speed’ (IPC).

A CPU is like a hot dog eating contest competitor:

Concept	Analogy
Cycle	Contest time (seconds)
Instret (Instructions)	Number of hot dogs eaten
IPC (Instructions Per Cycle)	Swallowing speed

Formula: IPC = Instret / Cycle “

Junior: “So after I unrolled the loop, even though there are more hot dogs, they’re being swallowed faster?”

Senior: “Exactly. Your previous code probably had ‘Data Dependencies’—the previous bite wasn’t swallowed yet, so the next bite couldn’t go in, causing the competitor to just stand there dazed (Pipeline Stall), resulting in low IPC.

After loop unrolling, instructions don’t interfere with each other, so the CPU can swallow several at once (Pipeline filled), and IPC goes up. So even though total instruction count increased, because swallowing is fast enough, total time actually decreased.“

Junior: “I see! So higher IPC is always better?”

Senior: “Not necessarily. If you just have them drink water (execute nop), they can swallow super fast (high IPC), but they’re not actually eating anything (no useful work). So when looking at performance, we must look at Cycle Count and IPC together.”

Performance optimization requires measurement. A program runs slowly, and we need to know why. Cache misses dominate execution time, or branch mispredictions cause pipeline stalls, or memory bandwidth limits throughput. Performance counters transform vague slowness into quantifiable bottlenecks.

RISC-V provides a Performance Monitoring Unit (PMU) through a set of hardware performance counters. These counters track events like cycles executed, instructions retired, cache hits and misses, branch predictions, and TLB accesses. The basic counters (cycle, instret, time) are mandatory and provide fundamental metrics. Hardware performance counters (mhpmcounter3-31) are optional and track implementation-specific events. Together, these counters enable profiling, bottleneck identification, and performance analysis.

This chapter explores RISC-V performance counters and the PMU. We’ll examine the counter architecture, basic counters, hardware performance counters, performance events, profiling techniques, and how RISC-V compares to ARM’s PMU.

16.1 Performance Counter Architecture

Performance Monitoring Overview

Performance monitoring answers questions like:

How many cycles did this function take?
What is the IPC (instructions per cycle)?
How many cache misses occurred?
How many branches were mispredicted?
Where is the performance bottleneck?

RISC-V performance counters provide hardware-based measurement with minimal overhead. Counters increment automatically on specific events, allowing precise measurement without software instrumentation.

Counter CSRs

RISC-V defines performance counter CSRs in three privilege levels:

Machine-mode counters (M-mode only):

mcycle: Machine cycle counter
minstret: Machine instructions-retired counter
mhpmcounter3-31: Machine hardware performance counters (29 counters)

Supervisor/User-mode counters (readable from S/U-mode):

cycle: Cycle counter (shadow of mcycle)
instret: Instructions-retired counter (shadow of minstret)
hpmcounter3-31: Hardware performance counters (shadow of mhpmcounter)

Time counter:

time: Real-time counter (wall-clock time)

For RV32, each counter has a high-word CSR (e.g., mcycleh, cycleh) for 64-bit values.

Counter Privilege Levels

Counters are accessible based on privilege:

M-mode: Can read/write all counters (mcycle, minstret, mhpmcounter)
S-mode: Can read cycle, instret, hpmcounter (if enabled)
U-mode: Can read cycle, instret, hpmcounter (if enabled)

Access control via mcounteren and scounteren:

// Enable cycle and instret for S-mode and U-mode
uint64_t mcounteren = (1 << 0) | (1 << 2);  // CY, IR
write_csr(mcounteren, mcounteren);

// Enable cycle and instret for U-mode (from S-mode)
uint64_t scounteren = (1 << 0) | (1 << 2);
write_csr(scounteren, scounteren);

Counter Inhibit

Counters can be inhibited (stopped) via mcountinhibit:

// Stop cycle and instret counters
uint64_t mcountinhibit = (1 << 0) | (1 << 2);  // CY, IR
write_csr(mcountinhibit, mcountinhibit);

// Resume counters
write_csr(mcountinhibit, 0);

This is useful for:

Measuring specific code regions
Reducing power consumption
Preventing counter overflow

16.2 Basic Performance Counters

mcycle / cycle (Cycle Counter)

The cycle counter tracks the number of clock cycles executed by the hart:

// Read cycle counter
uint64_t start = read_csr(cycle);
// ... code to measure ...
uint64_t end = read_csr(cycle);
uint64_t cycles = end - start;

printf("Cycles: %llu\n", cycles);

For RV32, use cycleh for the high 32 bits:

// RV32: Read 64-bit cycle counter
uint64_t read_cycle_rv32(void) {
    uint32_t hi, lo, hi2;
    do {
        hi = read_csr(cycleh);
        lo = read_csr(cycle);
        hi2 = read_csr(cycleh);
    } while (hi != hi2);  // Retry if high word changed
    
    return ((uint64_t)hi << 32) | lo;
}

minstret / instret (Instructions Retired Counter)

The instructions-retired counter tracks the number of instructions completed:

// Read instret counter
uint64_t start = read_csr(instret);
// ... code to measure ...
uint64_t end = read_csr(instret);
uint64_t instructions = end - start;

printf("Instructions: %llu\n", instructions);

IPC Calculation

Combining cycle and instret gives IPC (instructions per cycle):

// Measure IPC
uint64_t cycles_start = read_csr(cycle);
uint64_t instret_start = read_csr(instret);

// ... code to measure ...

uint64_t cycles_end = read_csr(cycle);
uint64_t instret_end = read_csr(instret);

uint64_t cycles = cycles_end - cycles_start;
uint64_t instructions = instret_end - instret_start;

double ipc = (double)instructions / cycles;
printf("IPC: %.2f\n", ipc);

IPC interpretation:

IPC close to 1: Good utilization (in-order core)
IPC > 1: Superscalar execution (out-of-order core)
IPC < 1: Pipeline stalls (cache misses, branch mispredicts, etc.)

time (Real-Time Counter)

The time counter provides wall-clock time:

// Read time counter
uint64_t start_time = read_csr(time);
// ... code to measure ...
uint64_t end_time = read_csr(time);
uint64_t elapsed = end_time - start_time;

// Convert to microseconds (assuming 1 MHz time counter)
printf("Elapsed time: %llu us\n", elapsed);

The time counter frequency is platform-specific (typically 1 MHz or 10 MHz). It’s useful for:

Wall-clock timing
Timeout implementation
Real-time scheduling

Difference: cycle vs time

cycle: Counts CPU cycles (stops during sleep, varies with frequency scaling)
time: Counts real time (continues during sleep, constant frequency)

// Example: Measure sleep overhead
uint64_t cycles_before = read_csr(cycle);
uint64_t time_before = read_csr(time);

wfi();  // Sleep until interrupt

uint64_t cycles_after = read_csr(cycle);
uint64_t time_after = read_csr(time);

printf("Cycles during sleep: %llu\n", cycles_after - cycles_before);  // ~0
printf("Time during sleep: %llu\n", time_after - time_before);        // > 0

16.3 Hardware Performance Counters

mhpmcounter3-31 (Hardware Performance Counters)

RISC-V provides up to 29 hardware performance counters (HPM counters) for tracking implementation-specific events. These counters are optional—implementations may provide 0 to 29 counters.

Counter CSRs:

mhpmcounter3-31: M-mode counters (29 counters)
hpmcounter3-31: S/U-mode readable counters (shadows of mhpmcounter)
mhpmevent3-31: Event selection registers

Event Selection (mhpmevent CSRs)

Each HPM counter has an associated event selector:

// Configure mhpmcounter3 to count L1 I-cache misses
write_csr(mhpmevent3, EVENT_L1_ICACHE_MISS);

// Reset counter
write_csr(mhpmcounter3, 0);

// ... code to measure ...

// Read counter
uint64_t icache_misses = read_csr(mhpmcounter3);
printf("L1 I-cache misses: %llu\n", icache_misses);

Event codes are implementation-specific. Common events include:

Cache events (L1/L2 hits, misses)
Branch events (taken, not-taken, mispredicted)
Pipeline events (stalls, flushes)
Memory events (loads, stores, TLB misses)

Counter Overflow Handling

Counters are 64-bit and rarely overflow. If overflow is a concern:

// Check for overflow (counter wrapped around)
uint64_t start = read_csr(mhpmcounter3);
// ... code ...
uint64_t end = read_csr(mhpmcounter3);

if (end < start) {
    // Overflow occurred
    uint64_t count = (UINT64_MAX - start) + end + 1;
} else {
    uint64_t count = end - start;
}

Some implementations support overflow interrupts (implementation-specific).

Example: Multi-Counter Measurement

Measuring multiple events simultaneously:

// Configure counters
write_csr(mhpmevent3, EVENT_L1_DCACHE_MISS);
write_csr(mhpmevent4, EVENT_L2_CACHE_MISS);
write_csr(mhpmevent5, EVENT_BRANCH_MISPREDICT);

// Reset counters
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);
write_csr(mhpmcounter5, 0);

// Measure code
uint64_t cycles_start = read_csr(cycle);
uint64_t instret_start = read_csr(instret);

// ... code to measure ...

uint64_t cycles_end = read_csr(cycle);
uint64_t instret_end = read_csr(instret);
uint64_t l1_misses = read_csr(mhpmcounter3);
uint64_t l2_misses = read_csr(mhpmcounter4);
uint64_t branch_mispredicts = read_csr(mhpmcounter5);

// Report
printf("Cycles: %llu\n", cycles_end - cycles_start);
printf("Instructions: %llu\n", instret_end - instret_start);
printf("L1 D-cache misses: %llu\n", l1_misses);
printf("L2 cache misses: %llu\n", l2_misses);
printf("Branch mispredicts: %llu\n", branch_mispredicts);

16.4 Performance Events

Cache Events

Cache events track memory hierarchy performance:

L1 Instruction Cache:

L1 I-cache access
L1 I-cache miss
L1 I-cache hit

L1 Data Cache:

L1 D-cache access
L1 D-cache miss
L1 D-cache hit
L1 D-cache writeback

L2 Cache:

L2 cache access
L2 cache miss
L2 cache hit

Example: Measure cache miss rate:

// Configure counters
write_csr(mhpmevent3, EVENT_L1_DCACHE_ACCESS);
write_csr(mhpmevent4, EVENT_L1_DCACHE_MISS);

// Reset and measure
write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);

// ... code ...

uint64_t accesses = read_csr(mhpmcounter3);
uint64_t misses = read_csr(mhpmcounter4);
double miss_rate = (double)misses / accesses * 100.0;

printf("L1 D-cache miss rate: %.2f%%\n", miss_rate);

Branch Events

Branch events track control flow performance:

Branch Types:

Branch instructions executed
Branch taken
Branch not taken

Branch Prediction:

Branch mispredicted
Branch correctly predicted

Example: Measure branch prediction accuracy:

write_csr(mhpmevent3, EVENT_BRANCH_EXECUTED);
write_csr(mhpmevent4, EVENT_BRANCH_MISPREDICT);

write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);

// ... code with branches ...

uint64_t branches = read_csr(mhpmcounter3);
uint64_t mispredicts = read_csr(mhpmcounter4);
double accuracy = (1.0 - (double)mispredicts / branches) * 100.0;

printf("Branch prediction accuracy: %.2f%%\n", accuracy);

Pipeline Events

Pipeline events track execution efficiency:

Stalls:

Pipeline stall cycles
Load-use stall
Store buffer full stall

Flushes:

Pipeline flush (branch mispredict, exception)
I-cache flush
D-cache flush

Example: Identify stall sources:

write_csr(mhpmevent3, EVENT_PIPELINE_STALL);
write_csr(mhpmevent4, EVENT_LOAD_USE_STALL);

write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);

// ... code ...

uint64_t total_stalls = read_csr(mhpmcounter3);
uint64_t load_use_stalls = read_csr(mhpmcounter4);

printf("Total stall cycles: %llu\n", total_stalls);
printf("Load-use stalls: %llu (%.1f%%)\n",
       load_use_stalls,
       (double)load_use_stalls / total_stalls * 100.0);

Memory Events

Memory events track memory system activity:

Memory Operations:

Load instructions
Store instructions
Atomic instructions

TLB Events:

TLB access
TLB miss (I-TLB, D-TLB)
Page table walk

Example: Measure TLB performance:

write_csr(mhpmevent3, EVENT_DTLB_ACCESS);
write_csr(mhpmevent4, EVENT_DTLB_MISS);

write_csr(mhpmcounter3, 0);
write_csr(mhpmcounter4, 0);

// ... code with memory accesses ...

uint64_t tlb_accesses = read_csr(mhpmcounter3);
uint64_t tlb_misses = read_csr(mhpmcounter4);
double tlb_miss_rate = (double)tlb_misses / tlb_accesses * 100.0;

printf("D-TLB miss rate: %.2f%%\n", tlb_miss_rate);

16.5 Profiling and Analysis

perf Tool for RISC-V

The Linux perf tool supports RISC-V performance counters:

# Count cycles and instructions
perf stat -e cycles,instructions ./my_program

# Sample on cycles (profiling)
perf record -e cycles ./my_program
perf report

# Count cache misses
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./my_program

# Count branch mispredictions
perf stat -e branch-misses,branches ./my_program

PMU Programming

Kernel-level PMU programming:

// Linux kernel: Configure PMU for profiling
void setup_pmu_profiling(void) {
    // Enable cycle and instret for user mode
    write_csr(mcounteren, 0x7);  // CY, TM, IR

    // Configure HPM counter for L1 D-cache misses
    write_csr(mhpmevent3, EVENT_L1_DCACHE_MISS);
    write_csr(mhpmcounter3, 0);

    // Enable counter for user mode
    uint64_t mcounteren = read_csr(mcounteren);
    mcounteren |= (1 << 3);  // HPM3
    write_csr(mcounteren, mcounteren);
}

Event Sampling

Sampling-based profiling collects periodic samples:

// Pseudo-code: Sample-based profiling
void pmu_interrupt_handler(void) {
    // Read PC where interrupt occurred
    uint64_t pc = read_csr(mepc);

    // Record sample
    record_sample(pc);

    // Reset counter for next sample
    write_csr(mhpmcounter3, -SAMPLE_PERIOD);
}

// Setup sampling
void setup_sampling(void) {
    // Configure counter to overflow after SAMPLE_PERIOD events
    write_csr(mhpmevent3, EVENT_CYCLES);
    write_csr(mhpmcounter3, -SAMPLE_PERIOD);

    // Enable overflow interrupt (implementation-specific)
    enable_pmu_interrupt();
}

Performance Analysis Techniques

Top-down analysis:

Measure overall IPC
If IPC is low, identify bottleneck:
- Cache misses? → Optimize data layout
- Branch mispredicts? → Improve branch predictability
- Pipeline stalls? → Reduce dependencies

Hotspot analysis:

Use sampling to find hot functions
Measure counters for hot functions
Optimize based on counter data

Comparative analysis:

Measure before optimization
Apply optimization
Measure after optimization
Compare counter values

Example workflow:

# Before optimization
perf stat -e cycles,instructions,L1-dcache-load-misses ./program
# Cycles: 1000000, Instructions: 500000, IPC: 0.5, Misses: 50000

# After optimization (improved data locality)
perf stat -e cycles,instructions,L1-dcache-load-misses ./program_opt
# Cycles: 600000, Instructions: 500000, IPC: 0.83, Misses: 10000
# Result: 40% speedup, 80% reduction in cache misses

16.6 Comparison with ARM PMU

RISC-V Counters vs ARM PMU

ARM provides a Performance Monitoring Unit (PMU) with similar capabilities. Comparison:

Feature	RISC-V PMU	ARM PMU
Basic Counters	cycle, instret, time	PMCCNTR (cycle), no instret
HPM Counters	mhpmcounter3-31 (up to 29)	PMEVCNTRn (typically 6-8)
Event Selection	mhpmevent3-31	PMEVTYPER (event type)
Counter Width	64-bit	32-bit or 64-bit (ARMv8)
Overflow	Implementation-specific	Overflow interrupt (PMOVSCLR)
Access Control	mcounteren, scounteren	PMUSERENR (user enable)
Counter Inhibit	mcountinhibit	PMCNTENSET/CLR (enable/disable)
Privilege Levels	M/S/U modes	EL0/EL1/EL2/EL3

Event Mapping

Common events mapped between architectures:

Event	RISC-V	ARM
Cycles	cycle CSR	PMCCNTR
Instructions	instret CSR	No direct equivalent
L1 I-cache miss	Implementation-specific	0x01
L1 D-cache miss	Implementation-specific	0x03
L2 cache miss	Implementation-specific	0x17
Branch mispredict	Implementation-specific	0x10
Branch executed	Implementation-specific	0x0C
TLB miss	Implementation-specific	0x05 (I-TLB), 0x06 (D-TLB)

ARM event codes are standardized (ARM Architecture Reference Manual), while RISC-V event codes are implementation-specific.

Profiling Tool Comparison

Both architectures support standard profiling tools:

RISC-V:

# perf on RISC-V Linux
perf stat -e cycles,instructions,cache-misses ./program
perf record -e cycles -g ./program
perf report

ARM:

# perf on ARM Linux
perf stat -e cycles,instructions,cache-misses ./program
perf record -e cycles -g ./program
perf report

The perf tool abstracts architecture differences, providing a consistent interface.

Practical Differences

RISC-V advantages:

64-bit counters (no overflow on long runs)
Separate instret counter (ARM lacks this)
Up to 29 HPM counters (ARM typically 6-8)
Simpler privilege model

ARM advantages:

Standardized event codes (portable across implementations)
Mature PMU infrastructure
Overflow interrupts (standard)
Extensive tool support

Example: Measuring IPC

RISC-V:

uint64_t cycles = read_csr(cycle);
uint64_t instret = read_csr(instret);
double ipc = (double)instret / cycles;

ARM (requires software counting):

uint64_t cycles = read_pmccntr();
// No instret equivalent—must use PMU event counter
uint64_t instret = read_pmevcntr(0);  // Configured for instruction count
double ipc = (double)instret / cycles;

RISC-V’s dedicated instret counter simplifies IPC measurement.

Implementation Examples

RISC-V:

SiFive U74: 2 HPM counters (L1 cache events)
SiFive P550: 6 HPM counters (cache, branch, TLB events)
Alibaba XuanTie C910: 4 HPM counters

ARM:

Cortex-A53: 6 PMU counters
Cortex-A72: 6 PMU counters
Cortex-A76: 6 PMU counters
Neoverse N1: 6 PMU counters

RISC-V implementations vary widely in HPM counter count. ARM implementations are more consistent (typically 6 counters).

🛠️ Hands-on Lab: Lab 16.1 — The CPU’s EKG (Measuring IPC)

This lab demonstrates how to read hardware performance counters and calculate IPC.

⚠️ Important Warning: In QEMU TCG mode or Spike, cycle usually just follows instret (IPC ≈ 1), which doesn’t reflect real hardware pipeline behavior. Run on real hardware to observe significant differences.

Lab Objectives

Implement C functions to read cycle and instret
Design two workloads: High dependency (low IPC) vs High parallelism (high IPC)
Calculate and print IPC

Code (pmu_lab.c)

#include <stdio.h>
#include <stdint.h>

// ---------------------------------------------------------
// Helper Functions: Read CSRs
// ---------------------------------------------------------
static inline uint64_t read_cycle() {
    uint64_t val;
    asm volatile ("csrr %0, cycle" : "=r" (val));
    return val;
}

static inline uint64_t read_instret() {
    uint64_t val;
    asm volatile ("csrr %0, instret" : "=r" (val));
    return val;
}

// ---------------------------------------------------------
// Workload 1: High Dependency (Low IPC)
// ---------------------------------------------------------
void workload_dependency(int iters) {
    volatile int a = 1;
    for (int i = 0; i < iters; i++) {
        // Each add must wait for previous to complete
        asm volatile (
            "add %0, %0, %0 \n"
            "add %0, %0, %0 \n"
            "add %0, %0, %0 \n"
            : "+r" (a)
        );
    }
}

// ---------------------------------------------------------
// Workload 2: Independent (High IPC)
// ---------------------------------------------------------
void workload_independent(int iters) {
    volatile int a = 1, b = 2, c = 3;
    for (int i = 0; i < iters; i++) {
        // Instructions are independent, CPU can issue simultaneously
        asm volatile (
            "add %0, %0, %0 \n"
            "add %1, %1, %1 \n"
            "add %2, %2, %2 \n"
            : "+r" (a), "+r" (b), "+r" (c)
        );
    }
}

// ---------------------------------------------------------
// Measurement Function
// ---------------------------------------------------------
void measure(const char* name, void (*func)(int), int iters) {
    uint64_t start_c = read_cycle();
    uint64_t start_i = read_instret();

    func(iters);

    uint64_t end_c = read_cycle();
    uint64_t end_i = read_instret();

    uint64_t delta_c = end_c - start_c;
    uint64_t delta_i = end_i - start_i;
    double ipc = (double)delta_i / delta_c;

    printf("[%s]\n", name);
    printf("  Cycles : %lu\n", delta_c);
    printf("  Instrs : %lu\n", delta_i);
    printf("  IPC    : %.2f\n\n", ipc);
}

int main() {
    printf("=== RISC-V PMU Demo ===\n");
    printf("Warning: On QEMU/Spike, IPC is simulated as ~1.0\n");
    printf("Run on real hardware for accurate results.\n\n");

    int iters = 100000;
    measure("Dependent Workload", workload_dependency, iters);
    measure("Independent Workload", workload_independent, iters);

    return 0;
}

Compile and Run

# Compile
riscv64-unknown-elf-gcc -O0 -o pmu_lab pmu_lab.c

# Run on Spike (simulated, IPC ≈ 1)
spike pk pmu_lab

# On real hardware, expect:
# - Dependent Workload: IPC ≈ 0.3-0.5 (stalls)
# - Independent Workload: IPC ≈ 1.5-2.0 (parallel)

Expected Output (Real Hardware)

=== RISC-V PMU Demo ===
Warning: On QEMU/Spike, IPC is simulated as ~1.0
Run on real hardware for accurate results.

[Dependent Workload]
  Cycles : 1200000
  Instrs : 400000
  IPC    : 0.33

[Independent Workload]
  Cycles : 240000
  Instrs : 400000
  IPC    : 1.67

danieRTOS Reference: danieRTOS uses cycle counters in its scheduler to measure context switch overhead and task execution time.

⚠️ Common Pitfalls

Pitfall 1: Higher IPC = Faster Program?

Misconception: The optimization goal is to maximize IPC.

Truth: High IPC doesn’t necessarily mean fast programs.

// Super high IPC, but does no useful work
for (int i = 0; i < 1000000; i++) {
    asm volatile ("nop");  // IPC might approach 4.0!
}

// Lower IPC, but actually doing computation
for (int i = 0; i < 1000000; i++) {
    result += array[i];    // IPC might only be 0.5
}

💡 Correct Understanding: Performance = Instret / Time or Instret / Cycle, but only if those instructions do useful work.

Pitfall 2: Ignoring Counter Overflow

Error Scenario: After long execution, counter overflows causing negative results.

Solution: Use 64-bit counters (RV64) or correctly handle 32-bit counter overflow.

// RV32: Need to read cycleh (high 32 bits)
uint64_t read_cycle_rv32() {
    uint32_t lo, hi1, hi2;
    do {
        hi1 = read_csr(cycleh);
        lo  = read_csr(cycle);
        hi2 = read_csr(cycleh);
    } while (hi1 != hi2);  // Guard against overflow during read
    return ((uint64_t)hi1 << 32) | lo;
}

Pitfall 3: Confusing cycle and time

Error Scenario: Using cycle to measure sleep time.

Truth:

CSR	Behavior
`cycle`	Tracks CPU execution cycles, stops during WFI
`time`	Tracks real time, continues during WFI

// ❌ Wrong: cycle doesn't increment during WFI
start = read_cycle();
wfi();  // Wait for interrupt
end = read_cycle();
sleep_time = end - start;  // Result is nearly 0!

// ✅ Correct: Use time for sleep measurement
start = read_time();
wfi();
end = read_time();
sleep_time = end - start;  // Correctly reflects wait time

Summary

Performance counters and the Performance Monitoring Unit enable quantitative performance analysis. This chapter explored RISC-V’s counter architecture and how it compares to ARM’s mature PMU infrastructure.

Performance counter architecture provides hardware-based measurement with minimal overhead. Counter CSRs exist at multiple privilege levels—machine-mode counters (mcycle, minstret, mhpmcounter) and supervisor/user-mode readable shadows (cycle, instret, hpmcounter). Access control through mcounteren and scounteren enables selective counter exposure to lower privilege levels. Counter inhibit via mcountinhibit allows stopping counters to measure specific code regions or reduce power consumption.

Basic performance counters provide fundamental metrics. The cycle counter tracks clock cycles executed by the hart. The instret counter tracks instructions retired (completed). The time counter provides wall-clock time at a constant frequency. Together, cycle and instret enable IPC calculation, a key performance metric. The difference between cycle (stops during sleep) and time (continues during sleep) enables measuring sleep overhead and real-time intervals.

Hardware performance counters track implementation-specific events through mhpmcounter3-31 (up to 29 counters). Event selection via mhpmevent CSRs configures what each counter tracks. Counters are 64-bit, minimizing overflow concerns. Multiple counters can measure different events simultaneously, enabling comprehensive performance characterization. Counter overflow handling is implementation-specific, with some implementations supporting overflow interrupts.

Performance events cover the full spectrum of microarchitectural activity. Cache events track L1 instruction cache, L1 data cache, and L2 cache hits and misses, revealing memory hierarchy performance. Branch events track branch execution and prediction accuracy, identifying control flow bottlenecks. Pipeline events track stalls and flushes, showing execution efficiency. Memory events track loads, stores, and TLB performance, revealing memory system behavior.

Profiling and analysis leverage performance counters for optimization. The Linux perf tool provides a standard interface to RISC-V counters for counting events and sampling-based profiling. PMU programming in the kernel configures counters and enables user-mode access. Event sampling collects periodic samples to identify hot code regions. Performance analysis techniques include top-down analysis (identify bottleneck category), hotspot analysis (find hot functions), and comparative analysis (measure optimization impact).

Comparison with ARM shows both similarities and differences. ARM’s PMU provides similar functionality with a cycle counter and multiple event counters. ARM standardizes event codes across implementations, while RISC-V leaves them implementation-specific. RISC-V provides 64-bit counters and a dedicated instret counter, simplifying IPC measurement. ARM provides standardized overflow interrupts. Both architectures support the perf tool, providing a consistent user experience. RISC-V allows up to 29 HPM counters, while ARM implementations typically provide 6-8 counters.

Together, RISC-V’s performance counters enable effective performance measurement, profiling, and optimization across the full range from embedded systems to high-performance processors.

Chapter 17. RISC-V vs ARM vs MIPS — A Systematic Comparison

Part X — RISC-V vs Other Architectures

🎯 Learning Objectives

After reading this chapter, you will be able to:

Compare Architectures Multi-dimensionally: Analyze architectural differences from licensing, ecosystem, and technical debt perspectives
Understand RISC-V’s Rise: Grasp why RISC-V is called “MIPS 2.0” yet succeeded where MIPS struggled
Make Technology Choices: Select appropriate architecture based on project needs (IoT vs Mobile vs Server)

💡 Scenario: War at the Round Table

Scene: The whiteboard in the conference room displays three large words: ARM vs RISC-V vs MIPS. The atmosphere is tense.

Architect: “Everyone, the specs for our new product line ‘Project X’ are finalized. We need this chip to run AI acceleration, with extremely low power consumption, and most importantly—the BOM cost is being squeezed hard. Today we must decide on the core architecture.”

Senior (ARM Advocate): “Architect, for safety’s sake, I still recommend the ARM Cortex-M series. Although licensing fees are expensive and we pay per-chip royalties, the toolchain is mature—everyone knows Keil and IAR. If we choose a new architecture to save money and the software team goes crazy debugging, delaying time-to-market, the losses will be even greater.”

Junior (Newcomer): “But Senior, I heard RISC-V is free? Wouldn’t our profit margins be much higher?”

Architect: “Junior, to be precise, the RISC-V ISA (specification) is free, but good IP (design) still costs money (like SiFive or Andes), though usually without royalties. For high-volume products like ours, this can indeed save a huge amount in ‘toll fees.’”

Professor (Consultant): “And it’s not just about money. Senior, have you considered technical flexibility? ARM’s instruction set is closed. If we want to add a few special instructions for our AI algorithm, will ARM listen to us? But with RISC-V, we can use Custom Extensions to add our own instructions—performance might improve tenfold.”

Senior: “Professor makes a good point, but Custom Extensions have risks too. If we add random instructions ourselves, GCC and LLVM won’t recognize them. Wouldn’t we need to maintain our own compiler team?”

Architect: “That’s the trade-off. Let me summarize:

Consideration	ARM	RISC-V	Decision Impact
Licensing Cost	High (Per-chip Royalty)	Low (No Royalty)	RISC-V saves money at high volume
Ecosystem Maturity	High (20+ years)	Medium (Growing fast)	ARM safer short-term, RISC-V has long-term potential
Customization Flexibility	Low (Requires negotiation)	High (Custom Ext.)	RISC-V advantage for AI/crypto acceleration
Software Tools	Mature	Improving	ARM temporarily leads in debug experience
“

Junior: “What about MIPS? I remember university textbooks all taught MIPS?”

Professor: “MIPS is a classic and contributed greatly to education. But its business model had problems—licensing too expensive, IP company changed hands multiple times, ecosystem withered. RISC-V inherits MIPS’s spirit in many ways (Clean RISC Design), but learned the lesson: use open-source model to avoid patent hell.”

Architect: “Alright, our conclusion: for short-term projects requiring stable time-to-market, choose ARM; for long-term strategy needing customization and cost-consciousness, choose RISC-V. As for MIPS, unless maintaining legacy product lines, not recommended for new projects.”

Architecture choice shapes everything. The instruction set determines how software expresses computation, how hardware implements execution, and how ecosystems develop around the platform. RISC-V enters a landscape dominated by ARM in mobile and embedded systems, and historically influenced by MIPS in education and networking. Understanding how these architectures compare reveals RISC-V’s design decisions, trade-offs, and competitive position.

This chapter provides a systematic comparison of RISC-V, ARM, and MIPS across eleven dimensions: ISA design philosophy, instruction set complexity, register architecture, exception and interrupt models, memory models, virtual memory, interrupt architecture, calling conventions, pipeline and microarchitecture, ecosystem and licensing, and future directions. Each section examines how the architectures approach the same problem, highlighting similarities, differences, and the implications for software and hardware.

RISC-V represents a modern, modular, open approach. ARM represents a comprehensive, evolving, commercial approach. MIPS represents classic RISC simplicity with historical commercial roots. Together, they illustrate the spectrum of architectural design choices.

17.1 ISA Design Philosophy

RISC-V: Modular and Extensible

RISC-V’s philosophy emphasizes modularity and extensibility:

Base ISA: Minimal, frozen foundation (RV32I, RV64I, RV128I)
Standard extensions: Optional, composable modules (M, A, F, D, C, V, etc.)
Custom extensions: Vendor-specific additions without ISA fragmentation
Clean slate: No legacy baggage, designed from first principles

This modular approach allows:

Tiny microcontrollers (RV32I only)
Application processors (RV64IMAFDCV)
Custom accelerators (base + custom extensions)

ARM: Comprehensive and Evolving

ARM’s philosophy emphasizes comprehensiveness and evolution:

Comprehensive ISA: Rich instruction set covering many use cases
Profiles: Different ISA subsets (A-profile, R-profile, M-profile)
Backward compatibility: New versions extend, rarely remove features
Market-driven: Features added based on market needs

ARM profiles:

A-profile: Application processors (ARMv8-A, ARMv9-A)
R-profile: Real-time processors (ARMv8-R)
M-profile: Microcontrollers (ARMv8-M, ARMv9-M)

MIPS: Classic RISC Simplicity

MIPS’s philosophy emphasizes classic RISC principles:

Simple, regular ISA: Load-store architecture, fixed-length instructions
Delayed branches: Expose pipeline to software (MIPS I-IV)
Coprocessor model: Floating-point and other functions as coprocessors
Minimal complexity: Keep hardware simple, let software handle complexity

MIPS influenced RISC-V’s design but RISC-V modernized many aspects (no delayed branches, cleaner privilege model).

Design Trade-offs

Aspect	RISC-V	ARM	MIPS
Modularity	High (base + extensions)	Medium (profiles)	Low (monolithic)
Extensibility	High (custom extensions)	Low (vendor-specific)	Low (proprietary)
Backward Compatibility	High (frozen base)	High (evolutionary)	Medium (versions)
Complexity	Low to medium	Medium to high	Low
Flexibility	High	Medium	Low

17.2 Instruction Set Complexity

Instruction Count Comparison

Approximate instruction counts:

Architecture	Base Instructions	With Extensions	Total (typical)
RISC-V RV32I	47	+M(8), +A(11), +F(26), +D(26), +C(~40)	~150-200
RISC-V RV64I	59	+M(8), +A(11), +F(26), +D(26), +C(~40)	~170-220
ARM ARMv8-A	~500 base	+NEON, +SVE, +crypto	~1000+
MIPS32	~100 base	+FPU, +DSP	~200-300

RISC-V has the smallest base ISA. ARM has the largest instruction set. MIPS falls in between.

Encoding Formats

RISC-V: 6 base formats (R, I, S, B, U, J) + compressed (CR, CI, CSS, CIW, CL, CS, CA, CB, CJ)

R-type: [funct7|rs2|rs1|funct3|rd|opcode]
I-type: [imm[11:0]|rs1|funct3|rd|opcode]
S-type: [imm[11:5]|rs2|rs1|funct3|imm[4:0]|opcode]

ARM: Multiple formats (data processing, load/store, branch, etc.)

Data processing: [cond|00|I|opcode|S|Rn|Rd|operand2]
Load/store:      [cond|01|I|P|U|B|W|L|Rn|Rd|offset]

MIPS: 3 formats (R, I, J)

R-type: [opcode|rs|rt|rd|shamt|funct]
I-type: [opcode|rs|rt|immediate]
J-type: [opcode|address]

RISC-V and MIPS have simpler, more regular encodings than ARM.

Addressing Modes

RISC-V:

Register: add rd, rs1, rs2
Immediate: addi rd, rs1, imm
Base+offset: lw rd, offset(rs1)

ARM:

Register: ADD Rd, Rn, Rm
Immediate: ADD Rd, Rn, #imm
Base+offset: LDR Rd, [Rn, #offset]
Base+register: LDR Rd, [Rn, Rm]
Pre/post-indexed: LDR Rd, [Rn, #offset]! or LDR Rd, [Rn], #offset

MIPS:

Register: add $rd, $rs, $rt
Immediate: addi $rt, $rs, imm
Base+offset: lw $rt, offset($rs)

ARM has the most addressing modes. RISC-V and MIPS are simpler.

ISA Complexity Metrics

Metric	RISC-V	ARM	MIPS
Instruction formats	6 base + 9 compressed	10+	3
Addressing modes	3	10+	3
Conditional execution	Branch only	Most instructions (ARMv7), limited (ARMv8)	Branch only
Instruction length	32-bit (16-bit compressed)	32-bit (16-bit Thumb)	32-bit
Regularity	High	Medium	High

RISC-V and MIPS prioritize regularity. ARM prioritizes expressiveness.

17.3 Register Architecture

RISC-V: x0-x31 (x0 = zero)

RISC-V provides 32 general-purpose registers:

x0: Hardwired zero (reads as 0, writes ignored)
x1-x31: General-purpose registers
f0-f31: Floating-point registers (F/D extensions)

Special conventions:

x1 (ra): Return address
x2 (sp): Stack pointer
x8 (s0/fp): Frame pointer

ARM: X0-X30 + XZR + SP

ARM ARMv8-A provides 31 general-purpose registers + special registers:

X0-X30: General-purpose registers (64-bit)
W0-W30: 32-bit views of X0-X30
XZR (X31): Zero register (reads as 0, writes ignored)
SP: Stack pointer (separate from general registers)
PC: Program counter (not directly accessible in ARMv8)

ARM has 31 general registers vs RISC-V’s 32 (including zero).

MIPS: $0-$31 ($0 = zero)

MIPS provides 32 general-purpose registers:

$0 ($zero): Hardwired zero
$1-$31: General-purpose registers

Special conventions:

$31 ($ra): Return address
$29 ($sp): Stack pointer
$30 ($fp): Frame pointer

MIPS and RISC-V have identical register counts and zero register concept.

Special-Purpose Registers

RISC-V: CSRs (Control and Status Registers)

Accessed via csrr, csrw, csrrw, etc.
Examples: mstatus, mtvec, mepc, mcause

ARM: System registers

Accessed via MRS, MSR instructions
Examples: SCTLR_EL1, VBAR_EL1, ELR_EL1, ESR_EL1

MIPS: Coprocessor 0 registers

Accessed via mfc0, mtc0 instructions
Examples: Status, Cause, EPC, BadVAddr

All three use separate namespaces for system registers.

17.4 Exception and Interrupt Models

RISC-V: Trap Model (M/S/U Modes)

RISC-V uses a unified trap model:

Privilege modes: M (Machine), S (Supervisor), U (User)
Trap types: Exceptions (synchronous), Interrupts (asynchronous)
Trap vector: mtvec (M-mode), stvec (S-mode)
Trap cause: mcause (M-mode), scause (S-mode)
Trap PC: mepc (M-mode), sepc (S-mode)

Trap handling:

# M-mode trap handler
trap_handler:
    csrr t0, mcause      # Read cause
    csrr t1, mepc        # Read PC
    # Handle trap
    mret                 # Return from trap

ARM: Exception Levels (EL0-EL3)

ARM uses exception levels:

Exception levels: EL0 (User), EL1 (OS), EL2 (Hypervisor), EL3 (Secure Monitor)
Exception types: Synchronous, IRQ, FIQ, SError
Exception vector: VBAR_EL1, VBAR_EL2, VBAR_EL3
Exception syndrome: ESR_EL1, ESR_EL2, ESR_EL3
Exception link: ELR_EL1, ELR_EL2, ELR_EL3

Exception handling:

# EL1 exception handler
exception_handler:
    mrs x0, ESR_EL1      # Read syndrome
    mrs x1, ELR_EL1      # Read PC
    # Handle exception
    eret                 # Return from exception

MIPS: Exception Handling

MIPS uses a simpler exception model:

Modes: User, Kernel
Exception vector: Fixed at 0x80000180 (general) or 0x80000000 (reset/NMI)
Cause register: Encodes exception type
EPC: Exception PC

Exception handling:

# MIPS exception handler
exception_handler:
    mfc0 k0, Cause       # Read cause
    mfc0 k1, EPC         # Read PC
    # Handle exception
    eret                 # Return from exception

Comparison Table

Feature	RISC-V	ARM	MIPS
Privilege Levels	3 (M/S/U) + optional H	4 (EL0-EL3)	2 (User/Kernel)
Trap/Exception Types	Unified (trap)	4 types (Sync, IRQ, FIQ, SError)	Unified (exception)
Vector Table	mtvec, stvec	VBAR_ELn	Fixed address
Cause Register	mcause, scause	ESR_ELn	Cause
Return Instruction	mret, sret	eret	eret
Nested Interrupts	Software-managed	Hardware-managed	Software-managed

ARM has the most sophisticated exception model. RISC-V is modular. MIPS is simplest.

17.5 Memory Models

RISC-V: RVWMO (Weak Memory Ordering)

RISC-V uses a weak memory model (RVWMO):

Ordering: Relaxed by default
Fences: fence instruction for ordering
Atomics: LR/SC and AMO instructions
TSO extension: Optional total store ordering (Ztso)

Memory ordering:

# RISC-V: Ensure store visible before load
sw x10, 0(x5)
fence w, r           # Write-to-read fence
lw x11, 0(x6)

ARM: Weak Memory Model

ARM uses a weak memory model:

Ordering: Relaxed by default
Barriers: DMB (data memory barrier), DSB (data synchronization barrier), ISB (instruction synchronization barrier)
Atomics: LDXR/STXR (exclusive) and atomic instructions (ARMv8.1+)

Memory ordering:

# ARM: Ensure store visible before load
STR X0, [X1]
DMB SY               # Data memory barrier (system)
LDR X2, [X3]

MIPS: Sequential Consistency Variants

MIPS traditionally used sequential consistency, but modern MIPS supports weak ordering:

Ordering: Sequential consistency (MIPS I-III), weak ordering (MIPS IV+)
Barriers: sync instruction
Atomics: LL/SC (load-linked/store-conditional)

Memory ordering:

# MIPS: Ensure store visible before load
sw $t0, 0($a0)
sync                 # Synchronization barrier
lw $t1, 0($a1)

Memory Barrier Instructions

Architecture	Barrier Instruction	Purpose
RISC-V	`fence r, w`	Order reads before writes
RISC-V	`fence w, r`	Order writes before reads
RISC-V	`fence rw, rw`	Full fence
ARM	`DMB`	Data memory barrier
ARM	`DSB`	Data synchronization barrier
ARM	`ISB`	Instruction synchronization barrier
MIPS	`sync`	Synchronization barrier

RISC-V’s fence is more fine-grained (specify predecessor/successor). ARM has multiple barrier types. MIPS has a single sync instruction.

17.6 Virtual Memory

RISC-V: Sv39/Sv48 Page Tables

RISC-V virtual memory:

Sv39: 39-bit virtual address, 3-level page table
Sv48: 48-bit virtual address, 4-level page table
Sv57: 57-bit virtual address, 5-level page table (future)
Page sizes: 4 KB, 2 MB (megapage), 1 GB (gigapage)
TLB: Implementation-specific

Page table entry (Sv39):

[63:54] Reserved
[53:28] PPN[2] (26 bits)
[27:19] PPN[1] (9 bits)
[18:10] PPN[0] (9 bits)
[9:0]   Flags (V, R, W, X, U, G, A, D)

ARM: 48-bit VA with TTBR0/1

ARM virtual memory (ARMv8-A):

VA size: 48-bit (configurable 36-48 bits)
Page table levels: 4 levels (configurable)
Page sizes: 4 KB, 2 MB, 1 GB
TTBR0/TTBR1: Separate page tables for user/kernel

Page table entry (4 KB granule):

[63:48] Ignored/SW use
[47:12] Output address
[11:2]  Attributes (AF, SH, AP, NS, etc.)
[1]     Table/Block
[0]     Valid

MIPS: TLB-Based MMU

MIPS virtual memory:

TLB: Software-managed TLB (no hardware page table walk)
VA size: 32-bit (MIPS32), 64-bit (MIPS64)
Page sizes: 4 KB (typical), configurable
TLB entries: 16-64 entries (implementation-specific)

TLB entry:

EntryHi:  [VPN | ASID]
EntryLo0: [PFN | C | D | V | G]
EntryLo1: [PFN | C | D | V | G]
PageMask: Page size

Page Table Walk Comparison

Feature	RISC-V	ARM	MIPS
Hardware Walk	Yes	Yes	No (software TLB refill)
Page Table Levels	3-5 (Sv39-Sv57)	4 (configurable)	N/A (TLB only)
Page Sizes	4 KB, 2 MB, 1 GB	4 KB, 2 MB, 1 GB	4 KB (configurable)
TLB Refill	Hardware	Hardware	Software (exception)
ASID	Yes (satp.ASID)	Yes (TTBR.ASID)	Yes (EntryHi.ASID)

RISC-V and ARM use hardware page table walks. MIPS uses software TLB refill, giving more flexibility but requiring software overhead.

17.7 Interrupt Architecture

RISC-V: PLIC/CLIC/AIA

RISC-V interrupt architecture has evolved:

CLINT: Core-Local Interruptor (timer, software interrupts)
PLIC: Platform-Level Interrupt Controller (external interrupts)
CLIC: Core-Local Interrupt Controller (vectored, nested interrupts for embedded)
AIA: Advanced Interrupt Architecture (MSI, IMSIC for servers)

PLIC example:

// PLIC interrupt handler
void plic_handler(void) {
    uint32_t source = plic_claim();  // Claim interrupt

    if (source == UART_IRQ) {
        uart_handler();
    }

    plic_complete(source);  // Complete interrupt
}

ARM: GIC (GICv3/GICv4)

ARM uses the Generic Interrupt Controller (GIC):

GICv2: Legacy, supports up to 8 cores
GICv3: Modern, supports many cores, message-based
GICv4: Adds virtualization support
Interrupt types: SGI (software), PPI (private peripheral), SPI (shared peripheral)

GIC example:

// GIC interrupt handler
void gic_handler(void) {
    uint32_t intid = gic_acknowledge();  // Acknowledge interrupt

    if (intid == UART_INTID) {
        uart_handler();
    }

    gic_end_of_interrupt(intid);  // End of interrupt
}

MIPS: Simple IRQ Model

MIPS uses a simple interrupt model:

Interrupt lines: 8 hardware interrupt lines (IP0-IP7)
Interrupt mask: Controlled by Status register
Interrupt pending: Indicated by Cause register
External controller: Optional external interrupt controller

MIPS example:

// MIPS interrupt handler
void mips_interrupt_handler(void) {
    uint32_t cause = read_c0_cause();
    uint32_t pending = (cause >> 8) & 0xFF;  // IP bits

    if (pending & (1 << 2)) {  // IP2
        uart_handler();
    }
}

Interrupt Routing and Priority

Feature	RISC-V (PLIC)	ARM (GIC)	MIPS
Interrupt Sources	1-1023	32-1020	8 lines
Priority Levels	0-255	0-255	None (software)
Routing	Per-hart enable	Affinity routing	Fixed
Vectoring	Optional (CLIC)	Yes	No
Nesting	Software-managed	Hardware-managed	Software-managed

ARM GIC is the most sophisticated. RISC-V PLIC is flexible. MIPS is simplest.

17.8 Calling Conventions

RISC-V: RV64 SysV ABI

RISC-V calling convention (RV64):

Arguments: a0-a7 (x10-x17)
Return values: a0-a1 (x10-x11)
Saved registers: s0-s11 (x8-x9, x18-x27)
Temporary registers: t0-t6 (x5-x7, x28-x31)
Stack pointer: sp (x2)
Return address: ra (x1)

Function call:

# Caller
addi sp, sp, -16
sd ra, 8(sp)
call function        # ra ← PC+4, PC ← function
ld ra, 8(sp)
addi sp, sp, 16

# Callee
function:
    addi sp, sp, -16
    sd s0, 8(sp)
    # ... function body ...
    ld s0, 8(sp)
    addi sp, sp, 16
    ret              # PC ← ra

ARM: AAPCS64

ARM calling convention (AAPCS64):

Arguments: X0-X7
Return values: X0-X1
Saved registers: X19-X28
Temporary registers: X9-X15
Stack pointer: SP
Return address: X30 (LR)
Frame pointer: X29 (FP)

Function call:

# Caller
STP X29, X30, [SP, #-16]!
BL function          # LR ← PC+4, PC ← function
LDP X29, X30, [SP], #16

# Callee
function:
    STP X29, X30, [SP, #-16]!
    MOV X29, SP
    # ... function body ...
    LDP X29, X30, [SP], #16
    RET              # PC ← LR

MIPS: O32/N32/N64 ABIs

MIPS has multiple ABIs:

O32: 32-bit, 4 argument registers
N32: 32-bit pointers, 64-bit registers
N64: 64-bit

MIPS O32 calling convention:

Arguments: $a0-$a3 ($4-$7)
Return values: $v0-$v1 ($2-$3)
Saved registers: $s0-$s7 ($16-$23)
Temporary registers: $t0-$t9 ($8-$15, $24-$25)
Stack pointer: $sp ($29)
Return address: $ra ($31)

Function call:

# Caller
addiu $sp, $sp, -8
sw $ra, 4($sp)
jal function         # $ra ← PC+8, PC ← function
lw $ra, 4($sp)
addiu $sp, $sp, 8

# Callee
function:
    addiu $sp, $sp, -8
    sw $s0, 4($sp)
    # ... function body ...
    lw $s0, 4($sp)
    addiu $sp, $sp, 8
    jr $ra           # PC ← $ra

ABI Comparison

Feature	RISC-V	ARM	MIPS (O32)
Argument Registers	8 (a0-a7)	8 (X0-X7)	4 ($a0-$a3)
Return Registers	2 (a0-a1)	2 (X0-X1)	2 ($v0-$v1)
Saved Registers	12 (s0-s11)	10 (X19-X28)	8 ($s0-$s7)
Temporary Registers	7 (t0-t6)	7 (X9-X15)	10 ($t0-$t9)
Stack Growth	Downward	Downward	Downward
Alignment	16 bytes	16 bytes	8 bytes (O32)

RISC-V and ARM have more argument registers than MIPS O32, reducing stack usage.

17.9 Pipeline and Microarchitecture

In-Order vs Out-of-Order

All three architectures support both in-order and out-of-order implementations:

RISC-V:

In-order: SiFive E-series, Rocket
Out-of-order: SiFive P-series, BOOM (Berkeley Out-of-Order Machine)

ARM:

In-order: Cortex-A5, Cortex-A7, Cortex-A53
Out-of-order: Cortex-A72, Cortex-A76, Neoverse N1/V1

MIPS:

In-order: MIPS 24K, 34K
Out-of-order: MIPS 74K, R10000

Branch Prediction Strategies

Modern implementations use sophisticated branch prediction:

Implementation	Branch Predictor	Accuracy
SiFive U74	2-level adaptive	~90%
SiFive P550	TAGE predictor	~95%
ARM Cortex-A76	Multi-level predictor	~95%+
MIPS 74K	2-level adaptive	~90%

Out-of-order cores require better branch prediction to maintain performance.

Cache Hierarchies

Typical cache configurations:

RISC-V (SiFive P550):

L1 I-cache: 32 KB, 4-way
L1 D-cache: 32 KB, 8-way
L2 cache: 512 KB - 2 MB, 8-way

ARM (Cortex-A76):

L1 I-cache: 64 KB, 4-way
L1 D-cache: 64 KB, 4-way
L2 cache: 256 KB - 512 KB, 8-way

MIPS (74K):

L1 I-cache: 32 KB, 4-way
L1 D-cache: 32 KB, 4-way
L2 cache: Optional, implementation-specific

Implementation Examples

Implementation	Type	Pipeline Stages	Issue Width	IPC (peak)
SiFive E76	In-order	8	1	1.0
SiFive P550	Out-of-order	13	3	3.0
ARM Cortex-A53	In-order	8	2	2.0
ARM Cortex-A76	Out-of-order	13	4	4.0
MIPS 24K	In-order	8	1	1.0
MIPS 74K	Out-of-order	15	2	2.0

ARM has the most aggressive out-of-order implementations. RISC-V is catching up. MIPS development has slowed.

17.10 Ecosystem and Licensing

RISC-V: Open and Free ISA

RISC-V ecosystem:

ISA: Open, royalty-free
Licensing: No licensing fees
Governance: RISC-V International (non-profit)
Implementations: Open-source (Rocket, BOOM) and commercial (SiFive, Andes, etc.)
Software: GCC, LLVM, Linux, FreeBSD, Zephyr, FreeRTOS

Advantages:

No licensing costs
Customizable (add custom extensions)
Growing ecosystem
Academic and research-friendly

Challenges:

Younger ecosystem (less mature than ARM)
Fewer commercial implementations
Software ecosystem still developing

ARM: Commercial Licensing

ARM ecosystem:

ISA: Proprietary
Licensing: Architecture license (design own core) or implementation license (use ARM core)
Governance: ARM Holdings (commercial company)
Implementations: ARM Cortex series, vendor designs (Apple, Qualcomm, Samsung)
Software: Mature ecosystem (GCC, LLVM, Linux, Android, iOS)

Advantages:

Mature, proven ecosystem
Extensive software support
Wide industry adoption
Strong performance

Challenges:

Licensing fees
Less customizable
Vendor lock-in

MIPS: Historical Commercial, Now Open

MIPS ecosystem:

ISA: Now open (MIPS Open initiative, 2018)
Licensing: Historically commercial, now open
Governance: MIPS Open (Wave Computing)
Implementations: Historical (MIPS Technologies), now limited
Software: GCC, LLVM, Linux (legacy support)

Status:

Historical importance (education, networking)
Declining commercial adoption
Open initiative came too late
Largely superseded by ARM and RISC-V

Industry Adoption and Ecosystem

Aspect	RISC-V	ARM	MIPS
Market Share	Growing (embedded, IoT)	Dominant (mobile, embedded)	Declining
Vendors	SiFive, Andes, Alibaba, etc.	ARM, Apple, Qualcomm, etc.	Limited
Software Ecosystem	Growing	Mature	Legacy
Tool Support	Good (GCC, LLVM)	Excellent	Good (legacy)
Community	Active, growing	Mature, large	Small
Education	Increasing	Common	Historical

ARM dominates commercially. RISC-V is growing rapidly. MIPS is declining.

17.11 Future Directions

RISC-V Roadmap

RISC-V is actively evolving:

Ratified extensions: V (vector), B (bit manipulation), Zicond (conditional ops)
In progress: J (JIT support), P (packed SIMD), Zc (code size reduction)
Proposed: Crypto extensions, memory tagging, capabilities
Platform profiles: RVA22, RVA23 (standardize extension combinations)

Focus areas:

Performance (vector, out-of-order)
Security (crypto, memory tagging)
Code density (compressed, Zc)
Ecosystem maturity (tools, software)

ARM Roadmap

ARM continues to evolve:

ARMv9-A: SVE2 (scalable vector), TME (transactional memory), MTE (memory tagging)
Confidential Compute: Realm Management Extension (RME)
AI/ML: Matrix extensions, neural processing
Automotive: ASIL-D safety, real-time

Focus areas:

AI/ML acceleration
Security (confidential computing)
Automotive and safety
Performance scaling

Emerging Extensions

RISC-V:

Crypto: AES, SHA, scalar crypto
Vector: V extension (ratified), improvements ongoing
Hypervisor: H extension (ratified)
Bit manipulation: B extension (ratified)

ARM:

SVE2: Scalable vector (successor to NEON)
MTE: Memory Tagging Extension (security)
TME: Transactional Memory Extension
SME: Scalable Matrix Extension (AI/ML)

Security Features

Both architectures are adding security features:

RISC-V:

PMP (Physical Memory Protection)
sPMP (Supervisor PMP)
Crypto extensions
Proposed: Memory tagging, capabilities (CHERI-like)

ARM:

TrustZone (secure world)
MTE (Memory Tagging Extension)
PAC (Pointer Authentication Codes)
BTI (Branch Target Identification)

Industry Trends

RISC-V trends:

Rapid adoption in China (Alibaba, Huawei)
Growing in IoT and embedded
Increasing in data center (SiFive, Ventana)
Strong in research and education

ARM trends:

Continued dominance in mobile
Growing in data center (AWS Graviton, Ampere)
Strong in automotive
Expanding in AI/ML

MIPS trends:

Declining market share
Legacy support only
Some niche applications (networking)

The future favors RISC-V (open, growing) and ARM (mature, dominant). MIPS is largely historical.

Summary

Comparing RISC-V, ARM, and MIPS reveals different architectural philosophies and trade-offs. This chapter examined eleven dimensions of comparison, showing how each architecture approaches the same problems.

ISA design philosophy distinguishes the three architectures fundamentally. RISC-V emphasizes modularity and extensibility with a frozen base ISA and composable extensions. ARM emphasizes comprehensiveness and evolution with rich instruction sets and market-driven features. MIPS emphasizes classic RISC simplicity with regular encodings and minimal complexity. These philosophies shape all subsequent design decisions.

Instruction set complexity varies significantly. RISC-V has the smallest base ISA (47 instructions for RV32I) with optional extensions. ARM has the largest instruction set (1000+ instructions) covering many use cases. MIPS falls in between with classic RISC simplicity. Encoding formats reflect this—RISC-V and MIPS use regular formats, while ARM uses more complex encodings for expressiveness.

Register architecture shows convergence. All three provide 31-32 general-purpose registers with a hardwired zero register (RISC-V x0, ARM XZR, MIPS $0). ARM separates the stack pointer from general registers. All three use separate namespaces for system registers accessed through special instructions.

Exception and interrupt models reflect different privilege architectures. RISC-V uses a unified trap model with M/S/U modes. ARM uses exception levels (EL0-EL3) with four exception types. MIPS uses a simpler two-mode model. ARM provides the most sophisticated exception handling, while MIPS is simplest.

Memory models all use weak ordering for performance. RISC-V uses RVWMO with fine-grained fence instructions. ARM uses a weak model with multiple barrier types (DMB, DSB, ISB). MIPS evolved from sequential consistency to weak ordering with sync barriers. All three provide atomic instructions for synchronization.

Virtual memory shows architectural maturity. RISC-V and ARM use hardware page table walks with multi-level page tables (Sv39/Sv48 for RISC-V, 4-level for ARM). MIPS uses software-managed TLB refill, providing flexibility at the cost of software overhead. All three support multiple page sizes and ASID for efficient context switching.

Interrupt architecture ranges from simple to sophisticated. RISC-V uses PLIC for platform interrupts with CLIC for embedded and AIA for servers. ARM uses GIC (Generic Interrupt Controller) with mature multi-core support. MIPS uses a simple 8-line interrupt model. ARM GIC is most mature, RISC-V is evolving, MIPS is simplest.

Calling conventions show practical similarities. All three use similar register allocation strategies with 4-8 argument registers, 2 return registers, and saved/temporary register sets. RISC-V and ARM provide 8 argument registers, while MIPS O32 provides only 4. All use downward-growing stacks with 8-16 byte alignment.

Pipeline and microarchitecture demonstrate implementation diversity. All three architectures support both in-order and out-of-order implementations. ARM has the most aggressive out-of-order cores (Cortex-A76, Neoverse). RISC-V is catching up with competitive designs (SiFive P550, BOOM). MIPS development has slowed. Branch prediction and cache hierarchies are similar across modern implementations.

Ecosystem and licensing represent the fundamental business difference. RISC-V is open and royalty-free, enabling customization and eliminating licensing costs. ARM is proprietary with licensing fees but offers a mature, proven ecosystem. MIPS was commercial but opened too late, now declining. ARM dominates commercially, RISC-V is growing rapidly, MIPS is historical.

Future directions show active evolution. RISC-V is adding vector, crypto, and security extensions while standardizing platform profiles. ARM is advancing AI/ML, security (MTE), and confidential computing. Both are adding memory tagging and enhanced security features. Industry trends favor RISC-V (open, growing) and ARM (mature, dominant), while MIPS remains largely historical.

Together, these comparisons show RISC-V as a modern, modular, open alternative to ARM’s mature, comprehensive, commercial approach, with MIPS representing classic RISC principles now largely superseded. The choice depends on priorities: openness and customization favor RISC-V, ecosystem maturity favors ARM.

Appendix A. CSR Reference

Control and Status Register Quick Reference

💡 Usage Guide: This appendix is your “dashboard” during development. When you need to look up a CSR’s bit positions or operation methods, flip right here.

🛠️ Common CSR Quick Reference

`mstatus` (Machine Status) Bit Map

This is the most frequently used CSR, controlling interrupts, privilege modes, and other core functions.

63    62    38 37 36   34 33 32   22 21 20 19 18 17   13 12 11 10  9  8  7  6  5  4  3  2  1  0
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ SD │WPRI│ MBE│ SBE│ SXL│ UXL│WPRI│ TSR│ TW │ TVM│ MXR│ SUM│MPRV│ XS │ FS │ MPP│WPRI│ SPP│MPIE│
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
                                                                   │     │     │
                                                                   │     │     └─ Bit 7: MPIE
                                                                   │     └─ Bit 11-12: MPP
                                                                   └─ Bit 3: MIE

Key Bit Descriptions:

Bit	Name	Description
3	MIE	Machine Interrupt Enable (global interrupt switch)
7	MPIE	Previous MIE (MIE value before entering trap)
11-12	MPP	Previous Privilege (00=U, 01=S, 11=M)
17	MPRV	Modify Privilege (Load/Store use MPP privilege)

Common Operation Snippets

// 1. Enable Global Interrupt
csrs mstatus, (1 << 3);   // Set MIE bit

// 2. Disable Global Interrupt
csrc mstatus, (1 << 3);   // Clear MIE bit

// 3. Set next mode to S-mode (preparing for mret)
csrc mstatus, (3 << 11);  // Clear MPP
csrs mstatus, (1 << 11);  // Set MPP = 01 (S-mode)

// 4. Read current MPP value
csrr t0, mstatus
srli t0, t0, 11
andi t0, t0, 3            // t0 = MPP (0=U, 1=S, 3=M)

`mie` / `mip` (Interrupt Enable/Pending) Reference

These two CSRs control interrupt enable and pending status.

Bit Position Reference:
┌─────────────────────────────────────────────────────────────┐
│  11  │  9   │  7   │  5   │  3   │  1   │                   │
│ MEIE │ SEIE │ MTIE │ STIE │ MSIE │ SSIE │                   │
│  M   │  S   │  M   │  S   │  M   │  S   │                   │
│ Ext  │ Ext  │Timer │Timer │ Soft │ Soft │                   │
└─────────────────────────────────────────────────────────────┘

Common Operations:

// Enable Machine Timer Interrupt
csrs mie, (1 << 7);       // Set MTIE

// Enable Machine External Interrupt
csrs mie, (1 << 11);      // Set MEIE

// Check if Timer Interrupt is Pending
csrr t0, mip
andi t0, t0, (1 << 7)     // t0 != 0 means Timer interrupt pending

`mcause` (Machine Cause) Decode Table

When a trap occurs, mcause tells you the reason.

Interrupt (mcause[63] = 1):

Code	Name	Description
1	Supervisor Software Interrupt	S-mode software interrupt
3	Machine Software Interrupt	M-mode software interrupt
5	Supervisor Timer Interrupt	S-mode timer interrupt
7	Machine Timer Interrupt	M-mode timer interrupt
9	Supervisor External Interrupt	S-mode external interrupt
11	Machine External Interrupt	M-mode external interrupt

Exception (mcause[63] = 0):

Code	Name	Description
0	Instruction Address Misaligned	Instruction address not aligned
1	Instruction Access Fault	Instruction access error
2	Illegal Instruction	Invalid instruction
3	Breakpoint	Breakpoint (ebreak)
4	Load Address Misaligned	Load address not aligned
5	Load Access Fault	Load access error
6	Store Address Misaligned	Store address not aligned
7	Store Access Fault	Store access error
8	Environment Call from U-mode	U-mode ecall
9	Environment Call from S-mode	S-mode ecall
11	Environment Call from M-mode	M-mode ecall
12	Instruction Page Fault	Instruction page fault
13	Load Page Fault	Load page fault
15	Store Page Fault	Store page fault

Trap Handler Example:

void trap_handler() {
    uint64_t cause;
    asm volatile ("csrr %0, mcause" : "=r" (cause));

    if (cause & (1UL << 63)) {
        // Interrupt
        uint64_t code = cause & 0x7FF;
        switch (code) {
            case 7:  handle_timer_interrupt(); break;
            case 11: handle_external_interrupt(); break;
        }
    } else {
        // Exception
        switch (cause) {
            case 2:  handle_illegal_instruction(); break;
            case 7:  handle_store_access_fault(); break;
            case 8:  handle_ecall_from_umode(); break;
        }
    }
}

This appendix provides a comprehensive reference for RISC-V Control and Status Registers (CSRs). CSRs control processor behavior, report status, and provide access to privileged functionality. Each CSR has a 12-bit address and is accessed using dedicated CSR instructions (CSRRW, CSRRS, CSRRC, and their immediate variants).

A.1 CSR Address Space Organization

CSR addresses are 12 bits, organized as follows:

Bits [11:10]: Privilege Level
  00 = User/Unprivileged
  01 = Supervisor
  10 = Hypervisor (reserved in base spec)
  11 = Machine

Bits [9:8]: Read/Write Access
  00 = Read/Write
  01 = Read/Write
  10 = Read/Write
  11 = Read-Only

Bits [7:0]: Register Number

Access Rules:

Accessing a CSR from insufficient privilege level causes an illegal instruction exception
Writing to a read-only CSR (bits [11:10] = 11) causes an illegal instruction exception
Unimplemented CSRs may read as zero or cause an exception (implementation-defined)

A.2 Machine-Level CSRs (M-mode)

Machine Information Registers

CSR	Address	R/W	Description
mvendorid	0xF11	RO	Vendor ID (JEDEC manufacturer ID)
marchid	0xF12	RO	Architecture ID (implementation-specific)
mimpid	0xF13	RO	Implementation ID (version number)
mhartid	0xF14	RO	Hardware thread ID (unique per hart)
mconfigptr	0xF15	RO	Pointer to configuration data structure

Usage: These read-only CSRs identify the processor implementation. Software can use them to detect features, apply workarounds, or report system information.

Example:

csrr t0, mhartid        # Read hart ID
csrr t1, mvendorid      # Read vendor ID

Machine Trap Setup

CSR	Address	R/W	Description
mstatus	0x300	RW	Machine status register
misa	0x301	RW	ISA and extensions (may be read-only)
medeleg	0x302	RW	Exception delegation to S-mode
mideleg	0x303	RW	Interrupt delegation to S-mode
mie	0x304	RW	Machine interrupt enable
mtvec	0x305	RW	Machine trap-handler base address
mcounteren	0x306	RW	Counter enable for S-mode
mstatush	0x310	RW	Additional machine status (RV32 only)

Machine Trap Handling

CSR	Address	R/W	Description
mscratch	0x340	RW	Scratch register for M-mode trap handlers
mepc	0x341	RW	Machine exception program counter
mcause	0x342	RW	Machine trap cause
mtval	0x343	RW	Machine bad address or instruction
mip	0x344	RW	Machine interrupt pending
mtinst	0x34A	RW	Machine trap instruction (transformed)
mtval2	0x34B	RW	Machine bad guest physical address

Machine Memory Protection

CSR	Address	R/W	Description
pmpcfg0	0x3A0	RW	PMP configuration register 0
pmpcfg1	0x3A1	RW	PMP configuration register 1 (RV32 only)
pmpcfg2	0x3A2	RW	PMP configuration register 2
pmpcfg3	0x3A3	RW	PMP configuration register 3 (RV32 only)
pmpcfg4-15	0x3A4-0x3AF	RW	PMP configuration registers 4-15
pmpaddr0-15	0x3B0-0x3BF	RW	PMP address registers 0-15
pmpaddr16-63	0x3C0-0x3EF	RW	PMP address registers 16-63

Note: RV32 uses pmpcfg0, pmpcfg2, pmpcfg4, etc. (even-numbered only). RV64 uses pmpcfg0, pmpcfg2, pmpcfg4, etc., with each holding 8 configuration bytes.

Machine Counters and Timers

CSR	Address	R/W	Description
mcycle	0xB00	RW	Machine cycle counter (lower 32/64 bits)
minstret	0xB02	RW	Machine instructions retired counter
mhpmcounter3-31	0xB03-0xB1F	RW	Machine performance monitoring counters
mcycleh	0xB80	RW	Upper 32 bits of mcycle (RV32 only)
minstreth	0xB82	RW	Upper 32 bits of minstret (RV32 only)
mhpmcounter3h-31h	0xB83-0xB9F	RW	Upper 32 bits of mhpmcounter (RV32 only)

Machine Counter Setup

CSR	Address	R/W	Description
mcountinhibit	0x320	RW	Machine counter-inhibit register
mhpmevent3-31	0x323-0x33F	RW	Machine performance monitoring event selectors

Usage: mcountinhibit controls which counters are active. Setting bit N stops counter N from incrementing, saving power.

A.3 Supervisor-Level CSRs (S-mode)

Supervisor Trap Setup

CSR	Address	R/W	Description
sstatus	0x100	RW	Supervisor status register (subset of mstatus)
sie	0x104	RW	Supervisor interrupt enable
stvec	0x105	RW	Supervisor trap-handler base address
scounteren	0x106	RW	Counter enable for U-mode

Supervisor Trap Handling

CSR	Address	R/W	Description
sscratch	0x140	RW	Scratch register for S-mode trap handlers
sepc	0x141	RW	Supervisor exception program counter
scause	0x142	RW	Supervisor trap cause
stval	0x143	RW	Supervisor bad address or instruction
sip	0x144	RW	Supervisor interrupt pending

Supervisor Address Translation and Protection

CSR	Address	R/W	Description
satp	0x180	RW	Supervisor address translation and protection

satp Format (RV64):

Bits [63:60]: Mode (0=Bare, 8=Sv39, 9=Sv48, 10=Sv57)
Bits [59:44]: ASID (Address Space Identifier)
Bits [43:0]:  PPN (Physical Page Number of root page table)

A.4 User-Level CSRs (U-mode)

Floating-Point Control and Status

CSR	Address	R/W	Description
fflags	0x001	RW	Floating-point accrued exceptions
frm	0x002	RW	Floating-point rounding mode
fcsr	0x003	RW	Floating-point control and status (fflags + frm)

fflags Bits:

Bit 0: NV (Invalid Operation)
Bit 1: DZ (Divide by Zero)
Bit 2: OF (Overflow)
Bit 3: UF (Underflow)
Bit 4: NX (Inexact)

frm Values:

0: RNE (Round to Nearest, ties to Even)
1: RTZ (Round towards Zero)
2: RDN (Round Down, towards -∞)
3: RUP (Round Up, towards +∞)
4: RMM (Round to Nearest, ties to Max Magnitude)

User Counters and Timers

CSR	Address	R/W	Description
cycle	0xC00	RO	Cycle counter (lower 32/64 bits)
time	0xC01	RO	Timer (lower 32/64 bits)
instret	0xC02	RO	Instructions retired counter
hpmcounter3-31	0xC03-0xC1F	RO	Performance monitoring counters
cycleh	0xC80	RO	Upper 32 bits of cycle (RV32 only)
timeh	0xC81	RO	Upper 32 bits of time (RV32 only)
instreth	0xC82	RO	Upper 32 bits of instret (RV32 only)
hpmcounter3h-31h	0xC83-0xC9F	RO	Upper 32 bits of hpmcounter (RV32 only)

Note: These are read-only shadows of the machine-level counters. Access can be disabled by mcounteren (for S-mode) or scounteren (for U-mode).

A.5 Debug CSRs

Debug CSRs are accessible only in Debug Mode (entered via debugger or trigger).

CSR	Address	R/W	Description
dcsr	0x7B0	RW	Debug control and status register
dpc	0x7B1	RW	Debug program counter
dscratch0	0x7B2	RW	Debug scratch register 0
dscratch1	0x7B3	RW	Debug scratch register 1

dcsr Bit Fields:

Bits [31:28]: xdebugver (Debug specification version)
Bits [8:6]: cause (Reason for entering debug mode)
- 1: ebreak instruction
- 2: Trigger module
- 3: Debugger halt request
- 4: Single step
- 5: Reset halt
Bit [2]: step (Single-step mode enable)
Bits [1:0]: prv (Privilege level before entering debug mode)

A.6 Trigger/Debug Module CSRs

CSR	Address	R/W	Description
tselect	0x7A0	RW	Trigger select register
tdata1	0x7A1	RW	Trigger data register 1 (type and config)
tdata2	0x7A2	RW	Trigger data register 2 (match value)
tdata3	0x7A3	RW	Trigger data register 3 (additional data)
tinfo	0x7A4	RO	Trigger info (supported types)
tcontrol	0x7A5	RW	Trigger control
mcontext	0x7A8	RW	Machine context register
scontext	0x7AA	RW	Supervisor context register

Usage: Triggers enable hardware breakpoints and watchpoints. tselect chooses which trigger to configure, tdata1-3 configure the selected trigger.

A.7 Key CSR Bit Fields

mstatus (Machine Status Register)

RV64 Format:

Bit  63: SD (State Dirty - summary of FS/XS)
Bits 36-37: SXL (S-mode XLEN)
Bits 34-35: UXL (U-mode XLEN)
Bit  22: TSR (Trap SRET)
Bit  21: TW (Timeout Wait - trap WFI)
Bit  20: TVM (Trap Virtual Memory - trap SATP writes)
Bit  19: MXR (Make eXecutable Readable)
Bit  18: SUM (permit Supervisor User Memory access)
Bit  17: MPRV (Modify PRiVilege)
Bits 15-16: XS (user eXtension State)
Bits 13-14: FS (Floating-point State)
Bits 11-12: MPP (Machine Previous Privilege)
Bit  8: SPP (Supervisor Previous Privilege)
Bit  7: MPIE (Machine Previous Interrupt Enable)
Bit  5: SPIE (Supervisor Previous Interrupt Enable)
Bit  3: MIE (Machine Interrupt Enable)
Bit  1: SIE (Supervisor Interrupt Enable)

FS/XS Values:

0: Off (all off)
1: Initial (none dirty, some on)
2: Clean (none dirty, some on)
3: Dirty (some dirty)

MPP/SPP Values:

0: User mode
1: Supervisor mode
3: Machine mode

mtvec (Machine Trap Vector)

Format:

Bits [XLEN-1:2]: BASE (trap handler base address, 4-byte aligned)
Bits [1:0]: MODE
  0 = Direct (all traps to BASE)
  1 = Vectored (interrupts to BASE + 4*cause, exceptions to BASE)

Example:

la t0, trap_handler
csrw mtvec, t0          # Direct mode (MODE=0)

la t0, trap_handler
ori t0, t0, 1           # Set MODE=1
csrw mtvec, t0          # Vectored mode

mcause (Machine Cause Register)

Format:

Bit [XLEN-1]: Interrupt (1=interrupt, 0=exception)
Bits [XLEN-2:0]: Exception Code

Exception Codes (Interrupt=0):

Code	Exception
0	Instruction address misaligned
1	Instruction access fault
2	Illegal instruction
3	Breakpoint
4	Load address misaligned
5	Load access fault
6	Store/AMO address misaligned
7	Store/AMO access fault
8	Environment call from U-mode
9	Environment call from S-mode
11	Environment call from M-mode
12	Instruction page fault
13	Load page fault
15	Store/AMO page fault

Interrupt Codes (Interrupt=1):

Code	Interrupt
0	User software interrupt
1	Supervisor software interrupt
3	Machine software interrupt
4	User timer interrupt
5	Supervisor timer interrupt
7	Machine timer interrupt
8	User external interrupt
9	Supervisor external interrupt
11	Machine external interrupt

satp (Supervisor Address Translation and Protection)

RV64 Format:

Bits [63:60]: MODE
  0 = Bare (no translation)
  8 = Sv39 (39-bit virtual address)
  9 = Sv48 (48-bit virtual address)
  10 = Sv57 (57-bit virtual address)
Bits [59:44]: ASID (Address Space Identifier, 16 bits)
Bits [43:0]: PPN (Physical Page Number of root page table, 44 bits)

RV32 Format:

Bit [31]: MODE (0=Bare, 1=Sv32)
Bits [30:22]: ASID (9 bits)
Bits [21:0]: PPN (22 bits)

Example:

# Switch to Sv39 mode with ASID=1, root page table at 0x80200000
li t0, 0x8000000000080200  # MODE=8, ASID=0, PPN=0x80200
csrw satp, t0
sfence.vma                 # Flush TLB

A.8 CSR Instructions Quick Reference

Instruction	Format	Operation
CSRRW	csrrw rd, csr, rs1	t = CSR; CSR = rs1; rd = t
CSRRS	csrrs rd, csr, rs1	t = CSR; CSR = t \| rs1; rd = t
CSRRC	csrrc rd, csr, rs1	t = CSR; CSR = t & ~rs1; rd = t
CSRRWI	csrrwi rd, csr, imm	t = CSR; CSR = imm; rd = t
CSRRSI	csrrsi rd, csr, imm	t = CSR; CSR = t \| imm; rd = t
CSRRCI	csrrci rd, csr, imm	t = CSR; CSR = t & ~imm; rd = t

Pseudo-instructions:

csrr rd, csr        # Read CSR (csrrs rd, csr, x0)
csrw csr, rs1       # Write CSR (csrrw x0, csr, rs1)
csrs csr, rs1       # Set bits (csrrs x0, csr, rs1)
csrc csr, rs1       # Clear bits (csrrc x0, csr, rs1)
csrwi csr, imm      # Write immediate (csrrwi x0, csr, imm)
csrsi csr, imm      # Set bits immediate (csrrsi x0, csr, imm)
csrci csr, imm      # Clear bits immediate (csrrci x0, csr, imm)

A.9 Common CSR Usage Patterns

Enable Machine-Mode Interrupts

# Enable machine timer and external interrupts
li t0, 0x88             # MTIE (bit 7) + MEIE (bit 11)
csrs mie, t0            # Set bits in mie

# Enable global interrupts
li t0, 0x8              # MIE (bit 3)
csrs mstatus, t0        # Set MIE in mstatus

Trap Handler Entry

trap_handler:
    # Save context
    csrrw sp, mscratch, sp  # Swap sp with mscratch

    # Save registers on stack
    addi sp, sp, -32*8
    sd x1, 0(sp)
    sd x2, 8(sp)
    # ... save all registers ...

    # Read trap cause
    csrr t0, mcause
    csrr t1, mepc
    csrr t2, mtval

    # Handle trap...

Context Switch (Change satp)

# Switch to new process page table
# a0 = new satp value
csrw satp, a0
sfence.vma              # Flush TLB

Disable Interrupts for Critical Section

# Save and disable interrupts
csrrci t0, mstatus, 0x8  # Clear MIE, save old mstatus

# Critical section...

# Restore interrupts
csrw mstatus, t0         # Restore original mstatus

A.10 CSR Access Permissions

Privilege Level Check:

CSR address bits [11:10] encode minimum privilege level
Accessing CSR from lower privilege → illegal instruction exception

Read-Only Check:

CSR address bits [11:10] = 11 → read-only
Writing to read-only CSR → illegal instruction exception

Implementation-Defined Behavior:

Unimplemented CSRs may:
- Read as zero, writes ignored (WARL - Write Any, Read Legal)
- Cause illegal instruction exception
- Implementation must document behavior

A.11 References

RISC-V Privileged Specification: Complete CSR definitions and bit fields
RISC-V Debug Specification: Debug CSRs (dcsr, dpc, tselect, tdata)
RISC-V ISA Manual: CSR instructions and access rules

Appendix B. Extension Reference

RISC-V ISA Extensions Quick Reference

💡 Usage Guide: This appendix is your “menu” during project planning. When deciding which extensions your project needs, reference the decision guide here.

🧩 Extension Selection Guide (Decision Guide)

Quick Decision Table

Extension	Full Name	When to Use?	Dependencies	Recommendation
M	Multiply/Divide	Almost all projects need it	None	✅ Strongly recommended
A	Atomic	Multi-core, OS, Lock-free	None	✅ Required for OS
F	Single Float	Floating-point (games, scientific)	None	As needed
D	Double Float	High-precision floating-point	F	As needed
C	Compressed	Reduce code size 20-30%	None	✅ Strongly recommended
V	Vector	AI/DSP/Matrix operations	D	Required for HPC
Zba	Address Gen	Heavy array access `a[i*4]`	None	Performance optimization
Zbb	Bit Manipulation	Bit operations (popcount, clz)	None	Performance optimization
Zbs	Single-bit	Single-bit operations	None	Performance optimization
Zicsr	CSR Access	CSR access (separated from I)	None	✅ Required for system code
Zifencei	Fence.I	Instruction cache sync (JIT)	None	Required for self-modifying code

Common Combinations (Profiles)

Minimal Embedded:     RV32IMC      (Multiply + Compressed)
Standard Embedded:    RV32IMAC     (+ Atomic operations)
Application Proc:     RV64IMAFDC   (= RV64GC, full general-purpose)
High-Performance:     RV64GCV      (+ Vector)

Dependency Graph

        ┌─────┐
        │  I  │ (Base ISA)
        └──┬──┘
           │
    ┌──────┼──────┬──────┬──────┐
    ▼      ▼      ▼      ▼      ▼
  ┌───┐  ┌───┐  ┌───┐  ┌───┐  ┌───┐
  │ M │  │ A │  │ C │  │ F │  │Zicsr│
  └───┘  └───┘  └───┘  └─┬─┘  └───┘
                         │
                         ▼
                       ┌───┐
                       │ D │
                       └─┬─┘
                         │
                         ▼
                       ┌───┐
                       │ V │
                       └───┘

⚠️ Common Pitfalls

Pitfall 1: Thinking G Includes C

Misconception: RV64G includes compressed instructions.

Truth: G = IMAFD, does NOT include C. For compressed instructions, explicitly write RV64GC.

# ❌ Wrong: Assuming G has compression
riscv64-unknown-elf-gcc -march=rv64g ...

# ✅ Correct: Explicitly add C
riscv64-unknown-elf-gcc -march=rv64gc ...

Pitfall 2: Forgetting Zicsr and Zifencei

Background: Starting from RISC-V 2.1 spec, CSR instructions and FENCE.I were separated from I.

Impact: Some toolchains require explicit specification.

# If compiler complains about missing csrr/csrw
riscv64-unknown-elf-gcc -march=rv64gc_zicsr_zifencei ...

Pitfall 3: misa Can Only Be Read in M-mode

Error Scenario: Trying to read misa to detect extensions in S-mode or U-mode.

Solution: Use SBI query, or record during M-mode initialization.

// ❌ In S-mode, this triggers Illegal Instruction
uint64_t misa;
asm volatile ("csrr %0, misa" : "=r" (misa));

// ✅ Query via SBI or Device Tree
// Or save misa to global variable during M-mode boot

This appendix provides a comprehensive reference for RISC-V ISA extensions. RISC-V’s modular design allows implementations to include only the extensions they need, from minimal embedded systems to high-performance application processors.

B.1 Base ISAs

Base ISA	Description	Register Width	Address Space
RV32I	32-bit integer base	32 bits	32-bit (4 GB)
RV64I	64-bit integer base	64 bits	64-bit (16 EB)
RV128I	128-bit integer base (future)	128 bits	128-bit
RV32E	Embedded variant (16 registers)	32 bits	32-bit (4 GB)

RV32I: The base 32-bit integer instruction set. Includes 32 general-purpose registers (x0-x31), integer arithmetic, logical operations, loads/stores, branches, and jumps. Sufficient for simple embedded systems.

RV64I: Extends RV32I to 64-bit. Adds 64-bit arithmetic operations (ADDW, SUBW, etc.) and 64-bit loads/stores (LD, SD). Registers are 64 bits wide. Used for application processors and servers.

RV32E: Reduced version of RV32I with only 16 registers (x0-x15). Designed for ultra-low-cost embedded systems where area is critical. Reduces register file size by 50%.

B.2 Standard Extensions

M Extension: Integer Multiplication and Division

Status: Ratified
Description: Adds integer multiply, divide, and remainder instructions.

Instruction	Description
MUL	Multiply (lower XLEN bits)
MULH	Multiply high (signed × signed)
MULHSU	Multiply high (signed × unsigned)
MULHU	Multiply high (unsigned × unsigned)
DIV	Divide (signed)
DIVU	Divide (unsigned)
REM	Remainder (signed)
REMU	Remainder (unsigned)

RV64 Additions: MULW, DIVW, DIVUW, REMW, REMUW (32-bit variants)

Usage: Essential for most applications. Division is expensive in hardware, so minimal systems may omit M extension and use software division.

A Extension: Atomic Instructions

Status: Ratified Description: Adds atomic memory operations for synchronization.

Load-Reserved/Store-Conditional:

Instruction	Description
LR.W/D	Load-Reserved Word/Doubleword
SC.W/D	Store-Conditional Word/Doubleword

Atomic Memory Operations (AMO):

Instruction	Description
AMOSWAP.W/D	Atomic swap
AMOADD.W/D	Atomic add
AMOAND.W/D	Atomic AND
AMOOR.W/D	Atomic OR
AMOXOR.W/D	Atomic XOR
AMOMAX.W/D	Atomic maximum (signed)
AMOMAXU.W/D	Atomic maximum (unsigned)
AMOMIN.W/D	Atomic minimum (signed)
AMOMINU.W/D	Atomic minimum (unsigned)

Ordering Annotations: .aq (acquire), .rl (release), .aqrl (both)

Usage: Required for multi-core systems and lock-free algorithms.

F Extension: Single-Precision Floating-Point

Status: Ratified
Description: Adds 32 floating-point registers (f0-f31) and single-precision (32-bit) floating-point operations.

Registers: 32 × 32-bit floating-point registers (f0-f31)

Instructions:

Arithmetic: FADD.S, FSUB.S, FMUL.S, FDIV.S, FSQRT.S
Fused Multiply-Add: FMADD.S, FMSUB.S, FNMADD.S, FNMSUB.S
Comparison: FEQ.S, FLT.S, FLE.S
Conversion: FCVT.W.S, FCVT.S.W, FCVT.L.S, FCVT.S.L
Move: FMV.X.W, FMV.W.X
Load/Store: FLW, FSW
Sign Injection: FSGNJ.S, FSGNJN.S, FSGNJX.S
Min/Max: FMIN.S, FMAX.S
Classification: FCLASS.S

CSRs: fflags, frm, fcsr (floating-point control and status)

D Extension: Double-Precision Floating-Point

Status: Ratified
Description: Extends F extension to support double-precision (64-bit) floating-point.

Requires: F extension

Registers: Extends f0-f31 to 64 bits each

Instructions: Same as F extension but with .D suffix (FADD.D, FMUL.D, etc.)

Additional Conversions: FCVT.S.D, FCVT.D.S (convert between single and double)

Load/Store: FLD, FSD (64-bit loads/stores)

C Extension: Compressed Instructions

Status: Ratified
Description: Adds 16-bit compressed instructions to reduce code size.

Encoding: 16-bit instructions (bits [1:0] ≠ 11) intermixed with 32-bit instructions

Instruction Categories:

Loads/Stores: C.LW, C.LD, C.SW, C.SD, C.LWSP, C.LDSP, C.SWSP, C.SDSP
Arithmetic: C.ADDI, C.ADDIW, C.ADDI16SP, C.ADDI4SPN, C.LI, C.LUI
Logical: C.ANDI, C.SLLI, C.SRLI, C.SRAI
Branches: C.BEQZ, C.BNEZ
Jumps: C.J, C.JAL, C.JR, C.JALR
Register Move: C.MV, C.ADD
Special: C.NOP, C.EBREAK

Code Size Reduction: Typically 25-30% smaller code compared to RV32I/RV64I alone

Usage: Highly recommended for all systems. Minimal hardware cost, significant code density improvement.

V Extension: Vector Operations

Status: Ratified (v1.0)
Description: Adds vector processing with variable-length vectors.

Registers: 32 vector registers (v0-v31), each with configurable element width and length

CSRs:

vtype: Vector type (element width, LMUL)
vl: Vector length
vstart: Vector start index (for resuming after exception)
vxrm: Vector fixed-point rounding mode
vxsat: Vector fixed-point saturation flag

Configuration: vsetvl, vsetvli (set vector length and type)

Instruction Categories:

Arithmetic: VADD, VSUB, VMUL, VDIV, VREM
Logical: VAND, VOR, VXOR
Shift: VSLL, VSRL, VSRA
Comparison: VMSEQ, VMSNE, VMSLT, VMSLE, VMSGTU
Load/Store: VLE, VSE (unit-stride), VLSE, VSSE (strided), VLXEI, VSXEI (indexed)
Reduction: VREDSUM, VREDMAX, VREDMIN
Mask: VMAND, VMOR, VMXOR, VMNOT
Permutation: VSLIDEUP, VSLIDEDOWN, VRGATHER
Floating-Point: VFADD, VFMUL, VFDIV, VFSQRT, VFMADD

Usage: High-performance computing, DSP, machine learning

B.3 Bit Manipulation Extensions

Zba: Address Generation

Status: Ratified Description: Instructions for address calculation.

Instruction	Description
SH1ADD	Shift left by 1 and add (rs1 << 1) + rs2
SH2ADD	Shift left by 2 and add (rs1 << 2) + rs2
SH3ADD	Shift left by 3 and add (rs1 << 3) + rs2

Usage: Efficient array indexing (e.g., a[i] where elements are 2, 4, or 8 bytes)

Zbb: Basic Bit Manipulation

Status: Ratified Description: Common bit manipulation operations.

Instruction	Description
ANDN	AND with inverted operand
ORN	OR with inverted operand
XNOR	XOR with inverted operand
CLZ	Count leading zeros
CTZ	Count trailing zeros
CPOP	Count population (number of 1 bits)
MAX	Maximum (signed)
MAXU	Maximum (unsigned)
MIN	Minimum (signed)
MINU	Minimum (unsigned)
SEXT.B	Sign-extend byte
SEXT.H	Sign-extend halfword
ZEXT.H	Zero-extend halfword
ROL	Rotate left
ROR	Rotate right
RORI	Rotate right immediate
ORC.B	OR-combine bytes
REV8	Byte-reverse (endian swap)

Usage: Cryptography, compression, bit-field manipulation

Zbc: Carry-Less Multiplication

Status: Ratified Description: Carry-less multiplication for cryptography.

Instruction	Description
CLMUL	Carry-less multiply (lower half)
CLMULH	Carry-less multiply (upper half)
CLMULR	Carry-less multiply (reversed)

Usage: AES-GCM, CRC calculation

Zbs: Single-Bit Instructions

Status: Ratified Description: Single-bit set, clear, invert, extract.

Instruction	Description
BCLR	Bit clear
BCLRI	Bit clear immediate
BEXT	Bit extract
BEXTI	Bit extract immediate
BINV	Bit invert
BINVI	Bit invert immediate
BSET	Bit set
BSETI	Bit set immediate

Usage: Bit-field manipulation, flag management

B.4 Compressed Extensions (Zc*)

Zcb: Code Size Reduction (16-bit)

Status: Ratified Description: Additional 16-bit instructions for code density.

Instructions: C.LBU, C.LHU, C.LH, C.SB, C.SH, C.ZEXT.B, C.SEXT.B, C.SEXT.H, C.ZEXT.H, C.ZEXT.W, C.NOT, C.MUL

Usage: Further code size reduction beyond C extension

Zcmp: Push/Pop and Move

Status: Ratified Description: Push/pop multiple registers, double move.

Instructions:

CM.PUSH: Push registers to stack
CM.POP: Pop registers from stack
CM.POPRET: Pop and return
CM.POPRETZ: Pop, return, and zero a0
CM.MVA01S: Move two registers to a0/a1
CM.MVSA01: Move a0/a1 to two registers

Usage: Function prologue/epilogue optimization

Zcmt: Table Jump

Status: Ratified Description: Indirect jump via table for switch statements.

Instructions: CM.JT, CM.JALT (jump via table)

Usage: Efficient switch/case implementation

B.5 Cache Management Extensions

Zicbom: Cache Block Management

Status: Ratified Description: Cache block clean, flush, and invalidate.

Instructions:

CBO.CLEAN: Clean cache block (write back if dirty)
CBO.FLUSH: Flush cache block (write back and invalidate)
CBO.INVAL: Invalidate cache block

Usage: DMA coherence, cache maintenance

Zicbop: Cache Block Prefetch

Status: Ratified Description: Prefetch hints for cache optimization.

Instructions:

PREFETCH.R: Prefetch for read
PREFETCH.W: Prefetch for write
PREFETCH.I: Prefetch for instruction

Usage: Performance optimization, prefetching

Zicboz: Cache Block Zero

Status: Ratified Description: Zero a cache block efficiently.

Instructions: CBO.ZERO (zero cache block)

Usage: Fast memory initialization

B.6 Privileged Extensions

Zicsr: CSR Instructions

Status: Ratified (part of base) Description: Control and Status Register access instructions.

Instructions: CSRRW, CSRRS, CSRRC, CSRRWI, CSRRSI, CSRRCI

Usage: Required for privileged software (OS, firmware)

Zifencei: Instruction Fetch Fence

Status: Ratified (part of base) Description: Synchronize instruction and data caches.

Instructions: FENCE.I

Usage: Self-modifying code, JIT compilation, code loading

Zihintpause: Pause Hint

Status: Ratified Description: Hint for spin-wait loops.

Instructions: PAUSE (encoded as FENCE with specific operands)

Usage: Reduce power in spin-locks

B.7 Hypervisor Extension (H)

Status: Ratified Description: Support for virtualization.

Features:

Two-stage address translation (VS-stage and G-stage)
Virtual supervisor mode (VS-mode)
Hypervisor CSRs (hstatus, hedeleg, hideleg, hgatp, etc.)
Virtual interrupt management

Instructions: HLV, HSV (hypervisor load/store), HFENCE.VVMA, HFENCE.GVMA

Usage: Virtual machines, hypervisors (KVM, Xen)

B.8 Cryptography Extensions

Zk: Scalar Cryptography

Status: Ratified Description: Cryptographic instructions for AES, SHA, SM3, SM4.

Sub-extensions:

Zkn: NIST algorithms (AES, SHA-256, SHA-512)
Zks: ShangMi algorithms (SM3, SM4)
Zkb: Bit manipulation for crypto
Zkr: Entropy source (seed CSR)

Instructions:

AES: AES32ESI, AES32ESMI, AES32DSI, AES32DSMI, AES64ES, AES64DS, etc.
SHA-256: SHA256SIG0, SHA256SIG1, SHA256SUM0, SHA256SUM1
SHA-512: SHA512SIG0, SHA512SIG1, SHA512SUM0, SHA512SUM1
SM3: SM3P0, SM3P1
SM4: SM4ED, SM4KS

Usage: Secure boot, TLS, disk encryption

B.9 Extension Combinations

Common Combinations

Combination	Name	Description
RV32I	Base	Minimal 32-bit system
RV32IM	-	Base + multiply/divide
RV32IMC	-	Base + multiply + compressed
RV32IMAC	-	Base + multiply + atomic + compressed
RV32IMAFC	-	Base + M + A + F + C
RV32IMAFDC	-	Base + M + A + F + D + C
RV32GC	General	RV32IMAFD_Zicsr_Zifencei + C
RV64GC	General	RV64IMAFD_Zicsr_Zifencei + C

RV32G / RV64G: “General-purpose” configuration = IMAFD + Zicsr + Zifencei

B.10 Platform Profiles

RVA22 Profile (Application Processors)

Base: RV64I Mandatory Extensions:

M, A, F, D, C (IMAFD + C)
Zicsr, Zifencei
Zba, Zbb, Zbs (bit manipulation)
Zihintpause
Zicbom, Zicbop, Zicboz (cache management)
Sv39 (virtual memory)
Privileged spec v1.12+

Optional Extensions: V, H, Zk

Usage: Linux-capable application processors

RVA23 Profile (Next-generation)

Adds to RVA22:

Sv48 or Sv57 (larger virtual address space)
Zihintntl (non-temporal locality hints)
Zicond (conditional operations)
Zawrs (wait-on-reservation-set)
Zcb, Zcmp, Zcmt (additional compressed)
Vector extension (V) mandatory

Usage: High-performance servers, HPC

RVM23 Profile (Microcontrollers)

Base: RV32I or RV64I Mandatory Extensions:

M, C
Zicsr, Zifencei
Zba, Zbb, Zbs
Zicbop, Zicboz
PMP (Physical Memory Protection)

Optional: A, F, D, V

Usage: Embedded microcontrollers

B.11 Extension Naming Convention

Format: RV[32|64|128][I|E][Extensions]

Examples:

RV32I: 32-bit base integer
RV64IMAC: 64-bit with M, A, C extensions
RV32GC: 32-bit general-purpose with compressed
RV64GCV: 64-bit general-purpose with compressed and vector

Ordering: Extensions listed in canonical order (IMAFDQCV…)

B.12 Extension Detection

Runtime Detection (misa CSR)

csrr t0, misa
andi t1, t0, (1 << 0)   # Check 'A' extension (bit 0)
bnez t1, has_atomic

misa Bit Assignments:

Bit 0: A (Atomic)
Bit 2: C (Compressed)
Bit 3: D (Double-precision FP)
Bit 4: E (Embedded - RV32E)
Bit 5: F (Single-precision FP)
Bit 7: H (Hypervisor)
Bit 8: I (Base integer ISA)
Bit 12: M (Multiply/Divide)
Bit 20: U (User mode)
Bit 21: V (Vector)

B.13 References

RISC-V ISA Manual: Complete extension specifications
RISC-V Profiles: RVA22, RVA23, RVM23 specifications
Extension Specifications: Individual ratified extension documents

Appendix C. Boot Loader Reference Implementation

Minimal RISC-V Bootloader Example

💡 Usage Guide: This appendix is your “boot disk” for starting projects. When you need to write bare-metal code from scratch, copy templates directly from here.

🚀 Minimal Viable Boot Template (Copy-Paste Ready)

Minimal Linker Script (`link.ld`)

This is the most frequently copy-pasted file in bare-metal projects:

/* link.ld - For QEMU virt machine */
OUTPUT_ARCH(riscv)
ENTRY(_start)

MEMORY {
    RAM (rwx) : ORIGIN = 0x80000000, LENGTH = 128M
}

SECTIONS {
    . = 0x80000000;

    .text : {
        *(.text.boot)       /* Ensure boot code comes first */
        *(.text .text.*)
    } > RAM

    .rodata : {
        *(.rodata .rodata.*)
    } > RAM

    .data : {
        *(.data .data.*)
    } > RAM

    .bss : {
        _bss_start = .;
        *(.bss .bss.*)
        *(COMMON)
        _bss_end = .;
    } > RAM

    . = ALIGN(16);
    . = . + 0x4000;         /* Reserve 16KB Stack */
    _stack_top = .;
}

Minimal Entry Point (`entry.S`)

# entry.S - Minimal boot code
.section .text.boot
.global _start

_start:
    # 1. Set Stack Pointer
    la sp, _stack_top

    # 2. Clear BSS section
    la t0, _bss_start
    la t1, _bss_end
clear_bss:
    bge t0, t1, bss_done
    sd zero, 0(t0)
    addi t0, t0, 8
    j clear_bss
bss_done:

    # 3. Jump to C main
    call main

    # 4. Halt after main returns
halt:
    wfi
    j halt

Minimal Main (`main.c`)

// main.c - Minimal Hello World (UART)
#define UART_BASE 0x10000000  // QEMU virt UART address

void uart_putc(char c) {
    volatile char *uart = (volatile char *)UART_BASE;
    *uart = c;
}

void uart_puts(const char *s) {
    while (*s) uart_putc(*s++);
}

int main(void) {
    uart_puts("Hello, RISC-V!\n");
    return 0;
}

Compile and Run

# Compile
riscv64-unknown-elf-gcc -nostdlib -T link.ld \
    -o hello.elf entry.S main.c

# Run
qemu-system-riscv64 -machine virt -nographic \
    -kernel hello.elf

📊 Typical Boot Flow Diagram

┌─────────────────────────────────────────────────────────────┐
│                     Power-On Reset                          │
└─────────────────────────┬───────────────────────────────────┘
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  ZSBL (Zeroth-Stage Bootloader) - ROM                       │
│  • PC = Reset Vector (0x1000 or implementation-defined)     │
│  • Initialize clock, DRAM Controller                        │
│  • Jump to FSBL                                             │
└─────────────────────────┬───────────────────────────────────┘
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  FSBL (First-Stage Bootloader) - Flash/ROM                  │
│  • Initialize SPI/SD storage device                         │
│  • Load OpenSBI to DRAM                                     │
│  • Jump to OpenSBI                                          │
└─────────────────────────┬───────────────────────────────────┘
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  OpenSBI (M-mode Firmware)                                  │
│  • Set up PMP to protect M-mode memory                      │
│  • Initialize SBI services                                  │
│  • Set medeleg/mideleg to delegate traps                    │
│  • Jump to S-mode Kernel                                    │
└─────────────────────────┬───────────────────────────────────┘
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  Linux Kernel (S-mode)                                      │
│  • Initialize virtual memory                                │
│  • Start init process                                       │
└─────────────────────────────────────────────────────────────┘

This appendix provides a reference implementation of a minimal RISC-V bootloader. This code demonstrates the essential steps required to boot a RISC-V system from reset to loading an operating system. While production bootloaders like U-Boot are much more complex, this example illustrates the core concepts.

C.1 Boot Sequence Overview

Power-On Reset
    ↓
Reset Vector (0x1000 or implementation-defined)
    ↓
ZSBL (Zeroth-Stage Bootloader) - ROM code
    ├─ Initialize clocks
    ├─ Initialize DRAM
    └─ Jump to FSBL
        ↓
FSBL (First-Stage Bootloader) - Flash/ROM
    ├─ Initialize storage (SPI/SD)
    ├─ Load SSBL to DRAM
    ├─ Verify SSBL (optional)
    └─ Jump to SSBL
        ↓
SSBL (Second-Stage Bootloader) - U-Boot/GRUB
    ├─ Initialize devices
    ├─ Load kernel and device tree
    ├─ Set up boot arguments
    └─ Jump to kernel
        ↓
Operating System (Linux/FreeBSD)

C.2 ZSBL: Zeroth-Stage Bootloader

Purpose: Minimal ROM code to initialize DRAM and load FSBL.

Constraints:

Must fit in small on-chip ROM (typically 16-64 KB)
No DRAM available initially
Must run from ROM or tightly-coupled memory (TCM)

ZSBL Entry Point

# zsbl_start.S - ZSBL entry point
# Runs in M-mode immediately after reset

.section .text.init
.global _start

_start:
    # Disable interrupts
    csrw mie, zero
    csrw mip, zero
    
    # Set up trap vector (point to error handler)
    la t0, trap_handler
    csrw mtvec, t0
    
    # Initialize stack pointer
    # Use on-chip SRAM (e.g., 0x08000000 + 16KB)
    la sp, _stack_top
    
    # Clear BSS section
    la t0, _bss_start
    la t1, _bss_end
1:
    bge t0, t1, 2f
    sd zero, 0(t0)
    addi t0, t0, 8
    j 1b
2:
    
    # Jump to C code
    call zsbl_main
    
    # Should never return
    j .

trap_handler:
    # Minimal trap handler - just hang
    j .

ZSBL Main Function

// zsbl_main.c - ZSBL main logic

#include <stdint.h>

// Hardware addresses (example for SiFive FU540)
#define DRAM_BASE       0x80000000
#define DRAM_SIZE       (8 * 1024 * 1024 * 1024UL)  // 8 GB
#define FSBL_LOAD_ADDR  0x80000000
#define FSBL_SIZE       (128 * 1024)  // 128 KB
#define SPI_FLASH_BASE  0x20000000

// DRAM controller registers (simplified)
#define DRAM_CTRL_BASE  0x10000000
#define DRAM_INIT_REG   (DRAM_CTRL_BASE + 0x00)
#define DRAM_STATUS_REG (DRAM_CTRL_BASE + 0x04)

void dram_init(void) {
    volatile uint32_t *init_reg = (uint32_t *)DRAM_INIT_REG;
    volatile uint32_t *status_reg = (uint32_t *)DRAM_STATUS_REG;
    
    // Trigger DRAM initialization
    *init_reg = 0x1;
    
    // Wait for DRAM ready
    while ((*status_reg & 0x1) == 0) {
        // Busy wait
    }
}

void load_fsbl(void) {
    uint8_t *src = (uint8_t *)SPI_FLASH_BASE;
    uint8_t *dst = (uint8_t *)FSBL_LOAD_ADDR;
    
    // Simple memcpy from SPI flash to DRAM
    for (size_t i = 0; i < FSBL_SIZE; i++) {
        dst[i] = src[i];
    }
}

void zsbl_main(void) {
    // 1. Initialize DRAM
    dram_init();
    
    // 2. Load FSBL from SPI flash to DRAM
    load_fsbl();
    
    // 3. Jump to FSBL
    void (*fsbl_entry)(void) = (void (*)(void))FSBL_LOAD_ADDR;
    fsbl_entry();
    
    // Should never reach here
    while (1);
}

C.3 FSBL: First-Stage Bootloader

Purpose: Load second-stage bootloader (U-Boot) from storage.

Features:

Initialize storage controller (SPI, SD, eMMC)
Load SSBL image from storage
Verify SSBL (checksum or signature)
Jump to SSBL

FSBL Main Function

// fsbl_main.c - FSBL main logic

#include <stdint.h>

#define SSBL_LOAD_ADDR  0x80200000  // Load U-Boot at 2MB offset
#define SSBL_SIZE       (512 * 1024)  // 512 KB
#define SSBL_FLASH_OFFSET 0x40000     // Offset in SPI flash

// SPI controller registers (simplified)
#define SPI_BASE        0x10040000
#define SPI_CTRL_REG    (SPI_BASE + 0x00)
#define SPI_DATA_REG    (SPI_BASE + 0x04)
#define SPI_STATUS_REG  (SPI_BASE + 0x08)

void spi_init(void) {
    volatile uint32_t *ctrl_reg = (uint32_t *)SPI_CTRL_REG;
    
    // Configure SPI: 8-bit mode, clock divider = 4
    *ctrl_reg = 0x04;
}

void spi_read(uint32_t offset, uint8_t *buf, size_t len) {
    volatile uint32_t *data_reg = (uint32_t *)SPI_DATA_REG;
    volatile uint32_t *status_reg = (uint32_t *)SPI_STATUS_REG;
    
    // Send read command (0x03) + 24-bit address
    *data_reg = 0x03;
    *data_reg = (offset >> 16) & 0xFF;
    *data_reg = (offset >> 8) & 0xFF;
    *data_reg = offset & 0xFF;
    
    // Read data
    for (size_t i = 0; i < len; i++) {
        // Wait for data ready
        while ((*status_reg & 0x1) == 0);
        buf[i] = *data_reg & 0xFF;
    }
}

uint32_t calculate_checksum(uint8_t *data, size_t len) {
    uint32_t sum = 0;
    for (size_t i = 0; i < len; i++) {
        sum += data[i];
    }
    return sum;
}

void fsbl_main(void) {
    uint8_t *ssbl_addr = (uint8_t *)SSBL_LOAD_ADDR;
    
    // 1. Initialize SPI controller
    spi_init();
    
    // 2. Load SSBL from SPI flash
    spi_read(SSBL_FLASH_OFFSET, ssbl_addr, SSBL_SIZE);
    
    // 3. Verify SSBL (simple checksum)
    uint32_t *checksum_ptr = (uint32_t *)(ssbl_addr + SSBL_SIZE - 4);
    uint32_t expected_checksum = *checksum_ptr;
    uint32_t actual_checksum = calculate_checksum(ssbl_addr, SSBL_SIZE - 4);
    
    if (actual_checksum != expected_checksum) {
        // Checksum failed - hang
        while (1);
    }
    
    // 4. Jump to SSBL
    void (*ssbl_entry)(void) = (void (*)(void))SSBL_LOAD_ADDR;
    ssbl_entry();
    
    // Should never reach here
    while (1);
}

C.4 Minimal SSBL: Second-Stage Bootloader

Purpose: Load kernel and device tree, set up boot environment.

Features:

Parse device tree
Load kernel image
Set up boot arguments
Jump to kernel in S-mode

SSBL Main Function

// ssbl_main.c - Minimal second-stage bootloader

#include <stdint.h>

#define KERNEL_LOAD_ADDR  0x80400000  // Load kernel at 4MB offset
#define DTB_LOAD_ADDR     0x82000000  // Load device tree at 32MB
#define KERNEL_SIZE       (8 * 1024 * 1024)  // 8 MB
#define DTB_SIZE          (64 * 1024)  // 64 KB

// Boot arguments for kernel
struct boot_args {
    uint64_t hartid;
    uint64_t dtb_addr;
};

void uart_putc(char c) {
    volatile uint32_t *uart_tx = (uint32_t *)0x10010000;
    *uart_tx = c;
}

void uart_puts(const char *s) {
    while (*s) {
        uart_putc(*s++);
    }
}

void load_kernel_and_dtb(void) {
    // In real bootloader, this would load from storage
    // For this example, assume kernel and DTB are already in memory
    uart_puts("Loading kernel...\n");
    // ... load kernel to KERNEL_LOAD_ADDR ...

    uart_puts("Loading device tree...\n");
    // ... load DTB to DTB_LOAD_ADDR ...
}

void jump_to_kernel(uint64_t hartid, uint64_t dtb_addr, uint64_t kernel_addr) {
    // Set up registers for kernel entry
    // a0 = hartid
    // a1 = dtb_addr

    __asm__ volatile (
        "mv a0, %0\n"
        "mv a1, %1\n"
        "jr %2\n"
        :
        : "r"(hartid), "r"(dtb_addr), "r"(kernel_addr)
        : "a0", "a1"
    );
}

void ssbl_main(void) {
    uint64_t hartid;

    // Read hart ID
    __asm__ volatile ("csrr %0, mhartid" : "=r"(hartid));

    uart_puts("SSBL: Second-Stage Bootloader\n");

    // 1. Load kernel and device tree
    load_kernel_and_dtb();

    // 2. Set up boot arguments
    uart_puts("Booting kernel...\n");

    // 3. Jump to kernel (in S-mode)
    // Note: In real bootloader, would delegate to S-mode first
    jump_to_kernel(hartid, DTB_LOAD_ADDR, KERNEL_LOAD_ADDR);

    // Should never reach here
    while (1);
}

C.5 Linker Script

Purpose: Define memory layout for bootloader.

/* bootloader.ld - Linker script for RISC-V bootloader */

OUTPUT_ARCH("riscv")
ENTRY(_start)

MEMORY
{
    ROM   (rx)  : ORIGIN = 0x00001000, LENGTH = 64K
    SRAM  (rwx) : ORIGIN = 0x08000000, LENGTH = 16K
    DRAM  (rwx) : ORIGIN = 0x80000000, LENGTH = 8G
}

SECTIONS
{
    /* Code section in ROM */
    .text : {
        *(.text.init)
        *(.text*)
    } > ROM

    /* Read-only data in ROM */
    .rodata : {
        *(.rodata*)
    } > ROM

    /* Data section in SRAM */
    .data : {
        _data_start = .;
        *(.data*)
        _data_end = .;
    } > SRAM AT> ROM

    /* BSS section in SRAM */
    .bss : {
        _bss_start = .;
        *(.bss*)
        *(COMMON)
        _bss_end = .;
    } > SRAM

    /* Stack in SRAM */
    .stack : {
        . = ALIGN(16);
        . += 8K;
        _stack_top = .;
    } > SRAM
}

C.6 Makefile

# Makefile for RISC-V bootloader

CROSS_COMPILE = riscv64-unknown-elf-
CC = $(CROSS_COMPILE)gcc
AS = $(CROSS_COMPILE)as
LD = $(CROSS_COMPILE)ld
OBJCOPY = $(CROSS_COMPILE)objcopy

CFLAGS = -march=rv64imac -mabi=lp64 -mcmodel=medany \
         -O2 -Wall -Wextra -nostdlib -nostartfiles \
         -fno-builtin -fno-common

LDFLAGS = -T bootloader.ld -nostdlib

ZSBL_OBJS = zsbl_start.o zsbl_main.o
FSBL_OBJS = fsbl_start.o fsbl_main.o
SSBL_OBJS = ssbl_start.o ssbl_main.o

all: zsbl.bin fsbl.bin ssbl.bin

zsbl.elf: $(ZSBL_OBJS)
 $(LD) $(LDFLAGS) -o $@ $^

fsbl.elf: $(FSBL_OBJS)
 $(LD) $(LDFLAGS) -o $@ $^

ssbl.elf: $(SSBL_OBJS)
 $(LD) $(LDFLAGS) -o $@ $^

%.bin: %.elf
 $(OBJCOPY) -O binary $< $@

%.o: %.S
 $(CC) $(CFLAGS) -c -o $@ $<

%.o: %.c
 $(CC) $(CFLAGS) -c -o $@ $<

clean:
 rm -f *.o *.elf *.bin

.PHONY: all clean

C.7 Common Boot Issues and Solutions

Issue 1: Hart Hangs at Reset

Symptoms: System doesn’t boot, no output.

Possible Causes:

Reset vector pointing to invalid address
ROM not mapped correctly
Clock not initialized

Debug Steps:

# Add debug output at very first instruction
_start:
    li t0, 0x10010000  # UART base
    li t1, 'A'
    sw t1, 0(t0)       # Write 'A' to UART
    # ... rest of code ...

Issue 2: DRAM Initialization Fails

Symptoms: System hangs after DRAM init, or data corruption.

Possible Causes:

Incorrect DRAM controller configuration
Clock frequency mismatch
Timing parameters wrong

Debug Steps:

void dram_test(void) {
    volatile uint32_t *test_addr = (uint32_t *)0x80000000;

    // Write test pattern
    *test_addr = 0xDEADBEEF;

    // Read back
    if (*test_addr != 0xDEADBEEF) {
        uart_puts("DRAM test failed!\n");
        while (1);
    }
}

Issue 3: Bootloader Doesn’t Load

Symptoms: ZSBL runs but FSBL doesn’t start.

Possible Causes:

SPI flash not initialized
Wrong flash offset
Corrupted image

Debug Steps:

void fsbl_main(void) {
    uart_puts("FSBL starting...\n");

    // Verify first few bytes of SSBL
    uint8_t *ssbl = (uint8_t *)SSBL_LOAD_ADDR;
    uart_puts("First bytes: ");
    for (int i = 0; i < 16; i++) {
        uart_puthex(ssbl[i]);
        uart_putc(' ');
    }
    uart_putc('\n');
}

C.8 References

U-Boot Documentation: https://u-boot.readthedocs.io/
OpenSBI Documentation: https://github.com/riscv-software-src/opensbi
RISC-V Boot Flow: RISC-V Platform Specification
Device Tree Specification: https://devicetree.org/

Appendix D. SBI Call Reference

Supervisor Binary Interface (SBI) Quick Reference

💡 Usage Guide: This appendix is your “API manual” for S-mode calling M-mode services. When you forget whether a7 is EID or FID, flip right here.

🎯 SBI Calling Convention Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    S-mode (Kernel/OS)                           │
│                                                                 │
│    ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐             │
│    │   a7    │ │   a6    │ │ a0-a5   │ │  ecall  │             │
│    │  EID    │ │  FID    │ │  Args   │ │ ──────► │             │
│    └─────────┘ └─────────┘ └─────────┘ └─────────┘             │
│       │           │           │                                 │
├───────┼───────────┼───────────┼─────────────────────────────────┤
│       ▼           ▼           ▼                                 │
│    ┌─────────────────────────────────────────┐                 │
│    │         Trap to M-mode (OpenSBI)        │                 │
│    └─────────────────────────────────────────┘                 │
│       │                                                         │
│       ▼                                                         │
│    ┌─────────┐ ┌─────────┐                                     │
│    │   a0    │ │   a1    │                                     │
│    │ Error   │ │ Value   │                                     │
│    │  Code   │ │(Return) │                                     │
│    └─────────┘ └─────────┘                                     │
│                                                                 │
│                    M-mode (OpenSBI Firmware)                    │
└─────────────────────────────────────────────────────────────────┘

Memory Aid: a7 = EID (Which Extension), a6 = FID (Which Function)

📋 Common EID Quick Reference

EID (Hex)	EID (ASCII)	Name	Purpose	Common FID
`0x10`	—	Base	Query SBI version/vendor	0=version, 3=probe Extension
`0x54494D45`	“TIME”	Timer	Set Timer interrupt	0=set_timer
`0x735049`	“sPI”	IPI	Cross-core interrupt	0=send_ipi
`0x52464E43`	“RFNC”	RFENCE	Remote TLB flush	0-6 (various fences)
`0x4442434E`	“DBCN”	Debug Console	Debug output	0=write, 1=read
`0x48534D`	“HSM”	Hart State Mgmt	Start/stop Hart	0=start, 1=stop

🛠️ Common SBI Wrappers (Copy-Paste Ready)

sbi_call Universal Interface

struct sbiret {
    long error;
    long value;
};

static inline struct sbiret sbi_call(long eid, long fid,
    long a0, long a1, long a2, long a3, long a4, long a5)
{
    struct sbiret ret;
    register long r_a0 asm("a0") = a0;
    register long r_a1 asm("a1") = a1;
    register long r_a2 asm("a2") = a2;
    register long r_a3 asm("a3") = a3;
    register long r_a4 asm("a4") = a4;
    register long r_a5 asm("a5") = a5;
    register long r_a6 asm("a6") = fid;
    register long r_a7 asm("a7") = eid;

    asm volatile("ecall"
        : "+r"(r_a0), "+r"(r_a1)
        : "r"(r_a2), "r"(r_a3), "r"(r_a4), "r"(r_a5), "r"(r_a6), "r"(r_a7)
        : "memory");

    ret.error = r_a0;
    ret.value = r_a1;
    return ret;
}

Common Function Wrappers

// 1. Set Timer Interrupt (Most commonly used!)
static inline void sbi_set_timer(uint64_t stime_value) {
    sbi_call(0x54494D45, 0, stime_value, 0, 0, 0, 0, 0);
}

// 2. Write a character (Debug Console)
static inline void sbi_debug_console_write_byte(char c) {
    sbi_call(0x4442434E, 2, c, 0, 0, 0, 0, 0);
}

// 3. Query SBI version
static inline long sbi_get_spec_version(void) {
    struct sbiret ret = sbi_call(0x10, 0, 0, 0, 0, 0, 0, 0);
    return ret.value;
}

// 4. Probe if Extension is supported
static inline long sbi_probe_extension(long eid) {
    struct sbiret ret = sbi_call(0x10, 3, eid, 0, 0, 0, 0, 0);
    return ret.value;  // 0 = not supported, non-0 = supported
}

⚠️ Common Pitfalls

Pitfall 1: EID and FID Order Confused

Symptom: SBI call returns SBI_ERR_NOT_SUPPORTED.

Cause: a7 and a6 are swapped.

// ❌ Wrong: EID and FID swapped
register long a6 asm("a6") = 0x10;        // Should be FID
register long a7 asm("a7") = 0;           // Should be EID

// ✅ Correct: a7=EID, a6=FID
register long a6 asm("a6") = 0;           // FID = 0 (get_spec_version)
register long a7 asm("a7") = 0x10;        // EID = 0x10 (Base Extension)

Pitfall 2: Forgetting to Check Error Code

Symptom: SBI call fails but program continues, causing hard-to-trace subsequent errors.

// ❌ Wrong: Ignoring error code
sbi_set_timer(next_time);

// ✅ Correct: Check error
struct sbiret ret = sbi_call(0x54494D45, 0, next_time, 0, 0, 0, 0, 0);
if (ret.error != 0) {
    panic("sbi_set_timer failed: %ld", ret.error);
}

This appendix provides a comprehensive reference for RISC-V SBI (Supervisor Binary Interface) calls. SBI defines the standard interface between supervisor mode (S-mode) software and machine mode (M-mode) firmware, enabling portable operating systems across different RISC-V platforms.

D.1 SBI Calling Convention

Register Usage

Input Registers:

Register	Purpose
a7	Extension ID (EID)
a6	Function ID (FID)
a0	Parameter 0 / Return value
a1	Parameter 1 / Return value (optional)
a2	Parameter 2
a3	Parameter 3
a4	Parameter 4
a5	Parameter 5

Output Registers:

Register	Purpose
a0	Error code (0 = success, negative = error)
a1	Return value (function-specific)

Preserved Registers: All registers except a0 and a1 are preserved across SBI calls.

Invocation

// S-mode code invokes SBI using ecall instruction
register unsigned long a0 asm("a0") = param0;
register unsigned long a1 asm("a1") = param1;
register unsigned long a6 asm("a6") = function_id;
register unsigned long a7 asm("a7") = extension_id;

asm volatile("ecall"
             : "+r"(a0), "+r"(a1)
             : "r"(a6), "r"(a7)
             : "memory");

// a0 contains error code, a1 contains return value

D.2 SBI Error Codes

Code	Name	Description
0	SBI_SUCCESS	Operation completed successfully
-1	SBI_ERR_FAILED	Operation failed
-2	SBI_ERR_NOT_SUPPORTED	Function not supported
-3	SBI_ERR_INVALID_PARAM	Invalid parameter
-4	SBI_ERR_DENIED	Permission denied
-5	SBI_ERR_INVALID_ADDRESS	Invalid address
-6	SBI_ERR_ALREADY_AVAILABLE	Resource already available
-7	SBI_ERR_ALREADY_STARTED	Already started
-8	SBI_ERR_ALREADY_STOPPED	Already stopped

D.3 Base Extension (EID = 0x10)

Purpose: Query SBI implementation details and supported extensions.

D.3.1 Get SBI Specification Version (FID = 0)

Returns: SBI specification version.

a1[31:24]: Major version
a1[23:0]: Minor version

long sbi_get_spec_version(void) {
    register unsigned long a0 asm("a0");
    register unsigned long a1 asm("a1");
    register unsigned long a6 asm("a6") = 0;
    register unsigned long a7 asm("a7") = 0x10;
    
    asm volatile("ecall" : "=r"(a0), "=r"(a1) : "r"(a6), "r"(a7) : "memory");
    return a1;  // Version in a1
}

D.3.2 Get SBI Implementation ID (FID = 1)

Returns: Implementation ID.

ID	Implementation
0	Berkeley Boot Loader (BBL)
1	OpenSBI
2	Xvisor
3	KVM
4	RustSBI
5	Diosix

D.3.3 Get SBI Implementation Version (FID = 2)

Returns: Implementation-specific version number.

D.3.4 Probe SBI Extension (FID = 3)

Parameters:

a0: Extension ID to probe

Returns:

a1: 0 = not available, 1 = available

long sbi_probe_extension(long extension_id) {
    register unsigned long a0 asm("a0") = extension_id;
    register unsigned long a1 asm("a1");
    register unsigned long a6 asm("a6") = 3;
    register unsigned long a7 asm("a7") = 0x10;
    
    asm volatile("ecall" : "+r"(a0), "=r"(a1) : "r"(a6), "r"(a7) : "memory");
    return a1;
}

D.3.5 Get Machine Vendor ID (FID = 4)

Returns: mvendorid CSR value.

D.3.6 Get Machine Architecture ID (FID = 5)

Returns: marchid CSR value.

D.3.7 Get Machine Implementation ID (FID = 6)

Returns: mimpid CSR value.

D.4 Timer Extension (EID = 0x54494D45 “TIME”)

Purpose: Program timer interrupts.

D.4.1 Set Timer (FID = 0)

Parameters:

a0: Timer value (absolute time in ticks)

Description: Programs the timer to fire at the specified time. Clears pending timer interrupt.

void sbi_set_timer(uint64_t stime_value) {
    register unsigned long a0 asm("a0") = stime_value;
    register unsigned long a6 asm("a6") = 0;
    register unsigned long a7 asm("a7") = 0x54494D45;
    
    asm volatile("ecall" : "+r"(a0) : "r"(a6), "r"(a7) : "memory");
}

// Usage: Set timer to fire in 1 second (assuming 10 MHz timebase)
uint64_t current_time;
asm volatile("rdtime %0" : "=r"(current_time));
sbi_set_timer(current_time + 10000000);

D.5 IPI Extension (EID = 0x735049 “sPI”)

Purpose: Send inter-processor interrupts.

D.5.1 Send IPI (FID = 0)

Parameters:

a0: Hart mask (bitmap of target harts)
a1: Hart mask base (base hart ID for the mask)

Description: Sends supervisor software interrupt to specified harts.

long sbi_send_ipi(unsigned long hart_mask, unsigned long hart_mask_base) {
    register unsigned long a0 asm("a0") = hart_mask;
    register unsigned long a1 asm("a1") = hart_mask_base;
    register unsigned long a6 asm("a6") = 0;
    register unsigned long a7 asm("a7") = 0x735049;

    asm volatile("ecall" : "+r"(a0), "+r"(a1) : "r"(a6), "r"(a7) : "memory");
    return a0;
}

// Usage: Send IPI to harts 1, 2, 3
sbi_send_ipi(0b1110, 0);  // Bits 1, 2, 3 set, base = 0

// Send IPI to hart 65 (bit 1 of mask, base = 64)
sbi_send_ipi(0b10, 64);

D.6 RFENCE Extension (EID = 0x52464E43 “RFNC”)

Purpose: Remote fence operations for TLB and instruction cache synchronization.

D.6.1 Remote FENCE.I (FID = 0)

Parameters:

a0: Hart mask
a1: Hart mask base

Description: Execute FENCE.I on remote harts.

long sbi_remote_fence_i(unsigned long hart_mask, unsigned long hart_mask_base) {
    register unsigned long a0 asm("a0") = hart_mask;
    register unsigned long a1 asm("a1") = hart_mask_base;
    register unsigned long a6 asm("a6") = 0;
    register unsigned long a7 asm("a7") = 0x52464E43;

    asm volatile("ecall" : "+r"(a0), "+r"(a1) : "r"(a6), "r"(a7) : "memory");
    return a0;
}

D.6.2 Remote SFENCE.VMA (FID = 1)

Parameters:

a0: Hart mask
a1: Hart mask base
a2: Start address (virtual address)
a3: Size (number of pages)

Description: Execute SFENCE.VMA on remote harts for specified address range.

long sbi_remote_sfence_vma(unsigned long hart_mask, unsigned long hart_mask_base,
                           unsigned long start_addr, unsigned long size) {
    register unsigned long a0 asm("a0") = hart_mask;
    register unsigned long a1 asm("a1") = hart_mask_base;
    register unsigned long a2 asm("a2") = start_addr;
    register unsigned long a3 asm("a3") = size;
    register unsigned long a6 asm("a6") = 1;
    register unsigned long a7 asm("a7") = 0x52464E43;

    asm volatile("ecall" : "+r"(a0), "+r"(a1)
                 : "r"(a2), "r"(a3), "r"(a6), "r"(a7) : "memory");
    return a0;
}

// Usage: Flush TLB for address range on all harts
sbi_remote_sfence_vma(~0UL, 0, 0x80000000, 4096);  // Flush 1 page

D.6.3 Remote SFENCE.VMA with ASID (FID = 2)

Parameters:

a0: Hart mask
a1: Hart mask base
a2: Start address
a3: Size
a4: ASID

Description: Execute SFENCE.VMA with ASID on remote harts.

long sbi_remote_sfence_vma_asid(unsigned long hart_mask, unsigned long hart_mask_base,
                                unsigned long start_addr, unsigned long size,
                                unsigned long asid) {
    register unsigned long a0 asm("a0") = hart_mask;
    register unsigned long a1 asm("a1") = hart_mask_base;
    register unsigned long a2 asm("a2") = start_addr;
    register unsigned long a3 asm("a3") = size;
    register unsigned long a4 asm("a4") = asid;
    register unsigned long a6 asm("a6") = 2;
    register unsigned long a7 asm("a7") = 0x52464E43;

    asm volatile("ecall" : "+r"(a0), "+r"(a1)
                 : "r"(a2), "r"(a3), "r"(a4), "r"(a6), "r"(a7) : "memory");
    return a0;
}

D.6.4 Remote HFENCE.GVMA (FID = 3)

Parameters:

a0: Hart mask
a1: Hart mask base
a2: Guest physical address
a3: Size

Description: Execute HFENCE.GVMA on remote harts (hypervisor extension).

D.6.5 Remote HFENCE.GVMA with VMID (FID = 4)

Parameters:

a0: Hart mask
a1: Hart mask base
a2: Guest physical address
a3: Size
a4: VMID

Description: Execute HFENCE.GVMA with VMID on remote harts.

D.6.6 Remote HFENCE.VVMA (FID = 5)

Parameters:

a0: Hart mask
a1: Hart mask base
a2: Guest virtual address
a3: Size

Description: Execute HFENCE.VVMA on remote harts.

D.6.7 Remote HFENCE.VVMA with ASID (FID = 6)

Parameters:

a0: Hart mask
a1: Hart mask base
a2: Guest virtual address
a3: Size
a4: ASID

Description: Execute HFENCE.VVMA with ASID on remote harts.

D.7 Hart State Management Extension (EID = 0x48534D “HSM”)

Purpose: Manage hart lifecycle (start, stop, suspend).

D.7.1 Hart Start (FID = 0)

Parameters:

a0: Hart ID
a1: Start address (physical address)
a2: Opaque parameter (passed to hart in a1)

Returns: SBI_SUCCESS or error code

Description: Start the specified hart at the given address.

long sbi_hart_start(unsigned long hartid, unsigned long start_addr,
                    unsigned long opaque) {
    register unsigned long a0 asm("a0") = hartid;
    register unsigned long a1 asm("a1") = start_addr;
    register unsigned long a2 asm("a2") = opaque;
    register unsigned long a6 asm("a6") = 0;
    register unsigned long a7 asm("a7") = 0x48534D;

    asm volatile("ecall" : "+r"(a0), "+r"(a1)
                 : "r"(a2), "r"(a6), "r"(a7) : "memory");
    return a0;
}

// Usage: Start hart 1 at address 0x80200000
sbi_hart_start(1, 0x80200000, 0);

D.7.2 Hart Stop (FID = 1)

Parameters: None

Returns: Does not return on success

Description: Stop the current hart. Hart enters stopped state.

void sbi_hart_stop(void) {
    register unsigned long a6 asm("a6") = 1;
    register unsigned long a7 asm("a7") = 0x48534D;

    asm volatile("ecall" : : "r"(a6), "r"(a7) : "memory");
}

D.7.3 Hart Get Status (FID = 2)

Parameters:

a0: Hart ID

Returns:

a1: Hart status

Hart Status Values:

Value	Status
0	STARTED
1	STOPPED
2	START_PENDING
3	STOP_PENDING
4	SUSPENDED
5	SUSPEND_PENDING
6	RESUME_PENDING

long sbi_hart_get_status(unsigned long hartid) {
    register unsigned long a0 asm("a0") = hartid;
    register unsigned long a1 asm("a1");
    register unsigned long a6 asm("a6") = 2;
    register unsigned long a7 asm("a7") = 0x48534D;

    asm volatile("ecall" : "+r"(a0), "=r"(a1) : "r"(a6), "r"(a7) : "memory");
    return a1;
}

D.7.4 Hart Suspend (FID = 3)

Parameters:

a0: Suspend type
a1: Resume address
a2: Opaque parameter

Suspend Types:

Value	Type
0x00000000	RETENTIVE (retain state, low latency)
0x80000000	NON_RETENTIVE (lose state, save/restore required)

D.8 System Reset Extension (EID = 0x53525354 “SRST”)

Purpose: System-wide reset and shutdown.

D.8.1 System Reset (FID = 0)

Parameters:

a0: Reset type
a1: Reset reason

Returns: Does not return on success

Reset Types:

Value	Type
0x00000000	SHUTDOWN
0x00000001	COLD_REBOOT
0x00000002	WARM_REBOOT

Reset Reasons:

Value	Reason
0x00000000	NO_REASON
0x00000001	SYSTEM_FAILURE

void sbi_system_reset(unsigned long reset_type, unsigned long reset_reason) {
    register unsigned long a0 asm("a0") = reset_type;
    register unsigned long a1 asm("a1") = reset_reason;
    register unsigned long a6 asm("a6") = 0;
    register unsigned long a7 asm("a7") = 0x53525354;

    asm volatile("ecall" : : "r"(a0), "r"(a1), "r"(a6), "r"(a7) : "memory");
    __builtin_unreachable();
}

// Usage: Reboot the system
#define SBI_RESET_TYPE_COLD_REBOOT 1
#define SBI_RESET_REASON_NO_REASON 0
sbi_system_reset(SBI_RESET_TYPE_COLD_REBOOT, SBI_RESET_REASON_NO_REASON);

D.9 Performance Monitoring Unit Extension (EID = 0x504D55 “PMU”)

Purpose: Configure and read performance counters.

D.9.1 Get Number of Counters (FID = 0)

Returns:

a1: Number of counters

D.9.2 Get Counter Info (FID = 1)

Parameters:

a0: Counter index

Returns:

a1: Counter info

D.9.3 Configure Matching Counters (FID = 2)

Parameters:

a0: Counter index base
a1: Counter mask
a2: Config flags
a3: Event index
a4: Event data

Returns: Number of counters configured

D.9.4 Start Counters (FID = 3)

Parameters:

a0: Counter index base
a1: Counter mask
a2: Start flags
a3: Initial value

D.9.5 Stop Counters (FID = 4)

Parameters:

a0: Counter index base
a1: Counter mask
a2: Stop flags

D.9.6 Read Firmware Counter (FID = 5)

Parameters:

a0: Counter index

Returns:

a1: Counter value

D.10 Legacy Extensions (Deprecated)

Note: These extensions are deprecated but still widely used for compatibility.

D.10.1 Console Putchar (EID = 0x01)

Parameters:

a0: Character to output

void sbi_console_putchar(int ch) {
    register unsigned long a0 asm("a0") = ch;
    register unsigned long a7 asm("a7") = 0x01;

    asm volatile("ecall" : "+r"(a0) : "r"(a7) : "memory");
}

D.10.2 Console Getchar (EID = 0x02)

Returns:

a0: Character read, or -1 if no character available

int sbi_console_getchar(void) {
    register unsigned long a0 asm("a0");
    register unsigned long a7 asm("a7") = 0x02;

    asm volatile("ecall" : "=r"(a0) : "r"(a7) : "memory");
    return a0;
}

D.10.3 Legacy Set Timer (EID = 0x00)

Parameters:

a0: Timer value

Note: Use Timer Extension (0x54494D45) instead.

D.10.4 Legacy Clear IPI (EID = 0x03)

Note: Deprecated. Clear sip.SSIP bit directly.

D.10.5 Legacy Send IPI (EID = 0x04)

Parameters:

a0: Hart mask pointer

Note: Use IPI Extension (0x735049) instead.

D.10.6 Legacy Remote FENCE.I (EID = 0x05)

Parameters:

a0: Hart mask pointer

Note: Use RFENCE Extension (0x52464E43) instead.

D.10.7 Legacy Remote SFENCE.VMA (EID = 0x06)

Parameters:

a0: Hart mask pointer
a1: Start address
a2: Size

Note: Use RFENCE Extension instead.

D.10.8 Legacy Remote SFENCE.VMA with ASID (EID = 0x07)

Parameters:

a0: Hart mask pointer
a1: Start address
a2: Size
a3: ASID

Note: Use RFENCE Extension instead.

D.10.9 Legacy System Shutdown (EID = 0x08)

Note: Use System Reset Extension (0x53525354) instead.

D.11 Extension ID Summary

EID	Name	Description
0x10	BASE	Base extension (version, probe)
0x54494D45	TIME	Timer programming
0x735049	sPI	Inter-processor interrupts
0x52464E43	RFNC	Remote fence operations
0x48534D	HSM	Hart state management
0x53525354	SRST	System reset
0x504D55	PMU	Performance monitoring
0x4442434E	DBCN	Debug console
0x53555350	SUSP	System suspend
0x43505043	CPPC	Collaborative Processor Performance Control
0x4E41434C	NACL	Nested Acceleration

D.12 Common Usage Patterns

Early Boot Console Output

void early_printk(const char *str) {
    while (*str) {
        if (*str == '\n')
            sbi_console_putchar('\r');
        sbi_console_putchar(*str++);
    }
}

Timer-based Scheduling

void setup_timer_interrupt(uint64_t interval_us) {
    uint64_t current_time;
    asm volatile("rdtime %0" : "=r"(current_time));

    // Assuming 10 MHz timebase (100 ns per tick)
    uint64_t ticks = interval_us * 10;
    sbi_set_timer(current_time + ticks);

    // Enable supervisor timer interrupt
    csr_set(sie, SIE_STIE);
}

Multi-core Synchronization

void flush_tlb_all_harts(void) {
    // Flush TLB on all harts
    sbi_remote_sfence_vma(~0UL, 0, 0, ~0UL);
}

void wake_up_secondary_harts(void) {
    for (int i = 1; i < num_harts; i++) {
        sbi_hart_start(i, (unsigned long)secondary_start, 0);
    }
}

System Shutdown

void system_poweroff(void) {
    sbi_system_reset(0, 0);  // SHUTDOWN, NO_REASON
    while (1);  // Should never reach here
}

void system_reboot(void) {
    sbi_system_reset(1, 0);  // COLD_REBOOT, NO_REASON
    while (1);
}

D.13 References

SBI Specification: https://github.com/riscv-non-isa/riscv-sbi-doc
OpenSBI Documentation: https://github.com/riscv-software-src/opensbi
Linux RISC-V SBI Implementation: arch/riscv/kernel/sbi.c

Appendix E. RISC-V vs ARM Instruction Comparison

Quick Reference for Porting Between RISC-V and ARM

💡 Usage Guide: This appendix is your “translator” for architecture porting. When you need to port ARM code to RISC-V (or vice versa), check the comparison tables here.

This appendix provides a side-by-side comparison of common instructions between RISC-V and ARM (ARMv8-A AArch64). This reference is designed to help developers porting code between the two architectures.

E.1 Arithmetic and Logical Instructions

Integer Arithmetic

Operation	RISC-V	ARM
Add	`add rd, rs1, rs2`	`ADD Xd, Xn, Xm`
Add immediate	`addi rd, rs1, imm`	`ADD Xd, Xn, #imm`
Subtract	`sub rd, rs1, rs2`	`SUB Xd, Xn, Xm`
Subtract immediate	`addi rd, rs1, -imm`	`SUB Xd, Xn, #imm`
Negate	`sub rd, x0, rs`	`NEG Xd, Xm`
Multiply	`mul rd, rs1, rs2`	`MUL Xd, Xn, Xm`
Multiply high (signed)	`mulh rd, rs1, rs2`	`SMULH Xd, Xn, Xm`
Multiply high (unsigned)	`mulhu rd, rs1, rs2`	`UMULH Xd, Xn, Xm`
Divide (signed)	`div rd, rs1, rs2`	`SDIV Xd, Xn, Xm`
Divide (unsigned)	`divu rd, rs1, rs2`	`UDIV Xd, Xn, Xm`
Remainder (signed)	`rem rd, rs1, rs2`	No direct equivalent (use MSUB)
Remainder (unsigned)	`remu rd, rs1, rs2`	No direct equivalent (use MSUB)

Note: ARM does not have direct remainder instructions. Use: MSUB Xd, Xn, Xm, Xo (Xd = Xo - Xn * Xm)

Logical Operations

Operation	RISC-V	ARM
AND	`and rd, rs1, rs2`	`AND Xd, Xn, Xm`
AND immediate	`andi rd, rs1, imm`	`AND Xd, Xn, #imm`
OR	`or rd, rs1, rs2`	`ORR Xd, Xn, Xm`
OR immediate	`ori rd, rs1, imm`	`ORR Xd, Xn, #imm`
XOR	`xor rd, rs1, rs2`	`EOR Xd, Xn, Xm`
XOR immediate	`xori rd, rs1, imm`	`EOR Xd, Xn, #imm`
NOT	`xori rd, rs, -1`	`MVN Xd, Xm`
AND NOT	`andn rd, rs1, rs2` (Zbb)	`BIC Xd, Xn, Xm`
OR NOT	`orn rd, rs1, rs2` (Zbb)	`ORN Xd, Xn, Xm`

Shift Operations

Operation	RISC-V	ARM
Shift left logical	`sll rd, rs1, rs2`	`LSL Xd, Xn, Xm`
Shift left immediate	`slli rd, rs1, shamt`	`LSL Xd, Xn, #imm`
Shift right logical	`srl rd, rs1, rs2`	`LSR Xd, Xn, Xm`
Shift right immediate	`srli rd, rs1, shamt`	`LSR Xd, Xn, #imm`
Shift right arithmetic	`sra rd, rs1, rs2`	`ASR Xd, Xn, Xm`
Shift right arith imm	`srai rd, rs1, shamt`	`ASR Xd, Xn, #imm`
Rotate right	`ror rd, rs1, rs2` (Zbb)	`ROR Xd, Xn, Xm`
Rotate right immediate	`rori rd, rs1, shamt` (Zbb)	`ROR Xd, Xn, #imm`

E.2 Load and Store Instructions

Basic Loads

Operation	RISC-V	ARM
Load byte (signed)	`lb rd, offset(rs1)`	`LDRSB Xd, [Xn, #offset]`
Load byte (unsigned)	`lbu rd, offset(rs1)`	`LDRB Wd, [Xn, #offset]`
Load halfword (signed)	`lh rd, offset(rs1)`	`LDRSH Xd, [Xn, #offset]`
Load halfword (unsigned)	`lhu rd, offset(rs1)`	`LDRH Wd, [Xn, #offset]`
Load word (signed)	`lw rd, offset(rs1)`	`LDRSW Xd, [Xn, #offset]`
Load word (unsigned)	`lwu rd, offset(rs1)`	`LDR Wd, [Xn, #offset]`
Load doubleword	`ld rd, offset(rs1)`	`LDR Xd, [Xn, #offset]`

Basic Stores

Operation	RISC-V	ARM
Store byte	`sb rs2, offset(rs1)`	`STRB Wd, [Xn, #offset]`
Store halfword	`sh rs2, offset(rs1)`	`STRH Wd, [Xn, #offset]`
Store word	`sw rs2, offset(rs1)`	`STR Wd, [Xn, #offset]`
Store doubleword	`sd rs2, offset(rs1)`	`STR Xd, [Xn, #offset]`

Addressing Modes

RISC-V: Only base+offset

lw t0, 8(sp)      # Load from sp + 8

ARM: Multiple modes

LDR X0, [SP, #8]       # Base + offset
LDR X0, [SP, #8]!      # Pre-indexed (update SP)
LDR X0, [SP], #8       # Post-indexed (update SP after)
LDR X0, [SP, X1]       # Base + register
LDR X0, [SP, X1, LSL #3]  # Base + shifted register

Porting Note: RISC-V requires separate add/sub for pre/post-indexed addressing:

# ARM: LDR X0, [SP], #8
# RISC-V equivalent:
ld t0, 0(sp)
addi sp, sp, 8

E.3 Branch and Jump Instructions

Conditional Branches

Operation	RISC-V	ARM
Branch if equal	`beq rs1, rs2, label`	`CMP Xn, Xm` + `B.EQ label`
Branch if not equal	`bne rs1, rs2, label`	`CMP Xn, Xm` + `B.NE label`
Branch if less than	`blt rs1, rs2, label`	`CMP Xn, Xm` + `B.LT label`
Branch if >= (signed)	`bge rs1, rs2, label`	`CMP Xn, Xm` + `B.GE label`
Branch if < (unsigned)	`bltu rs1, rs2, label`	`CMP Xn, Xm` + `B.LO label`
Branch if >= (unsigned)	`bgeu rs1, rs2, label`	`CMP Xn, Xm` + `B.HS label`

Key Difference: RISC-V compares and branches in one instruction. ARM requires separate compare.

Unconditional Jumps

Operation	RISC-V	ARM
Jump	`jal x0, label` or `j label`	`B label`
Jump and link	`jal ra, label`	`BL label`
Jump register	`jalr x0, 0(rs1)` or `jr rs1`	`BR Xn`
Jump and link register	`jalr ra, 0(rs1)`	`BLR Xn`
Return	`jalr x0, 0(ra)` or `ret`	`RET`

E.4 Compare and Set Instructions

Comparisons

Operation	RISC-V	ARM
Set if less than	`slt rd, rs1, rs2`	`CMP Xn, Xm` + `CSET Xd, LT`
Set if less (unsigned)	`sltu rd, rs1, rs2`	`CMP Xn, Xm` + `CSET Xd, LO`
Set if less than imm	`slti rd, rs1, imm`	`CMP Xn, #imm` + `CSET Xd, LT`
Set if less imm (uns)	`sltiu rd, rs1, imm`	`CMP Xn, #imm` + `CSET Xd, LO`

ARM Condition Codes:

RISC-V	ARM Condition
`beq`	`B.EQ` (equal)
`bne`	`B.NE` (not equal)
`blt`	`B.LT` (less than, signed)
`bge`	`B.GE` (greater or equal, signed)
`bltu`	`B.LO` (lower, unsigned)
`bgeu`	`B.HS` (higher or same, unsigned)

E.5 Atomic Instructions

Load-Reserved / Store-Conditional

Operation	RISC-V	ARM
Load-reserved word	`lr.w rd, (rs1)`	`LDXR Wd, [Xn]`
Load-reserved dword	`lr.d rd, (rs1)`	`LDXR Xd, [Xn]`
Store-conditional word	`sc.w rd, rs2, (rs1)`	`STXR Ws, Wd, [Xn]`
Store-conditional dword	`sc.d rd, rs2, (rs1)`	`STXR Ws, Xd, [Xn]`

Example: Atomic Increment

RISC-V:

retry:
    lr.w t0, (a0)
    addi t0, t0, 1
    sc.w t1, t0, (a0)
    bnez t1, retry

ARM:

retry:
    LDXR W0, [X1]
    ADD W0, W0, #1
    STXR W2, W0, [X1]
    CBNZ W2, retry

Atomic Memory Operations (AMO)

Operation	RISC-V	ARM
Atomic swap	`amoswap.w rd, rs2, (rs1)`	`SWP Wd, Wm, [Xn]`
Atomic add	`amoadd.w rd, rs2, (rs1)`	`LDADD Ws, Wt, [Xn]`
Atomic AND	`amoand.w rd, rs2, (rs1)`	`LDCLR Ws, Wt, [Xn]` (inverted)
Atomic OR	`amoor.w rd, rs2, (rs1)`	`LDSET Ws, Wt, [Xn]`
Atomic XOR	`amoxor.w rd, rs2, (rs1)`	`LDEOR Ws, Wt, [Xn]`
Atomic max (signed)	`amomax.w rd, rs2, (rs1)`	`LDSMAX Ws, Wt, [Xn]`
Atomic max (unsigned)	`amomaxu.w rd, rs2, (rs1)`	`LDUMAX Ws, Wt, [Xn]`
Atomic min (signed)	`amomin.w rd, rs2, (rs1)`	`LDSMIN Ws, Wt, [Xn]`
Atomic min (unsigned)	`amominu.w rd, rs2, (rs1)`	`LDUMIN Ws, Wt, [Xn]`

Ordering Annotations:

RISC-V: .aq (acquire), .rl (release), .aqrl (both)
ARM: LDADD vs LDADDA vs LDADDL vs LDADDAL

E.6 Memory Barriers

Operation	RISC-V	ARM
Full fence	`fence rw, rw`	`DMB SY`
Read fence	`fence r, r`	`DMB LD`
Write fence	`fence w, w`	`DMB ST`
Acquire fence	`fence r, rw`	`DMB LD`
Release fence	`fence rw, w`	`DMB ST`
Instruction fence	`fence.i`	`ISB`
TLB fence	`sfence.vma`	`TLBI` + `DSB` + `ISB`

RISC-V FENCE Format: fence pred, succ

pred: Predecessor operations (r=read, w=write, rw=both)
succ: Successor operations (r=read, w=write, rw=both)

ARM Barrier Types:

SY: Full system
ST: Store only
LD: Load only
ISH: Inner shareable
OSH: Outer shareable

E.7 System Instructions

CSR / System Register Access

Operation	RISC-V	ARM
Read CSR/sysreg	`csrr rd, csr`	`MRS Xd, sysreg`
Write CSR/sysreg	`csrw csr, rs`	`MSR sysreg, Xn`
Read-modify-write	`csrrw rd, csr, rs`	`MRS` + modify + `MSR`
Set bits	`csrrs rd, csr, rs`	`MRS` + `ORR` + `MSR`
Clear bits	`csrrc rd, csr, rs`	`MRS` + `BIC` + `MSR`

Exception and Privilege

Operation	RISC-V	ARM
System call	`ecall`	`SVC #imm`
Breakpoint	`ebreak`	`BRK #imm`
Return from exception	`mret` / `sret`	`ERET`
Wait for interrupt	`wfi`	`WFI`
Supervisor call	`ecall` (from U-mode)	`SVC #imm`
Hypervisor call	`ecall` (from VS-mode)	`HVC #imm`

E.8 Bit Manipulation (Zbb vs ARM)

Operation	RISC-V (Zbb)	ARM
Count leading zeros	`clz rd, rs`	`CLZ Xd, Xn`
Count trailing zeros	`ctz rd, rs`	No direct (use RBIT + CLZ)
Count population	`cpop rd, rs`	No direct (use CNT in NEON)
Byte reverse	`rev8 rd, rs`	`REV Xd, Xn`
Sign-extend byte	`sext.b rd, rs`	`SXTB Xd, Wn`
Sign-extend halfword	`sext.h rd, rs`	`SXTH Xd, Wn`
Zero-extend halfword	`zext.h rd, rs`	`UXTH Wd, Wn`
Min (signed)	`min rd, rs1, rs2`	No direct (use CMP + CSEL)
Max (signed)	`max rd, rs1, rs2`	No direct (use CMP + CSEL)
Rotate right	`ror rd, rs1, rs2`	`ROR Xd, Xn, Xm`

E.9 Calling Convention (ABI)

Register Usage

Purpose	RISC-V	ARM
Arguments	a0-a7 (x10-x17)	X0-X7
Return value	a0-a1 (x10-x11)	X0-X1
Saved registers	s0-s11 (x8-x9, x18-x27)	X19-X28
Temporary registers	t0-t6 (x5-x7, x28-x31)	X9-X15
Stack pointer	sp (x2)	SP
Frame pointer	fp/s0 (x8)	X29 (FP)
Return address	ra (x1)	X30 (LR)
Zero register	x0 (zero)	XZR (X31)

Function Prologue/Epilogue

RISC-V:

function:
    addi sp, sp, -16
    sd ra, 8(sp)
    sd s0, 0(sp)
    # ... function body ...
    ld s0, 0(sp)
    ld ra, 8(sp)
    addi sp, sp, 16
    ret

ARM:

function:
    STP X29, X30, [SP, #-16]!
    MOV X29, SP
    # ... function body ...
    LDP X29, X30, [SP], #16
    RET

Key Differences:

ARM has STP/LDP (store/load pair) for efficient stack operations
RISC-V uses separate sd/ld instructions
ARM uses X30 (LR) for return address; RISC-V uses x1 (ra)

E.10 Common Code Patterns

Loop Example

RISC-V:

    li t0, 0          # i = 0
    li t1, 10         # limit = 10
loop:
    # ... loop body ...
    addi t0, t0, 1    # i++
    blt t0, t1, loop  # if (i < 10) goto loop

ARM:

    MOV X0, #0        # i = 0
    MOV X1, #10       # limit = 10
loop:
    # ... loop body ...
    ADD X0, X0, #1    # i++
    CMP X0, X1
    B.LT loop         # if (i < 10) goto loop

Switch Statement

RISC-V:

    # Assume a0 = switch value
    li t0, 3
    bgtu a0, t0, default
    slli t0, a0, 2    # t0 = a0 * 4
    la t1, jump_table
    add t0, t0, t1
    lw t0, 0(t0)
    jr t0

jump_table:
    .word case0
    .word case1
    .word case2
    .word case3

ARM:

    # Assume X0 = switch value
    CMP X0, #3
    B.HI default
    ADR X1, jump_table
    LDR X2, [X1, X0, LSL #3]
    BR X2

jump_table:
    .quad case0
    .quad case1
    .quad case2
    .quad case3

E.11 Porting Checklist

Syntax Differences

Aspect	RISC-V	ARM
Register prefix	`x`, `f`, `a`, `t`, `s`	`X`, `W`, `V`, `Q`
Immediate prefix	None	`#`
Memory syntax	`offset(base)`	`[base, #offset]`
Comment	`#`	`//` or `;`
Directive prefix	`.`	`.`

Common Pitfalls

Zero Register: RISC-V x0 vs ARM XZR (different encoding)
Stack Pointer: RISC-V sp is x2; ARM SP is separate
Return Address: RISC-V stores in ra; ARM uses LR (X30)
Addressing Modes: ARM has more complex modes (pre/post-indexed)
Conditional Execution: ARM has conditional instructions; RISC-V uses branches
Remainder: RISC-V has rem/remu; ARM requires division + multiply-subtract

Performance Considerations

Code Density: ARM Thumb-2 vs RISC-V compressed (C extension)
Instruction Fusion: Both support micro-op fusion (implementation-dependent)
Branch Prediction: Similar capabilities (implementation-dependent)
Memory Ordering: RISC-V RVWMO is weaker than ARM (more reordering allowed)

E.12 References

RISC-V ISA Manual: https://riscv.org/technical/specifications/
ARM Architecture Reference Manual: ARMv8-A
RISC-V ABI Specification: https://github.com/riscv-non-isa/riscv-elf-psabi-doc
ARM Procedure Call Standard: AAPCS64

Appendix F. Memory Model Quick Reference

RISC-V Weak Memory Ordering (RVWMO) Quick Reference

💡 Usage Guide: This appendix is your “safety manual” for multi-core synchronization. When you encounter mysterious bugs in lock-free code, check Memory Ordering here first.

🔄 Producer-Consumer Synchronization Pattern (Copy-Paste Ready)

This is the most classic multi-core synchronization pattern, guaranteeing Consumer sees complete data written by Producer.

Producer (Core 0) - Write Side

# Producer: Write data, then set Flag
# s0 = data address, s1 = Flag address, t0 = data, t1 = Flag value

    sw      t0, 0(s0)       # 1. Write Data
    fence   w, w            # 2. Store-Store Fence: Ensure data written first
    sw      t1, 0(s1)       # 3. Write Flag (Ready = 1)

Explanation: fence w,w ensures “data write” is visible to other cores before “Flag write”.

Consumer (Core 1) - Read Side

# Consumer: Wait for Flag, then read data
# s0 = data address, s1 = Flag address

wait_flag:
    lw      t1, 0(s1)       # 1. Read Flag
    beqz    t1, wait_flag   #    Wait for Flag to become Ready
    fence   r, r            # 2. Load-Load Fence: Ensure Flag seen before reading Data
    lw      t0, 0(s0)       # 3. Read Data

Explanation: fence r,r ensures “Flag read” completes before “Data read”.

Complete C Example

// Shared variables
volatile int data = 0;
volatile int flag = 0;

// Producer (Core 0)
void producer(void) {
    data = 42;                          // Write data
    asm volatile ("fence w, w" ::: "memory");  // Store-Store Fence
    flag = 1;                           // Set Flag
}

// Consumer (Core 1)
int consumer(void) {
    while (flag == 0) { }               // Wait for Flag
    asm volatile ("fence r, r" ::: "memory");  // Load-Load Fence
    return data;                        // Read data (guaranteed to be 42)
}

📋 FENCE Usage Quick Reference

Scenario	FENCE Type	Description
Publish Data	`fence w, w`	Ensure data visible before Flag
Consume Data	`fence r, r`	Ensure Flag read before data
Release Lock	`fence rw, w`	Ensure Critical Section ops complete before Unlock
Acquire Lock	`fence r, rw`	Ensure ops after Lock don’t execute early
Full Barrier	`fence rw, rw`	Strongest Fence, no ops can cross
Self-Modify Code	`fence.i`	After modifying instructions, flush I-cache

🔐 Spinlock Example (Using Atomics)

# acquire_lock: Use amoswap.w.aq to acquire lock
# a0 = lock address, t0 = 1 (locked), t1 = result
acquire_lock:
    li      t0, 1
retry:
    amoswap.w.aq t1, t0, (a0)   # Atomic swap with Acquire
    bnez    t1, retry           # If was 1 (locked), retry
    ret                         # Successfully acquired lock

# release_lock: Use amoswap.w.rl to release lock
# a0 = lock address
release_lock:
    amoswap.w.rl zero, zero, (a0)  # Atomic write 0 with Release
    ret

Explanation:

.aq (Acquire): Subsequent ops won’t be moved before Lock
.rl (Release): Previous ops won’t be moved after Unlock

⚠️ Common Pitfalls

Pitfall 1: Thinking volatile Is Enough

Misconception: C’s volatile guarantees Memory Ordering.

Truth: volatile only prevents compiler optimization, doesn’t guarantee CPU-level Memory Ordering.

// ❌ Wrong: Only volatile, may read stale data on multi-core
volatile int data = 0;
volatile int flag = 0;

// Producer
data = 42;
flag = 1;  // CPU may reorder so flag is visible first!

// ✅ Correct: Add fence
data = 42;
asm volatile ("fence w, w" ::: "memory");
flag = 1;

Pitfall 2: Fence in Wrong Position

Symptom: Spinlock looks correct, but still has Race Condition.

// ❌ Wrong: fence after unlock
critical_section();
unlock();
asm volatile ("fence rw, w" ::: "memory");  // Too late!

// ✅ Correct: fence before unlock (or use .rl)
critical_section();
asm volatile ("fence rw, w" ::: "memory");
unlock();

Pitfall 3: Forgetting fence.i

Symptom: JIT or Self-modifying code executes old instructions.

Cause: After modifying instructions, I-Cache still has old content.

// ❌ Wrong: No fence.i after code modification
memcpy(code_buffer, new_code, size);
((void (*)(void))code_buffer)();  // May execute old instructions!

// ✅ Correct: fence.i after modification
memcpy(code_buffer, new_code, size);
asm volatile ("fence.i" ::: "memory");
((void (*)(void))code_buffer)();  // Now executes new instructions

This appendix provides a quick reference for RISC-V’s memory model (RVWMO). Understanding memory ordering is essential for writing correct concurrent code on RISC-V.

F.1 Memory Ordering Basics

What Can Be Reordered?

RISC-V Weak Memory Ordering (RVWMO) allows extensive reordering:

Reordering	Allowed?	Exception
Load → Load	✓ Yes	Same address, or FENCE
Load → Store	✓ Yes	Same address, or FENCE
Store → Store	✓ Yes	Same address, or FENCE
Store → Load	✓ Yes	Same address, or FENCE

Key Point: Almost everything can be reordered unless:

Operations access the same address (overlapping)
Operations are separated by a FENCE instruction
Operations have data/control dependencies
Operations use acquire/release atomics

Preserved Program Order (PPO)

Preserved Program Order is the subset of program order that MUST be respected:

Overlapping addresses: SW to X, then LW from X → always in order
Explicit fences: Operations separated by FENCE → always in order
Acquire/Release: Atomic operations with .aq or .rl → enforce ordering
Dependencies: Data dependencies (e.g., LW then use result) → always in order
Control dependencies: Branch then dependent operation → certain orderings preserved

F.2 FENCE Instruction Reference

FENCE Syntax

fence pred, succ

pred (predecessor): Operations before fence (r, w, or rw)
succ (successor): Operations after fence (r, w, or rw)

Common FENCE Variants

FENCE	Meaning	Use Case
`fence rw, rw`	Full fence	Strongest barrier, orders everything
`fence w, w`	Store-store fence	Ensure stores visible in order
`fence r, r`	Load-load fence	Ensure loads happen in order
`fence r, rw`	Acquire fence	After acquiring lock
`fence rw, w`	Release fence	Before releasing lock
`fence.i`	Instruction fence	After code modification (JIT, self-modifying code)
`fence.tso`	TSO fence	x86-compatible ordering

FENCE Examples

Full Fence (strongest):

sw a0, 0(s0)      # Store 1
fence rw, rw      # Full fence
lw t0, 0(s1)      # Load 1

All operations before fence complete before any operation after fence.

Store-Store Fence (publish pattern):

sw a0, 0(s0)      # Write data
fence w, w        # Ensure data written first
sw a1, 0(s1)      # Write flag

Ensures stores become visible in order.

Load-Load Fence (consume pattern):

lw t0, 0(s1)      # Read flag
fence r, r        # Ensure flag read first
lw t1, 0(s0)      # Read data

Ensures loads happen in order.

Acquire Fence (after lock acquisition):

lr.w.aq t0, (a0)  # Acquire lock (with .aq)
# OR
lw t0, 0(a0)      # Read lock
fence r, rw       # Acquire fence
# ... critical section ...

Prevents operations in critical section from moving before lock acquisition.

Release Fence (before lock release):

# ... critical section ...
fence rw, w       # Release fence
sw zero, 0(a0)    # Release lock

Prevents operations in critical section from moving after lock release.

F.3 Atomic Instructions

Load-Reserved / Store-Conditional (LR/SC)

Syntax:

lr.w rd, (rs1)           # Load-reserved word
lr.d rd, (rs1)           # Load-reserved doubleword
sc.w rd, rs2, (rs1)      # Store-conditional word
sc.d rd, rs2, (rs1)      # Store-conditional doubleword

Ordering Annotations:

.aq (acquire): No later operations can move before this
.rl (release): No earlier operations can move after this
.aqrl (both): Full ordering

Example: Atomic Increment

retry:
    lr.w t0, (a0)         # Load current value
    addi t0, t0, 1        # Increment
    sc.w t1, t0, (a0)     # Try to store
    bnez t1, retry        # Retry if failed (t1 != 0)

Example: Spinlock Acquire

acquire_lock:
    lr.w.aq t0, (a0)      # Load-reserved with acquire
    bnez t0, acquire_lock # If locked, retry
    li t1, 1
    sc.w.aq t2, t1, (a0)  # Try to acquire
    bnez t2, acquire_lock # Retry if failed

Atomic Memory Operations (AMO)

Syntax:

amoswap.w rd, rs2, (rs1)   # Atomic swap
amoadd.w rd, rs2, (rs1)    # Atomic add
amoand.w rd, rs2, (rs1)    # Atomic AND
amoor.w rd, rs2, (rs1)     # Atomic OR
amoxor.w rd, rs2, (rs1)    # Atomic XOR
amomax.w rd, rs2, (rs1)    # Atomic max (signed)
amomaxu.w rd, rs2, (rs1)   # Atomic max (unsigned)
amomin.w rd, rs2, (rs1)    # Atomic min (signed)
amominu.w rd, rs2, (rs1)   # Atomic min (unsigned)

Ordering Annotations: Same as LR/SC (.aq, .rl, .aqrl)

Example: Spinlock with AMOSWAP

acquire_lock:
    li t0, 1
    amoswap.w.aq t1, t0, (a0)  # Swap 1 into lock, get old value
    bnez t1, acquire_lock       # If old value != 0, retry

release_lock:
    amoswap.w.rl zero, zero, (a0)  # Swap 0 into lock (release)

F.4 Common Synchronization Patterns

Pattern 1: Message Passing

Problem: Producer writes data, then sets flag. Consumer waits for flag, then reads data.

Solution:

# Producer (Hart 0)
    sw a0, 0(s0)      # Write data
    fence w, w        # Ensure data written before flag
    sw a1, 0(s1)      # Write flag = 1

# Consumer (Hart 1)
loop:
    lw t0, 0(s1)      # Read flag
    beqz t0, loop     # Wait for flag
    fence r, r        # Ensure flag read before data
    lw t1, 0(s0)      # Read data

Why fences are needed:

Without fence w, w: Flag might be visible before data
Without fence r, r: Data might be read before flag is checked

Pattern 2: Spinlock (LR/SC)

Acquire:

acquire_lock:
    lr.w.aq t0, (a0)      # Load-reserved with acquire
    bnez t0, acquire_lock # If locked, retry
    li t1, 1
    sc.w.aq t2, t1, (a0)  # Try to set lock
    bnez t2, acquire_lock # Retry if SC failed
    # Lock acquired, critical section follows

Release:

    # Critical section
    amoswap.w.rl zero, zero, (a0)  # Release lock
    # OR
    fence rw, w
    sw zero, 0(a0)

Pattern 3: Spinlock (AMOSWAP)

Acquire:

acquire_lock:
    li t0, 1
    amoswap.w.aq t1, t0, (a0)  # Atomic swap
    bnez t1, acquire_lock       # If old value != 0, retry
    # Lock acquired

Release:

    amoswap.w.rl zero, zero, (a0)  # Release lock

Pattern 4: Dekker’s Algorithm (Mutual Exclusion)

Hart 0:

    li t0, 1
    sw t0, flag0       # flag0 = 1
    fence w, rw        # Ensure flag0 visible before reading flag1
    lw t1, flag1       # Read flag1
    bnez t1, wait      # If flag1 set, wait
    # Critical section
    fence rw, w        # Ensure critical section done
    sw zero, flag0     # flag0 = 0

Hart 1: (symmetric, swap flag0 and flag1)

Pattern 5: Producer-Consumer Queue

Producer:

    # Write data to queue[tail]
    sw a0, 0(s0)

    # Increment tail
    fence w, w         # Ensure data written before tail update
    addi s1, s1, 1
    sw s1, tail_ptr

Consumer:

    # Read tail
    lw t0, tail_ptr
    lw t1, head_ptr
    beq t0, t1, empty  # If tail == head, queue empty

    # Read data from queue[head]
    fence r, r         # Ensure tail read before data
    lw a0, 0(s2)

    # Increment head
    addi s2, s2, 1
    sw s2, head_ptr

Pattern 6: Barrier (N threads)

Barrier Wait:

barrier_wait:
    # Increment counter atomically
    li t0, 1
    amoadd.w.aq t1, t0, (a0)  # counter++, get old value
    addi t1, t1, 1             # t1 = new counter value

    # Check if all threads arrived
    li t2, N                   # N = number of threads
    bne t1, t2, spin           # If not all arrived, spin

    # Reset counter for next barrier
    amoswap.w.rl zero, zero, (a0)
    ret

spin:
    lw t3, 0(a0)
    bne t3, t2, spin
    ret

F.5 Memory Model Comparison

RVWMO vs Other Models

Model	Strength	Reordering Allowed	Fence Overhead
Sequential Consistency	Strongest	None	N/A (no reordering)
x86 TSO	Strong	Store→Load only	Low (implicit)
ARM	Weak	Extensive	Medium
RISC-V RVWMO	Weak	Extensive	Medium
RISC-V RVTSO	Strong	Store→Load only	Low

Ordering Guarantees

Operation Pair	RISC-V RVWMO	x86 TSO	ARM	SC
Load → Load	✗	✓	✗	✓
Load → Store	✗	✓	✗	✓
Store → Store	✗	✓	✗	✓
Store → Load	✗	✗	✗	✓

✓ = Ordered by default ✗ = Can be reordered (need fence)

F.6 FENCE Equivalents Across Architectures

RISC-V	x86	ARM	Purpose
`fence rw, rw`	`MFENCE`	`DMB SY`	Full barrier
`fence w, w`	`SFENCE`	`DMB ST`	Store barrier
`fence r, r`	`LFENCE`	`DMB LD`	Load barrier
`fence r, rw`	(implicit)	`DMB LD`	Acquire
`fence rw, w`	(implicit)	`DMB ST`	Release
`fence.i`	(implicit)	`ISB`	Instruction sync
`fence.tso`	(implicit)	-	TSO ordering

F.7 Acquire/Release Semantics

Acquire Semantics

Meaning: No memory operations after the acquire can move before it.

Use Case: After acquiring a lock, before accessing protected data.

Implementation:

# Option 1: Atomic with .aq
lr.w.aq t0, (a0)

# Option 2: Load + fence
lw t0, 0(a0)
fence r, rw

Release Semantics

Meaning: No memory operations before the release can move after it.

Use Case: After accessing protected data, before releasing a lock.

Implementation:

# Option 1: Atomic with .rl
amoswap.w.rl zero, zero, (a0)

# Option 2: Fence + store
fence rw, w
sw zero, 0(a0)

Acquire-Release Pair

Complete Lock Example:

# Acquire
acquire:
    lr.w.aq t0, (a0)
    bnez t0, acquire
    li t1, 1
    sc.w.aq t2, t1, (a0)
    bnez t2, acquire

# Critical section
    # ... protected operations ...

# Release
    amoswap.w.rl zero, zero, (a0)

F.8 Common Pitfalls

Pitfall 1: Missing Fences

Wrong:

# Producer
sw a0, 0(s0)      # Write data
sw a1, 0(s1)      # Write flag

# Consumer
lw t0, 0(s1)      # Read flag
lw t1, 0(s0)      # Read data (might be stale!)

Correct:

# Producer
sw a0, 0(s0)
fence w, w        # Add fence!
sw a1, 0(s1)

# Consumer
lw t0, 0(s1)
fence r, r        # Add fence!
lw t1, 0(s0)

Pitfall 2: Wrong Fence Type

Wrong (using load-load fence for release):

# Critical section
fence r, r        # Wrong! Doesn't order stores
sw zero, 0(a0)    # Release lock

Correct:

# Critical section
fence rw, w       # Correct! Orders all ops before stores
sw zero, 0(a0)

Pitfall 3: Forgetting .aq/.rl on Atomics

Wrong:

lr.w t0, (a0)     # Missing .aq!
# Critical section
amoswap.w zero, zero, (a0)  # Missing .rl!

Correct:

lr.w.aq t0, (a0)
# Critical section
amoswap.w.rl zero, zero, (a0)

Pitfall 4: Data Race

Wrong (no synchronization):

# Hart 0
sw a0, 0(s0)      # Write shared variable

# Hart 1
lw t0, 0(s0)      # Read shared variable (DATA RACE!)

Correct (use lock or atomic):

# Hart 0
# ... acquire lock ...
sw a0, 0(s0)
# ... release lock ...

# Hart 1
# ... acquire lock ...
lw t0, 0(s0)
# ... release lock ...

F.9 Quick Decision Tree

Do I need a fence?

Are multiple harts accessing shared memory?
├─ No → No fence needed
└─ Yes → Continue

Are the accesses synchronized (locks, atomics)?
├─ Yes → Fence included in lock/atomic
└─ No → Continue

Are the accesses to the same address?
├─ Yes → No fence needed (hardware preserves order)
└─ No → FENCE REQUIRED!

What type of fence?
├─ Publishing data? → fence w, w
├─ Consuming data? → fence r, r
├─ Acquiring lock? → fence r, rw (or .aq)
├─ Releasing lock? → fence rw, w (or .rl)
└─ Not sure? → fence rw, rw (full fence)

F.10 References

RISC-V Memory Model Specification: Chapter 14 of RISC-V ISA Manual
RVWMO Formal Specification: https://github.com/riscv/riscv-isa-manual
Memory Model Tools: herd7, rmem (for verification)
Linux Kernel Memory Barriers: Documentation/memory-barriers.txt

About the Author

Danny Jiang is a seasoned system software engineer and technical lead with over 20 years of hands-on experience in firmware development, CPU/SoC architecture, and system validation. Currently serving as a Benchmarking/Application Engineer at SiFive, Danny has built his career working with leading semiconductor and processor companies, including MIPS (under Imagination Technologies, MIPS LLC, and Wave Computing), Broadcom, Western Digital, Andes Technology, and Silicon Integrated Systems (SiS).

Throughout his career, Danny has contributed to the development and deployment of millions of chips across diverse domains—from RISC-V and MIPS processors to SSD controllers, Bluetooth/IoT chipsets, and x86 chipset BIOS. His expertise spans the entire system software stack, from low-level bootloaders and device drivers to ASIC/FPGA validation and system integration.

Professional Expertise

Danny specializes in:

Processor Architecture: RISC-V, MIPS, ARM, x86
System Software: Bootloaders, firmware, device drivers, RTOS porting
Validation & Verification: ASIC/FPGA bring-up, silicon validation, system integration
Embedded Systems: IoT, SSD, wireless connectivity (Bluetooth, 802.15.x)
Performance Engineering: Benchmarking (CoreMark, Dhrystone), optimization
Customer Support: Technical troubleshooting, toolchain customization, training

Connect with Danny:

Email: djiang.tw@gmail.com
LinkedIn: linkedin.com/in/danny-jiang-26359644
GitHub: https://github.com/djiangtw

Other Works:

See RISC-V Run: Fundamentals (this book)
Various open-source contributions to RISC-V ecosystem

Acknowledgments

The author would like to thank:

RISC-V International and all contributors to the RISC-V specifications for creating an open, well-documented ISA
The RISC-V community for their collaborative spirit and commitment to open standards
Colleagues and mentors at SiFive, MIPS, Andes, Broadcom, Western Digital, and SiS for their insights and expertise
Early reviewers who provided valuable feedback on draft chapters
Family and friends for their unwavering support during the writing process

About the Book

“See RISC-V Run: Fundamentals” is inspired by Dominic Sweetman’s classic “See MIPS Run” and aims to provide the same level of comprehensive, systematic coverage for the RISC-V architecture. This book combines:

Rigorous technical accuracy based on official RISC-V specifications
Practical insights from real-world implementation experience across multiple processor families
Clear explanations suitable for students, engineers, and researchers
Comparative analysis with ARM and MIPS to build architectural intuition

This volume focuses on fundamental concepts—from ISA basics and programmer’s model to pipeline design, system software, and platform integration. Future volumes, including “See RISC-V Run: Advanced”, will explore microarchitecture optimizations, advanced extensions, and cutting-edge implementations.

The book is licensed under CC BY 4.0, reflecting the author’s commitment to open knowledge sharing, consistent with the RISC-V philosophy.

January 2026

Bibliography and References

This book is based on publicly available specifications and documentation. All information is derived from open-source materials and official RISC-V specifications.

RISC-V Official Specifications

ISA Specifications

RISC-V Instruction Set Manual, Volume I: Unprivileged ISA
RISC-V International
https://github.com/riscv/riscv-isa-manual
Latest version: Ratified 2019, with ongoing updates
RISC-V Instruction Set Manual, Volume II: Privileged Architecture
RISC-V International
https://github.com/riscv/riscv-isa-manual
Latest version: Ratified 2021, with ongoing updates

Extension Specifications

RISC-V “V” Vector Extension
RISC-V International
https://github.com/riscv/riscv-v-spec
Version 1.0, Ratified 2021
RISC-V Bit Manipulation Extension
RISC-V International
https://github.com/riscv/riscv-bitmanip
Version 1.0, Ratified 2021
RISC-V Cryptography Extensions
RISC-V International
https://github.com/riscv/riscv-crypto
Version 1.0, Ratified 2021
RISC-V Hypervisor Extension
RISC-V International
Included in Privileged Architecture Specification

Platform Specifications

RISC-V Platform-Level Interrupt Controller (PLIC) Specification
RISC-V International
https://github.com/riscv/riscv-plic-spec
RISC-V Core-Local Interrupt Controller (CLIC) Specification
RISC-V International
https://github.com/riscv/riscv-fast-interrupt
RISC-V Supervisor Binary Interface (SBI) Specification
RISC-V International
https://github.com/riscv-non-isa/riscv-sbi-doc
Version 1.0, Ratified 2020
RISC-V ELF psABI Specification
RISC-V International
https://github.com/riscv-non-isa/riscv-elf-psabi-doc

RISC-V Software and Tools

RISC-V GNU Compiler Toolchain
https://github.com/riscv-collab/riscv-gnu-toolchain
RISC-V LLVM
https://github.com/llvm/llvm-project
QEMU RISC-V Emulator
https://www.qemu.org/docs/master/system/target-riscv.html
Spike RISC-V ISA Simulator
https://github.com/riscv-software-src/riscv-isa-sim
OpenSBI (Open Source Supervisor Binary Interface) https://github.com/riscv-software-src/opensbi

Companion Projects

danieRTOS - A Minimal RISC-V RTOS for Learning Danny Jiang https://github.com/djiangtw/djiang-oss-public/tree/main/daniertos A minimal RTOS implementation designed for learning RISC-V system programming. Lab examples in this book reference logic from this project.
Building danieRTOS - Technical Column Series Danny Jiang https://github.com/djiangtw/tech-column-public/tree/main/topics/building-daniertos A technical article series documenting the danieRTOS development process, covering Context Switch, Interrupt Handling, Timer, Scheduler, and other core topics.

Classic Architecture Books

Sweetman, Dominic. See MIPS Run, Second Edition.
Morgan Kaufmann, 2006.
ISBN: 978-0120884216
Patterson, David A., and John L. Hennessy. Computer Organization and Design RISC-V Edition: The Hardware Software Interface.
Morgan Kaufmann, 2017.
ISBN: 978-0128122754
Waterman, Andrew, and Krste Asanović (Editors). The RISC-V Reader: An Open Architecture Atlas.
Strawberry Canyon, 2017.
ISBN: 978-0999249109

ARM Architecture References

ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile
ARM Limited
https://developer.arm.com/documentation/
ARM Cortex-A Series Programmer’s Guide
ARM Limited
https://developer.arm.com/documentation/

Memory Model and Concurrency

RISC-V Memory Consistency Model
Included in RISC-V Unprivileged ISA Specification, Chapter 14
https://github.com/riscv/riscv-isa-manual
Alglave, Jade, et al. “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.2 - Memory Model”
RISC-V Foundation, 2017

Online Resources

RISC-V International Website
https://riscv.org/
RISC-V Technical Specifications
https://riscv.org/technical/specifications/
RISC-V GitHub Organization
https://github.com/riscv
RISC-V Software Collaboration
https://github.com/riscv-collab
RISC-V Wiki
https://wiki.riscv.org/

Academic Papers

Asanović, Krste, and David A. Patterson. “Instruction Sets Should Be Free: The Case for RISC-V.”
EECS Department, University of California, Berkeley, Technical Report No. UCB/EECS-2014-146, 2014.
Waterman, Andrew. “Design of the RISC-V Instruction Set Architecture.”
PhD Thesis, University of California, Berkeley, 2016.

Notes

All RISC-V specifications are available under open licenses (Creative Commons or similar)
This book does not use any proprietary or confidential information
All code examples are original or based on publicly available documentation
Readers should consult the official RISC-V specifications for the most current information

Last Updated: January 2026 (v0p11 Enhancement)

Keyboard shortcuts

See RiscV Run: Fundamentals