Chapter 12. Vector Processing & SIMD Comparison
Part VII — ISA Extensions
🎯 Learning Objectives
After reading this chapter, you will be able to:
- Understand SISD vs SIMD: Grasp the difference between scalar and vector operations
- Master VLA Core Concepts: Understand the value of Vector-Length Agnostic design
- Use the vsetvli Instruction: Perform Strip-mining (chunked processing)
- Write Vector Loops: Master the standard VLA loop structure
- Compare Different SIMD Architectures: Understand RISC-V V vs ARM SVE vs x86 AVX differences
💡 Scenario: The Flexible Noodle Cutter
Scene: Junior is staring at SIMD code on the screen, frustrated.
Junior: “Architect, this is painful. I wrote an optimized version for 128-bit hardware before. Now the company switched to a new chip that supports 512-bit. My original loop only cuts 4 floats at a time, but now I could cut 16. I have to rewrite the entire loop logic, including the ‘tail’ handling.”
Architect: “That’s the downside of Fixed-width SIMD. It’s like a fixed-size cookie cutter. Using a small cutter on a big dough is inefficient; switch to a bigger cutter, and your old recipe (code) doesn’t work anymore.”
Junior: “How does RISC-V solve this?”
Architect: “RISC-V’s V (Vector) Extension uses a design called VLA (Vector-Length Agnostic).
Imagine you have a ‘smart noodle cutting machine’. You don’t need to tell it ‘I want 4 pieces’ or ‘I want 16 pieces.’ You just say: ‘Here’s a big lump of dough (total length N), please cut it with your maximum capacity.’“
Junior: “Maximum capacity?”
Architect: “Right. The hardware responds: ‘Report: my blade can cut 8 pieces at a time (vl).’
Then you cut 8 pieces, push the dough forward, and ask again.
When you’re down to the last 3 pieces of dough, you don’t need to change cutters—the hardware automatically tells you: ‘This time I’ll only cut 3.’
Code written this way runs on 128-bit or 1024-bit machines without changing a single line, automatically achieving maximum performance.“
Junior: “Wow, even the tail is handled automatically? Teach me this instruction!”
Architect: “This is the legendary artifact: vsetvli.”
Modern applications increasingly demand data-parallel processing. Image processing applies the same filter to millions of pixels. Machine learning performs matrix operations on thousands of elements. Scientific simulations compute physics equations across vast grids. These workloads share a common pattern: the same operation repeated on different data.
Traditional scalar processors handle one operation at a time. To process 1000 elements, they execute 1000 separate instructions. Vector processors, by contrast, operate on multiple elements simultaneously with a single instruction—Single Instruction, Multiple Data (SIMD). This can provide 4×, 8×, or even greater speedups for data-parallel code.
Every major architecture offers SIMD extensions: x86 has SSE and AVX, ARM has NEON and SVE. RISC-V’s answer is the V extension (Vector), ratified in 2021. But RISC-V takes a different approach from its predecessors. Instead of fixed-width vectors that become obsolete as hardware improves, RISC-V uses vector-length agnostic programming—code that adapts automatically to different hardware implementations.
This chapter explores the V extension’s design, compares it with ARM and x86 SIMD, and shows how to write efficient vector code. We’ll see why RISC-V’s approach offers better long-term scalability than traditional SIMD architectures.
12.1 Vector Extension Overview
The SIMD Evolution
SIMD extensions have evolved through multiple generations, each adding wider vectors:
- x86: MMX (64-bit) → SSE (128-bit) → AVX (256-bit) → AVX-512 (512-bit)
- ARM: NEON (128-bit) → SVE (128-2048 bits, scalable)
Each generation requires new instructions and software rewrites. Code optimized for 128-bit vectors doesn’t automatically benefit from 256-bit hardware. This creates a dilemma: should compilers target narrow vectors for compatibility or wide vectors for performance?
Vector-Length Agnostic Programming
RISC-V’s V extension solves this with vector-length agnostic (VLA) programming. Instead of specifying exact vector widths, programs specify operations on abstract vectors. The hardware determines the actual vector length at runtime based on its capabilities.
A program written for V extension runs on any implementation, from embedded processors with 128-bit vectors to supercomputers with 4096-bit vectors, automatically using the available width. This future-proofs software and simplifies compiler design.
Key Concepts
VLEN: Vector register length in bits, implementation-defined (must be power of 2, minimum 128, maximum 65536). A processor might have VLEN=256 (256-bit vectors) or VLEN=512 (512-bit vectors).
ELEN: Maximum element width in bits, implementation-defined (minimum 32, maximum 64). Determines the largest element type (e.g., ELEN=64 supports 64-bit integers and doubles).
SEW: Selected element width in bits, chosen by software (8, 16, 32, or 64). Determines how many elements fit in a vector register.
LMUL: Vector register group multiplier (1/8, 1/4, 1/2, 1, 2, 4, 8). Allows using multiple registers as a single logical vector for larger operations.
AVL: Application vector length, the number of elements the application wants to process.
VL: Vector length, the number of elements actually processed by an instruction (VL ≤ AVL, VL ≤ VLEN/SEW).
The relationship is: VL = min(AVL, VLEN/SEW × LMUL)
Figure 12.1: Vector Register Organization
Vector Register File (32 registers: v0-v31)
Each register: VLEN bits (implementation-defined)
Element Width (SEW) - determines elements per register:
VLEN = 256 bits (example)
├─ SEW=8: 32 elements (bytes)
├─ SEW=16: 16 elements (halfwords)
├─ SEW=32: 8 elements (words)
└─ SEW=64: 4 elements (doublewords)
Register Grouping (LMUL) - use multiple registers as one:
├─ LMUL=1: 1 register (e.g., v0)
├─ LMUL=2: 2 registers (e.g., v0-v1)
├─ LMUL=4: 4 registers (e.g., v0-v3)
└─ LMUL=8: 8 registers (e.g., v0-v7)
A vector register can be interpreted with different element widths (SEW), and multiple consecutive registers can be grouped (LMUL) for larger operations.
12.2 Vector Register Organization
Vector Register File
The V extension adds 32 vector registers, v0 through v31. Each register is VLEN bits wide, where VLEN is implementation-defined. Unlike scalar registers which are always 32 or 64 bits, vector registers can be 128, 256, 512, or even larger.
Register v0 has a special role: it’s used as the mask register for predicated operations (more on this later).
Element Width and Capacity
A vector register holds multiple elements. The number depends on the selected element width (SEW):
Number of elements = VLEN / SEW
For VLEN=256:
- SEW=8 (byte): 32 elements
- SEW=16 (halfword): 16 elements
- SEW=32 (word): 8 elements
- SEW=64 (doubleword): 4 elements
Software selects SEW based on the data type being processed.
Register Grouping (LMUL)
Sometimes you need to process more elements than fit in one register. LMUL (register group multiplier) allows treating multiple consecutive registers as a single logical vector:
- LMUL=1: Use 1 register (default)
- LMUL=2: Use 2 consecutive registers (e.g., v0-v1)
- LMUL=4: Use 4 consecutive registers (e.g., v0-v3)
- LMUL=8: Use 8 consecutive registers (e.g., v0-v7)
With LMUL=2 and SEW=32 on VLEN=256, you get 16 elements (8 per register × 2 registers).
LMUL can also be fractional (1/2, 1/4, 1/8) to use only part of a register, leaving more registers available for other operations.
Register Alignment
When LMUL > 1, register numbers must be aligned:
- LMUL=2: Use v0, v2, v4, … (even registers)
- LMUL=4: Use v0, v4, v8, … (multiples of 4)
- LMUL=8: Use v0, v8, v16, v24 (multiples of 8)
This simplifies hardware implementation.
12.3 Vector Configuration
The vtype CSR
Vector operations are configured through the vtype CSR (vector type register), which specifies:
- SEW: Selected element width (8, 16, 32, or 64 bits)
- LMUL: Register group multiplier (1/8, 1/4, 1/2, 1, 2, 4, 8)
- vta: Vector tail agnostic (how to handle elements beyond VL)
- vma: Vector mask agnostic (how to handle masked-off elements)
The vsetvl Instruction
Before executing vector instructions, software must configure vtype and set VL using the vsetvl instruction:
vsetvli rd, rs1, vtypei: Set VL and vtype. rs1 contains AVL (requested vector length), vtypei encodes SEW and LMUL, rd receives the actual VL.
# Configure for 32-bit elements, LMUL=1
li a0, 100 # AVL = 100 elements to process
vsetvli t0, a0, e32, m1 # Set SEW=32, LMUL=1, VL = min(AVL, VLEN/32)
# t0 now contains actual VL
The hardware sets VL to the smaller of:
- AVL (what the application requested)
- VLEN/SEW × LMUL (what the hardware can handle)
If AVL=100 but the hardware can only process 8 elements at a time (VLEN=256, SEW=32, LMUL=1), then VL=8. The application must loop to process all 100 elements.
Vector-Length Agnostic Loop
Here’s the standard pattern for processing an array:
void vadd_vv(int *dst, int *src1, int *src2, size_t n) {
size_t vl;
for (size_t i = 0; i < n; i += vl) {
vl = vsetvl_e32m1(n - i); // Set VL for remaining elements
vle32_v_i32m1(v1, &src1[i], vl); // Load src1[i:i+vl]
vle32_v_i32m1(v2, &src2[i], vl); // Load src2[i:i+vl]
vadd_vv_i32m1(v3, v1, v2, vl); // v3 = v1 + v2
vse32_v_i32m1(&dst[i], v3, vl); // Store dst[i:i+vl]
}
}
This code works on any VLEN. On VLEN=128, it processes 4 elements per iteration. On VLEN=512, it processes 16 elements per iteration. No code changes needed.
Encoding vtype
The vtypei immediate in vsetvli encodes SEW and LMUL:
vtypei[2:0] = LMUL encoding:
000 = LMUL=1, 001 = LMUL=2, 010 = LMUL=4, 011 = LMUL=8
101 = LMUL=1/8, 110 = LMUL=1/4, 111 = LMUL=1/2
vtypei[5:3] = SEW encoding:
000 = SEW=8, 001 = SEW=16, 010 = SEW=32, 011 = SEW=64
vtypei[6] = vta (tail agnostic)
vtypei[7] = vma (mask agnostic)
The assembler provides convenient mnemonics: e32, m1 means SEW=32, LMUL=1.
12.4 Vector Arithmetic and Logic
Vector-Vector Operations
Vector arithmetic instructions operate on corresponding elements from two vector registers:
vadd.vv vd, vs2, vs1: vd[i] = vs2[i] + vs1[i] for i = 0 to VL-1
vsub.vv, vmul.vv, vdiv.vv: Subtraction, multiplication, division
vand.vv, vor.vv, vxor.vv: Bitwise AND, OR, XOR
# Vector addition: v3 = v1 + v2
vsetvli t0, a0, e32, m1
vle32.v v1, (a1) # Load first vector
vle32.v v2, (a2) # Load second vector
vadd.vv v3, v1, v2 # Add element-wise
vse32.v v3, (a3) # Store result
Vector-Scalar Operations
Often you need to add the same scalar to all vector elements. Vector-scalar instructions use a scalar register (x register) as the second operand:
vadd.vx vd, vs2, rs1: vd[i] = vs2[i] + rs1 for all i
# Add constant 10 to all elements
li a0, 10
vsetvli t0, a1, e32, m1
vle32.v v1, (a2)
vadd.vx v2, v1, a0 # v2[i] = v1[i] + 10
vse32.v v2, (a3)
Vector-Immediate Operations
For small constants, vector-immediate instructions avoid loading into a scalar register:
vadd.vi vd, vs2, imm: vd[i] = vs2[i] + imm (imm is 5-bit signed)
# Increment all elements by 1
vadd.vi v2, v1, 1 # v2[i] = v1[i] + 1
Widening and Narrowing Operations
Widening operations produce results twice as wide as the inputs:
vwaddu.vv vd, vs2, vs1: Widening unsigned add (e.g., 32-bit inputs → 64-bit results)
vwadd.vv: Widening signed add
Narrowing operations reduce width:
vnsrl.wv vd, vs2, vs1: Narrowing shift right logical (e.g., 64-bit inputs → 32-bit results)
These are essential for avoiding overflow in accumulations or reducing precision after computation.
Fused Multiply-Add
Vector fused multiply-add computes (a × b) + c in one instruction:
vfmadd.vv vd, vs1, vs2: vd[i] = (vd[i] × vs1[i]) + vs2[i]
This is crucial for matrix multiplication and other linear algebra operations.
12.5 Vector Memory Operations
Unit-Stride Loads and Stores
The most common memory access pattern is unit-stride: consecutive elements in memory.
vle32.v vd, (rs1): Load VL elements of 32-bit width from address rs1
vse32.v vs3, (rs1): Store VL elements of 32-bit width to address rs1
# Load 32-bit integers from array
vsetvli t0, a0, e32, m1
vle32.v v1, (a1) # Load v1[0:VL-1] from memory[a1]
The number of bytes loaded is VL × SEW/8. For VL=8 and SEW=32, this loads 32 bytes.
Strided Loads and Stores
Strided access loads elements separated by a constant stride:
vlse32.v vd, (rs1), rs2: Load elements from rs1, rs1+rs2, rs1+2×rs2, …
vsse32.v vs3, (rs1), rs2: Store with stride
// Load every other element (stride = 8 bytes for 32-bit elements)
vlse32.v v1, (a1), 8 # Load a1[0], a1[2], a1[4], ...
This is useful for accessing matrix columns or interleaved data.
Indexed (Scatter/Gather) Loads and Stores
Indexed access uses a vector of indices to load/store non-contiguous elements. This is also called “gather” (for loads) and “scatter” (for stores).
vluxei32.v vd, (rs1), vs2: Load elements from rs1+vs2[i] for each i (unordered)
vsuxei32.v vs3, (rs1), vs2: Store with indices (unordered)
# Example: Gather operation
# Suppose we have an array a[] and want to load a[1], a[3], a[5], a[2]
# First, create an index vector containing [1, 3, 5, 2]
vle32.v v1, (a1) # Load index vector: v1 = [1, 3, 5, 2]
vluxei32.v v2, (a2), v1 # Gather: v2[0]=a[1], v2[1]=a[3], v2[2]=a[5], v2[3]=a[2]
The index vector (v1 in the example) contains the indices of elements to load. For each element i, the instruction loads from address base + index[i] * element_size. So if v1 contains [1, 3, 5, 2], the gather operation loads:
- v2[0] = memory[a2 + 1*4] (element at index 1)
- v2[1] = memory[a2 + 3*4] (element at index 3)
- v2[2] = memory[a2 + 5*4] (element at index 5)
- v2[3] = memory[a2 + 2*4] (element at index 2)
This is essential for sparse matrix operations, indirect addressing, and accessing non-contiguous data.
Segment Loads and Stores
Segment operations load/store groups of elements (like struct fields):
vlseg2e32.v vd, (rs1): Load 2-field segments (e.g., {x, y} pairs)
vsseg2e32.v vs3, (rs1): Store 2-field segments
// Load array of {x, y} pairs
struct point { int x, y; };
struct point points[100];
vlseg2e32.v v1, (a0) # v1 = all x values, v2 = all y values
This efficiently handles structure-of-arrays (SoA) and array-of-structures (AoS) conversions.
Figure 12.2a: Unit-Stride Access
Unit-Stride (consecutive elements):
Memory: [0] [1] [2] [3] [4] [5] [6] [7]
↓ ↓ ↓ ↓
Vector: [0] [1] [2] [3]
Figure 12.2b: Strided Access
Strided (every 2nd element, stride=2):
Memory: [0] [1] [2] [3] [4] [5] [6] [7]
↓ ↓ ↓ ↓
Vector: [0] [2] [4] [6]
Figure 12.2c: Indexed (Gather) Access
graph TB
subgraph "Index Vector"
idx0["indices[0] = 1"]
idx1["indices[1] = 3"]
idx2["indices[2] = 5"]
idx3["indices[3] = 2"]
end
subgraph "Memory Array"
m0["mem[0] = 0"]
m1["mem[1] = 1"]
m2["mem[2] = 2"]
m3["mem[3] = 3"]
m4["mem[4] = 4"]
m5["mem[5] = 5"]
m6["mem[6] = 6"]
m7["mem[7] = 7"]
end
subgraph "Result Vector"
v0["vector[0] = 1"]
v1["vector[1] = 3"]
v2["vector[2] = 5"]
v3["vector[3] = 2"]
end
idx0 --> m1
m1 --> v0
idx1 --> m3
m3 --> v1
idx2 --> m5
m5 --> v2
idx3 --> m2
m2 --> v3
style m1 fill:#90EE90
style m2 fill:#FFB6C1
style m3 fill:#87CEEB
style m5 fill:#FFD700
Each index points to a memory location, and the value at that location is loaded into the corresponding vector position.
12.6 Vector Masking
Predicated Execution
Not all elements in a vector may need processing. Masking allows selectively enabling or disabling operations on individual elements.
The mask is stored in vector register v0, with one bit per element. If v0[i] = 1, element i is processed; if v0[i] = 0, element i is skipped (or handled according to vma setting).
Masked Operations
Most vector instructions have a masked variant using the .vm suffix:
vadd.vv vd, vs2, vs1, v0.t: Add only where v0[i] = 1
# Conditional add: dst[i] = (mask[i]) ? src1[i] + src2[i] : dst[i]
vle1.v v0, (a0) # Load mask into v0
vle32.v v1, (a1) # Load src1
vle32.v v2, (a2) # Load src2
vle32.v v3, (a3) # Load dst (for masked-off elements)
vadd.vv v3, v1, v2, v0.t # Add where mask is 1, keep v3 where mask is 0
vse32.v v3, (a3) # Store result
Comparison and Mask Generation
Comparison instructions generate masks:
vmseq.vv vd, vs2, vs1: vd[i] = (vs2[i] == vs1[i]) ? 1 : 0
vmslt.vv, vmsle.vv, vmsgt.vv: Less than, less or equal, greater than
# Find elements greater than 100
li a0, 100
vsetvli t0, a1, e32, m1
vle32.v v1, (a2)
vmsgt.vx v0, v1, a0 # v0[i] = (v1[i] > 100) ? 1 : 0
Mask Logical Operations
Masks can be combined with logical operations:
vmand.mm vd, vs2, vs1: Mask AND vmor.mm, vmxor.mm, vmnand.mm: Mask OR, XOR, NAND
# Combine two conditions: (a > 100) AND (a < 200)
vmsgt.vx v1, v2, a0 # v1 = (v2 > 100)
vmslt.vx v3, v2, a1 # v3 = (v2 < 200)
vmand.mm v0, v1, v3 # v0 = v1 AND v3
Use Cases
Masking is essential for:
- Conditional operations (if-then-else in vector code)
- Handling loop tails (when array size isn’t a multiple of VL)
- Sparse computations (skip zero elements)
- Implementing reductions with conditions
12.7 Vector Reductions
What is a Reduction?
A reduction combines all elements of a vector into a single scalar result. Common examples: sum all elements, find maximum, count non-zero elements.
Reduction Instructions
vredsum.vs vd, vs2, vs1: Sum all elements of vs2, add to vs1[0], store in vd[0]
vredmax.vs, vredmin.vs: Find maximum or minimum
vredand.vs, vredor.vs, vredxor.vs: Bitwise AND, OR, XOR of all elements
# Sum all elements of an array
vsetvli t0, a0, e32, m1
vmv.v.i v2, 0 # Initialize accumulator to 0
vle32.v v1, (a1) # Load vector
vredsum.vs v2, v1, v2 # v2[0] = sum(v1[0:VL-1]) + v2[0]
vmv.x.s a2, v2 # Move result to scalar register
For arrays larger than VL, loop and accumulate:
int sum_array(int *arr, size_t n) {
int sum = 0;
size_t vl;
for (size_t i = 0; i < n; i += vl) {
vl = vsetvl_e32m1(n - i);
vle32_v_i32m1(v1, &arr[i], vl);
vredsum_vs_i32m1_i32m1(v2, v1, v2, vl);
}
return vmv_x_s_i32m1_i32(v2);
}
Masked Reductions
Reductions can be masked to sum only selected elements:
# Sum elements where mask is 1
vredsum.vs v2, v1, v2, v0.t
This is useful for conditional sums (e.g., sum all positive elements).
12.8 Comparison with ARM NEON and x86 AVX
ARM NEON
ARM NEON provides 128-bit SIMD with 32 vector registers (v0-v31 in AArch64). Each register can hold:
- 16 × 8-bit elements
- 8 × 16-bit elements
- 4 × 32-bit elements
- 2 × 64-bit elements
NEON instructions specify the element width explicitly:
# ARM NEON: Add two vectors of 4 × 32-bit integers
ld1 {v0.4s}, [x0] // Load 4 × 32-bit
ld1 {v1.4s}, [x1]
add v2.4s, v0.4s, v1.4s // Add element-wise
st1 {v2.4s}, [x2]
Limitations of NEON:
- Fixed 128-bit width (no scalability)
- Code must be rewritten for wider vectors
- No predication (masking) in base NEON
ARM SVE (Scalable Vector Extension)
SVE addresses NEON’s limitations with scalable vectors (128-2048 bits). Like RISC-V V, SVE uses vector-length agnostic programming:
# ARM SVE: Vector add (works on any vector length)
ld1w z0.s, p0/z, [x0] // Load with predication
ld1w z1.s, p0/z, [x1]
add z2.s, z0.s, z1.s // Add
st1w z2.s, p0, [x2] // Store with predication
SVE and RISC-V V share similar philosophies: scalable vectors, predication, and VLA programming. However, SVE is more complex with more instruction variants and addressing modes.
x86 AVX
x86’s SIMD evolved through multiple generations:
- SSE: 128-bit (16 registers: xmm0-xmm15)
- AVX: 256-bit (16 registers: ymm0-ymm15)
- AVX-512: 512-bit (32 registers: zmm0-zmm31)
Each generation added new instructions:
# x86 AVX: Add two vectors of 8 × 32-bit integers
vmovdqu ymm0, [rax] ; Load 256 bits
vmovdqu ymm1, [rbx]
vpaddd ymm2, ymm0, ymm1 ; Add 8 × 32-bit
vmovdqu [rcx], ymm2 ; Store
Limitations of x86 SIMD:
- Fixed widths (128, 256, 512 bits)
- Code must be rewritten for each generation
- AVX-512 has many variants (AVX-512F, AVX-512BW, AVX-512DQ, etc.)
- Complexity: thousands of SIMD instructions
RISC-V V Advantages
Compared to NEON and AVX, RISC-V V offers:
- Scalability: One codebase works on any VLEN (128 to 65536 bits)
- Simplicity: Fewer instruction variants, consistent naming
- Predication: Built-in masking for all operations
- Flexibility: Fractional LMUL, widening/narrowing operations
- Future-proof: No need to rewrite code for wider vectors
Trade-offs:
- RISC-V V is newer (less mature tooling and libraries)
- x86 AVX has extensive optimization for specific workloads
- ARM NEON is simpler for fixed-width use cases
Figure 12.3: SIMD Architecture Comparison
| Feature | x86 SSE/AVX | ARM NEON | ARM SVE | RISC-V V |
|---|---|---|---|---|
| Vector Width | Fixed: 128/256/512 bits | Fixed: 128 bits | Scalable: 128-2048 bits | Scalable: 128-65536 bits |
| Registers | 16 (SSE/AVX) 32 (AVX-512) | 32 | 32 | 32 |
| Scalability | No (fixed per generation) | No (fixed) | Yes (scalable) | Yes (scalable) |
| Code Portability | No (rewrite per generation) | Yes (single codebase) | Yes (single codebase) | Yes (single codebase) |
| Predication | Partial (AVX-512 only) | No (base NEON) | Yes | Yes |
| Instruction Count | ~1000s (across generations) | ~200 | ~400 | ~300 |
| Complexity | High (many variants) | Low | Medium | Low |
| Ratification | 1999 (SSE) 2011 (AVX) 2016 (AVX-512) | 2005 | 2016 | 2021 |
| Key Advantage | Mature ecosystem | Simple, widely deployed | Scalable, predication | Scalable, simple, future-proof |
| Key Limitation | Fixed widths, complexity | Fixed 128-bit only | Complex instruction set | Newer, less mature tooling |
🛠️ Hands-on Lab: Lab 12.1 — Vector Addition
This lab demonstrates the classic VLA loop structure—the foundational pattern for RISC-V Vector programming.
Lab Objectives
- Write RISC-V Vector Assembly to implement C[i] = A[i] + B[i]
- Understand the meaning of
vsetvli’s return valuevl - Compare the structure of Scalar Loop vs Vector Loop
Strip-mining Loop Structure
This is the core pattern of VLA programming:
while (n > 0) {
vl = vsetvli(n); // Ask hardware: how many can you handle?
load(vl elements); // Load vl elements
compute(); // Execute operation
store(vl elements); // Store vl elements
n -= vl; // Decrease remaining count
pointers += vl; // Advance pointers
}
Code
File 1: vector_add.S
# vector_add.S - Vector Addition (VLA version)
.section .text
.global vec_add
# void vec_add(int *a, int *b, int *c, int n)
# a0 = pointer to A
# a1 = pointer to B
# a2 = pointer to C
# a3 = n (element count)
vec_add:
# --- Strip-mining Loop ---
loop:
# 1. Set vector length
# vsetvli rd, rs1, vtype
# t0: hardware returns actual elements it can process (vl)
# a3: remaining elements (AVL)
# e32: element size 32-bit
# m1: LMUL=1 (use 1 vector register)
# ta, ma: Tail/Mask Agnostic
vsetvli t0, a3, e32, m1, ta, ma
# 2. Load data
vle32.v v0, (a0) # v0 = A[0:vl]
vle32.v v1, (a1) # v1 = B[0:vl]
# 3. Execute addition
vadd.vv v2, v0, v1 # v2 = v0 + v1
# 4. Write back data
vse32.v v2, (a2) # C[0:vl] = v2
# 5. Advance pointers (int32 = 4 bytes)
slli t1, t0, 2 # t1 = vl * 4
add a0, a0, t1 # A pointer advances
add a1, a1, t1 # B pointer advances
add a2, a2, t1 # C pointer advances
# 6. Update remaining count
sub a3, a3, t0 # n = n - vl
# 7. Continue loop
bnez a3, loop
ret
File 2: main.c
#include <stdio.h>
extern void vec_add(int *a, int *b, int *c, int n);
#define N 100 // Intentionally not power of 2, to test tail handling
int main(void) {
int a[N], b[N], c[N];
// Initialize
for (int i = 0; i < N; i++) {
a[i] = i;
b[i] = 100;
c[i] = 0;
}
printf("Starting Vector Add...\n");
vec_add(a, b, c, N);
// Verify
int error = 0;
for (int i = 0; i < N; i++) {
if (c[i] != a[i] + 100) {
error++;
}
}
if (error == 0) {
printf("SUCCESS: All %d elements correct!\n", N);
} else {
printf("FAILED: %d errors\n", error);
}
return 0;
}
Compile and Run
# Compile (requires V extension support)
riscv64-unknown-elf-gcc -march=rv64gcv -o vec_add main.c vector_add.S
# Run on QEMU with V extension
qemu-riscv64 -cpu rv64,v=true vec_add
Expected Output:
Starting Vector Add...
SUCCESS: All 100 elements correct!
What You Just Did
- vsetvli: Asked hardware “how many elements can you process?” and got
vlback - Automatic Tail Handling: When N=100 and VLEN allows 8 elements per iteration, the last iteration automatically processes only 4 elements
- Portable Code: This same code runs on any VLEN (128-bit, 256-bit, 1024-bit) without modification
danieRTOS Reference: While danieRTOS doesn’t use vector operations directly, understanding VLA patterns helps when optimizing memory copy operations in the kernel.
⚠️ Common Pitfalls
Pitfall 1: Unnecessarily Handling the Tail
Error Scenario: Habituated to traditional SIMD, writing extra tail-handling loops.
Consequence: Wasted effort, and may introduce bugs.
// ❌ Wrong: No need to handle tail yourself
void vec_add_wrong(int *a, int *b, int *c, int n) {
int i;
// Vector part
for (i = 0; i + 4 <= n; i += 4) {
// vector_add_4(a+i, b+i, c+i);
}
// Tail part (this is redundant in VLA!)
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
}
// ✅ Correct: vsetvli handles tail automatically
// See Assembly example above
Pitfall 2: Assuming Fixed VLEN
Error Scenario: Hardcoding assumptions like VLEN=256 or other specific values.
Consequence: Program behaves incorrectly or performs poorly on different hardware.
# ❌ Wrong: Assuming 8 elements per iteration
loop:
li t0, 8 # Hardcoded!
vsetvli zero, t0, e32, m1, ta, ma
...
# ✅ Correct: Let hardware decide
loop:
vsetvli t0, a3, e32, m1, ta, ma # a3 = remaining count
...
Pitfall 3: Forgetting LMUL’s Impact
Error Scenario: Not understanding LMUL (Vector Register Group Multiplier).
Explanation:
- LMUL=1: Use 1 vector register (v0-v31)
- LMUL=2: Use 2 registers as a group (v0-v1, v2-v3, …)
- LMUL=4/8: Larger groups
# LMUL=1: Can use v0-v31 (32 independent registers)
vsetvli t0, a3, e32, m1, ta, ma
# LMUL=2: Can use v0, v2, v4, ... (16 groups, 2 each)
vsetvli t0, a3, e32, m2, ta, ma
# Now v0 and v1 are the same group, cannot be used separately
# LMUL=8: Can use v0, v8, v16, v24 (4 groups, 8 each)
vsetvli t0, a3, e32, m8, ta, ma
💡 Tip: LMUL > 1 can process more elements but reduces available register count. For simple vector addition, LMUL=1 is usually sufficient.
Summary
The RISC-V Vector extension represents a modern approach to SIMD processing, learning from decades of experience with x86 and ARM SIMD architectures. Its vector-length agnostic design ensures that code written today will automatically benefit from wider vectors in future hardware.
Vector-length agnostic programming is the V extension’s defining feature. By abstracting away the physical vector width, RISC-V allows a single binary to run efficiently on implementations ranging from tiny embedded processors to supercomputers. This eliminates the need to maintain multiple code paths for different vector widths, simplifying both compiler and application development.
Vector configuration through the vsetvl instruction and vtype CSR provides fine-grained control over element width (SEW), register grouping (LMUL), and vector length (VL). The hardware automatically determines the optimal VL based on the application’s request (AVL) and the implementation’s capabilities (VLEN), making it easy to write portable high-performance code.
Vector operations cover the full spectrum of data-parallel computation: arithmetic and logic operations with vector-vector, vector-scalar, and vector-immediate variants; widening and narrowing operations for precision management; and fused multiply-add for efficient linear algebra. The consistent instruction naming and behavior make the V extension easier to learn than x86’s sprawling SIMD instruction set.
Vector memory operations support diverse access patterns: unit-stride for contiguous data, strided for matrix columns and interleaved data, indexed for sparse matrices and indirect addressing, and segment operations for structure-of-arrays conversions. This flexibility enables efficient vectorization of a wide range of algorithms.
Vector masking provides predicated execution, allowing conditional operations on individual vector elements. Comparison instructions generate masks, mask logical operations combine conditions, and masked operations selectively process elements. This is essential for handling loop tails, implementing conditional logic in vector code, and optimizing sparse computations.
Vector reductions efficiently combine all elements of a vector into a scalar result, supporting operations like sum, maximum, minimum, and bitwise reductions. Masked reductions enable conditional aggregations, crucial for many algorithms.
Compared to ARM NEON and x86 AVX, RISC-V V offers superior scalability and simplicity. While NEON is limited to 128-bit vectors and AVX requires separate code for each generation (128, 256, 512 bits), RISC-V V code automatically adapts to any vector width. ARM SVE shares RISC-V’s scalable philosophy but with greater complexity. x86’s SIMD has evolved into thousands of instructions across multiple incompatible extensions, while RISC-V V maintains a clean, orthogonal design.
The V extension positions RISC-V well for future data-parallel workloads in machine learning, scientific computing, multimedia processing, and other domains where SIMD performance is critical.