Chapter 9: Embedded & RTOS Benchmarks
Part II: Tools
"In embedded systems, the worst case is the only case that matters." — Jack Ganssle
The "Average 1ms, But Sometimes 100ms" Disaster
"Average latency 1ms, fully meets spec."
That was the vendor's benchmark report. We were using this MCU for motor control, with a requirement to update PWM output every 1ms. Average 1ms? Perfect.
After the system went live, the motor started stuttering. Not every time—just "occasionally."
We spent three days debugging. Finally we discovered: behind that "average 1ms," there was a 0.1% chance of jumping to 50-100ms. In typical benchmark reports, these outliers get averaged away—invisible.
But for motor control, 0.1% of 100ms delays = stuttering once per second.
This is the fundamental difference between embedded/RTOS benchmarking and GPOS benchmarking: we care about worst case, not average case.
GPOS vs RTOS vs Bare-metal
Let's clarify the differences between these three environments:
| Feature | GPOS | RTOS | Bare-metal |
|---|---|---|---|
| Examples | Linux, Windows, macOS | FreeRTOS, Zephyr, RT-Linux | Running directly on hardware |
| Scheduling | Time-slicing, variable priority | Fixed priority, preemptive | None (or super loop) |
| Memory | Virtual memory, paging | Usually flat memory | Flat memory |
| Interrupt latency | Not guaranteed (may be ms) | Guaranteed upper bound (usually μs) | Minimal (cycles) |
| Jitter | High (background processes) | Low (deterministic) | Lowest |
| Tool support | Rich (perf, VTune) | Medium (trace, SEGGER) | Basic (GPIO toggle) |
Why This Matters
On GPOS, if an operation is "usually" 1ms, "occasionally" 10ms, most applications can tolerate it.
On RTOS/bare-metal:
- Motor control: 100ms delay = motor loses control
- Automotive ABS: 10ms delay = brake failure
- Medical devices: delay = potentially fatal
RTOS benchmarks must report worst-case, not just average.
Time Measurement: What If There's No OS?
On GPOS, we use clock_gettime() or rdtsc. On bare-metal, these APIs don't exist.
ARM Cortex-M: DWT Cycle Counter
The Data Watchpoint and Trace (DWT) unit provides a cycle counter:
// Enable DWT cycle counter (need to enable trace first)
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
// Read cycle count
static inline uint32_t get_cycles(void) {
return DWT->CYCCNT;
}
// Usage
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
uint32_t elapsed = end - start; // cycles
Note: DWT->CYCCNT is 32-bit, overflows on high-frequency MCUs (168MHz ≈ 25 seconds)
RISC-V: mcycle/minstret CSRs
RISC-V has standard cycle and instruction counters:
// Read cycle counter
static inline uint64_t get_mcycle(void) {
uint64_t cycle;
asm volatile ("rdcycle %0" : "=r"(cycle));
return cycle;
}
// Read instruction counter
static inline uint64_t get_minstret(void) {
uint64_t instret;
asm volatile ("rdinstret %0" : "=r"(instret));
return instret;
}
// Calculate CPI
uint64_t cycles_start = get_mcycle();
uint64_t instr_start = get_minstret();
my_function();
uint64_t cycles = get_mcycle() - cycles_start;
uint64_t instrs = get_minstret() - instr_start;
double cpi = (double)cycles / instrs;
Running on QEMU
# ARM Cortex-M3 (lm3s6965evb)
qemu-system-arm -M lm3s6965evb -nographic -kernel firmware.elf
# RISC-V (sifive_e - FE310)
qemu-system-riscv32 -M sifive_e -nographic -kernel firmware.elf
QEMU's cycle counter is "functional," not cycle-accurate. Numbers can verify program logic but don't represent real hardware cycle counts.
Porting Open-Source Benchmarks to Bare-metal
Good news: most CPU/memory benchmarks port easily:
| Benchmark | Porting Difficulty | Dependencies | Notes |
|---|---|---|---|
| Dhrystone | Easy | libc only | Need to remove time() calls |
| CoreMark | Easy | libc only | Official bare-metal support |
| Embench | Easy | None | Designed for embedded |
| Whetstone | Easy | libm | Needs floating-point support |
| STREAM | Medium | None | Needs enough memory |
| lmbench | Hard | POSIX | Core algorithms portable |
CoreMark Bare-metal Port
CoreMark officially supports bare-metal; just implement a few porting functions:
// core_portme.c - ARM Cortex-M implementation
// 1. Timing start/end
void start_time(void) {
start_cycles = DWT->CYCCNT;
}
void stop_time(void) {
end_cycles = DWT->CYCCNT;
}
CORE_TICKS get_time(void) {
### Compile and Run
```bash
# Cross-compile for ARM Cortex-M4
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -O3 \
-DITERATIONS=10000 \
core_main.c core_list_join.c core_matrix.c \
core_state.c core_util.c core_portme.c \
-T linker.ld -o coremark_arm.elf
# Cross-compile for RISC-V (RV32IMAC)
riscv32-unknown-elf-gcc -march=rv32imac -mabi=ilp32 -O3 \
-DITERATIONS=10000 \
core_main.c core_list_join.c core_matrix.c \
core_state.c core_util.c core_portme.c \
-T linker.ld -o coremark_riscv.elf
# Run on QEMU ARM
qemu-system-arm -M lm3s6965evb -nographic \
-semihosting -kernel coremark_arm.elf
# Run on QEMU RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
-kernel coremark_riscv.elf
Embench: Designed for Embedded
Embench is a modern embedded benchmark developed by EEMBC and academia:
# Download
git clone https://github.com/embench/embench-iot.git
cd embench-iot
# Build ARM version
python3 build_all.py --arch arm --chip cortex-m4 \
--board qemu-arm
# Run (needs appropriate runner)
python3 benchmark_speed.py --target-module run_qemu
Embench includes 19 real-application kernels:
aha-mont64 Montgomery multiplication
crc32 CRC calculation
cubic Cubic root solver
edn FIR filter
huffbench Huffman encoding
matmult-int Integer matrix multiply
md5sum MD5 hash
minver Matrix inversion
nbody N-body simulation
nettle-aes AES encryption
...
RTOS Benchmarks: Measuring the OS Itself
When using an RTOS, besides application performance, you need to measure OS overhead.
Context Switch Time
// FreeRTOS context switch benchmark
static TaskHandle_t task1, task2;
static volatile uint32_t switch_start, switch_end;
void Task1(void *pvParameters) {
for (;;) {
switch_start = get_cycles();
xTaskNotifyGive(task2); // Wake Task2
ulTaskNotifyTake(pdTRUE, portMAX_DELAY); // Wait
}
}
void Task2(void *pvParameters) {
for (;;) {
ulTaskNotifyTake(pdTRUE, portMAX_DELAY);
switch_end = get_cycles();
uint32_t elapsed = switch_end - switch_start;
// Record or accumulate elapsed
xTaskNotifyGive(task1);
}
}
Typical results (depends on MCU and RTOS):
RTOS MCU Context Switch
─────────────────────────────────────────────────
FreeRTOS Cortex-M4@168MHz ~200 cycles
Zephyr Cortex-M4@168MHz ~300 cycles
RT-Thread Cortex-M4@168MHz ~250 cycles
Interrupt Latency
Time from interrupt trigger to ISR execution start:
// Set up GPIO interrupt (STM32)
void EXTI0_IRQHandler(void) {
uint32_t entry_time = get_cycles(); // First line of ISR
// Calculate latency
uint32_t latency = entry_time - trigger_time;
record_latency(latency);
// Clear interrupt flag
EXTI->PR = EXTI_PR_PR0;
}
// Trigger in main program
trigger_time = get_cycles();
// Trigger via software or external GPIO
EXTI->SWIER = EXTI_SWIER_SWIER0;
Important: Measure multiple times, report distribution!
Interrupt Latency Distribution (10000 samples):
Min: 12 cycles
Max: 89 cycles
Avg: 15 cycles
P99: 45 cycles
P99.9: 78 cycles
That P99.9 of 78 cycles is the number to consider in design.
Semaphore/Mutex Overhead
static SemaphoreHandle_t sem;
void measure_semaphore_overhead(void) {
uint32_t total = 0;
for (int i = 0; i < 10000; i++) {
uint32_t start = get_cycles();
xSemaphoreTake(sem, portMAX_DELAY);
xSemaphoreGive(sem);
uint32_t end = get_cycles();
total += (end - start);
}
printf("Semaphore take+give: %lu cycles avg\n", total / 10000);
}
Determinism Measurement
A key RTOS characteristic is determinism. How do we quantify it?
Jitter Measurement
#define SAMPLES 10000
static uint32_t latencies[SAMPLES];
// Periodic task
void PeriodicTask(void *pvParameters) {
TickType_t last_wake = xTaskGetTickCount();
int idx = 0;
for (;;) {
uint32_t expected = last_wake * CYCLES_PER_TICK;
uint32_t actual = get_cycles();
if (idx < SAMPLES) {
latencies[idx++] = actual - expected;
}
vTaskDelayUntil(&last_wake, pdMS_TO_TICKS(1));
}
}
// Analyze jitter
void analyze_jitter(void) {
uint32_t min = UINT32_MAX, max = 0;
uint64_t sum = 0;
for (int i = 0; i < SAMPLES; i++) {
if (latencies[i] < min) min = latencies[i];
if (latencies[i] > max) max = latencies[i];
sum += latencies[i];
}
printf("Jitter: min=%lu, max=%lu, avg=%lu, range=%lu\n",
min, max, (uint32_t)(sum/SAMPLES), max-min);
}
WCET Estimation
Worst-Case Execution Time (WCET) is critical for real-time system design:
#define WCET_SAMPLES 100000
uint32_t measure_wcet(void (*func)(void)) {
uint32_t max_time = 0;
for (int i = 0; i < WCET_SAMPLES; i++) {
uint32_t start = get_cycles();
func();
uint32_t elapsed = get_cycles() - start;
if (elapsed > max_time) {
max_time = elapsed;
}
}
return max_time;
}
Warning: Measured WCET is only the observed maximum; true WCET may be larger. Rigorous WCET analysis requires static analysis tools (like aiT, Bound-T).
Running on Simulators
QEMU + Semihosting
Semihosting lets bare-metal programs use host I/O:
// ARM semihosting
static inline void semihosting_write(const char *s) {
asm volatile (
"mov r0, #0x04\n" // SYS_WRITE0
"mov r1, %0\n"
"bkpt #0xAB\n"
:
: "r"(s)
: "r0", "r1"
);
}
# ARM
qemu-system-arm -M lm3s6965evb -nographic \
-semihosting-config enable=on,target=native \
-kernel firmware_arm.elf
# RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
-semihosting-config enable=on,target=native \
-kernel firmware_riscv.elf
Common Pitfalls
Pitfall 1: Only Reporting Averages
Bad: "Average latency 1ms"
Good: "Latency: avg=1ms, max=15ms, P99=3ms, P99.9=12ms"
Pitfall 2: Ignoring Interrupt Effects
During measurement, other interrupts can pollute results:
// Disable interrupts during measurement
__disable_irq();
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
__enable_irq();
But this doesn't represent reality. Real systems have interrupts—measure both "with interrupts" and "without interrupts" scenarios.
Pitfall 3: Simulator ≠ Real Hardware
QEMU cycle count: 1000 cycles
Real hardware: 3500 cycles
QEMU is a functional simulator, not cycle-accurate. Use it to verify program correctness, not for performance evaluation.
Pitfall 4: Cache Matters in Embedded Too
Many assume MCUs don't have cache. Wrong:
- Cortex-M7 has I-cache and D-cache
- Modern RISC-V MCUs may have cache
- Flash to RAM access speed differences
// Cortex-M7 cache control
SCB_EnableICache();
SCB_EnableDCache();
// Invalidate before measurement
SCB_InvalidateDCache();
Summary
Embedded/RTOS benchmarking differs fundamentally from GPOS:
Core Differences
- GPOS cares about average case
- RTOS/bare-metal cares about worst case
- Determinism matters more than throughput
Time Measurement
- ARM: DWT cycle counter, SysTick
- RISC-V: mcycle/minstret CSRs
- Handle overflow carefully
Portable Benchmarks
- CoreMark, Dhrystone, Embench: Easy to port
- STREAM: Needs enough memory
- lmbench: Core algorithms portable
RTOS Measurements
- Context switch time
- Interrupt latency (report distribution!)
- Semaphore/mutex overhead
- Jitter and WCET
Simulator Usage
- QEMU: Functional verification, not performance evaluation
- Renode: Better peripheral and RTOS support
- Simulators cannot measure power consumption