Chapter 9: Embedded & RTOS Benchmarks

Part II: Tools

"In embedded systems, the worst case is the only case that matters." — Jack Ganssle

The "Average 1ms, But Sometimes 100ms" Disaster

"Average latency 1ms, fully meets spec."

That was the vendor's benchmark report. We were using this MCU for motor control, with a requirement to update PWM output every 1ms. Average 1ms? Perfect.

After the system went live, the motor started stuttering. Not every time—just "occasionally."

We spent three days debugging. Finally we discovered: behind that "average 1ms," there was a 0.1% chance of jumping to 50-100ms. In typical benchmark reports, these outliers get averaged away—invisible.

But for motor control, 0.1% of 100ms delays = stuttering once per second.

This is the fundamental difference between embedded/RTOS benchmarking and GPOS benchmarking: we care about worst case, not average case.

GPOS vs RTOS vs Bare-metal

Let's clarify the differences between these three environments:

Feature	GPOS	RTOS	Bare-metal
Examples	Linux, Windows, macOS	FreeRTOS, Zephyr, RT-Linux	Running directly on hardware
Scheduling	Time-slicing, variable priority	Fixed priority, preemptive	None (or super loop)
Memory	Virtual memory, paging	Usually flat memory	Flat memory
Interrupt latency	Not guaranteed (may be ms)	Guaranteed upper bound (usually μs)	Minimal (cycles)
Jitter	High (background processes)	Low (deterministic)	Lowest
Tool support	Rich (perf, VTune)	Medium (trace, SEGGER)	Basic (GPIO toggle)

Why This Matters

On GPOS, if an operation is "usually" 1ms, "occasionally" 10ms, most applications can tolerate it.

On RTOS/bare-metal:

Motor control: 100ms delay = motor loses control
Automotive ABS: 10ms delay = brake failure
Medical devices: delay = potentially fatal

RTOS benchmarks must report worst-case, not just average.

Time Measurement: What If There's No OS?

On GPOS, we use clock_gettime() or rdtsc. On bare-metal, these APIs don't exist.

ARM Cortex-M: DWT Cycle Counter

The Data Watchpoint and Trace (DWT) unit provides a cycle counter:

// Enable DWT cycle counter (need to enable trace first)
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

// Read cycle count
static inline uint32_t get_cycles(void) {
    return DWT->CYCCNT;
}

// Usage
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
uint32_t elapsed = end - start;  // cycles

Note: DWT->CYCCNT is 32-bit, overflows on high-frequency MCUs (168MHz ≈ 25 seconds)

RISC-V: mcycle/minstret CSRs

RISC-V has standard cycle and instruction counters:

// Read cycle counter
static inline uint64_t get_mcycle(void) {
    uint64_t cycle;
    asm volatile ("rdcycle %0" : "=r"(cycle));
    return cycle;
}

// Read instruction counter
static inline uint64_t get_minstret(void) {
    uint64_t instret;
    asm volatile ("rdinstret %0" : "=r"(instret));
    return instret;
}

// Calculate CPI
uint64_t cycles_start = get_mcycle();
uint64_t instr_start = get_minstret();

my_function();

uint64_t cycles = get_mcycle() - cycles_start;
uint64_t instrs = get_minstret() - instr_start;
double cpi = (double)cycles / instrs;

Running on QEMU

# ARM Cortex-M3 (lm3s6965evb)
qemu-system-arm -M lm3s6965evb -nographic -kernel firmware.elf

# RISC-V (sifive_e - FE310)
qemu-system-riscv32 -M sifive_e -nographic -kernel firmware.elf

QEMU's cycle counter is "functional," not cycle-accurate. Numbers can verify program logic but don't represent real hardware cycle counts.

Porting Open-Source Benchmarks to Bare-metal

Good news: most CPU/memory benchmarks port easily:

Benchmark	Porting Difficulty	Dependencies	Notes
Dhrystone	Easy	libc only	Need to remove time() calls
CoreMark	Easy	libc only	Official bare-metal support
Embench	Easy	None	Designed for embedded
Whetstone	Easy	libm	Needs floating-point support
STREAM	Medium	None	Needs enough memory
lmbench	Hard	POSIX	Core algorithms portable

CoreMark Bare-metal Port

CoreMark officially supports bare-metal; just implement a few porting functions:

// core_portme.c - ARM Cortex-M implementation

// 1. Timing start/end
void start_time(void) {
    start_cycles = DWT->CYCCNT;
}

void stop_time(void) {
    end_cycles = DWT->CYCCNT;
}

CORE_TICKS get_time(void) {

### Compile and Run

```bash
# Cross-compile for ARM Cortex-M4
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -O3 \
    -DITERATIONS=10000 \
    core_main.c core_list_join.c core_matrix.c \
    core_state.c core_util.c core_portme.c \
    -T linker.ld -o coremark_arm.elf

# Cross-compile for RISC-V (RV32IMAC)
riscv32-unknown-elf-gcc -march=rv32imac -mabi=ilp32 -O3 \
    -DITERATIONS=10000 \
    core_main.c core_list_join.c core_matrix.c \
    core_state.c core_util.c core_portme.c \
    -T linker.ld -o coremark_riscv.elf

# Run on QEMU ARM
qemu-system-arm -M lm3s6965evb -nographic \
    -semihosting -kernel coremark_arm.elf

# Run on QEMU RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
    -kernel coremark_riscv.elf

Embench: Designed for Embedded

Embench is a modern embedded benchmark developed by EEMBC and academia:

# Download
git clone https://github.com/embench/embench-iot.git
cd embench-iot

# Build ARM version
python3 build_all.py --arch arm --chip cortex-m4 \
    --board qemu-arm

# Run (needs appropriate runner)
python3 benchmark_speed.py --target-module run_qemu

Embench includes 19 real-application kernels:

aha-mont64     Montgomery multiplication
crc32          CRC calculation
cubic          Cubic root solver
edn            FIR filter
huffbench      Huffman encoding
matmult-int    Integer matrix multiply
md5sum         MD5 hash
minver         Matrix inversion
nbody          N-body simulation
nettle-aes     AES encryption
...

RTOS Benchmarks: Measuring the OS Itself

When using an RTOS, besides application performance, you need to measure OS overhead.

Context Switch Time

// FreeRTOS context switch benchmark
static TaskHandle_t task1, task2;
static volatile uint32_t switch_start, switch_end;

void Task1(void *pvParameters) {
    for (;;) {
        switch_start = get_cycles();
        xTaskNotifyGive(task2);  // Wake Task2
        ulTaskNotifyTake(pdTRUE, portMAX_DELAY);  // Wait
    }
}

void Task2(void *pvParameters) {
    for (;;) {
        ulTaskNotifyTake(pdTRUE, portMAX_DELAY);
        switch_end = get_cycles();

        uint32_t elapsed = switch_end - switch_start;
        // Record or accumulate elapsed

        xTaskNotifyGive(task1);
    }
}

Typical results (depends on MCU and RTOS):

RTOS            MCU              Context Switch
─────────────────────────────────────────────────
FreeRTOS        Cortex-M4@168MHz     ~200 cycles
Zephyr          Cortex-M4@168MHz     ~300 cycles
RT-Thread       Cortex-M4@168MHz     ~250 cycles

Interrupt Latency

Time from interrupt trigger to ISR execution start:

// Set up GPIO interrupt (STM32)
void EXTI0_IRQHandler(void) {
    uint32_t entry_time = get_cycles();  // First line of ISR

    // Calculate latency
    uint32_t latency = entry_time - trigger_time;
    record_latency(latency);

    // Clear interrupt flag
    EXTI->PR = EXTI_PR_PR0;
}

// Trigger in main program
trigger_time = get_cycles();
// Trigger via software or external GPIO
EXTI->SWIER = EXTI_SWIER_SWIER0;

Important: Measure multiple times, report distribution!

Interrupt Latency Distribution (10000 samples):
  Min:    12 cycles
  Max:    89 cycles
  Avg:    15 cycles
  P99:    45 cycles
  P99.9:  78 cycles

That P99.9 of 78 cycles is the number to consider in design.

Semaphore/Mutex Overhead

static SemaphoreHandle_t sem;

void measure_semaphore_overhead(void) {
    uint32_t total = 0;

    for (int i = 0; i < 10000; i++) {
        uint32_t start = get_cycles();
        xSemaphoreTake(sem, portMAX_DELAY);
        xSemaphoreGive(sem);
        uint32_t end = get_cycles();
        total += (end - start);
    }

    printf("Semaphore take+give: %lu cycles avg\n", total / 10000);
}

Determinism Measurement

A key RTOS characteristic is determinism. How do we quantify it?

Jitter Measurement

#define SAMPLES 10000
static uint32_t latencies[SAMPLES];

// Periodic task
void PeriodicTask(void *pvParameters) {
    TickType_t last_wake = xTaskGetTickCount();
    int idx = 0;

    for (;;) {
        uint32_t expected = last_wake * CYCLES_PER_TICK;
        uint32_t actual = get_cycles();

        if (idx < SAMPLES) {
            latencies[idx++] = actual - expected;
        }

        vTaskDelayUntil(&last_wake, pdMS_TO_TICKS(1));
    }
}

// Analyze jitter
void analyze_jitter(void) {
    uint32_t min = UINT32_MAX, max = 0;
    uint64_t sum = 0;

    for (int i = 0; i < SAMPLES; i++) {
        if (latencies[i] < min) min = latencies[i];
        if (latencies[i] > max) max = latencies[i];
        sum += latencies[i];
    }

    printf("Jitter: min=%lu, max=%lu, avg=%lu, range=%lu\n",
           min, max, (uint32_t)(sum/SAMPLES), max-min);
}

WCET Estimation

Worst-Case Execution Time (WCET) is critical for real-time system design:

#define WCET_SAMPLES 100000

uint32_t measure_wcet(void (*func)(void)) {
    uint32_t max_time = 0;

    for (int i = 0; i < WCET_SAMPLES; i++) {
        uint32_t start = get_cycles();
        func();
        uint32_t elapsed = get_cycles() - start;

        if (elapsed > max_time) {
            max_time = elapsed;
        }
    }

    return max_time;
}

Warning: Measured WCET is only the observed maximum; true WCET may be larger. Rigorous WCET analysis requires static analysis tools (like aiT, Bound-T).

Running on Simulators

QEMU + Semihosting

Semihosting lets bare-metal programs use host I/O:

// ARM semihosting
static inline void semihosting_write(const char *s) {
    asm volatile (
        "mov r0, #0x04\n"  // SYS_WRITE0
        "mov r1, %0\n"
        "bkpt #0xAB\n"
        :
        : "r"(s)
        : "r0", "r1"
    );
}

# ARM
qemu-system-arm -M lm3s6965evb -nographic \
    -semihosting-config enable=on,target=native \
    -kernel firmware_arm.elf

# RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
    -semihosting-config enable=on,target=native \
    -kernel firmware_riscv.elf

Common Pitfalls

Pitfall 1: Only Reporting Averages

Bad:  "Average latency 1ms"
Good: "Latency: avg=1ms, max=15ms, P99=3ms, P99.9=12ms"

Pitfall 2: Ignoring Interrupt Effects

During measurement, other interrupts can pollute results:

// Disable interrupts during measurement
__disable_irq();
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
__enable_irq();

But this doesn't represent reality. Real systems have interrupts—measure both "with interrupts" and "without interrupts" scenarios.

Pitfall 3: Simulator ≠ Real Hardware

QEMU cycle count:  1000 cycles
Real hardware:     3500 cycles

QEMU is a functional simulator, not cycle-accurate. Use it to verify program correctness, not for performance evaluation.

Pitfall 4: Cache Matters in Embedded Too

Many assume MCUs don't have cache. Wrong:

Cortex-M7 has I-cache and D-cache
Modern RISC-V MCUs may have cache
Flash to RAM access speed differences

// Cortex-M7 cache control
SCB_EnableICache();
SCB_EnableDCache();

// Invalidate before measurement
SCB_InvalidateDCache();

Summary

Embedded/RTOS benchmarking differs fundamentally from GPOS:

Core Differences

GPOS cares about average case
RTOS/bare-metal cares about worst case
Determinism matters more than throughput

Time Measurement

ARM: DWT cycle counter, SysTick
RISC-V: mcycle/minstret CSRs
Handle overflow carefully

Portable Benchmarks

CoreMark, Dhrystone, Embench: Easy to port
STREAM: Needs enough memory
lmbench: Core algorithms portable

RTOS Measurements

Context switch time
Interrupt latency (report distribution!)
Semaphore/mutex overhead
Jitter and WCET

Simulator Usage

QEMU: Functional verification, not performance evaluation
Renode: Better peripheral and RTOS support
Simulators cannot measure power consumption

Performance and Benchmarking