Chapter 9: Embedded & RTOS Benchmarks

Part II: Tools


"In embedded systems, the worst case is the only case that matters." — Jack Ganssle

The "Average 1ms, But Sometimes 100ms" Disaster

"Average latency 1ms, fully meets spec."

That was the vendor's benchmark report. We were using this MCU for motor control, with a requirement to update PWM output every 1ms. Average 1ms? Perfect.

After the system went live, the motor started stuttering. Not every time—just "occasionally."

We spent three days debugging. Finally we discovered: behind that "average 1ms," there was a 0.1% chance of jumping to 50-100ms. In typical benchmark reports, these outliers get averaged away—invisible.

But for motor control, 0.1% of 100ms delays = stuttering once per second.

This is the fundamental difference between embedded/RTOS benchmarking and GPOS benchmarking: we care about worst case, not average case.

GPOS vs RTOS vs Bare-metal

Let's clarify the differences between these three environments:

FeatureGPOSRTOSBare-metal
ExamplesLinux, Windows, macOSFreeRTOS, Zephyr, RT-LinuxRunning directly on hardware
SchedulingTime-slicing, variable priorityFixed priority, preemptiveNone (or super loop)
MemoryVirtual memory, pagingUsually flat memoryFlat memory
Interrupt latencyNot guaranteed (may be ms)Guaranteed upper bound (usually μs)Minimal (cycles)
JitterHigh (background processes)Low (deterministic)Lowest
Tool supportRich (perf, VTune)Medium (trace, SEGGER)Basic (GPIO toggle)

Why This Matters

On GPOS, if an operation is "usually" 1ms, "occasionally" 10ms, most applications can tolerate it.

On RTOS/bare-metal:

  • Motor control: 100ms delay = motor loses control
  • Automotive ABS: 10ms delay = brake failure
  • Medical devices: delay = potentially fatal

RTOS benchmarks must report worst-case, not just average.

Time Measurement: What If There's No OS?

On GPOS, we use clock_gettime() or rdtsc. On bare-metal, these APIs don't exist.

ARM Cortex-M: DWT Cycle Counter

The Data Watchpoint and Trace (DWT) unit provides a cycle counter:

// Enable DWT cycle counter (need to enable trace first)
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

// Read cycle count
static inline uint32_t get_cycles(void) {
    return DWT->CYCCNT;
}

// Usage
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
uint32_t elapsed = end - start;  // cycles

Note: DWT->CYCCNT is 32-bit, overflows on high-frequency MCUs (168MHz ≈ 25 seconds)

RISC-V: mcycle/minstret CSRs

RISC-V has standard cycle and instruction counters:

// Read cycle counter
static inline uint64_t get_mcycle(void) {
    uint64_t cycle;
    asm volatile ("rdcycle %0" : "=r"(cycle));
    return cycle;
}

// Read instruction counter
static inline uint64_t get_minstret(void) {
    uint64_t instret;
    asm volatile ("rdinstret %0" : "=r"(instret));
    return instret;
}

// Calculate CPI
uint64_t cycles_start = get_mcycle();
uint64_t instr_start = get_minstret();

my_function();

uint64_t cycles = get_mcycle() - cycles_start;
uint64_t instrs = get_minstret() - instr_start;
double cpi = (double)cycles / instrs;

Running on QEMU

# ARM Cortex-M3 (lm3s6965evb)
qemu-system-arm -M lm3s6965evb -nographic -kernel firmware.elf

# RISC-V (sifive_e - FE310)
qemu-system-riscv32 -M sifive_e -nographic -kernel firmware.elf

QEMU's cycle counter is "functional," not cycle-accurate. Numbers can verify program logic but don't represent real hardware cycle counts.

Porting Open-Source Benchmarks to Bare-metal

Good news: most CPU/memory benchmarks port easily:

BenchmarkPorting DifficultyDependenciesNotes
DhrystoneEasylibc onlyNeed to remove time() calls
CoreMarkEasylibc onlyOfficial bare-metal support
EmbenchEasyNoneDesigned for embedded
WhetstoneEasylibmNeeds floating-point support
STREAMMediumNoneNeeds enough memory
lmbenchHardPOSIXCore algorithms portable

CoreMark Bare-metal Port

CoreMark officially supports bare-metal; just implement a few porting functions:

// core_portme.c - ARM Cortex-M implementation

// 1. Timing start/end
void start_time(void) {
    start_cycles = DWT->CYCCNT;
}

void stop_time(void) {
    end_cycles = DWT->CYCCNT;
}

CORE_TICKS get_time(void) {

### Compile and Run

```bash
# Cross-compile for ARM Cortex-M4
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -O3 \
    -DITERATIONS=10000 \
    core_main.c core_list_join.c core_matrix.c \
    core_state.c core_util.c core_portme.c \
    -T linker.ld -o coremark_arm.elf

# Cross-compile for RISC-V (RV32IMAC)
riscv32-unknown-elf-gcc -march=rv32imac -mabi=ilp32 -O3 \
    -DITERATIONS=10000 \
    core_main.c core_list_join.c core_matrix.c \
    core_state.c core_util.c core_portme.c \
    -T linker.ld -o coremark_riscv.elf

# Run on QEMU ARM
qemu-system-arm -M lm3s6965evb -nographic \
    -semihosting -kernel coremark_arm.elf

# Run on QEMU RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
    -kernel coremark_riscv.elf

Embench: Designed for Embedded

Embench is a modern embedded benchmark developed by EEMBC and academia:

# Download
git clone https://github.com/embench/embench-iot.git
cd embench-iot

# Build ARM version
python3 build_all.py --arch arm --chip cortex-m4 \
    --board qemu-arm

# Run (needs appropriate runner)
python3 benchmark_speed.py --target-module run_qemu

Embench includes 19 real-application kernels:

aha-mont64     Montgomery multiplication
crc32          CRC calculation
cubic          Cubic root solver
edn            FIR filter
huffbench      Huffman encoding
matmult-int    Integer matrix multiply
md5sum         MD5 hash
minver         Matrix inversion
nbody          N-body simulation
nettle-aes     AES encryption
...

RTOS Benchmarks: Measuring the OS Itself

When using an RTOS, besides application performance, you need to measure OS overhead.

Context Switch Time

// FreeRTOS context switch benchmark
static TaskHandle_t task1, task2;
static volatile uint32_t switch_start, switch_end;

void Task1(void *pvParameters) {
    for (;;) {
        switch_start = get_cycles();
        xTaskNotifyGive(task2);  // Wake Task2
        ulTaskNotifyTake(pdTRUE, portMAX_DELAY);  // Wait
    }
}

void Task2(void *pvParameters) {
    for (;;) {
        ulTaskNotifyTake(pdTRUE, portMAX_DELAY);
        switch_end = get_cycles();

        uint32_t elapsed = switch_end - switch_start;
        // Record or accumulate elapsed

        xTaskNotifyGive(task1);
    }
}

Typical results (depends on MCU and RTOS):

RTOS            MCU              Context Switch
─────────────────────────────────────────────────
FreeRTOS        Cortex-M4@168MHz     ~200 cycles
Zephyr          Cortex-M4@168MHz     ~300 cycles
RT-Thread       Cortex-M4@168MHz     ~250 cycles

Interrupt Latency

Time from interrupt trigger to ISR execution start:

// Set up GPIO interrupt (STM32)
void EXTI0_IRQHandler(void) {
    uint32_t entry_time = get_cycles();  // First line of ISR

    // Calculate latency
    uint32_t latency = entry_time - trigger_time;
    record_latency(latency);

    // Clear interrupt flag
    EXTI->PR = EXTI_PR_PR0;
}

// Trigger in main program
trigger_time = get_cycles();
// Trigger via software or external GPIO
EXTI->SWIER = EXTI_SWIER_SWIER0;

Important: Measure multiple times, report distribution!

Interrupt Latency Distribution (10000 samples):
  Min:    12 cycles
  Max:    89 cycles
  Avg:    15 cycles
  P99:    45 cycles
  P99.9:  78 cycles

That P99.9 of 78 cycles is the number to consider in design.

Semaphore/Mutex Overhead

static SemaphoreHandle_t sem;

void measure_semaphore_overhead(void) {
    uint32_t total = 0;

    for (int i = 0; i < 10000; i++) {
        uint32_t start = get_cycles();
        xSemaphoreTake(sem, portMAX_DELAY);
        xSemaphoreGive(sem);
        uint32_t end = get_cycles();
        total += (end - start);
    }

    printf("Semaphore take+give: %lu cycles avg\n", total / 10000);
}

Determinism Measurement

A key RTOS characteristic is determinism. How do we quantify it?

Jitter Measurement

#define SAMPLES 10000
static uint32_t latencies[SAMPLES];

// Periodic task
void PeriodicTask(void *pvParameters) {
    TickType_t last_wake = xTaskGetTickCount();
    int idx = 0;

    for (;;) {
        uint32_t expected = last_wake * CYCLES_PER_TICK;
        uint32_t actual = get_cycles();

        if (idx < SAMPLES) {
            latencies[idx++] = actual - expected;
        }

        vTaskDelayUntil(&last_wake, pdMS_TO_TICKS(1));
    }
}

// Analyze jitter
void analyze_jitter(void) {
    uint32_t min = UINT32_MAX, max = 0;
    uint64_t sum = 0;

    for (int i = 0; i < SAMPLES; i++) {
        if (latencies[i] < min) min = latencies[i];
        if (latencies[i] > max) max = latencies[i];
        sum += latencies[i];
    }

    printf("Jitter: min=%lu, max=%lu, avg=%lu, range=%lu\n",
           min, max, (uint32_t)(sum/SAMPLES), max-min);
}

WCET Estimation

Worst-Case Execution Time (WCET) is critical for real-time system design:

#define WCET_SAMPLES 100000

uint32_t measure_wcet(void (*func)(void)) {
    uint32_t max_time = 0;

    for (int i = 0; i < WCET_SAMPLES; i++) {
        uint32_t start = get_cycles();
        func();
        uint32_t elapsed = get_cycles() - start;

        if (elapsed > max_time) {
            max_time = elapsed;
        }
    }

    return max_time;
}

Warning: Measured WCET is only the observed maximum; true WCET may be larger. Rigorous WCET analysis requires static analysis tools (like aiT, Bound-T).

Running on Simulators

QEMU + Semihosting

Semihosting lets bare-metal programs use host I/O:

// ARM semihosting
static inline void semihosting_write(const char *s) {
    asm volatile (
        "mov r0, #0x04\n"  // SYS_WRITE0
        "mov r1, %0\n"
        "bkpt #0xAB\n"
        :
        : "r"(s)
        : "r0", "r1"
    );
}
# ARM
qemu-system-arm -M lm3s6965evb -nographic \
    -semihosting-config enable=on,target=native \
    -kernel firmware_arm.elf

# RISC-V
qemu-system-riscv32 -M sifive_e -nographic \
    -semihosting-config enable=on,target=native \
    -kernel firmware_riscv.elf

Common Pitfalls

Pitfall 1: Only Reporting Averages

Bad:  "Average latency 1ms"
Good: "Latency: avg=1ms, max=15ms, P99=3ms, P99.9=12ms"

Pitfall 2: Ignoring Interrupt Effects

During measurement, other interrupts can pollute results:

// Disable interrupts during measurement
__disable_irq();
uint32_t start = get_cycles();
my_function();
uint32_t end = get_cycles();
__enable_irq();

But this doesn't represent reality. Real systems have interrupts—measure both "with interrupts" and "without interrupts" scenarios.

Pitfall 3: Simulator ≠ Real Hardware

QEMU cycle count:  1000 cycles
Real hardware:     3500 cycles

QEMU is a functional simulator, not cycle-accurate. Use it to verify program correctness, not for performance evaluation.

Pitfall 4: Cache Matters in Embedded Too

Many assume MCUs don't have cache. Wrong:

  • Cortex-M7 has I-cache and D-cache
  • Modern RISC-V MCUs may have cache
  • Flash to RAM access speed differences
// Cortex-M7 cache control
SCB_EnableICache();
SCB_EnableDCache();

// Invalidate before measurement
SCB_InvalidateDCache();

Summary

Embedded/RTOS benchmarking differs fundamentally from GPOS:

Core Differences

  • GPOS cares about average case
  • RTOS/bare-metal cares about worst case
  • Determinism matters more than throughput

Time Measurement

  • ARM: DWT cycle counter, SysTick
  • RISC-V: mcycle/minstret CSRs
  • Handle overflow carefully

Portable Benchmarks

  • CoreMark, Dhrystone, Embench: Easy to port
  • STREAM: Needs enough memory
  • lmbench: Core algorithms portable

RTOS Measurements

  • Context switch time
  • Interrupt latency (report distribution!)
  • Semaphore/mutex overhead
  • Jitter and WCET

Simulator Usage

  • QEMU: Functional verification, not performance evaluation
  • Renode: Better peripheral and RTOS support
  • Simulators cannot measure power consumption