Chapter 29: Edge AI Performance

Part VII: AI/HPC


"The future of AI is at the edge." — Pete Warden

When the Cloud Isn't an Option

Elena was developing a hearing aid that used AI to separate speech from background noise. The algorithm worked beautifully in the lab—running on a workstation with an RTX 4090.

Now she had to make it run on a device the size of a fingernail, powered by a battery smaller than a watch battery, with less computing power than a 1990s calculator.

"This model needs 50 GFLOPS," her signal processing colleague said, reviewing the specs. "The chip provides 0.5 GFLOPS. That's a 100x gap."

"And it needs to run in real-time," Elena added. "15ms latency max, or users will notice the delay between lip movement and audio. Oh, and the battery needs to last 16 hours."

This is edge AI in a nutshell: the same intelligence that runs on data center GPUs, compressed into devices that run on milliwatts. The constraints seem impossible until you realize that millions of such devices ship every month.

This chapter explores performance analysis for edge AI—where the rules are different, the constraints are brutal, and traditional GPU-centric thinking will lead you astray.


A Different World of Constraints

Edge AI operates under constraints that would seem absurd to cloud engineers:

ResourceCloud (A100 GPU)Edge (Typical MCU)Ratio
Memory80 GB256 KB - 1 MB80,000 - 320,000x
Compute312 TFLOPS1-100 MOPS3M - 300M x
Power400W10mW - 1W400 - 40,000x
Cost$10,000+$1 - $101,000 - 10,000x

That's 6-9 orders of magnitude difference across every dimension. Techniques that work in the cloud—larger batch sizes, more parameters, higher precision—are simply impossible.

The Four Constraints of Edge AI

Memory: Your model must fit. Not "mostly fit" or "fit with swapping." The entire model, plus activations, plus input/output buffers, must fit in available RAM and Flash. There's no cloud to offload to.

Power: Battery life matters more than speed. A model that runs 2x faster but consumes 3x the energy is worse, not better. Thermal limits cap sustained performance.

Latency: Real-time means real-time. A hearing aid with 100ms delay is unusable. An autonomous vehicle with 200ms perception delay is dangerous. You need both low latency and consistent latency.

Cost: When you're shipping millions of units, every cent matters. A $2 chip that needs a $3 NPU accelerator is twice as expensive as a $2.50 chip that doesn't.

TensorFlow Lite

TensorFlow Lite is Google's lightweight inference framework for mobile and embedded devices.

Model Conversion

import tensorflow as tf

# Load SavedModel
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")

# Basic conversion
tflite_model = converter.convert()

# Save
with open("model.tflite", "wb") as f:
    f.write(tflite_model)

Quantization Options

# Dynamic range quantization (simplest)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Full integer quantization (requires representative dataset)
def representative_dataset():
    for _ in range(100):
        yield [np.random.randn(1, 224, 224, 3).astype(np.float32)]

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# Float16 quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

Running Inference

import numpy as np
import tensorflow as tf

# Load model
interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()

# Get input/output info
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Set input
input_data = np.random.randn(1, 224, 224, 3).astype(np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)

# Execute
interpreter.invoke()

# Get output
output = interpreter.get_tensor(output_details[0]['index'])

TFLite Benchmark Tool

# Download benchmark tool
# https://www.tensorflow.org/lite/performance/measurement

# Run benchmark
./benchmark_model \
    --graph=model.tflite \
    --num_threads=4 \
    --warmup_runs=10 \
    --num_runs=100

# Example output:
# Inference (avg): 15.2 ms
# Inference (std): 1.3 ms

TensorFlow Lite Micro

TFLite Micro is an ultra-lightweight inference framework for microcontrollers.

Design Goals

TFLite Micro features:

1. Minimal binary size

### Using TFLite Micro

```cpp
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
#include "tensorflow/lite/schema/schema_generated.h"

// Model data (typically loaded from Flash)
extern const unsigned char model_data[];

// Tensor Arena (size depends on model)
constexpr int kTensorArenaSize = 10 * 1024;
uint8_t tensor_arena[kTensorArenaSize];

void setup() {
    // Load model
    const tflite::Model* model = tflite::GetModel(model_data);

    // Set up operators
    static tflite::MicroMutableOpResolver<5> resolver;
    resolver.AddConv2D();
    resolver.AddMaxPool2D();
    resolver.AddFullyConnected();
    resolver.AddSoftmax();
    resolver.AddReshape();

    // Create interpreter
    static tflite::MicroInterpreter interpreter(
        model, resolver, tensor_arena, kTensorArenaSize);

    // Allocate tensors
    interpreter.AllocateTensors();

    // Get input tensor
    TfLiteTensor* input = interpreter.input(0);

    // Set input data...

    // Run inference
    interpreter.Invoke();

    // Get output
    TfLiteTensor* output = interpreter.output(0);
}

MLPerf Tiny

MLPerf Tiny is the AI benchmark standard designed for microcontrollers.

Benchmark Content

MLPerf Tiny Benchmarks:

Benchmark        Model             Task              Input
─────────────────────────────────────────────────────────────
Visual Wake      MobileNet v1      Image class       96×96 grayscale
                 (0.25)            (face detection)

Keyword Spot     DS-CNN            Speech recog      49×10 MFCC
                                   (keyword detect)

Anomaly Detect   FC AutoEncoder    Anomaly detect    128 features
                                   (machine sound)

Image Class      ResNet v1         Image class       32×32 RGB
                                   (CIFAR-10)

Performance Metrics

MLPerf Tiny metrics:

1. Latency
   - Single inference time
   - Unit: milliseconds

2. Throughput
   - Inferences per second
   - Unit: inferences/second

3. Energy
   - Energy per inference
   - Unit: μJ/inference

4. Accuracy
   - Must meet target accuracy
   - Example: Visual Wake Words > 80%

Typical Results

MLPerf Tiny v1.0 example results:

Hardware                VWW Latency    KWS Latency    Energy
─────────────────────────────────────────────────────────────
STM32L4R5 (Cortex-M4)   250 ms         50 ms          1.2 mJ
MAX78000 (dedicated NPU) 2.5 ms        0.5 ms         12 μJ
GAP9 (RISC-V + NPU)      5 ms          1 ms           25 μJ

Dedicated NPU can achieve 100x performance improvement

Edge AI Performance Analysis

Measurement Methods

1. Latency measurement
   - Use high-precision timer
   - Multiple runs for average
   - Note warm-up effects

2. Power measurement
   - Hardware power meter
   - Current sensor
   - Software estimation (imprecise)

3. Memory measurement
   - Static analysis (model size)
   - Runtime monitoring (peak RAM)

Latency Analysis

// Latency measurement on ARM Cortex-M
#include "arm_math.h"

volatile uint32_t start_cycles, end_cycles;

// Use DWT Cycle Counter
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

// Measure
start_cycles = DWT->CYCCNT;
interpreter.Invoke();
end_cycles = DWT->CYCCNT;

uint32_t cycles = end_cycles - start_cycles;
float latency_ms = (float)cycles / SystemCoreClock * 1000.0f;

Optimization Strategies

Model Optimization

1. Quantization
   - INT8 quantization (4x compression)
   - INT4 quantization (8x compression)
   - Mixed precision

2. Pruning
   - Remove unimportant weights
   - Structured pruning better for hardware

3. Knowledge Distillation
   - Train small model with large model
   - Maintain accuracy while reducing size

4. Architecture Search
   - NAS to find optimal architecture
   - Optimize for target hardware

Hardware Acceleration

Edge AI accelerators:

1. NPU (Neural Processing Unit)
   - Dedicated matrix operation units
   - Examples: Apple Neural Engine, Google Edge TPU

2. DSP (Digital Signal Processor)
   - Vector operations
   - Example: Qualcomm Hexagon

3. GPU (Mobile)
   - General parallel computing
   - Examples: Adreno, Mali

4. FPGA
   - Programmable hardware
   - Suitable for custom requirements

Kenji's Shipping Day

Six months after that first failed demo, Kenji's team shipped their product.

The final model was nothing like what they'd started with. The original MobileNetV2 had been replaced with a custom architecture—smaller, faster, and specifically designed for their hardware. They'd used knowledge distillation to train it, quantization to shrink it, and careful profiling to optimize every layer.

The result: 15 FPS inference on a $3 MCU, with 94% accuracy on their target task. Battery life: 18 months on a coin cell.

"The cloud version was 99% accurate," Kenji admitted. "We lost 5 percentage points. But we gained something more important: we can actually ship."

At the launch party, his manager asked what he'd learned.

"Edge AI isn't about making cloud AI smaller," Kenji said. "It's a different discipline entirely. Different constraints, different tools, different trade-offs. You can't just shrink a model and hope it works. You have to design for the edge from the beginning."

He paused. "Also, buy a good current meter. You'll need it."

Edge AI performance is where all the constraints collide: compute, memory, power, latency, accuracy, and cost. Mastering it requires understanding not just ML, but embedded systems, power electronics, and the art of making hard trade-offs. It's challenging—but when you ship a product that runs AI on a device that costs less than a cup of coffee, it's deeply satisfying.


Summary

Edge AI performance analysis requires considering unique constraints:

Key Frameworks

  • TensorFlow Lite: Mobile device standard
  • TFLite Micro: Microcontrollers
  • MLPerf Tiny: Standard benchmark
  • Core ML / NNAPI: Platform-specific acceleration

Core Performance Metrics

  • Latency: Inference time
  • Energy: Energy consumption
  • Memory: RAM/Flash usage
  • Accuracy: Post-quantization accuracy

Main Optimization Strategies

  • Quantization (INT8, INT4)
  • Pruning
  • Knowledge distillation
  • Hardware acceleration

Practical Measurement Methods

  • Cycle counter (latency)
  • Current sensor (power)
  • Static analysis + runtime monitoring (memory)