Chapter 20: Compiler Size Optimization

Part VI: Embedded Constraints


"Premature optimization is the root of all evil, but so is premature pessimization." — Unknown

The Optimization Level Myth

Every C/C++ developer knows -O2 and -O3 make programs run faster. But in the embedded world, there's an often-overlooked friend: -Os and -Oz.

In the previous chapter's story, we used systematic analysis to identify the floating-point library as the "space killer." But that was just the beginning—true size optimization requires understanding what the compiler does behind the scenes.

This chapter answers a core question with experimental data: How much code size difference do different compiler options actually produce?


Optimization Level Comparison

GCC Optimization Level Definitions

Level       Goal               Characteristics
────────────────────────────────────────────────────────────
-O0         Debug              No optimization, max debuggability
-O1         Basic              Reduce code size, moderate optimization
-O2         Standard           Balance speed and size
-O3         Aggressive         Maximum speed, may increase code size
-Os         Size optimization  Based on -O2, but prefer smaller code
-Oz         Minimum size       Clang only, more aggressive than -Os
-Og         Debug optimization Suitable for use with debugger

Experiment: Compiling the Same Program

We use a typical embedded application (RTOS task management + UART driver) as our test baseline:

# Compile same source with different optimization levels
$ riscv64-unknown-elf-gcc -O0 -o test_O0.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O1 -o test_O1.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O2 -o test_O2.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O3 -o test_O3.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -Os -o test_Os.elf main.c drivers/*.c

Measurement results (.text section size):

Opt Level   .text (bytes)    vs -O0        vs -Os
────────────────────────────────────────────────────────
-O0           28,672         100.0%        +78.2%
-O1           18,432          64.3%        +14.5%
-O2           20,480          71.4%        +27.3%
-O3           24,576          85.7%        +52.7%
-Os           16,096          56.1%        baseline
-Oz*          14,848          51.8%        -7.8%

* Clang only

Key observations:

  1. -O3 is larger than -O2: Aggressive inlining and loop unrolling increase code size
  2. -Os is ~20% smaller than -O2: Significant for memory-constrained systems
  3. -O0 is largest: No optimization, each statement generates separate instructions

Speed vs Size Trade-off

  Code Size
  ▲
  │                                   ● -O3 (fastest, but usually largest)
  │
  │   ● -O0 (slow and large, no optimization)
  │
  │                           ● -O2 (fast, balanced size)
  │
  │                   ● -O1 (basic optimization)
  │
  │                       ● -Os (small size, good speed)
  │               ● -Oz (smallest size, sacrifices speed)
  │
  └───────────────────────────────────────────────────────► Speed

Rules of thumb:

  • Development/debug: -Og
  • Release (speed priority): -O2 or -O3
  • Release (space priority): -Os or -Oz
  • Extremely constrained: -Oz + LTO + --gc-sections

Advanced Compiler Options

Basic optimization levels are just the beginning. Here are advanced size optimization techniques.

1. Dead Code Elimination

Use -ffunction-sections, -fdata-sections with linker's --gc-sections:

# Compile: place each function/data in its own section
$ gcc -ffunction-sections -fdata-sections -Os -c main.c -o main.o

# Link: remove unused sections
$ gcc -Wl,--gc-sections main.o -o firmware.elf

Effect example:

Options                           .text size
──────────────────────────────────────────────
Without gc-sections               24,576 bytes
With gc-sections                  18,432 bytes
Savings                           6,144 bytes (25%)

This is especially effective when using large libraries (like newlib)—you might only use memcpy and strlen, but without gc-sections, malloc, printf, and more get linked in.

LTO allows the compiler to perform cross-compile-unit optimization at link time:

# Both compile and link need -flto
$ gcc -flto -Os -c main.c -o main.o
$ gcc -flto -Os -c uart.c -o uart.o
$ gcc -flto -Os main.o uart.o -o firmware.elf

LTO advantages:

  • Cross-file inlining decisions
  • More precise dead code elimination
  • Better constant propagation

LTO costs:

  • Significantly increased compile time (possibly 2-5x)
  • Debug information may be harder to trace
  • Some linker scripts need adjustment

Effect example:

Options             .text size    Compile time
───────────────────────────────────────────────
-Os                 16,096        1.2s
-Os -flto           14,336        3.8s
Savings             1,760 (11%)   +217%

3. Inlining Control

Inlining is the most critical speed vs size trade-off:

// Force inline (may increase code size)
static inline __attribute__((always_inline))
void critical_function(void) { ... }

// Prevent inline (ensure minimum code size)
__attribute__((noinline))
void large_function(void) { ... }

When to force inline:

  • Very small helper functions (1-3 lines)
  • Functions on hot paths
  • When call overhead exceeds the function itself

When to prevent inline:

  • Large functions (over 20-30 lines)
  • Functions with multiple call sites
  • Error handling paths

Standard Library Selection

The standard C library is one of the biggest "hidden costs" in embedded systems.

newlib vs newlib-nano

Library           printf support       .text increase
───────────────────────────────────────────────────────
newlib            Full                 ~50-80 KB
newlib-nano       Basic (no float)     ~8-15 KB
Custom minimal    Integer only         ~1-2 KB

Using newlib-nano:

# GCC's --specs option
$ arm-none-eabi-gcc --specs=nano.specs -Os main.c -o firmware.elf

Verifying the effect:

$ size firmware_newlib.elf
   text    data     bss     dec     hex filename
  52480     256    4096   56832    ddc0 firmware_newlib.elf

$ size firmware_nano.elf
   text    data     bss     dec     hex filename
  12288     256    4096   16640    4100 firmware_nano.elf

Difference: 40 KB. Huge on 64 KB or 128 KB flash systems.

Avoiding printf

If you only need to output integers or simple strings, consider a lightweight custom version:

// Full printf: ~15-50 KB
printf("Value: %d\n", value);

// Custom lightweight version: ~200 bytes
void print_int(const char* prefix, int value) {
    uart_puts(prefix);
    char buf[12];
    itoa(value, buf, 10);
    uart_puts(buf);
    uart_puts("\n");
}

Size Impact of Specific Features

Certain C/C++ features have significant impact on code size:

Floating-Point Operations

Feature                          Extra size on MCU without FPU
────────────────────────────────────────────────────────────────
float arithmetic                 +10-15 KB (software emulation)
double arithmetic                +25-50 KB (software emulation)
printf("%f")                     +15-25 KB (formatting)
math.h (sin, cos, etc.)          +10-30 KB (depends on usage)

Best practices:

  • Use fixed-point arithmetic whenever possible
  • If float is necessary, avoid double
  • Never use printf("%f") on MCUs without FPU

C++ Features

Feature                          Typical size impact
────────────────────────────────────────────────────────────────
Virtual functions                +8-16 bytes vtable per class
RTTI (typeid, dynamic_cast)      +2-10 KB
Exception handling               +10-50 KB
STL containers                   Depends on usage, can be tens of KB

Recommended embedded C++ compile options:

$ g++ -fno-rtti -fno-exceptions -Os ...

Experiment: Complete Optimization Workflow

Let's demonstrate the optimization workflow with a complete example:

Initial state:

$ riscv64-unknown-elf-gcc -O2 main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  45056     512    8192   53760    d200 firmware.elf

Flash usage: 45.5 KB, target: 32 KB.

Step 1: Switch to -Os

$ riscv64-unknown-elf-gcc -Os main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  36864     512    8192   45568    b200 firmware.elf

Saved: 8.2 KB (-18%). Still 4.9 KB over.

Step 2: Add gc-sections

$ riscv64-unknown-elf-gcc -Os -ffunction-sections -fdata-sections \
    -Wl,--gc-sections main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  30720     512    8192   39424    9a00 firmware.elf

Saved: 6.1 KB (-17%). Target achieved! But let's continue.

Step 3: Add LTO

$ riscv64-unknown-elf-gcc -Os -flto -ffunction-sections -fdata-sections \
    -Wl,--gc-sections main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  28672     512    8192   37376    9200 firmware.elf

Saved: 2 KB (-7%).

Step 4: Use newlib-nano

$ riscv64-unknown-elf-gcc -Os -flto -ffunction-sections -fdata-sections \
    -Wl,--gc-sections --specs=nano.specs main.c drivers/*.c -o firmware.elf
$ size firmware.elf
   text    data     bss     dec     hex filename
  18432     512    8192   27136    6a00 firmware.elf

Saved: 10.2 KB (-36%).

Optimization summary:

Phase                       .text size    Cumulative savings
───────────────────────────────────────────────────────────
Original (-O2)              45,056        baseline
Step 1: -Os                 36,864        -18%
Step 2: gc-sections         30,720        -32%
Step 3: LTO                 28,672        -36%
Step 4: newlib-nano         18,432        -59%

From 45 KB to 18 KB—60% flash space saved!


Common Pitfalls

1. Over-Inlining

// This function gets inlined 20 times by -O3 = 20x original size
inline void update_display(int x, int y, int color) {
    // 50 lines of drawing logic
    ...
}

Solution: Use -Os or manually mark __attribute__((noinline)).

2. Forgetting gc-sections

# Wrong: added -ffunction-sections at compile, forgot --gc-sections at link
$ gcc -ffunction-sections -fdata-sections -c file.c
$ gcc file.o -o output   # Forgot -Wl,--gc-sections!

3. Debug Symbols Impact

Debug symbols don't increase Flash usage, but make the ELF file larger:

$ riscv64-unknown-elf-gcc -g -Os main.c -o debug.elf
$ riscv64-unknown-elf-gcc -Os main.c -o release.elf

$ ls -lh *.elf
-rwxr-xr-x 1 user user 245K debug.elf
-rwxr-xr-x 1 user user  35K release.elf

# But .text size is the same:
$ size debug.elf release.elf
   text    data     bss     dec     hex filename
  18432     512    8192   27136    6a00 debug.elf
  18432     512    8192   27136    6a00 release.elf

Summary

  • Optimization levels: -Os is usually the best choice for embedded, 15-25% smaller than -O2
  • Dead code elimination: -ffunction-sections -fdata-sections -Wl,--gc-sections saves 10-30%
  • LTO: Cross-compile-unit optimization, additional 5-15% savings (but increases compile time)
  • Standard library: newlib-nano can save 30-50 KB compared to newlib
  • Avoid: Floating-point operations, printf("%f"), C++ exceptions/RTTI
  • Optimization order: First analyze with tools (previous chapter), then choose appropriate compiler options (this chapter)