Chapter 20: Compiler Size Optimization
Part VI: Embedded Constraints
"Premature optimization is the root of all evil, but so is premature pessimization." — Unknown
The Optimization Level Myth
Every C/C++ developer knows -O2 and -O3 make programs run faster. But in the embedded world, there's an often-overlooked friend: -Os and -Oz.
In the previous chapter's story, we used systematic analysis to identify the floating-point library as the "space killer." But that was just the beginning—true size optimization requires understanding what the compiler does behind the scenes.
This chapter answers a core question with experimental data: How much code size difference do different compiler options actually produce?
Optimization Level Comparison
GCC Optimization Level Definitions
Level Goal Characteristics
────────────────────────────────────────────────────────────
-O0 Debug No optimization, max debuggability
-O1 Basic Reduce code size, moderate optimization
-O2 Standard Balance speed and size
-O3 Aggressive Maximum speed, may increase code size
-Os Size optimization Based on -O2, but prefer smaller code
-Oz Minimum size Clang only, more aggressive than -Os
-Og Debug optimization Suitable for use with debugger
Experiment: Compiling the Same Program
We use a typical embedded application (RTOS task management + UART driver) as our test baseline:
# Compile same source with different optimization levels
$ riscv64-unknown-elf-gcc -O0 -o test_O0.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O1 -o test_O1.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O2 -o test_O2.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -O3 -o test_O3.elf main.c drivers/*.c
$ riscv64-unknown-elf-gcc -Os -o test_Os.elf main.c drivers/*.c
Measurement results (.text section size):
Opt Level .text (bytes) vs -O0 vs -Os
────────────────────────────────────────────────────────
-O0 28,672 100.0% +78.2%
-O1 18,432 64.3% +14.5%
-O2 20,480 71.4% +27.3%
-O3 24,576 85.7% +52.7%
-Os 16,096 56.1% baseline
-Oz* 14,848 51.8% -7.8%
* Clang only
Key observations:
- -O3 is larger than -O2: Aggressive inlining and loop unrolling increase code size
- -Os is ~20% smaller than -O2: Significant for memory-constrained systems
- -O0 is largest: No optimization, each statement generates separate instructions
Speed vs Size Trade-off
Code Size
▲
│ ● -O3 (fastest, but usually largest)
│
│ ● -O0 (slow and large, no optimization)
│
│ ● -O2 (fast, balanced size)
│
│ ● -O1 (basic optimization)
│
│ ● -Os (small size, good speed)
│ ● -Oz (smallest size, sacrifices speed)
│
└───────────────────────────────────────────────────────► Speed
Rules of thumb:
- Development/debug:
-Og - Release (speed priority):
-O2or-O3 - Release (space priority):
-Osor-Oz - Extremely constrained:
-Oz+ LTO +--gc-sections
Advanced Compiler Options
Basic optimization levels are just the beginning. Here are advanced size optimization techniques.
1. Dead Code Elimination
Use -ffunction-sections, -fdata-sections with linker's --gc-sections:
# Compile: place each function/data in its own section
$ gcc -ffunction-sections -fdata-sections -Os -c main.c -o main.o
# Link: remove unused sections
$ gcc -Wl,--gc-sections main.o -o firmware.elf
Effect example:
Options .text size
──────────────────────────────────────────────
Without gc-sections 24,576 bytes
With gc-sections 18,432 bytes
Savings 6,144 bytes (25%)
This is especially effective when using large libraries (like newlib)—you might only use memcpy and strlen, but without gc-sections, malloc, printf, and more get linked in.
2. Link-Time Optimization (LTO)
LTO allows the compiler to perform cross-compile-unit optimization at link time:
# Both compile and link need -flto
$ gcc -flto -Os -c main.c -o main.o
$ gcc -flto -Os -c uart.c -o uart.o
$ gcc -flto -Os main.o uart.o -o firmware.elf
LTO advantages:
- Cross-file inlining decisions
- More precise dead code elimination
- Better constant propagation
LTO costs:
- Significantly increased compile time (possibly 2-5x)
- Debug information may be harder to trace
- Some linker scripts need adjustment
Effect example:
Options .text size Compile time
───────────────────────────────────────────────
-Os 16,096 1.2s
-Os -flto 14,336 3.8s
Savings 1,760 (11%) +217%
3. Inlining Control
Inlining is the most critical speed vs size trade-off:
// Force inline (may increase code size)
static inline __attribute__((always_inline))
void critical_function(void) { ... }
// Prevent inline (ensure minimum code size)
__attribute__((noinline))
void large_function(void) { ... }
When to force inline:
- Very small helper functions (1-3 lines)
- Functions on hot paths
- When call overhead exceeds the function itself
When to prevent inline:
- Large functions (over 20-30 lines)
- Functions with multiple call sites
- Error handling paths
Standard Library Selection
The standard C library is one of the biggest "hidden costs" in embedded systems.
newlib vs newlib-nano
Library printf support .text increase
───────────────────────────────────────────────────────
newlib Full ~50-80 KB
newlib-nano Basic (no float) ~8-15 KB
Custom minimal Integer only ~1-2 KB
Using newlib-nano:
# GCC's --specs option
$ arm-none-eabi-gcc --specs=nano.specs -Os main.c -o firmware.elf
Verifying the effect:
$ size firmware_newlib.elf
text data bss dec hex filename
52480 256 4096 56832 ddc0 firmware_newlib.elf
$ size firmware_nano.elf
text data bss dec hex filename
12288 256 4096 16640 4100 firmware_nano.elf
Difference: 40 KB. Huge on 64 KB or 128 KB flash systems.
Avoiding printf
If you only need to output integers or simple strings, consider a lightweight custom version:
// Full printf: ~15-50 KB
printf("Value: %d\n", value);
// Custom lightweight version: ~200 bytes
void print_int(const char* prefix, int value) {
uart_puts(prefix);
char buf[12];
itoa(value, buf, 10);
uart_puts(buf);
uart_puts("\n");
}
Size Impact of Specific Features
Certain C/C++ features have significant impact on code size:
Floating-Point Operations
Feature Extra size on MCU without FPU
────────────────────────────────────────────────────────────────
float arithmetic +10-15 KB (software emulation)
double arithmetic +25-50 KB (software emulation)
printf("%f") +15-25 KB (formatting)
math.h (sin, cos, etc.) +10-30 KB (depends on usage)
Best practices:
- Use fixed-point arithmetic whenever possible
- If float is necessary, avoid double
- Never use
printf("%f")on MCUs without FPU
C++ Features
Feature Typical size impact
────────────────────────────────────────────────────────────────
Virtual functions +8-16 bytes vtable per class
RTTI (typeid, dynamic_cast) +2-10 KB
Exception handling +10-50 KB
STL containers Depends on usage, can be tens of KB
Recommended embedded C++ compile options:
$ g++ -fno-rtti -fno-exceptions -Os ...
Experiment: Complete Optimization Workflow
Let's demonstrate the optimization workflow with a complete example:
Initial state:
$ riscv64-unknown-elf-gcc -O2 main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
45056 512 8192 53760 d200 firmware.elf
Flash usage: 45.5 KB, target: 32 KB.
Step 1: Switch to -Os
$ riscv64-unknown-elf-gcc -Os main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
36864 512 8192 45568 b200 firmware.elf
Saved: 8.2 KB (-18%). Still 4.9 KB over.
Step 2: Add gc-sections
$ riscv64-unknown-elf-gcc -Os -ffunction-sections -fdata-sections \
-Wl,--gc-sections main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
30720 512 8192 39424 9a00 firmware.elf
Saved: 6.1 KB (-17%). Target achieved! But let's continue.
Step 3: Add LTO
$ riscv64-unknown-elf-gcc -Os -flto -ffunction-sections -fdata-sections \
-Wl,--gc-sections main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
28672 512 8192 37376 9200 firmware.elf
Saved: 2 KB (-7%).
Step 4: Use newlib-nano
$ riscv64-unknown-elf-gcc -Os -flto -ffunction-sections -fdata-sections \
-Wl,--gc-sections --specs=nano.specs main.c drivers/*.c -o firmware.elf
$ size firmware.elf
text data bss dec hex filename
18432 512 8192 27136 6a00 firmware.elf
Saved: 10.2 KB (-36%).
Optimization summary:
Phase .text size Cumulative savings
───────────────────────────────────────────────────────────
Original (-O2) 45,056 baseline
Step 1: -Os 36,864 -18%
Step 2: gc-sections 30,720 -32%
Step 3: LTO 28,672 -36%
Step 4: newlib-nano 18,432 -59%
From 45 KB to 18 KB—60% flash space saved!
Common Pitfalls
1. Over-Inlining
// This function gets inlined 20 times by -O3 = 20x original size
inline void update_display(int x, int y, int color) {
// 50 lines of drawing logic
...
}
Solution: Use -Os or manually mark __attribute__((noinline)).
2. Forgetting gc-sections
# Wrong: added -ffunction-sections at compile, forgot --gc-sections at link
$ gcc -ffunction-sections -fdata-sections -c file.c
$ gcc file.o -o output # Forgot -Wl,--gc-sections!
3. Debug Symbols Impact
Debug symbols don't increase Flash usage, but make the ELF file larger:
$ riscv64-unknown-elf-gcc -g -Os main.c -o debug.elf
$ riscv64-unknown-elf-gcc -Os main.c -o release.elf
$ ls -lh *.elf
-rwxr-xr-x 1 user user 245K debug.elf
-rwxr-xr-x 1 user user 35K release.elf
# But .text size is the same:
$ size debug.elf release.elf
text data bss dec hex filename
18432 512 8192 27136 6a00 debug.elf
18432 512 8192 27136 6a00 release.elf
Summary
- Optimization levels:
-Osis usually the best choice for embedded, 15-25% smaller than-O2 - Dead code elimination:
-ffunction-sections -fdata-sections -Wl,--gc-sectionssaves 10-30% - LTO: Cross-compile-unit optimization, additional 5-15% savings (but increases compile time)
- Standard library: newlib-nano can save 30-50 KB compared to newlib
- Avoid: Floating-point operations, printf("%f"), C++ exceptions/RTTI
- Optimization order: First analyze with tools (previous chapter), then choose appropriate compiler options (this chapter)