Understanding the Use of Volatile and Memory Barriers in Embedded C

Embedded C programming demands a level of precision that goes well beyond typical application development. Systems that control automotive engines, medical devices, or industrial robots must respond to external events within strict timing windows while managing hardware resources that the programmer can see and touch through memory-mapped I/O. In this environment the compiler’s optimizations, which are designed to produce faster and smaller code, can introduce subtle failures if their effects on hardware-visible memory operations are not properly controlled. Two fundamental tools for achieving that control are the volatile keyword and memory barriers (also called memory fences). Understanding when and how to apply them is essential for writing robust, predictable embedded firmware.

How the Compiler Optimizes Memory Accesses

To appreciate why volatile and memory barriers are needed, it helps to understand what compilers do when they optimise code. A C compiler, such as GCC or Clang, treats most variables as values that will be read and written only by the code it sees in the current translation unit. This assumption allows the compiler to:

Cache a variable's value in a CPU register across multiple reads, avoiding repeated memory accesses.
Reorder independent memory reads and writes to improve instruction pipelining.
Eliminate redundant writes if the compiler judges the value will be overwritten before being read again.
Merge consecutive reads into a single load if the variable appears not to change between them.

In a standard desktop program these optimisations are almost always safe because no other agent can change the variable behind the compiler's back. In an embedded system, however, a variable can be modified by hardware, by an interrupt service routine (ISR), or by a different CPU core. When that happens, the compiler’s optimised view of the world becomes dangerously wrong.

The volatile Keyword: Halting Optimisation at the Variable Level

The volatile keyword is the simplest tool for telling the compiler that a variable is subject to change outside its control. Declaring a variable as volatile forces the compiler to treat every access — both reads and writes — as an observable side effect that must not be omitted or reordered with respect to other volatile accesses. Specifically, the C standard guarantees that within a single thread of execution, all volatile accesses occur in the order they appear in the source code and that each access actually hits the memory address (or hardware register) involved.

Common Use Cases for volatile

Memory-mapped hardware registers: A status register bit might be cleared by the hardware itself between a read and a write. Without volatile, the compiler could cache the read and return an outdated value.
Flags modified by an ISR: A shared variable set inside an interrupt handler and checked in the main loop must be declared volatile so that the main loop re-reads it each iteration instead of using a register-cached copy.
Data shared between two separate tasks on a single core: Even without an operating system, code paths that are logically concurrent (e.g., a background loop and a timer-triggered function) can require volatile to prevent optimisations that assume only one code path touches the variable.

It is important to note that volatile only tells the compiler not to optimise the individual accesses. It does not provide any guarantee about the order of accesses relative to non-volatile variables, nor does it prevent the CPU core from reordering memory operations in a multi-core or multi-master bus scenario. That is where memory barriers come in.

Memory Barriers: Controlling Order and Visibility

A memory barrier is an instruction or a sequence of instructions that enforces a partial ordering on memory operations. In modern processors, memory accesses are often reordered by the CPU core itself to hide memory latency. The reordering can happen for reads (loads) and writes (stores) independently. On a strongly-ordered architecture like an ARM Cortex-M (which uses a weak memory model for some operations), or on a multi-core ARM Cortex-A running symmetric multiprocessing, memory barriers are essential to ensure that one core's writes are visible to another core in the order the programmer intended.

Why Memory Barriers Are Needed Beyond volatile

Consider two tasks running on separate cores. Core A writes to a flag variable (declared volatile) and then writes to a data buffer. Core B reads the flag and then reads the data buffer. Without a memory barrier, the processor on Core A may reorder the store to the data buffer after the store to the flag, or Core B may speculatively load the data buffer before it sees the flag update. The volatile keyword alone cannot prevent such reordering at the hardware level because the CPU’s memory ordering rules apply to all loads and stores regardless of the volatile qualifier (which only affects the compiler, not the CPU). A memory barrier is inserted between the writes (or before the reads) to guarantee that the ordering is enforced at the processor level.

Types of Memory Barriers

Full memory barrier (e.g., ARM DMB SY or x86 MFENCE): Ensures that all memory accesses (loads and stores) before the barrier finish before any memory accesses after the barrier begin. Often used to synchronise data shared between cores.
Load barrier (e.g., ARM DMB LD or x86 LFENCE): Orders only load instructions. All loads before the barrier complete before any loads after the barrier are performed.
Store barrier (e.g., ARM DMB ST or x86 SFENCE): Orders only store instructions. All stores before the barrier are visible before any stores after the barrier.

In addition to these ordering barriers, there are also data synchronization barriers (DSB) and instruction synchronization barriers (ISB) used in ARM architectures for cache maintenance and context switching. For the typical embedded developer concerned with shared data, the full memory barrier is the most common tool.

Combining volatile and Memory Barriers

The two mechanisms are complementary. volatile prevents the compiler from removing or reordering accesses to a specific variable. Memory barriers prevent the CPU from reordering accesses across the barrier. For a shared flag that guards a data structure, a typical pattern is:

Declare the flag as volatile to ensure the compiler always reads and writes it when asked.
Insert a full memory barrier after writing the data but before writing the flag on the producer side.
Insert a full memory barrier after reading the flag but before reading the data on the consumer side.

This pattern ensures that by the time the consumer sees the flag update, all the data writes that should be visible are indeed visible. Without the barrier, even with volatile, a weakly-ordered CPU could reorder the accesses and break the implied ordering contract.

Compiler Barriers vs. Hardware Barriers

In many embedded projects, developers also use a technique called a compiler barrier. On GCC and Clang, the directive asm volatile("" ::: "memory") tells the compiler to treat the point as a full memory barrier for optimisation purposes — it prevents the compiler from reordering any memory accesses across that point. However, this does not generate any hardware barrier instruction. If the CPU itself can reorder memory operations, a compiler barrier is insufficient. The correct approach is to use platform-specific intrinsic functions or inline assembly to emit the proper hardware barrier (e.g., __sync_synchronize() in GCC for a full hardware barrier, or __dsb(0) on ARM).

Atomic Operations as an Alternative

Since the C11 standard, the language offers a standardised way to manage shared memory: atomic types and the stdatomic.h library. An atomic variable, when used with the appropriate memory ordering (e.g., memory_order_release and memory_order_acquire), provides both volatile-like guarantees and the necessary memory barriers. This approach is safer and more portable than manually inserting barriers, especially for simple flags and counters. However, many legacy embedded codebases and bare-metal systems still rely on volatile plus explicit barriers because of toolchain limitations or the need to interface directly with hardware registers.

Best Practices for Using volatile and Memory Barriers

Apply volatile only where necessary: Overuse can prevent valuable compiler optimisations. Use it for hardware registers and for variables shared between ISR and main code. For data shared between tasks or cores, consider using atomic operations or synchronisation primitives.
Always pair memory barriers with the shared data they protect: A barrier without a clear ordering contract is useless. Document the producer/consumer semantics so that future maintainers understand why the barrier is there.
Prefer platform-independent APIs when possible: If your toolchain supports C11 atomics, use them. If you must use intrinsics, isolate them in a small hardware abstraction layer to ease porting.
Test under worst-case conditions: Race conditions and memory ordering bugs often manifest only under heavy interrupt load or with caching enabled. Use stress tests that flush and invalidate caches to expose hidden ordering flaws.
Understand your target architecture’s memory model: Some microcontrollers use a strongly-ordered memory model (e.g., Cortex-M3/4/7) where certain types of accesses are ordered by default. Others, such as multi-core Cortex-A processors, require explicit barriers for inter-core communication. Study the ARM Architecture Reference Manual or the equivalent for your CPU.

Common Pitfalls and Misconceptions

Thinking volatile is a complete synchronisation solution: It is not. It only affects compiler optimisations, not CPU reordering. Always pair with memory barriers when multiple cores or bus masters are involved.
Inserting memory barriers but forgetting volatile: If the compiler doesn’t know a variable can change externally, it may optimise away the access entirely, and no amount of hardware barriers will help.
Using too many barriers: Excessive barriers degrade performance. Only place barriers exactly where ordering is required.
Assuming barriers are transitive across different communication paths: A barrier ensures ordering on a single CPU's perspective. If two cores communicate through shared memory, both must respect the same barrier protocol.

Real-World Example: Synchronising a Double-Buffer

Consider a system where a sensor writes data into a buffer, then toggles a ready flag. A main loop reads the flag and then reads the buffer. In a single-core system without caching, volatile on the flag and a compiler barrier might suffice because the CPU's memory model on a Cortex-M3 ensures write-after-write ordering for normal memory. But as soon as a DMA controller or a second core is introduced, both hardware barriers and proper memory attributes become essential. A robust implementation would:

Declare the flag as volatile.
Use a full hardware memory barrier after the buffer write and before the flag write (producer).
Use a full hardware memory barrier after the flag read and before the buffer read (consumer).
Ensure the buffer memory region is configured as strongly-ordered or device type in the MPU if available, to prevent speculative reads or write buffering.

External Resources

Conclusion

Mastering the use of volatile and memory barriers is not optional for serious embedded development — it is a requirement for correctness in systems that interact with hardware, interrupts, or multiple processing elements. volatile protects against compiler optimisations that would otherwise skip or reorder accesses to variables that change unexpectedly. Memory barriers enforce ordering at the CPU or system level, ensuring that data written by one agent is visible to another in the intended sequence. By understanding the distinction and applying each tool appropriately, developers can eliminate a whole class of intermittent, hard-to-debug failures and produce firmware that behaves reliably under all operating conditions.