civil-and-structural-engineering
How to Implement Over-the-air Updates in Embedded Operating Systems
Table of Contents
Over-the-air (OTA) updates have become a fundamental capability for embedded systems operating in the field. Without the ability to update firmware remotely, devices are left vulnerable to security flaws, suffer from bugs that degrade performance, and lack the features that keep them competitive. For embedded operating systems—running on resource-constrained, often deeply integrated hardware—implementing OTA updates is both a technical challenge and a critical business requirement. This guide walks through the architecture, implementation steps, and best practices needed to build a reliable OTA update system for embedded devices.
What Are OTA Updates and Why Do They Matter?
OTA updates allow firmware, application software, configurations, and even the operating system itself to be updated over a wireless network—cellular, Wi‑Fi, Bluetooth, LoRaWAN, or satellite—without requiring physical access to the device. In industries like industrial IoT, automotive, medical devices, and smart home systems, devices are often deployed in remote or inaccessible locations. Sending a technician to physically reflash a unit is expensive, slow, and sometimes impossible. OTA updates solve that problem.
Beyond convenience, OTA updates are essential for:
- Security patching: Vulnerabilities in the operating system or application can be fixed promptly, reducing the window of exposure.
- Feature improvement: New capabilities can be added post‑deployment, extending the product lifecycle.
- Bug fixes: Issues that emerge only in production can be corrected without a costly recall.
- Compliance: Regulatory updates can be applied automatically to all devices in a fleet.
However, OTA updates also introduce risks. A failed update can “brick” a device, corrupt data, or open security holes. Therefore, a well‑designed OTA system must address reliability, security, and bandwidth constraints simultaneously.
Core Components of an OTA Update Architecture
An OTA system comprises several interacting components, each with specific responsibilities. Understanding these components is the first step toward a robust implementation.
The Bootloader
The bootloader is the very first code that runs when a device powers on. For OTA updates, the bootloader must support two essential functions:
- Update verification: It checks the integrity and authenticity of the new firmware before executing it.
- Fallback mechanism: If the new firmware fails to boot or is deemed invalid, the bootloader reverts to a known good version. Common designs include A/B (dual‑bank) slots, where the bootloader alternates between two copies, or a single slot with a recovery partition.
The Update Server
The server stores firmware images, metadata (version, checksums, signing keys), and orchestrates delivery to the fleet. It may also handle device registration, policy enforcement (e.g., staged rollouts), and reporting. Popular open‑source solutions include Eclipse hawkBit and Mender, while cloud platforms like AWS IoT Device Management offer managed OTA services.
The Update Client
Running on the embedded device, the client manages the communication with the server, downloads the update payload, verifies its authenticity, writes it to the appropriate storage location, and triggers the bootloader to apply the update. The client must operate robustly even under poor network conditions, power loss, or low battery.
Security Infrastructure
Security is non‑negotiable. At a minimum, OTA systems must implement:
- Code signing: Every firmware image is digitally signed using a private key, and the device verifies the signature using a pre‑installed public key.
- Encrypted transport: HTTPS (or MQTTS over TLS) protects the download channel from eavesdropping and tampering.
- Secure boot: The bootloader cryptographically verifies the firmware before execution, preventing unauthorised code from running.
- Secure storage for keys: Private signing keys must be stored in hardware (HSM, TPM) or in tamper‑resistant software modules.
Storage Management
Embedded devices have limited flash memory. The OTA system must efficiently manage the storage of the current firmware, the downloaded update, and backup copies. This often involves partitioning the flash into at least two banks (A/B) or using a dedicated recovery partition. Compression (e.g., using zlib or LZ4) and delta (differential) updates are used to reduce the size of payloads.
Steps to Implement OTA Updates in Embedded OS
Implementing OTA updates requires a systematic approach that covers everything from bootloader design to fleet‑wide monitoring. Below are the critical steps, organised into practical phases.
1. Design the Bootloader for Update Management
The bootloader is the foundation of any OTA system. Its primary responsibilities are to decide which firmware image to run and to facilitate the update process.
- Choose between A/B updates and single‑slot with recovery. A/B (dual‑bank) is the gold standard: two copies of the firmware are stored; one is active, the other is updated. If the new image fails to boot, the bootloader automatically reverts to the older copy. Single‑slot designs are simpler but require a separate recovery mode that the user must trigger manually.
- Implement metadata tracking. The bootloader should maintain a metadata region (e.g., a reserved flash page) that stores the status of each slot: “active”, “pending update”, “failed”, “successful”. This metadata is updated by the client during the update flow.
- Add cryptographic verification. The bootloader must check the digital signature of the firmware image before booting. Verification can be done using public‑key cryptography (RSA, ECDSA) with a hash check (SHA‑256).
- Provide a fallback timer. After applying an update, the bootloader sets a “revert on failed boot” timer (e.g., 10 seconds). If the new firmware does not signal boot success within that window, the bootloader reverts to the previous slot.
2. Build a Scalable Update Server
The server manages the distribution of firmware to potentially thousands of devices. Key considerations:
- Firmware version management: Store all released versions with metadata (version string, release date, hardware compatibility, target OS).
- Rollout policies: Implement staged rollouts – for example, push updates to 5% of the fleet, then gradually increase if no issues are reported. The server can use device groups or fleets to manage this.
- Authentication and authorisation: Devices must authenticate (e.g., via X.509 certificates or pre‑shared keys) before they can request or download an update. This prevents unauthorised clients from draining bandwidth or accessing private firmware.
- Efficient delivery: Use CDNs or regional servers to reduce latency. Support resumable downloads (HTTP Range requests) so that devices can continue after a network drop.
- Error logging and analytics: Collect update‑attempt telemetry (success, failure reason, device ID) to identify problematic firmware versions or devices with connectivity issues.
3. Develop the Update Client
The client runs on the embedded device and interacts with the server. Its design must account for the device’s limited memory, CPU, and power budget.
- Polling vs. push. Most embedded systems use periodic polling (e.g., every hour or day) to check for updates, because maintaining a persistent connection (MQTT/CoAP) drains battery. The client sends the current firmware version to the server; the server responds with “no update” or a new firmware URL.
- Download and verification. The client downloads the firmware image over HTTPS, verifying the signature and checksum incrementally (streaming) to avoid storing the entire payload in RAM. It writes the raw data directly to the inactive flash slot (B if A is active).
- Write integrity. After writing, the client validates the flash slot by reading back the image and recalculating the hash. Only then does it set the boot‑metadata slot to “pending update” and trigger a system reset.
- Handling interruptions. If power is lost during download or flash write, the client must resume from a checkpoint (if the server supports ranges) or restart the download. The bootloader will still boot the unchanged firmware because the metadata was not updated.
4. Implement Robust Security
Security is a layered process. The OTA update pipeline is a prime attack vector; a compromised update could give an attacker full control over every device in the fleet.
- Use cryptographic signatures for every firmware image. Sign the image at build time with a hardware‑protected private key. The device’s bootloader and/or client verify the signature against a public key that is burned into the device at manufacturing (or securely provisioned later).
- Encrypt the update payload. Even though HTTPS secures the transport, encrypting the firmware image itself (e.g., with AES) adds another layer: if an attacker obtains the image from the server, they cannot reverse‑engineer it without the device‑specific key.
- Enforce secure boot. Ensure that the bootloader cryptographically verifies the active firmware at every power‑on, not just after an update. This prevents an attacker from permanently installing malicious code by flashing via a different interface (JTAG, UART).
- Revocation and key rotation. If a signing key is compromised, you must be able to revoke it. Devices should check a certificate revocation list (CRL) or use a key‑signature chain that allows offline updates to the trust anchor.
- Rate limiting and anomaly detection. The server should detect abnormal update‑request patterns (e.g., a single device requesting the same update hundreds of times) and throttle or blacklist the device.
5. Test the OTA Update Process Thoroughly
Because OTA updates target deployed hardware, testing is paramount. Simulate every failure scenario you can imagine.
- Power loss at every stage: Cut power during download, during flash write, during bootloader verification, and after the new firmware starts. Ensure the device always boots into a good state.
- Network interruptions: Test with low bandwidth, high latency, packet loss, and sudden disconnects. Verify that the client can resume downloads or gracefully fall back.
- Corrupt firmware: Feed the client an image with a wrong signature, a bad checksum, or truncated data. The client must reject it and log the error without affecting the active firmware.
- Rollback scenarios: After a “successful” update, manually inject a bug that causes the new firmware to crash. Verify that the bootloader’s watchdog timer triggers a rollback to the previous slot.
- Fleet‑wide staging: Test with a small device group first. Monitor logs to ensure no regressions before pushing to the full fleet.
Best Practices for Production OTA Systems
Beyond the basic implementation, the following practices help ensure your OTA system is reliable at scale.
Use A/B Updates with Atomic Switching
A/B (dual‑bank) updates are the most reliable approach for embedded devices that cannot tolerate downtime. The update is applied to the inactive slot while the active slot continues running. Only after the new image is fully written and verified does the system swap slots and reboot. If the new image fails to boot, the bootloader immediately goes back to the old slot. This design also allows zero‑downtime updates if the device supports live migration (though many embedded systems still reboot).
Adopt Delta / Differential Updates
Instead of sending a full firmware image every time, delta updates compute the binary difference between the current and new firmware and send only that patch. Tools like bsdiff or Google’s update engine can create patches that are often 80–95% smaller than the full image. This reduces bandwidth costs, speeds up downloads, and lowers the risk of interruptions.
Phase Rollouts and Monitor in Real Time
Never push an update to 100% of devices immediately. Roll out in phases (e.g., 5%, 20%, 50%, 100%) with a cooldown period between phases. During each phase, monitor key metrics: update success rate, boot success rate, crash reports, and connectivity changes. If a phase shows a spike in failures, halt the rollout and investigate before proceeding.
Implement a Watchdog in the New Firmware
After the first boot from a new firmware, the bootloader (or a startup script) should set a watchdog timer that must be cleared by the new firmware within a short window (e.g., 60 seconds). If the firmware hangs, crashes, or fails to clear the watchdog, the bootloader assumes it is broken and reverts. This mechanism catches latent bugs that only manifest after a few seconds of operation.
Provide a Safe “Factory Reset” Path
Even with perfect OTA design, devices can enter an unrecoverable state (e.g., corrupted bootloader region). A physical recovery mechanism – such as a button held during reset, a serial console, or a dedicated recovery image served over a secondary channel – should be documented for the rare cases where OTA recovery fails.
Log and Analyse Update Results
Every update attempt should generate logs on the device (if storage permits) and send result telemetry to the server. Logs should include: device ID, old version, new version, update start/end timestamps, download size, last‑seen network strength, and any error codes. Analysing this data helps you identify problematic firmware versions, network‑bandwidth bottlenecks, or hardware‑specific issues.
Conclusion
Implementing OTA updates in embedded operating systems is not a trivial task, but it is increasingly necessary for any product that expects to live in the field for more than a few months. The key is to treat the update system as a first‑class component of your device’s firmware – designed with the same rigor as the application logic. Invest in a secure bootloader, a scalable server, a resilient update client, and rigorous testing. When done correctly, OTA updates give you the power to fix, improve, and secure your devices throughout their entire lifecycle, saving costs and delighting users.