control-systems-and-automation
Root Cause Analysis for Cybersecurity Breaches in Industrial Control Systems
Table of Contents
Understanding Root Cause Analysis in the ICS Context
Industrial Control Systems (ICS) form the backbone of critical infrastructure—power grids, water treatment plants, oil refineries, and chemical manufacturing facilities. As these environments embrace digital connectivity and Industrial Internet of Things (IIoT) devices, the attack surface expands dramatically. Unlike typical IT breaches, a compromise in an ICS can lead to physical damage, environmental disasters, or loss of life. Root Cause Analysis (RCA) in this domain is not merely a post-incident exercise; it is a proactive engineering discipline that digs past immediate symptoms to uncover the systemic weaknesses that allowed a breach to succeed.
RCA in ICS differs from IT security forensics because it must account for operational technology (OT) constraints: legacy protocols that lack encryption, real-time control loops that cannot tolerate latency, and safety lifecycles that can be disrupted by security patches. A thorough RCA bridges the gap between IT forensic methodology and OT engineering reality, helping organizations identify whether a breach stemmed from a technical flaw, a procedural gap, or a misalignment between safety and security priorities.
Common Root Causes of ICS Cybersecurity Breaches
While each incident is unique, patterns emerge across ICS breaches. Understanding these common root causes helps organizations focus their defensive resources.
Weak Passwords and Inadequate Authentication
Many ICS systems still rely on default credentials shared across multiple devices or hard-coded passwords in programmable logic controllers (PLCs). The 2021 Colonial Pipeline ransomware attack, though primarily an IT compromise, highlighted how weak authentication on remote access tools can lead to lateral movement into OT environments. The root cause is often not just the password itself but a lack of policy requiring strong, unique credentials and multi-factor authentication (MFA) for all ICS access points.
Unpatched Software and Firmware Vulnerabilities
Industrial systems frequently run on outdated operating systems like Windows 7 or XP, and patch cycles can span months or years due to compatibility testing with control applications. This creates a window of exposure for known vulnerabilities. The Triton/Trisis attack on a Saudi petrochemical plant in 2017 exploited vulnerabilities in Schneider Electric’s Triconex safety controller—a device that was not patched because operators feared disrupting safety functions. The root cause was a deployment process that prioritized uptime over security hygiene without a compensating control strategy.
Lack of Network Segmentation Between IT and OT
Flat networks are the single largest structural weakness in ICS environments. When corporate IT and control networks are not properly segmented via firewalls, DMZs, or one-way diodes, a phishing email that compromises a business system can allow attackers to pivot into the control network. The 2015 Ukraine power grid attack succeeded partly because the attackers used the IT network to reach the ICS network, a direct result of insufficient segmentation. The root cause is often a topology design that treats the corporate network as trusted, ignoring the reality that attackers will exploit any available path.
Insider Threats: Malicious and Accidental
Insider threats in ICS can range from a disgruntled engineer who reprogram a PLC to cause a malfunction, to a contractor who inadvertently connects a laptop infected with malware to the OT network. A 2019 study by the Ponemon Institute found that insiders are responsible for nearly 25% of ICS incidents. The root cause is frequently a combination of inadequate access controls, absence of behavior analytics, and a culture that prioritizes convenience over security. For example, shared service accounts and broad administrative privileges make it difficult to trace actions back to an individual.
Insufficient Monitoring and Detection Capabilities
Many ICS environments lack endpoint detection and response (EDR), network monitoring, or security information and event management (SIEM) systems that are tuned for OT protocols. Without visibility into control network traffic—such as Modbus, DNP3, or PROFINET—an attacker can move laterally for weeks or months before discovery. The 2017 NotPetya attack impacted numerous ICS organizations, but in many cases, the malware was already present on internal networks before the wiper component activated. The root cause was a monitoring gap that failed to detect the initial compromise. As SANS ICS surveys consistently show, organizations that invest in OT-specific monitoring detect breaches three times faster than those that rely on IT tools alone.
Methodologies for Conducting Root Cause Analysis in ICS
RCA is a structured process. While IT incident response frameworks provide a starting point, ICS-specific methodologies incorporate operational context. The following approaches are widely used in industrial settings.
The 5 Whys
Originally developed by Toyota, the 5 Whys technique is deceptively simple: ask "why" repeatedly until the underlying cause emerges. For example, why did the safety system fail? Because an outdated firmware allowed an attacker to bypass it. Why was the firmware outdated? Because the patch had not been tested for the specific safety function. Why was testing delayed? Because there was no automated test harness. Why was there no test harness? Because the budget prioritized new equipment over validation tools. The root cause might be a resource allocation decision, not a technical missing patch. The 5 Whys works well for single-threaded incidents but can miss systemic interactions.
Fishbone (Ishikawa) Diagram
Also known as cause-and-effect analysis, the fishbone diagram organizes potential causes into categories such as People, Process, Technology, Environment, and Procedures. For an ICS breach, categories might include: People (training, awareness, insider actions), Process (change management, patching cycles, incident response playbooks), Technology (authentication, segmentation, backend design), and External Factors (vendor vulnerabilities, regulatory gaps). A team maps each contributing factor and traces it back to the diagram’s spine (the breach). This method is ideal for complex incidents with multiple interlocking failures.
Fault Tree Analysis (FTA)
FTA is a top-down, deductive method often used in safety engineering but equally applicable to security breaches. Starting with the undesired event (the breach), analysts work backward using logic gates (AND, OR) to identify combinations of failures that could cause the event. FTA is particularly useful in ICS because it mirrors the safety analysis that engineers already perform. For instance, a breach might occur if (A) the firewall is misconfigured AND (B) the antivirus is outdated AND (C) the anomaly detection system is not monitored. This rigor helps prioritize corrective actions that break the most critical logic paths.
The RCA Process in Five Phases
- Phase 1: Data Collection and Preservation — Forensic imaging of controllers, historians, engineering workstations, and network logs. In OT, this must be done carefully to avoid disrupting critical processes. Use read-only access where possible and consult with operations engineers before pulling power from devices.
- Phase 2: Event Timeline Reconstruction — Correlating logs from both IT and OT sources. ICS environments often have time synchronization issues (different devices using different NTP servers or none at all), so time normalization is critical. Tools like Wireshark with OT dissectors can help reconstruct packet sequences.
- Phase 3: Vulnerability Identification — Mapping the attack path to specific weaknesses. This includes not only technical vulnerabilities (CVEs) but also procedural gaps, such as lack of background checks for contractors or absence of a formal change review board.
- Phase 4: Root Cause Determination — Applying one or more methodologies (5 Whys, fishbone, FTA) to converge on the fundamental reason. Often, the root cause is a combination of a technical vulnerability and a process failure. For example, the network segmentation policy existed on paper but was never audited.
- Phase 5: Corrective Action Development and Verification — Implementing measures that address the root cause, not just the symptoms. Common actions include redesigning network architecture, hardening device configurations, implementing automated patch management sandboxes, and introducing OT-aware intrusion detection. Each action should be tested in a staging environment before deployment to production.
Unique Challenges of RCA in ICS Environments
Conducting RCA in an industrial control system presents hurdles rarely encountered in IT security. Acknowledging these challenges upfront improves the quality of the analysis.
Legacy Technology and Proprietary Protocols
Many ICS devices have been in operation for 15–30 years, running firmware that cannot be patched or even logged. Proprietary protocols from vendors like Siemens, Rockwell, or ABB may not have native security features or standardized logging. Analysts often need deep engineering knowledge to interpret device behavior. Tools like CISA’s ICS-CERT advisories provide guidance on known vulnerabilities, but the RCA team must also understand the operational context—such as which registers control a valve or which ladder logic commands a pump.
Safety Over Security Constraints
An RCA must never recommend a corrective action that violates safety protocols. For example, requiring a password change every 30 days may seem secure, but if an engineer is locked out during an emergency shutdown procedure, human life could be at risk. The root cause analysis process must involve safety engineers and reference standards such as ISA-62443 (IEC 62443) that balance security with functional safety.
Limited Forensic Capabilities
Unlike IT servers, many PLCs and RTUs do not have persistent storage for logs. Event data may be held in volatile memory that disappears on reboot. Forensic tools designed for ICS, such as those from Dragos or Nozomi Networks, can capture state information, but they are not universally deployed. As a result, RCA often relies on indirect evidence—operator interviews, shift logs, and historian data—which requires careful corroboration.
Regulatory and Compliance Pressures
Industries such as energy, water, and chemical manufacturing are subject to regulations (NERC CIP, NIST SP 800-82, EU NIS Directive) that may mandate specific RCA procedures. The analysis must produce a report that satisfies auditors without exposing sensitive vulnerabilities that could be exploited. Balancing transparency with confidentiality is a skill that RCA teams must develop.
Building an Effective RCA Program for ICS
RCA should not be a one-off exercise after every breach; it should be integrated into the organization’s security governance. A mature program includes the following elements.
Pre-Incident Preparation
Before a breach occurs, define the RCA team: a mix of IT security professionals, OT engineers, control system operators, and management. Pre-authorize read-only access to key systems and establish a chain of custody for forensic evidence. Document the network architecture, asset inventory, and known dependencies. Organizations that have a baseline of normal operations can identify anomalies faster during an RCA.
Tool Selection and Integration
Invest in tools that provide visibility into OT environments. Network monitoring appliances that can parse Modbus, DNP3, and OPC-UA traffic are essential. Endpoint agents designed for embedded systems (such as those from Microsoft Defender for IoT or Armis) can collect telemetry without destabilizing controllers. Centralized logging with time-synchronized data feeds allows correlation between IT and OT events. A good RCA should be able to answer: “How did the attacker first access the OT network, what commands were sent, and which assets were affected?”
Post-Incident Learning and Continuous Improvement
After an RCA report is published, track the implementation of corrective actions. Create a quarterly review that evaluates whether the actions have actually reduced the risk. For example, if the root cause was a lack of segmentation, verify that the new firewall rules are enforced and that no exceptions have been silently added. Share anonymized lessons across the industry through information-sharing groups like CISA’s Automated Indicator Sharing (AIS) for ICS, or the ISA’s Security Compliance Institute. This collective learning helps the entire sector raise its security baseline.
Illustrative Case Study: Lessons from a Hypothetical ICS Breach
Note: The following example is constructed from common patterns observed by security researchers. It does not represent any specific incident but synthesizes typical root causes.
A mid-sized water utility experienced a breach that caused pumps to run at unsafe speeds, triggering emergency shutdowns. Initial symptoms pointed to a malicious payload in the HMI (Human Machine Interface) software. An RCA team used the fishbone method and identified contributing factors: the HMI was running Windows 7 without a security update, the remote access VPN used single-factor authentication shared among 12 operators, and network logs showed traffic from the corporate IT segment to the control network that had not been flagged by the IT firewall (since it only monitored north-south traffic, not east-west). The root cause was determined to be the lack of segmentation combined with an insecure remote access policy. Corrective actions included deploying a DMZ with a one-way data diode, implementing MFA for all remote connections, and establishing an automated patch sandbox for HMI updates. The RCA also revealed that no procedure existed for reviewing remote access sessions—a process gap that was subsequently addressed.
This case highlights that the root cause was not a single vulnerability but a combination of technology gaps and process failures. By addressing both, the utility not only recovered from the incident but built a more resilient control environment.
Conclusion: Embedding RCA as a Continuous Process
Root Cause Analysis is not a post-mortem ritual; it is a strategic capability that turns incidents into learning opportunities. In industrial control systems, where the cost of failure includes physical damage and public safety risks, the ability to systematically uncover and eliminate root causes is indispensable. By adopting structured methodologies, respecting the unique constraints of OT environments, and building a dedicated program, organizations can move from reactive firefighting to proactive resilience. The ultimate goal is to make every breach—however painful—serve as a stepping stone toward a more secure and dependable industrial infrastructure.