Redundant Link Design for OT Systems
Ensure reliable OT systems with proven redundant network designs, robust protocols, and practical testing. Learn principles, pitfalls, and best practices for OT link redundancy.
📖 Estimated Reading Time: 3 minutes
Article
Redundant Link Design for OT Systems: Principles, Practices, and Pitfalls
In industrial and critical environments, reliability aspirations are not just aspirations—they manifest in design demands, regulatory requirements, and operational doctrine. In these contexts, Systems Availability (SA), Mean Time To Repair (MTTR), and Protection Against Failure (PAF) are not buzzwords but measurable, audited metrics. Nowhere do these needs manifest more concretely than in the networking layer serving Operational Technology (OT) systems.
Historical Context: The Shift from Isolated to Connected OT Networks
Historically, OT networks ran as islands: isolated, proprietary bus systems (MODBUS over RS-485, PROFIBUS, etc.), ring or line topologies with physical barriers to IT interaction. The rationale: minimize attack surfaces and maximize uptime through physical isolation. But requirements evolved. Efficiency and data-driven decisions required integration with corporate IT and cloud services. Under this pressure, Ethernet and TCP/IP became prevalent as the substrate for plant-floor connectivity. Enter the pressing debate: how do we ensure redundancy and reliability, as we collapse formerly discrete layers?
Redundancy Basics: Why and What We’re Protecting Against
Redundancy is about minimizing single points of failure, but effective redundancy is a nuanced subject.
Common points-of-failure mitigated by redundant network links in OT systems include:
Physical cable or fiber cuts
Switch/router hardware faults
Power anomalies
Human error (misconfigurations, patching, etc.)
The question: Which risks warrant mitigation, at what cost, and using what technology?
Protocols and Architectures for Redundant Networking
The selection of redundant networking topologies and protocols is not arbitrary; it is shaped by the nature of OT application requirements (latency, determinism, recovery time), legacy system constraints, and interoperability needs.
Layer 2: STP, RSTP, and Industrial Ring Protocols
Spanning Tree Protocol (STP, IEEE 802.1D): Dating to the 1980s, STP prevents loops by blocking redundant paths at Layer 2. Its drawback in OT? Convergence/recovery after a failure can take tens of seconds—disastrous in processes requiring sub-second recovery.
Rapid Spanning Tree Protocol (RSTP, IEEE 802.1w): An evolution of STP providing faster convergence (within a couple of seconds), but still inadequate for many deterministic industrial uses.
Media Redundancy Protocol (MRP, IEC 62439-2): Designed for industrial Ethernet rings, MRP delivers reconvergence times under 500ms, suitable for many plant control networks. Supported by most industrial Ethernet switches today.
PRP/HSR (Parallel Redundancy Protocol / High-availability Seamless Redundancy, IEC 62439-3): These make parallel use of two physically separate networks. Instead of rapid switchover, both links are always active. Zero-wait recovery is possible, but at the cost of doubled network infrastructure.
Layer 3: Routing-Based Redundancy
Where possible, Layer 3-based designs enhance robustness:
Dynamic Routing Protocols (OSPF, EIGRP, IS-IS): Fast convergence, support for equal-cost multipath, and fine-grained route control. But their complexity may outstrip the skills of plant-maintenance teams.
VRRP, HSRP: Useful for gateway redundancy between IT and OT domains.
Annotation: The Misconception of “Redundant = Reliable”
Deploying redundant links alone will not guarantee reliability. Operator error, software bugs, and misconfiguration can all defeat the intended effect.
Classic OT Topologies: Line vs Ring vs Star and Their Fault Tolerance
Line (Bus) Topology
Simple “daisy-chained” switches or devices. One cut: everything beyond the fault goes dark. In modern environments, hybrid approaches sometimes use dual lines, but this is a hazardous compromise.
Ring Topology
For years, ring topologies have been the OT workhorse. MRP and proprietary ring protocols allow a broken link to be bypassed via the alternate path. Speed of reconvergence depends on protocol—and device firmware quality, which varies widely by vendor and model.
Star Topology
Centralized distribution layer (core switch), with each device having a dedicated link. Pros: easy fault domain isolation. Cons: introduces a potentially devastating single core switch as a single point of failure. Solution: dual-homed stars or full mesh topologies, but these come with cost and management trade-offs.
Note: Don’t Underestimate Layer 1 Redundancy
True physical independence is vital. This means:
Diversified cable routes (not “diverse” if they are zip-tied to the same ladder)
Separate power feeds (ideally from different panels/substations)
Geographic dispersion of core nodes
Troubles Heard from the Trenches: The IT/OT Collaboration Problem
In brownfield settings, IT’s push for standardized protocols and cheaper COTS hardware often clashes with OT’s justified distrust of “untested” features and fear of production downtime. For instance:
Patching storms: Regular IT security patching windows may be unthinkable for continuous process networks, risking accidental STP topology shifts mid-shift.
Visibility gaps: IT expects SNMP and NetFlow everywhere; OT operators may lack the culture—and sometimes, the tools—needed to investigate layer 2/3 faults.
Protocols like PRP/HSR can “just work” with OT tools, but the skillset to configure, monitor, and troubleshoot mixed IT/OT redundant topologies may be lacking.
Bridging the Divide: Robust Documentation and Testing
If you take away one practical lesson: actual failover drills matter far more than notional redundancy “on paper.” Walk through real link, node, and power failure scenarios. Measure, don’t assume, failover times. Document topologies clearly for both IT and OT teams—ideally with network diagrams and runbooks.
Practical Considerations: Design Patterns and Anti-Patterns
Good Design Practices
Prefer two physically independent paths end-to-end (core, distribution, and edge)
Use managed switches with proven, standards-based redundancy support (ideally with third-party validation)
Keep recovery times under documented business risk tolerances (< 500ms for process automation; tolerances can be far higher for simple metering/monitoring)
Document and regularly test failover procedures (including rollback steps for human error)
Common Pitfalls
Assuming vendor redundancy claims apply under real-world OT failure scenarios (e.g., graceful shut, not messy power loss)
“Redundant” cabling routed through the same cable tray
Forgetting management plane and control networks require their own redundancy
Over-complexity without staff training and runbooks
Secure Connectivity: The Elephant in the Room
Redundant physical links won’t help if an attacker or malware disables both. All network architectures must acknowledge:
Segmentation between IT and OT via firewalls or industrial DMZs (per ISA/IEC-62443 best practices)
Strong authentication and logging access to switch/router infrastructure
Routine configuration backups and secure storage (encrypted off-host)
Physical security controls (locked panels, access controls in network rooms)
Conclusion: Simplicity, Drill, and Realism Trump Apparent Complexity
Redundant link design for OT systems is one of those classic “the devil is in the details” problems. Too many environments suffer from designs that look beautiful on a Visio diagram but fail basic real-world tests.
If you’re a CISO, IT/OT director, or engineer: favor approaches you can truly test and support, focus on real independence in physical paths and devices, and rehearse failures before your plant has to live through one. “Redundant” is not a checkbox; it’s a lived operational property. The best networks are those the whole team can operate and recover—no matter which link breaks next.
Other blog posts from Trout