Building Fault-Tolerant Network Paths in OT

Performance and Reliability

Building Fault-Tolerant Network Paths in OT

Discover essential strategies to build fault-tolerant, high-availability network paths in Operational Technology environments. Ensure reliability and safety with proven redundancy techniques.

📖 Estimated Reading Time: 3 minutes

Article

Building Fault-Tolerant Network Paths in Operational Technology (OT) Environments

Industrial networks increasingly underpin critical infrastructure—from power generation and water treatment to manufacturing and energy distribution. Unlike typical enterprise IT environments, Operational Technology (OT) settings emphasize deterministic communication, high uptime, and rigorous safety. Fault tolerance is more than a desirable attribute; it’s quite literally a mandate for reliable and safe operations.

Yet, as more organizations converge IT and OT, the complexity of maintaining resilient, secure network paths grows. In this article, we’ll dissect what fault tolerance means in the context of OT, trace key enabling technologies, and offer concrete architectural suggestions for C-level stakeholders, network architects, and operators alike.

Understanding Fault Tolerance in OT Networks

Definition and Fundamentals

Fault tolerance describes the capacity of a system to continue functioning even in the event of partial failure. For OT, this usually centers on ensuring that network paths—between PLCs (Programmable Logic Controllers), SCADA masters, field devices, sensors, and HMIs—remain available despite failures in equipment, links, or intermediary nodes.

Why OT Requirements Diverge from IT

Determinism: Industrial protocols (e.g., PROFIBUS, Modbus, EtherNet/IP, PROFINET) often require predictable latency—packet delivery windows of tens of milliseconds are common, with jitter tightly controlled.
Redundancy at Multiple Layers: Plant safety and process continuity frequently require that the loss of a switch, link, or even a portion of the network has zero effect on ongoing operations.
Legacy Infrastructure: Decades-old proprietary networks, often not designed with modern reliability models, need to play alongside newer digital assets.

Historical Evolution of Fault Tolerance in the Industrial Space

Legacy Topologies: Ring, Bus, and Star

In the early days (pre-2000s), most fieldbus networks used bus or daisy-chained topologies. These were single points of failure: a broken cable stopped the whole line. Some systems employed electrical redundancy, but not with true dynamic path rerouting.

The advent of ring topologies was a critical milestone. Here, if a link fails, traffic automatically reroutes the other way around the ring. Standards such as Rapid Spanning Tree Protocol (RSTP) and Media Redundancy Protocol (MRP, IEC 62439-2), common in modern industrial Ethernet, built on these designs.

Parallel with IT: Spanning Tree, Trill, and Beyond

Enterprises faced analogous problems: early Ethernet was vulnerable to loops, so the Spanning Tree Protocol (STP) was standardized in 1990 (IEEE 802.1D). While originally too slow (convergence times of tens of seconds), improvements like RSTP (802.1w – 2001) and Multiple Spanning Tree Protocol (MSTP, 802.1s) were adopted. In industrial contexts, even RSTP’s sub-second recovery wasn’t fast enough for some process demands, driving the OT sector to produce its own rapid-failover mechanisms.

The development of Parallel Redundancy Protocol (PRP, IEC 62439-3) and High-availability Seamless Redundancy (HSR, IEC 62439-3) in the late 2000s exemplifies the OT world’s frustration with “just good enough” reliability and convergence times seen in traditional IT. These approaches prioritize zero recovery time—data is always duplicated along diverse paths so failure is invisible to the application.

Architecting for Fault Tolerance

Network Design Patterns

Ring Topology with Protocol-Based Protection
- Rings with MRP or proprietary fast-recovery mechanisms can restore path availability within 200 ms, sometimes faster. The trade-off is complexity and, occasionally, vendor lock-in.
Dual-Homed Star Topologies
- Devices (or control servers) that connect via two separate switches (or distribution layers), possibly with diverse uplinks, offering path diversity and improved resilience when paired with redundancy methods.
Parallel/Hybrid Topologies with PRP or HSR
- Data packets are sent simultaneously over two distinct LANs (PRP) or along two ring directions (HSR). No traffic interruption occurs; the receiver discards duplicates on-the-fly.
Layered Defense: Combining L2, L3, and Above
- Modern designs increasingly blend L2 and L3 redundancy: redundant L2 rings feeding into L3 (routed) backbones. At higher layers, some plants overlay SD-WAN or VPN for secure remote operator access, with automatic failover.

Considerations by Layer

Layer 1 (Physical Layer)

Use physically separated cable routes where possible. Fiber is increasingly affordable and can mitigate EMI/EMC concerns.
Redundant power supplies, supervisory circuits, and environmental controls reduce domino failures.

Layer 2 (Data Link Layer)

Select industrial switches supporting fast failover (MRP, proprietary rapid recovery, PRP/HSR).
Loop-free operation is typically mission-critical; misconfigurations at this layer can cascade.
VLANs for segmentation—but consider propagation of topology-change notifications (TCNs) in your design.

Layer 3 (Network Layer) and Above

OSPF/EIGRP (multi-area, fast hello/dead intervals) can provide sub-second convergence, but keep in mind that deterministic traffic may suffer during reconvergence windows.
Where process tolerance allows, IP routing adds resilience and helps with network segmentation for security.

Protocols in Play: Deep Dive

MRP (IEC 62439-2): Offers rapid ring recovery for up to 50 switches, typically used for ring topologies in field networks. Typical recovery time: <200ms.
PRP (IEC 62439-3): Allows zero-time recovery by sending frames over two independent LANs. End devices handle duplicate elimination—failure of one path is invisible. Interoperability with "non-aware" network devices is a note to check carefully.
HSR (IEC 62439-3): Optimized for ring/circular topologies. Every node forwards every frame, ensuring no single point of failure; well-suited for linear or circular process area runs. Used often in electric utilities.
RSTP/MSTP: Adequate for less time-sensitive OT, but not real-time enough for safety or certain process operations.

IT/OT Collaboration: The Practical Challenges

Districts, Domains, and Zones—IT’s Logical Segmentation vs OT’s Physical Realities

The IT world often approaches network separation through segmentation, usually with VLANs, subnets, and firewalls. In OT, these logical constructs need careful mapping onto physical realities (specific cable routes, plant boundaries, hazardous environment regulations).

OT teams rarely tolerate extended downtime for changes or troubleshooting—“maintenance windows” are measured in seconds or require months of coordination. Therefore, collaborative teams must exercise extreme care and long preview windows for disruptive reconfigurations.

Securing Redundancy—Risks of Overlapping Paths

Redundant paths that are not also secure effectively double an adversary’s options.
PRP and HSR overlays don’t guarantee encryption/integrity on their own; layering VPNs or IPSec tunnels may be required, while balancing the resulting latency overhead against OT determinism requirements.
Critical to coordinate IT and OT monitoring so both paths are visible to security operations; silent failovers can mask ongoing faults or active attacks unless alarms and logging are present on all redundant routes.

Testing, Validation, and Ongoing Maintenance

Lab Before Plant: Validating Under Load

Test failover and recovery both under normal and high network load. Some industrial switches exhibit drastically slower recovery when loaded with broadcast/multicast traffic.
Use industrial protocol simulators (e.g., PROFINET tester, Modbus tools) to verify application-level continuity across failovers.

Ongoing Monitoring and Lifecycle Maintenance

Monitor path availability, duplicate frame rates (for PRP/HSR), and link flapping.
Document and routinely verify contact information for incident response on all critical network elements.
Plan for rare but challenging issues—such as L2 mis-learning, address resolution failures, and non-deterministic device behavior after power cycling.

Summary and Recommendations

Evaluate business/operational continuity objectives, then choose the simplest topology and protocol that meets them—in OT, complexity often backfires.
Use industrial-grade protocols (PRP/HSR/MRP) for real-time, sub-second recovery; reserve conventional IT approaches (RSTP, L3 redundancy) for peripheral or less time-sensitive process zones.
Bake in path diversity physically (different cable routes, disparate switches) and logically (VRFs, VLANs, firewall zones).
Build and maintain close coordination between IT and OT teams—architecture must align with process safety, not just connectivity.
Test, test, and test again. Don’t trust vendor data sheets or simulation alone—use real hardware and process loads.

Get in Touch with Trout team

Enter your information and our team will be in touch shortly.

Get in Touch with Trout team

Enter your information and our team will be in touch shortly.

Performance and Reliability

Performance and Reliability

Building Fault-Tolerant Network Paths in OT

Building Fault-Tolerant Network Paths in OT

Building Fault-Tolerant Network Paths in Operational Technology (OT) Environments

Understanding Fault Tolerance in OT Networks

Definition and Fundamentals

Why OT Requirements Diverge from IT

Historical Evolution of Fault Tolerance in the Industrial Space

Legacy Topologies: Ring, Bus, and Star

Parallel with IT: Spanning Tree, Trill, and Beyond

Architecting for Fault Tolerance

Network Design Patterns

Considerations by Layer

Layer 1 (Physical Layer)

Layer 2 (Data Link Layer)

Layer 3 (Network Layer) and Above

Protocols in Play: Deep Dive

IT/OT Collaboration: The Practical Challenges

Districts, Domains, and Zones—IT’s Logical Segmentation vs OT’s Physical Realities

Securing Redundancy—Risks of Overlapping Paths

Testing, Validation, and Ongoing Maintenance

Lab Before Plant: Validating Under Load

Ongoing Monitoring and Lifecycle Maintenance

Summary and Recommendations

Further Reading & Standards

Final Thoughts

Common Root Causes of OT Downtime

Role of QoS in ICS Communications

Get in Touch with Trout team

Get in Touch with Trout team