Scheduling Maintenance Windows in 24/7 Plants
Learn how to schedule maintenance windows in 24/7 industrial plants with best practices for network segmentation, redundancy, and human collaboration to minimize downtime.
📖 Estimated Reading Time: 3 minutes
Article
Scheduling Maintenance Windows in 24/7 Industrial Plants: A Technical Deep Dive
For industrial and critical infrastructure operators, routine maintenance is non-optional. But in environments that run 24/7—think chemical production, power generation, semiconductor fabs—the concept of a traditional "Sunday maintenance window" is a luxury not afforded. This article unpacks the practical realities, risks, and technical considerations borne by CISOs, IT directors, network engineers, and operators responsible for maintaining the digital reliability of continuous-process plants.
Historical Overview: OT, IT, and the Maintenance Conundrum
Historically, Operational Technology (OT) networks operated on the premise of physical isolation—the so-called "air gap." Shifts in value and productivity over the past two decades have dissolved these boundaries, converging IT and OT, and setting the stage for new operational headaches.
Pre-2000s: Many SCADA and Distributed Control Systems (DCS) were designed for stand-alone, high availability. Patch cycles were infrequent; system downtime tolerances were loose by today’s standards.
2000s Onward: The proliferation of Ethernet/IP (with Modbus TCP, PROFINET, EtherNet/IP), virtualization, and the rise of remote monitoring introduced dependencies on general-purpose IT stacks with their own regular patch and update cycles.
Present: Regulatory frameworks (NERC CIP, IEC 62443, CFATS, etc.), and attack surface realities mean failure to update or patch is no longer viable—but neither is downtime.
The Realities of 24/7 Operation
In comparison to transactional enterprise IT workloads, industrial control systems require the following properties:
High Availability (HA): Plants often achieve 99.99+% uptime via N+1 or 2N architectures, redundant networking, and clustered control systems.
Determinism: Predictable response times are essential. Even micro-outages from patching can induce system faults or production slowdowns.
Patch Aversion: Legacy devices have limited vendor support, and unvalidated firmware updates or configuration changes are a vector for both failure and attack.
Yet, regulatory and security realities insist systems must be updated against threats such as ransomware, supply chain compromise, or zero-day exploitation.
Network and Infrastructure Design to Enable Maintenance
Robust architectural design is the linchpin. Below are foundational elements proven to minimize downtime during necessary maintenance:
1. Physical and Logical Network Segmentation
By enforcing strict segmentation (VLANs, VRFs, or air gaps where justified), critical real-time systems (Level 1, per Purdue Model) are insulated from changes elsewhere. Modern best practice places zone and conduit firewalls at trust boundaries, using least-privilege allow rules.
2. Redundant Network Paths and Failover Protocols
Standard Spanning Tree Protocol (STP) dates to the 1980s, but modern industrial networks may depend on specialized rapid failover mechanisms like PRP (Parallel Redundancy Protocol) and HSR (High-availability Seamless Redundancy)—both standardized in IEC 62439-3. These are designed for “bumpless” failover, allowing you to upgrade one path while the other maintains session continuity.
3. Out-of-Band Management (OOBM) Networks
Despite being IT 101, OOBM is often missing in brownfield plants. Reliable serial consoles (e.g., via terminal servers) and isolated jump hosts can distinguish between a recoverable misconfiguration and a disastrous plant-wide outage during a change window.
4. Virtualization and Containerization
Selective virtualization (hypervisors purpose-designed for real time) or containerization of operator HMI and historian workloads allows “hot moves” and easier snapshot-based rollback, shrinking recovery times during failed patch events.
Scheduling and Coordinating Maintenance Windows: Tactical Approaches
1. Aligning with Plant Operations: Not All Hours Are Equal
While “downtime” doesn’t exist, there are usually lower risk production phases: for example, batch transitions, tank clean-in-place cycles, or scheduled shift change. Working in close coordination with operations allows pairing maintenance with these natural lulls.
2. Staged, Rolling Maintenance
Implement a phased approach:
Device or zone-based rotation: Update only non-redundant assets at a time (A-side, then B-side), ensuring *never* to risk both “legs” of a redundant pair simultaneously.
Test in dev, then limited pilot: A lab or digital twin that mirrors plant network topologies and asset firmware versions is invaluable; but nothing replaces piloting on a limited production segment.
Rollback readiness: Pre-position tested recovery images and validated configuration backups.
3. IT/OT Collaboration Rituals
True coordination is a human, not just technical, discipline:
Establish documented “change review boards” (CRB) encompassing both IT and OT stakeholders.
Maintain detailed MOPs (Methods of Procedure) and backout plans—engineers should rehearse these, tabletop style, in cross-disciplinary teams.
Leverage change management systems with precise asset inventory integration. Don’t just track servers—include PLC firmware, switch versions, firewall rulesets.
Security Considerations: Balancing Patch Urgency and Operational Risk
Security teams often push for “immediate patch” in response to high-profile CVEs. In 24/7 industrial settings, the operational risk can outweigh the security risk—especially when, as is common, the native vendor-provided patch process itself requires reboots or production stoppage.
Mitigation is about trade-offs:
Where patch isn’t immediately feasible, deploy virtual patching at critical OT firewalls (IPS signatures, protocol whitelisting).
Increase endpoint monitoring (EDR, passive asset discovery) for known indicators of exploitation, especially on maintenance-deferred segments.
Track vendor advisories—some patches introduce their own instability, so be deliberate about risk/benefit analysis for each asset class.
Monitoring, Alerts, and Post-Change Validation
After the window, vigilance is key:
Leverage network taps or SPAN ports for real-time diagnostics—look for link flapping, loss of protocols such as GOOSE or DNP3, or abnormal latency between critical node pairs.
Monitor OT asset health using dedicated system diagnostics, not just generic up/down SNMP traps.
Document every change and anomaly, feeding future maintenance planning and informing OT Cybersecurity Incident Response Plans (CIRP).
Case Study Example: Rolling Switch Firmware Upgrade in a Power Plant
Consider a high-availability substation network deployed using PRP rings. A firmware flaw is discovered in the edge switches (vulnerable to a denial-of-service attack). Full plant shutdown is not possible. The process:
Segment the ring into halves; upgrade one segment after shifting all OT traffic to the redundant loop.
Monitor the health of process communications (e.g., GOOSE packet delivery) before, during, and after firmware load.
If severe anomaly is detected, invoke rollback script pre-loaded via OOBM port.
Complete on both segments, then validation test seamless failover between rings to confirm both sides operate as expected before going to standby.
Lessons Learned and Common Pitfalls
Misunderstanding dependencies: Changing seemingly isolated systems (e.g., patching a domain controller) can break authentication to a safety-critical HMI.
Underestimating latency: Even momentary STP recalculation can cause cascading alarms or safety instrumented system (SIS) triggers.
Complacency after the window: Some latent defects only reveal themselves after hours or days—continuous validation is essential.
Conclusion: Toward Resilient, Secure, and Sustainable Maintenance Policy
There is no panacea; instead, you want robust network design, deliberate human process, and unflinching IT/OT collaboration. The core lesson is to never let operational “business as usual” become a substitute for well-maintained infrastructure. Plan well, communicate better, validate constantly.
Maintenance can and must fit in a 24/7 world—even if the window is only a few seconds, and the process takes days of preparation.
Further Reading and Standards References
IEC 62443: Security for industrial automation and control systems
IEC 62439-3: High availability automation networks—PRP/HSR
NIST SP 800-82: Guide to Industrial Control Systems (ICS) Security
ISA 101: HMI Design and Management
About the Author
The author has spent years working in plant networks, sometimes with a screwdriver in one hand and a CLI in the other, trying to fix what shouldn’t have broken during a maintenance window. If you want cheerful marketing, go elsewhere; if you want the real deal, bookmark this page.
Other blog posts from Trout