Paper 2 Complete DCS Failure
Paper 2 Complete DCS Failure
Paper Presentation:
“COMPLETE AND SIMULTANEOUS DCS FAILURE
IN TWO 500MW UNITS”
Protections
Alarms Controls
DCS
Analysis Operations
Reporting Monitoring
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION
• Both units all DCS processors simultaneously rebooted & lost all logic configurations.
System • Both units boiler and TG tripped on hardwired backup protections outside DCS control.
Events • No indications/alarms available for operation. No SOE/event logs for troubleshooting.
• Workstations were inoperative after the incident until network-B was temporarily switched off.
System • Full download was done in DCS panels, taking 3-4 hours for complete restoration.
Restoration • Suspected switch was replaced with proper IP address and port settings.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION
• DCS vendor representative was at site in order to troubleshoot the 1st incident.
Work before • While restoring uplink connection to another common Ethernet switch in network-B, similar
2nd incident complete DCS failure occurred (one unit tripped at full load, other boiler tripped)
Activities • The uplink connection that caused the 2nd failure was removed.
after 2nd • Operational actions and system restoration done similar to 1st incident.
incident • Both units kept under safe shutdown for further testing.
Work before • All Ethernet switch port settings and connections in the network checked thoroughly.
3rd failure • Double physical connections between same set of switches were removed, wherever found.
(test) • On reconnecting the uplink that caused 2nd failure, complete DCS of both units failed again.
Activities • The net-B uplink connection that caused the 2nd and 3rd failures was taped and kept removed.
after 3rd • System restoration done similar to 1st and 2nd incidents & both units taken to full load.
failure • Threat still exists in network-A, BUT solution is pending from supplier for 2 ½ months.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION
(Broadcast storm initiated by restart / uplink initialization of any one switch in the loop)
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION
Backup copy • DCS processors always must keep a backup copy of logics in non-volatile flash
memory, in order to avoid complete loss of logics and huge downtime of 3-4 hours for
of logics logic download and restoration.
Processor • DCS processor functions of RTOS, I/O Scans, Logic execution and External
communications have to be protected by dedicated memory allocation.
memory • The communication handler utilities inside processor firmware for either redundant
allocation network and hot backup link also should be completely independent.
Vendor • All DCS vendors must to be contractually mandated to provide prompt support and
permanent solutions (within one month) in the unlikely event of complete DCS
solution tieup control failure with potential catastrophic consequences, throughout plant lifespan.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION
Broadcast • All sources of broadcast storms causing network overload must be identified and
storms addressed through appropriate checks and balances (in architecture and configurations).
Parallel
• Parallel double connections (port redundancy) between any two adjacent pairs of
double switches must be prevented by design.
connections
Inadvertent • Multiple ( > 2 ) switches must not form a closed physical path (ring or loop). Special
attention needed at the highest level of network hierarchy where multiple domains or
ring or loop units of same DCS connect.
Firewall • Hardware Firewalls must be installed in the DCS network at appropriate interfaces
and configured to prevent Denial-of-service (DoS) attacks from outside the plant
safety network. Typical firewall interfaces : for offsite PLCs, PADO, intranet, ERP etc.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION
Boiler • 2/3 MFT Processors fail must initiate hardwired MFT, and trip all firing equipment
backup • DC/emergency scanner air system auto-start, closing spray main block valves
systems • Direct indication for furnace pressure and excess oxygen independent of DCS
Electrical • Enabling signal from DCS for UT-ST fast changeover electrical circuit must be ensured
available through set-reset latch relay, to prevent loss of unit supply.
backup • Manual closing provision for DG incomer breaker & critical auxiliary drives on the
systems electrical module at LT switchgear.
Emergency • The backup protections for boiler and turbine must use redundant power sources, both
independent of normal power sources to protection panel. Failure of both backup supplies
trip systems must initiate machine trip through DCS. Desk EPBs must trip machine through backup path.
(Detailed list in technical paper)
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION
• To ensure safety in control systems : “First introspect, then inspect, thus protect”
Need to test • Complex network architecture makes DCS based protection systems vulnerable.
Backup • Healthiness of all backup systems (independent of DCS) also to be ensured by C&I
maintenance (check) and operation (witness) departments periodically during
healthiness protection checking, typically after unit overhauls. Joint protocol to be recorded.
CONCLUSION POINTS
THANK YOU