0% found this document useful (0 votes)
537 views15 pages

Paper 2 Complete DCS Failure

The document describes three complete failures of the distributed control system (DCS) that controls two 500MW power plant units. In each incident, a restart or connection change to an Ethernet switch caused a broadcast storm that overloaded the DCS processors, causing them to simultaneously reboot and lose their logic configurations. This resulted in the units tripping off-line due to loss of controls and protections. The root cause was identified as a design flaw in the DCS processors that caused a single network failure to disable both redundant networks and processors. Improvements to the DCS design are proposed to prevent single failures from causing complete loss of control.

Uploaded by

freeware
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
537 views15 pages

Paper 2 Complete DCS Failure

The document describes three complete failures of the distributed control system (DCS) that controls two 500MW power plant units. In each incident, a restart or connection change to an Ethernet switch caused a broadcast storm that overloaded the DCS processors, causing them to simultaneously reboot and lose their logic configurations. This resulted in the units tripping off-line due to loss of controls and protections. The root cause was identified as a design flaw in the DCS processors that caused a single network failure to disable both redundant networks and processors. Improvements to the DCS design are proposed to prevent single failures from causing complete loss of control.

Uploaded by

freeware
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

INDIAN POWER STATIONS

POWER PLANT O&M CONFERENCE - 2012

Paper Presentation:
“COMPLETE AND SIMULTANEOUS DCS FAILURE
IN TWO 500MW UNITS”

K C Tripathy, DGM (C&I)


Anoop K, Dy Manager (C&I)
B K Rathore, Dy Manager (C&I)
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

ROLE OF DCS IN POWER PLANTS

Protections

Alarms Controls

DCS
Analysis Operations

Reporting Monitoring
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

DCS ARCHITECTURE FOR CASE-STUDY


TOTAL: Domain-1 for
103 processors,
21 workstations
214 processors,
1st Unit
46 workstations,
50 Ethernet switches
8 processors

Network-A Domain-3 for Network-B


(Ethernet) Common plant (Ethernet)

22 Layer-2 switches, 4 workstations 22 Layer-2 switches,


3 Layer-3 switches 3 Layer-3 switches

Each functional group has:


Domain-2 for
2 redundant processors & 103 processors,
2nd Unit
2 redundant networks 21 workstations
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

INCIDENT NO. 1 REPORT


(Both units initially @ full load)

Maintenance • An Ethernet switch in network-B common domain was found to be non-communicating.


Work • This switch (common to both units DCS) was restarted as per site procedure.

• Both units all DCS processors simultaneously rebooted & lost all logic configurations.
System • Both units boiler and TG tripped on hardwired backup protections outside DCS control.
Events • No indications/alarms available for operation. No SOE/event logs for troubleshooting.

• Emergency auxiliary drives were started directly from LT switchgear.


Operational • DG incomer breaker was manually closed in one unit from switchgear.
Actions • ECW pumps (that tripped in one unit) were directly started from HT switchgear.

• Workstations were inoperative after the incident until network-B was temporarily switched off.
System • Full download was done in DCS panels, taking 3-4 hours for complete restoration.
Restoration • Suspected switch was replaced with proper IP address and port settings.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

PLANT SAFETY HAZARDS


BOILER (tripped via hardwired MFT to all firing equipment)
•Furnace pressure of one unit was found near high trip limit after DCS restoration.
•Boiler drums were emptied, BCW pumps ran dry without any protection.
•ID-FD fans ran without any control or protection for over an hour.
•Safety relief valves floated for 15 minutes in both units, as bypass was unavailable.
• POTENTIAL RISK : Furnace explosion / Tube failures (if backups also fail)
TURBINE (tripped via hardwired inter-trip from MFT)
•Turbine of one unit crash-halted, 2 days before hand-barring is possible.
•TG lub oil temperature control unavailable : low temperatures persisted.
•All TDBFP ran without any protection, one had exhaust diaphragm rupture.
•Hotwell of one unit emptied & CEPs ran dry without any protection.
• POTENTIAL RISK : Water ingress / Bearing failures (if backups also fail)
GENERATOR (tripped via low fwd power delayed backup)
• Seal oil system of one unit out of service for a few minutes
(due to UT-ST changeover failure & delay in DC SOP starting)
• Hydrogen leakage in one unit observed in TG floor.
• POTENTIAL RISK : Fire hazard
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

INCIDENTS NO. 2 & 3 REPORT

• DCS vendor representative was at site in order to troubleshoot the 1st incident.
Work before • While restoring uplink connection to another common Ethernet switch in network-B, similar
2nd incident complete DCS failure occurred (one unit tripped at full load, other boiler tripped)

Activities • The uplink connection that caused the 2nd failure was removed.
after 2nd • Operational actions and system restoration done similar to 1st incident.
incident • Both units kept under safe shutdown for further testing.

Work before • All Ethernet switch port settings and connections in the network checked thoroughly.
3rd failure • Double physical connections between same set of switches were removed, wherever found.
(test) • On reconnecting the uplink that caused 2nd failure, complete DCS of both units failed again.

Activities • The net-B uplink connection that caused the 2nd and 3rd failures was taped and kept removed.
after 3rd • System restoration done similar to 1st and 2nd incidents & both units taken to full load.
failure • Threat still exists in network-A, BUT solution is pending from supplier for 2 ½ months.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

INITIATING CAUSE (DCS NETWORK)


(Representative diagram for network-B, similar for redundant network-A)

(Broadcast storm initiated by restart / uplink initialization of any one switch in the loop)
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

ROOT CAUSE (DCS PROCESSOR)


Single network overload caused working memory overflow and
application crash of communication handler utility inside DCS
processor firmware (incidentally common to both networks).
All processors (active and redundant) abruptly rebooted at once.
Thereby both redundant network & redundant processor concepts
of DCS design were defeated.
Since the particular version of DCS processor does not keep a
backup copy of logics in non-volatile flash, all logic programs
were lost. This resulted in a huge downtime of 3-4 hours.

Indicative DCS processor volatile memory (RAM) allocation table


Real-time OS Input /Output Scan Logic Execution External
REBOOT ! (real-time data) (time classes) Communications
BIOS System AI AO DI DO Special Fast Medium Slow Net-A Net-B Backup
link
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

DESIGN CONSIDERATIONS - PROCESSOR


Network • Loss of ALL network communications or complete communication overload (in any
failure or or both of the redundant networks) should NOT lead to complete loss of control in
overload ANY functional group – in line with Functional Grouping Concept of DCS

Backup copy • DCS processors always must keep a backup copy of logics in non-volatile flash
memory, in order to avoid complete loss of logics and huge downtime of 3-4 hours for
of logics logic download and restoration.

Processor • DCS processor functions of RTOS, I/O Scans, Logic execution and External
communications have to be protected by dedicated memory allocation.
memory • The communication handler utilities inside processor firmware for either redundant
allocation network and hot backup link also should be completely independent.

• Internationally accepted safety certifications and standards (TŰV, SIL3 – IEC


DCS safety 61508) must be mandatory for all DCS contract technical specifications.
compliance • DCS vendor to certify total network safety compliance for its processors.

Vendor • All DCS vendors must to be contractually mandated to provide prompt support and
permanent solutions (within one month) in the unlikely event of complete DCS
solution tieup control failure with potential catastrophic consequences, throughout plant lifespan.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

DESIGN CONSIDERATIONS - NETWORK

Broadcast • All sources of broadcast storms causing network overload must be identified and
storms addressed through appropriate checks and balances (in architecture and configurations).

Parallel
• Parallel double connections (port redundancy) between any two adjacent pairs of
double switches must be prevented by design.
connections

Inadvertent • Multiple ( > 2 ) switches must not form a closed physical path (ring or loop). Special
attention needed at the highest level of network hierarchy where multiple domains or
ring or loop units of same DCS connect.

Firewall • Hardware Firewalls must be installed in the DCS network at appropriate interfaces
and configured to prevent Denial-of-service (DoS) attacks from outside the plant
safety network. Typical firewall interfaces : for offsite PLCs, PADO, intranet, ERP etc.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

DESIGN CONSIDERATIONS - BACKUP


• A detailed backup design document (for all systems independent of DCS) that ensure
Design basic minimum plant safety needs to be issued for every project.
document • The list may include hardwired systems for automatic action & start/stop provisions and
indications for manual intervention during DCS failure

Boiler • 2/3 MFT Processors fail must initiate hardwired MFT, and trip all firing equipment
backup • DC/emergency scanner air system auto-start, closing spray main block valves
systems • Direct indication for furnace pressure and excess oxygen independent of DCS

• Hardwired MFT or control failure of both turbine protection functional groups


TG backup (monitored through watchdog relays) must directly energize turbine trip solenoids.
systems • Direct indication for turbine speed, vacuum, lub oil pressure & seal oil-H2 DP.
• Direct pressure-switch auto-start, manual start buttons and status for emergency drives

Electrical • Enabling signal from DCS for UT-ST fast changeover electrical circuit must be ensured
available through set-reset latch relay, to prevent loss of unit supply.
backup • Manual closing provision for DG incomer breaker & critical auxiliary drives on the
systems electrical module at LT switchgear.

Emergency • The backup protections for boiler and turbine must use redundant power sources, both
independent of normal power sources to protection panel. Failure of both backup supplies
trip systems must initiate machine trip through DCS. Desk EPBs must trip machine through backup path.
(Detailed list in technical paper)
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

PROJECT CONSIDERATIONS – CRASH TEST

• To ensure safety in control systems : “First introspect, then inspect, thus protect”
Need to test • Complex network architecture makes DCS based protection systems vulnerable.

• To be done from inside DCS network, directly upon its processors.


Avalanche • During Factory Acceptance Test (FAT) at vendors’ shop-floor by Engineering Group.
crash tests • During Site Acceptance Test (SAT) at project site by Commissioning Group.

• Creation of port redundancy, loop of multiple Ethernet switches followed by restart /


uplink re-initialization of any one switch, in a fully networked system.
Sample tests • Creation of a Denial-of-service (DoS) attack from inside the DCS network.
for FAT/SAT • The tests may be carried out on either network first, then both together.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

PLANT O&M CONSIDERATIONS

• Comprehensive DCS online maintenance procedure to be mandatorily submitted by


Online DCS DCS vendor & approved by Engineering, to be strictly followed at site.
maintenance • Typical procedures to be included: Logic backups/changes, network settings, single-
point failure & restoration, complete MMI restoration, complete DCS restoration

(Detailed list in technical paper)

Backup • Healthiness of all backup systems (independent of DCS) also to be ensured by C&I
maintenance (check) and operation (witness) departments periodically during
healthiness protection checking, typically after unit overhauls. Joint protocol to be recorded.

• Site-specific emergency handling operation procedures to be prepared to handle


Emergency complete DCS failure situation, with assigned responsibilities.
operation
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

CONCLUSION POINTS

During the complete DCS failures, all


major equipment in power plant faced
high risk of catastrophic damage due to
loss of all protections and controls.

DCS Processors must retain primary


control tasks of logic execution and I/O
scan even under complete network
failure or overload.

DCS design contract specifications,


vendor support tie-up, network crash-
tests, and site O&M procedures need to
be strengthened further.
INTRODUCTION INCIDENTS ANALYSIS LEARNINGS CONCLUSION

“DCS Processors must be pre-designed & crash-tested against all


network failures & overloads to ensure human & plant safety”

THANK YOU

You might also like