Safety Engineering: The Task of Safety Engineers
Safety Engineering: The Task of Safety Engineers
Safety Engineering: The Task of Safety Engineers
Safety engineering is an applied science closely related to systems engineering and its subset,
System Safety Engineering. Safety engineering assures that a life-critical system behaves as
needed even when other components fail. In practical terms, the term "safety engineering"
refers to any act of accident prevention by a person qualified in the field. Safety engineering is
often reactionary to adverse events, also described as "incidents," as reflected in accident
statistics. This arises largely because of the complexity and difficulty of collecting and analyzing
data on "near misses."
Increasingly, the importance of a safety review is being recognized as an important risk
management tool. Failure to identify risks to safety, and the according inability to address or "control"
these risks, can result in massive costs, both human and economic. The multidisciplinary nature of
safety engineering means that a very broad array of professionals are actively involved in accident
prevention or safety engineering.
The process
Ideally, safety engineers take an early design of a system, analyze it to find what faults
can occur, and then propose safety requirements in design specifications up front and
changes to existing systems to make the system safer. In an early design stage, often a
fail-safe system can be made acceptably safe with a few sensors and some software to
read them. Probabilistic fault-tolerant systems can often be made by using more, but
smaller and less-expensive pieces of equipment.
Far too often, rather than actually influencing the design, safety engineers are assigned
to prove that an existing, completed design is safe. If a safety engineer then discovers
significant safety problems late in the design process, correcting them can be very
expensive. This type of error has the potential to waste large sums of money.
The exception to this conventional approach is the way some large government
agencies approach safety engineering from a more proactive and proven process
perspective. This is known as System Safety. The System Safety philosophy, supported
by the System Safety Society and many other organizations, is to be applied to complex
and critical systems, such as commercial airliners, military aircraft, munitions and
complex weapon systems, spacecraft and space systems, rail and transportation
systems, air traffic control system and more complex and safety-critical industrial
systems. The proven System Safety methods and techniques are to prevent, eliminate
and control hazards and risks through designed influences by a collaboration of key
engineering disciplines and product teams. Software safety is fast growing field since
modern systems functionality are increasingly being put under control of software. The
whole concept of system safety and software safety, as a subset of systems
engineering, is to influence safety-critical systems designs by conducting several types
of hazard analyses to identify risks and to specify design safety features and
procedures to strategically mitigate risk to acceptable levels before the system is
certified.
Additionally, failure mitigation can go beyond design recommendations, particularly in
the area of maintenance. There is an entire realm of safety and reliability engineering
known as "Reliability Centered Maintenance" (RCM), which is a discipline that is a direct
result of analyzing potential failures within a system and determining maintenance
actions that can mitigate the risk of failure. This methodology is used extensively on
aircraft and involves understanding the failure modes of the serviceable replaceable
assemblies in addition to the means to detect or predict an impending failure. Every
automobile owner is familiar with this concept when they take in their car to have the oil
changed or brakes checked. Even filling up one's car with gas is a simple example of a
failure mode (failure due to fuel starvation), a means of detection (fuel gauge), and a
maintenance action (filling the tank).
For large scale complex systems, hundreds if not thousands of maintenance actions
can result from the failure analysis. These maintenance actions are based on conditions
(for example, gauge reading or leaky valve), hard conditions (for example, a component
is known to fail after 100 hrs of operation with 95 percent certainty), or require
inspection to determine the maintenance action (such as metal fatigue). The Reliability
Centered Maintenance concept then analyzes each individual maintenance item for its
risk contribution to safety, mission, operational readiness, or cost to repair if a failure
does occur. Then the sum total of all the maintenance actions are bundled into
maintenance intervals so that maintenance is not occurring around the clock, but rather,
at regular intervals. This bundling process introduces further complexity, as it might
stretch some maintenance cycles, thereby increasing risk, but reduce others, thereby
potentially reducing risk, with the end result being a comprehensive maintenance
schedule, purpose built to reduce operational risk and ensure acceptable levels of
operational readiness and availability.
Analysis techniques
The two most common fault modeling techniques are called "failure modes and effects
analysis" and "fault tree analysis." These techniques are just ways of finding problems
and of making plans to cope with failures, as in Probabilistic Risk Assessment (PRA or
PSA). One of the earliest complete studies using PRA techniques on a commercial
nuclear plant was the Reactor Safety Study (RSS), edited by Prof. Norman
Rasmussen[3]
Safety certification
Usually a failure in safety-certified systems is acceptable if, on average, less than one
life per 109 hours of continuous operation is lost to failure. Most Western nuclear
reactors, medical equipment, and commercial aircraft are certified to this level. The cost
versus loss of lives has been considered appropriate at this level (by FAA for aircraft
under Federal Aviation Regulations).
Preventing failure
Probabilistic fault tolerance: Adding redundancy to
equipment and systems
Once a failure mode is identified, it can usually be prevented entirely by adding extra
equipment to the system. For example, nuclear reactors contain dangerous radiation,
and nuclear reactions can cause so much heat that no substance might contain them.
Therefore reactors have emergency core cooling systems to keep the temperature
down, shielding to contain the radiation, and engineered barriers (usually several,
nested, surmounted by a containment building) to prevent accidental leakage.
Most biological organisms have a certain amount of redundancy: Multiple organs,
multiple limbs, and so on.
For any given failure, a fail-over, or redundancy can almost always be designed and
incorporated into a system.
Containing failure
It is also common practice to plan for the failure of safety systems through containment
and isolation methods. The use of isolating valves, also known as the Block and bleed
manifold, is very common in isolating pumps, tanks, and control valves that may fail or
need routine maintenance. In addition, nearly all tanks containing oil or other hazardous
chemicals are required to have containment barriers set up around them to contain 100
percent of the volume of the tank in the event of a catastrophic tank failure. Similarly,
long pipelines have remote-closing valves periodically installed in the line so that in the
event of failure, the entire pipeline is not lost. The goal of all such containment systems
is to provide means of limiting the damage done by a failure to a small localized area.