IMechE Alarp-Technical-Safety-Guide - 2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 105

ALARP FOR

ENGINEERS:
A TECHNICAL
SAFETY GUIDE

Safety and Reliability Group

Improving the world through engineering


THE IMECHE
SAFETY AND
RELIABILITY
GROUP
The IMechE Safety and Reliability Group promotes the development of safety and reliability
requirements for products such as equipment, systems, or services.

What we do:
• Provide best practice guides and information to engineers associated with designing,
manufacturing, operating or maintaining products.
• Increase the public understanding of risk, while continuing to work on the assessment
of Hazardous events, dissemination of lessons learned and the promotion of risk
reduction strategies.
• Represent the Institution on the Hazards Forum which provides an interdisciplinary focus
for the study of disasters.
• Organise events, such as the annual ALARP Seminar.
• Support Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk
and Reliability.

We strive to produce up to date, effective communications and therefore welcome any


comments on this document, or ideas for improving future revisions. Please send your
contributions to: [email protected].

The SRG is a voluntary group, which meets quarterly. We welcome new members who have
experience in, and a passion for, safety. Please send CV to [email protected].

IMECHE SAFETY AND


RELIABILITY GROUP

Acknowledgements
This document has been developed by Working Group 2 of the IMechE Safety and Reliability
Group, with support from the other SRG working groups.

We would like to thank the many external reviewers from legal firms, industry, consultancies,
regulators, engineering institutions and societies, whose contributions have been invaluable.

ALARP for Engineers: A Technical Safety Guide 03


CONTENTS

Glossary of Terms 06
Acronyms 08
1. Introduction and Scope 09
1.1 Why this Document is Needed 10
1.1.1 Process Safety 10
1.1.2 Systems Approach to Safety 10
1.1.3 Reasonable Foreseeability 10
1.1.4 Probabilistic Assessment 11
1.1.5 Evidential Admissibility 11
1.1.6 Risk Management Contiguity 11
2. UK Safety Legislation 12
2.1 Regulatory Control and Prosecution 14
2.2 Reasonable Foreseeability and the ‘Incidence of everyday life’ 16
2.3 Reasonable Practicability and Gross Disproportion 18
2.4 Risk Transfer and Risk Trade-offs 20
2.5 Target Safety Levels (TSLs) 21
2.6 The Legal Duty Holder 22
2.7 Reverse Burden of Proof 22
2.8 Retrospective Application of ALARP Principles 22
3. Overview of the Safety Risk Management Process 23
3.1 The Hazardous System 25
3.2 Proportionality 25
3.3 The Proportionality Matrix 27
3.4 Engineering Management Systems 27
4. Identification 30
4.1 The Hierarchy of Controls 30
4.2 Basic HAZID 31
4.3 Comprehensive HAZID 35
4.3.1 Hierarchical Guideword Expansion 35
4.3.2 Contextual Guideword Expansion 36
4.3.3 The Degree of Guideword Expansion 38
4.3.4 Guidewords for Identifying RRMs and Actions for Further Study 40
4.4 HAZID Conclusions 42
5. Evaluating Reasonably Foreseeable Consequences 43
5.1 Consequence Reduction Measures 44
6. Analysis 45
6.1 RRM Effectiveness Review 47
6.2 Human Error, Human Factors, and Ergonomics 49
6.3 Task Risk Analysis (TRA) and Procedure HAZOP 51
6.3.1 Task Risk Analysis 51
6.3.2 Procedure HAZOP 52
6.4 Systems Theoretic Process Analysis (STPA) 53
6.5 Failure Modes and Effects Analysis (FMEA/FMECA) 55

ALARP for Engineers: A Technical Safety Guide 04


6.6 Fault Tree Analysis (FTA) 55
6.7 Event Tree Analysis (ETA) 55
6.8 Hazard and Operability Study (HAZOP) 55
6.9 Functional Safety Analysis (FSA) 55
6.10 Bow Ties 57
6.11 Layers of Protection Analysis (LOPA) 59
6.12 Risk Matrices and Quantitative Risk Assessment (QRA) 60
6.13 Tailor-made Matrices 60
6.14 Specialist Materials Studies 61
6.15 Conclusions 61
7. Lifecycle Criteria 62
7.1 Performance Standard Verification 63
8. Documenting and Communicating the Demonstration of ALARP 64
8.1 Building a WRA 64
8.2 Goal Structuring Notation (GSN) 70
8.3 The Safety Report/Case 71
9. Upkeep of ALARP Demonstration and Management of Change 73
9.1 Building a WRA 73
9.2 Creeping Change Monitoring and HAZID (CCHAZID) 74
9.3 Contingency Plans and Matrices 75
10. Common Pitfalls 76
11. References 78

Appendix A: Risk, Its Measurement and Prediction 79


A1 Legal Definition of Risk & Risk Assessment 79
A2 Risk Measurement (Robust Statistics) 80
A3 Risk Calculation (Bayesian Methods) 81
A4 Risk Prediction (Belief) and Randomness 82
Control and Randomness 82
Knowledge 83
A5 Cognitive Bias 87
A6 Risk Matrices 88
A7 Cost Benefit Analysis (CBA) 92
A8 Conclusions 94
Appendix B: Technical Errors in Risk Quantification Models (QRA) 95
B1 Error #1 Non-Representative Data 95
B2 Error #2 The Causal Fallacy 97
B3 Error #3 Omission 98
B4 Error #4 Null Hypothesis 98
B5 Error #5 Ludic Fallacy - Independence and Randomness 99
B6 Error #6 Illegitimate Transposition of the Conditional (Prosecutor’s Fallacy) 99
B7 Error #7 Unfalsifiable Chance Correlations 100
B8 Conclusions 101
Appendix C: Compliance Checklist 102

ALARP for Engineers: A Technical Safety Guide 05


GLOSSARY
OF TERMS

Barrier Another term for a Risk Reduction Measure (RRM), generally used in the context of
Bow Ties.

Case Law Interpretations of Statutory Law, made by the senior courts.

Cause Anything whose absence may prevent a loss, e.g. errors in design, manufacture,
control, human factors, ergonomics, management systems, equipment unreliability,
RRM flaws, safety culture, conflicts of interest, cognitive bias. (NB. Terms such as
root, direct and indirect Causes may be ambiguous, so they are avoided.)

Failure Mode A type of failure, e.g. corrosion, fatigue, rupture, unrevealed control failure.

Good Practice Standards, practices, methods, guidance, and procedures conforming to the law and
the degree of skill and care, diligence, prudence and foresight which would
reasonably and ordinarily be expected from a skilled and experienced person or
body engaged in a similar type of undertaking under the same or similar
circumstances.
NB. The HSE have also defined Good Practice as the generic term for those
standards for controlling Risk which have been judged and recognised by HSE as
satisfying the law when applied to a particular relevant case in an appropriate
manner.

Gross An RRM that is not considered to be Reasonably Practicable, because the sacrifice
Disproportion (in time, trouble, or money) is disproportionate to the Risk being mitigated.
See Section 2.3 for a full definition.

Hazard A useful means of expressing a condition or material that presents the ‘possibility
of danger’ in a specified situation, e.g. cooking oil fire in busy kitchen, pressurised
release of toxic material from relief valve upwind of an office building, aircraft stalls
whilst turning finals at one thousand feet. A defined Hazard is useful when the
analysis of it would not benefit further if it were contextualised anymore.
NB. Depending on the analytical objectives the hazard could be either
i. poor control panel layout
ii. operator presses wrong button
iii. reactor goes unstable
iv. reactor melts down
v. radiation reaches domestic housing
vi. people exposed to radiation or
vii. people die from radiation sickness.
Any of these could be described as a Hazard or a Cause, depending on the analytical
context, so Hazards may be described as mutually exclusive Risk scenarios that
generally represent several Causes. The differences between Hazards, Risks and
Causes are not always clear.

Legal Duty The organisation or individual upon which the legal obligation lies.
Holder

Lifecycle Safety Critical data that needs to be established at the design stage and maintained
Criteria or updated throughout the lifecycle of the product.

ALARP for Engineers: A Technical Safety Guide 06


Proportionality The test whether sufficient effort has gone into identifying, preventing and/or
mitigating Risks, i.e. whether all that is Reasonably Practicable has been done.
See Section 3.2/3.

Qualitative Non-probabilistic evidence, based on the logical, technical and/or physical


Argument characteristics and responses of a System.

Quantitative An argument based on numerical probabilities and consequences.


Argument

Reasonable A legal concept to define which Hazards fall within legislation. See Section 2.2.
Foreseeability

Reasonably See Section 2.3.


Practicable

Regulator The UK HSE, ONR, ORR, BSR, CAA, local authorities, and the police.

Risk A combination of severity and likelihood, expressed in case law as ‘the possibility
of danger’.

Risk Reduction Any measure that removes a Hazard or prevents harm to people, reduces its
Measure (RRM) likelihood or mitigates the consequences.

Robust Data that complies with the following five criteria:


Statistics i. Representative of the relevant population.
ii. Randomly sampled.
iii. Statistically Significant.
iv. Free of Unfalsifiable Chance Correlations.
v. Ergodicity.

Safety Critical Any component, function, activity, process, or procedure whose omission, failure or
incorrect operation could increase Risks associated with the System.

Scenario A specific set of conditions that may be physical, sociological and/or environmental,
which could create a Hazardous situation.

Statutory Law Laws made by the UK Parliament.

System A set or group of interacting, interrelated or interdependent elements or parts that


achieve a common safety objective, including hardware, software, human
interactions, procedures, management systems, competency/training, emergency
response and the operating environment (natural or imposed). It could differ for a
permanent situation, scenario or activity.

Well-Reasoned A qualitative, rational explanation of how all Reasonably Foreseeable Hazards have
Argument been systematically identified and all Reasonably Practicable RRMs have been
(WRA) implemented – see Section 8.1 and 8.2. It may be supported by quantitative
arguments provided they are based on Robust Statistics, or modifications thereof,
and comply with the Royal Statistical Society, Practitioner Guides No’s. 1 to 4,
Guidance for Judges, Lawyers, Forensic Scientists and Expert Witnesses, and are
free of any of the errors in Appendices A and B of this document.

ALARP for Engineers: A Technical Safety Guide 07


ACRONYMS

ACoP Approved Code of Practice


ALARP As Low as Reasonably Practicable
BSL Basic Safety Level
BSR Building Safety Regulator
CAA Civil Aviation Authority
CBA Cost Benefit Analysis
CFD Computational Fluid Dynamics
COMAH Control of Major Accident Hazards Regulations
COSHH Control of Substances that may be Hazardous to Health
FMEA Failure Modes and Effects Analysis
FSA Functional Safety Analysis
FTA Fault Tree Analysis
GSN Goal Setting Notation
HASAWA Health and Safety at Work Act
HAZID Hazard Identification Methodology
HAZOP Hazard and Operability Study
HE Human Error
HEP Human Error Probability
HF Human Factors
HFA Human Factors Analysis
HRA Human Reliability Analysis
HSE The UK Health & Safety Executive
LOPA Layers of Protection Analysis
MHSWR Management of Health and Safety at Work Regulations
ONR Office of Nuclear Regulation
ORR Office of Road and Rail
PDCA Plan Do Check Act
PDF Probability Distribution Function
PPE Personal Protective Equipment
PRA robabilistic Risk Analysis
QRA Quantified Risk Analysis
RAM Risk Assessment Matrix
RRM Risk Reduction Measure
RSS Royal Statistical Society
SCR Safety Case Regulations
SFAIRP So Far as Is Reasonably Practicable
SIF Safety Instrumented Function
SIL Safety Integrity Level
STPA Systems Theoretic Process Analysis
STAMP Systems Theoretic Accident Model and Processes
TRA Task Risk Analysis
TSL Target Safety Level
UCC Unfalsifiable Chance Correlation
WRA Well-Reasoned Argument

ALARP for Engineers: A Technical Safety Guide 08


1. INTRODUCTION
AND SCOPE

This document explains the UK legal obligation to reduce Risks So Far As


Is Reasonably Practicable (SFAIRP), which is more commonly known as
As Low As Reasonably Practicable (ALARP), in an engineering context.
Nevertheless, the principles and methodologies discussed here could be
regarded as sound practice, regardless of a country’s legislative regime.

The document applies to engineering decisions • Explain the Systems approach to risk
for all Risks, minor or major, throughout the assessment.
lifecycle of any product, activity, process, or • Highlight some of the common
installation, including its design, construction, misconceptions about legal obligations.
commissioning, operation, modification, integrity
• Communicate legal precedent with respect to
management, decommissioning and disposal.
Reasonable Foreseeability.

NB. There are many examples of UK Legislation • Explain the arguments for and against
which are prescriptive as to the safeguards qualitative versus quantitative analytical
which must be taken, regardless of the techniques.
SFAIRP approach, for example the Pressure • Explain the criteria for probabilistic evidence to
Systems Safety Regulations, Water Supply be admissible in a UK law court.
(Water Fittings) Regulations, COSHH, to • Explain the common probabilistic error types.
name only a few. • Describe acceptable methods of
demonstrating gross disproportion.
However, whilst recognising that certain
• Describe how to build a Well-Reasoned
industries have dedicated legislation and have
Argument (WRA) to demonstrate legal
developed specific guidance for their types of
compliance.
Hazards and Risk Reduction Measures (RRMs),
this guidance is intended to establish the generic • Outline a risk management process which is
principles and encourage cross-pollination of logical, objective, contiguous and systematic.
best practice across all industries and products.
The document has been reviewed by lawyers,
There have been several developments since regulators, and safety experts from various
much of the existing guidance was written, which industries. It is primarily aimed at engineers,
has rendered some aspects misleading. These but may also be useful to managers, students,
include changes in the law, legal precedent, auditors, and accident investigators. Lawyers
evidential admissibility, lessons from accidents, wanting to know more about how engineers can
new analytical techniques, and the discovery manage Risk may also find it informative.
of fundamental errors in some existing
methodologies. The need for a single source of
updated guidance to help engineers comply with
the law, improve safety, and avoid unnecessary
costs and studies is therefore clear. The specific
objectives of this document are therefore to:
• Explain the risk management processes
involved.
• Explain lifecycle considerations and lessons
from process safety.
The principles and methodologies
discussed here could be regarded
as sound practice, regardless of a
country’s legislative regime.

ALARP for Engineers: A Technical Safety Guide 09


1.1 Why this Document is Needed throughout its lifecycle. Most of these
criteria need to be developed at the
The following text describes some of design stage as part of the engineering
the relevant changes that may not have
process and then incorporated into
been incorporated into, or recognised by,
such things as performance standards,
existing guidance.
operations procedures, maintenance, and
1.1.1. Process Safety inspection routines, change control and
decommissioning/disposal strategies.
Process safety is a disciplined framework They are described here as the Lifecycle
for managing the integrity of operating
Criteria (Section 7).
systems and processes that handle
Hazardous substances. It relies on 1.1.2. Systems Approach to Safety
good design principles, engineering,
operating and maintenance practices, and After the Grenfell Tower disaster, there
management of change procedures to has been an increased emphasis on the
ensure the prevention and mitigation Systems approach to demonstrate that
of events throughout the lifecycle of all Reasonably Practicable Risk controls
the facility. are in place. In this case a System is
defined as a set or group of interacting,
The legal requirement for companies to
interrelated or interdependent elements
ensure that Risks are reduced to SFAIRP
or parts that achieve a common safety
under the HASAWA has existed in the
objective, including hardware, software,
UK since 1974. The recommendations
of the Baker Report on the BP Texas human interactions, procedures, training,
City Refinery disaster in the US in 2005 emergency response and the operating
(1), form the basis of much of what environment (natural or imposed). System
is generally considered today to be definition is therefore an early stage in the
relevant Good Practice in Process Safety process.
Management.
1.1.3. Reasonable Foreseeability
Although the concept of process safety
has existed since the 1970s, it was greatly Legal precedent has established
strengthened by the Baker report. Whilst Reasonable Foreseeability as an important
this was aimed at the process industries, criterion for determining whether the
its principles may be applied to most responsible party (the Legal Duty Holder)
enterprises and it is now becoming well has met their obligations. However,
established as Good Practice. It requires what may be considered Reasonably
a set of criteria to be established and Foreseeable is not straightforward and
used to maintain the integrity of the plant warrants further explanation.

The recommendations of the Baker Report on


the BP Texas City Refinery disaster in the US in
2005, form the basis of much of what is generally
considered today to be relevant Good Practice in
Process Safety Management.

ALARP for Engineers: A Technical Safety Guide 10


UK legislation does not require and industry guidance documents still
the quantification of Risks, promote the use of Risk quantification
(QRA), risk matrices, risk profiling, risk
and the methods of estimating targets and risk ranking, none of which are
or calculating them are prone legal obligations. UK law requires that all
to unacceptable errors and Reasonably Practicable Risk controls be
uncertainties. in place, no matter how large or small the
Risk is. This requires a systematic process
of hazard identification, prevention, and
1.1.4. Probabilistic Assessment mitigation, which may need to comply with
Good Practice, prescriptive legislation,
UK legislation does not require the
standards, approved codes of practice,
quantification of Risks, and the methods of
guidance, and be based on a WRA.
estimating or calculating them are prone
to unacceptable errors and uncertainties. 1.1.5. Evidential Admissibility
Errors of over nine orders of magnitude
can result from simple mistakes (Appendix Following several miscarriages of justice
B6), yet there is no means of verifying or due to errors in probabilistic evidence
sense checking these figures. Extreme much stricter controls have been applied
caution should therefore be exercised if to what is admissible in court (3). These
a risk assessment is to use any form of limit probabilistic arguments to only the
probabilistic argument. simplest of calculations applied to Robust
Statistics. This has significant implications
Appendix A explains some of the for predictive modelling, such as QRA.
fundamental statistical and mathematical
requirements for probabilistic arguments Risk matrices are highly subjective and
and any form of cost benefit analysis. It unlikely to constitute reliable evidence.
also discusses the difficulty of satisfying They are commonly used to evaluate
these in an engineering context, plus some the level of assessment required, i.e.
of the cognitive biases that affect any Proportionality. Section 3.2 provides an
form of prediction, such as risk matrices. alternative to risk matrices, which is more
Appendix B references many of the errors objective and legally admissible.
that may occur in computerised risk
models (QRA/PRA) and their data. Many 1.1.6. Risk Management Contiguity
of these cannot be overcome and there The strategy for managing any given
are no reliable review methodologies to Risk is largely a matter of professional
identify them. judgement, but this can lead to disparate
The Offshore Safety Case Regulations and incontiguous studies, which may
of 1992 introduced the requirement to not effectively communicate whether all
assess probability but subsequently Reasonably Practicable analytical efforts
removed it in 2005. Guidance such as have been undertaken. This document
Reducing Risks, Protecting People (2) therefore attempts to join the dots by
was written during this period and has guiding the reader through a more logical
not been updated, leaving a legacy and sequential approach, which should
perception that risks must be quantified improve safety and provide a more cost
and should be Tolerable, which is incorrect effective means of legal compliance.
(Section 2). Nevertheless, many HSE

ALARP for Engineers: A Technical Safety Guide 11


2. UK SAFETY
LEGISLATION

Key Messages

The UK legal regime for health and safety is based on the Health And Safety At Work Act (HASAWA),
which places an obligation on the Legal Duty Holder to identify all Reasonably Foreseeable Hazards
and ensure, So Far As Is Reasonably Practicable (SFAIRP), that neither employees nor the public are
exposed to Risks to their health or safety. This is often expressed as an obligation to reduce Risks to
As Low As Reasonably Practicable (ALARP).

These legal obligations give discretion to the Legal Duty Holders, as to how they should best
minimise Risks. They cannot be delegated or transferred, but this does not mean that subordinates
or contractors are absolved of their responsibilities either (Section 2.6).

There are also prescriptive laws that apply to certain types of activity and industry. Furthermore, any
standards, guidance, Approved Codes of Practice may be regarded as Good Practice, which should
be complied with, unless a WRA has been provided to justify deviating from them.

The analysis must be Proportionate to the nature and complexity of the situation and the severity of
its effects if it is to identify all Reasonably Foreseeable Hazards.

The analysis must recognise the social and technical aspects of the System(s) involved, and this
may include materials, potential failure modes, activities, scenarios, escalations, human factors,
environmental aspects, and any other factor that might influence Risk (Section 6.0).

There is no legal requirement to quantify Risk, nor does a Risk figure demonstrate SFAIRP or ALARP.
‘Tolerable’ Risk is not a legal concept.

The legal obligations apply to the whole lifecycle of the System.

The argument for excluding a Risk Reduction Measure (RRM) on the grounds that it would be Grossly
Disproportionate would need to be convincing to a reasonable person (Section 2.3).
The inability of a Legal Duty Holder to pay for an RRM is not a justification.

The Reverse Burden of Proof puts the onus on the Legal Duty Holder to demonstrate that all
Reasonably Practicable RRMs have been implemented (Section 2.7). It is therefore prudent to record
all significant decisions with a WRA, regardless of whether a Safety Case is legally required.

Disclaimer: This Section is intended to provide engineers with a brief description of the key principles of UK goal setting
safety law, to assist with their day to day professional responsibilities. However, there may be exceptions, nuances,
special requirements and prescriptive laws that may apply to specific industries, activities, or products, that have not
been covered. Some case law examples have been summarised, to assist in this understanding but these cannot
constitute a rigorous explanation; only generalities that may require further research. Wherever uncertainty exists the
reader is strongly advised to seek legal advice.

ALARP for Engineers: A Technical Safety Guide 12


Whilst the HASAWA is a broad umbrella • Failure modes are types of loss of control,
obligation, there may be some exemptions, such such as the failure of a vessel by rupture,
as the design of highways. corrosion, or fatigue.
• Causes are anything that if they were
The HASAWA applies to: absent could prevent the loss, e.g. errors in
• Products, installations, processes, and design, manufacture, control, human factors,
activities. ergonomics, management systems, safety
• The whole lifecycle, from the initial feasibility culture, conflicts of interest, cognitive bias.
assessment, through design, construction, (NB. Terms such as root, direct and indirect
commissioning, operation, maintenance, Causes may be ambiguous, so they are
modification, decommissioning and disposal. avoided here.)
• Hazard identification, prevention, mitigation • Activities or Scenarios are specific activities
(whether consequence or likelihood reduction) or conditions, such as the removal of a safety
and emergency response. item for maintenance.
• The System relating to each Hazard, e.g. any • Escalations relate to the way Hazards develop,
interacting, interrelated or interdependent such as an electrical fire spreading throughout
elements or parts that achieve a common a building.
safety objective, including hardware, software, • RRMs relate to controls, mitigations, barriers,
human interactions, procedures, management procedures and any other measures to prevent
systems, procedures, competency/training and or reduce the risks of loss.
the operating environment (natural or imposed).
These categories are not universal, so they
A safety analysis/risk assessment may need to are not common to all accidents, which may
address the problem at any of one or more levels, develop in different ways. They should therefore
as follows: only be used as a guide. The type and content
• Hazards are regarded here as any features of the analysis will depend upon character and
that could endanger people, such as toxic the Reasonably Foreseeable consequences of
chemicals, flammable materials, suspended the System (Sections 3 and 6). It will need to
objects, and thunderstorms. determine whether the Hazard can be eliminated,
• Consequences relate to the nature and contained, controlled, or mitigated and whether
severity of health and safety affects and the any RRMs have flaws or limitations that would
number of people involved. limit their effectiveness (Section 6.1).

SFAIRP vs. ALARP


UK health and safety law requires the Legal Duty Holder to ensure So Far As Is Reasonably
Practicable (SFAIRP), that employees and anyone affected by what he does, is not exposed to
Risk to their safety, health or wellbeing. However, from 1992 to 2005, the Offshore Safety Case
legislation required Risks to be reduced to As Low As Reasonably Practicable (ALARP), but this has
now been removed because it led to an over-emphasis on Risk quantification. Whilst both terms
are often argued to be the same, ALARP is an abstract (because it refers to Risk) whilst SFAIRP is
more material, referring to actual safety measures.

ALARP has become common parlance to describe SFAIRP, (which may be because it simply rolls
of the tongue more easily). Although this document refers to SFAIRP and ALARP, both should be
interpreted as answering the same basic question, ‘Have all reasonably practicable RRMs been
implemented, in the Systems we design and build, to ensure that people are safe from the Hazards
to which they will be exposed?’

ALARP for Engineers: A Technical Safety Guide 13


RRMs, which are sometimes referred to as 1974 (HASAWA) and associated sets of
barriers, are any measure that reduces the regulations. Civil actions are brought by
likelihood or consequence of an accident. They individuals for compensation due to injury
are not limited to hardware and may relate to or death caused by negligence (Duty of
software, human factors, management systems, Care failings) or breach of statutory or case
training/competence, procedures, proximities law. The standard of proof for both civil and
to other equipment or Hazards, locality, and criminal cases is based on the balance of
environment. probabilities, i.e. it is more likely than not
to be true (rather than beyond reasonable
Because the regulations relate to all phases of the doubt as for other criminal cases).
lifecycle, many of the Lifecycle Criteria (Section However, the burden of proof differs,
7) will need to be established at the design because it lies with the Legal Duty Holder
stage and maintained or updated throughout (Section 2.6) to demonstrate compliance
the lifecycle. The concepts have been formally for criminal cases, and the claimant to
recognised by the HSE for the process industries prove negligence or breach for civil ones.
since the Baker Report on the Texas City disaster
The duties set out in HASAWA include the
(1), and the principles are now established Good
following:
Practice, so they should be equally applicable to
other industries and products. Section 2 General Duties of employers
to their employees: (1) It shall be the duty
NB. Goal setting legislation is supported by of every employer to ensure, so far as is
delegated legislation (normally Regulations), reasonably practicable, the health, safety
many of which may be prescriptive in nature. and welfare at work of all his employees.
This legislation generally relates to specific
Section 3 General Duties of employers
industries or types of activity, mandating
and self-employed to persons other
certain criteria, such as the need for protective
than their employees: (1) It shall be the
equipment or setting exposure limits to certain
duty of every employer to conduct his
Hazard types, such as noise, chemicals, and
undertaking in such a way as to ensure,
radiation. However, these Regulations are
so far as is reasonably practicable, that
beyond the scope of this document.
persons not in his employment who
may be affected thereby are not thereby
exposed to risks to their health or safety.
2.1 Regulatory Control and
Prosecution Section 6 General duties of
manufacturers etc. as regards articles
UK safety law is of two types, statutory and substances for use at work.: (1) It shall
legislation (made by Parliament) and be the duty of any person who designs,
case law1, which is interpretations of manufactures, imports or supplies any
statute made by judges, the latter of article for use at work or any article of
which is particularly important for goal fairground equipment—to ensure, so
setting regulations. It also includes law far as is reasonably practicable, that the
made by judges themselves, such as for article is so designed and constructed
example, the tort of negligence. In the that it will be safe and without risks to
UK legal system criminal prosecutions health at all times when it is being set,
are brought by the state, primarily under used, cleaned or maintained by a person
the Health and Safety at Work etc. Act at work;

1 Also known as Common Law, as opposed to Statutory Law, which is made by parliament.

ALARP for Engineers: A Technical Safety Guide 14


The Management of Health and Safety at legislation is the judiciary, who will seek answers
Work Regulations (MHSWR) includes the to three basic questions:
following:
Was the Risk Reasonably Foreseeable?
Section 3 Risk Assessment: (1) Every
employer shall make a suitable and
Was the Risk anything more than what
sufficient assessment of the risks to the
can truly be said to be part of the incidence
health and safety of his employees to
of everyday life?
which they are exposed whilst they are
at work; and the risks to the health and
Would it have been reasonably practicable
safety of persons not in his employment
to further reduce or eliminate the Risk?
arising out of or in connection with the
conduct by him of his undertaking, for the If the answer is no to any of these questions
purpose of identifying the measures he the courts are likely to acquit the defendant.
needs to take. However, it is necessary to refer to case law to
understand how the questions will be answered.
The primary aims of this document are
to explain what these terms mean in an
engineering context.

Approved Codes of Practice (ACoPs) have


a special legal status and provide practical
examples of Good Practice that will be
considered by the courts. They give advice
on how to comply with the law by, for example,
providing a guide to what is Reasonably
Practicable and what is required for a Risk
assessment to be ‘suitable and sufficient’.
Guidance does not have the same legal status,
but either may be considered Good Practice
therefore, if the Legal Duty Holder chooses to
deviate from them, it will always be prudent to
record a full explanation, i.e. a WRA.

The Regulators and their agencies, such as


the Health & Safety Executive (HSE), Office of
Nuclear Regulation (ONR), Office of Rail and Road
(ORR), and the Civil Aviation Authority (CAA)
oversee compliance and have a duty to accept
safety cases/reports for certain ‘permissioned’
industries, such as nuclear, chemical, process
and offshore oil and gas. However, the final
arbiter of any safety case/report, or breach of

ALARP for Engineers: A Technical Safety Guide 15


2.2 Reasonable Foreseeability and the ‘Incidence of everyday life

Reasonable Foreseeability is an important concept to engineers because it determines whether a


Hazard will need to be addressed.

The following case law examples are helpful to clarify these issues:

Relevant Case Law Examples on Foreseeability and Everyday Risk

McLean v Remploy, 1994.


Foreseeability: The employer was found not liable in a case where a practical joke played by one employee
resulted in injury to another, as it was deemed to be not Foreseeable.

Foreseeability with
Baker v Quantum Clothing Group, 2011.
respect to standards,
The court found the employer not liable because, at the time of the employees hearing
or generally available
damage, there was no recognised guidance on acceptable noise levels in the workplace,
understanding of,
so it was therefore deemed not Foreseeable.
the Risk:

Dean & Chapter of Rochester Cathedral v Leonard Debell, 2016.


A tripping accident due to a piece of concrete protruding from under a traffic bollard, where
the appeal court found the defendant not liable. The court concluded that it was a) an
extremely small piece of concrete, and it was unlikely that a pedestrian would walk so close
to the bollard and b) the Risk had to amount to more than the “everyday risk” from normal
Risk must be more blemishes or defects common to any road or path.
than the Everyday
risk: Regina v Porter, 2008.
The case involved the death of a three-year old boy, who injured his head when jumping
from steps in his school playground. Unfortunately, having been taken to hospital he
contracted MRSA and died. The judge stated, ‘Where the Risk can truly be said to be part
of the incidence of everyday life, it is less likely that the injured person could be said to
have been exposed to Risk by the conduct of the operations in question’.

Regina v Board Of Trustees Of The Science Museum, 1993.


One of the museums cooling towers was found to contain the bacteria which causes
Harm does not need
legionnaire’s disease. No-one had succumbed to that disease, but there was a Risk to
to have occurred:
health and safety. A Risk existed, which was regarded as ‘a possibility of danger’. The
guilty verdict was upheld by the appeal court.

Regina v Tangerine Confectionery Ltd and Veolia, 2011.


Tangerine involved a man crushed in a machine because he had not followed the company
procedure. Veolia involved the death of a litter picker working on the side of a fast road.
All appropriate
In both cases the argument that the incidents were not Foreseeable as they were an
means must be used
incidence of everyday life was rejected. The general duties under the HSWA: “command
to identify the risk:
an enquiry into the possibility of injury” however, “They are not limited, in the risks to which
they apply, to risks which are obvious. They impose, in effect, a duty on employers to think
deliberately about things which are not obvious”.

ALARP for Engineers: A Technical Safety Guide 16


However, there is little case law associated with • Was negative void coefficient a Reasonably
major accidents, as they are thankfully rare, and Foreseeable failure mode?
any that have happened were almost inevitably • Were breaches of procedures, or the design of
Foreseeable. the core, Reasonably Foreseeable Causes?

The last example above, R v Tangerine & Veolia, Logically, any one of these questions is a valid
is particularly relevant because it imposes a duty determinant of whether the legal obligations have
to undertake appropriate studies to identify the been met.
Hazards and the means by which they could be
liberated, prevented or mitigated, i.e. establish In this context, a reasonable definition of ‘not
a detailed understanding of the technical and/ Foreseeable’ could be compliance with the
or human factor issues. An alternative way of following criteria:
stating this could be: ‘The analysis must be • It is not required by any standard, guidance,
Proportionate to the nature and complexity of the code of practice or accepted Good Practice.
situation and the severity of its effects if it is to
• Appropriate studies (such as HAZID, HAZOP,
identify all Reasonably Foreseeable Hazards’.
FMEA) would not identify the Hazard,
consequence, scenario, failure mode or Cause.
The guidance to the Offshore Safety Case
• There is no history of this or similar events.
Regulations states that it is Foreseeable that a
helicopter could crash into an offshore oil and • It is not plausible that it will occur in future
gas installation, but not an airliner. This implies (where plausibility is justified by a reasonable
that the airliner scenario is not plausible, rather Qualitative Argument, e.g. it can be
than not Foreseeable, as it has been foreseen. demonstrated that are several RRMs in place,
The premise here, is that a reasonable person which are sufficiently effective, diverse, reliable
would agree that, although the airliner scenario and redundant that total failure would not be
is possible, the combination of conditions that plausible), OR by a Quantitative Argument,
could lead to it may be dismissed as not plausible which is based on Robust Statistical evidence
or, in the accepted legal terminology, not of integrity, for that item, operating under
Foreseeable. The rationale would appear to be those conditions.
that there are enough natural or man-made RRMs
to make the collision implausible, e.g. aviation If an event is Reasonably Foreseeable, then the
law, pilot competence, the remote vicinity of an Legal Duty Holder must demonstrate that all Risk
offshore platform, radar and visual flight rules. controls have been implemented.
It therefore follows that any event that is provided
with enough suitable RRMs could be argued to be
not Foreseeable. This conclusion is supported by
Case Law, R v HTM (Section 2.3).

This raises the question of what it is that would


be Reasonably Foreseeable – consequence,
scenario, failure mode or Cause. Taking the
Chernobyl disaster as an example:
• Was a nuclear meltdown Reasonably
Foreseeable?
• Was a turbine rundown test a Reasonably
Foreseeable accident scenario?

It is Foreseeable that a helicopter


could crash into an offshore oil and
gas installation, but not an airliner.

ALARP for Engineers: A Technical Safety Guide 17


2.3 Reasonable Practicability and before the advent of commercial computing, so
Gross Disproportion its meaning may have been misconstrued. This
has been reinforced by documents such as the
HSE’s ‘Reducing Risks, Protecting People’ (2),
The issue of defining what is, or is not, reasonably
which discusses risk quantification, tolerable
practicable is probably the most disputed aspect
and broadly acceptable risks2 and risk ranking,
of this legislation and this has led to many
none of which are now legal requirements and
different approaches for demonstrating it, some
may not be feasible either, as there may not be
of which would be unlikely to provide a sound
any valid means of making such calculations, as
defence in a law court. There is limited case demonstrated in Appendices A & B. There is also
law on reasonable practicability, and this is not a wealth of other misleading, legacy industry
without ambiguity. and Regulator guidance that refers to risk
quantification. Most of these documents were
Perhaps the most significant misinterpretation of published when there were legal requirements
Lord Asquith’s judgment is the assumption that for risk quantification and have not been updated
Risks must be quantified. This may stem from since they were revoked in 2005 with the revised
his use of the term computation, but this was Offshore Safety Case Regulations.

Case Law Examples Relating to Reasonable Practicability

Edwards v National Coal Board, 1949 is the widely accepted definition,


where Lord Asquith stated:
“Reasonably practicable is a narrower term than ‘physically possible’ and implies that a
computation must be made in which the quantum of risk is placed in one scale and the
Definition:
sacrifice involved in the measures necessary for averting the risk (whether in time, trouble
or money) is placed in the other and that, if it be shown that there is a great disproportion
between them – the risk being insignificant in relation to the sacrifice – the person upon
whom the obligation is imposed discharges the onus which is upon him.”

The Legal Duty R v Howe and Son (Engineers) Ltd, 1999.


Holder’s ability to The Court of Appeal said: “The size of a company and its financial strength or weakness
pay for risk reduction cannot affect the degree of care that is required in matters of safety. Otherwise, the
is irrelevant to the employee of a small concern would be liable to find himself at greater risk than the
liability: employee of a large one.”

R v HTM, 2006.
HTM claimed that two employees who had been trained, coupled with warnings on the
machinery, nevertheless broke the rules. It was not Foreseeable that they could have done
anything more to prevent their deaths. The company were found not liable and this was
The relationship upheld by the court of appeal, who made the several points, including the following:
between • It was correct that evidence of foreseeability should be allowed because that evidence
Foreseeability was potentially relevant to the issue of reasonable practicability.
and Reasonable
Practicability • Foreseeability was merely a tool with which to assess the likelihood of a risk
eventuating.
• A defendant to a charge under sections 2, 3 or 4 of the HSWA, when asking a jury to
consider whether it had done everything that was reasonably practicable, could not be
prevented from bringing evidence as to the likelihood of risk occurring.

ALARP for Engineers: A Technical Safety Guide 18


R v HTM gives an insight to the relationship Statistical base rates must therefore be
between Reasonable Practicability, Reasonably representative.
Foreseeable and likelihood. The court of appeal 4. All assumptions must be stated, and
makes clear that Foreseeability is one test of independence must be demonstrated,
Reasonable Practicability and that it is also a never assumed.
means of describing likelihood. The arguments
for it were Qualitative, not Quantitative, (which
The limitations of CBA are covered in Appendix
is the respective difference between likelihood
A7, which shows that Risks calculated by QRA
and probability). The most reliable criteria may
have unacceptable errors and uncertainties
therefore be the types of evidence permissible
that invalidate CBA. Nevertheless, there may be
in court and whether barristers would be
cases where CBA is valid, such as drugs trials
prepared to use them.
based on large datasets. Appendices A & B
therefore cover the feasibility and major pitfalls
Although Cost Benefit Analysis (CBA), which is
of using probabilistic data. For these reasons
the method recommended in R2P2, is ostensibly
Gross Disproportion would almost inevitably
the most logical means of demonstrating Gross
need to be based on a conservative WRA that
Disproportion it may not be legally acceptable,
would be convincing to a reasonable person.
unless it is based on Robust Statistical data,
Given the absence of case law relevant to this,
which is rarely available for engineered Systems.
it is recommended that any demonstration
The judiciary have recognised these problems
and the RSS guidance (3) sets out four basic of Gross Disproportion should abide by the
criteria for such evidence: following eight principles:

1. Expert witnesses must have appropriate 1. It should demonstrate that the analysis
competence in statistics if they are to give has been Proportionate (Section 3.1), i.e.
probabilistic evidence. This must therefore be undertaken in enough detail to understand
true for anyone providing such evidence. the Hazards and any relevant failure modes,
Causes, activities or scenarios that could
2. Methods of modifying statistical base
lead to an accident and how they may be
rates are limited to a form of deductive
prevented or mitigated.
Bayesian inference, discouraging the use
of mathematical formulae. The guides only 2. It should demonstrate a sufficient
refer to one stage of modification, (from prior understanding of the technical and social
probabilities, known as base rates, to inferred issues associated with the Risk.
posteriors), thereby indicating that multiple 3. If appropriate, it should show that any relevant
modifications, such as the complicated standards, approved codes of practice,
algorithms used in computer models, are too guidance and relevant Good Practice are not
complex to be admissible. applicable to the unique set of circumstances
3. Evidence must be relevant to the case, (which being considered.
is, in any case, a requirement of the courts).

2 The only known exception involving the use of ‘broadly acceptable’ comes from EU railways regulation:
COMMISSION IMPLEMENTING REGULATION (EU) No 402/2013 of 30 April 2013 on the common safety
method for risk evaluation and assessment and repealing Regulation (EC) No 352/2009, quote:
2.2.2. To focus the risk assessment efforts upon the most important risks, the hazards shall be classified
according to the estimated risk arising from them. Based on expert judgement, hazards associated with
a broadly acceptable risk need not be analysed further but shall be registered in the hazard record. Their
classification shall be justified in order to allow independent assessment by an assessment body.
2.2.3. As a criterion, risks resulting from hazards may be classified as broadly acceptable when the risk is so
small that it is not reasonable to implement any additional safety measure. The expert judgement shall
take into account that the contribution of all the broadly acceptable risks does not exceed a defined
proportion of the overall risk.
In practice 2.2.3 interprets ‘broadly acceptable’ in a similar manner to ALARP.

ALARP for Engineers: A Technical Safety Guide 19


4. If the rejected RRMs would have reduced 2.4 Risk Transfer and Risk Trade-offs
the potential consequences, those benefits
should be stated in conservative terms,
considering the number of people affected, Some RRMs may cause Risk transfers between
their injuries, damage to their health and/or different groups, such as employees, customers,
loss of life. and the public. There are no fixed rules
5. If the rejected RRMs would have reduced regarding any such compromises and therefore
the probability of an event, this should be a plausible, morally justifiable, argument will
expressed in Qualitative terms. need to be made, which would be acceptable to
a reasonable person. Such arguments may be
6. It should consider any Risk transfers or trade-
influenced by the relations between the Hazard
offs (Section 2.4).
and the populations involved, which could range
7. It should consider the societal expectations from awareness, acceptance, responsibility, or
for the situation being considered, bearing blame. For example, the workforce may be aware
in mind that these may change with time. of, and have tacitly accepted, a degree of Risk
Society would clearly expect greater which the public has not. Societal expectations
expenditure/sacrifice for fatalities than minor therefore allow their Risks to be ten times
injuries and may include an element of Risk higher than those of the public (2). Reducing
aversion (where the acceptance of multiple the Hazards for a high Risk worker by slightly
fatalities is disproportionately less than that increasing the Risks to several members of
of single deaths). public may therefore be difficult to justify, even
8. The argument should be convincing to a if the overall population Risk is reduced. Equally
reasonable person (or the majority of a jury), protecting a car occupant/driver by putting
based on the balance of probabilities. innocent pedestrians at greater Risk could be
unacceptable (an issue that has challenged the
It should also be noted that if an RRM would designers of autonomous vehicles). The degree
have been Reasonably Practicable to install at to which blame, responsibility, acceptance and
the construction stage of a project, it would awareness should be accounted for may not
not be possible to argue that it is Grossly be quantifiable, but the arguments used should
Disproportionate later because of the increased nevertheless be included.
cost of retrospective implementation.
Equally, one Risk may be traded for another, e.g.
In conclusion, if an RRM cannot be shown to be a step may be replaced by a slope to reduce the
Grossly Disproportionate based on these eight tripping Hazard, but it could introduce a slipping
principles, then it must be implemented (and Hazard in wet or icy conditions. It may not be
operations may need to be stopped until it is). possible to say which option has the greatest
probability or consequence, so a case should
be made that acknowledges the differences
and takes other factors into account, such as an
individual’s right to accessibility.

Other Risk Trade-offs can be necessary


where two regulations create the potential for
conflict, such as safety and environmental, e.g.
depressurising a gas installation may make it
safer but discharge global warming gasses into
If an RRM would have been the atmosphere. The legal principle here is to
reasonably practicable to install take the highest-level regulation as the priority,
but this may not be clear, as both requirements
at the construction stage of a may exist in the same regulation, e.g. COMAH.
project, it would not be possible
to argue that it is grossly
disportionate later.

ALARP for Engineers: A Technical Safety Guide 20


However, the ‘Environmental permitting therefore part of the ALARP demonstration.
guidance: Core guidance’ relating to the
2. Safety Critical equipment reliabilities. These
‘Environmental Permitting (England and Wales)
are control systems whose failure may
Regulations 2010’ states:
directly lead to a loss. Typical TSLs range
from 10-7 to 10-9 failures/operating hour
‘A.1.14 Regulators should take those
(e.g. rail and aviation), so Robust Statistical
requirements into account when setting permit
evidence would be impractical because of
conditions, and both parties should in particular
the timeframe and numbers of items that
ensure that environmental permitting and
would need to be tested under real operating
Health and Safety requirements do not impose
conditions. The requirements often use the
conflicting obligations.’
words prediction and/or estimation, which
implies a Qualitative Argument, possibly
It would therefore be prudent for the Duty
supported by some relevant statistics. For
Holder to test the arguments with the
example, the new version of an aero engine
appropriate Regulators before finalising any
might be able to call on reliability data from
decision.
the previous version and demonstrate that
the new one has better design and materials,
and that a rigorous Risk management process
2.5 Target Safety Levels (TSLs) has not been able to identify any new Hazards
or failure modes. The probabilistic benefits
would typically be unquantifiable, but the
In certain cases, Regulators, Legal Duty Holders, Qualitative Argument should be sufficient to
or standards may set numerical safety criteria, convince an appropriate specialist that the
sometimes referred to as Target Safety Levels TSLs would be achieved.
(TSLs) or Basic Safety Levels (BSLs). Although
3. Tolerable Risks to individuals. These are
these are not strictly legal requirements and
typically found in non-binding guidance
do not constitute a demonstration that all
documents and their calculation or estimation
Reasonably Practicable Risk controls are in
may be subject to errors of several orders
place, they could be regarded as Good Practice,
of magnitude (Appendices A & B). Whilst
provided they can be quantified with enough
Robust Statistics may be available at an
accuracy. There are three basic types:
industry or societal level, there are no sound
mathematical methods of modifying them
1. Standards. Design standards that require for uniquely engineered Systems. The
Systems to comply with uncertain loads such only option may be to make a qualitative
as natural events like storm or earthquake justification using WRA, in a similar manner to
that have recognised occurrence frequencies, #2 above.
(e.g. offshore installations designed to survive
the 100-year storm), or structural standards
based on real life failure data, which is built in TSLs will therefore be justified either by
as safety factors to cope with those unknown reference to recognised Robust Statistics,
loads. Compliance with the standard is such as storm frequencies, or by using a WRA.
The latter option may be supported by Robust
Statistics from similar equipment or operations
provided the differences are understood and
systematically analysed. However, there is no
…protecting a car occupant/ legal requirement to calculate TSLs or BSLs,
and they do not over-ride the obligation to
driver by putting innocent
demonstrate that all Reasonably Practicable
pedestrians at greater risk could Risk controls are in place. Care should be taken
be unacceptable - an issue that to ensure that any arguments or calculations
has challenged the designers of comply with the RSS criteria from Section 2.3.
autonomous vehicles.

ALARP for Engineers: A Technical Safety Guide 21


2.6 The Legal Duty Holder 2.8 Retrospective Application of
ALARP Principles
The question of where the legal duty lies was
given clarification in Regina V British Steel Plc, Where an existing product or installation has
1994, where a scaffold tower built by contractors not previously been demonstrated to comply
collapsed and killed someone. British Steel with these regulations, it may be necessary
accepted that the incident constituted a breach to undertake this retrospectively. There is no
of HASAWA, but they stated that the defence known provision or exceptions for this type
of reasonable practicability enabled them to of analysis, and it is questionable whether an
submit a defence on the basis that the company RRM applied after commissioning might not be
had taken all reasonable care in delegating reasonably practicable, especially if it would
supervision to the contractor. The judge ruled have been during the design stage (as stated in
that the defence of proper delegation did not Section 2.3).
arise and convicted. This was upheld by the
Court of Appeal.

In cases where multiple organisations are


contracted to the same project, their individual
liabilities may be complex, so it is strongly
advised to seek legal advice.

2.7 Reverse Burden of Proof

The Reverse Burden of Proof set out in the


HSWA states that in a criminal prosecution,
the prosecutor must show, beyond reasonable
doubt, that the Defendant breached its duty by
failing to ensure that no one had been exposed
to Risk. The Defendant must then prove, on the
balance of probability that it had taken steps
reasonably practicable.

However, following the Enterprise and


Regulatory Reform Act (ERRA) 2013, in civil
cases it is for a claimant to demonstrate that the
employer has failed in his obligations.

In cases where multiple organisations


are contracted to the same project,
their individual liabilities may be
complex, so it is strongly advised to
seek legal advice.

ALARP for Engineers: A Technical Safety Guide 22


3. OVERVIEW OF THE
SAFETY RISK
MANAGEMENT PROCESS
Key Messages

The identification and analysis of safety Risks needs to be as objective, systematic, and scientific as
practicable, only using judgement where it can be shown to be accurate enough.

The principles involved in demonstrating that all Reasonably Practicable Risk controls are in place
are essentially the same for all industries, although the methodologies may vary.

A systematic brainstorming process is required for identifying Hazards.

The Proportionality principle is key to determining the proper level of analysis for any Hazard.

A structured and demonstrable safety management system may be necessary.

Risk management applies throughout the lifecycle of the subject matter, from design to disposal.

The basic elements of the Risk management • Reducing the likelihood - There are many
process are summarised in Figure 1. Whilst the methodologies for analysing the problem to
elements of Figure 1 are applicable across all identify further RRMs. The objective of these
industries, the process can be relatively simple studies is generally to identify ways to prevent
and sequential, or complex and iterative, so the an incident by exposing inadequate or absent
figure indicates potential feedback loops where controls.
one stage may reveal new information relevant • Recording and communicating safety critical
to a previous one. However, the process will messages - It is important to document the
generally include the following elements: demonstration of ALARP, together with safety
• Determining Proportionality - For the goals and how they are to be achieved. This
analysis to be Proportionate, the worst also ensures that these lessons are not lost,
Reasonably Foreseeable consequences so that integrity can be maintained throughout
need to be established, together with a broad the product’s lifecycle.
understanding of the type of System and its • Maintaining ALARP throughout the Lifecycle
complexity. The Proportionality Matrix (Section - ALARP applies throughout the product
3.2) is then used to establish the Risk analysis lifecycle, so systems need to be in place to
strategy. ensure that changes, such as modifications,
• Identifying Hazards - This could be described wear and tear and aging do not compromise
as systematic brainstorming to establish the integrity.
the high level Hazards for further analysis. • Identifying RRMs - RRMs can be identified at
This stage may also identify Risk Reduction almost any of the above stages, as this is the
Measures (RRMs). fundamental purpose of risk management.
• Reducing consequences - This may be the
simplest and most effective Risk reduction
strategy if it is practicable. However, it
may require physical effects modelling to
understand the potential escalations and
effects on people. If the consequences can be
reduced significantly, or eliminated, there may
be no need for further analysis.

ALARP for Engineers: A Technical Safety Guide 23


MAKE INITIAL ASSESSMENT OF WORST
REASONABLY FORESEEABLE CONSEQUENCES
TO ESTABLISH PROPORTIONALITY
Proportionality matrix (Section 3)
Policies, Engineering Procedures, Change Control, QA, Audit and Review,
Roles and Responsibilities, Competence Requirements. (Section 3)

IDENTIFY HAZARDS
Systematic brainstorming, hazardous materials,
equipment and activities, Hierarchy of Controls, HAZID,
good practice, standards, accident history (Section 4)

IDENTIFY AND IMPLEMENT RISK REDUCTION MEASURES (RRMs)


REDUCE CONSEQUENCES
MANAGEMENT SYSTEMS

Physical effects modelling, escalations and


consequence mitigation (Section 5)

Prevention and Mitigation


REDUCE LIKELIHOOD
Analytical methodologies, causes of loss and RRM
effectiveness (Section 6)

RECORD AND COMMUNICATE


SAFETY CRITICAL DATA AND CONCLUSIONS
Lifecycle criteria, WRA, safety report/case
(Section 7 & 8)

MAINTAIN ALARP THROUGHOUT LIFECYCLE


Management of change, CCHAZID, contingency plans
(Section 9)

Figure 1: Risk Management in Engineering

Risk management in safety differs from • Hazard, consequence or likelihood reduction,


business risk, because the legal obligations and emergency response.
require a higher standard of evidence, and • The System relating to each Hazard.
professional judgement may not be good
enough (Appendix A).
It must therefore be initiated early enough to
enable all recommendations to be incorporated
Risk management can apply to: in the design, perhaps as early as the feasibility
• Products, installations, processes, and studies, and may need to be repeated as the
activities. design progresses. The deliverables include
• The whole lifecycle, from the initial feasibility RRMs, Lifecycle Criteria and a demonstration
assessment, through design, construction, that all Reasonably Practicable Risk controls are
commissioning, operation, maintenance, in place.
modification, decommissioning, and disposal.

ALARP for Engineers: A Technical Safety Guide 24


3.1 The Hazardous System 3.2 Proportionality

Each Hazard will relate to a System, which A key principle in risk analysis is that of
comprises anything that can trigger or influence Proportionality, which dictates the amount
the likelihood or consequences of that Hazard of effort justified for a particular Hazard.
causing an accident over the lifecycle, including Unfortunately, there is little explanation of what
equipment, software, people, management this means or how it can be measured. Most
systems, procedures, activities, and natural guidance on this uses ambiguous wording, such
Causes. It may also be necessary to define the a ‘low risk’, without any explanation of what
System limits, or the design envelope. this means. HSE COMAH guidance states ‘The
depth depth of the analysis in the operator’s
The System may need to be considered over risk assessment should be proportionate to
some or all stages of its lifecycle, as follows: (a) the scale and nature of the major accident
• Construction/Manufacture Hazards (MAHs) presented by the establishment
and the installations and activities on it, and (b)
• Commissioning, set-up and/or trials
the risks posed to neighbouring populations
• Storage/Mothballing/Transportation and the environment’ i.e. the assessment has
• Modes of Operation to be site specific. ‘The depth of analysis that
• Maintenance and Inspection needs to be present depends on the level of
• Modification risk predicted before the identified measures
are applied.’ However, Risk cannot be known
• Dismantlement and Disposal
until the analysis is complete, and even then, it
may not be possible. Appendices A & B show
The relevant stages could be incorporated into
that Risk calculations and predictions may be
the hazard identification guidewords, or even
prone to elusive errors of thousands, millions or
have dedicated columns (see Table 2).
even billions of times, with no means of sense
checking or validating them.
Example 3.1 System Definition -
As stated in Section 2, the law courts are unlikely
The Grenfell Tower Disaster
to accept any Risk prediction without robust
This accident illustrated how the System evidence of its veracity, so an objective measure
boundaries can be set too tightly, as the true of Proportionality is necessary. This could
System was much more than the cladding include:
on the outside of the building that ultimately 1. The worst Reasonably Foreseeable
caught fire; it included its structure, its contents, consequences.
escape routes, ventilation systems, the
2. The potential for RRMs.
occupants, emergency procedures, drills, and
the emergency services, all of which influenced 3. The System complexity.
either the likelihood or consequences of the
outcome, so analysis of the cladding alone
would be incomplete.

The law courts are unlikely to accept any


Risk prediction without robust evidence
of its veracity, so an objective measure of
Proportionality is necessary.

ALARP for Engineers: A Technical Safety Guide 25


The worst Reasonably Foreseeable Sociotechnical Systems increase
consequences are the most quantifiable of complexity, and human interfaces
these criteria. Regardless of the Causes, there
should be enough information to estimate this
may require additional study.
within an order of magnitude, which is good
enough for these purposes (Section 5). Minor Interdependencies and interactions relate to
injuries or health effects would not normally non-linear systems where different elements
justify extensive study unless they apply to a influence others that may have different
significant population. Multiple fatalities would functions. This can be as complex at multi-
almost certainly require comprehensive study. faceted software systems or as simple as two
independent functions that could be affected by
The potential RRMs will also help to determine a common maintenance procedure.
the depth of analysis required. For example,
a meteorite strike could have serious Close coupled Systems involve Safety Critical
consequences but there is little, if anything, that time constraints (4), so control functions may be
can be done about it. An autonomous vehicle more critical yet more limited. This can happen
may have lower consequences but much greater with reactive components (nuclear, chemical)
potential for Risk reduction and would therefore or any processes where some stages are
warrant substantial study work. time dependent on previous stages with little
flexibility or potential for substitution under
The System complexity could include novelty, Hazardous conditions, e.g. as part of emergency
Safety Critical functionality, interdependencies, response.
interactions, close coupling, and sociotechnical
aspects. Sociotechnical Systems also increase
complexity, introducing human factors, which
Any engineered System has a degree of novelty, can relate to ergonomics, safety culture,
which can introduce new or unknown Risks. competence, procedures, culture, and
Those parts of a System that are covered by managements systems. Systems with critical
design standards or codes may have less alarms, maintenance, and inspection processes,
novelty and therefore require less analysis. monitoring requirements, approvals processes
Nevertheless, standards have limits, e.g. a (such as permit to work) and human interfaces
relief valve complying with the standard may may require additional study, especially when
not be an effective RRM for all scenarios (e.g. there is potential for unsafe acts.
overpressure and fire).

Safety Critical functions may range from simple


to complex. Complex Systems may have
feedback control loops (manual or automatic),
Safety Critical monitoring requirements, multiple
interfaces, critical proximities and/or multiple
functions/options. Safety Critical software would
normally be regarded as complex. The greater
the complexity, coupled with Foreseeable
Hazards, the more analysis will be required, and
this may involve systematic methodologies to
identify and evaluate all Safety Critical scenarios.

ALARP for Engineers: A Technical Safety Guide 26


3.3 The Proportionality Matrix descriptors to help make such judgements.
Nevertheless, a straightforward matrix can be
Whilst the Foreseeable consequences may constructed to illustrate what would reasonably
be quantified, potential RRMs and complexity be expected for any given System, as shown in
are more subjective, albeit with reasonable Matrix 1 below.

Typical Requirements for a Proportionality Matrix


No assessment Apply good If consequences If consequences cannot be
practice, cannot be reduced, reduced, then undertake a
guidance, then undertake a comprehensive HAZID and all
Worst
standards and basic HAZID and reasonable methodologies,
Reasonably
hierarchy of any studies arising unless shown to have no further
Foreseeable
Controls from it benefit or to be disproportionate
Consequences

A B C D
Single injury or 1
health effects.
Single fatality or 2
chronic health
effect, or multiple
injuries.
Multiple fatalities 3
or chronic health
effects.

May be proportionate if the system has:


Not
KEY: - no foreseeable potential for further risk reduction and/or Proportionate
acceptable
- limited complexity.

Matrix 1: The Proportionality Matrix

NB. This bears similarity with Risk Matrices, policies, quality assurance, audit, review, change
which are discouraged for the reasons given in control, procedures, roles, responsibilities,
Section 6.12 and Appendix A6. competence requirements, and processes
for capturing lessons from experience and
The Proportionality Matrix therefore provides history. This is especially true for permissioned
the engineer with a more objective and legally industries that are obliged to produce a safety
defensible method for determining the depth report/case for acceptance by the Regulator.
of analysis than most guidance and/or risk
matrices would. It would be prudent to record One useful and well established framework for
the basis for rating the Hazards, especially a management system is the Plan, Do, Check,
within the amber zones. This also provides a Act (PDCA) principle, as shown in Figure 2,
basis for a more objective WRA. although other models may be as effective.
This can be applied at many levels, to test the
effectiveness of individual tasks through to
3.4 Engineering Management the high level policies of an organisation. It
Systems also illustrates the principles behind Figure
1, showing how the feedback loops facilitate
In order to demonstrate that all Reasonably iterative problem solving.
Practicable Risk controls are in place it may be
necessary for the whole engineering process The PDCA is most useful for determining how
to be governed by an appropriate safety well the key elements have been integrated
management system, which may comprise into the organisation, to ensure that it has the

ALARP for Engineers: A Technical Safety Guide 27


capabilities and processes to undertake risk
management. The planning phase is about Establish the policies, objectives,
PLAN
having clear policy statements objectives and strategies to achieve desired outcome.
strategies. The safety policy should cover:
• The statement of intent and aims for health
Implement the plan. Provide resources,
and safety. DO
processes and establish competencies.
• The high level roles, accountabilities and
responsibilities, competencies and training for
Study, analyse, QA, review in
health and safety in the organisation. CHECK
comparison with the plan.
• The general arrangements in place, to achieve
these aims and to monitor whether they are
successful. ACT
Adjust process or change strategy to
correct inconsistencies with the plan.
Safety objectives and strategies can be defined
for each project, to ensure that those involved
will understand what is to be achieved and how. Figure 2: ‘Plan-Do-Check-Act’ Cycle
They should also show how these integrate with
the commercial and technical objectives. Example 3.2 -
Poor Communications in Design
The organisation, or project teams, will need
to be arranged, resourced, and managed for A pipeline to an offshore gas platform was designed and built by
effective communication between disciplines, a virtual team spread across different locations in Europe. During
departments, contractors, consultants and the commissioning the pipeline was pressure tested but pressure
Regulator, so that all Safety Critical objectives are could not be achieved, indicating a leak. It was later found that
clear, relevant knowledge is disseminated, and one part of the team had decided to install an instrument fitting,
decisions are taken by competent persons with but this was not communicated to other disciplines, so it was
appropriate roles, responsibilities and authorities. installed but not capped off during construction. There were no
One means of achieving this is to have safety consequences in this case, but the costs ran into many
designated Technical Authorities for each millions by the time it was resolved.
engineering discipline, to make Safety Critical
decisions and ensure compliance the law, Good
Practice, and technological developments. (NB. a comprehensive review and audit plan would
Section 2.3 gave case law, which established be appropriate. Further useful information may
that the ability of the Legal Duty Holder to come from monitoring recent global or industry
pay has no influence on the liability.) The UK developments on major accidents, standards,
Engineering Council ‘Standard for Professional and Good Practice.
Engineering Competence (UK-SPEC 3rd Edition)’ There will need to be an accessible Safety
may be regarded as Good Practice as it defines Information System to provide legal
the competency requirements relating to safe requirements, Good Practice, standards, codes
systems of work and safety in design. However, of practice, guidance, information on lessons
companies may have their own competency from previous designs and relevant incidents,
requirements and assessment processes, which plus project documentation, including policies,
could be a suitable substitute. design briefs, specifications, and any other
The Check stage is sometimes referred to as relevant technical information.
the analysis or study phase. Whilst this is shown It will also be necessary to ensure that any Safety
in Figure 1 as the Analysis stage, it will also need Critical decisions are taken by appropriately
to include reviews or audits of the process, its qualified persons, recorded, and communicated
management and project deliverables. For to other relevant parties in the team.
SMEs or small projects, the reviews may be
informal and based on the feedback loops
shown in Figure 1. For larger enterprises and
projects, such as petrochemical and nuclear,

ALARP for Engineers: A Technical Safety Guide 28


The processes should be designed to learn from
reasonably practicable experience, incidents,
accidents, and any available leading indicators
that could benefit the design. It may therefore
be necessary to have people with relevant
experience and knowledge throughout the
whole engineering process. This could include
experienced operators/users, maintenance and
inspection personnel and subject matter experts.
Any Safety Critical engineering activity will also
require a formalised Management of Change
(MoC) process, to ensure that changes cannot be
implemented without undergoing appropriate risk
assessment and approvals.
Finally, there needs to be systems in place
to ensure that the learning points and
recommendations from the reviews and audits
are implemented.
There is a range of guidance on safety
management systems, of which HSG65 may be
the most well-known (5), but these apply PDCA
to the broader spectrum of technical, operational
and occupational aspects, the latter two of which
have less relevance here.

ALARP for Engineers: A Technical Safety Guide 29


4. IDENTIFICATION

Key Messages

The purpose of this stage is to identify:


• The Hazards, their effects and, where practicable, their Causes.
• The System related to each Hazard, or all Hazards.
• Risk Reduction Measures.
• Requirements for further study.

Identification of Hazards requires a structured, systematic brainstorming process.

A HAZID will normally be required for any significant Hazard, and this should draw upon guidewords,
experience and past incidents and accidents.

The HAZID should consider the whole System relevant to the Hazard, which may include hardware,
software, competence, procedures, emergency response etc.

For each Hazard, the HAZID will either conclude that the Risks are ALARP or that further study is
required.

Identification comprises a systematic Good Practice puts the emphasis on inherently


brainstorming exercise to identify the safe solutions, such as elimination of the Hazard,
Reasonably Foreseeable Hazards and their or substituting it with a less Hazardous materials
consequences, in terms of health, injury, fatality or activities. The factors of safety built into
and the numbers of people affected. The the design should be conservative enough to
purpose is to establish whether all Reasonably deal with any potential load or stress that the
Practicable RRMs are in place, or whether the system may experience. Effective isolation
problem warrants further analysis. could be achieved by considering the proximity
with people or providing protective RRMs. If
4.1 The Hierarchy of Controls the RRMs are completely effective no further
analysis will be required, but if Reasonably
The minimum level of Risk management is to Foreseeable Risks remain, then it is likely that
apply: the process with generate further study actions.
• Relevant Good Practice, which may include
Approved Codes of Practice (ACoPs), Elimination, substitution. Factor of safety. Simple,
standards, and guidance from the Regulator Remove people (simplify/automate/ effective
or industry, unless there is a sound case for relocate). Containment (isolate/ solutions
protect). Detection, control
not doing so. and recovery. Awareness,
• Any other relevant legislation. competence. Procedures,
organisation. Protective
• The Hierarchy of Controls, as shown in
equip’t (PPE).
Figure 3. Personnel escape.
Emergency
Complex, less effective
The methods at the top of the hierarchy response.
solutions. Likely to require
are preferred as they are more effective at further study, especially
eliminating, rather than reducing, the Risk. (NB. if consequences remain
Figure 3 is a modified version of the standard significant
hierarchy, for engineering purposes.).
Figure 3: The Hierarchy of Controls

ALARP for Engineers: A Technical Safety Guide 30


Example 4.1 -
Typical Strategies Adopted by Different Industries
These decisions need to be taken throughout
the engineering cycle, from initial feasibility
Aviation – emphasis on high integrity, fault tolerant systems with
studies through detailed design. As the System
redundancy, safety factors and health monitoring. Focus on
complexity and consequences become more
human error, human factors, training, and recovery from adverse
severe, the efforts to reduce risks will need to be
situations. Highly regulated, with comprehensive certification
Proportionately greater. There will need to be an
processes and comprehensive accident investigation.
audit trail, which demonstrates that a systematic
process has been used to identify, prevent Buildings – emphasis on inherent safety, with non-flammable
and/or mitigate the Hazards. This can result in structures and furnishings, fire-fighting, escape to safe location.
different strategies, as illustrated in Example 4.1.
Chemicals – emphasis on prevention of release of chemicals,
shut down to minimise quantities lost and prevention of their
However, different strategies may also apply at
ignition. Minimising explosion severity. Escape and evacuation
different levels. Whilst it may not be possible
of all personnel in vicinity.
to make aviation inherently safe, this may be
possible for some systems on an aircraft. Military – emphasis on inherent safety, especially with nuclear
Although the resulting strategies may look missiles, and prevention of misuse.
different, all significant Hazards will need to be Nuclear – emphasis on returning system to a safe state,
identified and addressed, and the starting point minimising need for manual intervention, high reliability, and
for this is generally a HAZID study. redundant safety systems. High integrity containment of the
hazard.

4.2 Basic HAZID


although it may be preferable to use this as a
In many cases the Hazards and potential RRMs starting point for developing more bespoke
may not be as obvious and a more systematic, guidewords. The categories in the table illustrate
formalised approach may be necessary, some different perspectives that may be taken to
especially as the consequences become more maximise ideas generation.
serious. The HAZID (sometimes known as a
Preliminary Hazard Analysis) provides assurance The level of detail should be Proportionate to
that: the complexity of the System and severity of
• All Foreseeable Hazards have been identified. the Foreseeable consequences (Section 3.2 and
3.3). It may also depend on the risk management
• Where appropriate, Proportionate,
strategy, and how much emphasis is to be placed
and suitable further analysis has been
on closing out Hazards in the HAZID, as opposed
recommended.
to using more advanced methodologies, such
• A full audit trail of the risk management as those in Sections 5 and 6. Ideally, the HAZID
process is available. would determine the follow-on studies but many
• It may also be able to demonstrate that all industries consider them to be Good Practice, so
appropriate RRMs have been identified and they are undertaken anyway.
applied, including Good Practice, and therefore
close out the risk assessment process. Preparation will be key to the effectiveness of
the brainstorming. The participants will need
The HAZID identifies and evaluates Hazards and to be kept mentally stimulated throughout
RRMs, with a view to establishing whether the the process and provided with all the relevant
risks have been reduced to ALARP, or if further technical detail as a when required. Too many
study work is required, as detailed in Sections 5 guidewords could make the process tedious,
and 6. whereas too few could miss Hazards. The
facilitator will need to consider the categories in
It is a systematic brainstorming exercise, Table 1, research any relevant background data
employing guidewords taken from industry or and delete any guidewords that are obviously
company standard lists. Alternatively, a generic irrelevant to the subject matter, but only when
set, such as that in Table 1 could be used, there is no doubt.

ALARP for Engineers: A Technical Safety Guide 31


System Characteristics
Goals Applications Functions/Activities Control Loops Conflicting Objectives
Skills Knowledge Communications Dependencies Technological Limits
Resources Complexities Authorisations Responsibilities Interlocks/Permissives

Cost Time Constraints/ Time Critical Human Machine Shortcuts/


Constraints Urgency Sequences Interfaces Workarounds

Procedures Monitoring Proximities/Layout Escalations Recovery


Energy
Pressure Temperature Momentum Radioactivity UV/IR/Microwave
Height Latent Heat Static / Voltage Tension / Stress Chemicals / Materials
Fire/Explosion: Flammable - Solids - Liquids - Gas - Vapour - Mist - Dust
Cause of Death, Injury or Health Effect
Fire Explosion Chemical/pH Toxicity/Smoke Electric Shock
Heat Burns Poisoning Asphyxiation Suffocation/Drowning
Crushing Burying Collision/Impact Haemorrhage Hypo/Hyperthermia
Noise Carcinogens Physical Strain/Injury Trapping Slips, Trips and Falls
Radiation Anaesthetics Viruses/Bacteria Divers Bend Falling object
Lifecycle
Design Procure Construct Manufacture Commission/Trials
Store Transport Start-up Operate Shut-Down/Idle
Maintain Modify Mothball Decommission Dispose
Failure Modes and Internal Causes
Catalyst Flash Point Ignition Acidification Pressurised Release
Dosage Erosion Joule Thompson Containment Vibration/Resonance
Seizure Overload Velocity Change Spillage/Vapour Confined Space
Fracture Ventilation Brittle Fracture Creep/Fatigue Power Change/Surge
Solidifying Blockage Shock Waves Contamination Melting/Softening
Friction Hardness Energy Storage Energy Transfer Structural Failure
Stalling Skidding Runaway Reaction Wear & Tear Dropped Object
Turbulence Waste Workmanship Loose Objects Specification Validity
Poor Fit Stability Loss Partial Connection Corrosion/Decay Material Compatibility
External Causes
Wind Flooding Earthquake/Landslip Tsunami Vandalism/Terrorism
Freezing Lightning Accidental Damage Foreign Object Nearby/Other Activities
Control and/or Human Error (Task Specific)
Correct Action, Wrong:
No Too Early/Too Late/ Too Little/Unfinished/
Wrong Action Item/Proximity/Set-up/
Action Out of Sequence Too Much
Location/Configuration
Control / Human Factors
Information: False/
Instruction: Quality/ Knowledge: Data/ Feedback: Weak/Slow/ Layout: Simple/Intuitive/
Irrelevant/Distractions/
Language/Compulsion Competence/Skills Sensor Failure Understandable
Overload/Boredom
Lighting Heat Timing/Urgency Access Physiology
Table 1: Generic HAZID Guidewords

ALARP for Engineers: A Technical Safety Guide 32


The HAZID will normally be a team event, Although it is considered Good Practice for
comprising suitably qualified and/or experienced HAZIDs to be undertaken as one off events by
persons, such as users/operators, technical multi-disciplinary teams, recent developments,
specialists, maintenance, and inspection such as STPA, have challenged this, employing
personnel, working systematically through two dedicated engineers, working over longer
the guidewords, which may be pre-filled in the periods, calling in relevant disciplines on an
worksheets, placed on flipcharts, or displayed ad hoc basis. A detailed HAZID can be time
on the walls, for attendees to refer to as the consuming, and it may not be Reasonably
exercise progresses. The relevant technical Practicable to have so many key personnel
data should also be displayed for attendees to involved for extended periods. Given that the
refer to at any time, e.g. properties and system team event is generally regarded as Good
characteristics from Table 1. The room layout, Practice, this approach may need to be justified
number of attendees, lighting, heating, timing and managed under clear rules and agreements
of sessions, refreshments, and avoidance of to ensure that those disciplines will be involved
distractions, such as calls, emails and other when needed. However, its main advantage is
operational pressures, can help to improve the that it removes time constraints, gives greater
event effectiveness. It may also be necessary flexibility, and access to more disciplines and
to hold the event away from the normal working individuals, and even site visits.
environment and prohibit phones and laptops.
The HAZID should begin with a team briefing
The HAZID team should normally comprise: that includes a full description of the System,
• Custodian (Who holds overall responsibility any relevant technical data, such as operational
for arranging the event and setting terms of and technical goals, design criteria, materials
reference. May also be the safety engineer). (primarily those items in properties and system
• Chairman (3rd party, who prepares the event, characteristics from Table 1), together with
runs it, and writes the report). the operating envelope and environment.
• Scribe (records proceedings). This should be supported by information on
past incidents, accidents, and any relevant
• Discipline engineers (including safety and
precursors from similar operational experience.
other relevant disciplines)
This stage can generate significant debate, and
• Relevant specialists from the client, this should be encouraged, as it can generate
contractors and/or suppliers. many action points prior to the more formalised
• Relevant construction/manufacture, brainstorming phase.
operators, and maintenance representatives.
• Follow-up Co-ordinator (to ensure that Table 2 shows a typical worksheet structure,
actions are allocated correctly and with the type of headings that could be
expedited). employed. The development of guidewords

Example 4.2 -
Brent Bravo Offshore Platform Fatalities

In 2003 two persons working inside a leg of the Brent Bravo platform in the North Sea were
engulfed by hydrocarbon gases from a leak. They were unaware of the anaesthetic effect of these
gases, and they collapsed and died. Hydrocarbons are highly flammable and potentially explosive,
so there were many RRMs to prevent ignition of the gas but, perhaps because of the severity of
these threats, no one had considered the anaesthetic Hazards. Nevertheless, the Hazard was well
understood by divers who work in the oil and gas industry. It is possible that a HAZID could have
identified the Hazard.

ALARP for Engineers: A Technical Safety Guide 33


System Under Investigation

Guideword Hazard Worst Reasonably Causes Safety RRMs ALARP justification Action Party
Foreseeable Goals or action for further and date
Consequences and study
Escalations

Table 2: Possible Worksheet Structure for a HAZID

and worksheet structure should therefore be ALARP Justification/Further Study: If it is


part of the HAZID preparation, prior to the clear at this stage that there are no further
team discussion. The columns in use and their RRMs that could be implemented, it may be
content will depend on the type of problems possible to demonstrate that all Reasonably
being solved. Practicable Risk controls are in place, otherwise
it will be necessary to undertake further study
Each Hazard is then explored in more detail, by work. The RRMs or follow on studies should be
populating the columns, to determine whether proportionate to the consequences, so it could
suitable RRMs can be identified, whether they
be prudent to include a column to categorise
will be effective, and whether they are all that
this, by referencing the Proportionality Matrix
is Reasonably Practicable. If not, it may be
above (Matrix 1).
necessary to recommend further study.

The team should work in steps, through each Example 4.3 -


guideword and developing the additional Escalations in Grenfell Tower
columns (as in Table 1).
A relatively small fire, which should not have threatened anyone,
Causes: This column is optional, as it may be escalated into a major catastrophe. The building had multiple
too complex to identify all the potential Causes. escalation routes, as the initial fire, which should have been
If this is the case, then the strategy will either compartmentalised, was understood to pass through open
be to effectively mitigate the consequences or windows, distorted window frames, kitchen vents and fire doors
recommend further study. Nevertheless, it could that either failed to shut or were later forced open by fire-
draw on Table 1 above to help identify Causes. fighters. All these problems that could have been identified in a
HAZID.
Consequences and Escalations: It is generally
within the capabilities of engineers to evaluate Action: The Action columns are used to ensure
the Reasonably Foreseeable Consequences that RRMs are implemented, or further studies
within an order of magnitude, which is accurate are carried out and that a suitable party is given
enough. However, this will need require the this responsibility.
team to identify any potential escalations that
might occur and how this could increase the
It is important to create and maintain the
consequences, as illustrated by Example 4.3.
right team dynamic, which is free of cognitive
biases (Appendix A5). If the right attitudes and
Safety Goals: It may also be beneficial to include
behaviours are not exhibited it may be necessary
a column to define the safety goals, as this
to address these with individuals, or even remove
will help in determining the effectiveness of
RRMs, defining the Lifecycle Criteria (Section 7), them from the process.
facilitating Goal Structuring Notation (Section
8.2), and documenting the final demonstration If the subject matter is characterised by
that all reasonably practicable RRMs are in place. activities, it may be more appropriate to
substitute or supplement the HAZID with a Task
RRMs: These may be identified from the Risk Assessment (Section 6.3.1).
Hierarchy of Controls (Figure 3) and the RRM
guidewords (Table 5).

ALARP for Engineers: A Technical Safety Guide 34


4.3 Comprehensive HAZID suit a specific System. Working with the
relevant specialists, each Foreseeable
Systems that have greater complexity and form of corrosion is identified and then
potential for multiple fatalities, may require further expanded to reflect any relevant
greater attention to detail in the HAZID, as factors that may influence it, such as
indicated in the Proportionality Matrix (Section failure modes, equipment items, loads,
3.3). The basic HAZID applies the guidewords to properties, activities, or human errors/
the subject in general, but the objective here is factors. NB. In some cases it may be
to expand the guidewords in a more systematic desirable to use Boolean notation
manner and address the subject bit by bit, (with AND & OR gates) but this was not
representing each part in its most helpful form considered to add sufficient value to this
for Hazard assessment. Two methods of doing example.
this are proposed, hierarchical or contextual,
depending on the characteristics of the System The greater this expansion, the closer the
being addressed. guidewords get to the potential Causes
of accidents. Each of these may then
4.3.1. Hierarchical Guideword Expansion be transferred to the HAZID worksheets
As for the basic HAZID, the relevant for further analysis and identification
guidewords may be identified from an of RRMs. Ideally, the expansion should
industry standard list, or from Table 1 be detailed enough for the HAZID to
above. These will then be expanded to demonstrate that risks have been
reflect any factors that may lead to the reduced to ALARP, although it may be
problem (like working backwards through preferable to action further studies,
a fault tree). Example 4.4 shows how the especially where the System is too
corrosion guideword may be expanded to complex for this approach.

Example 4.4 - Hierarchical Guideword Expansion

Corrosion

Coating
Galvanic Surface Loss Intergranular Inspection
Failure

Stainless
Water Material UV Visual Procedures
bolting

Insulation
Oxygen Welding Abrasion NDT Frequency
Failure

Heat Access Competence


Temperature
Treatment

ALARP for Engineers: A Technical Safety Guide 35


4.3.2. Contextual Guideword Expansion

Another method of expanding the Activity


guidewords is to select the most beneficial
context and divide this into appropriate
Function Item
sub-categories. Figure 4 suggests some
different contexts, of which the most
suitable are then used to capture the
Hazard types, e.g. fire in a lift shaft, engine
failure during take-off etc. The Hazard is
therefore described in two parts, or more Lifecycle
Context Location
if appropriate. stage

This requires greater preparation for the


HAZID, but it can yield savings, by enabling
some Hazards to be closed out without
the need for further study work. Set-up
Environment
Condition
Each one of these context categories
can then be developed into guidewords, Proximity
as shown in Example 4.5, for a building.
Contexts should only be included if they
could affect safety.
Figure 4: Some Hazard Contextualisation Categories

Example 4.5 - Developing Contextual Guidewords for a High Rise Building

Set-up/ Lifecycle
Activity Item Location Proximity Environment Function
Condition stage
Maintenance Lift Basement N/A Main road Day/Night N/A Sleeping
Cleaning Cladding Kitchens - Park Heatwave - Cooking
Removing Electricity
- Stairwell - - - Storage
refuse substation

A context could be a combination of of hazardous conditions, whilst leaving


these columns, such as maintenance other more detailed categories for
of lift at night. This could generate an the brainstorming. The most relevant
impracticable number of Hazards, so category, or two, would therefore be
a pragmatic approach will be needed. selected upfront for defining Hazards, as
As stated in the definitions, a Hazard shown in Example 4.6, leaving the other
is a useful description, so this could categories to be brainstormed in the team
mean identifying sufficient detail in the event.
preparation to capture the full scope

ALARP for Engineers: A Technical Safety Guide 36


Example 4.6 - Identifying RRMs in a Building

Context (in this Guidewords from other


General Guideword RRM
case by Location) Categories,

Fire Basement General Fire and smoke detection system


Compartmentalisation e.g. fire doors
Passive fire protection on structural
Refuse Prohibit storage of refuse in basement
Electrical Switchgear Electrical isolation on fire or smoke detection
Halon substitute system e.g. HFC
Fittings and furniture Sprinkler system
Non-flammable furniture
Boiler room General Fire and smoke detection system
Compartmentalisation e.g. fire doors
Oil Ventilation shutdown on fire/smoke detection
Clothes drying and lint Sprinkler system
Fire extinguishers

This example is relatively straightforward significant major accidents in rail, aviation,


and the RRMs are mostly Good Practice, shipping, buildings, chemicals and oil and
with nothing particularly novel, so the gas. This shows that contextualisation
HAZID should be able to close out most gives more detail but does not identify
of the Hazards without further study. The the exact scenario or the reasons for the
most common guidewords for defining accidents, which would require further
the Hazards may be activity or location/ guideword expansion. However, its main
item, but this will depend on the type of
advantage is that it should ensure that
System being engineered.
all possible Hazards are defined at a
Example 4.7 illustrates how this might sufficiently high level that nothing will be
have applied to some of the UK’s most overlooked.

ALARP for Engineers: A Technical Safety Guide 37


Example 4.7 - How Guidewords Might Have Identified Some UK Related Major Accidents

Accident General Guidewords Contextual Guideword

Aberfan Landslide Near village & heavy rain (Proximity, Weather)


BG Rough Corrosion or Fracture Exchangers (Item)
Buncefield Explosion Tank overflow, no one present (Item, Proximity)
Clapham Junction Collision Signal failure (Function, Item)
Flixborough Fracture Flexible coupling (Item)
Grenfell Tower Fire Kitchen, cladding (Location, Item)
Herald of Free Sinking or loss of
Doors fail to shut, underway (Activity, Condition)
Enterprise stability
Hillsborough Crushing Loss of control on entry (Activity, Proximity)
Kegworth Loss of Power Single engine landing (Condition, Proximity)
Ladbroke Grove Collision Signal failure (Function, Item)
Nimrod Explosion Fuel tank overflow, refuelling, (Location, Condition)
Piper Alpha Fire and/or Explosion Adjacent control room, maintenance (Activity, Location, Proximity)
Titanic Collision Iceberg not seen (Environment, Activity)
Contextual guideword categories are shown in brackets.

4.3.3. The Degree of Guideword Expansion every item of equipment in the HAZID if
this could be done more effectively by
Whenever the Foreseeable consequences a few specialists in an FMEA. However,
of a hazard are significant enough, the scope of the FMEA may not be as
and there is potential to identify more broad as the HAZID, so care should also
RRMs, then further study or guideword be taken that scenarios are not getting
expansion may be justified. The level overlooked if this course of action is taken
of analysis should be Proportionate too early.
to the consequences of each Hazard,
some of which would be dealt with by a Nevertheless, the HAZID cannot be
single guideword, as in a basic HAZID, expected to identify all Causes, which
(e.g., electric shock or slips, trips, and may be too numerous and subtle.
falls, which would be unlikely to affect Example 4.8 is a good illustration of
more than one person), whereas more these difficulties and why further study
guidewords may be necessary for fire, may be necessary. The guidewords in
which may have the potential for multiple this case could have been ‘explosion’
fatalities. and ‘fuel tanks’, but this is too superficial
to identify such things as swarf from
For some Hazards more advanced drilling and riveting causing short-circuit
methodologies, such as FMEA, HAZOP between aging high and low voltage
or STPA, may be preferable, so it may cables which would ignite fuel vapours
be better to raise actions for them to (the industry subsequently changed the
be analysed that way. There is no point procedures for drilling and riveting on
in working through the failure modes of aircraft). Nevertheless, although drilling

ALARP for Engineers: A Technical Safety Guide 38


may be an operational issue, the other six
Causes are all engineering related and the Example 4.8 -
consequences were so severe that they Boeing 747, TWA 800 Aircraft Crash
would justify significant analytical effort.
This aircraft exploded shortly after take-off with a loss of 230
There is a separate argument that some lives. Although the main Cause has never been established for
of these Causes were general industry certain, the NTSB report concluded that the most likely scenario
related problems, whilst others were was that the flammable vapours in the fuel tank exploded, due to
unique to this design of aircraft, i.e. the following Causes:
cables running in the same bundle and 1. High and low voltage cables run in the same bundles.
swarf may have been equally true for
2. Cracked insulation on aging cables (this is thin to save
many other types of aircraft, whereas
weight on aircraft).
the remaining five Causes were not.
3. Build-up of lint around connectors.
From a legal perspective this raises the
question of whether any of these could 4. Swarf causing short-circuiting between high voltage and
be regarded as an ‘incidence of everyday low voltage cables.
life’ (Section 2) because they are well 5. Build-up of electrically conductive silver sulphide deposits
established and tested practices and in the fuel tank.
should therefore be excluded from the 6. Hot air conditioning packs located under the fuel tanks.
HAZID? If this were to be the case, then
7. A design philosophy that called for eliminating all potential
it may be possible to filter out some of
sources of ignition from fuel tanks rather than assuring that
the contextual categories in Figure 4 as
vapours in the tank would never be flammable.
irrelevant to some Systems. In practice,
such a concept may be difficult to argue,
as is illustrated by Example 4.9. who spend much longer periods
analysing the Risks, possibly for weeks
It could be argued that the tank, its or months, liaising with specialists as and
level alarm, and bund were all fairly when required. This may be a reasonable
industry standard designs and therefore strategy for a comprehensive HAZID,
constituted Good Practice. However, one or at least for the preparation phase, in
Cause of the accident was the absence which case the HAZID team would work
of personnel nearby (contextual category on pre-defined Hazards enabling them to
- proximity), so the assumptions made in look at more detail, such as the Causes.
all other Systems may not have been valid A range of different approaches is
in this case. The purpose of the HAZID is therefore possible, but the most effective
to identify such assumptions, omissions, one should be adopted, regardless of
and errors, so the granularity of the assumed Good Practice.
analysis may need to be greater than
that which has become commonplace
in industry. Example 4.9 - Buncefield

Whilst a basic HAZID might employ In this incident a tank level alarm failed during filling and the tank
around eight specialists for a day or overflowed into a bund, which was designed to take the tank
two, a comprehensive one could take capacity. However, the incident happened on a Sunday, when no
these people away from their jobs for one was around, so it went unnoticed, the filling continued, and
significantly longer, which may not be the bund overflowed. A large gas cloud formed, resulting in an
practicable. Team brainstorming has explosion that damaged the plant offices, local businesses, and
come to be regarded as Good Practice, residential properties. Fortunately, there were no fatalities.
so any deviation from this may need
The criminal prosecution resulted in approximately £10 million in
to be justified. However, other
fines and costs. Civil liabilities amounted to around £700 million.
methodologies, such as STPA (Section
6.4), use a core team of two persons

ALARP for Engineers: A Technical Safety Guide 39


The structure of the HAZID worksheet structure or set of guidewords, so it
may also vary, depending on the will always be the responsibility of the
approach, with extra columns for persons executing the HAZID to optimise
contextual guidewords, as in Table 3. it, and this may require some forethought
However, there is no universal HAZID and open-mindedness.

System Under Investigation

Hazard Other Worst Reasonably Causes RRMs ALARP Action


Guidewords Foreseeable justification or Party and
General Pre-defined Consequence action for date
Guideword Contextual and Escalations further study
(Table 1) Guidewords

Table 3: Possible Worksheet Structure for a Comprehensive HAZID

If the subject matter is characterised by Automation may not be able to deal with
activities, it may be more appropriate to all situations, as it may need to be over-
substitute or supplement the HAZID with ridden with manual intervention in some
a Procedure HAZOP (Section 6.3.2) or situations.
STPA (Section 6.4).
These options may create a need for
4.3.4. Guidewords for Identifying RRMs and further study, as suggested in the
Actions for Further Study last column. For example, placing of a
guard on a machine may be all that is
Other than the RRMs from Good Practice Reasonably Practicable without significant
and standards, it will be necessary to effort or analysis, but if there are
consider bespoke or novel measures. Reasonably Foreseeable situations where
These may relate to prevention or the guard would not be effective, such as
mitigation, either reducing the likelihood during maintenance, then a more detailed
or consequences of the loss. Table causal analysis may be required.
4 categorises some of the potential
types of Risk reduction to illustrate how An RRM may also create unexpected new
guidewords could be developed for this Hazards or adversely affect other Risks,
purpose. The priorities should still be in so they should always be checked to
line with the Hierarchy of Controls (Figure make sure these have been identified and
3) but the assumption here is that inherent addressed.
safety could not be achieved, so more
avenues will need to be explored.

Procedural solutions are rarely reliable


RRMs and may need to be backed up with
appropriate training, warnings, monitoring,
and a system of authorisations.

ALARP for Engineers: A Technical Safety Guide 40


Potential RRM Guidewords

Potentially Controllable Aspects of the Some Possible Further Study


Examples/Guidewords
Hazard Methodologies
Causes Eliminate, Substitute, Reduce. Inventory review, specialist materials
Hazardous Material
Toxic, flammable, reactive. studies (6.15).
Function and Change, simplify, eliminate, automate.
Technical re-evaluation.
Complexity Remove people.
Wear and tear, ageing, material
Passive Causes deterioration, component, latent Technical review, STPA (6.4)
build-up of Risk.

Redundancy, common cause failure,


RRM Effectiveness Review (6.1),
Reliability diversity, self-revealing failure, fail to
FMEA (6.5), FSA (6.9), STPA (6.4)
safety, over-rides.
Control and Software Decisions, communications, actions,
STPA (6.4)
Error feedback.
Interfaces, close coupling.
Activities and TRA/Procedure HAZID (6.3), STPA (6.4),
Proceduralising tasks that are
Procedures Human Factors Analysis (6.2)
Safety Critical.
System Configurations Exceeding design envelope, future
HAZOP (6.8), STPA (6.4)
and Loads changes to operations or design.
Wind, icing, earthquake, light and Environmental data and consequence
Natural, Environmental
lighting, wetting, tsunami. analysis (3.0)
Layout, physical limits, access,
Ergonomics Ergonomics (6.2)
control panel/room design.
Prevention Active Prevention Health monitoring, relief valves. FMEA (6.5), FTA (6.6), FSA (6.9), STPA (6.4)

Passive Protection Guards, locks, isolations. Bespoke studies

Error Recovery Control system design, built in time to TRA/Procedure HAZID (6.3), STPA (6.4),
Human error
correct/recover. HE/HFA (6.2)
Consequences Quantification of immediate health/
Initial Severity Consequence analysis (3.0)
mortality effects.

Escalation Fire, explosions, collapse. Consequence analysis (3.0)

Mitigations Passive Fire/blast walls, airbags. Bespoke studies

Alarms, fire sprinklers,


Active Bespoke studies
isolation/shut-down systems.
Redundancy, common Cause failure,
RRM Effectiveness RRM Effectiveness Review (6.1)
failsafe, diverse.
Emergency Self-Rescue Escape routes, lifejackets/boats. HFA (6.2) & bespoke studies
Response
ER Plan, rescue, first aid, fire-fighting,
Organisation
decontamination, identifying missing TRA/Procedure HAZID (6.3), STPA (6.4)
Response
persons.
Communications, data, rescue,
External Response
fire-fighting, evacuation of area.

Table 4: Potential RRM Guidewords

ALARP for Engineers: A Technical Safety Guide 41


4.4 HAZID Conclusions Addressing the consequences and potential
escalations will often be the preferred Risk
There is no right way to carry out a HAZID reduction strategy, and these can often be
because every Hazard may present a different identified in the HAZID. Consequence reduction
set of variables and conditions. Nevertheless, measures may also be the most cost effective.
with good preparation it should be possible However, they may not always be feasible, such
to develop a systematic, rigorous process as with aircraft crashes, so the main effort
that is both Proportionate and defensible. An would then be on identifying Causes and their
experienced leader should then keep it moving prevention (Section 6).
at the right pace, maintain motivation and
generate the right actions. However, causal factors may be numerous
and elusive, as they may be a function of
However, it should be recognised that complex control systems, computer software,
experience has positive and negative aspects. interactions, interdependencies, management
The purpose is to find unknown unknowns systems, procedures, human factors,
and so, whilst experienced people may have ergonomics, and weaknesses in an RRMs. The
extensive knowledge of the System and be Proportionate granularity for the HAZID, or any
able to identify historical problems, they may follow-on study, may only become clearer as the
dismiss a unique situation as low Risk because studies progress.
they have not experienced it (Appendix A). The
inexperienced person asking basic questions The HAZID can also contribute to, or even act
may therefore be the one to find the unexpected as, a Hazard register, which may be a valuable
Hazards, or their Causes. data source throughout the lifecycle (Section 7)
and for the safety report (Section 8).
The first HAZID may take place early in a project,
often at feasibility stage, but may need to be The HAZID process can therefore demonstrate
updated or repeated as the design progresses. that all Reasonably Practicable Risk controls are
New Hazards may arise, and others may be in place if:
eliminated, so the timing will need to be early • All RRMs have been implemented (Section
enough to implement changes, but not before 2) and there is clearly no potential for further
all relevant design changes have been agreed. RRMs, or
At certain stages throughout the lifecycle there • the residual consequences are sufficiently
may need to be Creeping Change HAZIDs benign, time limited and effect few people, or
(Section 9.2) or updates for any modifications.
• no Foreseeable Risks remain.
The HAZID is not mandatory though, and some
If so, Sections 5 and 6 of this document may
industries or disciplines may choose to move
be skipped. If RRMs are considered to be
directly to other methodologies, but only if
Grossly Disproportionate a WRA may have to be
the case can be made that all Hazards will be
produced in compliance with Section 2.3 above.
identified. Many industries have standardised
analytical tools that undertake detailed
identification, such as HAZOP for process
industries and STPA for aerospace, but HAZID
has the potential to identify the widest range
of Hazards and may also provide valuable input
data for those other methodologies.

ALARP for Engineers: A Technical Safety Guide 42


5. EVALUATING REASONABLY
FORESEEABLE CONSEQUENCES

Key Messages

The purpose of this stage is to:


• Model Hazard severity and escalations, such as impact, fire, dispersion and explosion.
• Evaluate effects on health, injury, and mortality.
• Identify simple RRMs, to avoid the need for detailed causal analysis.

A conservative judgement of the Reasonably Foreseeable consequences, which may have been
made during the HAZID, might not always be sufficient to identify suitable RRMs.

This may be based on:


• Industry standard calculations or Hazard ranges.
• Computer simulations or Physical Effects Modelling.
• Empirical testing.
• Historical data, provided it is based on Robust Statistics

Potential escalations of the Hazard will need to be considered.

RRMs to reduce the consequences are normally preferred.

The simplest means of mitigating a Risk may The Reasonably Foreseeable consequences will
be to address the consequences, as the not necessarily be the worst possible (Section
Causes may be manifold and relatively elusive. 2.2), provided a WRA can be made to dismiss
In some cases, where the Hazards are known, the latter on the grounds of sufficiently effective
such as fire, this could be the starting point of and reliable RRMs or Robust Statistics (Appendix
the evaluation. A2) that the likelihood of the conditions required
for it to occur would not be plausible. The
The consequences may have been evaluated consequences may also be evaluated for each
conservatively using professional judgement Foreseeable Failure Mode, if applicable, e.g. fires
during the HAZID process, but it may be due to a small corrosion leak and a pipe rupture
necessary to undertake simulation or testing may have quite different consequences.
to understand its severity and whether it could
escalate to something more significant. The validity of potential escalations may need to
be checked with empirical testing, calculation,
The Reasonably Foreseeable consequences or simulation (typically Physical Effects
could be a function of the following factors: Models, which may be utilise Computational
The nature of the Hazard or loss of control. Fluid Dynamics). CFD models are popular for
• The Failure Mode, which may influence the simulating fires, explosions, and the dispersion
extent of the Hazard. of toxic or flammable gases. Where escalations
are Reasonably Foreseeable, the consequences
• Whether and how it could escalate to a larger
could increase substantially, so consequence
or different Hazard.
reduction measures should be considered to
• Its effects on health, injury, or mortality, prevent these or mitigate them.
including who will be affected and how many,
e.g. constructors/manufacturers, operators/
users, maintainers, other staff, customers,
public and emergency responders.
The Reasonably Foreseeable consequences
will not necessarily be the worst possible.

ALARP for Engineers: A Technical Safety Guide 43


The effects on health, whether by illness, Many of these RRMs are covered in Good
injury, or death, are governed by the Control of Practice or standards for each industry or
Substances Hazardous to Health Regulations discipline, because they are typically generic
(COSHH), and Workplace Exposure Limits solutions. However, they may not always be
(WELs) are provided by HSE. Section 2.2 gives sufficient or appropriate. Individual cases
some relevant case law (Baker v Quantum may have different or additional Hazards than
Clothing Group, 2011) regarding the availability those assumed in Good Practice or standards,
of such information and what the Duty Holder so it may be appropriate to carry out detailed
could be expected to know. analysis (Section 6).

For consequences to people, it would be If these measures eliminate or sufficiently


prudent to start from the maximum number that reduce the consequences it may be possible
could be exposed, as this may be reduced later to demonstrate that all reasonably practicable
when considering the scenarios and activities RRMs are in place without further analysis.
that may be associated with the failure modes. However, if an RRM Effectiveness Review
The numbers affected may often be estimated (Section 6.1) reveals weaknesses and the
using informed professional judgement, residual consequences are still significant,
together with the simulations to determine the then further analysis may become appropriate,
extent of the Hazard. specifically to look at measures that reduce the
likelihood of the events occurring.

5.1 Consequence Reduction


Measures

The consequence evaluation should generate


ideas for consequence reduction, such as:
• Using less Hazardous materials.
• Building in greater factors of safety to make
the consequences not Foreseeable.
• Layout and proximity of hazardous equipment
and activities to prevent escalations and
impact on people.
• Escape facilities/routes protection and
orientation relative to Hazards and their
potential escalation.
• Access to emergency equipment or
control points, e.g. equipment isolation and
shutdown, fire-fighting.
• Prevention of escalation by passive means,
such as containment/RRMs and/or improved
survivability, e.g. fire and blast walls, passive
fire protection and Personal Protective
Equipment (PPE).
• Active safety systems, e.g. alarms, shut-down
or fire-fighting.
• Removal or minimisation of exposed
personnel, e.g. automation or remote control
of Hazards and/or safety systems.
• Emergency response, rescue, and treatment
of survivors.

ALARP for Engineers: A Technical Safety Guide 44


6. ANALYSIS

Key Messages

The purpose of this stage is to identify and evaluate:


• The System’s susceptibility to the underlying causes of loss.
• RRMs that could either reduce the likelihood or consequence of events.
• Any vulnerabilities and limitations in those RRMs.

Analysis of the Hazards should be Proportionate to the Reasonably Foreseeable consequences,


the System characteristics, and the potential to provide or improve RRMs.

Weaknesses in RRMs need to be evaluated against the relevant safety criteria.

The analytical strategy needs to consider the available methodologies, their strengths, and
weaknesses, and optimise or modify them to suit the subject matter.

Assumptions and omissions should be eliminated wherever possible, or minimised and stated.

Predictive methods, such as QRA and risk matrices are not recommended analytical techniques.

If the previous stages have not demonstrated systematically assess all possible System
that all Reasonably Practicable RRMs are conditions to identify Hazardous scenarios,
in place using simple solutions, then it may such as HAZOP.
be necessary to reduce their likelihood by
identifying further Causes or flaws in RRMs. A common framework for analysis is to
Accidents and their outcomes are either the describe loss as a linear chain of events which
result of component failure or interactions are independent of each other, such as Event
between the elements of a System, so each and Fault Tree Analyses, Functional Safety
of these may require a different analytical Assessments, Layers of Protection Analysis
approach. History shows that virtually all or Bow Ties. Whilst these may be effective for
accidents have multiple Causes, and typically component failures, they may otherwise be an
involve sociotechnical Systems, which can be oversimplification, because each event may
influenced by such things as actions, omissions, be influenced by the others or by separate
latent conditions, human factors, ergonomics, common factors, for which more complex
unclear objectives, errors in the design, models are appropriate.
software or procedures, environmental factors
and any vulnerabilities or limitations in the RRMs. Recent developments, such as STPA, view the
It is therefore rarely possible to attribute a risk to System as hierarchical control frameworks
a single Cause. The analysis may therefore need with feedback loops, which may be more
to identify multiple Causes that could influence effective at identifying interactions, especially
the likelihood of any Hazard becoming a loss. in sociotechnical Systems. Any System has
goals and constraints, which may need to be
The strategy for analysing the Causes may clearly identified. Processes, such as Goal
vary significantly. Some methodologies Setting Notation may help to understand the
take a blanket approach, such as human objectives, define the constraints, and identify
factors checklists or management system their weaknesses. The framework for analysing
reviews, to identify anything that might have the Causes will therefore be a major influence on
a causal relationship. Alternatively, others will the effectiveness of any study.

ALARP for Engineers: A Technical Safety Guide 45


The methodologies may be forward or backward looking, either
starting with the Causes and understanding how these can develop
into Failure Modes and beyond, or by starting at the consequences
or Failure Modes and working backwards to identify their relevant
Causes, scenarios, unsafe acts, human factors, and escalation
mechanisms.
The methodologies may be forward or backward In some Systems there may be limited control
looking, either starting with the Causes and over the Causes of accidents and the focus will
understanding how these can develop into need to be on mitigation of escalations and/
Failure Modes and beyond, or by starting at the or consequences by protective measures,
consequences or Failure Modes and working rescue or recovery. The Grenfell disaster was
backwards to identify their relevant Causes, one that would have benefitted from such an
scenarios, unsafe acts, human factors, and approach, as there were many people that
escalation mechanisms. There are practical lost their lives from what started as a relatively
limitations to either approach and it may be random small fire that was beyond the control
appropriate to supplement them with generic of the building designers. The strategy for
methods, such as ergonomics and human factors compartmentalising fires failed because there
studies that are not related to particular Hazards. were multiple escalation routes, as the initial
flames were understood to pass through open
For complex sociotechnical Systems there windows, distorted window frames, kitchen
could be a multitude of human factors, unsafe vents and fire doors that either failed to shut or
acts and sub-System interactions that may were later forced open by fire-fighters. The fire
Cause a Hazardous situation. This would then developed due to the materials and design
suggest a forward-looking approach, such as of the cladding.
Systems Theoretic Process Analysis (STPA),
but this may only be practical if the Reasonably Whenever a System contains a human element
Foreseeable consequences are severe enough. that is Safety Critical, either because human
For high Risk events which depend on software intervention or error could Cause an event or the
systems, such as autonomous vehicles, aircraft, System relies on it to maintain safe operations,
and missile systems, such an approach may there could be reason to employ generic
have been the only credible methodology, as it methodologies as well, such as human factors,
is highly unlikely that backward-looking ones procedural review, and management systems
would have identified the problem. assessments, which are not scenario specific.
The Flixborough and Piper Alpha disasters, were
Working back along the causal chain may not examples of failures in management systems,
identify all scenarios but, if effective RRMs in the management of change and permitting
can be found, there may be no need to do of maintenance tasks respectfully. Where the
so. For some types of System, it may be the System integrity depends on management
only practical approach. The process would controls it may be preferable to undertake
be stopped when effective RRMs are found, thorough reviews of these.
or it becomes clear that the extra analytical
effort would be disproportionate to the subject
matter. The Buncefield incident (Example 4.9)
was a good example of a System that lent itself
to a backward-looking analysis, such as HAZOP
and Functional Systems Analysis (FSA), as The methodologies are basically different
the failure mode was well understood but the means of structuring a problem to optimise
RRMs preventing the loss of containment were the identification of RRMs. For this reason,
unreliable and ineffective, especially when no they should be regarded as flexible tools
personnel were present. that can be modified for any given problem.

ALARP for Engineers: A Technical Safety Guide 46


The methodologies are basically different 2. Failure to safety
methods of structuring a problem to optimise - Would the RRM fail in a manner that would
the identification of RRMs. For this reason, they prevent the Hazard?
should be regarded as flexible tools that can be 3. Reliability
modified for any given problem. It should also - Are the RRM’s failure modes understood
be noted they can become counterproductive and minimised?
by standardisation. For example, whilst generic 4. Independence/Redundancy
Bow Ties or Event Trees may be useful reference - Are multiple RRMs completely independent
material, they may become accepted as solutions or do they have potential common Cause
to justify safety, rather than opportunities to failures?
identify problems, thereby inhibiting analysis and - Do different RRMs have the same
weaknesses for certain Hazard
the identification of further RRMs.
characteristics, i.e. the Swiss Cheese holes
line up, e.g. a common power supply,
In many industries it may be Good Practice to common maintenance/calibration routines
follow certain methodologies, such as HAZOP or training omissions?
in the process industries, but this does not - Are there interdependencies between
necessarily satisfy the legal obligations, so different functions in the RRMs?
further analysis may be required. Nevertheless,
the final demonstration that all reasonably
Example 6.1 -
practicable RRMs have been selected should
Redundancy in the Fukushima nuclear disaster
justify the methodologies chosen.
This was a good example of common Cause failure, because
The methodologies should not be regarded three 100% cooling pumps that may have been assumed to
as the end of the process, as their results may be independent were simultaneously flooded in the tsunami.
need to be part of a broader review to be sure
that all relevant RRMs have been identified.
5. Diversity
- Are redundant RRMs sufficiently diverse to
Some of the more common methodologies ensure that no failure type would occur on
are described here, together with their main both, e.g. similar components that fail for
advantages and disadvantages. Some are the same reasons or in the same timescale?
considered to be Good Practice in certain 6. Self-Revealing Failures
industries or disciplines, but others should be - If any RRM fails, would it be apparent, or
evaluated to determine whether they can add raise the alarm to operators?
value. In some cases, it may be appropriate to 7. Manual Intervention Requirements
modify or combine the methodologies to suit - Systems that require some element of
the problem being analysed. manual intervention to complete the safety
function may be less reliable than
automated ones.
6.1 RRM Effectiveness Review 8. Secondary Risks
- Does the RRM create additional Hazards
Any RRM may be susceptible to failure, (which or increase other existing Risks.
also constitutes a Hazard). Multiple RRMs may
give the impression of effective Risk mitigation, Example 6.2 -
although the reality may be much less so. The Secondary Risk
following general criteria are suggested for
assessing single or multiple RRMs, although A gas turbine power generator suffered gas absorption into
the lube oil, potentially lowering its flash point to dangerous
certain types may have more specific criteria:
levels. The lube oil had to be regularly sampled, which was
1. Effectiveness then dumped into a pot that could take up to five samples.
- Is the RRM effective for all Hazard Because the pot did not need to be emptied every time, it
characteristics? eventually overflowed, spraying oil onto the turbine, causing
- Does it prevent or mitigate the Hazard? a catastrophic fire.
- Would it work under all Hazard conditions
and System configurations?

ALARP for Engineers: A Technical Safety Guide 47


9. Over-rides 10. Deterioration/Creeping Change
- If an RRM can be over-ridden for - Will the RRM deteriorate or slowly change
operational, maintenance or any other with time?
reasons this may lead to increased Risks.
An over-ride could be as simple as an Example 6.4 - Bhopal
electrical isolation by switching, or as
complex as process systems being taken Bhopal was one of the worst industrial accidents ever, with
out of service. Ideally, over-rides should thousands of deaths. However, it failed on nearly all the above
be avoided, but they may be necessary criteria. There were many contributory factors (accident
if the whole system cannot be shut down, summaries are readily available), but the intention here is to
in which case additional controls will be simply relate these to the above criteria:
necessary to minimise and manage 1. Effectiveness – The scrubber and the water curtain for
the outage. neutralising the Methyl Isocyanate (MIC) were undersized.
The water could not even reach the cloud.
Example 6.3 - 2. Failure to safety – Despite several RRMs failing, none of
Accident Involving Over-rides them stopped the process.
3. Reliability – Some instruments failed or were wrongly
The Smiler is a steel roller coaster located
calibrated. Equipment replaced with cheaper materials was
at Alton Towers in the United Kingdom.
corrosive, contributing iron which made the reaction worse.
It opened in 2013 as the world’s first 14
The refrigeration unit was unreliable and was shut down.
inversion ride. It can handle up to 5 trains
Frequent alarms meant that no one took notice.
at a time, each with up to 16 passengers. A
serious accident occurred in 2015 due to 4. Independence/Redundancy – Although the tank ullage,
over-riding the control system. refrigeration, scrubber, flare, and water curtain appeared to
offer multiple redundant RRMs, the common mode of failure
The track is divided into blocks, with the
was a backlog of maintenance.
intention that the control system will not
permit two trains in any block at the same 5. Diversity – The RRMs were at least diverse, as they all had
time. Following a fault status on the system different functions and involved different functions and
the passengers were removed to test the equipment.
track and bring it safely back into operation. 6. Self-Revealing Failures – The lack of a slip disk (spectacle
However, a series of complicated fault blind) was not self-revealing.
conditions led to a stalled train remaining in 7. Manual Intervention Requirements – Almost all the RRMs
one of the blocks, which was not noticed. required some form of manual intervention to carry out their
The system had over-rides to allow the safety functions.
technicians to clear the block and this was
8. Secondary Risks – The tanks created latent Risks, as
used to allow operations to resume and
ullage was used for storage, permitting greater quantities of
the first occupied train to proceed, which
Hazardous material to be held with less safety buffer.
then collided with the stalled train, causing
serious injuries. 9. Over-rides – Whilst over-rides normally apply to active
systems, there were both active and passive RRMs over-
Ideally, the System should have been
ridden in this case (refrigeration, scrubber, alarms, and tanks).
designed to be inherently safe, but this is
not always possible. The net result was a 10. Deterioration/Creeping Change – The plant suffered from
complex control system that necessitated corrosion and wear and tear on equipment items, some of
over-rides and manual intervention to keep which were ultimately taken out of service.
it operational. This type of System is ideally
suited to analysis by STPA (Section 6.4). This should help determine the optimum safety
philosophy, i.e. how to deal with Hazards, their
Causes and consequences, and demonstrate
that either, the Risk has been reduced to a level
that is not foreseeable (Section 2.2), or further
RRMs would be Grossly Disproportionate.
There is no formula for this, but failure may be
classified as not Foreseeable if enough RRMs
sufficiently meet the criteria to make it evident
that the likelihood of complete failure is so low
that it may be dismissed.

ALARP for Engineers: A Technical Safety Guide 48


The final criterion, Secondary Risks, is important Human error is increasingly recognised as a
because RRMs often create additional symptom, not a Cause. These analyses establish
unintentional Hazards or Risks, which need to be what can go wrong, and therefore become the
understood (Section 8.1, WRA Example 3). basis for asking why it would happen, which is
human factors. This includes cognitive functions,
such as attention, detection, perception, memory,
6.2 Human Error, Human Factors, judgement, and reasoning (including heuristics
and Ergonomics and biases), and decision making. It is often
reviewed under the categories of skills, rules/
procedures, and knowledge, or by Performance
Human factors apply to the physical, cognitive,
Influencing Factors (PIFs), which the Health and
and organisational aspects of Risk, which can
Safety Executive list the following categories:
lead to human error, the latter of which should
be the primary focus during the design phase.
Job factors:
Human error may be summarised into five • Clarity of signs, signals, instructions, and
categories: other information.
• No action. • Difficulty/complexity of task.
• Wrong action. • Routine or unusual.
• Correct action, wrong item/location/ • Divided attention.
condition/set-up/proximity. • Procedures inadequate or inappropriate.
• Too little/too much. • Preparation for task (e.g. permits, risk
• Too early/too late/out of sequence. assessments, checking).
• Time available/required.
All Safety Critical manual control functions or • Communication, with colleagues, supervision,
actions within the System should be examined contractor, other.
against these criteria. For simple Systems this
could be applied at the HAZID level by applying Technical (Ergonomics):
the relevant guidewords (Table 1, Section 4.2),
or by TRA, Procedure HAZOP (Section 6.3), or • Working environment (noise, heat, space,
STPA (Section 6.4). lighting, ventilation).
• Tools appropriate for task.
The objectives during design should be to • System/equipment interface (labelling,
simplify control systems wherever practicable alarms, error avoidance/ tolerance).
and minimise the amount of human interaction
necessary. The optimum use materials and Person factors:
technology can minimise control reliance,
interventions, maintenance, and inspection, • Physical capability and condition.
which will both reduce the potential for human • Fatigue (acute from temporary situation, or
error and the exposure of individuals to the chronic).
Hazards. Automation may be attractive, but this • Stress/morale.
can introduce other problems if it is too complex • Work overload/underload.
or cannot cater for all situations. • Competence to deal with circumstances.
• Motivation vs. other priorities.
Alternatively, it may be necessary to design so
that the Hazards, or their Causes, are removed Organisation factors:
prior to any manual intervention, such as
interlocks or physical barriers. • Work pressures e.g. production vs. safety.
• Level and nature of supervision / leadership.
• Communication.
The objectives during design • Manning levels.
• Peer pressure.
should be to simplify control • Clarity of roles and responsibilities.
systems wherever practicable • Consequences of failure to follow rules/
and minimise the amount of procedures.
human interaction necessary.

ALARP for Engineers: A Technical Safety Guide 49


• Effectiveness of organisational learning below to illustrate the main aspects that may
(learning from experiences). need consideration during the engineering of
• Organisational or safety culture, e.g. everyone a System:
breaks the rules.
Human Factors Guidewords from Table 1
Whilst many of these are not engineering
functions per se, they may be greatly influenced Instruction: Quality/
by the design of the System and therefore Lighting
Language/Compulsion
need to be considered at that stage. The HSE
recommend the following stages for evaluation:
Knowledge: Data/
Heat
Competence/Skills
1. Consider the System Hazards.
2. Identify manual activities that affect these
Information: False/
Hazards. Irrelevant/Distracting/ Timing/Urgency
3. Outline the key steps in these activities. Overload

4. Identify any potentially Safety Critical human


errors in these steps. Feedback: Weak/Slow/
Access
Sensor Failure
5. Identify factors that make these failures more
likely (e.g. PIFs).
Layout: Simple/Intuitive/
6. Manage the failures using hierarchy of Physiology
Understandable
controls (Section 4.1).
7. Manage error recovery. Perhaps the most important of these is Layout,
which could refer to the whole system or a
Human error is a technical subject that must control panel. If the System comprises a single
be analysed at a scenario level, but human lever to move one arm, it may be intuitive and
factors include psychological aspects, which are understandable, but if there are ten levers and it
more generic, and may involve several studies is not obvious what each is connected to then
addressing the different types of PIF, see Table 5. it may be neither intuitive nor understandable
Some useful guidewords relating to human (Example 6.5).
factors were given in Table 1 and are repeated

Human Error Human Factors

Discipline Technical Technical and Psychological

Application Scenario Specific Generic

HFA, Ergonomics, Control Room/


Methodologies HAZID, TRA, HAZOP or STPA
Panel Studies
Table 5: Comparison of Human Error and Human Factors

ALARP for Engineers: A Technical Safety Guide 50


Control room/panel design reviews are 6.3 Task Risk Analysis (TRA) and
considered Good Practice wherever there is a Procedure HAZOP
human machine interface that may have Safety
Critical aspects. If this cannot be achieved, then Only a small percentage of major accidents
procedures and competency requirements involve a random failure of equipment
should be designed to reduce the likelihood of (Appendix A4) and most relate to some form
human error. of procedural failure, i.e. human error. Where
engineered Systems have Safety Critical tasks
Ergonomics reviews are a key part of the design or procedures for their construction, operation,
process to assess the physiological aspects of maintenance, remedial work/interventions,
changing the System status. Computer Aided or decommissioning then some form of risk
Design (CAD) packages normally facilitate this. analysis will be required. This will need to be
undertaken as early as practicable to ensure
Human Reliability Assessment (HRA) is a that any resulting design changes can be
method of defining Human Error Probabilities implemented where these are preferential to
(HEPs), which can only be determined from procedural changes. At its most basic, this may
Robust Statistics or qualitative arguments. be a review by the relevant disciplines, such as
However, unique designs cannot have Robust engineers, operators, technicians, maintenance,
Statistics, so this form of quantification is inspection, and supervisory personnel.
unreliable (Appendix A2). Qualitative arguments
will address the PIFs, which can demonstrate
whether all Reasonably Practicable measures 6.3.1. Task Risk Analysis
have been taken. HEPs provide no such
justification and their only benefit may be A review by individuals may not be
that they are easier to record than qualitative Proportionate if the Foreseeable
arguments, so they are not recommended. consequences are significant, so a formal
TRA will be more appropriate. The TRA is
the HAZID technique (Section 4) applied
Example 6.5 - The Piper Alpha Disaster
to a specific activity or procedure. It
works through each step (or objective)
In 1988 a fire starting from an export pump
systematically, asking a series of
led to the complete loss of an offshore oil
and gas platform in the North Sea together questions, such as:
with 167 lives. The pump, which was one • What exactly is going to be done?
of two, was under maintenance and the • What materials will be dealt with?
relief valve had been removed. Due to
confusion at the Permit to Work handover, • What tools and equipment will be used?
the operators were unclear whether the • When will the job be done (daytime,
pump was serviceable and started it, causing night-time, time of year, etc.)?
a high pressure leak of flammable liquids. • Where will the job be done (at height, in
However, the physical appearance of the confined space, etc.)?
pump was unchanged because the relief
valve was not on the skid, but high up in • How might the task affect people,
the vent pipework above. There are good activities or equipment close by?
reasons for locating it there, but had it been
possible to place it on the skid it may have This should establish what can go wrong,
been obvious to the operators that the pump amendments to the procedure, and
was not serviceable, and it is possible that whether each step can be made inherently
the catastrophe might have been avoided.
safe, or safer, by incorporating further
The layout was therefore not simple, intuitive,
RRMs into the design. Table 6 shows a
or easily understandable.
typical worksheet for such an exercise.

ALARP for Engineers: A Technical Safety Guide 51


Activity:
Step Guideword Hazard Reasonably Foreseeable Required ALARP? Action Party Due date
(or safety Questions Consequences and Procedure
objective) Potential Escalations Amendments
or RRMs

Table 6: Suggested TRA Worksheet

For close coupled Systems (Section 6.1) 6.3.2. Procedure HAZOP


it may also be necessary to undertake a
timeline analysis, which may be achieved For more complex Systems with serious
by working backwards from the critical consequences, it may be Proportionate
point (when the System loses control, to undertake a Procedure HAZOP, which
becomes unsafe, escalates, or the is a more rigorous application of the TRA
outcome), and assessing whether there is principles. The guidewords from the lower
sufficient time to respond appropriately, rows of Table 1, Section 4.2 are then
including detection, alarm, processing the either applied to the steps, or nodes, of
information to understand the problem the activity, as shown in Table 7. Nodes
and responding, for everyone involved. may be used to avoid unnecessarily
Once the required times have been repetitious workload, where several steps
established, an appropriate contingency to achieve the same objective(s) can be
should be added to allow for Reasonably addressed simultaneously. The nodes
Foreseeable delays, such as distractions, should be pre-defined by an experienced
confusion, and erroneous decisions. If practitioner, who understands both the
these times cannot be satisfied it may be methodology and the activity sufficiently.
necessary to rethink the design.
However, the most systematic and
rigorous means of analysing complex
Systems and procedures may be STPA,
which is covered in Section 6.4.

Purpose No Action More Action Less Action Wrong Action

Part of Action Extra Action Other Action Out of Sequence More Time

Less Time More Information Less Information No Information Wrong Information

Abnormal
Clarity Training Maintenance Ergonomics
Conditions

Failure prevents Self-Revealing Common Mode


Effective? Reliable?
next step? Failure? Failure?

Table 7: Procedure HAZOP Guideword Examples

ALARP for Engineers: A Technical Safety Guide 52


6.4 Systems Theoretic Process
Analysis (STPA) Management/Supervision

STPA (6) views the System as a set of functions, Policy, Rules, Procedures
rather than equipment items. Many accident
models, such as FTA, FMEA, ETA and Bow Ties,
assume accidents to be caused by a linear
succession of discrete, equipment failure-based Human Controller
events, which omit any feedback or interactional
aspects. This limits the causal detail that is Mental Model (beliefs)
practical with these methodologies and omits
human error, regulatory and management
constraints on the System. STPA requires a
different mindset, as it treats safety as a control
Programmed Controller
problem in which accidents arise from complex
processes that may operate concurrently
Process Model Control Algorithm
and interact to create unsafe situations. It
therefore has the benefit that human error and
management systems can be addressed by a
bottom-up approach, rather than by the more
generic methodologies discussed elsewhere in
Monitoring/Feedback Signal/
this section.
for Hazards, Failures and Comms/
System Status Actuators
All accidents involve lack of control over the
System Hazards, so the control view of safety
defines a safe System as one that enforces
safety constraints on the behaviour of the
System. STPA is the Hazard analysis arm External Conditions and Controlled
of System Theoretic Accident Model and Other Hazards Activity(ies) or Process
Processes (STAMP), which was created to find
more Causes, including social, organisational,
Figure 5: Example STPA Control Structure
human error, design and requirements flaws,
and dysfunctional interactions among non-failed
components. It originated in the aerospace
industries to review software systems, has and actions. The hierarchy can be extended up
become widely used in autonomous vehicles to regulatory and governmental influences if
and rail but is now finding wider applications, necessary.
such as petrochemical and medical.
The STPA process comprises four stages:
By breaking down any System functionality
1. Define the purpose of the analysis.
into decisions, communications, actions, and
feedback, it facilitates their representation in 2. Model the control structure.
schematic form, like an instrumented control 3. Identify Unsafe Control Actions.
loop, as shown in Figure 5. This enables the 4. Identify the loss scenarios.
hardware and sociotechnical aspects of the
System, such as equipment items, human error,
STPA has a wide range of applications and may
management systems and regulatory controls,
to be presented diagrammatically. be the most comprehensive risk management
process currently available, as is evident from
The structure may vary from a single control Table 4 (Section 4.3.4.). Example 6.6 below
loop to multiple loops representing different shows how STPA can be used in a nuclear power
equipment items, feedback, and control plant. The methodology is detailed in the STPA
paths for various data/commands, as well as Handbook (7).
automated or manually controlled decisions

ALARP for Engineers: A Technical Safety Guide 53


Example 6.6 - STPA for an Exothermic Reaction Cooling System

Losses:
L1: Loss of life/injury, L2: Asset damage, L3 Environmental, L3 Loss of production

Control Structure:

Maintenance Human Controller

Pump 1
on/off Digital Control System

Pump 2 Coolant Pump Temperature


on/off Control Flow

Transmitters set to 1 out of 3 voting


Temperature for maintaining cooling, with the
intent that inadvertent shutdown
(all three fail) would be an extremely
Flow (1oo3) low probability.
Coolant Pump 1
Tank (1oo3)

Vessel with
Exothermic
Pump 2
Reaction

Cooler

Unsafe Control Actions:

Too early/Too late/ Stopped too soon/


Not providing causes hazard Providing causes hazard
Out of order Applied too long

Controller does not provide Controller shuts down


coolant when temperature coolant when temperature N/A N/A
rises (H1) is in set limits (H2)

Identify Loss Scenarios for H2:


Controller shuts down cooling when in range [H-2]. > Controller believes all 3 sensors have failed. > All 3
transmitters out of range. > Range is calibrated over normal operation variance. > Transmitter limit = x mA. But
how could this be exceeded? Both pumps flowing simultaneously, but controller will not allow this. However,
the Control Structure shows that maintenance could start both pumps. Why? Manually taking one pump offline
would mean that the other pump needs to be started first, which would immediately cause shutdown.

Solution:
Exothermic reaction needs to be controllable with one pump or two. Recalibrate flow transmitters to cope with
both pumps. However, the effects of doing so would need to be checked to ensure that this does not introduce
another problem to production or safety, e.g. sensitivity reduction effects performance monitoring or detection
of leaks.

ALARP for Engineers: A Technical Safety Guide 54


6.5 Failure Modes and Effects sociological factors and indirect or root
Analysis (FMEA/FMECA) Causes and the fact that the ETA branches
are not always independent. Although the ETA
FMEA is a systematic analysis of component structure can stimulate deeper analysis, it
failures and their effects on the wider System. It may equally over-simplify Systems if it is not
is a starting point for reliability analysis but may carefully thought through, see LOPA (Section
be useful for a wide range of applications as a 6.10). Extreme care should therefore be taken
qualitative tool. FMECA broadens its application if probabilities are to be attributed to the event
to criticality analysis. The process should be tree branches (Appendices A & B).
concurrent with design, being applied at System
level in the early stages, progressing on to 6.8 Hazard and Operability Study
component level as these aspects are defined. (HAZOP)

6.6 Fault Tree Analysis (FTA) A HAZOP is a structured and systematic


examination of complex hardware, normally
FTA is a deductive failure analysis using Boolean used in the chemical/process industries. It seeks
logic to understand how hardware/instrumented to identify all the Hazardous conditions relating
systems can fail at a functional level and thereby to a group of equipment items. The principle is
identify the best ways to reduce the Risk of to divide the equipment up into suitable Nodes
an event happening. It is also used to debug and apply variables, such as None, More, Less,
software systems. It may be used together with Late, Early for variables, such as temperature
FMEA. and pressure. This enables analysts to identify
Hazardous equipment configurations and make
FTA is particularly useful for finding common any necessary changes during design. It is a
causes and single points of failure. It can powerful technique that has become firmly
therefore identify whether primary and backup established in industries wherever Hazardous
systems are independent, for example. As with chemicals are processed. The same principles
any model, it is only as good as the data that is can also be applied to TRAs, as a Procedure
input. HAZOP (Section 6.3.2).

However, it may also be used as a method of 6.9 Functional Safety Analysis (FSA)
brainstorming causal mechanisms by working
backwards from the Hazardous situation to the FSAs are intended to determine the appropriate
scenarios and routes that may lead up to it. reliabilities for Safety Instrumented Functions
(SIFs). The resulting Safety Integrity Levels (SILs)
FTA can suffer similar problems to ETA, determine the Risk reduction, as described in BS
especially for Systems involving manual input. EN 61508, which is supported by BS EN 61511
for the process industry, 61513 for nuclear,
6.7 Event Tree Analysis (ETA) 62061 for machinery and 26262 for automotive.
These may be regarded as Good Practice with
These are useful for identifying the possible certain caveats.
courses of development/escalation of a
Hazardous situation with many potential There are two assumptions that need to be
outcomes. In the same way that FTA looks at considered, namely:
the events leading up to a Hazardous condition, • that there is a tolerable risk target that needs
ETA plots the events thereafter. Although it was to be met and
originally developed for probabilistic analysis, it • where relevant, the unmitigated risk is either
may be used as a qualitative analytical tool or as known or can be calculated.
a graphical method of presenting the full set of
potential outcomes.
The main problems with ETA are to overlook

ALARP for Engineers: A Technical Safety Guide 55


The reference to a tolerable risk is not a For Type 3 Systems the Risk is the product of
demonstration that all reasonably practicable the unsafe failure probability, the SIF failure rate,
RRMs are in place per se, but the SIL category and the Foreseeable fatalities, assuming this is
may be regarded as a demonstration that the only mitigation RRM. As for Type 2 systems
any higher category would be Grossly the SIF normally protects against excursions,
Disproportionate. It is unlikely that this could be not necessarily a System failure, so any
calculated with sufficient accuracy, but it may be assumption that it is may be highly conservative,
regarded as Good Practice. For the purposes of especially as there may be other Causes
this section, SIFs can be categorised into three of failure not covered by the SIF and other
types: safety systems or even potential for further
Type 1. Control systems whose failure would be intervention (human or otherwise). Failure
a Direct Cause of an accident. rates may therefore not be quantifiable with
sufficient accuracy to determine the Risk, or a
Type 2. Systems to prevent excursions outside
specific SIL rating (Appendices A & B). There
the safe operating envelope.
may be many variables that influence the failure
Type 3. Mitigation when a System has rate of a component or System and these may
failed to an unsafe state, e.g. loss of vary by many orders of magnitude in different
containment. applications, e.g. different environmental
conditions. The assessment process involves
The reliability of a Type 1 SIF is normally several judgements, such as the probability
specified for given type of SIF operating of a gas cloud igniting, which depend almost
under specified conditions and is expressed entirely on the local environment in which they
as a Target Safety Level. However, it may operate, and for which no reliable data exists
not be practicable to verify the SIF reliability to make those judgements. These variables
statistically (Section 2.5). Nevertheless, because cannot be random, nor can they always be
failure of the SIF leads directly to a Reasonably taken into account, so failure rate statistics may
Foreseeable loss of life, the TSL could be underestimate or overestimate reality by errors
calculated on the basis of cost vs. benefit, much larger than the SIL categories.
provided a value can be attributed to the lives
lost and a suitable Gross Disproportion factor SIL reliabilities increase by an order of
can be established (Appendix A7). For more magnitude for each category, but Appendix
complex Systems where unexpected scenarios A4 and (9) shows that the variability in risk
may be critical, or there is a potential for error in assessments can be four orders of magnitude,
programmable controllers, STPA may provide a i.e. 1,000 times greater than the SIL category’s
more effective analysis (Section 6.4). range. SIL categorisation therefore becomes
a highly judgemental exercise. One solution to
Type 2 Systems are not a measure of Risk per se failure rate uncertainties and errors is to allocate
because an excursion is not necessarily a failure, a fixed frequency for each Hazard type, e.g.
e.g. failure of the SIF may lead to overspeed of a fires, explosions, toxic clouds, and these should
centrifuge but that does not necessarily equate relate to the same Systems that the SIF relates
to catastrophic failure as there may be safety to, rather than components. The principle here
factors built into the design. If overspeed is a is to take the best available data and, allowing
regular occurrence there may be reasonable for uncertainties, set a conservative failure
statistical data for such excursions (whether frequency for the type of System, e.g. fires on
System specific or from general industry a certain type of compressor occur once per
experience), so this can be combined with thousand years. Although data for each type
the reliability of the SIF to provide a posterior of System may be sparse, and may even be
excursion rate, but not a Risk. Because many argued to be optimistic, because many of the
excursions have no consequences, there are no failed Systems have SIFs that could prevent the
universal TSLs, and SIL categorisation becomes initial failure escalating and therefore becoming
a judgemental exercise. part of the data, there may be a reasonable legal

ALARP for Engineers: A Technical Safety Guide 56


defence to say that coarse industry wide data is
the best that can be reasonably used, and that
fixing this frequency at least prevents further

Consequences
unrealistic judgements.

6.10 Bow Ties Top

Hazards/Causes
Event
Bow Ties have become an increasingly popular

Emergency Response,
means of graphically representing Hazards

Escalation Prevention
and their RRMs, as shown in Figure 6. They are

Recue & Recovery


based on the Swiss Cheese model introduced

Consequence
by James Reason (10) and shown in Figure 7.

Mitigation
The cheeses represent the RRMs, which may
Prevention/
reduce the likelihood or consequence of an
Mitigation
event. The holes in the cheeses represent flaws
or weaknesses in the RRMs, which can lead to
an accident when they align. One problem with Figure 6: The Bow Tie
the Bow Tie is that it shows the cheeses but not
the holes, which may create an unduly optimistic
Hazard
picture of Risk reduction.

The typical Bow Tie structure shown in Figure


6, has several Hazards/Causes leading to
a single Top Event, which is the point at
which an undesirable state is reached, such
as a component failure or a gas leak due
to corrosion. There are then further RRMs
depending on how the event develops, Loss
including detection, active and passive controls,
escalations, and emergency response. (NB. In
Bowtie terminology holes in the cheeses are
known as Escalation Factors, which is not the Figure 7: Swiss Cheese Model
same as the conventional understanding of
escalations.)

However, this model is rarely a true reflection of


reality. Any System with complex interactions,
interdependencies, or common mode failures
cannot be accurately represented by a Bow Tie.
An example of these limitations is given in Figure
8, with 8a showing a true schematic of a safety
system in Boolean notation, whilst 8b gives the
Bow Tie representation with the RRMs in series.
The common mode failure mechanisms and
interdependencies are therefore ignored, even
though they may well be key risk generators.
The linear Bow Tie notation may therefore
give an extremely optimistic impression of risk
reduction. In this example, it may therefore be
more realistic to say that there are only two
RRMs in parallel (manual and sensor detection),
whose reliabilities are compromised by
dependencies on further downstream elements.

ALARP for Engineers: A Technical Safety Guide 57


In practice the RRMs in Bow Ties are known as
barriers, which tends to give the impression
that they are independent and effective. It is Control
highly unusual for a System to have completely Manual Action
Detection Manual Alarm
independent barriers, especially where there are
sociotechnical or environmental factors, so it
Total
may be essential to supplement the Bow Tie with
Shutdown
an RRM Effectiveness Review (Section 6.1). The
answers to these questions may help determine
Auto Alarm
whether the safety philosophy has critical flaws,
or whether the barriers have sufficient integrity Sensor
Evacuate
for an accident to be Unforeseeable (Section Detection
People
2.2) and, if so, justify further RRMs as Grossly
Disproportionate.
Process Trip
One of the main problems with Bow Ties is
that anything that contributes towards risk Figure 8a: Fault Tree Representation
reduction gets counted as a barrier, including
quality assurance, competence and anything Common cause failures
intended to enhance reliability, or which may be
acting in parallel to another barrier, e.g. a prison
cell may have the walls, the ceiling, the door, its Sensor Auto Process
Total S/D
lock, the procedure for locking it, managed by Detection Alarm Trip
a competent guard, plus the maintenance and
Manual Manual Control Execute
inspection of the lock and its specification, all Detection Alarm Action People
counted as different barriers, yet failure of any
one of these could effectively mean that the
whole System fails completely.
Figure 8b: Serial (Bow Tie) Representation
Unfortunately, industry has been unable to
agree on a single unambiguous definition of
what constitutes a barrier, which only facilitates
this double-counting. There will always be a
temptation to count the barriers in a Bow Tie, but
this could be highly deceptive and result in an
unduly optimistic impression of safety. Example
6.7 shows that there were a significant number
of barriers in one of the world’s worst disasters.

Bow Ties are not a Hazard identification process


and they require the Causes to have been
identified prior to their construction, but they
may be a useful vehicle for concentrating the
mind on safety issues and showing engineers
and operators which elements are Safety
Critical and why.

Bow Ties have proven to be a useful graphical


means of communicating how simple serial
Systems work, but for anything more complex,
such as aircraft control systems, they may be
too simple.

ALARP for Engineers: A Technical Safety Guide 58


Example 6.7 - Bhopal’s Theortical RRMs in Bow Tie Format

This accident was also discussed in Example 6.4 (Section 6.1), where it was shown that most of the problems
contravene the criteria involved in an RRM Effectiveness Review.

However, a Bow Tie would have presented an extremely optimistic view of the water washing Risks because it
would inevitably show many barriers/RRMs without showing their effectiveness, common mode failures and any
other factors discussed in Section 6.1.
Isolation Valve(s)

Nitrogen Blanket

Instrumentation
Competence

Water Spray
Procedures

Tank Ullage

Evacuation
Slip Flange

Scrubber
Alarms

Flare
Top
Event

Nevertheless, a set of Bow Ties could provide a starting point for the RRM Effectiveness Review, if there is one for
each identified scenario.

6.11 Layers of Protection Analysis 5. Assumptions not justified.


(LOPA) 6. HEPs taken from standards, such as BS
EN 61511, without being adjusted for local
LOPA is effectively a Bow Tie representation conditions.
of a System that attempts to quantify Risk. It 7. Erroneous modification of HEPs.
is mainly used for instrumented Systems with
8. RRMs assumed to be independent when
multiple levels of redundancy. However, unless
common Cause failure modes existed. (NB.
these Systems are truly serial, independent,
This contravenes the RSS Guidance (3), which
and redundant it will suffer similar problems to
calls for independence to be demonstrated.)
Bow Ties, and the algorithms are liable to be
oversimplifications, which underestimate the
As quantification of Risks is not a legal
Risks.
requirement and there are no absolute Risk
criteria, its purpose may be limited to compliance
LOPA is like QRA, only looking at one branch
with non-binding guidance or Target Safety
of an event tree at a time, which makes it
Levels (Section 2.5), such as providing input
susceptible to all the data and mathematical
to an FSA. It is nevertheless recommended to
errors in Appendix B, so extreme caution is
represent the functionality diagrammatically
recommended.
(as in Figure 8a) and analyse this with respect to
the criteria given in RRM Effectiveness Review
The HSE Research Report 716 (11) raises several
(Section 6.1).
concerns with actual studies they had reviewed:
1. The omission of Foreseeable initiators.
2. Lack of justification for initiator frequencies.
3. Lack of conditional modification of standard
initiator frequencies for local conditions.
4. The omission of critical component reliabilities.

ALARP for Engineers: A Technical Safety Guide 59


6.12 Risk Matrices and Quantitative a machine safe to continue running during certain
Risk Assessment (QRA/PRA/PSA) maintenance tasks? In their simplest form they
can be symmetrical Simultaneous Operations
The use of probabilistic prediction methods (SIMOPs) matrices that determine whether
such as risk matrices and QRA are not different tasks can be undertaken together
recommended, because: (green), or with specified constraints (amber) or
never (red), (Matrix 2). Alternatively, they may plot
• They do not fulfil any legal obligation
steps in the construction/production schedule
(Section 2.3).
against potential activities. They can either be
• They would not be admissible evidence in a used to identify design constraints or as Lifecycle
court of law (3). Criteria to guide operational, maintenance and
• They are mathematically unsound and based other Reasonably Foreseeable activities at the
on beliefs (Appendices B2/5/7). design stage.
• They simplify nuanced relationships, thereby
omitting key variables (Appendix B3). Symmetrical SIMOPs Matrix
• They may contain numerous assumptions Task 1 Task 2 Task 3
(Appendix B5). Task 1 N/A Green Amber
• They are subject to cognitive biases Task 2 G N/A Red
(Appendix A5).
Task 3 A R N/A
• They are based on non-representative data
(Appendix B1). Task Step/Activity Matrix
• The results cannot be verified.
Task 1 Task 2 Task 3 Task 4
• They cannot identify unknowns or enhance
Activity A N/A R G A
the understanding of a subject.
• Historical evidence shows that they can err by Activity B R A N/A G
over a billion times and assessors can have Activity C G N/A R G
inconsistencies of four orders of magnitude. Matrix 2 a) & b): SIMOPs and Task Matrix Styles
Probabilistic misjudgements may have been a
Cause in most major accidents (12). The matrices should be designed to best fit
potential Safety Critical conflicts.
Extreme caution is therefore advised for any
form of probabilistic assessment unless it is Another application could be comparison of
based on Robust Statistics, sound mathematical options where multiple Hazards exist, such as
processes, and the analysts fully understand the whether to modify an existing System or build a
issues raised in Appendices A & B. Statistical new one, an example of which is given in Matrix 3.
analysis of this kind is highly specialised and
even highly qualified personnel cannot be relied This can be a useful communication tool for
upon to identify the errors (Appendix B8). the most difficult decisions, such as whether to
modify an existing System or to replace it, which
Proportionality is better assessed using Matrix 1 may be more expensive but not necessarily
(Section 3.2). have fewer Risks throughout the lifecycle. Only
Hazards where there may be differences between
their consequence or likelihood should be
6.13 Tailor-made Matrices considered. (NB. The term likelihood is used, as it
is qualitative, as opposed to probability, which is
Matrices can provide a systematic means of quantitative and generally untenable.) The matrix
working through tasks and activities. This is then provides a simple overview of where the
especially useful when the engineering decisions Risks differ, and which is the greatest (red), the
need to consider factors in construction, least (amber) or non-existent/eliminated (green).
operation, maintenance and/or decommissioning/ This cannot replace the WRA, but it can help to
disposal, e.g. what needs to be included to make present it in a more accessible manner.

ALARP for Engineers: A Technical Safety Guide 60


Comparing Options

Construction Operational Maintenance Decommissioning and


Design Hazards
Hazards Hazards Hazards Disposal Hazards

Hazard: A B C D E F G H J K

Modify
existing
System

Build
new
System

Matrix 3 Matrix for comparing options

6.14 Specialist Materials Studies and FMEA. It will be critical to identify any
weaknesses in RRMs, so an RRM Effectiveness
Failure of materials may be a significant Cause Review may be necessary, especially for non-
of accidents, which requires detailed analysis instrumented systems.
by specialists or rigorous testing in all relevant
environments and load conditions, especially On the other hand, for some high consequence
when new materials are introduced. Factors such events there may be no practical mitigation
as corrosion, fatigue, creep, flammability, toxicity, RRMs, such as aircraft crashes, so much greater
ultraviolet deterioration, impact, fretting, extreme effort will be necessary to identify all Reasonably
temperature operation, chemical reactions may Foreseeable Causes. This may involve a
combination of TRA, Procedure Reviews, HAZOP,
need to be tested, sometimes in combinations.
FTA, FMEA/FMECA, FSA, HFA, STPA, or any other
industry specific methodologies. Complex and/
or sociotechnical Systems with internal and/or
6.15 Conclusions external interfaces or interdependencies and
significant consequences may necessitate STPA.
Each of the above methodologies has strengths
and weaknesses and may only work in certain Any methodology should be regarded as a
contexts. The best approach to selecting the flexible process that can be modified to suit
types and degrees of study required should be the subject matter. The ideal approach may be
based on the Proportionality Matrix (Section a combination of different ones, employing the
3.3) and the Potential RRM Categories (Table 5, best aspects of each. This is especially true for
Section 4.3), together with the pros and cons HAZID, TRA and tailor made matrices, which
described in this section. can be effective when designed to focus on
recognisable characteristics of the problems to
In some cases, the Causes are reasonably be solved.
generic, at least at a high level, such as in fires
in buildings. The HAZID may identify prevention Although this section is not a comprehensive list
measures, like the use of non-flammable of methodologies, it is intended to illustrate the
materials, but the Causes may be too numerous principles involved and how they may be applied.
to be identified, so significant effort may The preparation for any analysis may therefore be
be required to identify and deliver effective key to the success of the work. This may require
mitigation RRMs to prevent escalations and experience and some imagination but, based
serious consequences. These may be physical on the principles set out, it should be possible
RRMs or single function instrumented systems, to create bespoke methodologies that are both
which explains the use of processes like FSA cost effective and produce better results.

ALARP for Engineers: A Technical Safety Guide 61


7. LIFECYCLE
CRITERIA

Key Messages

Risk control throughout the lifecycle requires certain criteria to be established at the early stages of
feasibility and design and maintained throughout the lifecycle.

These criteria need to be formulated and communicated in a manner that ensures the necessary
actions will be taken throughout the lifecycle.

It will be necessary to demonstrate that any Safety Critical element would perform its function(s)
effectively.

This demonstration may be based on trials, standards, certification, engineering calculations, and/or
a Well-Reasoned Argument.

ALARP decisions must consider the whole


lifecycle, including construction and
Example 7.1 -
manufacturing, operation, integrity (maintenance
Some notable accidents due to the failure
and inspection), decommissioning, disposal, and
to recognise the System limits
Reasonably Foreseeable modifications, change
• Windscale, where a plant designed to produce Plutonium
of use, environmental conditions, and any other
was used to produce Tritium,
factors. It will therefore be necessary to have
• Space Shuttle Challenger, where the Solid Rocket Booster
certain information, specifications and criteria
seals failed because they were operating below the design
defined at the design stage and maintained or
temperature,
updated throughout the lifecycle as the Risks
change (either through conscious decision, • Santiago de Compostela, a train derailment due to
external environment influences, or degradation exceeding the speed limit on a bend,
of the equipment). • Seveso disaster, an abnormal operation caused a runaway
exothermic reaction,
Depending on the complexity and • Texas City, where an activity continued, despite the failure of
consequences, the Lifecycle Criteria may critical equipment and
include: • Chernobyl, where the reactor was taken into an unstable
1. Hazard register (a listing of all Foreseeable operating condition.
Hazards)
2. Design and Operational envelope/System 9. Safety Critical element register (components
boundaries/permissible operations/ and functions).
functional capabilities (e.g. applications, 10. Performance Standards - Specifications and
physical, control, environmental). performance requirements for Safety Critical
3. Requirements for future Reasonably elements, which may include functionality,
Foreseeable changes to the design or availability, reliability, survivability under
operation. hazardous conditions, independence or
4. Safety Critical activities and procedures. interdependence with other components or
5. Safety philosophies and strategies. barriers, and any other relevant criteria, such
as those identified in the RRM Effectiveness
6. Design life/life expectancy.
Review (Section 6.1). They may comprise
7. Monitoring requirements, manual and/or of passive, integrity issues (such as
automated, to identify and warn of unsafe strength, corrosion resistance) and active,
conditions. functional requirements (such as preventing
8. Emergency response, rescue, and recovery overpressure, braking distance, performance
plans/strategy. required with redundant equipment failure).

ALARP for Engineers: A Technical Safety Guide 62


11. Maintenance and inspection policies and
routines. Example 7.2 - Grenfell Tower
12. Criteria for ongoing monitoring, such as Key
One of the main causes of the Grenfell Tower disaster was
Performance Indicators (KPIs) and leading
the use of external cladding that had not been certified using
indicators based on clear objectives.
accurate mock-ups of the design, or the conditions that it would
13. Contingency plans for Reasonably face in a real fire.
Foreseeable failure modes of Barriers/RRMs. The reliance on certification of the materials under
14. Special requirements for decommissioning non-representative conditions was therefore a failure of risk
and disposal. management, which tragically led to the loss of 72 lives.

7.1 Performance Standard


Verification

Once suitable Performance Standards have


been established it will be necessary to
demonstrate that they will be, or are being,
met. Ideally, this demonstration will empirically
test the system, or its components under all
Foreseeable conditions. However, this may not
be Reasonably Practicable, or even possible, so
suitable alternatives will need to be adopted,
such as the testing of mock-ups in simulated
conditions, tests during commissioning, use of
certified equipment, compliance with standards,
and/or engineering calculations, but only
provided they are sufficiently representative
of the Foreseeable conditions. Many of these
choices may be considered to be good practice
in certain industries, for certain components or
systems, but this should not be assumed, and
may need to be justified.

The verification process may reveal


shortcomings, which require improvements
to the specification. They may also affect the
previous stages of identification and analysis,
thus necessitating an iteration (Figure 1) or a
change of strategy.

Further in service verification may be necessary


through maintenance and inspection of
equipment, and monitoring of leading indicators.

ALARP for Engineers: A Technical Safety Guide 63


8. DOCUMENTING AND
COMMUNICATING THE
DEMONSTRATION OF ALARP
Key Messages

An ALARP demonstration is not complete until the Hazards and their controls are effectively
communicated to all stakeholders, whether management, workforce, users and/or the public.
This may take many forms, including safety cases, Safety Critical records, training, warnings, and
competency requirements.

For some installations, buildings or products there may be a legal requirement to provide a formal
safety report/case, for acceptance by the Regulator.

A safety report/case should comprise a WRA for each System to communicate in a clear,
comprehensive, and defensible manner that the Risks will be ALARP throughout the lifecycle.
(NB. A safety case should also include certain operational and managerial issues which may
not be fully covered in this document.)

To be effective, the WRA may need to be developed throughout the risk management process, by
an accountable individual, who is supported by a team of the relevant disciplines. This may reveal
flaws or omissions in the analysis, which may need correction or further study work.

Regardless of the safety case obligations, appropriate WRAs and Safety Critical data should be
recorded and maintained throughout the lifecycle to satisfy the legal obligations.

Given the Reverse Burden of Proof (Section 2.7) • Training and competency.
it will be prudent to document all decisions that • Procedures (applying to Safety Critical
could influence the Risks. The documentation activities throughout the lifecycle).
can take many forms, ranging from studies, • Hazard Register
reports, checklists (Appendix C), ALARP
• Requirements for activity control and risk
Worksheet (Table 7), through to full Safety
assessment,
Reports/Cases (Table 8). It should provide a
e.g. permit systems.
fully auditable record of the risk management
process for each Hazard, summarising the • Lifecycle Criteria included in relevant
conclusions and how they were reached in an documentation,
accessible style, i.e. demonstrate that every e.g. performance standards.
Foreseeable Hazard has been identified and • Safety philosophies.
subjected to suitable and sufficient analysis • WRAs for Systems.
and that all reasonable measures to mitigate • A safety report/case.
the Risks have been implemented. Some
justification for the type and level of analysis
adopted for each Hazard may also be necessary. 8.1 Building a WRA
Appendix C suggests a framework for a checklist The basis of any safety demonstration should
(which is a summary of the key points throughout be a WRA, which may be necessary for each
this document) and may act as a final check, or System and Hazard. Its content should be as
for legal compliance, audit, and review purposes. factual, qualitative, and concise as practicable,
employing graphical, diagrammatic, and tabular
Effective communication of safety messages representations wherever desirable. It should
may require diverse means, such as: preferably be based on recognised methods
• Publicity campaigns. such as Claims, Argument, Evidence and/or Goal
• Localised warning signs. Structured Notation (Section 8.2).

ALARP for Engineers: A Technical Safety Guide 64


Any argument will only be robust if it is structured However, this may be inappropriate when there
in a logical form, which allows all stakeholders to are complex arguments to be communicated,
understand it, critique it, and agree, or disagree. so a dedicated report may be necessary for
The Risk arguments should never be based on each Hazard. Three examples of how the basics
unverifiable or untenable numerical calculations. of the WRA are generated are given below to
illustrate these principles. These are very brief
The WRA would need to answer several basic summaries, but they illustrate the need for a
questions, such as: rigorous identification and analysis of the all the
• Has a full technical understanding of the relevant Hazards and variables involved in the
System been demonstrated, including all technical assessment.
relevant variables and the relationships
between them? Example 8.1 illustrates how several mutually
exclusive RRMs can be evaluated to determine
• What assumptions have been made and
which one is optimal. It illustrates how the WRA
what is the basis for them?
is built, by breaking each Hazard down into
• What are the Hazardous materials and its components and variables to identify the
adverse environmental influences? differences. A comparison of both the technical
• What are the potential Causes of failure? and cost differences can then be made to
• Can the System be configured in a way that determine whether these would be Grossly
creates danger? Disproportionate to a reasonable person. Some
• Has it been demonstrated that all Safety technical studies may be necessary as back-up.
Critical System functions will operate as
expected under all Reasonably Foreseeable ALARP Proforma
configurations and environments, whether
automated or manual? System Definition
• What is the potential for complex
Context (e.g. operation, conditions, environment)
interactions, interdependencies, and human
error, and are they understood?
Identification Hazards
• What are the Reasonably Foreseeable
consequences? Failure Modes and Causes
• What has been done to mitigate these
High Level Safety Objective
effects or reduce their likelihood?
• How effective are the RRMs and how might Reasonably Foreseeable Consequences (by failure
they become ineffective? mode if appropriate)

• What are the safety objectives, and Relevant Good Practice, Standards, ACoPs and
Guidance
have they been effectively recorded and
communicated? Assumptions and Uncertainties
• If an RRM has been rejected, what is Assessment
Options considered
the justification, and does it fulfil the
requirements stated in Section 2.3? Detailed Safety Objective(s)

In its simplest form, the WRA for each Hazard Option rejections and reasons (e.g. Gross
may be captured in an ALARP proforma, Disproportion, inapplicability)
as in Table 8, which is, in any case, a good
Hardware control measures
communications tool for summarising the
Systems and their RRMs. The proformas can be Risk
Software controls (competence, procedural,
Reduction
contained in a single document, or incorporated management systems)
Measures
into the safety report/case, so that it is available
Lifecycle Criteria (where appropriate)
for engineers, maintenance and inspection
personnel, operators, and regulators. Table 8: Suggested ALARP Communications Proforma

ALARP for Engineers: A Technical Safety Guide 65


Example 8.2 is more focussed on providing a It should also be noted that technical solutions
technical understanding of the System and may often generate further, unexpected,
evaluating the effectiveness of an RRM. In this Causes. It may therefore be necessary to
case, it was shown to be ineffective and therefore demonstrate that the solution does not create
rejected because it was considered unsuitable, any hidden scenarios or additional Hazards.
which is a more tangible and convincing (NB. HAZOP, FMEA and FTA could also have
argument than that of Gross Disproportion. been used in this case. However, STPA is
perhaps the most complete methodology, as
The example illustrates how a systematic method the process involves clearer steps working
of dealing with the Hazards, technicalities and from the high level Hazard back to the specific
RRMs can produce the basis for a WRA. The scenario.)
arguments were developed along the principles
of GSN, developing and expanding each Hazard All the examples illustrate how a probabilistic
(which could alternatively be expressed as a approach would have been untenable because
goal) until all aspects had been closed out. Each it would necessarily omit relationships between
argument was developed based on technical key variables that are too detailed for inclusion
logic, trials, and references to learned articles, yet would affect the overall Risks substantially.
and supported with charts and graphical
illustrations of fire effects.

It should be noted that the Hazards could have


been expanded by location, but in this case they
were generic enough for the analysis to be as
brief as practicable. Any minor differences could
have been dealt with as a footnote.

Example 6.4, Section 6.4 looks at a causal


analysis using STPA to illustrate how a
systematic process working backwards from
the Hazard can identify a scenario that was
not otherwise apparent. A technical solution
(recalibration in this case) can normally be found,
but this may be safety critical, so it may need
to be recorded, possibly within the Lifecycle
Criteria, to ensure future changes to the system
do not contravene it.

The WRA would only need to describe the


process and the identified scenarios, together
with their solutions. If no solution could be found
an argument should be made to show how its
likelihood will be reduced, (perhaps by training,
communicating and minimising exposure)
or its consequences mitigated. The ultimate
decision could depend on societal expectations
regarding its acceptability versus its societal
benefits.

ALARP for Engineers: A Technical Safety Guide 66


Example 8.1 - Elements of a WRA for selecting the optimum design for a train tunnel System

Problem: Identify which, if any, tunnel systems are not reasonably practicable.
The following is an outline of how to develop a WRA to select a tunnel system and demonstrate whether specific
designs would be grossly disproportionate.

Hazards: Tunnel System


H1. Derailment.
H2. Collision with oncoming train.
H3. Impact with object. Single Double
H4. Fire. Track Tunnel

RRMs:
• Options of five tunnel systems, as shown in the diagram.
• Allowable train speeds. Single Double
• Train and track type Track Tunnel
• Frequency of trains. plus service tunnel
• Tunnel firefighting systems.
• Rescue provision.
• Remote monitoring.
• Operational controls, e.g. opposing train limits. Double Single
Track Tunnel
Example variables:
• Train speed.
• Train design (safety standards). Double Single
• Size, straightness, and length of tunnel. Track Tunnel
• Number and type of carriages. plus service tunnel
• Track type.
• Diesel or electric and other flammable materials.
Double Single
• Passengers or cargo.
Track Tunnel
• Train frequencies, especially passing.
plus service tunnel
• Debris potential.
• Fire-fighting, ventilation and escape.
• Rescue potential.

It is immediately apparent that there are many variables that can affect both the Causes and consequences of
accidents in the tunnel. Whilst there may be some statistical data, this would be too limited to reflect all these
variables.

The consequences for each Hazard could vary enormously depending on the tunnel system and many of the
variables listed. This may involve significant judgement, as there may not be enough evidence to model specific
crash scenarios, or their relationship with the above variables. The constraints on the system will be key to the
Risks, e.g. train speed limits, oncoming trains, train types and safety standards. These constraints may also be
different for each tunnel system.

The Causes could also be manifold, e.g. potential for debris, maintenance tasks, train speeds, flammable materials.
The analyses of these would need to be rigorous enough to demonstrate that all Reasonably Foreseeable Causes
had been identified.

ALARP for Engineers: A Technical Safety Guide 67


All these factors would need to be considered for each tunnel system but only the differences would be relevant
to this decision. These could be presented in matrix or tabular form and ranked by consequence. There would
then need to be a discussion about the Causes to determine the Foreseeability of each event.

The final decision would need to compare the remaining differences in the Causes and consequences to the
costs for each tunnel system. If the former are not significant but the costs are, then an argument can be
made that the more expensive options do not provide sufficient benefits to warrant their adoption, i.e. such an
argument would necessarily include Causes and consequences, and should make sense to a ‘reasonable person’
as well as the other criteria for Gross Disproportion (Section 2.3).

Problem: Identify and evaluate RRMs for fires and explosions on the platform.

The following summarises the key arguments used in the WRA (some aspects, such as safe refuges, alternative
muster locations, lifeboats, and escape equipment have been omitted).

Hazards:
H1. Gas jet fire radiation affecting personnel at the time of ignition (escape to safe place).
H2. Gas jet fire radiation preventing rescue of injured parties.
H3. Gas jet fire escalating to other hydrocarbon equipment.
H4. Explosions due to delayed ignition or extinguishing and re-igniting gas cloud.
H5. Smoke impairment of muster locations.
H6. Liquid hydrocarbon fires in drip trays under pumps escalating to equipment above.
H7. Heli-fuel fire on helideck preventing rescue of passengers after crash on the helideck.
H8. Heli-fuel fire on helideck preventing rescue of passenger after spillage on the deck.

RRMs:
RRM A. Fire pump(s) with deluge and/or fire monitors for jet fires.
RRM B. Foam system for helideck.
RRM C. Fire extinguishers.
RRM D. Emergency Shut Down (ESD) and Depressurisation (EDP) systems.

Analysis of gas jet fire (H1, H2 & H3) mitigation:


RRM A Deluge:
Heat output for various fire sizes vs. latent heat of vaporisation shows that fires larger than an instrument fitting
will Cause complete vaporisation of pump output and generate superheated steam, which is fatal after one or two
breaths.

Radiation fall off follows an inverse square law. Three zones around flame i) not survivable ii) survivable for limited
time iii) not a threat. Survivable zone is a thin annulus, approximately 2m thick. Deluge has little effect on the
radius or thickness of the survivable annulus and may fill it with superheated steam, thus increasing the Risks to
individuals.

Deluge reduces visibility and creates slippery surfaces for escapees.

ALARP for Engineers: A Technical Safety Guide 68


Water does not reach the heads for first 30 seconds, which may be too late for escapees.

If rescue teams require deluge protection, then victim would have succumbed before arrival.

It cannot prevent escalations as water cannot penetrate the flame envelope and it cannot achieve 100%
coverage of target equipment.

Testing causes corrosion of pressurised equipment, further increasing Risks and maintenance.
It causes significant maintenance and testing, which requires extra manning on platform, thus increasing the
number of people exposed to all offshore Risks.

It cannot extinguish a jet fire and would create potential for explosions if it did.

RRM B Foam systems not applicable


RRM C Fire extinguishers are not effective.
RRM D ESD and EDP were shown to be most effective at preventing escalations and impairment of muster
locations and extinguishing the fire safely, especially if depressurisation time can be shortened, e.g. more
isolation valves, higher capacity vent or selective blowdown.

Analysis of explosions (H4):


RRM A deluge can increase explosion overpressures, except in certain configurations which were not applicable.
Deluge can also ignite the gas by static, so may increase Risks.

Analysis of liquid fires in drip trays (H5):


RRM C, fire extinguishers held by operators or maintenance teams working on the equipment are effective at
extinguishing these fires. Failing this, alarm, muster and shutdown will prevent injury before escalations occur.

Analysis of helideck fires (H6 & H7):


One shot dedicated foam system would be as effective as pumped system, have quicker response time (critical)
and creates less maintenance and manning Risks.

Conclusions:
A pumped fire water system with deluge and monitors was rejected. Fire extinguishers were incorporated
into procedures and permit system for work on liquids systems. A one shot foam system was adopted for the
helideck.

Although fire-fighting systems on offshore installations were considered to be Good Practice, they were shown to
be not applicable in this case.

ALARP for Engineers: A Technical Safety Guide 69


8.2 Goal Structuring Notation (GSN)
Example 8.3 -
GSN representation of Example 8.2, fire-fighting
Goal Structuring Notation (GSN) is a graphical
argumentation notation that represents the
Example 8.2 above could be communicated using GSN by
individual elements of any safety argument in
defining the parent and child goals, which are then connected to
terms of: the strategy (the argument) and solution (empirical tests in this
• Goals (claims). case). By mapping them out in a diagram that illustrates each
• Strategies (argument). aspect by the shape of the text box, the case becomes more
transparent, open to quality assurance and review.
• Solutions (evidence).
• Contexts (scope). Although the process would normally be used to justify a
• Assumptions. System’s safety, in this case the argument is being made that an
RRM is not effective and should be rejected based on a WRA.
• Justifications (evidence).

This can be likened to the claims, argument, Mitigate gas jet All Process
fire effects Releases
evidence approach discussed in Section 8.1.
This is especially suitable for control Systems
where options are binary, but the principles are
sound for any demonstration that all reasonably
practicable RRMs have been identified. The Prevent
Prevent Enable Enable
escalations
System must first be defined in terms of structural escape rescue of
to process
objectives, or goals, each of which must then be collapse from area survivors
equipment
justified by structured arguments.

The development of a GSN should begin at the


HAZID stage, which sets out the parent goals.
This should be subject to regular review and
critique throughout the engineering design and Arguments
safety studies, which expand the parent goals
into child ones, together with their strategies
and solutions, until a satisfactory conclusion is Assumptions Justifications
drawn. A complete description of the process
can be found in the ‘Goal Structuring Notation Empirical
Community Standard, Version 2’ (13). Fire Tests

It may be impractical to contain the full


arguments on the GSN diagram, so the
boxes will need to reference separate text.
Alternatively, the GSN may be a back-up
document to a textual argument, used to check
the integrity of the overall demonstration. This
will depend on the characteristics of the subject
matter and whether the GSN would be a better
communication tool than a textual WRA.

ALARP for Engineers: A Technical Safety Guide 70


8.3 The Safety Report/Case It is recommended that a competent Risk
manager is appointed, taking accountability
Some regulations, such as COMAH, mandate and ownership for the whole risk management
a fully documented Safety Report/Case, but process and safety case development (ALARP
wherever multiple fatalities are Reasonably Demonstration - WRA). He or she should
Foreseeable it would be advisable (and even ensure that all the relevant inputs are received
Reasonably Practicable) to produce a report in from specialists, engineers, consultants,
line with those principles. Table 9 suggests a operators, maintainers, inspectors, workforce,
structure and contents for this purpose. and emergency services throughout the risk
management processes. This may even require
It is therefore important that the safety report careful structuring of consultations, team
is not seen as an end, but as a means of events or interviews with individuals to ensure
communication with all personnel who have the feedback is effective from all parties.
influence over, or are affected by, the Risks, Furthermore, there may be a need to consult
right from the owners of the business down with and/or inform members of the public who
to the workforce, users and general public, as may be affected by the Risks.
appropriate. The content should be accessible
to its readership, which is listed in column 3,
e.g. the Executive Summary should be no more
than ten pages of non-technical language for
owners, shareholders and senior management,
followed by greater detail and content in user/
operator language for the main sections, with
the most technical data and arguments for
engineers and the Regulator in the appendices.
The case should also provide enough detail to
help specify competency requirements and
support training programmes for personnel
who will operate, use, maintain, inspect, and
respond to emergencies on the Systems. Not
all the safety studies will need to be included,
but the case should summarise them and
provide the appropriate references.

The act of writing the safety report or ALARP


demonstration is often an effective means
of checking whether the analysis is robust,
which is why it has feedback loops as shown
in Figure 1. It may therefore be prudent to
start structuring and developing the case right
from the initial HAZID, using methods such as
GSN to help structure a coherent and tenable
WRA. The report should not be an add on after
the studies and Hazard analyses have been
completed.

ALARP for Engineers: A Technical Safety Guide 71


Section Title Content Readership

High level description of business, objectives, installation, activities, Stakeholders and


Executive
Hazards, management controls, policies, and worst Reasonably senior management.
Summary
Foreseeable consequence(s).

Installation description with plans and photographs/isometrics. General Users, installation


Installation management,
description of equipment, types of Hazardous material and activities,
Overview operators,
personnel at Risk.
maintainers,
engineers.
Roles, responsibilities, and accountabilities.
Activity authorisation/permits.
Management Core competencies.
Systems Integrity assurance and monitoring.
Management of Change.
QA, audit and review policies.

Each System to include, as appropriate:


System definition and boundaries (see Section 4).
Description, including locations/drawings/photographs.
Hazardous materials, their effects, and inventories.
Hazardous Failure modes, their Causes and consequences.
Systems Escalation mechanisms and the conditions required.
Safety Critical activities, procedures, and competencies.
Integrity assurance, Safety Critical monitoring and maintenance.
Risk Reduction Measures (RRMs) and Hazard response.
Lifecycle Criteria, including Safety Critical elements and functions.

Technical and managerial control philosophies relating to such things


Safety as Hazardous materials, prevention, detection, mitigation, emergency
Philosophies response, human factors, failure modes, escalations, consequence
reduction and rescue and recovery.

Emergency Facilities (e.g. muster points, fire and rescue) and procedures.
Response External support and communications.

Amendment process and record of changes. Technical information


Dangerous substances register. for the regulator and
Personnel at Risk, numbers and locations (on and off-site). engineers.
Summaries of analyses and studies:
Appendices: Hazard identification and assessment methodologies and rationale.
Escalation and consequence evaluation.
Human factors/ergonomics.
RRMs/Barrier selection and exclusion (e.g. ALARP Proforma).
Emergency Response.

Table 9: Suggested Structure and Content for a Major Accident Safety Case/Report

ALARP for Engineers: A Technical Safety Guide 72


9. UPKEEP OF ALARP
DEMONSTRATION AND
MANAGEMENT OF CHANGE
Key Messages

A demonstration of ALARP is not a one off exercise but continues throughout the lifecycle.

A formalised Management of Change (MoC) process may need to be established. This may also
need to be backed up by a formalised revalidation process where complex systems may change
for reasons beyond the control of the MoC, e.g. COMAH 5 year revalidation of Safety Report.

A formalised means of monitoring creeping change is recommended, whether by audit, review and/
or inspection or monitoring leading indicators.

The management system should specify accountabilities, roles and responsibilities and
competence for those undertaking this activity.

It is recommended that contingencies are set out for Reasonably Foreseeable outages and
changes of status, to avoid shutting down Systems or product recalls.

9.1 Building a WRA • New scientific knowledge about Hazards and


their effects (see case law example in Section
The Health and Safety at Work Regulations 2.2, Baker vs. Quantum Clothing Group).
1999 states: • Changes to the law, guidance, ACoPs or Good
Practice.
(3) Any assessment such as is referred to in • Advances in technology, including cost
paragraph (1) or (2) shall be reviewed by the reductions, which could further reduce the
employer or self-employed person who made Risks.
it if
• Failure or degradation of any Safety Critical
(a) there is reason to suspect that it is no element within the System.
longer valid; or • Ageing, wear and tear or general degradation
(b) there has been a significant change in of equipment.
the matters to which it relates; and where • Changes to, or failure to meet, the Lifecycle
as a result of any such review changes to Criteria.
an assessment are required, the employer • Changes to the operating environment,
or self-employed person concerned shall whether natural or imposed.
make them.
• Changes to emergency response provision
(whether company or external).
As stated previously, SFAIRP/ALARP applies
to the whole lifecycle and this means that any • Significant changes to Safety Critical aspects
of the following changes during this timescale of the management system, including
may require reassessment: competence, roles and responsibilities, and
• Changes to the System, its use, or permit systems.
application; including changes to the design,
engineering modifications or additions. An MoC system may need to be put in place
to identify any of the above changes and
• New or increased Hazards.
initiate the ALARP review. The appropriate
• Increased Foreseeable consequences, e.g. response to change should be addressed
changes to the exposed population. immediately, which may be to shut down the
• Learning from other accidents, (whether installation or recall a product, or repair as
company or external). soon as reasonably practicable, with or without

ALARP for Engineers: A Technical Safety Guide 73


additional RRMs in the interim, or accept the product or installation may change with time,
Risks and continue. Nevertheless, whichever whether due to wear and tear, corrosion,
option is chosen it will be necessary to justify UV deterioration, changes to management,
it, generally with a documented WRA. people, training schemes and numerous other
Causes. These changes should be identified
9.2 Creeping Change Monitoring and and managed, whether by audit, review,
HAZID (CCHAZID) inspection and/or the monitoring of leading
indicators. These will need to be defined at the
Creeping Change is the accumulation of small engineering stage and implemented as part of
changes which often go unnoticed, but which the Lifecycle Criteria (Section 7).
can ultimately add up to a significant change.
Because, by their nature, they are gradual, A significant number of accidents happen due
unseen, and not planned, creeping changes to these types of changes, as can be seen from
can be difficult to monitor. The status of any Example 9.1:

Example 9.1 - Some major accidents due to creeping change

UK Worldwide

Herald of Free Space Shuttle Space Shuttle


Aberfan Nimrod Bhopal
Enterprise Challenger Columbia

Marchioness Windscale Kings Cross Texas City Lac Megantic Richmond

The CCHAZID (14) is a variant of the standard • Change of use, additional uses, process
HAZID technique (Section 4), with a similar changes.
structure and process but different guidewords, • Hazardous materials and environmental
to help identify new or increased Risks occurring changes.
over time.
• Equipment or infrastructure changes.
(e.g. electrical, mechanical, instrumentation,
The primary guidewords in this case could
relate to: structural and process).

• A whole site, location, or organisation. • Proximity changes (equipment, activities).


• A defined product. • Management/ownership changes.
• Activities, modules, or systems/functions. • Workforce change, loss of skills, changes to
• Items of equipment. training, revised procedures.
• RRMs. • Operational Risk Assessments (ORAs) and
• Safety critical equipment. Management of Changes (MoCs).
• Workforce, organisational and culture changes.
The secondary guidewords could relate to:
• Results from audits and reviews.
• Ageing (including degradation and
obsolescence). The team make-up in this case could include
• New knowledge, technologies, standards, all relevant disciplines, preferably Technical
or legislation. Authorities (Section 3.3) as it is a broad scope.
• Data acquisition, e.g. trends in leading It is not a substitute for a normal HAZID, rather
indicators. complimentary to it.

ALARP for Engineers: A Technical Safety Guide 74


9.3 Contingency Plans and Matrices demonstration, or safety report, guidance for
owners, operators, or users on how to respond.
In many cases equipment failures or other This could take the form of a contingency
changes to status may invalidate the ALARP matrix, as shown Matrix 4. Green would show
demonstration and any approved safety that an activity, operation or use of a piece
report, which would require the cessation of of equipment would not be affected by the
operations or the recall of products. Where failure, or the Risk increase would be negligible.
such changes or failures are Reasonably Amber would indicate that temporary measures
Foreseeable it is recommended that measures would be necessary, as indicated by the rule
are in place to best manage Risks in advance reference in the cell. A red cell would prohibit
to avoid complete shutdown or withdrawal the use of the equipment item, activity, or
of the product. This may be included in the operation under the specified set of conditions.

Failed Safety Critical Component, Control, RRM or Safety System

Activity Activity Operation Operation Equipment Equipment


#1 #2 #1 #1 #1 #2

SCE #1 Green G Rule A Rule B Red Rule C

SCE #2 R R G G G R

SCE #3 Rule A Rule C Rule C & D R Rule B & C R

SCE #4 R R Rule C R G G

Matrix 4: Suggested Structure for a Contingency Matrix

ALARP for Engineers: A Technical Safety Guide 75


10. COMMON
PITFALLS
The common pitfalls in risk management often Assumption 5 – That numerical Risk estimates
relate to assumptions about predicting Risk and can be meaningful.
UK law, as follows:
Expert witnesses, consultants and professionals
Assumption 1 – Risks need to be tolerable. have made errors of many orders of magnitude
when judging Risk (Appendix A4 and Example
There is no legal requirement in the UK for Risks A4.3). The error potential is almost unlimited,
to be tolerable, nor is there any relaxation of as illustrated in many major accidents where
obligations if the Risks are Broadly Acceptable. unquantifiable or unexpected variables have
The requirement to reduce Risks to ALARP is turned out to be the most critical factors
not linked to the level of Risk. A lot of regulator (Appendices A and B). Risk cannot be sense
and industry guidance implies that tolerability checked, unlike many scientific or engineering
is a key criterion, if not a legal requirement, problems where errors may be obvious within a
thereby shifting the emphasis from qualitative single order of magnitude.
to quantitative analysis. This has led to the
development of many unsound predictive Assumption 6 – Numerical Risk comparisons
methodologies, which may distract the Legal are more valid than absolute Risk estimates
Duty Holder from undertaking suitable and
sufficient qualitative analysis to demonstrate Errors in Risk calculations equally affect the
that the risks are ALARP. comparisons between two Risks if the error
is common to both. However, if the errors in
Assumption 2 – Risks need to be quantified. each option are different it is possible that the
comparison error is larger than either of the
Attributing a numerical value to a Risk cannot absolute ones.
demonstrate that it is ALARP. Quantification
may be appropriate for the purposes of Assumption 7 – The best RRM can be selected
justifying Risk Transfer or demonstrating Gross using quantitative methods
Disproportion, provided it is based on Robust
Statistics, avoids the errors in Appendices A & B, As for Assumption 6, unless Robust Statistical
and complies with the RSS Guidance (3). data is available for each RRM, assumptions will
need to be made to determine their Risks. This
Assumption 3 – Risks need to be ranked almost inevitably creates an invalid comparison,
or profiled. and the exercise becomes one of comparing
assumptions, which may well be hidden. It is
As for assumption 2, the ranking of Risks does therefore better to make that logic transparent
not satisfy any UK legislation, as all Risks must in a rational argument for each case.
be reduced to ALARP. However, it is necessary
for the analysis to be Proportionate (Sections Assumption 8 – Sensitivity checks can
3.1 & 3.2), which can only be based on tangible, validate QRA models
and preferably quantifiable, criteria.
A sensitivity check changes one or more
Assumption 4 – Gross Disproportion variables in the model or its input data to
requires CBA. observe changes in the model’s output. If this is
to be used to validate the model, the assumption
Gross Disproportion may be demonstrated is that the user will have reference points to
qualitatively, and this may be the only effective compare it with, but the models are inherently
means of doing so, unless there are Robust hypothetical so this would not be possible. It
Statistics on which to base the CBA (Appendix A7). cannot reveal errors in the input data and is
only useful as a QA method to identify obvious
coding errors in the software, such as output

ALARP for Engineers: A Technical Safety Guide 76


decreasing when the input increases. • Inappropriate definition of a representative
Assumption 9 – RRMs/barriers are sample of events.
independent • Inappropriate use of Risk criteria.
• No consideration of further measures that
Many accidents occur because RRMs that were could be taken.
assumed to be independent had common mode • Inappropriate use of cost benefit analysis.
failures that were not foreseen (Sections 6.1,
• Using Reverse ALARP arguments (i.e. using
6.10, 6.11 and Appendices A and B5). cost benefit analysis to attempt to argue
that it is acceptable to reduce existing safety
Assumption 10 – Counting RRMs/barriers is a standards).
good indication of safety • Not doing anything with the results of the
assessment.
Because there is no universally agreed definition • Not linking Hazards with Risk controls.
of what constitutes a barrier, Bow Ties often
• Not communicating Hazards and Safety
present different parts of a control measure
Critical information to relevant parties.
as separate RRMs. However, separate RRMs
are rarely, if ever, shown as one, therefore Bow
Ties almost inevitably create an optimistic
impression of safety if the RRMs are counted
(Sections 6.10 and 6.11).

Other Pitfalls:

Some other common pitfalls, including some


highlighted by the Health and Safety Executive
(15) are:
• Carrying out a Risk assessment to attempt to
justify a decision that has already been made.
• Using a generic assessment when a site-
specific assessment is needed.
• Carrying out a QRA without first considering
relevant Good Practice.
• Using inappropriate Good Practice.
• Making decisions based on individual
Risk when societal Risk is the appropriate
measure.
• Only considering the Risk from one activity.
• Assessing the Risk of one component, or
element of, the System or installation, (known
as salami slicing).
• Not involving a team of people in the
assessment or not including employees with
practical knowledge of the process/activity
being assessed.
• Over-use of consultants.
• Failure to identify all Hazards associated with
a particular activity.
• Failure to fully consider all possible outcomes.
• Inappropriate use of data.

ALARP for Engineers: A Technical Safety Guide 77


11. REFERENCES

1. Baker, J. The Report of the BP U.S. 13. Richard J. Goff and Justin Holroyd,
Refineries Independent Safety Review UK Health and Safety Laboratory.
Panel. 2007. Development of a Creeping Change
HAZID Methodology. IChemE. [Online]
2. Health and Safety Executive. Reducing
2017. https://www.icheme.org/
Risks, Protecting People (R2P2). 2001.
media/11897/paper-61.pdf.
ISBN 0 7176 2151 0.
14. Health and Safety Executive. Good
3. Royal Statistical Society. Practitioner
Practice and pitfalls in risk assessment,
Guides No’s. 1 to 4, Guidance for Judges,
RR151. 2003.
Lawyers, Forensic Scientists and Expert
Witnesses, Royal Statistical Society. s.l. : 15. —. HSG238, Out of control: Why control
Royal Society of Statistics, 2009 to 2014. systems go wrong and how to prevent
failure, 2003.
4. Perrow, C. Normal Accidents: Living
with High Risk Technologies. s.l. : ISBN: 16. Confidential Enquiry into Sudden Death
9780691004129, 1984. in Infancy” (or “CESDI”), entitled “Sudden
Unexpected Deaths in Infancy. s.l. : BMJ.
5. Health and Safety Executive. HSG65,
Managing for Health and Safety, 3rd 17. Hill, Pr. R. Cot Death or Murder? -
Edition. [Online] 2013. https://www.hse. Weighing the Probabilities. s.l. : Salford
gov.uk/pubns/books/HSG65.htm. University, 2002.
6. Leveson, N. Engineering a Safer World: 18. The fabrication of facts: The lure of the
Systems Thinking Applied to Safety. 2011. incredible coincidence. Derksen, Ton. s.l. :
ISBN: 9780471846802. Neuroreport, 2009.
7. Leveson, N., Thomas J. https:// 19. Kahneman, Daniel. Thinking Fast and Slow.
psas.scripts.mit.edu/home/get_file. 2011.
php?name=STPA_handbook.pdf. STPA
20. Tetlock, P. E. and Gardner, D.
Handbook. [Online] 2018.
Superforecasting: The Art and Science of
8. Kurt Lauridsen, Igor Kozine, Frank Prediction. 2015.
Markert, Aniello Amendola, Michalis
21. Robson, D. The Intelligence Trap: Why
Cristou, Monica Fiori. Assessment of
Smart People Do Stupid Things and How
Uncertianties in Risk Analysis of Chemical
to Make Wiser Decisions. s.l. : Hodder &
Establishments. Roskilde : Riso National
Stoughton, 2019.
Laboratory, 2002. Riso-R-1344(EN).
22. Cox Jr., L.A. What’s Wrong with Risk
9. Reason, J.,. Managing the Risks of
Matrices? 2008.
Organizational Accidents. Aldershot, UK :
Ashgate, 1997. 23. Thomas, P., Bratvold, RB, Eric Bickel JR.
The Risk of Using Risk Matrices. 2013.
10. Health and Safety Executive. A Review
of Layers of Protection Analysis (LOPA) 24. Miller, K. Quantifying Risk and How It All
analyses of overfill of fuel storage tanks, Goes Wrong,. s.l. : IChemE, 2018.
RR716. [Online] 2009. https://www.hse.
25. Health and Safety Executive. RR672
gov.uk/research/rrhtm/rr716.htm.
Offshore hydrocarbon releases 2001 to
11. Tinsley C. H., Dillon R. L., Madsen P. M. 2008. s.l. : HSE Books, 2008.
How to Avoid Catastrophe. [Online] 2011.
26. Ashwanden, C. You Can’t Trust What You
https://hbr.org/2011/04/how-to-avoid-
Read About Nutrition. FiveThirtyEight.
catastrophe.
[Online] 2016. FiveThirtyEight.com.
12. The Assurance Working Group. The GSN
27. Reason, Pr. James. The Human
Community Standard . [Online] 2018.
Contribution.
https://scsc.uk/r141B:1?t=1.
28. J., Thomas. 2020. 2020 MIT STAMP
Workshop.

ALARP for Engineers: A Technical Safety Guide 78


APPENDIX A:
RISK, IT’S MEASUREMENT
AND PREDICTION
Key Messages

Safety Risks are either measurements, calculations, or beliefs.

Measured Risk is a statistic, an average, which may not be indicative of the Risks in a unique System
and may not provide any indication of why the Risk exists. Robust Statistics are only possible for
mass produced items tested under identical conditions for sufficient time.

Calculated Risks must be based on logic or Robust Statistics that are modified in accordance with
Bayesian theory. This is rarely possible for engineered or sociotechnical systems (Appendix B6).

Predicted Risk is an epistemic belief, a knowledge related opinion, but accident history shows that
the unknowns can be as important as the knowns, if not more so. It may be undermined by multiple
cognitive biases, conflicts of interest, errors due to poor understanding of probability theory and
counter-intuitive mathematical relationships
Predicted Risk is not real, it is an opinion, generally unique to an individual, which may be subject to
almost unlimited errors that can be extraordinarily difficult to identify, even by experts.

Risk predictions for rare events cannot be sense checked or verified.

Risk assumes randomness, which is an absence of control, either due to inability, ignorance, or
choice. Ignorance cannot be quantified. Choice must be justified. Accidents are due to ineffective
or absent controls, which cannot be quantified but may be rectified.

The Risks associated with engineered and sociotechnical Systems are nuanced with potentially
chaotic aspects. Predictions, based on experience, appearance, or comparison with similar
Systems, may therefore be highly deceptive. ‘Expert judgement’ or ‘sound engineering judgement’
can only relate to Foreseeability, not to Risk quantification.

Risk prediction is irrational because of the following paradox:


• Without understanding the reasons for failure, prediction is simply guesswork.
• If the reasons are understood, then prediction has no purpose.

Accidents only happen because someone thinks the Risk is low, so dismissing low Risks is illogical.

Risk management requires Hazards to be identified, Causes to be understood and Risks to be


reduced. Risk calculation and prediction cannot contribute to any of these objectives.

This appendix deals with the legal definition A1 Legal Definition of Risk and
of Risk and the feasibility of measuring, Risk Assessment
calculating, or predicting it for the purposes of
demonstrating whether it has been reduced to The concept of risk was described in the case
ALARP. of Regina V Board Of Trustees Of The Science
Museum, 1993 as ‘a possibility of danger’.

ALARP for Engineers: A Technical Safety Guide 79


It is not specifically defined in the Health A2 Risk Measurement (Robust
and Safety as Work etc. Act 1974, although Statistics)
extensive reference is made to it, mostly
requiring ‘so far as is reasonably practicable, Risk is measured using statistics, which must be
the elimination or minimisation of any risks to robust and compliant with the following criteria:
health or safety’. However, risk is defined in
the COMAH Regulations as ‘the likelihood of 1. Representativeness
a specific effect occurring within a specified
- The sampling criteria should reflect the
period or in specified circumstances’. The
purpose of the measurement, not just be
Offshore Installations (Offshore Safety Directive)
similar in some respects, but genuinely
(Safety Case etc.) Regs 2015 do not define it
reflective of all relevant variables and
but requires ‘all major accident risks have been
contexts, and be free from any non-relevant
evaluated, their likelihood and consequences
influences. For example, car accidents could
assessed…and that suitable measures…
be measured by type of vehicle, age and
have been, or will be, taken to control those
gender of driver, weather conditions, type of
risks’. The Nuclear Installations Act 1965 and
road and many other variables, depending on
Nuclear Installations (Dangerous Occurrences)
the objectives of the study.
Regulations 1965 do not define it or make
specific provisions for its assessment. The
Approved Code of Practice for the Management 2. Randomness
of Health and Safety at Work Act states ‘a risk - The sampling should be random and
is the likelihood of potential harm’. Therefore, unbiased and
the legislation always associates Risk with - There should be no known variables that
likelihood, rather than probability, which is could affect/control the probability unless it
critical because likelihood is a qualitative term, can be modified later to accurately account
whilst probability is quantitative. for them.

Although some guidance regards terms such 3. Statistical Significance


as high, medium, or low risk as qualitative, these - The sample sizes must be large enough to
are not recognised in law and they may be give confidence in the results.
more accurately described as a relative form of
quantification that lacks a reference, thereby 4. Freedom from Unfalsifiable Chance
having questionable meaning. A qualitative Correlations
assessment is an understanding of the Hazard
- When a dataset is sub-divided into
characteristics, the conditions under which it may
correlations these need to have logical
be liberated, how it can develop and escalate and
justification. The greater the number of
the Foreseeable consequences. A qualitative
variables considered, the greater the
assessment of the reduction in likelihood would
likelihood that some will correlate by chance
therefore demonstrate how the System qualities
alone (Appendix B7, Error #7). The adage
are changed by the RRMs. Whilst some of these
‘correlation does not equate to causation’
aspects may be quantifiable, there is no legal
illustrates the guiding principle and, unless
obligation to do this.
a genuine causal relationship can be
established, statistical data should not be
The Management of Health and Safety at Work
used this way, or it should at least carry a
Regulations 1999 does not specifically define
clear warning.
Risk Assessment, but it does say that a ‘every
employer shall make a suitable and sufficient
5. Ergodicity
assessment of the risks’ to all persons affected.
A court case is normally conducted based on - If a coin flipped a thousand times gives the
rational arguments, which must therefore be the same number of heads as one thousand
preferred assessment approach, where feasible. coins flipped once, then the data is said to

ALARP for Engineers: A Technical Safety Guide 80


be ergodic. If a dataset can be divided into
separate groups, it is not ergodic. An item of Example A2.1 –
mass produced equipment may fail due to Weather Data Used for the Design of North Sea Oil
reasons inherent in its design or the way it and Gas Platforms
is operated. Failures due to design could be
The design standards for these platforms require them to be
ergodic if they are random and the user has
built to the resist the largest wave that would be expected to
no control over them. However, failures due
occur over a 100 year period. Unfortunately, global warming
to application or the way the equipment was
has increased this, and some platforms have had to be raised
operated may vary greatly, so it is not. If 99
at huge expense.
users operate it in one application and one
uses it in a situation where it fails then the
data could give a 1% failure rate, which would was due to poor training, communications,
be far too high for the first application and ergonomics, design, management systems and
seriously underestimate the second. Non- so on. HSE238 (16) shows that most accidents
ergodic data produces averages that greatly involving some form of control error or loss
underestimate the effect of the most serious were caused by poor design or specification,
and none of these factors can be put down to
Causes, because they are diluted across a
probability. Measured Risk therefore has little
large population for which they do not apply.
application for engineered Systems.
Without detailed causal information it may be
impossible to tell whether the data is ergodic
and therefore a realistic representation of the
A3 Risk Calculation (Bayesian
relevant Risk.
Methods)
This limits Robust Statistics to subjects with a
The Bayesian method converts background
large amount of data, such as mass produced
data (known as prior probabilities or base rates),
items, operating under identical conditions.
into posterior probabilities reflective of a more
Although major accidents of all types are
specific condition, e.g. converting the cancer
frequent enough to be measured statistically
rates in the general population (the prior) to
on a global basis this will be of little relevance
those for smokers only (the posterior). In this
to measuring the Risks of a unique engineered
way it can convert a large dataset into sub-sets,
System. It may be possible to measure failure
but only if the conditional probabilities and
rates for some components or RRMs, such as
populations of the subsets are known. Taking
fire protection systems, but these could vary
the car accident example above, the prior
greatly for various reasons, such as design,
probability (the total accident numbers) can be
application, operating conditions, natural or
granularized into contextual probabilities, such
Hazardous environments, or maintenance.
as car or road type, driver gender or age, as
In practice, reliability data taken from public
shown in this Example A3.1:
sources or manufacturers may not be collected
with the same reasons as data from a controlled
study and may therefore have limited value
Example A3.1 –
(Appendix B1).
Bayesian Calculation for the Probability of
an Accident Given a Male Driver
Some variables have extensive datasets, such
as weather, but users should note that historical Bayes Formula:
(Pr(Accident))/
data may not always be indicative of future Pr(Accident │ Male) = Pr(Male │ Accident) x
performance (Example A2.1). (Pr(Male Driver))
Where Pr(Accident │ Male) = the probability of an accident
Nevertheless, accident investigations rarely, given that the driver is male.
if ever, state the Causes to have been random
failures of equipment items, because there
are always deeper reasons or the Cause

ALARP for Engineers: A Technical Safety Guide 81


However, critical information is often missing therefore not truly random. Nevertheless,
(typically the denominator on the right hand because it is beyond human capability to control
side), so this must be estimated. This changes the outcome, it is regarded as uncontrollable
calculation into prediction. For these reasons, and therefore random.
Bayesian methods have become associated
with guesswork, but the counterargument is Hazards may therefore be random, controllable
that they should at least be better than the prior or some combination of both. For example,
probabilities. Although the errors in the above weather is random, but its effects may be
example may not be too large, as the number of controllable. Any Risk prediction therefore
male drivers can be guessed with reasonable implies absent or limited control, whether due
accuracy, when dealing with major accident to inability, ignorance, or choice, as illustrated
Risks, this can be much greater. Example A4.3 in Example A4.1.
in Appendix A4 below, details a scientific study
(9), which showed that judgements of these Example A4.1 -
types of rare event regularly varied by four Accidents Due to Inability, Ignorance or Choice
orders of magnitude. Even if Bayesian analysis
could get 90% closer to the truth, this would Inability:
still leave an expected error of 1,000 times. This Space Shuttle Columbia failed because a tile fell off at launch,
has significant implications for Target Safety damaging the wing leading edge, which meant that safe re-
Levels (Section 2.5) and Cost Benefit Analysis entry into earth’s atmosphere was not possible. The problem
(Appendix A7), where it is highly unlikely that all had been foreseen, and all available technology was used to
the relevant variables will be known. control the Risk, but it could not be eliminated.
Ignorance:
Nevertheless, for those rare cases where all
The early Comet airliners had failed due to square windows,
the relevant variables can be quantified using
which caused stress raisers that led to fatigue cracks. The
Robust Statistical data, the Bayesian method is
problem was not foreseen, but is now well understood, and
a sound mathematical concept and should be aircraft have round or oval windows.
legally admissible.
Choice:
Risk calculation pitfalls are covered in Appendix Ladbroke Grove rail disaster was a foreseeable accident that
B, which shows that the results are unverifiable, ultimately happened because of a conscious decision to omit
yet can contain errors exceeding many orders the Train Protection System (which would have prevented the
crash).
of magnitude, which may be unavoidable and
remarkably difficult to detect.
However, it may not be possible to calculate,
measure or meaningfully predict the Risks
A4 Risk Prediction (Belief) and caused by this lack of control. The Space
Randomness Shuttle and Comet examples were obviously
unquantifiable, (as they were unique and
Given the above limitations, the only remaining unforeseen respectively), but the Ladbroke
option will to be predict the Risks, so the Grove decision (see Appendix A7) was based on
limitations of this judgemental approach need non-representative data that was modified using
to be considered. This will depend on two key estimates, with disastrous consequences.
factors: control and knowledge.
Databases of reliability or frequency of failure
Control and Randomness also inherently assume randomness, e.g. Human
Risk inherently assumes randomness, which Error Probabilities (Section 6.2), even though
may be described as something beyond our they may be influenced or controlled by known
understanding or control. The toss of a coin or knowable factors, such as ergonomics and
may be considered a random event, although training. If anything is to be regarded as a Risk,
it is governed by the laws of physics, so it is it would be necessary to justify why this is not
theoretically calculable and repeatable and controlled, (whether due to inability, ignorance,

ALARP for Engineers: A Technical Safety Guide 82


or choice) and then make the Risk prediction on
the uncontrolled, random aspects. The difficulty Example A4.2 - Risk as an Epistemic Concept
with estimating something that we either cannot
control, understand, or have chosen not to What is the probability that a patient has disease X?
control, and for which we have no statistical Doctor A does not know the patient, so she quotes the
data, is clearly extremely challenging. population statistics, which are 1/10,000.
Doctor B knows that the patient is male and in his 70’s, so she
Knowledge quotes the statistics for that group, which are 1/1,000.
Doctor C has seen the patients test results, which are positive,
Risk is an epistemic concept, i.e. knowledge but they are only 90% reliable, so she states 9/10.
based. If we have complete knowledge, we will
know the outcome and there would be no Risk. Who is correct?
However, with partial knowledge the outcome is Answers:
uncertain, and it becomes a Risk estimate. 1) They are all correct, according to the knowledge they hold.
2) They are all incorrect, because none of them checked with
The Risk will therefore depend on the level of the haematologist who has analysed the blood sample and
knowledge, as illustrated in Example A4.2. confirmed that the patient definitely had disease X.
Risk is therefore an opinion, based on the
knowledge held by the person making the
judgement, so it should always include a caveat
stating what that knowledge is. There can
be no right or wrong answer. Phrases such
as ‘sound engineering judgement’ or ‘expert
judgement’ are often used in this context, but
their meaning implies:
• that the expert has acquired all reasonably
practicable knowledge about the subject
before making such a judgement, and
• that such knowledge would be sufficient
to make a meaningful and worthwhile Risk
prediction.

Neither of these may be true.

In Example A4.2, the doctors were all given the


same question, but their answers varied by
four orders of magnitude, depending on their
knowledge.

NB. Doctor C made another mistake that


is easily missed. Her intuition said 9/10,
based on the test reliability of 90%, but
the true probability is 1/1,111 because
we would expect 10% of all people tested
to be false positives (1,000) and 0.9 to be
true positives. Risk evaluations are often
counter-intuitive, so judgements can result
in enormous errors and should never be
trusted (see also Appendix B6 to show how
large these errors can be).

ALARP for Engineers: A Technical Safety Guide 83


Example A4.3 - Risk Estimation Variance and Errors

A controlled study by the Risø National Institute (9) employed six specialist consultancies to assess
the Risks on a chemical installation. Figure A1 shows the results, with three of the eleven equipment
items having more than four orders of magnitude Risk variation, which is 10 times larger than the
ALARP region for workers, and 100 times larger than that for the public (2).

10-2
Worker Intolerable
10-3
Public Intolerable
10-4
10-5
Risk/Year

Broadly Acceptable
10-6
10-7
10-8

Variations between consultancies for 11 different systems

Figure 1A1: QRA Variation vs. ALARP Range

It should be noted that this trial measured variation not error (which could be larger). The report
concluded that the differences were necessary assumptions caused by ‘analysts guessing’. The
process is therefore random over a potential range of at least four orders of magnitude.

Because the Risk predictions for rare events are unverifiable errors cannot be known, but they
could be even larger.

The law courts provide the best evidence of how large the errors could be, because some gross
miscarriages of justice have warranted academic challenge to the figures. The Sally Clark trial (17)
(18) infamously had a critical expert witness error of one billion times (Appendix B6, Error #6) and
the case of Lucia de Berk in the Dutch courts (19) erred by ten orders of magnitude. Both errors
resulted in life sentences for the defendants, who were later released when this came to light. The
Sally Clark case was probably the most researched example but nevertheless had a UCC error
(Appendix B7, Error #7) of one hundred times that went unnoticed even by the academics.

Predictions regarding engineered and activities, such as mechanical, electronic,


sociotechnical systems may have similar software, management systems, procedures,
variations, as illustrated in Example A4.3. permits, competence, maintenance,
In terms of probability, these Systems are operations etc.
nuanced, with potentially chaotic aspects, • External influences, such as political, financial,
where a small change to a minor detail can conflicts of interest etc.
often dramatically change the Risk picture. This
• Quality of leadership, cultures, policies, and
is evident from most major accidents, which
communications.
conform to the principle of multiple Causes and
are typically dictated by several of the following • Design error.
characteristics: • Human factors.
• Complex interfaces and interdependencies • Environmental factors (natural or man-made)
between sub-Systems, e.g. equipment and and deterioration.

ALARP for Engineers: A Technical Safety Guide 84


Example A4.4 -
How Chaotic Variables Influenced the Probability of the Chernobyl Disaster

Cause #1 – The fuel rods had graphite tips, which made the reaction unstable at low levels of power.
Cause #2 – The operators did not understand this.
Cause #3 – The test procedures contained ambiguities and omissions.
Cause #3 – The test required standard operating criteria to be contravened.
Cause #4 – The test was rushed because of pressures to get it done.
Cause #5 – The run down test was to prove that power to the pumps could be maintained during
a reactor shut down, but this should have been proved by design and safe simulations during
commissioning.
These are just a few of the many reasons that resulted in the disaster, but they illustrate how
changing any one of them could have either prevented the accident from happening, or
at least changed the probability enormously.

Any one of these factors can become dominant, because the Risks are epistemic and partly
or combinations of them could change the controllable, and may be reduced to ALARP,
probability from negligible to highly likely, or but no one could have sufficient knowledge to
even certain. This is too complex to model estimate them in any meaningful way.
probabilistically or to make mental estimates of
their likelihood. The Chernobyl disaster (Example The tossing of a coin, or the rolling of a die, are
A4.4) illustrates these principles quite well. therefore deceptive examples of Risk because
they have unique problem characteristics – i)
Chernobyl is one example of these sensitivities, they are based on a logical argument (that
but similar conclusions could be drawn for of symmetry) and ii) they involve a complete
virtually all major accidents. The level of detail inability to control the outcome. Engineering
required to meet the knowledge requirements problems are inevitably much more complex,
above, would therefore be impracticable, making with multiple variables, which may often be
any meaningful predictions totally unrealistic. controlled, if only partially. There is no logical
argument for Risk, as was evident for the coin.
This illustrates how the unknowns can become The Russian Roulette example shows how the
more important than the knowns and, even if introduction of more variables immediately
they are known, it may be impossible to evaluate makes the problem insoluble.
or quantify them in any meaningful way.
Accident inquiries rarely, if ever, state that the
Engineers can sense check most engineering event was random, as there was inevitably good
calculations within an order of magnitude (e.g. reason for its occurrence. Example A4.6 lists
a pump or a bridge that is either ten times 30 well-known major accidents, of which only
too large or small would be apparent to a one could be considered entirely random, the
competent engineer). This may be the reason Space Shuttle Columbia, which was doomed
why they believe that they can do the same the moment the tile fell off and hit the wing.
with Risk, because they fail to recognise the (NB. Some would argue that even the Shuttle
different nature and sensitivities inherent in the accident was not random because there may
probabilistic domain (20), (21). have been more that NASA could have done to
prevent it.) The other accidents were initiated
The sensitivities can become almost unlimited, by identifiable errors in design or operation,
as illustrated in Example A4.5. – Russian most of which could have been rectified with an
Roulette. This example is analogous to many effective Risk management process.
industrial applications of ALARP legislation

ALARP for Engineers: A Technical Safety Guide 85


Example A4.5 - Russian Roulette

Russian Roulette illustrates some of the key points here because it has epistemic, controllable
and random aspects, which are analogous to many industrial situations. The Risk of the gun firing
is governed by three variables:
i) weight of the bullet
ii) friction of the cylinder
iii) orientation of the gun, (which may either increase or decrease the Risk).

A simple logical argument might predict the Risk as 1/6 because there is one bullet in six
chambers, but this is incorrect because the weight of the bullet tends to make the cylinder stop
at the bottom of the spin. In Bayesian terms, the 1/6 is the prior probability, which ignores the
weight effect, whereas the actual probability is known as the posterior. It may not be reasonably
practicable to assess the effects of friction, so the posterior probability would be incalculable.
If friction is the dominant factor, the weight of the bullet could be a minor influence only. On the
other hand, a frictionless cylinder would make the weight of the bullet dominant, and therefore
almost inconceivable that it would stop at the top. Depending on the orientation of the gun, the
range of possible probabilities is virtually infinite and not something that anyone could judge,
unless they had Robust Statistical data from trials under identical conditions. In the absence of
these, the posterior probability cannot be known, and it would clearly be unacceptable to quote
the prior. Any attempt to assess the posterior probability would be nothing more than guesswork,
which would not be legally admissible evidence.

If the gun’s orientation were known, it may be possible to say whether the probability is greater or
less than 1/6, but nothing more. If the gun’s orientation is fixed, then the Risks are uncontrollable
and random, but with no known probability. If it is not fixed, and the operator can choose the
orientation, then it is at least partially controllable, because holding the gun upright (with the
barrel at the top of the spin) will reduce the probability, thereby achieving ALARP, even though the
benefit cannot be quantified. The demonstration of ALARP can therefore only be based on
a WRA.

Example A4.6 - Randomness in 30 Well-Known Major Accidents

UK Worldwide

Ladbroke
Aberfan Flixborough Bhopal Guadalajara Seveso
Grove

Boeing 737 Space Shuttle


BG Rough Grenfell Tower Nimrod Longford
Max 8 Challenger

Herald of Free Space Shuttle


Buncefield Piper Alpha Chernobyl Macondo
Enterprise Columbia

Clapham
Hillsborough Titanic Feyzin Mumbai High Texas City
Junction

Three Mile
Comet Airliner Kegworth Windscale Fukushima Pasadena
Island

ALARP for Engineers: A Technical Safety Guide 86


Each event had elements that could be Prediction is therefore an epistemic concept,
regarded as random, such as Grenfell Tower, susceptible to virtually unlimited error and
which was initiated by an electrical fault in a variation between individual estimators, and it
refrigerator. This was not something that was cannot be sense checked or verified.
within the control of the design or operation of
the building, so it must be considered random
from that perspective. A5 Cognitive Bias

Virtually all major accidents happen because The above limitations are only made worse
the Risk is underestimated at some point in by the many cognitive biases that affect Risk
the product lifecycle, i.e. a failure to recognise estimation (21), some of which are:
either the Hazard, a flaw in its controls, or the
likelihood of failure (12). 1. Confirmation Bias (aka Wilful Blindness)
(20)
The difference between the ability to judge - Probably the best known cognitive bias, as
plausibility (Foreseeability) and probability, virtually everyone tends to select evidence
is recognised in the RSS legal guidance (3). that favours their own beliefs and disregard
Although an expert may be able to judge evidence to the contrary.
Foreseeability, probability is a different matter,
e.g. a metallurgist may be able to say whether 2. Availability Heuristic (aka Outcome Bias)
a fatigue crack is Foreseeable, but he/she (20)
does not have the competence to judge the - Personal experience dominates judgement,
probability and, in the absence of Robust although it may differ from reality.
Statistical data on that item operating under
those precise conditions, any judgement would 3. Anchoring (20)
have no sound basis. Expertise has boundaries,
- Where any initial suggestion of likelihood
which do not encompass probabilistic
is liable to influence the final assessment.
quantification.
This can occur in QRA, where the results are
compared to previous studies, even though
The primary objective in risk management is to
those studies may be flawed and cannot be
identify controls and determine whether they are
verified.
effective, not to dismiss the Hazard as random.
Prediction attempts to quantify unknowns,
4. Bounded Rationality (20)
although identifying them should enable them
to be addressed. If analysis concludes that they - This is where the argument is limited
cannot be controlled, then that provides a WRA to only those factors that can be
for demonstrating that all reasonable RRMs evaluated, ignoring nuanced, complex, or
have been implemented. If the conclusion is that unquantifiable variables.
control would be too expensive, then Appendix
A7 explains why prediction is not accurate 5. Question Substitution (20)
enough to demonstrate Gross Disproportion and - When a question cannot be answered
a WRA may be the only viable means of rejecting directly, it has been shown that people tend
the RRM. to substitute a similar but different question
to the one they are given when it is not
understood (20) and (Appendix B6, Error #6).

Virtually all major accidents


happen because the Risk is
underestimated, so dismissing
low Risks is illogical.

ALARP for Engineers: A Technical Safety Guide 87


6. Centring Bias (20) 13. Self-Justification (21)
- When asked to rate Risks on a range of 1 to - Once a person has made a judgement,
5, people are reluctant to predict extremes they are generally reluctant to change their
and a disproportionately large majority will mind. Initial impressions, made before the
rate in the range 2 to 4. details are appreciated, can therefore have
undue effect on these judgements.
7. Bias Cascade and Group Think (22)
- The first person to express an opinion is These biases can all adversely affect risk
likely to affect all subsequent opinions, management, and most would be difficult, if not
because of a reluctance to challenge it, impossible, to identify in retrospect. Robson
especially if the first estimate was made (22) describes the characteristics common to
by someone with authority, experience, or the great scientists, which would be equally
qualifications. desirable in risk management practitioners,
namely ‘Curiosity, Growth Mindset and
8. Theory Induced Blindness (20) Humility’, because they tend to minimise these
- This is where a theory is learned and biases. However, people intuitively believe that
applied before it is fully understood. This they can assess Risk, probably because it has
creates a belief in the process, which been vital for the survival of mankind. The skills
is not easily overturned by revelations are only suited to simple problems though,
of its weakness. The individual uses such as avoiding wild animals, and they cannot
confirmation bias to reinforce his or her translate to complex engineered Systems.
own beliefs.

9. Earned Dogma (22)


- This is where status, qualifications or
A6 Risk Matrices
experience can lead an individual to believe
For most risk assessments mathematical
so strongly in their own opinions that they
modelling is too complex and time consuming,
will not listen to the views of others.
so the use of Risk Matrices has become
10. Circular Reasoning commonplace, as illustrated in Example A6.1.
- A logical fallacy in which the reasoner There is no standard matrix, but examples have
begins with what they are trying to end been developed by the Regulator, engineering
with. It can happen when a hypothesis institutions and industry, each plotting
is expressed in an algorithm that is consequences and likelihood on a matrix that
then used to prove a concept, e.g. QRA. can range from 3x3 up to 8x8.
Although the result is unverifiable, the
lack of transparency creates a belief that Risk Matrices do not fulfil any specific legal
hypothesis has been proven. Repetition obligation, primarily because they ask what
only strengthens this belief. the Risk is, rather than whether all reasonably
practicable RRMs have been identified and
11. Denigration of History implemented.
- A common term associated with gamblers,
investors and decision makers who begin As discussed in Appendices A4 and A5,
to believe that what has happened in the humans do not have the ability to assess
past to others will not happen to them. engineering Risk, because it is nuanced, with
the most relevant variables being subtle and
12. Conflict of Interest detailed aspects of the System.
- Depending on who is doing the estimation
or calculation, their personal motives A major problem is that there are seven
may influence how they interpret data or fundamentals that need to be established
estimate Risk. before a risk matrix can either be designed or

ALARP for Engineers: A Technical Safety Guide 88


Example A6.1 - A Typical Safety Risk Matrix

The figure shows a commonly used layout for a safety risk matrix.

Chronic Multiple
First Aid Minor Injury Fatality
Injury Fatalies
Very likely Red
Likely
Possible Amber
Unlikely
Very Unlikely Green

This will normally have a key, such as:


GREEN Acceptable,
AMBER Consider further Risk reduction,
RED Unacceptable.
NB. The terms on the likelihood axis are ambiguous, and ‘possible’ is commonly used in the
middle, even though ‘unlikely’ cannot be less likely than ‘possible’.
The term ‘likely’ would imply greater than 50%, but one person might interpret this as the chance
if a given action is taken, whereas another might consider it to be the chance over a year, across
a location with hundreds of people working there. Furthermore, many matrices consider ‘likely’
minor injuries to be acceptable, which illustrates the significance of these ambiguities.

used, and it is common for none of these to be 4. Which consequences does it refer to?
stated. These are as follows: - The expected outcome?
1. What is its purpose? - The worst Foreseeable outcome?
- Measuring tolerability of Risk? - The worst possible outcome?
- Ranking Risks for different items or 5. What type of Risk metric is in use?
activities? - Relative Risk? (If so, to what? High and low
- Determining Proportionality, i.e. the level are meaningless without a reference.)
of Risk assessment required? - Absolute Risk?
2. What is it measuring? 6. Which Risk it is measuring?
- One Risk to an individual? - The unmitigated Hazard Risks?
- One Risk to all people? - The Top Event Risk (Failure Mode)?
- All Risks to one individual? - The outcome?
- All Risks to all people?
7. What timescale is the Risk is measured
3. What level of granularity is it focussing on? over?
- Average Risks for that industry? - A single action?
- Specific Risks for that product, location, - The whole activity?
or plant? - Per Year (assuming the activity repeats
- Average or specific Risks for a particular continuously for a year)?
system? - Per year (for the average number of times
- Average or specific Risks for a given that activity is likely to occur)?
function or component? - Over the product Lifecycle?

ALARP for Engineers: A Technical Safety Guide 89


Another problem with risk matrices is that
the descriptors on the likelihood axis can be The reason for this inability is because Risks
highly ambiguous. Terms such as remote, are nuanced, or even chaotic. It will be apparent
unlikely, likely, and highly likely have meanings from many of the accident examples given in
that can vary greatly in different contexts. The this document that Risks are as much to do
terms have not been standardised and they with subtle details, as the big picture aspects.
are very rarely defined. It could be argued that Seemingly small errors in software, procedures,
even minor injuries that are ‘likely’ would be design, human factors etc., can have major
unacceptable. Risks are typically described in effects on the Risks. Example A6.2 is a good
relative terms, such as low, medium, and high, illustration of this.
which are meaningless without a reference
level, e.g. driving a car could be high Risk Example A6.2 - Large Changes of Risk Due to Nuances
compared to taking a walk, but low compared
to climbing a mountain. However, such After many years of safe service, an airliner was given different
comparisons are only possible when Robust engines and an electronic system to make it handle like the
Statistics are available for all the things being previous version. Shortly after its introduction two of the new
compared, which cannot be the case for Risk aircraft crashed with total loss of life. Despite being largely the
assessments of unique activities, procedures, same aircraft and almost identical in appearance and flying
or equipment. characteristics, the Risks were obviously vastly different. Any
attempt to have judged or calculated those Risks would have
All these ambiguities that can render the results been untenable.
meaningless. The matrices may also be used by This example illustrates the two principles discussed:
managers, operators, engineers, and workers,
• Engineering Risk is nuanced and potentially chaotic,
each with differing objectives that may be
i.e. extremely sensitive to detailed aspects of the system
influenced by factors such as attitudes to Risk,
personal/commercial agendas, safety culture, • Similarity and outward appearance may be no indication
budgets, and external pressures and any of the of Risk
biases listed in Appendix A5 above.

The matrices are sometimes used to determine A Risk thought to be low may be because it has
the acceptability of Risk, with green cells not been experienced, or because the Hazards,
regarded as acceptable, amber worthy of Causes or flaws in RRMs are not appreciated,
consideration for Risk reduction, and red being so it may be much higher than estimated.
unacceptable. The inherent assumption is that Conversely, a Risk believed to be high may be
the matrix adds some value, which is greater because of past accidents and a knowledge
than simply estimating whether something is of causal mechanisms, but that would mean
acceptable or not. In practice, it simply splits the they have probably been addressed already
judgement into two parts, i.e. the consequence, in standards, Good Practice or by common
which can normally be judged within acceptable sense and may therefore be much lower than
limits, and the frequency, which cannot. The estimated. Risk management seeks to identify
danger is that the assessor starts with a unknowns so, on this basis, these would be
preconceived idea of the acceptability of a expected to be where the Risks are believed to
Risk and subconsciously ‘reverse engineers’ be low. So, although focussing efforts on what
it into the desired cell, with all the cognitive we perceive to be high Risks may be intuitive, it
biases discussed previously. It is not possible may be illogical, and it could lead to a situation
to scientifically prove the accuracy of Risk where the largest Risks are deliberately being
judgements for rare events, but the Riso excluded.
institute study (9) showed that the variability
alone can regularly exceed a factor of 10,000
times (Example A4.3), so the errors could be
even larger. Risk simply cannot be judged in
this way.

ALARP for Engineers: A Technical Safety Guide 90


The other main application for Risk matrices
is to rank Hazards for Proportionality, i.e. the Example A6.3 -
degree of further analysis required. If they lie in The Danger of Risk Prediction and Ranking
the red band, they will then be dealt with first and
subject to the greatest scrutiny. Amber would be A COMAH registered site which processed hydrocarbons had
subject to less scrutiny and green would be least a QRA that mapped the Risk contours across the plant. At the
of all, if any. The key assumption here is that the end of the site there was a bio-treatment building to purify
water discharged to the water course to meet environmental
estimation will be based on an understanding
standards. Apart from life buoys around the water tanks the
of the nuances, but the same arguments apply
Risks were considered negligible and it fell outside the QRA
as before, and the variation would be greater Risk contours. After fifty years of operation without a serious
than the difference between the Risk bands. The incident on the site, the concrete bio-treatment building was
resulting ranking would therefore be nothing destroyed by an explosion (luckily without any fatalities).
more than random guesswork.
The prediction that the building was such low Risk assumed
There is a paradox in ranking Risks for analysis: that it contained no hydrocarbon. However, corrosion in
Risk cannot be determined unless the nuances a separator weir plate allowed hydrocarbon to enter the
treatment plant, which did not have intrinsically safe electrical
are understood, but the purpose of the analysis
equipment and the explosion resulted.
is to identify those very nuances, i.e. the unknown
unknowns. So, the notion that Risks can be A single example, such as this, does not prove that the
estimated to determine what should be analysed prediction was wrong, but it illustrates how a ranking exercise
is irrational and could lead to the dismissal of can dismiss the very Hazards that need to be identified and
critical Risks. understood. Hydrocarbon in the water system is a Foreseeable
Hazard and should have been identified by a HAZID or HAZOP,
However, if these were known and understood but Risk prediction and ranking clearly enabled it to be
dismissed.
they could be addressed and there would be no
point in assessing the Risks.
Nevertheless, there are many other problems
It is a truism that accidents only happen because with risk matrices. One paper (23) found that
someone thinks the Risk is low, and this illustrates some matrices shuffled or even completely
one of the most fundamental flaws behind risk reversed ranking orders if all the Hazards being
matrices. compared were multiplied by a small constant.
This is because the cell colours have generally
Nevertheless, guidance often fails to recognise been determined by judgement rather than
this and promotes the use of risk matrices. scientific method and their contours are not true
representations of equivalent Risk.
It has been shown that consequence and
Foreseeability are the only variables that can be
assessed with sufficient accuracy, and that is why
they are the basis for the Proportionality matrix in
Section 3.3, which does not contain a likelihood
axis. The Proportionality matrix obviates the need
for risk matrices and makes the risk management
process more objective and systematic.

The notion that Risks can be estimated


to determine what should be analysed is
irrational and could lead to the dismissal
of critical Risks.

ALARP for Engineers: A Technical Safety Guide 91


Risk Matrices for Business Decisions

The reasons for using risk matrices for business decisions is quite different to their use in
safety. For example, a decision whether to invest in Project A or Project B may involve much
more tangible and quantifiable variables, such as exchange rates, materials and labour
availability, interest rates and potential for change. Each of these may be estimated with
reasonable accuracy compared to safety decisions because they tend to be mid-range
numbers, say between 10 and 90%. Uncertainties in this range are generally much more
limited, whereas safety deals with exceedingly small numbers, which people are extremely poor
at estimating.

In business the matrix can be a useful means of communicating opinions as to which option may
have the least Risk (both in terms of maximum loss and its likelihood). Unlike safety decisions, these
are binary ones that have no legal obligation to be transparent or robust.

Risk matrices have been heavily criticised in A7 Cost Benefit Analysis (CBA)
scientific papers. Thomas (24) stated, ‘our
literature search found more than 100 papers Cost Benefit Analysis (CBA) is calculated as
in the OnePetro database that document the follows:
application of RMs in a risk-management context.
However, we are not aware of any published Cost Cost
=
empirical evidence showing that they actually Benefits (Risk reduction) * (Value of Preventing a Fatality)
help in managing risk or that they improve
decision outcomes’. It went on to say ‘A tool that where Risk = Probability *Number of Fatalities
produces arbitrary recommendations in an area
as important as risk management in O&G should Gross Disproportion is demonstrated when cost/
not be considered an industry best practice’. Cox benefits > 1 x (a Gross Disproportion factor),
(23) stated, ‘Typical risk matrices can correctly where the factor could range from approximately
and unambiguously compare only a small fraction 3 to 10 according to guidance on the HSE
(e.g., less than 10%) of randomly selected pairs website, but there are no known legal precedents
of Hazards…. Risk matrices can mistakenly
assign higher qualitative ratings to quantitatively However, caution must be exercised, to ensure
smaller risks. For risks with negatively correlated that the quantification of Risks and costs
frequencies and severities, they can be “worse are scientifically derived, error potential and
than useless,” leading to worse‐than‐random uncertainties are taken into consideration, and
decisions’. the value of life, injury or health issues are in line
with societal expectations.
The matrices may therefore constitute little more
than easily manipulated, ambiguous heuristics, CBA cannot demonstrate Gross Disproportion
based on intuition rather than logic and subject if the uncertainties are more than an order of
to multiple cognitive biases, with enormous magnitude because reliance on the mean is
error potential, and all of which are applied to mathematically unsound in such circumstances.
an undefined notion. In a legal context it is not The true moment of a probability distribution
unreasonable that any probabilistic argument should use the integral of moments across it,
should be transparent and reasoned deductively i.e. ∫y.f.df, as shown in Graph A1, with the point
(plausibility) or inductively (based on Robust weightings (y.f.δf) shown in Graph A2. With Robust
Statistics with no more than one level of Bayesian Statistical data y.f.δf peaks close to the mean,
modification). Their use in risk management but as the uncertainty or errors increase, (e.g. the
is therefore strongly discouraged and it is 90% confidence figure is more than an order of
recommended that the Proportionality Matrix magnitude above the mean), then y.f.δf continues
(Section 3.3) is used instead. to increase and tends towards infinity, as shown

ALARP for Engineers: A Technical Safety Guide 92


in Graphs A3 and A4. This is because the
frequency (f) increases more than the amplitude
y y.f.δf
(y) decreases, thus giving the data points A1 A2
away from the mean greater weight than the
mean itself. Therefore, once the errors exceed
an order of magnitude, CBA tends towards
zero, making any proposed Risk mitigation δf frequency
mandatory. As explained above, almost any
prediction or modelling of Risk involves errors y y.f.δf
and uncertainties greater than this, therefore A3 A4
proving that CBA based on anything other than
Robust Statistics would be an unsound method
of rejecting an RRM.

An alternative solution would be to argue that Graph A1 to A4


the Gross Disproportion factor would have to
be many orders of magnitude before CBA could
be reliably used to discount an RRM. However, Example A7 -
by this time the arguments behind the analysis The Ladbroke Grove Train Accident
would be self-evident and the decision could be
made solely based on a WRA, which would be Possibly the only publicly documented example of CBA in
decision making relates to the Ladbroke Grove rail disaster
legally admissible. Furthermore, the exclusion
in 1999, where 31 people were killed and over 250 injured,
of prediction from an analysis incentivises the
following a Signal Passed At Danger (SPAD). Automatic Train
analyst to drill deeper into the arguments and Protection (ATP), which should have prevented the accident,
provide a more complete justification for Gross had been rejected, partly because the CBA concluded that the
Disproportion. cost/benefit ratio was too high at 3.1/1 (only just meeting the
HSE’s minimum Gross Disproportion factor of 3).

It was caused by a signal that had three known but


unquantifiable problems, i) the signal was on a curve making
it difficult to determine which track it related to, ii) it was only
100m after a low bridge that obscured it until the train was very
close and iii) electrification of another track had also partly
obscured the signal. None of these factors can be modelled
verifiably and any attempt to quantify the Risks would be
extraordinarily speculative. However, even if the Risks of each
problem were quantifiable, there is no way of knowing their
combined Risk, which could have been synergistic, perhaps
factoring them up by millions or billions of times.

The CBA suffered three fundamental errors:


i) it failed to recognise that Risk is nuanced, making it
extremely sensitive to subtle changes,
ii) the data (SPAD records) was non-representative and not
statistically significant,
iii) known and controllable aspects of the Risk were not
remedied.
Risk quantification must be based on a detailed understanding
of the Causes, even though such an understanding would
obviate the reasons to quantify the Risk.

ALARP for Engineers: A Technical Safety Guide 93


A8 Conclusions

The theory behind statistics and probability is


complex and multifaceted, with many pitfalls
that engineers and safety professionals may not
be aware of or understand. Nevertheless, Risk
prediction and Risk ranking are subjects that
‘feel right’ despite numerous logical arguments
against them. The inability to verify predictions
makes them unscientific, with no guarantee that
they are anything better than guesswork, and
quite possibly worse.

Although statistics are useful in research,


where they can be used to guide scientists to
the most likely candidate from a known group
of variables, this is fundamentally different to
risk management where the statistics cannot
measure the desired subject and necessarily
relate to something else, which may not have
any causal relationship. In science, statistics
are therefore used to home in on the most
promising variable, for subsequent scientific
confirmation of a causal relationship, whereas in
safety, prediction is used to assess a potentially
unknown set of variables and draw a conclusion.

This cannot be a substitute for rigorous


and systematic identification and analytical
methodologies.

The argument in favour of Risk prediction is


that it at least forces people to think about the
subject, which may reveal some aspects not
previously realised. This may be partly true for
simple activities but there is no evidence that
putting a likelihood on a Hazard will add value,
despite significant evidence that it may act as
a substitute for analysis and therefore reduce
value.

ALARP for Engineers: A Technical Safety Guide 94


APPENDIX B:
TECHNICAL ERRORS IN RISK
QUANTIFICATION MODELS
(QRA/PRA/PSA)
Key Messages

Risk models are necessarily based on unverifiable hypotheses and assumptions, so they constitute
beliefs, not science.

They are susceptible to numerous errors, caused by non-representative data, false assumptions,
unfalsifiable chance correlations, omitted variables and simplified or omitted interdependencies.
The resulting errors can exceed many orders of magnitude.

Equipment failure is often the primary variable in Risk models, but it is rarely the direct Cause of
accidents and, even when it is, those failures can often be traced back to deeper, resolvable reasons
that cannot be modelled or quantified.

Risk models cannot identify Hazards, failure modes, or Causes.

Probabilistic errors can be notoriously difficult to find, even by experts, so a sufficiently rigorous
quality assurance process, (which interrogates the data collection, its interpretation, the algorithms,
and the model’s architecture), may not be a realistic proposition.

An important distinction that often goes assumptions and/or predictions that cannot be
un-noticed in probabilistic evaluation is the verified. Common errors are (25):
difference between error and uncertainty. 1. Non-representative data.
Uncertainty relates to the shape and dimensions 2. Causal fallacy.
of the Probability Distribution Function (PDF)
3. Omission.
and it can normally be quantified, especially if
the PDF complies with a mathematical form, 4. Null Hypothesis.
such as the Normal/Gaussian or Bell Curve 5. Ludic Fallacy – Independence.
distributions. This means that confidence 6. Illegitimate Transposition of the Determinant.
bounds can be determined, e.g. the 95% 7. Unfalsifiable Chance Correlations.
confidence interval, which could be much larger
than the 50% mean value, and therefore more The following sections explain these errors in
conservative. more detail.

However, errors have no such limits and, for


rare events such as accident frequencies, their B1 Error #1
magnitude can be many orders of magnitude, Non-Representative Data
with no means of sense checking or verification.
Errors should therefore never be classed as All data should be based on Robust Statistics,
uncertainties. so data on positive displacement pumps would
be non-representative of centrifugal pumps.
The quantification of major accident Risks Furthermore, data on Systems including many
is inherently predictive, even though these items must not be used for similar Systems that
models process large amounts of data (often have a different mix of components. Although
quoted to three or four significant figures) these principles are well-understood, difficulties
through complex computer programmes. Their may arise when analysing major accidents,
algorithms are simply opinions, embedded in because these are rare events which are too few
code. Both the data and the models contain and far between to be statistically significant.

ALARP for Engineers: A Technical Safety Guide 95


Example B1 - Process Industry Loss of Containment Failure Mode Data

Major Accident Leak Failure Modes Non-representative Failure Modes


Large leaks with small data quantities Small leaks with large data quantities

Brittle Fracture
Wear and tear on valve stem seals
(Longford)

Stress Corrosion Cracking


Pitting corrosion
(BG Rough)

Significant impact or dropped object Flange gasket deterioration and poor


(Mumbai High) bolt tightening

Overpressure rupture
Poorly fitting instrument connection
(Grangemouth, Ocean Odyssey)

Design calculation error


Passing valve
(Flixborough, Macondo)

Process upset
Wear and tear on door seals
(Texas City, Buncefield, Seveso, Bhopal)

Isolation/reconnection error
Sampling
(Piper Alpha, Pasadena)

The only solution is to measure something more known major accidents, which are typically
frequent that has a quantifiable relationship large, normally full bore, failures, greater than
(random or Bayesian) with the rare event and 10 kg/s. However, the right-hand column looks
calculate its frequency using this knowledge. It at the Causes of the more frequent, smaller
will be necessary to provide suitable evidence leaks, commonly known as weeps and seeps.
of any such relationship, whether by logical There are few, if any, major accidents that have
argument or by some form of statistical proof. been caused by these failure modes, yet they
dominate the datasets (26), despite having no
It is Good Practice to state the study objectives quantifiable random relationship with the rare
and define the relevant variables prior to the events that they are used to predict.
collection of any statistical data. However, when
used for predicting rare events, such as major Because the larger leaks are not statistically
accidents, this will not be practicable because significant it is normal practice to plot all leak data
the data population is generally worldwide. on a frequency vs. size graph and find the best fit
In practice, it will be necessary to draw upon curve, in the erroneous belief that this overcomes
whatever industry data is available, even the paucity of large leak data. This is therefore
though it may have been collected for different an attempt to overcome contravention of the
objectives to the study purpose, e.g. reliability, statistical significance criterion by contravening
rather than safety. representativeness instead. The problem is
that leak sizes are intuitively the right thing to
A good example of this error occurs in the measure, provided the Causes are ignored.
process industries where data is collected by Conversely, the idea that data on the common
size of leak, even though very few leaks lead cold could be used to predict cancer rates
to major accidents. However, the small leaks would be quickly dismissed, because it is well-
have different Failure Modes to large ones and known that one is a virus, whilst the other is not.
may have little or no potential to Cause major Statistical sampling must therefore be subject to
accidents. In Example B1 the left-hand column rigorous scrutiny to ensure that it represents that
is a list of these Failure Modes for some well- which the study is attempting to calculate.

ALARP for Engineers: A Technical Safety Guide 96


Another example of non-representative data A common correlation is to relate equipment
occurs in aviation because aircraft crashes types to failure rates. Whilst this may be logical
are subject to intense scrutiny to identify where the failure modes are unique to the
the Causes, which then undergo mandatory equipment types, this may not be the case and
correction in all other aircraft of that type. any resulting correlations could be down to
However, the data remains despite removal of chance alone. Failures could be due to external
the Cause. If the data were to be deleted though, factors, which may have some correlation
the aircraft would have a perfect safety record, with equipment types, but whether this is
which would not be true. This raises the question weak, medium, or strong will not be proven
of what the data is intended to reflect? Is it a by statistics. Whilst trials of this kind can be
measure of the general quality of the design and useful to prioritise further investigation (for
manufacture of the aircraft, in which case it has the purposes of saving time and/or money)
nothing to do with Causes, or is it something they do not prove anything per se. Unless
more specific and, if so, what? the relationships between the variables are
understood qualitatively, such correlations
This leads to a fundamental question about should not be used. It would also be reasonable
data, which is whether it is ergodic (see to question why equipment data is any better
Appendix A2)? Ergodic data also implies than measuring other variables, such as plot
randomness and a lack of control. Failures due areas, process inventories, tonnage of steel,
to design could be ergodic if they are random which may be no less accurate.
and the operator has no control over them.
However, failures due to application or the way Another example of this is correlating human
the equipment was operated may vary greatly, error with probabilities – known as HEPs.
This is a case where the variables influencing
so they are not. Non-ergodic data produces
such error are known but unquantifiable, e.g.
averages that greatly underestimate the effect
ergonomics, lighting, colour coding, graphical
of the Causes, because they are diluted across
user interfaces, panel layouts. The likelihood
a large population for which they do not apply.
is not random because these variables are
Considering that most accidents do not relate
controllable. HEPs are therefore not relevant to
to random failures of equipment, but due to
the demonstration of ALARP and would not be
operator control, the data used in QRA models is
legally admissible.
not ergodic, and cannot reflect Causes. Without
supplemental causal information it could be
The Piper Alpha disaster provides a good
impossible to quantify this error.
example of how the equipment category for
flanges can be so deceptive, because the
accident involved a flange that had been left with
B2 Error #2 a loosely fitting blind. But attributing this incident
The Causal Fallacy to the flange is illogical because it did not fail, nor
was its existence attributable to the accident. The
The Causal Fallacy may be explained by relief valve had to be removed for maintenance,
the adage ‘Correlation does not equate to so the existence of this flange was not the issue.
causation’. The fact that two variables exhibit Some acknowledged Causes of the accident
some statistical correlation is not evidence of were Permit to Work errors, competence and
a logical causal relationship. This is especially shift change handover; factors that are not
true when two finite variables are divided into unique to flanges, nor equally attributable to
categories and then compared, in which case other flanges on the installation, because they
correlations would be inevitable, albeit possibly would not be dismantled for the same reasons
meaningless. Some categories will inevitably and some may not be opened in the lifetime
correlate better than others, even when there is of the installation. Therefore, the flange is a
no real relationship. misleading correlation that gives no clue to the
real Cause or probability of the release.

ALARP for Engineers: A Technical Safety Guide 97


B3 Error #3 Omission accurate representation of the average for UK
train signals. Wherever a relevant variable is
Omission of known variables can cause known about, but remains unquantifiable, the
substantial error. The variables that affect Risk probabilistic analysis becomes unsound.
may be unknown or unquantifiable, yet they can
be related to detail, not the broad picture, or the
outward appearance of the subject. The Russian B4 Error #4 Null Hypothesis
Roulette example in Appendix A4 showed how
key variables cannot be modelled yet have A null hypothesis is the argument that two
major influences on probabilities. samples only differ by chance. It is not
uncommon to take a given Robust Statistical
A documented example of this error occurred in sample and divide it into smaller categories,
the QRA that contributed to the Ladbroke Grove which will therefore have larger uncertainties. If
rail disaster described in Appendix A7. The QRA the confidence intervals of smaller categories
extrapolated data from minor incidents to major overlap there will be some chance that they
ones despite the possibility that some variables are the same, and the greater the overlap the
may change qualitatively for these situations. greater the confidence in the null hypothesis.
There may be non-linear relationships between A pure mathematical proof is complex for small
variables such as train speed, passenger sample sizes, especially when more than two
densities (e.g. standing passengers at rush hour), samples are involved, but a graphical illustration
ergonomics (signal and oncoming train visibility), is sufficient for this case.
weather conditions and position of the sun, other
lights and crash resistance. Such relationships For example, the reliability of electrical relays
are nuanced and potentially chaotic. Reasonably
could be determined by testing several of them.
Foreseeable combinations of these may
However, if different makes, or types, of relay
therefore create disproportionately high-Risk
were involved their samples would be smaller
situations (which is often the very reason major
and their probability distribution functions could
accidents occur), yet they would be virtually
overlap. If so, there would be a finite probability
impossible to model. The three controllable
that the small samples would falsely claim one
factors relevant to this accident that were not
type to be more reliable than another or show a
modelled include:
difference when there is none.
1. the signal was on a curve making it difficult to
determine which track it related to, If the small sample sizes are less than ten, the
2. it was only 100m after a low bridge that confidence limits can be measured in orders
obscured it until the train was very close and of magnitude, so large differences would be
3. electrification of another track had also partly needed before conclusions could be drawn.
obscured the signal. Figure B1 shows ten equally spaced items with
normal distributions being ranked with 95%
The omission of these key variables from confidence intervals equal to the frequency
the QRA would therefore have rendered the range. The probability of sample #1 coming
analysis hopelessly optimistic and not even an lowest in the ranking is:

Pr(#1 is lowest) = fn{Pr(#1>#2) + Pr(#2<#1), Pr(#1>#3) + Pr(#3<#1), Pr(#1>#4) + Pr(#4<#1), etc.}


Range = 2 Orders of Magnitude
#1 #2 #3 #4 #5 #6 #7 #8 #9 #10

95% Confidence Interval

Figure B1: Illustration of Ranking Errors when Segregating Data

ALARP for Engineers: A Technical Safety Guide 98


A rough estimate of each term in the sequence they rarely are, so it will only misrepresent reality.
can be seen from the area of overlap between This is a form of circular reasoning, as discussed
distributions, which is clearly very high and in Appendix A5 above, in which a false premise
implies that the probability of correctly ordering creates beliefs that are counter to existing
any two items is small and for all ten items it is knowledge.
negligible. Each term in the sequence is smaller
than the previous, but there is a very high Time relationships can be a good example of
likelihood that the highest and lowest items will this, where a simple frequency of failure must
be incorrectly ranked. represent a time dependent function. Assume
a corrosion life of 10 years, ±1 year. One
item would be expected to fail after 10 years,
B5 Error #5 whilst 100 of them would only reduce this to 9
Ludic Fallacy - ndependence and years. Furthermore, failures get investigated,
Randomness with repairs and/or replacements before the
System is brought back into operation, typically
A key factor in any statistical analysis is the preventing any recurrence. The same would be
relationship between variables. When two dice true for fatigue failures and, to a lesser extent,
are thrown the outcome on each die is random overpressure, isolations, and other activities. It
and therefore independent of the other. The would therefore be illogical to assume that five
probability of throwing two sixes is therefore the identical process trains would create five times
product of the separate probabilities (= 1/36). the Risk of a single train. Most models count
However, independence is not true for modelling equipment, based on the false premise that the
Systems because: number of components has a linear relationship
1. they may be complex, multifunctional with with Risk.
many interdependencies.
Finally, there may be common mode failure
2. their failure modes may be time related.
mechanisms, which link the failures between
3. additional failures may be rectified after the supposedly independent RRMs. These may
first incident. have many common mode failures, as all items
4. there may be common mode failure may be exposed to the same permit system,
mechanisms. maintenance philosophy, safety culture, quality
system, budgetary and resource problems,
All models processing multiple datasets must climatic exposure and even one technician’s
simulate the relationships between them, incompetence applied to different Systems.
although this is unlikely to be known. The
pragmatic solution is to assume randomness, For these reasons, the oft used assumption
i.e. no relationship, as this simplifies the that the overall probability is simply the product
mathematics to the product of the probabilities. of all individual probabilities, may range
However, if this is not valid it can lead to huge from pessimistic to highly optimistic in Risk
underestimations or overestimations of the terms. The RSS Guides specifically require
Risk. Each dataset may be informative for one independence of variables to be demonstrated,
variable, but the model attempts to simulate the never assumed.
combined effects of all variables. There may
be some understanding of these relationships,
but mathematical limitations necessitate B6 Error #6
assumptions that ignore this and therefore Illegitimate Transposition of the
detract value, even though logical argument Conditional (Prosecutor’s Fallacy)
could be more informative and save a lot of
simulation work. If the datasets were to be The conditional probability of A given that B is
independent, then the model could add value, already satisfied, is expressed mathematically
but for the types of problems under discussion as Pr(A│B) where B is known as the conditional.

ALARP for Engineers: A Technical Safety Guide 99


However, unless everyday problems are given a Hazard, i.e. Pr(Accident│Hazard), is not
expressed in mathematical notation, it is not the same as a Pr(Hazard│Accident). However,
unusual for Pr(B│A) to be mistaken for Pr(A│B), the latter may be the only one that can be
which is known as the Illegitimate Transposition measured, because accident reports provide
of the Conditional or the Prosecutor’s Fallacy. this information, whilst it is unusual to record
Hazard frequencies. Bayesian theory can be
For example the probability of an accident used to relate the two, i.e.:

The number of Accidents


Pr(Accident│Hazard) = Pr(Hazard│Accident) x
The number of Hazards

The ratio of accidents to Hazards is known as potential is almost unlimited and that even
the Likelihood Ratio, but this still requires the expert judgement cannot be relied upon to
Hazard probability to be known, so there may be notice it or to sense check Risk figures. Even
no credible solution. This is a common problem the appeal court judges failed to understand
with Bayesian analysis, which is often fudged by the difference when it was explained to them
simply guessing the denominator, on the basis eighteen months later, stating that it was “a
that this at least gets closer to the true answer. straight mathematical calculation to anyone who
However, if the original error is four orders of knew the birth-rate over England, Scotland and
magnitude and Bayesian analysis reduces this to Wales”. It was not until the case became a Cause
three, then the result is still one thousand times celebre and academics wrote papers on it that
out. With rare events, such as major accidents, the defence were able to get the judgement
the ratio could be many orders of magnitude, overturned on a second appeal; by which time
giving unacceptable errors. she had spent three years in jail.

One of the most researched examples of this All data, together with its application, should
mistake was the trial of Sally Clark, for the therefore be thoroughly reviewed for possible
murders of her two children, who were later Illegitimate Transposition of the Determinant
concluded to have died of cot deaths. There was errors.
no evidence of foul play, except that an expert
witness, Professor Sir Roy Meadows, stated that
the probability of natural deaths was only 1 in B7 Error #7
73,000,000 (17), (18). Ms. Clark was found guilty
Unfalsifiable Chance Correlations
and given a life sentence.
Correlations between any two variables, e.g.
In practice, he stated the probability of two cot
fatigue and equipment type, should have proven
deaths given that she was innocent:
causal relationships, free from Unfalsifiable
Chance Correlations (UCCs), where multiple
Pr(2 cot deaths│innocence) = 1:73,000,000 tests looking at different variables may
eventually find correlations due to chance only,
However, this is not the question that the trial e.g. testing at a P value of 0.05 (95% confidence)
sought to establish, which was, “What was the would create one false correlation in every 20
probability that she was innocent given two cot trials. A good explanation of UCCs is given in
deaths?” to which the answer was: a study by Ashwanden (27) that demonstrated
that if people cut the fat off their meat there is
a 99.7% probability that they are atheists. The
Pr(innocence│2 cot deaths) = 15:1
point was deliberately provocative to illustrate
that mathematical rigour alone is not enough.
The fact that the error, which was over a billion 3,200 equally absurd hypotheses were tested,
times (25), went unnoticed, shows that the most of which showed little or no correlation,

ALARP for Engineers: A Technical Safety Guide 100


although several high correlations occurred B8 Conclusions
purely by chance. It is therefore not good
enough that mathematical protocol is followed; Probabilistic errors can be notoriously difficult to
the background population behind any data will find, even by experts, so a sufficiently rigorous
need to be checked, and until the correlation quality assurance process, (which interrogates
has been justified by a logical explanation it the data collection, its interpretation, the
cannot be treated as a proof. algorithms and the model’s architecture), may
not be a realistic proposition.
The Sally Clark case above, was probably one
of the most researched examples of statistical
error, but nevertheless had a probable UCC
error of one hundred times that went unnoticed
even by the academics. The 1:73,000,000 figure
came from the CESDI study (17), which was
based on comprehensive questionnaires that
asked the mothers of cot death babies many
diverse questions such as how the baby slept,
the bed sheet materials, whether it had a nanny
or had taken an international flight, the wealth
and age of the parents and so on. The three best
correlations from these each roughly halved
the frequency, which is not a particularly strong
effect, yet they combined to change the base
rate cot death figure of 1:850 to 1:8,500, which
was then bizarrely squared to account for the
two deaths (another unfalsifiable assumption
that there was no common Cause, genealogical
or otherwise) to give 1:73,000,000. Whilst there
is nothing to prove that the correlations and
assumptions were false, there is equally nothing
to justify using them as evidence. Nevertheless,
this point went unnoticed and a review of the
available papers on the Clark case (25) could not
find any that had identified this error.

ALARP for Engineers: A Technical Safety Guide 101


APPENDIX C:
COMPLIANCE CHECKLIST

Framework for an ALARP Checklist

Hazard, Failure Mode or Cause:


Analysis performed: Yes, No Justification/References
or N/A

IDENTIFICATION STAGE:
The System is defined and understood
Formal identification/brainstorming exercise
Appropriate disciplines/personnel involved
Lessons from incidents, accidents, and precursors
Health, injury and mortality effects assessed
Lifecycle – construction/manufacture
Lifecycle – commissioning/set-up/trials
Lifecycle – storage/mothballing
Lifecycle – operations
Lifecycle – maintenance and inspection
Lifecycle – decommission, dismantle, dispose
Reliability issues
Control issues
Activities and procedures
Exceeding the design envelope
Natural environment
Deterioration, ageing

RRMs – Hierarchy of Controls and Table 3:


Elimination/Substitution
Remove people (simplify, automate, relocate)
Containment (isolate, protect)
Detection, control, and recovery
Awareness, competence
Procedures, organisation
Maintenance, inspection
Personal Protective Equipment (PPE)
Emergency response
Risk transfers/trade-offs
Good Practice, standards, ACoPs, and guidance
Deviations and Gross Disproportion justified
Prescriptive legislation
ALARP? No reason for further consequence or
If yes, go to Lifecycle Criteria below.
escalation assessment and no residual Risk.

ALARP for Engineers: A Technical Safety Guide 102


CONSEQUENCE STAGE:
Suitable evaluation of Foreseeable consequences
Suitable evaluation of Foreseeable escalations
Consequence reduction measures identified
Less Hazardous materials
Layout
Escape facilities/routes/protection
Access to emergency equipment or control points
Effective passive RRMs
Effective active RRMs
Automation, personnel removal/relocation
Emergency Response
ALARP? (No residual/significant consequences) If yes, go to Lifecycle Criteria below.

ANALYTICAL STAGE:
Procedural analysis
Human factors analysis
Software analysis
Functionality analysis
RRM analysis – Effectiveness
RRM analysis – Failure to Safety
RRM analysis – Reliability
RRM analysis – Independence/Redundancy
RRM analysis – Diversity
RRM analysis – Self-Revealing Failures

LIFECYCLE CRITERIA DEFINED:


Operational envelope
Safety Critical procedures
Safety philosophies and strategies
The means for warning of unsafe conditions
Safety Critical elements (SCEs) and functions
SCE specifications and performance requirements
Design life/life expectancy
Maintenance and inspection requirements
Criteria for ongoing monitoring
Contingency plans for Foreseeable failures
Foreseeable changes to the design or operation
Emergency response, rescue, and recovery plans
Requirements for decommissioning and disposal

ALARP for Engineers: A Technical Safety Guide 103


DOCUMENTATION AND COMMUNICATION:
Awareness campaigns
Procedures
Training and competency
Safety objectives/philosophies
Lifecycle criteria embedded
ALARP Proforma/safety case

ALARP for Engineers: A Technical Safety Guide 104


Institution of
Mechanical Engineers
1 Birdcage Walk
Westminster
London SW1H 9JJ

imeche.org

You might also like