g_Risk
g_Risk
g_Risk
Hex 2224
TABLE OF CONTENTS
1. Risk Governance
Enterprise Risk Management (ERM) facilitates Hexaware management’s desire to effectively govern and
manage the enterprise’s approach to risk management and to create sustainable value to its stakeholders
through business objectives.
In Hexaware, ERM involves the strategic implementation of three lines of defence as the first principle of
the risk management framework. Same is given below.
At each line of defence the risk governance guidance is defined to support the ERM framework.
The first line of defence is at the project level wherein the Project Manager must understand the roles and
responsibilities about associated risks and treat the risks appropriately. The steps taken would include:
For more details on Project level risk management, please refer to Section 2.0 of this document.
The Account Service Delivery Manager (ASDM), Account Manager (AM) and the Delivery Head (DH)
form the risk management committee. This risk management committee is the first line of defence of the
risk governance framework. This committee is empowered with the responsibility and accountability to
effectively plan, build, run and monitor the project’s day-to-day risk environment. The committee
provides direction regarding risk response (i.e., treatment) for those risks that are outside of the Account /
Vertical risk tolerance.
The Project Manager should identify the suitable mitigation / contingency plans and get it approved by
the ASDM, who is a part of the risk management committee at the Project level. Project Manager has the
responsibility to ensure that the control activities and other responses that treat risk are enforced and
monitored for compliance.
During monthly project reviews, the ASDM monitors the status of the risk and the remedial actions taken.
This information is then collated with other risk reports for the second-level (Account-level executive risk
committee) and/or for the third-line risk governance committees (Board risk committee), who are charged
with the role of representing the enterprise’s stakeholders in respect to risk issues.
The responsibilities of these second-line defence functions include participating in the Account / Vertical
risk committees, reviewing risk reports and validating compliance to the risk management framework
requirements, with the objective of ensuring that risks are actively and appropriately managed.
The second level of defence is responsible for the following pertaining to their account / vertical:
The second line of defence derives the information from the first-line management and independently
assesses the risk information. If there are any value adds to the ERM framework, the same is
communicated to the Enterprise Risk governance committee i.e. the third level of defence. It is for the
Enterprise risk governance committee to evaluate the reports from these multiple sources and determine
the direction for the organization.
This committee has the responsibility and accountability to provide effective oversight of the enterprise’s
risk profile. This committee ensures that the enterprise’s executive management is effectively governing
and managing the enterprise’s risk environment.
The ERGC teams periodically reviews the second line of defence activities and results, including the risk
governance functions involved, to ensure that the ERM arrangements and structures are appropriate and
are discharging their roles and responsibilities completely and accurately. They would also verify the
results of the external audits, if any.
The results of these independent reviews are communicated to the executive council and they ensure that
appropriate action is taken to maintain and enhance the ERM effectiveness.
Each business head manages the third-party risks of their respective business units and they report to the
enterprise risk governance committee (ERGC) which is chaired by the Chief Operating Officer. The
ERGC committee meets quarterly to review the status. The COO reports to the Executive Council and
CEO on the status of the enterprise’s risks.
One of the risk identification techniques is Failure Modes & Effects Analysis (FMEA). In this
technique requirements are taken as basis for further steps. For each requirement possible reasons for
failure are identified and then for each reason, associated effects on the whole system is determined.
The analysis of failure modes and effects helps in preventing failures or reducing the impact of
failures. Another technique used for risk identification is brainstorming.
Some time the cost of solution may be higher than actual loss because of problem. In such scenario it
is better to suffer loss than to have a solution. It applies to risk management also. Cost benefit
analysis would give an understanding to decide whether to take mitigation action or to live with the
risk or to what extent mitigation can be done.
Identify the source of risks i.e. areas to focus upon to identify risks e.g. Development
environment, Project constraints etc. Project characteristics become driver in identification of
sources of risks.
After having identified the source, identify the problem area within the source i.e. identify
the category within the source e.g. in Project constraints, ‘Consultants’ may be a category.
The next step is to identify the subcategories within the category e.g. in Project constraints
(Source), ‘Consultant s’ (Category), the subcategories could be Schedule / Facilities etc.
Finally, questions are raised and if there is any concern for the same is identified as risk. For
example, question like ‘Are there areas where required technical skills are lacking?’ is raised and
if answer is ‘Yes’ then it is identified as a risk.
A checklist (Risk Identification Checklist) is provided in PRIME. This contains several questions
for each source / category / subcategory. The questions may be answered to determine whether
there is any risk. The response to each of the questions may be – YES, No, NA (Not Applicable).
These are only guidelines. In similar lines, you can identify more questions which may help in
identification of risk. Risk identification checklist to be used at the project initiation to conduct a
risk assessment and later re-assessment to be conducted once in 6 months. A typical 3 tier
structure is given below for reference:
RISK IDENTIFICATION
Sub Category
Schedule Facilities
Context1: New employee who is not aware of functional knowledge will inject more defects
Risk Statement1: Lack of functional knowledge of new employees results in more defects
It is important to assess and highlight the Key Application and Infrastructure Vulnerabilities. It
can be removed or mitigated by tuning the design or reconfiguring the infrastructure. Alternately,
if it is unable to eliminate or lessen vulnerabilities, they can defer them, or at least monitor for
their occurrence.
The identified risks are recorded in "Risks" module in SmartBase / Risk Management Form, an
Excel file. There may be some additional risks depending upon project characteristics. The same
need to be identified in addition to what is identified in the checklist. There could be some
unforeseen incidents that could not be identified as risks but occurred and had negative impact on
project success. Such incidents are to be logged in ‘Critical Incidents’ sheet of Risk Management
Form.
Any new risk identified as an outcome of reviews or usage of PPM should also be logged in Risk
Management Form / SmartBase > Risk module.
Projects should use the ‘Risks’ module in SmartBase for recording the identified risk, the relevant
risk parameter values, mitigation plans and contingency plans.
The quantification of the impact should consider the effort loss or effort overrun, loss of billing,
quantified opportunity loss etc. The impact is measured on the scale of 1 to 10, where 10 being
the maximum. A recommended scale for impact assessment is as follows:
Impact Impact Range
(SmartBase)
Low 1-4
Medium 5-7
High 8-10
In SmartBase, the ordinal values will be accepted for Probability and Impact and the
corresponding values will be considered. For example High will be considered as 9 etc.
The Risk Exposure (RE) is calculated as the product of Probability and Impact.
Risk Threshold is arrived at the organization level based on the org. risk analysis and by default
set as 5.6 in the SmartBase PM Plan. PM can override this value based on the risk tolerance of the
project.
If risk exposure value is greater than risk threshold value, then method of risk management is by
default set to Control in SmartBase and cannot be changed by the user.
If method of risk management is chosen as Control or Avoidance, then mitigation and
contingency plan for the risk will be mandatory.
Based on the priority of risks, a risk management method could be selected. The risk
management methods are:
Avoidance: Taking actions that will avoid the risk.
Control: Taking active steps to minimize risks
Acceptance: Not taking any action and accept the risk when it occurs
Probability ↓
Most Likely Control Avoidance Avoidance
Probable Control Control Avoidance
Occasional Acceptance Control Control
Unlikely Acceptance Acceptance Control
The above guideline with magnitude values are given below for reference.
An organizational guideline is available for the various categories of risks on probability and
impact. These guidelines should be used while assessing risks. Also, the organization guideline
value on selection of the exposure value to decide on risk mitigation should be used.
The risk mitigation plan identifies how specific risks will be dealt with and the steps that are
required to carry them out. It gives team members a clear sense of the actions that they are
expected to take and provides management with an understanding of what actions are being taken
on their behalf to minimize project risks.
The type of risk management method like Avoidance, Control or Acceptance should be decided
for each risk that is identified. The method chosen will be dependent on the cost of risk and the
cost of mitigation plans. This management method should be documented in the Risk
Management Form.
For each identified risk that are under Avoidance , mitigation actions are determined, and an
owner is assigned who would be primarily responsible for taking mitigation action and
monitoring of risk. The risk mitigation activity could be examined for the benefit it provides
versus the cost expended.
Mitigation Thresholds are event driven milestones in the project cycle which defines when the
mitigation actions should be started. Examples of mitigation thresholds are:
− Development/Test Environment not setup by a date
− Software/Tool Licenses are not available
− Response from customer not received by a certain date
− Start of a coding phase
− Estimated effort used up for the project has touched 50%
− Work completion status
Mitigation thresholds should be defined for all risks for which mitigation planning is done.
Sometimes it is not possible to mitigate a risk; i.e., it is not possible to incur a cost ahead of an
uncertain event that will either reduce the likelihood of that event occurring or limit the loss
should the event occur. Where mitigation is not possible, a contingency plan can be employed.
Contingency refers to an organized and coordinated set of steps to be taken after the risk occurs.
A contingency plan is nothing more than a plan to solve a problem that may occur but has not
occurred yet.
Contingency actions are also planned for all risks under Avoidance and Control so that when a
risk becomes reality the PM can just initiate the action. The advantage of contingency planning in
advance is that required time can be given to brainstorm to arrive at best possible alternative.
Ownership of risk should be given to a person who understands the situation and impact well. It is
necessary to decide as to when contingency actions will be initiated.
Projects using SmartBase should record the mitigation / contingency plan for the identified risk in
“Risks” module in SmartBase
When risks occur, the status of the risk should be changed to ‘Occurred’ and risk occurrence
details & action taken in SmartBase
Risk assessment techniques can be classified in various ways to assist with understanding their relative
strengths and weaknesses. Some of the techniques are defined below depending on the nature of the
assessment they provide and guidance to their applicability for certain situations.
3 FMEA Analysis SA SA SA SA SA
4 Event tree analysis A SA A A NA
5 Fault tree analysis A NA SA A A
6 Bow tie analysis NA A SA SA A
7 Bayesian NA SA NA NA NA
3.1 Brainstorming
3.1.1 Overview
Effective facilitation is very important in this technique and includes stimulation of the discussion at kick-
off, periodic prompting of the group into other relevant areas and capture of the issues arising from the
discussion.
3.1.2 Use
Brainstorming can be used in conjunction with other risk assessment methods described below or may
stand alone as a technique to encourage imaginative thinking at any stage of the risk management process
and any stage of the life cycle of a system. It may be used for high-level discussions where issues are
identified, for more detailed review or at a detailed level for problems.
3.1.3 Input
A team of people with knowledge of the organization, systems, processes or applications being
assessed.
3.1.4 Process
Brainstorming may be formal or informal. Formal brainstorming is more structured with participants
prepared in advance and the session has a defined purpose and outcome with a means of evaluating ideas
put forward. Informal brainstorming is less structured and often more ad-hoc.
In a formal process:
• the facilitator prepares thinking prompts and triggers appropriate to the context prior to
the session
3.1.5 Output
Outputs depend on the stage of the risk management process at which it is applied, for example at the
identification stage, outputs might be a list of risks and current controls.
Limitations include:
participants may lack the skill and knowledge to be effective contributors
• since it is relatively unstructured, it is difficult to demonstrate that the process has been
comprehensive, e.g. that all potential risks have been identified;
• there may be group dynamics where some people with valuable ideas stay
quiet while others dominate the discussion. This can be overcome by computer
brainstorming, using a chat forum or nominal group technique. Computer brainstorming
can be set up to be anonymous, thus avoiding personal and political issues which may
impede free flow of ideas. In nominal group technique ideas are submitted anonymously
to a moderator and are then discussed by the group.
3.2.1 Overview
The Delphi technique is a procedure to obtain a reliable consensus from a group of experts. Although the
term is often now broadly used to mean any form of brainstorming, an essential feature of the Delphi
technique, as originally formulated, was that experts expressed their opinions individually and
anonymously while having access to the other expert’s views as the process progresses.
3.2.2 Use
The Delphi technique can be applied at any stage of the risk management process or at any phase of a
system life cycle, wherever a consensus of views of experts is needed.
3.2.3 Input
3.2.4 Process
3.2.5 Output
Convergence toward consensus on the matter in hand.
Limitations include:
• it is labour intensive and time consuming;
• participants need to be able to express themselves clearly in writing.
3.3.1 Overview
Failure modes and effects analysis (FMEA) is a technique used to identify the ways in which components,
systems or processes can fail to fulfil their design intent.
FMEA identifies:
• all potential failure modes of the various parts of a system (a failure mode is what is
observed to fail or to perform incorrectly);
• the effects these failures may have on the system;
• the mechanisms of failure;
• how to avoid the failures, and/or mitigate the effects of the failures on the system.
FMECA extends an FMEA so that each fault mode identified is ranked according to its importance or
criticality. This criticality analysis is usually qualitative or semi-quantitative but may be quantified using
actual failure rates.
3.3.2 Use
There are several applications of FMEA: Design (or product) FMEA which is used for components and
products, System FMEA which is used for systems, Process FMEA which is used for manufacturing and
assembly processes, Service FMEA and Software FMEA.
FMEA may be applied during the design, manufacture or operation of a physical system.
To improve dependability, however, changes are usually more easily implemented at the design stage.
FMEA may also be applied to processes and procedures. For example, it is used to identify potential for
medical error in healthcare systems and failures in maintenance procedures.
FMEA can provide input to other analyses techniques such as fault tree analysis at either a qualitative or
quantitative level.
3.3.3 Input
FMEA need information about the elements of the system in sufficient detail for meaningful analysis of
the ways in which each element can fail. For a detailed Design FMEA the element may be at the detailed
individual component level, while for higher level Systems FMEA elements may be defined at a higher
level.
3.3.4 Process
The FMEA process is as follows:
Effect
Severity
Control
Detectability
1. Assemble a cross-functional team of people with diverse knowledge about the process, product or
service and customer needs. Functions often included are: design, quality, testing, reliability,
maintenance, purchasing (and suppliers), sales, and customer service.
2. Identify the scope of the FMEA. Is it for concept, system, design, process or service? What are
the boundaries? How detailed should we be? Use flowcharts to identify the scope and to make
sure every team member understands it in detail. (From here on, we’ll use the word “scope” to
mean the system, design, process or service that is the subject of your FMEA.)
4. Identify the functions of your scope. Ask, “What is the purpose of this system, design, process or
service? What do our customers expect it to do?” Usually you will break the scope into separate
subsystems, items, parts, assemblies or process steps and identify the function of each.
5. For each function, identify all the ways failure could happen. These are potential failure modes. If
necessary, go back and rewrite the function with more detail to be sure the failure modes show a
loss of that function.
6. For each failure mode, identify all the consequences on the system, related systems, process,
related processes, product, service, customer or regulations. These are potential effects of failure.
Ask, “What does the customer experience because of this failure? What happens when this failure
occurs?”
7. Determine how serious each effect is. This is the severity rating, or S. Severity is usually rated on
a scale from 1 to 10, where 1 is insignificant and 10 is catastrophic. If a failure mode has more
than one effect, write on the FMEA table only the highest severity rating for that failure mode.
8. For each failure mode, determine all the potential root causes. Use tools classified as cause
analysis tool, as well as the best knowledge and experience of the team. List all possible causes
for each failure mode on the FMEA form. Usage of Cause-effect diagrams, Pareto diagrams and
Ishikawa diagrams are recommended.
9. For each cause, determine the occurrence rating, or O. This rating estimates the probability of
failure occurring for that reason during the lifetime of your scope. Occurrence is usually rated on
a scale from 1 to 10, where 1 is extremely unlikely and 10 is inevitable. On the FMEA table, list
the occurrence rating for each cause.
10. For each cause, identify current process controls. These are tests, procedures or mechanisms that
you now have in place to keep failures from reaching the customer. These controls might prevent
the cause from happening, reduce the likelihood that it will happen or detect failure after the
cause has already happened but before the customer is affected.
11. For each control, determine the detection rating, or D. This rating estimates how well the controls
can detect either the cause or its failure mode after they have happened but before the customer is
affected. Detection is usually rated on a scale from 1 to 10, refer to table On the FMEA table, list
the detection rating for each cause.
12. Is this failure mode associated with a critical characteristic? (Critical characteristics are
measurements or indicators that reflect safety or compliance with government regulations and
need special controls.) Usually, critical characteristics have a severity of 9 or 10 and occurrence
and detection ratings above 3.
13. Calculate the risk priority number, or RPN, which equals S × O × D. Also calculate Criticality by
multiplying severity by occurrence, S × O. These numbers provide guidance for ranking potential
failures in the order they should be addressed.
14. Identify recommended actions and record them in the template along with the responsible person.
These actions may be design or process changes to lower the RPN value. For reducing the RPN
value, the new design change or process change, or new control will induce some change in the
value of severity or likelihood of occurrence or detection efficiency. The values of Severity,
Likelihood of Occurrence and Ability to Detect values will change based on the action item
implemented. This needs to be updated in the green columns
The RPN value is re-calculated using the above formula and verified for acceptable range. The acceptable
range of RPN may vary from situation to situation. The Team needs to decide whether the new RPN
value is acceptable else the steps from 2 to 15 should be followed again.
3.3.5 Output
The primary output of FMEA is a list of failure modes, the failure mechanisms and effects for each
component or step of a system or process (which may include information on the likelihood of failure).
Information is also given on the causes of failure and the consequences to the system. The output from
FMECA includes a rating of importance based on the likelihood that the system will fail, the level of risk
resulting from the failure mode or a combination of the level of risk and detectability of the failure mode.
FMECA can give a quantitative output if suitable failure rate data and quantitative consequences are used.
Limitations include:
• they can only be used to identify single failure modes, not combinations of failure modes;
• unless adequately controlled and focused, the studies can be time consuming and costly;
• they can be difficult and tedious for complex multi-layered systems.
3.4.1 Overview
ETA is a graphical technique for representing the mutually exclusive sequences of events following an
initiating event according to the functioning/not functioning of the various systems designed to mitigate
its consequences (see Figure). It can be applied both qualitatively and quantitatively.
Figure shows simple calculations for a sample event tree, when branches are fully independent. By
fanning out like a tree, ETA can represent the aggravating or mitigating events in response to the
initiating event, considering additional systems, functions or barriers.
3.4.2 Use
ETA can be used for modelling, calculating and ranking (from a risk point of view) different accident
scenarios following the initiating event
ETA can be used at any stage in the life cycle of a product or process. It may be used qualitatively to help
brainstorm potential scenarios and sequences of events following an initiating event and how outcomes
are affected by various treatments, barriers or controls intended to mitigate unwanted outcomes.
The quantitative analysis lends itself to consider the acceptability of controls. It is most often used to
model failures where there are multiple safeguards.
ETA can be used to model initiating events which might bring loss or gain. However, circumstances
where pathways to optimize gain are sought are more often modelled using a decision tree.
3.4.3 Input
Inputs include:
• a list of appropriate initiating events;
• information on treatments, barriers and controls, and their failure probabilities
• understanding of the processes whereby an initial failure escalates.
3.4.4 Process
An event tree starts by selecting an initiating event. This may be an incident such as a dust explosion or a
causal event such as a power failure. Functions or systems which are in place to mitigate outcomes are
then listed in sequence. For each function or system, a line is drawn to represent their success or failure.
A probability of failure can be assigned to each line, with this conditional probability estimated e.g. by
expert judgement or a fault tree analysis. In this way, different pathways from the initiating event are
modelled.
Note that the probabilities on the event tree are conditional probabilities, for example the probability of a
sprinkler functioning is not the probability obtained from tests under normal conditions, but the
probability of functioning under conditions of fire caused by an explosion.
Each path through the tree represents the probability that all the events in that path will occur. Therefore,
the frequency of the outcome is represented by the product of the individual conditional probabilities and
the frequency of the initiation event, given that the various events are independent.
3.4.5 Output
Outputs from ETA include the following:
• qualitative descriptions of potential problems as combinations of events producing
various types of problems (range of outcomes) from initiating events;
• quantitative estimates of event frequencies or probabilities and relative importance of
various failure sequences and contributing events;
• lists of recommendations for reducing risks;
• quantitative evaluations of recommendation effectiveness.
Limitations include:
• to use ETA as part of a comprehensive assessment, all potential initiating events need to be
identified. This may be done by using another analysis method (e.g. HAZOP, PHA), however,
there is always a potential for missing some important initiating events
• with event trees, only success and failure states of a system are dealt with, and it is difficult to
incorporate delayed success or recovery events;
• any path is conditional on the events that occurred at previous branch points along the path. Many
dependencies along the possible paths are therefore addressed. However, some dependencies,
such as common components, utility systems and operators, may be overlooked if not handled
carefully, may lead to optimistic estimations of risk.
The factors identified in the tree can be events that are associated with component hardware failures,
human errors or any other pertinent events which lead to the undesired event.
3.5.2 Use
A fault tree may be used qualitatively to identify potential causes and pathways to a failure (the top event)
or quantitatively to calculate the probability of the top event, given knowledge of the probabilities of
causal events. It may be used at the design stage of a system to identify potential causes of failure and
hence to select between different design options. It may be used at the operating phase to identify how
major failures can occur and the relative importance of different pathways to the head event.
A fault tree may also be used to analyses a failure which has occurred to display diagrammatically how
different events came together to cause the failure.
3.5.3 Inputs
For qualitative analysis, an understanding of the system and the causes of failure is required, as well as a
technical understanding of how the system can fail. Detailed diagrams are useful to aid the analysis.
For quantitative analysis, data on failure rates or the probability of being in a failed state for
all basic events in the fault tree are required.
3.5.4 Process
The steps for developing a fault tree are as follows:
• The top event to be analysed is defined. This may be a failure or maybe a broader outcome of that
failure. Where the outcome is analysed, the tree may contain a section relating to mitigation of the
actual failure.
• Starting with the top event, the possible immediate causes or failure modes leading to the top
event are identified.
• Each of these causes/fault modes is analysed to identify how their failure could be caused.
• Stepwise identification of undesirable system operation is followed to successively lower system
levels until further analysis becomes unproductive. In a hardware system this may be the
component failure level. Events and causal factors at the lowest system level analysed are known
as base events.
• Where probabilities can be assigned to base events the probability of the top event may be
calculated. For quantification to be valid it must be able to be shown that, for each gate, all inputs
are both necessary and sufficient to produce the output event. If this is not the case, the fault tree
is not valid for probability analysis but may be a useful tool for displaying causal relationships.
As part of quantification the fault tree may need to be simplified using Boolean algebra to account for
duplicate failure modes.
As well as providing an estimate of the probability of the head event, minimal cut sets, which form
individual separate pathways to the head event, can be identified and their influence on the top event
calculated.
Except for simple fault trees, a software package is needed to properly handle the calculations when
repeated events are present at several places in the fault tree, and to calculate minimal cut sets. Software
tools help ensure consistency, correctness and verifiability.
3.5.5 Outputs
The outputs from fault tree analysis are as follows:
• a pictorial representation of how the top event can occur which shows interacting pathways where
two or more simultaneous events must occur;
• a list of minimal cut sets (individual pathways to failure) with (where data is available) the
probability that each will occur;
• the probability of the top event.
Strengths of FTA:
• It affords a disciplined approach which is highly systematic, but at the same time sufficiently
flexible to allow analysis of a variety of factors, including human interactions and physical
phenomena.
• The application of the "top-down" approach, implicit in the technique, focuses attention on those
effects of failure which are directly related to the top event.
• FTA is especially useful for analysing systems with many interfaces and interactions.
• The pictorial representation leads to an easy understanding of the system behavior and the factors
included, but as the trees are often large, processing of fault trees may require computer systems.
This feature enables more complex logical relationships to be included (e.g. NAND and NOR)
but also makes the verification of the fault tree difficult.
• Logic analysis of the fault trees and the identification of cut sets is useful in identifying simple
failure pathways in a very complex system, where a combination of events which lead to the top
event could be overlooked.
Limitations include:
• Uncertainties in the probabilities of base events are included in calculations of the probability of
the top event. This can result in high levels of uncertainty where base event failure probabilities
are not known accurately; however, a high degree of confidence is possible in a well understood
system.
• In some situations, causal events are not bound together, and it can be difficult to ascertain
whether all important pathways to the top event are included. For example, including all ignition
sources in an analysis of a fire as a top event. In this situation probability analysis is not possible.
• Fault tree is a static model; time interdependencies are not addressed.
• Fault trees can only deal with binary states (failed/not failed) only.
• While human error modes can be included in a qualitative fault tree, in general failures of degree
or quality which often characterize human error cannot easily be included;
• A fault tree does not enable domino effects or conditional failures to be included easily.
3.6.1 Overview
Bow tie analysis is a simple diagrammatic way of describing and analysing the pathways of a risk from
causes to consequences. It can be a combination of the thinking of a fault tree analysing the cause of an
event (represented by the knot of a bow tie) and an event tree analysing the consequences. However, the
focus of the bow tie is on the barriers between the causes and the risk, and the risk and consequences.
Bow tie diagrams can be constructed starting from fault and event trees but are more often drawn directly
from a brainstorming session.
3.6.2 Use
Bow tie analysis is used to display a risk showing a range of possible causes and consequences. It is used
when the situation does not warrant the complexity of a full fault tree analysis or when the focus is more
on ensuring that there is a barrier or control for each failure pathway. It is useful where there are clear
independent pathways leading to failure.
Bow tie analysis is often easier to understand than fault and event trees, and hence can be a useful
communication tool where analysis is achieved using more complex techniques.
3.6.3 Input
An understanding is required of information on the causes and consequences of a risk and the barriers and
controls which may prevent, mitigate or stimulate it.
3.6.4 Process
The bow tie is drawn as follows:
o Recovery preparedness measures prevent the top event leading to the consequence
Step 7: For each recovery preparedness measure, identify escalation factors and controls
Step 8: For each Barrier, Recover Preparedness Measure and Escalation factor control identify
the Critical controls
Some level of quantification of a bow tie diagram may be possible where pathways are independent, the
probability of a consequence or outcome is known, and a figure can be estimated for the effectiveness of a
control. However, in many situations, pathways and barriers are not independent and controls may be
procedural and hence the effectiveness unclear. Quantification is often more appropriately carried out
using FTA and ETA.
3.6.5 Output
The output is a simple diagram showing main risk pathways and the barriers in place to prevent or
mitigate the undesired consequences or stimulate and promote desired consequences.
where
the probability of X is denoted by P(X);
the probability of X on the condition that Y has occurred is denoted by P(X|Y); and
Ei is the ith event.
Bayesian statistics differs from classical statistics in that is does not assume that all distribution
parameters are fixed, but that parameters are random variables. A Bayesian probability can be more easily
understood if it is considered as a person’s degree of belief in a certain event as opposed to the classical
which is based upon physical evidence. As the Bayesian approach is based upon the subjective
interpretation of probability, it provides a ready basis for decision thinking and the development of
Bayesian nets (or Belief Nets, belief networks or Bayesian networks).
Bayes nets use a graphical model to represent a set of variables and their probabilistic relationships. The
network is comprised of nodes that represent a random variable and arrows which link a parent node to a
child node, (where a parent node is a variable that directly influences another (child) variable).
3.7.2 Use
In recent years, the use of Bays’ theory and Nets has become widespread partly because of their intuitive
appeal and because of the availability of software computing tools. Bayes nets have been used on a wide
range of topics: medical diagnosis, image modelling, genetics, speech recognition, economics, space
exploration and in the powerful web search engines used today. They can be valuable in any area where
there is the requirement for finding out about unknown variables through the utilization of structural
relationships and data. Bayes nets can be used to learn causal relationships to give an understanding about
a problem domain and to predict the consequences of intervention.
3.7.3 Input
The inputs are like the inputs for a Monte Carlo model. For a Bayes net, examples of the
steps to be taken include the following:
• define system variables;
• define causal links between variables;
• specify conditional and prior probabilities;
• add evidence to net;
• perform belief updating;
• extract posterior beliefs.
3.7.4 Process
Bayes theory can be applied in a wide variety of ways. This example will consider the creation of a Bayes
table where a medical test is used to determine if the patient has a disease. The belief before taking the
test is that 99 % of the population do not have this disease and 1 % have the disease, i.e. the Prior
information. The accuracy of the test has shown that if the person has the disease, the test result is
positive 98 % of the time. There is also a probability that if you do not have the disease, the test result is
positive 10 % of the time. The Bayes table provides the following information:
Using Bayes rule, the product is determined by combining the prior and probability. The posterior is
found by dividing the product value by the product total. The output shows that a positive test result
indicates that the prior has increased from 1 % to 9 %. More importantly, there is a strong chance that
even with a positive test, having the disease is unlikely.
Examining the equation (0,01 0,98)/((0,01 0,98)+(0,99 0,1)) shows that the ‘no disease positive
result’ value plays a major role in the posterior values.
With the conditional prior probabilities defined within the following tables and using the notation that Y
indicates positive and N indicates negative, the positive could be “have disease” as above or could be
High and N could be Low.
Table – Conditional probabilities for node C with node A and node B defined
Table Conditional probabilities for node D with node A and node C defined
Using Bayes’ rule, the value P(D|A,C)P(C|A,B)P(A)P(B) is determined as shown below and
the last column shows the normalized probabilities which sum to 1 as derived in the previous
example (result rounded).
Table – Posterior probability for nodes A and B with node D and node C defined
Table – Posterior probability for node A with node D and node C defined
This shows that the prior for P(A=N) has increased from 0,1 to a posterior of 0,12 which is only a small
change. On the other hand, P(B=N|D=N,C=Y) has changed from 0,4 to 0,56 which is a more significant.
3.7.5 Outputs
The Bayesian approach can be applied to the same extent as classical statistics with a wide range of
outputs, e.g. data analysis to derive point estimators and confidence intervals. Its recent popularity is in
relation to Bayes nets to derive posterior distributions. The graphical output provides an easily understood
model and the data can be readily modified to consider correlations and sensitivity of parameters.
1. Consultants (Risk Management Project) – Aggressive schedules on fixed budgets almost certainly
will cause a schedule slip and a cost overrun. Appropriate staffing is incomplete early in project. No
time for needed training. Productivity rates needed to meet schedule is not likely to occur. Overtime
perceived as a standard procedure to overcome schedule deficiencies. Lack of analysis time may
result in incomplete understanding of product functional requirements.
2. Requirements (Risk Technical Product) – Poorly defined user requirements almost certainly will
cause existing system requirements to be incomplete. Documentation does not adequately describe
the system components. Interface document is not approved. Domain experts are inaccessible and
unreliable. Detailed requirements must be derived from existing code. Some requirements are
unclear, such as the software reliability analysis and the acceptance criteria. Requirements may
change due to customer turnover.
3. Development Process (Risk Technical Process) – Poorly conceived development process is highly
likely to cause implementation problems. There is introduction of new methodology from company
software process improvement initiative. Internally imposed development process is new and
unfamiliar. The Software Development Plan is inappropriately tailored for the size of the project.
Development tools are not fully integrated. Customer file formats and maintenance capabilities are
incompatible with the existing development environment.
4. Project Interfaces (Risk Management Project) – Dependence on external software delivery has a
very good chance of causing a schedule slip. Subcontractor technical performance is below
expectations. There is unproven hardware with poor vendor track record. Subcontractor commercial
methodology conflicts with customer MIL spec methodology. Customer action item response time is
slow. Having difficulty keeping up with changing / increasing demands of customers.
5. Management Process (Risk Management Process) – Poor planning is highly likely to cause an
increase in development risk. Management does not have a picture of how to manage object-oriented
(i.e. iterative) development. Project sizing is inaccurate. Roles and responsibilities are not well
understood. Assignment of system engineers is arbitrary. There is a lack of time and staff for
adequate internal review of products. No true reporting moves up through upper management.
Information appears to be filtered.
6. Development System (Risk Technical Process) – Inexperience with the development system will
probably cause lower productivity in the short term. Nearly all aspects of the development system are
new to the project team. The level of experience with the selected tool suite will place the entire team
on the learning curve. There is no integrated development environment for software, quality
assurance, configuration management, systems engineering, test and the program management office.
System administration support in tools, operating system, networking, recovery and backups is
lacking.
7. Design (Risk Technical Product) – Unproven design will likely cause system performance problems
and inability to meet performance commitments. The protocol suite has not been analyzed for
performance. Delayed inquiry and global query are potential performance problems. As the design
evolves, database response time may be hard to meet. Object-oriented runtime libraries are assumed
to be perfect. Building state and local backbones of sufficient bandwidth to support image data are
questionable. The number of internal interfaces in the proposed design generates complexity that
must be managed. Progress toward meeting technical performance for the subsystem has not been
demonstrated.
8. Management Methods (Risk Management Process) – Lack of management controls will probably
cause an increase in project risk and a decrease in customer delight. Management controls of
requirements are not in place. Content and organization of monthly reports does not provide insight
into the status of project issues. Risks are poorly addressed and not mitigated. Quality control is a
big factor in project but has not been given high priority by the company (customer perspective).
SQA roles and responsibilities have expanded beyond original scope (company perspective).
9. Work Environment (Risk Technical Process) – Remote location of project team, we believe will
make organizational support difficult and cause downtime. Information given to technical and
management people does not reach the project team. Information must be repeated many times.
Project status is not available through team meetings or distribution of status reports. Issues
forwarded to managers via the weekly status report are not consistently acted on. Lack of
communication between software development teams could cause integration problems.
10. Integration and Test (Risk Technical Product) – Optimistic integration schedule has a better than
even chance of accepting an unreliable system. The integration schedule does not allow for the
complexity of the system. Efforts to develop tests have been underestimated. The source of data
needed to test has not been identified. Some requirements are not testable. Formal testing below the
system level is not required. There is limited time to conduct reliability testing.