0% found this document useful (0 votes)
346 views391 pages

Bayesian Decision and Risk Analysis Lecture Notes 2022 QMUL

Bayesian Decision and Risk Analysis Lecture Notes 2022 QMUL

Uploaded by

Cristian Mina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
346 views391 pages

Bayesian Decision and Risk Analysis Lecture Notes 2022 QMUL

Bayesian Decision and Risk Analysis Lecture Notes 2022 QMUL

Uploaded by

Cristian Mina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 391

Martin Neil &

Norman Fenton

Risk Assessment and Decision


Analysis with Bayesian
Networks

There is More to Assessing Risk


than Statistics

Copyright Martin Neil and Norman Fenton 2019


• At the end of this course you should be able to:
– Quantify and reason about risk
– Use, in depth, decision support tools
– Analyse and design probabilistic risk models for
a wide range of application areas
– Learn about prediction, diagnosis and learning
– Understand and use probability theory and
Learning statistics
• These skills will be useful when:
Goals – Designing and building Bayesian AI and
decision support systems
– Anticipating and controlling risks in operations,
finance, safety, engineering, medicine, law.....
– Performing formal risk assessments of safety or
mission critical systems and processes
– Reasoning about risk and uncertainty in
everyday situations
The Book
• Risk Assessment and Decision
Analysis with Bayesian Networks

Norman Fenton and Martin Neil


Queen Mary University of London,
UK and AgenaRisk, UK

CRC Press
ISBN: 9781138035119
2018

• Available on Amazon and from CRC


Press

Slide 3
2 nd Edition Chapter 1:

Introduction

Slide 4
This is the annual number of Americans killed, on average, by
lawnmowers - compared to two Americans killed annually, on
average, by immigrant Jihadist terrorists.
The figure was highlighted in a viral tweet this year from Kim
Kardashian in response to a migrant ban proposed by
President Trump; it had originally appeared in a Richard Todd
article for the Huffington Post.
Todd’s statistics and Kardashian’s tweet successfully
highlighted the huge disparity between (i) the number of
Americans killed each year (on average) by ‘immigrant Islamic
WINNER: Jihadist terrorists’ and (ii) the far higher average annual death
tolls among those ‘struck by lightning’, killed by ‘lawnmowers’,
INTERNATIONAL and in particular ‘shot by other Americans’.
STATISTIC OF Todd and Kardashian’s use of these figures shows how
everyone can deploy statistical evidence to inform debate and
THE YEAR: 69 highlight misunderstandings of risk in people’s lives.
Judging panel member Liberty Vittert said: 'Everyone on the
panel was particularly taken by this statistic and its insight into
risk - a key concept in both statistics and everyday life. When
you consider that this figure was put into the public domain by
Kim Kardashian, it becomes even more powerful because it
shows anyone, statistician or not, can use statistics to illustrate
an important point and illuminate the bigger picture.’
Tweet by Kim
Kardashian
that earned
"International
Statistic of the
Year" 2017

Note: Because of the particular 10-year period chosen (2007-2017) the terrorist attack
statistics do not include the almost 3000 deaths on 9/11 and also a number of other attacks
that were ultimately classified as terrorist attacks.
Slide 6
Significant
dissenter was
Nassim Nicolas
Taleb – a well-
known expert
on risk and
‘randomness’.

He exposed a
fundamental
problem with
the statistic

Slide 7
Slide 8

• Key difference between risks that are


systemic - and so can affect more than one
person (such as a terrorist attack) and those
that are not (such as using a lawnmower),
which can be considered random.
• Compare chance of thousand New Yorkers
Taleb’s to die from using lawnmowers vs die from
terrorist attack
Argument • Failure to consider the range of factors that
affect the true risk to particular individuals or
groups.
• Other factors missing from the data and the
discussion explain the number of terrorist
deaths and need to be considered
Comparing
the
probability
distributions
of number of
fatalities per
year

Slide 9
Causal
view of
lawnmower
versus
terrorist
attack
deaths

Slide 10
Cost-
benefit
trade-off
analysis
required for
informed
decision-
making

Slide 11
Is data alone enough to inform
decision making?
• What contextual and situational factors
cause the risk? How do these vary?
• What about novelty? The unknown is
dangerous (…but also an opportunity)
• Is the past a reliable predictor of the
future? If not what is being ignored?
• If we can determine the causal process
that generates the data we can potentially
control it.

Slide 12
• Medical
– Lifestyle factors and
symptoms
– Treatment efficacy • ‘Gut-feel’ decisions or
There is • Legal on the back of an
envelope is
more to – DNA match
– Alibis
fundamentally
inadequate.
assessing – Witnesses • Need for scientific
• transparency and
risk than Safety
– Rare events
articulation of causal
models.
statistics – Choose between • Need to understand
alternatives strengths and
• Financial limitations of statistics
– Liquidity risk and data analysis
– Operational risk • Focus on Bayesian and
Causal Networks
• Reliability
– Complex software
– Novel technology

Slide 13
2 nd Edition Chapter 2:

Debunking Bad Statistics

Slide 14
Predicting economic growth, the Normal
distribution and its limitations
6

0
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2

Growth rate in GDP (%) over time from first quarter 1993 to first quarter 2008

• What are the chances that the next year’s growth rate will lie between 1.5%
and 3.5%? (stable economy)
• What are the chances that the growth will be less than 1.5% in each of the
next three quarters?
• What are the chances that within a year there will be negative growth?
(recession)
Slide 15
To answer these questions we need a
model.......
The distribution is symmetric around the
midpoint, which is called the mean of the
distribution. Because of this exactly half of the
distribution is greater than the mean and half
less than the mean.

The ‘spread’ of the data (which is the extent to


which it varies from the mean) is captured by a
single number called the standard deviation.
Normal distribution of people’s height
Less than 0.001% of the distribution lies beyond the
Approximately 20% of such a mean plus four standard deviations. This means
distribution lies between 178 and that there is less than 1 in a 100,000 chance of an
190, we can conclude that there is a adult being more than 224 cm tall.
20% chance a randomly selected
adult will be between 178 and 190 Infinite tails of the normal distribution imply non-zero
cm tall. (albeit very small) chance that an adult can be less
than 0 cm tall, and a non-zero chance that an adult
can be taller than the Eiffel tower. Slide 16
Normal model of GDP data

What are the chances that the next quarter


growth rate will be between 1.5% and 3.5%?

- Answer based on the model: approximately


72%.

What are the chances that the growth will be


less than 1.5% in each of the next three
quarters?

- Answer: about 0.0125% which is one in eight


Histogram of annualised GDP growth rate from thousand.
1993-2008
What are the chances that, within a year there
will be negative growth? (recession)?

Mean of 2.96 and a standard deviation of 0.75. - Answer: about 0.0003% which is less than
one in thirty thousand.

Slide 17
What happened next?
6

4
Within less than a year the growth rate was
below –5%. According to the model a growth
2

0
rate below –5% would happen considerably
-2
less frequently than the inverse of the life of
-4
the universe.
-6

-8
2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
93 93 94 94 95 95 96 96 97 97 98 98 99 99 00 00 01 01 02 02 03 03 04 04 05 05 06 06 07 07 08 08 09
19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

Growth rate in GDP (%) over time from 1993 to 2010

So what went wrong?

Clearly Normal distribution was a hopelessly inadequate model since it is


inherently incapable of predicting rare events.

Actual predictions made by financial institutions and regulators in the period


running up to the credit crunch were especially optimistic because they based
estimates of growth on the so called ‘Golden Period’ of 1998-2007.
Slide 18
Lessons from history

Growth rate in GDP (%) over time from


1956 to 2016 Histogram of annualised GDP growth rate from
1956-2010

Conditions in 2008 were unlike Not only is the spread of the distribution
any that had previously been much wider, but it is clearly not ‘Normal’
seen. The standard statistical because it is not symmetric.
approaches inevitably fail in
such cases.

Slide 19
• Scores achieved (on an objective
Patterns and quality criteria) by the set of state
Randomness – schools in one council district in the
School League UK. Scores provide ‘choice” for
Tables parents
Position School Score 26 45 144
• School 38 achieved a significantly
Number 27 46 143 higher score than the next best
1 38 175 28 1 142
2 43 164 29 18 142 school, and its score (175) is over
3 44 163 30 22 141
4 25 158 31 26 141 52% higher than the lowest ranked
5
6
31
47
158
158
32
33
4
14
140
140
school, number 41 (score 115).
7
8
11
23
155
155
34
35
29
39
140
139 • Based on the impressive results of
9
10
48
40
155
153
36
37
8
5
138
136
School 38 parents clamour to
11 7 151 38 17 136 ensure that their child gets a place
12 30 151 39 34 136
13 6 150 40 3 134 at this school.
14 9 149 41 24 133
15 33 149 42 36 131 • How you would feel if, instead of
16 19 148 43 37 131
17 10 147 44 15 130 your child being awarded a place in
18
19
12
32
147
147
45
46
21
16
130
128
School 38, he/she was being sent to
20 2 146 47 13 120 school 41?
21 27 146 48 20 116
22 42 146 49 41 115
23 28 145
24 35 145
25 49 145

Schools League Table


• The numbers do not represent schools
at all. They are simply the numbers
used in the UK National Lottery (1 to
49).
• Each ‘score’ is the actual number of
times that particular numbered ball
had been drawn in the 1172 draws of
the UK National Lottery that had taken
place up to 17 March 2007.
I lied..... • We are not suggesting that all schools
league tables are purely random like
this. In any given year there will be
variation like the table above.
• Many critical decisions are made
based on wrongly interpreting purely
random results exactly like these,
even though the randomness was
entirely predictable
Predictable vs Less
predictable risks
Slide 22
• Roy was mauled by one
of his tigers, causing life
threatening injuries. The
Non show was immediately
and permanently closed.
mechanical – Dismissal of 267 staff
risks are and performers
– Loss of hundreds
harder to millions $ at Mirage
Casino, 2003.
deal with • Most serious real world
‘risks’ are not like
lotteries.
– ‘Mechanical’ uncertainty
of the games played.
Hence this risk of ruin is
Siegfried and Roy (pre 2003), illustrated easily controlled and
by Amy Neil aged 12 avoided.
• Predictable risks Vs
‘unpredictable’ (Black
Swan)

Slide 23
The Black Swan

• A ‘Black Swan’ event is a term used


to describe highly unpredictable
events.
• For centuries all swans were
assumed to be white because
nobody had yet seen a black swan;
the sighting of a black swan
reversed all previous beliefs about
the color of swans.
• For the Mirage Casino what
happened on 3 October 2003 was a
devastating black swan event.
• Number between -1 and 1 that
determines whether two paired sets
of data are (linearly) related
Correlations • Confidence in the relationship is
determined by sample size
and
• For small data set need values
significance close to -1 or 1 to be statistically
values significant
• With a large enough data set
values close to 0 can be
significant
Correlations
and
significance
values Month Average
temperature
Total fatal
crashes
January 17.0 297
February 18.0 280
March 29.0 267
April 43.0 350
May 55.0 328
June 65.0 386
July 70.0 419
August 68.0 410
September 59.0 331
October 48.0 356
November 37.0 326
December 22.0 311

Temperature and fatal


automobile crashes
Slide 26
Per capita
cheese
consumptions

correlates with

Number of
people who die
tangled in their
bedsheets Taken from http://tylervigen.com
• Which drug is effective for weight p-values
loss?
• For drug Precision the mean weight loss is 5 reward low
lbs and every one of the 100 subjects in the
study loses between 4.5 lb and 5.5 lb. variation
• For drug Oomph the mean weight loss is 20
lbs and every one of the 100 subjects in the more than
study loses between 10 lb and 30 lb.
• Classical statistical testing with p-
magnitude
values favours drug Precision of impact
Given a large number
The of variables and a
Shotgun sufficiently large data
set it is almost
Fallacy inevitable that a
statistically significant
correlation will be
discovered between at
least one pair of
variables

• Variable S is student exam score


• Correlation between H and S is
0.59 and statistically significant
• But data is randomly generated
and labels are meaningless

Slide 29
Spurious relations?

Height Intelligence
Temperature (T)
Inappropriate causal link

Age

Number of accidents (N)

𝑁 = 2.144 × 𝑇 + 243.55 Height Intelligence

Correct influential relationship through


underlying common cause

Slide 30
The Danger of
Regression:

Looking
backwards
when you Suppose that you are blowing up a large balloon.
need to look After each puff you measure the surface area and
record it.
forwards
What will the surface area be on the 24th puff?

On the 50th puff?


Simpson’s
Paradox

Slide 32
Simpson’s paradox – can you trust
averages?
Fred Jane
Year 1 average 50 40

Year 2 average 70 62

Overall Average 60 51

So how is it possible that Jane got the prize for the


student with the best grade?
Slide 33
Simpson’s paradox explanation

Fred Jane
Year 1 total 350 (7 x 50) 80 (2 x 40)
Year 2 total 210 (3 x 70) 496 (8 x 62)
Overall total 560 576

Actual Overall average 56 57.6

Fred took 7 modules in Year 1 and 3 modules in Year 2

Jane took 2 modules in Year 1 and 8 modules in Year 2

Slide 34
Simpson’s paradox drug example
A new drug is being tested on a group of 800 people (400
men and 400 women) with a particular ailment.

Half of the people (randomly selected) are given the drug and
the other half are given a placebo.
Drug taken No Yes
Recovered
No 240 200
Yes 160 200

50% of those given the drug recover compared to 40%


given the placebo.

So the drug clear has a positive effect. Or does it?


Slide 35
Simpson’s paradox stratified data

Sex Female Male


Drug taken No Yes No Yes
Recovered
No 210 80 30 120
Yes 90 20 70 180

For men: 70% (70 out of 100) taking the placebo recover, but only 60%
(180 out of 300) taking the drug recover.

For men, the recovery rate is better without the drug.

For women: 30% (90 out of 300) taking the placebo recover, but only 20%
(20 out of 100) taking the drug recover.

For women, the recovery rate is better without the drug.


Slide 36
Simpson’s paradox explained

• Because it is impossible to ‘control’ for every possible ‘variable’ the implications


for classical approaches to medical trials are devastating (and generally not
understood).
• The paradox is theoretically unavoidable given possibility of confounding
variables.
Slide 37
How we
measure risk
can
dramatically
change our
perception
of risk

Slide 38
How we
measure risk
can
dramatically
change our
perception
of risk

Slide 39
How we
measure risk
can
dramatically
change our
perception
of risk

Slide 40
• Relative risk is being
Absolute reported not absolute
risk
Vs • Really interested in
knowing actual chance
Relative of dying if we drink
regularly Vs if we do
not.
Risk • Of 200,000 deaths 8
are from mouth cancer.
This means 6 of those 8
drank wine regularly
and two did not (tripled
risk 6/2).
• But what is the actual
chance of dying from
mouth cancer if you
drink regularly?
0.0012% to 0.002%.

Slide 41
2nd Edition Chapter 3:

The Need for Causal,


Explanatory Models in
Risk Assessment

Slide 42
Are you more likely to die in a car crash
when the weather if good or bad?

Temperature (T)

Number of accidents (N)

𝑁 = 2.144 × 𝑇 + 243.55

Inevitable temptation arising from such results is to infer causal links such
as, in this case, higher temperatures cause more fatalities. Slide 43
New research proves
that driving in winter is
Newspaper actually safer than at
headline any other time of the
year!

Slide 44
Assessing Risk of Road
Fatalities: Causal model
We have information in a database
about temperature, number of fatal
crashes, and number of miles
travelled. These are therefore often
called ‘objective’ factors.

If we wish our model to include


factors for which there is no readily
available information in a database
we may need to rely on expert
judgement. Hence, these are often
The situation whereby a called ‘subjective’ factors.
statistical model is
based only on available Just because the driving speed
data, rather than on information is not easily available
reality, is called
does not mean it should be ignored.
‘conditioning on the
data’. This enhances
convenience but at the
cost of accuracy.
Slide 45
When ideology and
causation collide
• Causal explanations can be
determined by ideological position
• Correlation between prenatal care
provision and weight of new born
babies
• Mothers receiving low levels of
prenatal care were disproportionately
black and poor
• Media blamed social failure to
provide enough prenatal care to poor
black woman and called for
increased resources
When ideology and
causation collide
• Independent of race and income
smoking and alcohol abuse more
prevalent amongst mothers who did
not get prenatal support
• Additional factor not included –
personal responsibility.
• A deficit in personal responsibility
may explain failure to seek prenatal
care.
• A liberal perspective may view the
problem differently from a
conservative perspective because of
the difference in ‘values’
So what is risk and how to measure it?

• Risk Definition 1: “an event that can have negative consequences”


(Conversely an event that can have a positive impact is an opportunity)

• Risk Definition 2:

Do these definitions help with risk


assessment and decision-making?
Slide 48
Risk registers and heat maps

Slide 49
..But this does not tell us tell us what we
need to know
Armageddon risk: large meteor strikes the Earth

The ‘standard approach’ makes no sense at all


Slide 50
• A risk (and, similarly, an opportunity) is an
event that can be characterised by a causal
chain involving (at least):
• The event itself
• At least one consequence event that
characterises the impact (so this will be
Risk using something negative for a risk event and
positive for an opportunity event)
causal • One or more trigger (i.e. initiating) events
analysis • One or more control events which may
stop the trigger event from causing the risk
event (for risk) or impediment events (for
opportunity)
• One or more mitigating events which help
avoid the consequence event (for risk) or
impediment events (for opportunity)
Risk using causal analysis

Trigger Drive Trigger Drive


Control Impediment
fast? Speed fast?
warnings? Crash?
(helps
avoid risk (may stop
Risk event event) Opportunity opportunity
event event)
Make
Make
Crash? meeting?
Mitigant meeting? Impediment
Seat
Nerves?
belt?
Consequence Consequence
(helps avoid (may stop
negative Win positive
Injury? consequence) contract? consequence)

Causal view of risk Causal view of opportunity

Slide 52
Armageddon Bayesian Network

Slide 53
Lesson Summary

• Statistics and data alone are insufficient means to


assess risks, make decisions, predict events and
control events.
• Uncertainty a function of known and unknown causal
mechanisms
• Important to look at relationships between variables to
see how risk emerge and could be prevented or
controlled.
• Bayesian networks are the primary means to combine
measurement of uncertainty with causality
Martin Neil &
Norman Fenton

Risk Assessment and Decision


Analysis with Bayesian
Networks

Inevitability of Subjectivity and


Basics of Probability

Copyright Martin Neil and Norman Fenton 2019


• At the end of this lecture you should:
– Appreciate the contrast between subjective and
frequentist probability
– Understand the inevitability of subjectivity
– Recognize and express problems as
experiments involving events
Learning – Understand axioms and theorems of probability
theory
Goals • These skills will be useful to:
– Express problems in a way that they can submit
to a probabilistic treatment
– Specify how events are related
– Express uncertainty and relationships in a
coherent way
2 nd Edition Chapter 4:

Measuring Uncertainty: The


Inevitability of Subjectivity

Using Appendix A: The Basics of Counting

Slide 3
We want a unifying way to quantify for
diverse types of uncertain events
• Where we a have “good” understanding of the uncertainty
– "The next toss on a coin will be a head."
– “The next roll of a die will be a 6.”
• Where we have a “poor” understanding of the uncertainty
– “USA will win the next World Cup.”
– “My bank will suffer a major loss tomorrow.”
– “A hurricane will destroy the White House within the next 5 years.”
• Where there is incomplete information about event that already happened:
– “Oliver Cromwell spoke more than 3000 words on 23 April 1654.”
– “OJ Simpson murdered his wife.”
– “You (the reader) have an as yet undiagnosed form of cancer.”
• Even an “unknown” event:
– “My bank will be forced out of business in the next two years as a result
of a threat that we do not yet know about.”
• Frequentist definition of chance of an
elementary event is the frequency with which
that event would be observed over an infinite
number of repeated experiments.
– The chance of an event is simply the sum of the
frequencies of the elementary outcomes of the
event divided by the total events
The – If an experiment has n equally likely elementary
events then the chance of any event is m/n
Frequentist where m is the number of elementary events for
the event
measure of • Assumes repeatability of experiment: The
uncertainty experiment is repeatable many times under
identical conditions.
• Assumes independence of experiments: The
outcome of one experiment does not influence
the result of any subsequent experiment.
Experiments, Outcomes and Events

• Experiment: “Next toss of coin”


• Outcome: “What side coin lands”
• Elementary events: “H”, “T”
• Events: “H”, “T”, “H or T” (exhaustive event), “Neither H nor T” (empty
event)

• Experiment: “Put OJ Simpson on Trial”


• Outcome: “Did OJ murder his wife”
• Elementary events: “OJ Murdered wife”, “Person other than OJ
murdered wife”, “Wife not murdered”
• Events: “OJ Murdered wife or Person other than OJ murdered wife” etc.
Relationships Between Events

• The complement of an event:


– The complement of an event E is the collection of all elementary events that are not part of the
event E. The event “do not roll a 4” is the complement of the event “roll a 4.
• The union of two events:
– The union of two events is the collection of all elementary events that are in either of the events.
The event “roll a number bigger than 1” is the union of the two events “roll an even number’” and
“roll a number bigger than 2.”
• The intersection of two events:
– The intersection of two events is the collection of all elementary events that are in both of the
events. The event “roll an even number larger than 2” is the intersection of the two events “roll an
even number” and “roll a number bigger than 2.”
• Mutually exclusive events:
– The two events “roll a number bigger than 4” and “roll a number less than 3.”
• Exhaustive event:
– The union of the ALL possible elementary events of the experiment. Roll any number “1-6”
Repeated
Experiments

Slide 8
“Disease X” (yes, no) Toss a coin and roll a die:
and “Test for disease (H,1)
X” (pos, neg): (H,2)
(H,3)
(yes, pos) (H,4)
(yes, neg) (H,5)
(no, pos) (H,6)
Joint (no, neg) (T,1)
(T,2)
Experiments (T,3)
(T,4)
(T,5)
(T,6)

Joint experiment is simply the Cartesian


product, written A × B, although curiously
probability theorists tend to write it as (A, B)
Slide 9
Joint Events
and
Marginalization
Marginal event “Disease = yes”.
• This is the total number of people who have the disease
irrespective of the test results.
• 100 out of the 10,000

Marginal event “test = positive”.


• This is the total number of people who test positive
irrespective of whether they have the disease.
• 594 out of 10,000

Slide 10
Joint Events and Marginalization
using Balls in an Urn

Slide 11
Calculating number of events using
Combinations
The number of 𝑛!
𝐶𝑜𝑚𝑏(𝑛, 𝑟) =
combinations of 𝑛 things 𝑟! (𝑛 − 𝑟)!
taken 𝑟 at a time:

49!
The number of elementary 𝐶𝑜𝑚𝑏 49,6 = = 13,983,816
events in the UK lottery is: 6! 49 − 6 !

The probability of the winning 1


𝑃 𝑤𝑖𝑛𝑛𝑖𝑛𝑔 𝑛𝑢𝑚𝑏𝑒𝑟 =
ticket in the UK lottery is: 13,983,816

Slide 12
Calculating number of events using
Permutations

The number of permutations 𝑛!


𝑃𝑒𝑟𝑚(𝑛, 𝑟) =
of size 𝑟 for 𝑛 objects: (𝑛 − 𝑟)!

“Perfect” order sequence of 52!


𝑃𝑒𝑟𝑚 52,52 =
playing cards: 52 − 52 !

= 52!

= 8065817517094387857
166063685640376697528
95054408832778240000
00000000
1
𝑃 𝑜𝑟𝑑𝑒𝑟 = ≈0
52! Slide 13
Determining
probability using
a repeated coin
tossing
experiment
But surely subjective measures
are irrational?
• Any subjective measure of uncertainty
cannot be validated
• Different experts will give different
subjective measurements
• They are therefore non-scientific and
objective measures are to be prefered

Slide 15
But can all uncertain problems
be expressed as frequency
problems?
• Consider issues even with coin tossing
– Is any coin perfectly fair? Throw
100,000 times and observe 50,001
heads.
– What happens if you observe 100
Heads in 100 tosses? Is assumption
of fairness even relevant?
• Any attempt to measure uncertainty
inevitably involves some subjective
judgement about
– The mechanism generating the
events
– The information available to you when
making your model and decisions
• How do you assess uncertainty when
frequentist assumptions don’t hold?

Slide 16
Combining Subjective and Objective
information

Casino 1- Honest Joe’s.

• You visit a reputable casino at midnight in a good neighbourhood in a


city you know well. When there you see various civic dignitaries
(judges etc.). You decide to play a dice game where you win if the die
comes up six.
• What is the probability of a six?

Casino 2 - Shady Sams.

• More than a few drinks later the Casino closes forcing you to gamble
elsewhere. You know the only place open is Shady Sam’s but you
have never been. The doormen give you a hard time, there are
prostitutes at the bar and hustlers all around. Yet you decide to play
the same dice game.
• What is the probability of a six?

Slide 17
Combining Subjective and Objective
information

Slide 18
1. Modeller: My model contains two variables ‘lose job’ causes
‘cannot pay debt’. If a borrower loses employment they cannot
pay the debt back 90% of the time.
2. Observer: But by losing income they can still pay debt 10% of
the time. Why is that? This looks odd. How can they still have
chance of 10% of paying debt without a job?
3. Modeller: Because they could sell house and can still pay.
4. Observer: OK, that isn’t in the model, let’s add that to model
(model now has two causes for ‘can pay debt’: lose job and sell
house)
Causal
5. Observer: But if the borrower loses their job but doesn’t sell
their house what’s the chance of paying the debt?
Revelation
6. Modeller: Answer – 5%.
7. Observer: How could someone still pay? There must be some
other reason.
and
8. Modeller: Perhaps they could sell their grandmother into
slavery? Absence of
9. Observer: OK, sounds a bit extreme but let’s add that to the
model. What’s the chance of not paying debt now?
10. Modeller: If borrower loses job, doesn’t sell the house and they
Information
don’t sell their grandmother into slavery, then the chance is 1%.
11. Observer: But why 1%?
12. Modeller: Because they may rob a bank!
13. Observer: OK, let’s add that to the model
...dialogue continues
• At some point the modeller reveals all possible
causal mechanisms and achieves a zero Einstein
probability of the borrower not paying their
debt in the presence of all possible causes, said: ‘God
thus rendering the model deterministic
• Our probabilities represent:
does not
• Causal mechanisms that are NOT in the model play dice
• Our lack of information about possible causes
• What is or isn’t in the model depends on our with the
cognitive revelation, imagination, experience
and availability of information universe’
• Probability as an expression of a rational
agent’s degrees of belief about uncertain
propositions
• Rational agents may disagree. There is no
“one correct probability”
• A rational agent will update and adapt The
their model and probabilities when new
(relevant) information becomes available Subjectivist
• If she receives feedback her assessed
probabilities will in the limit converge to
Viewpoint
observed frequencies
• With enough information, and the same
assumptions, different observers will
converge on the same probability
Frequentist Subjective
Frequentist • Can be legitimately • Is an expression of a
versus applied only to rational agent’s
repeatable problems degrees of belief
Subjective • Must be a unique about uncertain
Viewpoints number capturing an propositions
objective property in • Rational agents may
the real world disagree. There is
• Only applies to events no “one correct”
generated by a measure
random process • Can be assigned to
• Can never be unique events
considered for unique
events

Slide 22
2 nd Edition Chapter 5:

The Basics of Probability

Slide 23
• Event, 𝐸, is “next toss of a coin is a
head”.
• Probability of 𝐸 is 𝑃(𝐸)
Probability • Assignment of values depends on beliefs
and context:
Notation – Truly fair - 𝑃 𝐸 = 0.5
and – 100 tosses 20 heads - 𝑃 𝐸 = 0.2
– Inspected coin and sees it has two heads -
Examples 𝑃 𝐸 = 1.0
• Experiment has exhaustive events:
– {head, tail}, or
– {head, tail, side}
• Axiom 5.1 (Unit measure)The probability
of any event is a number between zero
Probability and one.

Axioms: 0 ≤ 𝑃(𝐸) ≤ 1

• Axiom 5.2 (Unit sum): The probability of


the exhaustive event is one (probability of
Unit measure, at least one event occurring is one).
unit sum 𝑃(𝐸1 ) + 𝑃(𝐸2 ) = 1
& • Axiom 5.3 (Addition rule): For mutually
Addition rule exclusive events, the probability of either
event happening is the sum of the
probabilities of the individual events.
𝑃(𝐸1 ∪ 𝐸2 ) = 𝑃(𝐸1 ) + 𝑃(𝐸2 )
• Theorem 5.1 Complement: The probability
of the complement of an event is equal to
Probability one minus the probability of the event. E.g.
with exhaustive events {𝐸1 , 𝐸2 }.
Theorems: 𝑃 𝐸1 = 1 − 𝑃 𝐸2

• Theorem 5.2 Addition law: For any two


Complement events, the probability of either event
happening (i.e., their union) is the sum of
& the probabilities of the two events minus
Addition rule the probability of both events happening
(i.e., the intersection).

𝑃 𝐸1 ∪ 𝐸2 = 𝑃 𝐸1 + 𝑃 𝐸2 − 𝑃(𝐸1 ∩ 𝐸2 )
Joint
Observation 5.1 Probability of independent
probability of events: If 𝐸1 and 𝐸2 are independent events,
independent then the probability that both 𝐸1 and 𝐸2 happen
is equal to the probability of 𝐸1 times 𝐸2
events
𝑃 𝐸1 ∩ 𝐸2 = 𝑃 𝐸1 × 𝑃 𝐸2
(Multiplication
rule)

𝑃 𝐸1 ∩ 𝐸2 = 𝑃 𝐸1 , 𝐸2
Observation 5.2 Probability of dependent
events: If 𝐸1 and 𝐸2 are dependent events, then
Joint the probability that both 𝐸1 and 𝐸2 happen is
equal to the probability of 𝐸1 times 𝐸2 given 𝐸1
probability of
Dependent
𝑃 𝐸1 ∩ 𝐸2 = 𝑃 𝐸1 × 𝑃 𝐸2 | 𝐸1
events

(Multiplication 𝑃 𝐸2 | 𝐸1 ‘|’ means 𝐸2 given 𝐸1


rule using This is called the conditional probability
conditional
probabilities) The impact of one experiment affects the
other!
Venn
diagram of
event
Spaces for
A and B

Slide 29
Simple Example – Six on Two Dice?

• Single ‘fair’ dice throw event 𝐸, where 𝑖 = 1,2,3,4,5,6


1
• Probability that dice throw event 𝑃 𝐸 = 𝑖 =
6
• Complement?
1 1 5
𝑃 𝐸 = 1 + 𝑃 𝐸 = ¬1 = + σ6𝑖=2 𝑃 𝐸 = 𝑖 = + = 1
6 6 6

• Two dice thrown in succession gives rise to two events 𝐴, 𝐵 where


1 1
𝑃 𝐴 = 𝑖 = and 𝑃 𝐵 = 𝑖 =
6 6
• What is probability of joint event: 𝑃 𝐴 = 6, 𝐵 = 6 ?
1 1 1
𝑃 𝐴 = 6, 𝐵 = 6 = 𝑃(𝐴 = 6 ∩ 𝐵 = 6) = =
6 6 36
Simple Example – Ace or Red card?

• Draw one card from a standard pack of 52. What is the


probability of drawing an ace or a red card, assuming that
all 52 cards are equally likely to be drawn?
Exercise

What is the probability of a


five or a six on any of two
dice rolled together?
Simple event one Dice
𝑃 𝐸1 = 5 ∪ 𝐸2 = 6
• Define the event, 𝐸, and calculate its probability
• Events are independent, joint events are mutually exclusive
• Probability of 5 or 6 on one dice?

𝑃 𝐸
= 𝑃 𝐸1 = 5 ∪ 𝐸2 = 6
= 𝑃 𝐸1 = 5 + 𝑃 𝐸2 = 6
− 𝑃 𝐸1 = 5 ∩ 𝐸2 = 6
= 1ൗ6 + 1ൗ6 − 0
= 1ൗ3
Joint event two Dice
𝑃 𝐷𝑖𝑐𝑒1 = 5,6 ∪ 𝐷𝑖𝑐𝑒2 = 5,6
Probability of 5 or 6 on one or both of two dice, denoted Dice A, Dice B

{5,6} on Dice A, {5,6} on Dice B


{5,6} on Dice A, {1,2,3,4} on Dice B
{1,2,3,4} on Dice A, {5,6} on Dice B
𝑃 𝑋 𝑃 𝑌
= 𝑃 𝐴 = {5,6} ∩ 𝐵 = {1,2,3,4} = 𝑃 𝐴 = {5,6} ∩ 𝐵 = {5,6}
= 𝑃 𝐴 = {5,6} × 𝑃 𝐵 = {1,2,3,4} = 𝑃 𝐴 = {5,6} × 𝑃 𝐵 = {5,6}}
= 2ൗ6 × 4ൗ6 = 2ൗ6 × 2ൗ6
= 8ൗ36 = 4ൗ36

𝑃 𝐷𝑖𝑐𝑒1 = 5,6 ∪ 𝐷𝑖𝑐𝑒2 = 5,6


= 2𝑃 𝑋 + 𝑃 𝑌 = 16ൗ36 + 4ൗ36
= 20ൗ36 = 5ൗ9
Cleaner notation and alternative
strategy

Cleaner notation:

1 1 1
𝑃 𝐴 =𝑃 𝐵 =𝑃 5∪6 =𝑃 5 +𝑃 6 −𝑃 5∩6 = + −0=
6 6 3

1 1 1 1 2 1 5
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) = + − = − =
3 3 3 3 3 9 9

Alternative strategy: Use complement rule where complement event is


{1,2,3,4} on Dice A and {1,2,3,4} on Dice B

2 2 5
𝑃 𝐴 ∪ 𝐵 = 1 − 𝑃 ¬𝐴, ¬𝐵 = 1 − =
3 3 9
Probability
distributions

Slide 36
Joint probability distribution
• The joint probability distribution of A with states {a1, a2, a3} and B
with states {b1, b2, b3, b4} is the probability distribution for the joint
event (A, B):

• Can call A and B variables with states rather than experiments and
events

Slide 37
(Random) Variable Types
• Types of variables (defined by state type):
– Discrete Labeled {Spurs, West Ham, Chelsea}
– Boolean {True, False}
– Continuous Value E.g. {…, -0.5, 0.1, 0.7, 0.7001, 0.8, 0.9501, ….}
– Integer Value E.g. {…, -1, 0, 1, 2, 3, 4, ….., n}
– Interval E.g. {…., ]0,1], ]1, 2], ….., ]n-1, n] }
• Infinite
– Continuous E.g. { ]-infinity, 0], ….,]0,1], ]1, 2], …..]0, +infinity] }

Slide 38
Joint probability
distribution an probability
of marginalized events

Slide 39
Marginalization

Slide 40
Dealing with more than two Variables

Joint distribution for 5 variables:

Marginal distribution (B, C):

To calculate we need lots of probability values:

n
If each variable has 10 states then need 10 probabilities! Slide 41
Axiom 5.4 Probability of dependent events: If
𝐴 and 𝐵 are dependent events, then the
probability that event ‘𝐵 occurs given that 𝐴
has already occurred’ is:

Fundamental 𝑃 𝐴 ∩ 𝐵 𝑃(𝐴, 𝐵)
𝑃 𝐵 𝐴) = =
Rule of 𝑃(𝐴) 𝑃(𝐴)
conditional
probability

Theorem 5.3:

𝑃 𝐴, 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐵 𝐴)𝑃 𝐴
Example Dependant Events

• Combined bet: Spurs score goal and then win match


• Bookies offer odds: 2-1 for spurs to score first and win (equivalent to 1/3)
• Event A is ‘Spurs score goal’, event B is ‘ Spurs win match’

𝑃 𝐴 = 0.4

𝑃 𝐵 | 𝐴 = 0.7

𝑃 𝐴, 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐵 𝐴)𝑃 𝐴 = 0.7 0.4 = 0.28 < 0.333

• Interested in additional event, C, Spurs score the second goal.

𝑃 𝐴, 𝐵, 𝐶 = 𝑃 𝐵 ∩ 𝐶 ∩ 𝐴 = 𝑃 𝐴 𝐶 ∩ 𝐵)𝑃 𝐵 ∩ 𝐶
= 𝑃 𝐴 𝐵 ∩ 𝐶)𝑃 𝐶 | 𝐵 𝑃(𝐵) = 𝑃(𝐴 | 𝐵, 𝐶)𝑃(𝐶 𝐵 𝑃(𝐵)
We can therefore decompose any joint
probability into a series of ‘chained’
conditional probability statements:

The Chain
Rule
Summary

𝑃(𝐴, 𝐵)
𝑃 𝐴 𝐵) = Fundamental rule of probability
𝑃(𝐵)

𝑃(𝐴, 𝐵) Joint probability

𝑃 𝐴 𝐵) Conditional probability

𝑃(𝐴) Marginal probability

Slide 45
Binomial Distribution Example

• Factory mass produces components with a failure rate, F, of 20% per year
𝑃 𝐹 = 0.2 𝑃 𝑛𝑜𝑡 𝐹 = 0.8
• Customer buys 5 components and needs to predict number that will fail within one year
of use
• Experiment with six outcomes: {0, 1, 2, 3, 4, 5} (Assume components are independent)
• Joint failure events:
𝑃 𝑛𝑜𝑡 𝐹 × 𝑃 𝑛𝑜𝑡 𝐹 × 𝑃 𝑛𝑜𝑡 𝐹 × 𝑃 𝑛𝑜𝑡 𝐹 × 𝑃 𝑛𝑜𝑡 𝐹 = 0.85 = 0.32768
𝑃 𝐹 × 𝑃 𝐹 × 𝑃 𝐹 × 𝑃 𝐹 × 𝑃 𝐹 = 0.25 = 0.00032
𝑃 𝑛𝑜𝑡 𝐹 × 𝑃 𝑛𝑜𝑡 𝐹 × 𝑃 𝐹 × 𝑃 𝐹 × 𝑃 𝐹 = 0.22 × 0.83 = 0.2048

5!
• How many ways can 2 components fail? 𝐶𝑜𝑚𝑏 5,2 = = 10
2! 5 − 2 !
• Probability of 2 failing is:

𝑃 2 = 0.0248 × 10 = 0.248 Slide 46


Binomial Distribution

If X is the number of successes to occur in n repeated independent Bernoulli


trials, each with probability of success p, then X is a Bernoulli random
variable with parameters n and p

𝑃(𝑋 = 𝑥) = 𝐶𝑜𝑚𝑏(𝑛, 𝑥)𝑝 𝑥 (1 − 𝑝)𝑛−𝑥

𝑛!
𝐶𝑜𝑚𝑏(𝑛, 𝑥) =
𝑥! (𝑛 − 𝑥)!

Example: n = 3, x = 1 and p = 1/3

1 2 2
3 1 1 1 2 4
𝑃(𝑋 = 1) = 1− =3 =
1 3 3 3 3 9

Slide 47
We have three doors,
behind which one has a
valuable prize and two
have something
Monty Hall worthless. After the
Game contestant chooses
one of the three doors
Show Monty Hall (who knows
which door has the
prize behind it) always
reveals a door (other
than the one chosen)
that has a worthless
item behind it. The
contestant can now
choose to switch doors
or stick to his or her
original choice.

Slide 48
Monty Hall Game Show Answer

Door 1 Door 2 Door 3 Win if -


X r r Stick
X r Switch
Switch
r X

X r Switch
r X r Stick
r X Switch

X r Switch
r X Switch
X Stick
r r

X – prize, r – revealed, O – initial choice Slide 49


Lesson Summary

• All probabilities are conditioned on assumptions, so


all probability is subjective
• Axioms and Theorems of probability
• Conditional & unconditional probabilities
• Marginal, joint and conditional probabilities
• Binomial theorem
Martin Neil &
Norman Fenton

Risk Assessment and Decision


Analysis with Bayesian
Networks

Bayes Theorem and Conditional


Probability

Copyright Martin Neil and Norman Fenton 2019


• At the end of this lecture you should:
– Appreciate that all probabilities are conditional
– Derived and applied Bayes Theorem
– Understand probabilistic fallacies
– Understood concept of second order probability

Learning – Can perform complex calculations involving


multiple causes and effects
Goals • These skills will be useful to:
– Understand core methods required in Bayesian
Networks
– Identify how causal and absence of information
can introduce dependencies
– Can apply Bayes theorem and binomial
distribution
2nd Edition Chapter 6:

Bayes’ Theorem and


Conditional Probability

Slide 3
All Probabilities are Conditional
The probability of every event is actually conditioned on K - the background
knowledge or context. So if A is event “roll a 4 on a die”
• P(A | K) = 1/6 where K is the assumption that the die is genuinely fair and
there can be no outcome other than 1, 2, 3, 4, 5, 6.
• P(A | K) = 1/6 where K is the assumption that there can be no outcome
other than 1, 2, 3, 4, 5, 6 and I have no reason to suspect that any one is
more likely than any other.
• P(A | K) = 1/8 where K is the assumption that the only outcomes are 1, 2, 3,
4, 5, 6, ‘lost’, ‘lands on edge’ and all are equally likely.
• P(A | K) = 1/10 where K is the assumption that the results of an experiment,
in which we rolled the die 200 times and observed the outcome “4” 20
times, is representative of the frequency of “4” that would be obtained in
any number of rolls.

Slide 4
Updating Beliefs when we Observe
Evidence

H E
(Hypothesis) (Evidence)

• We start with a hypothesis H, for which we have a belief P(H) called


our prior belief about H.
• Using evidence E about H we revise our belief about H in the light
of E. In other words we need to calculate P(H | E), which we call the
posterior belief about H.
• The probability of observing the evidence given the hypothesis is
P(E | H) which we call the likelihood.

Slide 5
Rev
Bayes

Slide 6
From fundamental rule:

𝑃 𝐻, 𝐸
𝑃 𝐻 𝐸) =
𝑃 𝐸
Derivation 𝑃 𝐻, 𝐸
of Bayes 𝑃 𝐸 𝐻) =
𝑃 𝐻
Theorem
⇒ Bayes theorem:

𝑃 𝐸 𝐻)𝑃(𝐻)
𝑃 𝐻 𝐸) =
𝑃(𝐸)
• Boolean variables {H, not H}:

𝑃 𝐸 𝐻)𝑃(𝐻)
𝑃 𝐻 𝐸) =
𝑃(𝐸)

• Use marginalization to determine 𝑃 𝐸 :


Bayes
𝑃 𝐸 = 𝑃 𝐸 𝐻) × 𝑃 𝐻 + 𝑃 𝐸 𝑛𝑜𝑡 𝐻) × 𝑃 𝑛𝑜𝑡 𝐻
Theorem:
General
form • For multiple states in H, {ℎ1 , ℎ2 ,…, ℎ𝑛 }

𝑃 𝐸 ℎ𝑖 )𝑃 ℎ𝑖
𝑃 ℎ𝑖 | 𝐸 = 𝑛
σ𝑖=1 𝑃 𝐸 ℎ𝑖 )𝑃 ℎ𝑖
• In a particular chest clinic 5% • We want to update our
of all patients who have been belief in hypothesis given
to the clinic are ultimately the evidence: 𝑃 𝐻 𝐸) = ?
diagnosed as having lung
cancer (H).
• We know from fundamental
– 𝑃 𝐻 = 𝑡𝑟𝑢𝑒 = 0.05 rule that :

• While 50% of patients are 𝑃(𝐻, 𝐸)


𝑃 𝐻 𝐸 =
smokers (E). 𝑃 𝐸
Chest – 𝑃 𝐸 = 𝑡𝑟𝑢𝑒 = 0.50

Clinic • By considering the records of


• Unfortunately we do not
have 𝑃(𝐻, 𝐸) but we do
have 𝑃(𝐸|𝐻)
Example all patients previously
diagnosed with lung cancer,
we know that 80% were • And we know that:
smokers.
– 𝑃 𝐸 = 𝑡𝑟𝑢𝑒 𝐻 = 𝑡𝑟𝑢𝑒 = 𝑃 𝐸 𝐻)𝑃(𝐻)
0.80 𝑃 𝐻 𝐸 =
𝑃(𝐸)

• A new patient comes into the


• Hence we can calculate:
clinic. We discover this
patient is a smoker.
0.8(0.05)
• What is the probability that 𝑃 𝐻 𝐸) = = 0.08
this patient will be diagnosed 0.5
as having lung cancer? Slide 9
• One in a thousand people has
a prevalence for a particular
heart disease.
• There is a test to detect this
disease. The test is 100%
accurate for people who have
Harvard the disease and is 95%
accurate for those who don’t
Medical School (this means that 5% of people
Question who do not have the disease
will be wrongly diagnosed as
having it).
• If a randomly selected person
tests positive what is the
probability that the person
actually has the disease?
Imagine 1,000
people

One has the


disease
But about 5% of the
remaining 999 people
without the disease test
positive.

That is about 49 or 50
people.
So about 1 out
of 50 who test
positive
actually have
the disease

That’s about 2%!


Harvard Medical School Question

• Given 1,000 people


• 1 has the disease
• 50 wrongly diagnosed as
having disease

• Half respondents said 95%


• Average answer was 56%

Slide 14
Alternative
event tree
representation

But this does not ‘scale up’!

Slide 15
Harvard Medical School using Bayes’
Theorem
𝐻: 𝑑𝑖𝑠𝑒𝑎𝑠𝑒, ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒 , 𝐸: 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑃(𝐻 = 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)=1/1000
𝑃 𝐸 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝐻 = 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)=1.0
𝑃 𝐸 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝐻 = ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒)=0.05

𝑃 𝐸 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 𝑃 𝐸 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝐻 = 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)𝑃(𝐻 = 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)


+ 𝑃 𝐸 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝐻 = ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒)𝑃(𝐻 = ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒)

𝑃 𝐻 = 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝐸 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) = 𝑃 𝐸 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝐻 = 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)𝑃(𝐻 = 𝑑𝑖𝑠𝑒𝑎𝑠𝑒)


𝑃(𝐸 = 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒)
1
1(1000)
= =0.019627
1 999
1 1000 + 0.05(1000)

Slide 16
But the good news is you never have to do
those calculations manually

Slide 17
Marginalization Notation

• Each time we marginalize the result is dependent on the current state of the
joint probability model
• As we add new evidence this state changes and Bayesian notation can make
this change look confusing!
𝑃 𝐵 = ෍ 𝑃 𝐵 𝐴)𝑃(𝐴)
𝐴
Is the 𝑃 𝐵 in each
• We now know 𝐴 = 𝑇𝑟𝑢𝑒 and marginalize again: equation the same?

𝑃 𝐵 = ෍ 𝑃 𝐵 𝐴 = 𝑇𝑟𝑢𝑒)𝑃(𝐴 = 𝑇𝑟𝑢𝑒)
𝐴

• Clarify by showing marginal is conditioned on the evidence

𝑃 𝐵 | 𝐴 = 𝑇𝑟𝑢𝑒 = ෍ 𝑃 𝐵 𝐴 = 𝑇𝑟𝑢𝑒)𝑃(𝐴 = 𝑇𝑟𝑢𝑒)


𝐴
• Are they the same? Yes! Use the fundamental rule:
𝑃(𝐵, 𝐴 = 𝑇𝑟𝑢𝑒) 𝑃 𝐵 𝐴 = 𝑇𝑟𝑢𝑒)𝑃(𝐴 = 𝑇𝑟𝑢𝑒)
𝑃 𝐵 | 𝐴 = 𝑇𝑟𝑢𝑒 = ෍ =෍ = 𝑃(𝐵)
𝑃(𝐴 = 𝑇𝑟𝑢𝑒) 𝑃(𝐴 = 𝑇𝑟𝑢𝑒)
𝐴 𝐴
Slide 18
Bayes and the transposed conditional
fallacy
• If you know that ‘all horses are four legged animals’ can you conclude that ‘all four
legged animal are horses’?
• Yet people (including legal, medical and even statistical experts) routinely assume

P(H | E) = P(E | H)

– When interpreting results of diagnostic medical tests. “There is only a 1%


chance this test result will be positive if the patient does not have the diseases.
Since the test was positive we are 99% sure the patient has the disease.”
– When interpreting legal evidence (especially forensic evidence). The Prosecutor
and Defendant fallacies
– When using classical statistical hypothesis testing

Slide 19
Prosecutor’s and Defendant’s
Fallacies
• “Suppose a crime has been committed. DNA found at the scene matches the
defendant. It is of a type which is present in 1 in a 1000 people.”
• Prosecutor’s fallacy
– “There is a 1 in a 1000 chance that the defendant would have the DNA
match if he were innocent. Thus, there is a 99.9% chance that he is guilty.”
– This simply (and wrongly) assumes P(H | E) = P(E | H) and also ignores the
prior P(H)
• Defendant’s fallacy
– “This crime occurred in a city of 8,000,000 people. Hence, this blood type
would be found in approximately 8,000 people. The evidence has provided
a probability of 1 in 8,000 that the defendant is guilty and thus has no
relevance.”
– This provides a correct posterior P(H | E) assuming prior P(H) = 1 in
8,000,000 but ignores the change in the posterior from the prior.
Example: Legal Reasoning
– The Birmingham Six

• Six people were convicted of IRA bombing on the basis of


the following evidence:
• Each had traces of nitro-glycerine (NITRO) on their hands
• Expert scientific testimony claimed this gave overwhelming
support for claim that they had handled high explosives
(HE)
• Experts testimony coded as: 𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = 𝐻𝐸)=0.99
• And taken to mean:

𝑃 𝐺𝑢𝑖𝑙𝑡𝑦 𝐸 = 𝑁𝐼𝑇𝑅𝑂)=𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = 𝐻𝐸)=0.99

Slide 21
Example: Legal Reasoning
– The Birmingham Six
• Subsequent investigation showed that Nitro-glycerine traces
could be deposited by many common materials, including
playing cards. Roughly 50% of the population had such
traces. Therefore assume 𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = ¬𝐻𝐸) = 0.5
• Assume prior is 𝑃(𝐻 = 𝐻𝐸) = 0.05
• What is 𝑃 𝐻 = 𝐻𝐸 𝐸 = 𝑁𝐼𝑇𝑅𝑂)?

𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = 𝐻𝐸)𝑃(𝐻 = 𝐻𝐸)


𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = 𝐻𝐸)𝑃 𝐻 = 𝐻𝐸 + 𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = ¬𝐻𝐸)𝑃(𝐻 = ¬𝐻𝐸)

0.99 × 0.05 0.0495


= = = 0.0943
0.99 × 0.05 + 0.5 × 0.95 0.0495 + 0.4750

• Prior is 𝑃 𝐻 = 𝐻𝐸 = 0.05. Posterior is 0.0943


• Is this enough to convict? Slide 22
The Likelihood Ratio

• We uses Bayes’ Theorem because we want to update our prior beliefs in a


hypothesis when we observe evidence:
– 𝑃 𝐸 𝐻) – the probability of evidence 𝐸 if 𝐻 is true
– 𝑃 𝐸 𝑛𝑜𝑡 𝐻) – the probability of evidence 𝐸 if 𝐻 is false

𝑃 𝐸 𝐻)
𝐿𝑅 = – the Likelihood Ratio (LR)
𝑃 𝐸 𝑛𝑜𝑡 𝐻)

• LR often used as a measure of ‘probative value’ of evidence:


– 𝐿𝑅 > 1 – the evidence 𝐸 more likely under 𝐻 than 𝑛𝑜𝑡 𝐻
– 𝐿𝑅 < 1 – the evidence 𝐸 less likely under 𝐻 than 𝑛𝑜𝑡 𝐻
– 𝐿𝑅 = 1 – the evidence 𝐸 equally likely under 𝐻 than 𝑛𝑜𝑡 𝐻
• Barry George case:
– 𝑃 𝐸 = 𝑔𝑢𝑛𝑝𝑜𝑤𝑒𝑟 𝑓𝑟𝑜𝑚 𝑔𝑢𝑛 𝑔𝑢𝑖𝑙𝑡𝑦) = 0.01
– 𝑃 𝐸 = 𝑔𝑢𝑛𝑝𝑜𝑤𝑒𝑟 𝑓𝑟𝑜𝑚 𝑔𝑢𝑛 𝑛𝑜𝑡 𝑔𝑢𝑖𝑙𝑡𝑦) = 0.01
– 𝐿𝑅 = 1 Slide 23
Shady Sam’s or Honest Joe’s

Slide 24
Second Order Probability

• Let us assume someone has smuggled a die out of either Shady Sam’s or Honest
Joe’s, but we do not know which casino it has come from. We wish to determine the
source of the die from (a) a prior belief about where the die is from and (b) data
gained from rolling the die a number of times.
• Assume: 𝑃 𝐽𝑜𝑒 ≡ 𝑃 𝑝 = 1ൗ6 = 0.7 𝑃 𝑆𝑎𝑚 ≡ 𝑃 𝑝 = 1ൗ12 = 0.3
• Data consists of one “6” and nineteen “not 6” results.

0.168756

0.168756

Slide 25
0.168756
The null hypothesis H and p-values

• A p-value is the probability of observing the data assuming the


null hypothesis is true.
• Hypotheses for coin:
𝐻0 : 𝑝 = 0.5 null (fair)
𝐻1 : 𝑝 > 0.5 alternative (not fair)
• If the p-value is low, typically 0.01 (1%), statisticians regard this
as ‘highly significant’ (specifically, ‘significant at the 1% level’),
and would reject the null hypothesis.
• But all the data shows is that P(E | 𝐻0 ) = 0.01
• Many people wrongly assume that the result demonstrates that
the “probability of the null hypothesis is 0.01” i.e. P(𝐻0 | E) = 0.01
• But we know P(E | 𝐻0 ) is not P(𝐻0 | E)

Slide 26
Hypothesis test for coin

• We want to calculate the number of heads, X, from a sample of,


n, tosses that would result in us rejecting the null hypothesis
• This is the 99th percentile of the distribution of X (99th percentile
is equivalent to p-value of 0.01)
• Using Binomial distribution 𝑃(𝑋 | 𝑝 = 0.5, 𝑛 = 100)

Cumulative
Probability = 0.99
63

• If you do 100 flips you would reject the null hypothesis that it is
fair coin once you had seen more than 62 heads
Slide 27
Lindley’s Paradox

• Assume we have 999 ‘fair’ coins and one coin known to be biased toward
‘heads’. Select a coin randomly. 𝑃(𝐻0 ) = 0.999 , 𝑃(𝐻1 ) = 0.001
• Assume:
– 𝑃(𝐻0 : 𝑝 = 0.5)
– 𝑃(𝐻1 : 𝑝 = 0.9)
• As good experimenters we set a p-value of 0.01 in advance. Then we must
reject 𝐻0 if we see X = 63
• Use binomial to calculate probability of evidence X = 63

𝑃(𝑋 = 63 | 𝑝 = 0.5, 𝑛 = 100) = 0.0027

𝑃 𝑋 = 63 𝑝 = 0.9, 𝑛 = 100) = 0.000000000000448 ≅ 0

Slide 28
Lindley’s Paradox
• Then by Bayes:

• And we have lots of extra parameters under each hypothesis:

𝑃(𝑋 = 63 | 𝑝 = 0.5, 𝑛 = 100)𝑃(𝐻0 : 𝑝 = 0.5)


𝑃(𝐻0 𝑋 = 63, 𝑛 = 100 =
𝑃(𝑋 = 63 | 𝑝 = 0.5, 𝑛 = 100)𝑃(𝐻0 : 𝑝 = 0.5)
+𝑃(𝑋 = 63 | 𝑝 = 0.9, 𝑛 = 100)𝑃(𝐻1 : 𝑝 = 0.9)
0.0027(0.999)
= = 1 > 0.01
0.0027 0.999 + 0(0.001)

• With 𝑃(𝐻1 : 𝑝 = 0.9) it is almost impossible to observe 63 heads in 100 flips.


• So the we have a paradox. The null hypothesis is rejected yet the probability
is one!

Slide 29
2nd Edition Chapter 7:

From Bayes’ Theorem to


Bayesian Networks

Slide 30
Since it is important for Norman to arrive on time for work, a
number of people (including Norman himself) are interested in
the probability that he will be late. Since Norman usually
travels to work by train, one of the possible causes for Norman
being late is a train strike. Because it is quite natural to reason

A simple from cause to effect we examine the relationship between a


possible train strike and Norman being late. This relationship is
represented by the causal model shown below where the edge
risk connects two nodes representing the variables “Train strike”
(T) to “Norman late” (N).
assessment
problem

Slide 31
Simple Risk Assessment Problem - Probabilities

Joint Probability Distribution Calculated Posterior using Bayes’

𝑃 𝑇, 𝑁 = 𝑃 𝑁 𝑇)𝑃(𝑇) 𝑃 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒


𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑇 = 𝑇𝑟𝑢𝑒)𝑃(𝑇 = 𝑇𝑟𝑢𝑒)
=
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)
0.8 × 0.1
Calculated Prior = = 0.47059
0.17

𝑃 𝑁 = 𝑇𝑟𝑢𝑒
= 𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑇 = 𝑇𝑟𝑢𝑒)𝑃 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒
+𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑇 = 𝐹𝑎𝑙𝑠𝑒)𝑃 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑇 = 𝐹𝑎𝑙𝑠𝑒)𝑃(𝑇 = 𝐹𝑎𝑙𝑠𝑒)
=
= 0.8 × 0.1 + 0.1 × 0.9 𝑃(𝑁 = 𝑇𝑟𝑢𝑒)
= 0.17 0.1 × 0.9
= = 0.52941
0.17
Slide 32
Simple Risk
Assessment
Problem –
Automatic in
AgenaRisk

Slide 33
Accounting
for Multiple
Causes and
Effects

𝑃 𝑂, 𝑀, 𝑇, 𝑁 = 𝑃 𝑀 𝑂, 𝑇)𝑃 𝑂 𝑃 𝑁 𝑇)𝑃(𝑇)

Slide 34
Calculations – Martin Late

𝑃 𝑀 = 𝑇𝑟𝑢𝑒 = ෍ 𝑃(𝑀 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇 𝑃 𝑂 𝑃 𝑇


𝑂,𝑇

= 𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒


+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃(𝑇 = 𝐹𝑎𝑙𝑠𝑒)
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝐹𝑎𝑙𝑠𝑒
= 0.8 × 0.4 × 0.1 + 0.6 × 0.4 × 0.9 + 0.6 × 0.6 × 0.1 + 0.3 × 0.6 × 0.9
= 0.446

𝑃 𝑀 = 𝐹𝑎𝑙𝑠𝑒 = 1 − 𝑃 𝑀 = 𝑇𝑟𝑢𝑒 = 0.554

Slide 35
Calculations – Martin Is Late given Norman
is Late
𝑃 𝑀 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = ෍ 𝑃(𝑀 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇 𝑃 𝑂 𝑃 𝑇 | 𝑁 = 𝑇𝑟𝑢𝑒
𝑂,𝑇

= 𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒


+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃(𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒)
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒
= 0.8 × 0.4 × 0.47059 + 0.6 × 0.4 × 0.52941 + 0.6 × 0.6 × 0.47059 +
0.3 × 0.6 × 0.52941
= 0.542353

𝑃 𝑀 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = 1 − 𝑃 𝑀 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = 0.45765

Slide 36
Simple Risk
Assessment
Problem –
Automatic in
AgenaRisk

Slide 37
Let’s add an edge!

Complex
Declare conditional probability tables the same….
Case
𝑃 𝑁 𝑇, 𝑂) = 𝑃 𝑀 𝑇, 𝑂)

Slide 38
Calculations – Complex Case

𝑃 𝑁 = 𝑇𝑟𝑢𝑒 = ෍ 𝑃(𝑁 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇 𝑃 𝑂 𝑃 𝑇


𝑂,𝑇

= 𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒


+𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃(𝑇 = 𝐹𝑎𝑙𝑠𝑒)
+𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒
+𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝐹𝑎𝑙𝑠𝑒
= 0.8 × 0.4 × 0.1 + 0.6 × 0.4 × 0.9 + 0.6 × 0.6 × 0.1 + 0.3 × 0.6 × 0.9
= 0.446

Same result as 𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇 because the probability tables


𝑃 𝑁 𝑇, 𝑂) = 𝑃 𝑀 𝑇, 𝑂) i.e. are the same remember!

Slide 39
Calculations – Complex Case

Before we calculated effect of evidence on single conditioning variable, T,


using Bayes’:
𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑇)𝑃(𝑇)
𝑃 𝑇 | 𝑁 = 𝑇𝑟𝑢𝑒 =
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)

Now we need to calculate effect of evidence, from N, on two conditioning


variables, (O, T) using the Fundamental Rule:

𝑃 𝑂, 𝑇 | 𝑁 = 𝑇𝑟𝑢𝑒

𝑃(𝑁 = 𝑇𝑟𝑢𝑒, 𝑂, 𝑇)
=
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)

𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇)𝑃(𝑂)𝑃(𝑇)
=
𝑃(𝑁 = 𝑇𝑟𝑢𝑒) Slide 40
Calculations – ‘Martin Is Late’ given
‘Norman is Late’ in Complex Case
𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇)𝑃(𝑂)𝑃(𝑇)
𝑃 𝑂, 𝑇 | 𝑁 = 𝑇𝑟𝑢𝑒 =
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)

𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒)𝑃(𝑂 = 𝑇𝑟𝑢𝑒)𝑃(𝑇 = 𝑇𝑟𝑢𝑒)


𝑃 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 =
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)

𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒)𝑃(𝑂 = 𝑇𝑟𝑢𝑒)𝑃(𝑇 = 𝐹𝑎𝑙𝑠𝑒)


𝑃 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 =
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)

𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒)𝑃(𝑂 = 𝐹𝑎𝑙𝑠𝑒)𝑃(𝑇 = 𝑇𝑟𝑢𝑒)


𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 =
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)

𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒)𝑃(𝑂 = 𝐹𝑎𝑙𝑠𝑒)𝑃(𝑇 = 𝐹𝑎𝑙𝑠𝑒)


𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 =
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)

Slide 41
Calculations – ‘Martin Is Late’ given
‘Norman is Late’ in Complex Case
𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇)𝑃(𝑂)𝑃(𝑇)
𝑃 𝑂, 𝑇 | 𝑁 = 𝑇𝑟𝑢𝑒 =
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)

𝑃 𝑁=𝑇𝑟𝑢𝑒 𝑂=𝑇𝑟𝑢𝑒,𝑇=𝑇𝑟𝑢𝑒)𝑃(𝑂=𝑇𝑟𝑢𝑒)𝑃(𝑇=𝑇𝑟𝑢𝑒) 0.8 0.4 0.1


𝑃 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = = = 0.0717
𝑃(𝑁=𝑇𝑟𝑢𝑒) 0.446

𝑃 𝑁=𝑇𝑟𝑢𝑒 𝑂=𝑇𝑟𝑢𝑒,𝑇=𝐹𝑎𝑙𝑠𝑒)𝑃(𝑂=𝑇𝑟𝑢𝑒)𝑃(𝑇=𝐹𝑎𝑙𝑠𝑒) 0.6 0.4 0.9


𝑃 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = = = 0.4843
𝑃(𝑁=𝑇𝑟𝑢𝑒) 0.446

𝑃 𝑁=𝑇𝑟𝑢𝑒 𝑂=𝐹𝑎𝑙𝑠𝑒,𝑇=𝑇𝑟𝑢𝑒)𝑃(𝑂=𝐹𝑎𝑙𝑠𝑒)𝑃(𝑇=𝑇𝑟𝑢𝑒) 0.6 0.6 0.1


𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = = = 0.0807
𝑃(𝑁=𝑇𝑟𝑢𝑒) 0.446

𝑃 𝑁=𝑇𝑟𝑢𝑒 𝑂=𝐹𝑎𝑙𝑠𝑒,𝑇=𝐹𝑎𝑙𝑠𝑒)𝑃(𝑂=𝐹𝑎𝑙𝑠𝑒)𝑃(𝑇=𝐹𝑎𝑙𝑠𝑒) 0.3 0.6 0.9


𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = = = 0.3632
𝑃(𝑁=𝑇𝑟𝑢𝑒) 0.446

Note: Notice that the denominator in all cases is constant at 0.446 – it acts as a
Slide 42
normalization constant to ensure all probabilities sum to one.
Calculations – ‘Martin Is Late’ given
‘Norman is Late’ in Complex Case
𝑃 𝑀 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = ෍ 𝑃(𝑀 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇 𝑃 𝑂, 𝑇 | 𝑁 = 𝑇𝑟𝑢𝑒
𝑂,𝑇,𝑁

= 𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒


+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃(𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒)
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒
= 0.8 × 0.0717 + 0.6 × 0.4843 + 0.6 × 0.0807 + 0.3 × 0.3632
= 0.50532

𝑃 𝑀 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = 1 − 𝑃 𝑀 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = 0.49462

Slide 43
Calculations – Complex Case Proof
𝑃 𝑀, 𝑂, 𝑇, 𝑁
𝑃 𝑀 | 𝑂, 𝑇, 𝑁 = First use Fundamental Rule
𝑃(𝑂, 𝑇, 𝑁)

𝑃 𝑀 | 𝑂, 𝑇 𝑃 𝑁 𝑂, 𝑇)𝑃 𝑂 𝑃(𝑇) For simplicity assume no evidence on 𝑁


=
𝑃(𝑂, 𝑇, 𝑁)

𝑃 𝑀 | 𝑂, 𝑇 𝑃 𝑁 𝑂, 𝑇)𝑃 𝑂 𝑃(𝑇)
=
σ𝑀 𝑃 𝑀 | 𝑂, 𝑇 𝑃 𝑁 𝑂, 𝑇)𝑃 𝑂 𝑃(𝑇)
Next calculate marginal 𝑃 𝑀
σ𝑁,𝑂,𝑇 𝑃 𝑀 | 𝑂, 𝑇 𝑃 𝑁 𝑂, 𝑇)𝑃 𝑂 𝑃(𝑇)
𝑃 𝑀 =
σ𝑀,𝑁,𝑂,𝑇 𝑃 𝑀 | 𝑂, 𝑇 𝑃 𝑁 𝑂, 𝑇)𝑃 𝑂 𝑃(𝑇) ෍ 𝑃 𝑀 | 𝑂, 𝑇 𝑃 𝑁 𝑂, 𝑇)𝑃 𝑂 𝑃 𝑇 = 1
𝑀,𝑁,𝑂,𝑇
= ෍ 𝑃 𝑀 | 𝑂, 𝑇 𝑃 𝑁 𝑂, 𝑇)𝑃 𝑂 𝑃 𝑇
𝑁,𝑂,𝑇

= ෍ 𝑃 𝑀 | 𝑂, 𝑇 ෍ 𝑃 𝑁 𝑂, 𝑇)𝑃 𝑂 𝑃 𝑇 𝑃 𝑁 𝑂, 𝑇)𝑃 𝑂 𝑃 𝑇 = 𝑃 𝑂, 𝑇 𝑁 𝑃(𝑁) by fundamental rule


𝑂,𝑇 𝑁

= ෍ 𝑃 𝑀 | 𝑂, 𝑇 ෍ 𝑃 𝑂, 𝑇|𝑁 𝑃(𝑁)
𝑂,𝑇 𝑁

= ෍ 𝑃 𝑀 | 𝑂, 𝑇 𝑃 𝑂, 𝑇 = 𝑃(𝑀) Slide 44

𝑂,𝑇
Lesson Summary

• All probabilities are conditioned on some subjective


assumptions, so all probability is subjective.
• Bayes’ theorem is simply a method that enables us to handle
conditional assumptions properly.
• In particular, Bayes’ enables us to revise and change our
predictions and diagnoses in light of new data and information
(which can be both objective and subjective).
• Bayes is scientific and rational since it forces our model to
“change its mind”.
• Manual calculations are tedious and error prone!
Martin Neil &
Norman Fenton

Risk Assessment and Decision


Analysis with Bayesian Networks

Introduction to Bayesian
Networks and AgenaRisk

Copyright Martin Neil and Norman Fenton 2019


• At the end of this lecture you should:
– Have a better understanding of Bayesian
Networks
– Understand how to build a Bayesian Network in
Learning AgenaRisk
– Understand how to run calculations on a
Goals Bayesian Network and how to interpret the
results
• These skills will be useful to:
– Applying Bayesian Networks to real world
problems
2nd Edition Chapter 7:

From Bayes’ Theorem to


Bayesian Networks

Slide 3
Bayesian and Bayesian
Network Applications

• Intelligent search
• Collaborative filtering
• Recommendation engines
• Machine learning
• Expert systems
• Data mining
• Risk assessment
• Computer vision

Slide 4
Definition

• A set of variables and a set of directed edges


between variables
• Each variable has a set of mutually exclusive states
• The variables together with the directed edges form a
Directed Acyclic Graph (DAG)
• To each variable, 𝐴 , with parents, 𝐵𝒊 , there is
attached a conditional probability table
𝑃 𝐴 𝐵1 , 𝐵2 , … , 𝐵𝑛 )

• To each variable, A, with no parents, there is


attached a probability table
𝑃(𝐴)

Slide 5
Simple Model

We can use basic probability theory and


Bayes to calculate everything we need to
know in this model

e.g. by marginalisation we find the prior


‘marginal’ value for Norman late

𝑃(𝑁 = 𝑇𝑟𝑢𝑒) = 𝑃(𝑁 = 𝑇𝑟𝑢𝑒|𝑇 = 𝑇𝑟𝑢𝑒)𝑃(𝑇


= 𝑇𝑟𝑢𝑒)
+𝑃(𝑁 = 𝑇𝑟𝑢𝑒|𝑇 = 𝐹𝑎𝑙𝑠𝑒)𝑃(𝑇 = 𝐹𝑎𝑙𝑠𝑒)
= 0.8 × 0.1 + 0.1 × 0.9
= 0.17

And use Bayes to calculate the posterior probability of a


Train strike if we observe Norman is late

𝑃(𝑁 = 𝑇𝑟𝑢𝑒|𝑇 = 𝑇𝑟𝑢𝑒)𝑃(𝑇 = 𝑇𝑟𝑢𝑒)


𝑃(𝑇 = 𝑇𝑟𝑢𝑒|𝑁 = 𝑇𝑟𝑢𝑒) =
𝑃(𝑁 = 𝑇𝑟𝑢𝑒)
0.8 × 0.1
=
0.17
= 0.47059
Slide 6
But most models are not so simple
We can just about do the necessary
calculations by hand, e.g. calculating
marginal probability Martin is late:
𝑃 𝑀 = 𝑇𝑟𝑢𝑒 = ෍ 𝑃(𝑀 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇 𝑃 𝑂 𝑃 𝑇
𝑂,𝑇
= 𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃(𝑇 = 𝐹𝑎𝑙𝑠𝑒)
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝐹𝑎𝑙𝑠𝑒
= 0.8 × 0.4 × 0.1 + 0.6 × 0.4 × 0.9 + 0.6 × 0.6 × 0.1 + 0.3 × 0.6 × 0.9
= 0.446

Revising the probability that Martin is late


once we know Norman is late:

𝑃 𝑀 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒 = ෍ 𝑃(𝑀 = 𝑇𝑟𝑢𝑒 𝑂, 𝑇 𝑃 𝑂 𝑃 𝑇 | 𝑁 = 𝑇𝑟𝑢𝑒


𝑂,𝑇
= 𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝑇𝑟𝑢𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝑇𝑟𝑢𝑒 𝑃(𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒)
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝑇𝑟𝑢𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝑇𝑟𝑢𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒
+𝑃 𝑀 = 𝑇𝑟𝑢𝑒 𝑂 = 𝐹𝑎𝑙𝑠𝑒, 𝑇 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑂 = 𝐹𝑎𝑙𝑠𝑒 𝑃 𝑇 = 𝐹𝑎𝑙𝑠𝑒 | 𝑁 = 𝑇𝑟𝑢𝑒
= 0.8 × 0.4 × 0.47059 + 0.6 × 0.4 × 0.52941 + 0.6 × 0.6 × 0.47059 + 0.3 × 0.6 × 0.52941
= 0.542353
Slide 7
Crucial independence assumptions
• We did NOT have to find the full joint probability distribution
𝑃 𝑀, 𝑁, 𝑂, 𝑇
– e.g. assumed M dependent only on O and T (and not on N), and that O and
T are independent of each other.
• These are called conditional independence assumptions
– If we were unable to make any such assumptions then the full joint
probability distribution of (𝑀, 𝑁, 𝑂, 𝑇) is (Chain Rule):
» 𝑃 𝑀, 𝑁, 𝑂, 𝑇 = 𝑃 𝑁 𝑀, 𝑂, 𝑇)𝑃 𝑀 𝑂, 𝑇)𝑃 𝑂 | 𝑇 𝑃(𝑇)
– But as N depends only on T the 𝑃 𝑁 𝑀, 𝑂, 𝑇) = 𝑃 𝑁 𝑇) and because O is
independent of T:
» 𝑃 𝑂|𝑇 =𝑃 𝑂
– Hence, the full joint probability distribution is simplified as:
» 𝑃 𝑀, 𝑁, 𝑂, 𝑇 = 𝑃 𝑁 𝑇)𝑃 𝑀 𝑂, 𝑇)𝑃 𝑂 𝑃(𝑇)
– And the expressions on r.h.s correspond exactly to the 4 probability tables
we used

Slide 8
The crucial graphical feature of a BN

The crucial graphical feature of a BN is that it tells us


which variables are NOT linked, and hence it captures
our assumptions about which pairs of variables are not
directly dependent.

..instead of the ‘complete


So we can use this graph graph in which all nodes are
connected

Slide 9
BN for the Asia example model

𝑃(𝐴)
𝑃(𝑆)

𝑃(𝐶|𝑆)
𝑃(𝐵|𝑆)
𝑃(𝑇𝐵|𝐴)

𝑃(𝑇𝐵𝑜𝐶|𝑇𝐵, 𝐶)

𝑃(𝑋|𝑇𝐵𝑜𝐶)
𝑃(𝐷|𝑇𝐵𝑜𝐶, 𝐵)

Slide 10
Properties of BNs
• Computations on full joint probability distribution not
feasible for large problems.
• Exploit conditional independence assumptions to
reduce combinatorial explosion
• Relatively easier to elicit conditional probability tables
from experts than ask for joint probabilities
• Causal structure easier to understand than
mathematics
• Fast algorithms are available to compile and execute
BNs (Pearl and Lauritzen & Speigelhalter)
• Evidence is propagated throughout BN by exploiting
Bayes Theorem
• No need to calculate by hand nor use standard
analytic formulations (e.g. conjugacy)
• Forecasts can be done with incomplete evidence
Slide 11
• https://www.agenarisk.com/installation
-upgrade-guide
Download • If you have any installation or setup issues
Link DO NOT contact AgenaRisk!
• Post to the forum….
AgenaRisk License
Instructions
• Floating License Server:
– Installed in ITL
– You can download it onto your laptop
or any other machines
– Runs on Windows/Mac/Linux
– Must have live internet connection
• Floating License Server address:
FLOAT-427DB0-29EAEF-82D719-82D4FA-D3B991-CE9D76
• RTFM – AgenaRisk user manual

Slide 13
Practical Session 1

Build simple model:

• Experiment with GUI tools (resize, appearance,


copy/paste etc)
• Run the model (with forward and backward inference
observations)
• Change the priors
Slide 14
Practical Session 2

Start AgenaRisk
Open Asia Model from \Model Library\Introductory\Asia\Asia.ast
Open AgenaRisk 10 User Manual
Explore different graph views
Examine Risk table view
Batched evidence
Soft evidence
Hide nodes
Add label, picture, edge annotations
Add notes
Perform sensitivity analysis
Slide 15
How to
build a BN

Slide 16
How to
Execute
and Query
a BN

Slide 17
Lesson Summary

• BNs are now widely applied in diverse fields


• Defined and outlined properties of BNs
• Tutorial on AgenaRisk
• Executed “Asia” medical expert system BN
Martin Neil &
Norman Fenton

Risk Assessment and Decision


Analysis with Bayesian Networks

Propagation and Reasoning in


Bayesian Networks

Copyright Martin Neil and Norman Fenton 2019


• At the end of this lecture you should:
• Understand the Variable Elimination (VE)
algorithm
• Appreciate the Junction Tree (JT) algorithm
• Understand structural properties of
Bayesian Networks including d-connections
Learning • Comprehend ‘explaining away’,
counterfactual reasoning and interventions
Goals • These skills will be useful when:
• Implementing Bayesian Network algorithms
• Uncovering fallacies in reasoning
• Structuring Bayesian Networks
appropriately
2nd Edition Chapter 7:

From Bayes’ Theorem to


Bayesian Networks

Slide 3
Marginalization by Variable Elimination

𝑃 𝐴, 𝑇, 𝐿, 𝐸, 𝑋 = 𝑃 𝐴 𝑃 𝑇 𝐴 𝑃 𝐿 𝑃 𝐸 𝑇, 𝐿 𝑃(𝑋|𝐸)

𝑃 𝑇 =෍ 𝑃(𝐴)𝑃(𝑇|𝐴)𝑃(𝐿)𝑃(𝐸|𝑇, 𝐿)𝑃(𝑋|𝐸 )
𝐴,𝐿,𝐸,𝑋

𝑃(𝑇) = ෍ 𝑃(𝐴)𝑃(𝑇|𝐴) ෍ ෍ 𝑃(𝐸|𝑇, 𝐿)𝑃(𝐿) ෍ 𝑃(𝑋|𝐸 )


𝐴 𝐸 𝐿 𝑋

෍𝑃 𝑋 𝐸 = 1 Box 7.4 Page 166


𝑋

𝑃(𝑇) = ෍ 𝑃(𝐴)𝑃(𝑇|𝐴) ෍ ෍ 𝑃(𝐸|𝑇, 𝐿)𝑃(𝐿)


𝐴 𝐸 𝐿

෍ 𝑃(𝐸|𝑇, 𝐿) 𝑃(𝐿) = 𝑃(𝐸|𝑇൰


𝐿
Note that there is no unique
𝑃(𝑇) = ෍ 𝑃(𝐴)𝑃(𝑇|𝐴) ෍ 𝑃(𝐸|𝑇) elimination order when models
𝐴 𝐸 get larger.
……………..

𝑃(𝑇) = ෍ 𝑃(𝐴)𝑃(𝑇|𝐴) Slide 4


𝐴
Structural Properties of BNs (d-connections)

Causal-Evidential Trail Common Cause Common Effect


(Serial connection) (Diverging connection) (Converging connection)

All Bayesian Networks can be composed from these atomic structures!

How evidence is communicated (‘propagated’) through model


Depends on these structures.

Slide 5
Causal (Evidential) Trail (Serial)

Evidence can also be ‘back propagated’


from node C to B if not blocked by A = a.

Slide 6
Causal Evidential Trail (Serial)

In a serial connection B and C are conditionally independent given A (or


equivalently B and C are d-separated given A).

i.e. changing B will not affect C nor A (which is instantiated)

Slide 7
Common Cause (Diverging)

Slide 8
Common Cause (Diverging)

In a diverging connection B and C are conditionally independent given A


(or equivalently B and C are d-separated given A).

i.e. changing B will not affect C nor A (which is instantiated)

Operates identically to serial connection! Slide 9


Common Effect (Converging)

Evidence can also be propagated


from node C to B if ‘opened’ by A = a

Slide 10
Common Effect (Converging)

In a converging connection B and C are conditionally dependent given A


(or equivalently B and C are d-connected given A).

i.e. changing B will affect C and A if A (or descendant) is instantiated with


hard or soft evidence.

Slide 11
Common Effect (Converging) Example
𝑎1 𝑎2
𝑏1 𝑏2 𝑏1 𝑏2
𝑃 𝐶 𝐴, 𝐵) = 𝑐 CPT as joint probabilities
1 .7 .5 .4 .8
𝑐2 .3 .5 .6 .2

then
Instantiate 𝐶 = 𝑐1 Instantiate 𝐴 = 𝑎2
𝑎1 𝑎2 𝑎1 𝑎2
𝑏1 𝑏2 𝑏1 𝑏2 𝑏1 𝑏2 𝑏1 𝑏2
𝑐1 0 0 .4 .8
𝑐1 .7 .5 .4 .8 𝑐2 0 0 0 0
𝑐2 0 0 0 0

𝑎1 𝑎2 𝑎1 𝑎2
𝑃(𝐴) = 𝑃(𝐴) =
.5 .5 0 1

𝑏 𝑏2 𝑏1 𝑏2
𝑃(𝐵) = 1 𝑃(𝐵) =
.46 .54 .33 .66
Note evidence instantiation is
equivalent to multiplying 0 or
Changed! Slide 12
1 into NPT entries
Determining d-separation
Enter evidence
on A G is D-separated
from A

blocked

J updated by F.

J updated by L
BUT no evidence
opened on L or descendant
(there is none)

Slide 13
Evidence: B = b and M = m
• Construct junction tree from BN graph
• Reduce graph to junction tree containing
serial connections with no loops
• Each converging BN fragment joined
together into single cluster
• Diverging and serial BN fragments serially
connected to allow blocking evidence flows
Overview of • Propagate evidence through junction tree
• When evidence entered calculate changes
Junction Tree to BN locally
algorithm • Propagate impact of evidence globally
through junction tree
• Use message passing to update likelihoods
throughout BN
• Calculations done using:
• Bayes theorem
• Fundamental rule
• Marginalisation
From Bayesian Network to Junction Tree

𝑃(𝑇) = ෍ 𝑃(𝐴)𝑃(𝑇|𝐴) ෍ ෍ 𝑃(𝐸|𝑇, 𝐿)𝑃(𝐿) ෍ 𝑃(𝑋|𝐸 )


𝐴 𝐸 𝐿 𝑋

Slide 15
From Bayesian Network to Junction Tree

Slide 16
Creating a Moral Graph

1. Moralise by adding edges between parent nodes


2. Remove all edge directions

B C G

D E H

BN (G) Moral Graph (GM)


Slide 17
Triangulation

A A A

B C G B C G B C G

D E H D E H D E

F F F

A A A A

B C B C B

D E D E D E D E

F
A
GEH ACE AE
E Clusters: GEC ABD E
E DEF ADE Slide 18
Cluster Identification

Step Node Selected Edges Added Clusters


1 H none GEH
2 G none CEG
3 F none DEF
4 C (A, E) ACE
5 B (A, D) ABD
6 D none ADE
7 E none AE
8 A none A

A different node elimination order can result in different clusters


Slide 19
Create Junction Tree

ABD Cluster/Clique AD Separator/Sepset

Slide 20
Two Step Propagation – Collection

ABD AD ADE DE DEF


X
Collect-
1 2
AE Evidence(X)
5
4 3

ACE
CE CEG EG GEH

Slide 21
Two Step Propagation – Distribution

ABD AD ADE DE DEF


X
Distribute-
1 2
3 AE Evidence(X)
4 5

ACE
CE CEG EG GEH

Slide 22
Table Notation used during propagation

𝑡(𝐴𝐵𝐶) = 𝑃(𝐵|𝐶)𝑃(𝐶)𝑃(𝐴) Multiplication

𝑡(𝐴𝐵) = ෍ 𝑡(𝐴𝐵𝐶) = ෍ 𝑡(𝐵𝐶)𝑡(𝐴) = 𝑡(𝐴) ෍ 𝑡(𝐵𝐶) = 𝑡(𝐴)𝑡(𝐵) Marginalization


𝐶 𝐶 𝐶

𝑃(𝐴, 𝐵, 𝐶) 𝑃 𝐵 𝐶 𝑃(𝐶)𝑃(𝐴) 𝑡𝐴𝐵𝐶 Fundamental rule/Bayes theorem


𝑃(𝐴|𝐵) = = =
𝑃(𝐵) 𝑃(𝐵) 𝑡𝐵

Slide 23
Propagation

AB B BC

Serial BN Junction Tree


𝑡 𝐴𝐵 𝑡(𝐵𝐶)
𝑡(𝐴𝐵𝐶) = 𝑃(𝐶|𝐵)𝑃(𝐵|𝐴)𝑃(𝐴) 𝑡(𝐴𝐵𝐶) =
𝑡(𝐵)

𝑡(𝐵) = ෍ 𝑡(𝐵𝐶) 𝑡(𝐵) = ෍ 𝑡(𝐴𝐵)


𝐶 𝐴


𝑡∗ 𝐵

𝑡 ∗ (𝐵) 𝑡 ∗ (𝐵)
𝑡 𝐵 = ෍ 𝑡 𝐵𝐶 = 𝑡 𝐵 = ෍ 𝑡(𝐴𝐵) = ෍ 𝑡(𝐴𝐵) = ෍ 𝑡 ∗ (𝐴𝐵)
𝑡 𝐵 𝑡(𝐵) 𝑡(𝐵)
𝐶 𝐴 𝐴 𝐴
trick!
Slide 24
* Indicates evidence instantiated in cluster
Monty Hall Game Show – Simple
Solution
Monty Hall Game Show – More
Complex Solution
Simpson’s paradox drug example
A new drug is being tested on a group of 800 people (400
men and 400 women) with a particular ailment.

Half of the people (randomly selected) are given the drug


and the other half are given a placebo.
Drug taken No Yes
Recovered
No 240 200
Yes 160 200

50% of those given the drug recover compared to 40%


given the placebo.

So the drug clear has a positive effect. Or does it?


Slide 27
Simpson’s paradox stratified data

Sex Female Male


Drug taken No Yes No Yes
Recovered
No 210 80 30 120
Yes 90 20 70 180

For men: 70% (70 out of 100) taking the placebo recover, but only 60%
(180 out of 300) taking the drug recover.

For men, the recovery rate is better without the drug.

For women: 30% (90 out of 300) taking the placebo recover, but only 20%
(20 out of 100) taking the drug recover.

For women, the recovery rate is better without the drug.

Slide 28
Simpson’s paradox explained

• Because it is impossible to ‘control’ for every possible ‘variable’ the implications


for classical approaches to medical trials are devastating (and generally not
understood).
• The paradox is theoretically unavoidable given possibility of confounding
variables.
Slide 29
Revisiting Simpson’s Paradox

Which model is correct? (c)

You cannot ignore Sex as a cause, though ideally want to design drug
trial so that drug is administered equally across all relevant strata,
including sex. Slide 30
Proposition 1: Correlation implies causation. No!

Refuting the
Assertion ‘If
There Is No
Correlation
Then There
Cannot be Proposition 2: Causation implies correlation. No!

Causation’

Assume c is a confounder and is unobserved.


and relationship b | a, c is deterministic!
Slide 31
Modelling Interventions

Causal model
Question: Is it beneficial to take the drug?

But now Sex distribution has changed!


Observation

Intervention breaks causal chain and


Surgery produces new model…

….man changes nature!

Intervention result differs from observation


Intervention result

Slide 32
Modelling Counterfactuals
Suppose we know that a person who took the drug did not recover. Without knowing the
sex of the person, what is the probability that the person would have recovered if they had
not taken the drug?

Original model in which we observe a person who took


the drug did not recover.

Counterfactual version of the model in which we


determine if the person would have recovered it they
had not taken the drug.

Twin network solution – combines what did happen with


what might have happened.

Slide 33
Lesson Summary

• Atomic fragments called d-connections supporting multiple


causes and multiple consequences
• Support ‘explaining away’: an operation central to human
reasoning
• Modelled two paradoxes: Monty Hall and Simpson’s
• Described intervention and counterfactual reasoning
Martin Neil &
Norman Fenton

Risk Assessment and Decision


Analysis with Bayesian Networks

NPTs, Functions and Continuous


Distributions

Copyright Martin Neil and Norman Fenton 2019


• At the end of this lecture you should:
• Understand role and use of functions and
expressions
• Determine which functions are appropriate for
different node types
Learning • Appreciate that real world applications involve
discrete and continuous variables
Goals • You should also understand the use of:
• Simulation of continuous variables by
approximation
• Statistical learning to infer parameters from data
• Inference using technique called dynamic
discretization
2nd Edition Chapter 9:

Building and Eliciting Node


Probability Tables

Slide 3
Growth in Probability Table Size

Suppose that we decided to change the states of the water level nodes to be:
very low, low, medium, high, very high. Then the NPT now has 5 × 5 × 4 = 100
cells. If we add another parent with 5 states then this jumps to 500 entries,
and if we increase the granularity of rainfall to, say, 10 states then the NPT has
1,250 cells.

There is clearly a problem of combinatorial growth.


Slide 4
Labelled Nodes

• Generally the most difficult to handle since few ‘non-manual’ tricks can be applied
• Only main non-manual tricks available are to use
• Comparative expressions
• Partitioned expressions

Slide 5
Using Comparative Expressions

• In some circumstance it may be sufficient to use a simple


comparative expression to define a full NPT
• The only possible injury Alison and Nicole can suffer is a damaged ligament.
• The only possible injury Peter can suffer is a broken bone.
• The only possible injury John can suffer is “other.”
• Node Injury is captured using the following nested logical expression:
▪ if (<condition>, “option1”, “option2”)
▪ A == B
▪ A || B
▪ A && B
• Example:
▪ if (student == “Alison” || student == “Nicole”,
“Damaged ligament”, if (student ==”Peter”, “Broken
bone”, “Other”))

Slide 6
Boolean Nodes

• Any state is the complement of the other and they are mutually exclusive
• OR function: if (A == “True” || B== “True”, “True”, “False”)
• AND function: if (A == “True” && B== “True”, “True”, “False”)

Tuberculosis True False Tuberculosis True False


Cancer True False True False Cancer True False True False

True 1 1 1 0 True 1 0 0 0
False 0 0 0 1 False 0 1 1 1

OR AND Slide 7
OR Example

Expression for System node:

Slide 8
OR Example – Assigned and Calculated Marginals

Slide 9
OR Example Diagnostic Inference

Slide 10
Naïve Bayes
Classifier
Model

Most popular Bayesian model used in spam filters


and many machine learning applications.
Slide 11
Naïve Bayes Classifier Model

Slide 12
AND Example

if (Power_1 == “True” && Power_2 == “True” && Power_3


== “True” && Power_4 == “True” , “True”, “False”)

Slide 13
The M from N operator

• Suppose the system actually fails only if at least 3 of the 4 power


supplies fail
• The NPT expression for node “System” is then:

mfromn(3, power1 == “True” , power2 ==


“True”,power3 == “True” , power4 == “True”)

• Note that if m = 1 then the M from N operator is the same as the OR


operator and if m = n then the M from N operator is the same as the
AND operator.

Slide 14
NoisyOR
Like the OR function but where there is uncertainty (e.g even if all
the causal factors are “True” we cannot say with certainty that the
person will suffer a heart attack before 60).

If the effects of the parent


nodes on the child are
essentially independent
NoisyOR enables us to
quantify the impact of each
causal factor on the heart
attack node independently of
considering all of the
combinations of states of the
other parents.

Slide 15
NoisyOR

𝑌 = 𝑁𝑜𝑖𝑠𝑦𝑂𝑅(𝑋1 , 𝑣1 , 𝑋2 , 𝑣2 ,…, 𝑋𝑛 , 𝑣𝑛 , 𝑙)

Equivalent to:

𝑃 𝑌 = 𝑡𝑟𝑢𝑒 𝑋1 … 𝑋𝑛 ) = 1 − (1 − 𝑙) ෑ (1 − 𝑣𝑖 )
𝑋𝑖 𝑖𝑠 𝑡𝑟𝑢𝑒

Where 𝑙 is the leak value and 𝑣𝑖 is the weight (probability) associated with cause 𝑋𝑖

Example: 𝑌 = 𝑁𝑜𝑖𝑠𝑦𝑂𝑅(𝑋1 , 0.4, 𝑋2 , 0.2, 𝑋3 , 0.3, 0.1)

𝑃 𝑌 = 𝑡𝑟𝑢𝑒 𝑋1 = 𝑡𝑟𝑢𝑒, 𝑋2 = 𝑡𝑟𝑢𝑒, 𝑋3 = 𝑡𝑟𝑢𝑒)


= 1 − 1 − 0.1 1 − 0.4 1 − 0.2 1 − 0.3
= 0.6976

𝑃 𝑌 = 𝑡𝑟𝑢𝑒 𝑋1 = 𝑡𝑟𝑢𝑒, 𝑋2 = 𝑡𝑟𝑢𝑒, 𝑋3 = 𝑓𝑎𝑙𝑠𝑒)


= 1 − 1 − 0.1 1 − 0.4 1 − 0.2 = 0.568

Slide 16
NoisyOR Example

• Specify a probability (between 0 and 1) for each of the causal factors


that the consequence will be True if this particular cause is True.
• E.g, if we believe there is a 15% chance that being a smoker will cause a
heart attack before the age of 60 then the value associated with the cause
smoker will be 0.15.
• The leak value is the extent to which there are missing factors from the
model that can contribute to the consequence being true
• When the leak value is 0 and all the causal factor values are 1, NoisyOR
is same as OR function
• Assumptions:
• Negative values of causal parent variables have no effect on the child
variable
• Probability of each parent causal variable is independent

noisyor (Smoker, 0.15, Lack of exercise, 0.1, Poor diet,


0.1, Stress, 0.05, Parent had heart attack, 0.2, 0.1) Leak value

Slide 17
NoisyOR Example Calculation

Slide 18
Ranked Nodes and Functions

Common BN pattern involves nodes whose type is ‘ranked’ in an


ordinal sense

Little relevant data but experts can provide judgments


like:

• When X 1 and X 2 are both ‘very high’ the distribution of Y is heavily


skewed toward ‘very high’.
• When X 1 and X 2 are both ‘very low’ the distribution of Y is heavily
skewed toward ‘very low’.
• When X 1 is ‘very low’ and X 2 is ‘very high’ the distribution of Y is
centred below ‘medium’.
• When X 1 is ‘very high’ and X 2 is ‘very low’ the distribution of Y is
centred above ‘medium’.

Ranked nodes exploit the fact that there is essentially an underlying numerical scale

Slide 19
Truncated
Normal
(TNormal)
Distribution

Slide
20
Ranked nodes underlying (hidden)
scales
Ranked nodes with TNormal NPTs

Slide 22
Ranked Nodes - The Weighted Mean Function

Slide 23
2nd Edition Chapter 10:

Numeric Variables and


Continuous
Distribution Functions

Slide 24
Functions and Continuous Distributions

A probability distribution for an experiment is the


assignment of probability values to each of the possible
states (elementary events).

• What happens when the number of values is large or infinite?


• Use probability density function (pdf) 𝑓(𝑋) for variable 𝑋
• Cumulative density function (cdf) can be represented as an integral:
𝑥 ∞

𝐹(𝑋) = න 𝑓(𝑥)𝑑𝑥 න 𝑓(𝑥)𝑑𝑥 = 1


−∞ −∞

• For a range of values 𝑋 ∈ 𝑎, 𝑏 then:


𝑏

𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = 𝐹(𝑋) = න 𝑓(𝑥)𝑑𝑥


𝑎

Slide 25
Approximating a continuous distribution using
discrete intervals

𝑛 𝑏

෍ 𝑃(𝑥𝑖 ) ≈ 𝑃(𝑎 < 𝑋 < 𝑏) = න 𝑓(𝑥)𝑑𝑥


𝑖=1 𝑎

• In the discrete case we would sum probabilities over piecewise


uniform discrete states 𝑋𝑖 that together compose an interval 𝑋 ∈ [𝑎, 𝑏]
𝑎6 6

𝑃(𝑎1 < 𝑋 < 𝑎6 ) = න 𝑓(𝑥)𝑑𝑥 ≈ ෍ 𝑃(𝑎𝑖 < 𝑋𝑖 < 𝑎𝑖+1 )


𝑎1 𝑖=1

• The more intervals we use the better the


approximation

Slide 26
Joint and conditional probabilities

• Joint: 𝑓(𝑥1 , . . . , 𝑥𝑗 , . . . , 𝑥𝑘 )

• Conditional: 𝑓 𝑥1 , . . . , 𝑥𝑗 , . . . , 𝑥𝑘 = 𝑓 𝑥1 𝑥2 )𝑓 𝑥2 𝑥𝑗 , . . . , 𝑥𝑘 )

• Marginal: 𝑓(𝑥𝑗 ) = න . . . න𝑓(𝑥1 , . . . , 𝑥𝑗 , . . . 𝑥𝑘 )𝑑𝑥1 . . . 𝑑𝑥𝑘


∀𝑖≠𝑗

• Fundamental rule:
𝑓(𝑥, 𝑦)
𝑓 𝑥|𝑦 =
𝑓(𝑦)

• All of the discussions and properties about dependence and independence


apply here as they did in the discrete case.
Slide 27
Dynamic Discretization

• In AgenaRisk to use dynamic discretization you simply declare that a


numeric node is a simulation node.
• Approximate algorithm that dynamically determines a discretization
that maximises the information captured in a model whilst
minimising the number of states
• At each iteration test a candidate discretization and adopt the one
that minimises the relative entropy between the approximation and
the true function, using KL distance: 𝑓(𝑥)
𝐷(𝑓||𝑔) = න 𝑓(𝑥) log 𝑑𝑥
• Stopping rule for convergence 𝑔(𝑥)
𝑆

Slide 28
Dynamic Discretization
𝑃(𝑋 = 𝑓𝑎𝑙𝑠𝑒) = 𝑁𝑜𝑟𝑚𝑎𝑙(𝜇1 , 𝜎12 ) 𝑋 = 𝑓𝑎𝑙𝑠𝑒
𝑃 𝑌 𝑋) = ൝
𝑃(𝑋 = 𝑡𝑟𝑢𝑒) = 0.5 𝑁𝑜𝑟𝑚𝑎𝑙(𝜇2 , 𝜎22 ) 𝑋 = 𝑡𝑟𝑢𝑒

1 1
Analytical solution: 𝑃(𝑌) = 𝑁 𝑌 | 𝜇1 , 𝜎12 + 𝑁 𝑌 | 𝜇2 , 𝜎22
2 2 Slide 29
Dynamic Discretization – Car Costs Example

Slide 30
Dynamic Discretization – Car Costs Example

Features:
• Hybrid model with continuous
variables conditioned on discrete
• Use scenarios to evaluate options
• Statistical distributions are “mixed” by
partitioning functions acting like IF
statements

Slide 31
Parameter Learning using Dynamic Discretization

Exam Pass Rates

Slide 32
Maximum likelihood (frequentist) estimate

σ𝑛𝑖=1 𝑝𝑖
𝜇ҭ 𝑝 = 𝑝lj = = 0.664
𝑛

2 σ𝑛𝑖=1(𝑝𝑖 − 𝑝)lj 2
𝜎ҭ𝑝 = = 0.0391
𝑛−1

𝑁(𝜇ҭ 𝑝 = 0.664, 𝜎ҭ𝑝 2 = 0.0391)

Do these parameters have exactly these values?


Under this model many actual observations such as 40% are
unlikely!
Slide 33
Bayesian Statistical Learning

• Parameters unknown
• Data known
• Set sensible priors and likelihoods

𝜇~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0,1)
𝜎 2 ~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0,1)
𝑝~𝑇𝑁𝑜𝑟𝑚𝑎𝑙(𝜇, 𝜎 2 , 0,1)

• Learn from data


• Predict unknown school

𝑃 𝑝 𝜇, 𝜎 2 , 𝑑𝑎𝑡𝑎) Slide 34
Bayesian Result

Slide 35
Comparing Frequentist and Bayesian Results

Frequentist

Bayesian

Bayesian model better reflects uncertainty in the data

Slide 36
Second Order Probability

• Let us assume someone has smuggled a die out of either Shady Sam’s or Honest
Joe’s, but we do not know which casino it has come from. We wish to determine the
source of the die from (a) a prior belief about where the die is from and (b) data
gained from rolling the die a number of times.
• Assume: 𝑃 𝐽𝑜𝑒 ≡ 𝑃 𝑝 = 1ൗ6 = 0.7 𝑃 𝑆𝑎𝑚 ≡ 𝑃 𝑝 = 1ൗ12 = 0.3
• Data consists of one “6” and nineteen “not 6” results.

0.168756

0.168756

0.168756
Slide 37
Fixed value for 𝑝 Unknown value for 𝑝

Using Binomial distribution to solve


Casino problem

Slide 38
Risk Aggregation
• Sum of a collection of financial assets or events, where
each asset or event is modelled as a variable:
• In cyber security we might estimate the number of
network breaches over a year and, for each breach,
have in mind the severity of loss (in terms of lost
availability, lost data or lost system integrity).
• In insurance we might have a portfolio of insurance
policies and expect a frequency of claims to be made
in each year, with an associated claim total.
• In operational risk we might be able to forecast the
frequency and severity of classes of events and then
wish to aggregate these into a total loss distribution
for all events (the so-called Loss Distribution Approach
(LDA).

Slide 39
Risk Aggregation Example

• Insurance company that has several different classes of risk


portfolios. Each of these has its own different loss distribution.
Suppose these distributions are:
S0, S1,…,Sn

• To compute the company’s total loss distribution, T, we simply


compute the sum:
𝑇 = 𝑆0 + 𝑆1 + ⋯ + 𝑆𝑛

• Where:
𝑆0 = 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 0.2,100 , 𝑆1 = 𝑁𝑜𝑟𝑚𝑎𝑙 50,100 , 𝑆2 = 𝑈𝑛𝑖𝑓𝑜𝑟𝑚 0,50

𝑆3 = 𝑇𝑁𝑜𝑟𝑚𝑎𝑙 0,1000, 0,50 , 𝑆4 = 𝐺𝑎𝑚𝑚𝑎 1,20 , 𝑆5 = 𝑇𝑟𝑖𝑎𝑛𝑔𝑢𝑙𝑎𝑟(50,70,100)

Slide 40
Risk Aggregation

Slide 41
Compound Sum Analysis

• What happens when we have


possibly thousands of events that
we need to aggregate? What do we
do if the frequency of these events
is also unknown?
• Apply compound sum analysis tool
in AgenaRisk
• Example:
Consider where the Frequency of loss,
n, is a Poisson distribution with mean
100 loss events per year and a severity
distribution, S, which is Triangular with
parameters (5, 80, 100)

Slide 42
Lesson Summary

• Generally impractical to complete NPTs manually


• Can exploit available functions for different node types to
minimise elicitation of NPT values
• Labelled nodes: can use conditional functions
• Boolean nodes: can use Boolean functions, Noisy-OR etc.
• Ranked nodes: use the various weighted mean type functions
• Can approximate continuous functions using dynamic
discretization
• Bayesian Networks can be used for machine learning from
data
Martin Neil &
Norman Fenton

Risk Assessment and Decision


Analysis with Bayesian Networks

Defining the Structure of


Bayesian Networks

Copyright Martin Neil and Norman Fenton 2019


• At the end of this lecture you should:
• Understand the difference between goal
oriented and causal reasoning
• Understand what idioms are and how they
work
• Understand how to compose smaller
Learning networks together to make large models
• These skills will be useful when:
Goals • Implementing large scale Bayesian
Networks
• Eliciting knowledge from subject matter
experts
• Thinking about causality
2nd Edition Chapter 8:

Defining the Structure of


Bayesian Networks

Slide 3
Use d-connections for eliciting knowledge?

Causal-Evidential Trail Common Cause Common Effect


(Serial connection) (Diverging connection) (Converging connection)

Using d-connections alone as basis for eliciting and representing knowledge


is insufficient in practice.

Textbook examples usually presented as a fait accompli

Slide 4
Do arc directions represent inference or causality?

Mr Holmes is working at his


office when he received a burglary earthquake
telephone call from Watson who
tells him that Holmes’ burglar
alarm has gone off. Convinced
that a burglar has broken into his
house, Holmes rushes into his car
and heads for home. On his way
he listens to the radio, and in the alarm sounds radio report
news it is reported that there has
been a small earthquake in the
area. Knowing that the
earthquake has a tendency to
turn the burglar alarm on he
returns to his work leaving his Inference direction Causal direction
neighbours the pleasures of the
noise.
Slide 5
Which directions should the arrows point?

• Mathematically there is no temperature


cold hand
reason to chose a) over b). outside

• If we want to reason causally we


must use a).
• After all what would the prior for
cold hands independent of
temperature mean? temperature
cold hand
outside
• Tempting for practitioners to use
b) because this matches the
inference direction. (a) (b)

Slide 6
Cause to effect and effect to cause equivalent

𝑃 𝑇, 𝑁 = 𝑃 𝑁 𝑇)𝑃(𝑇) 𝑃 𝑇, 𝑁 = 𝑃 𝑇 𝑁)𝑃(𝑁) 𝑃 𝑇, 𝑁 = 𝑇𝑟𝑢𝑒 = 𝑃 𝑇, 𝑁 = 𝑇𝑟𝑢𝑒 =


𝑃 𝑁 = 𝑇𝑟𝑢𝑒 𝑇)𝑃(𝑇) 𝑃 𝑇 𝑁 = 𝑇𝑟𝑢𝑒)𝑃(𝑁 = 𝑇𝑟𝑢𝑒)

Train strike model in Chapter 7


Slide 7
• “The syntactical or structural form
peculiar to any language; the genius or
cast of a language.” Webster’s
Dictionary
• Generic types of reasoning pattern to
help:
• Decide which edge direction to choose;
• Determine how knowledge should be
Idioms as represented in the BN
Practical Solution • Developed and applied over many
different domains:
• Legal reasoning
• Medical systems
• Complex engineered systems
• Domain independent and general
purpose
The Idioms
• Cause-consequence idiom
• Measurement idiom
• Definitional/Synthesis idiom
• Induction idiom

An idiom instantiation is an idiom made concrete for a


particular problem, by using meaningful node labels.

Slide 9
Cause Consequence Idiom

• Predict outputs caused by


process transforming or
acting upon inputs
• Organised chronologically
(or at least
contemporaneously)
• The underlying causal
process is not represented
(it’s “in” the probability
table)
• Rules for determining
causality couched in
common sense
Slide 10
Cause Consequence Idiom Examples

Slide 11
Joining Cause Consequence Idioms

Here we are predicting the frequency


of software failures based on
knowledge about problem difficulty
and supplier quality. This process
involves a software supplier
producing a product. A good quality
supplier will be more likely to produce
a failure-free piece of software than a
poor quality supplier. However, the
more difficult the problem to be
solved the more likely it is that faults
may be introduced and the software
fail.

Slide 12
Risk as Cause-Consequence

Drive Drive
fast? Speed fast?
warnings? Crash?

Make
Make
Crash? meeting?
meeting?
Seat
Nerves?
belt?

Win
Injury? contract?

Causal view of risk Causal view of opportunity

Slide 13
More Complex Example

Slide 14
Measurement Idiom

• Uncertainty about our own ability to


observe accurately (i.e. define the true
state of something)
• Edge directions are causal
• Actual value exists chronologically before
the assessed estimate is produced by the
measuring instrument
• The accuracy of the assessment process is
represented by some intervening causes
associated with the instrument.
• Same attribute – one estimate of the
other.
• Unobserved variable may be ‘latent’.
Slide 15
Measurement Idioms: Direct and Indirect

Case 1: When the inaccuracy is fixed and Case 2 (Indicators): Only indirect measures are
known because direct measurement is possible
possible

Slide 16
Measurement Idioms: Implicit and Explicit

Slide 17
Data mined
Bayesian Network
model

• Age—This is actually a classification


of the age of the patient (infant,
child, teenager, young adult, etc.)
• Brain scan result—This is a
classification of the types of physical
damage (such as different types of
hemorrhaging) indicated by
shadows on the scan.
• Outcome—This is classified as
death, injury, or life.
• Arterial pressure—The loss of
pressure may indicate internal
bleeding.
• Pupil dilation and response—This
indicates loss of consciousness or
coma state.
• Delay in arrival—Time between the
accident and the patient being
admitted to the hospital.
Common Sense
approach

• The actual injury is a latent variable and


unobservable.
• The doctor runs tests on arterial
pressure, brain scan, and pupil dilation
to determine what the unknown state
of the brain and what the actual injury
might be.
• These are all instantiations of the
special case of the measurement idiom
indicator nodes (where the accuracy is
implicit).
• The outcome influenced by four causes,
including treatment. So this is an
instantiation of the cause–consequence
idiom.
• The actual injury may be a
consequence of age since the severity
or type of injury suffered by the elderly
or infant may differ from that likely to
be suffered by others.
The Cola Test
Model four situations, each with different assumptions:

Assumption 1. Where a
single expert provides their
opinion

Assumption 2. Where a
single expert is allowed to
make three repeated
judgments on the same
product

Presume the product is disguised in some way Slide 20


The Cola Test

Model four situations, each


with different assumptions:

Assumption 3. Where
different independent
experts are used

Assumption 4. Where
different experts are used
but they suffer from the
same inaccuracy, that is,
they are dependent in some
significant and important
way

Presume the product is disguised in some way Slide 21


The Cola Test - Conclusion

If independent scientists use the same flawed procedure or same


data set containing errors as other scientists, then they will arrive
at conclusions that might confirm previous hypotheses.

The claim that such “independent confirmation” strengthens the


hypothesis is vacuous. Some skeptics in the global warming debate
have argued that climate scientists have fallen into this trap.

In general, your statistical alarm bells should ring whenever you


hear statements like “this hypothesis must be true because the
majority of scientists think so.”

Slide 22
Definitional/Synthesis Idiom

• Synthetic nodes represent definitional relations, which may be specified as deterministic


functions or axiomatic relations where we are completely certain of the deterministic
functional relationship between the concepts
• Case 1: Definitional Relationship between Variables

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 =
𝑡𝑖𝑚𝑒
• Case 2: Hierarchical Definitions

Slide 23
Definitional/Synthesis Idiom -divorcing

• Case 3: Combining Different Nodes Together to Reduce Effects of Combinatorial Explosion


(“Divorcing”)

• Size of NPT for 𝑚 nodes with 𝑛 states is: 𝑚𝑛


• Solution is to use binary factorization (“divorcing”) to reduce size of state space. This also
reduces cognitive overhead

Slide 24
• The induction idiom is simply a
model of statistical induction to
learn some parameter that
might then be used in some
other BN idiom.
• For example, we might learn the
accuracy of a thermometer and
then use the mean accuracy as a
probability value in some NPT
using the measurement idiom.
• None of the reasoning is
explicitly causal.
• The focus of Bayesian statistics is
induction and inference.

Induction Idiom
Induction Idiom: Model example
Asymmetry: Impossible paths
Person descends a staircase where there is a risk they will slip.
If they slip they fall either forwards or backwards:

Slips Yes No

Forwards 0.1 ??

Backward 0.9 ??

Solution: add explicit “N/A” states to nodes where there


are impossible state combinations:

Slide 27
Asymmetry and Event Trees
• But suppose problem extended:
▪ If the person falls forwards they will be startled (but otherwise unharmed).
▪ If the person falls backwards they might break their fall or not.
▪ If they break their fall they will be bruised, but if not they will suffer a head
injury.
• Can always simply use BN equivalent of event tree.
• Event tree is asymmetric – some future states are unreachable from past states:
Impossible paths!

No Outcome OK

Slips

Yes Forward Outcome Startled


Event tree
Falls ‘solution’
Bruised
Yes Outcome
Backward Breaks
Fall
No Head
Outcome
Injury

Slide 28
Asymmetry: Bayesian Network Solutions

• BN solution 1:

The NPT needed for the node


Outcome is far from natural. It
has 72 entries and many
correspond to impossible state
combinations.

• BN solution 2:

Even allowing for NA states, this


BN has no single NPT with more
than 9 entries and has a much-
reduced total of 39 entries in all

Slide 29
Mountain pass problem
Mountain pass problem: We want to arrive at an appointment to visit a friend in
the next town. We can either take a car or go by train. The car journey can be
affected by bad weather, which might close the mountain pass through which
the car must travel. The only events that might affect the train journey are
whether the train is running on schedule or not; bad weather at the pass is
irrelevant to the train journey.

Event tree model is


asymmetric in the sense
that only some of the
variables are causally
dependent. The weather
and pass variables are
irrelevant conditional on
the train being taken and
the train running late is
irrelevant conditional on
the car being taken

Slide 30
Mountain pass: obvious BN solutions do not work

Not only does the make appointment While it is possible to ease the
node have many impossible states, but problems associated with the NPT for
the model fails to enforce the crucial make appointment by introducing
assumption that ‘take car’ and ‘take synthetic nodes as shown here the
train’ are mutually exclusive problem of mutual exclusivity is still
alternatives with corresponding not addressed
mutually exclusive paths.
Slide 31
Mountain pass: BN solution

We model the mutually exclusive


transport options as mutually exclusive
states within the same node (Mode of
transport).

Still not perfect as in the NPT we


are forced to declare and consider
all state combinations. This gives
rise to the unfortunate but
necessary duplication of
probabilities to ensure consistency.

Slide 32
Lesson Summary

• Idiom framework helps practitioners build BNs using expertise


and opinions.
• Reused experience from real applications to specify ‘divide
and conquer’ approach to manage complexity.
• Approach relies on using building blocks (idioms) that are
joined together to form the BN.
• Asymmetry problem
Martin Neil &
Norman Fenton

Risk Assessment and


Decision Analysis with
Bayesian Networks

Modeling Operational Risk

Copyright Martin Neil and Norman Fenton 2019


• At the end of this lecture you should
understand:
• Complex systems require “safety”
assessment (or equivalent in non safety
critical domains E.g. Finance)
• Basic Concepts: hazards, accidents and
safety integrity
• Modelling techniques: fault tree and
Learning event trees
Goals • Using BNs for modelling financial,
engineering and cyber risks
• You should also understand the use of:
• Use of Boolean logic gates AND, OR
mFROMn
• Simulation of continuous variables by
approximation
• Combine these in hybrid models
2nd Edition Chapter 13:

Modelling Operational Risk

Slide 3
Risk and ‘Systems Perspective’

• Safety is the absence of accidents


– Safety can be ensured by avoiding hazards
– Accident may also be avoided by factors outside the system (luck!)
• Safety is a property of a system in its environment
• System perspective:
– Series of interconnected, interacting parts that together deliver some
desired outcome
– Machines, people, process, culture, and environment
– Need to frame the boundary of system under study
• Nature of hazards of system depends on use and motivation
– Many systems become unsafe if misused (accidents)
– Unsafe if subject to attack (mal intent)

Slide 4
Operational Risk Terminology
Role of Causation

Suppose a driver wishes


to quantify the probability 𝑃(𝑜𝑖𝑙 𝑝𝑎𝑡𝑐ℎ) = 0.01
that she is involved in a 𝑃 𝑠𝑘𝑖𝑑 𝑜𝑖𝑙 𝑝𝑎𝑡𝑐ℎ) = 0.1
head-on automobile 𝑃 𝑝𝑜𝑜𝑟 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑠𝑘𝑖𝑑) = 0.1
collision in good weather 𝑃(𝑝𝑟𝑒𝑠𝑒𝑛𝑐𝑒 𝑜𝑓 𝑣𝑒ℎ𝑖𝑐𝑙𝑒) = 0.5
conditions. She 𝑃(𝑛𝑜 𝑒𝑣𝑎𝑠𝑖𝑣𝑒 𝑎𝑐𝑡𝑖𝑜𝑛) = 0.5
determines that a head-
on collision must arise
from the combination of a The probability of the head on collision is
skid on a patch of oil, then simply the product of these
poor driver control and probabilities:
the presence of an
oncoming vehicle, whose 𝑃(𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛) = 0.01 × 0.1 × 0.1 × 0.5 × 0.5
driver might not able to = 0.0000025
take evasive action.

Slide 6
Swiss
Cheese
Model for
Rare
Catastrophic
Events

Slide 7
Swiss
Cheese
Model for
Rare
Catastrophic
Events

Slide 8
Bow Tie Model

Slide 9
Fault Tree Example as BN
if(Power_Failure=="True"||
Computer_Failure__OR=="True",
"True", "False") if(CPU=="True"||Digital_I_O=="True",
"True", "False")

mfromn(2,Power_Supply_1=="True",
Power_Supply_2=="True“
Power_Supply_3=="True")

Slide 10
Fault Tree Example as BN

Slide 11
Common Causes, Diversity and Resilience

Common cause

Design not diverse Slide 12


Event Tree for a
Gas Release Accident

Event Trees
Example: Derailment Events

Slide 14
Event Tree for Derailment Example

Note: Conditional independence:


Slide 15
P(collision) not dependent on P(collapse)
BN for Derailment Event Tree

Slide 16
‘Soft Systems’ Approach

• Alternative to causal analysis that analyses risk based on soft


factors relating to how the system is designed, manufactured,
or used
• Implicitly recognizes key properties of systems that can make
analysis of them difficult or intractable:
– Component parts are partially unknown or changeable.
– The parts are interdependent and the interdependencies might change
over time.
– Processes are elaborate and human oriented.
– Teamwork involving people and machines is needed for success,
rather than simply clockwork functioning.
– The presence of feedback.
• Use a BN to articulate a higher level, less granular analysis
less focused on components and interactions
Slide 17
‘Soft Systems’ Approach

• Alternative to causal analysis that analyses risk based on soft factors


relating to how the system is designed, manufactured, or used
• Implicitly recognizes key properties of systems that can make analysis of
them difficult or intractable:
– Component parts are partially unknown or changeable.
– The parts are interdependent and the interdependencies might change over
time.
– Processes are elaborate and human oriented.
– Teamwork involving people and machines is needed for success, rather than
simply clockwork functioning.
– The presence of feedback.
• Use a BN to articulate a higher level, less granular analysis less focused
on components and interactions
• So-called safety or risk argument

Slide 18
‘Soft Systems’ Approach Example

Slide 19
• What risks can occur?
• Can they occur in my process?
• How rare are they?
Operational • How reliable are our controls?
• How good is our internal and external data?
risk in • What is likely level of losses?
finance • What is worst case scenario?
• How can we improve?
• How much capital should we set aside?
Example financial
accident – rogue
trading

• Rogue trading - Unauthorised trading is


essentially trading outside limits
specified by the bank resulting in an
overexposure to market risk.
• Well known cases:
– Nick Leeson, Barings bank, 1995,
(£827m)
– Jérôme Kerviel, Société Générale,
2008 ($4.9billion)
– Kweku Adoboli , UBS, 2011
($2.3billion)
• The current financial crisis is a form of
‘systemic’ operational risk event

Slide 21
Resiliency Perspective

• Operational risk is faced by all organisations.


– Human error main cause of a catastrophic event
– ....but without latent weaknesses in organisation the event would
not reach catastrophic proportions.
• Operational Risk modelling cannot solely involve the
investigation of statistical phenomena
• Need dynamic (time dependent model) of risk events the
controls process and their interactions, including common
causes and resiliency assumptions

Slide 22
Rogue Trading

Process Controls:
1. Trade request • Front office control environment (the
2. Conduct Trade control environment affects the
probability of unauthorised trading)
3. Registration of Trade
• Back office reconciliation checks
4. Reconciliation check (performed per trade)
5. Settlement and Netting • Market positions and results
6. Entering of Trade on Trading monitoring, Value-At-Risk (VAR)
Books calculation (periodical)
• Audit checks (periodical but not as
often as the market checks)

Slide 23
Rogue Trading States

• “Authorized/OK” trade: An authorized trade permitted and


approved by management beforehand
• “Accidental”: An unauthorized trade that is accidental due
to mistakes by the trader
• “Illegal Fame” trade: An unauthorized trade with the intent
to further the trader’s career, status within the bank, or
size of bonuses.
• “Illegal Fraud” trade: An unauthorized trade with the sole
intent of benefiting the perpetrator.
• “Discovered”: A trade of any unauthorized category is
revealed by a control and is thus a discovered
unauthorized trade

Slide 24
Influencers on Controls

• Failed segregation of duties (between front and back


office)
• Fabricated input data (in the positions monitoring)
• Corrupted trade sampling (in the audit process)
• Alliances between staff, within and without, the institution

Slide 25
Loss Model

Slide 26
Slide 26
Loss Model

E - Events

C - Controls

O – Operational
Failures
D – Dependency
Factors

F – Failures

𝑇 𝑡 𝑚 𝑛 𝑜

𝑃(𝐸, 𝐶, 𝑂, 𝐹, 𝐷) = ෑ ෑ ෑ ෑ ෑ 𝑃(𝐸𝑡 |𝐸𝑡−1 , 𝐶𝑡 ) 𝑃(𝐶𝑡 |O𝐶𝑡 )𝑃(𝑂𝑗 |F𝑂𝑗 , D𝑂𝑗 )𝑃(𝐷𝑘 |O𝐶𝑡−𝑠 )𝑃(𝐹𝑖 )𝑃(𝐶0 )
𝑡=1 𝑠=1 𝑗=1 𝑖=1 𝑘=1 Slide 27
Conditional Probability Tables

𝑃(𝐸𝑡 |𝐸𝑡−1 , 𝐶𝑡 )

1 if 𝑂1 ∪ 𝑂2 = 𝑓𝑎𝑖𝑙
𝑃(𝐶1 = 𝑓𝑎𝑖𝑙|𝑂1 , 𝑂2 ) = ቊ
0 otherwise

Slide 28
Executing the Loss Model

Slide 29
Scenarios and Stress Testing

• Explicitly model scenarios that lead to historical or hypothetical


future losses
– Model unexpected Vs expected losses
– Use “mixture” of internal and external loss data

P(Loss)
Median loss
99% percentile
(1 in 100 year loss)

Loss, $

Expected Losses Unexpected Losses Slide 30


Scenarios and Stress Testing

• Consider a scenario comprising of two events: market crash, 𝑀, and a rogue trading,
𝑅, event.
• Each of these is judged to have a percentage probability of occurring of 10% and 3%
respectively.
• For each of these we have discrete Boolean nodes in our BN and associated with
each state must assign a loss distribution, 𝐿𝑀 and 𝐿𝑅 .
• For 𝐿𝑀 this might be:
𝑃(𝐿𝑀 |𝑀 = 𝐹𝑎𝑙𝑠𝑒)~𝑁(10,100)
𝑃(𝐿𝑀 |𝑀 = 𝑇𝑟𝑢𝑒)~𝑁(500,10000)
• For 𝐿𝑅 this might be:
𝑃(𝐿𝑅 |𝑅 = 𝐹𝑎𝑙𝑠𝑒)~0
𝑃(𝐿𝑅 |𝑅 = 𝑇𝑟𝑢𝑒)~𝑁(250,10000)

• Interested in setting aside capital to cover our risk (a regulatory requirements)


• Compute total losses:
𝐿 𝑅 + 𝐿𝑀 Slide 31
Scenarios and Stress Testing

$687m

• 99.5% value at risk is


99.5th percentile:
• For the market crash
event it is $667m $667m $347m
• For the rogue trading
event it is $347m
• For total loss it is $687m

Slide 32
Stress Testing – Tail Analysis

• Bank that needs to predict the future costs of some loan


based on the yearly interest rate charged by another bank
to provide this loan.
• However if interest rates go beyond stress level is the
loan affordable and might the bank default?
• Assumptions:
– The capital sum, which we will treat as a constant, $100m
– Stress interest rate is: X  4
Total interests changed over 10 time periods is: Y = 100(1 + X ) − 100
10

– Historical interest rates follow a heavy tailed distribution (overleaf)

Slide 33
Historical interest rates

X ~ LogNormal (  = .05,  = 0.25)


Slide 34
Conditioning on the Tail

f ( X  4, Y )
f (Y | X  4 ) =  dX ...looks tricky
X f ( X  4)

Using dynamic discretization and declare discrete variable

1 if X  4
P ( Z = true) = 
0 if X  4

Result is: E (Y | X  4) = 60
Slide 35
Cyber Security
Modelling
• Cyber security analysis involves the
modelling of vulnerabilities in an
organisation’s information infrastructure
that might be maliciously attacked,
compromised and exploited by external
or internal agents
• Features:
– Physical assets (networks,
servers, mobile phones, etc.),
people
– Processes and procedures that
might contain security ‘holes’,
bugs, unpatched/un-updated
systems
– Other features that might present
themselves as vulnerabilities

Slide 36

This Photo by Unknown Author is licensed under CC BY-SA


BN Cyber Modelling

• Improve technical and economic decision support, by:


– Allowing simulation of future attacks and defences
– Supporting diagnosis of unknowns for knowns
– Providing means for cost–benefit analysis
– Aligning notations with system of systems descriptions
– Defining set of attackers, vulnerabilities, control and their
– Properties and relationships
– Recognising trust boundaries and scope
• Support improved modelling, by:
– Providing reusable template models
– Enabling specialization from taxonomies
– Providing means for refinement of abstract or concrete
– entities
– Handling of recursion (BNs link to other BNs in a ‘chain’)

Slide 37
Cyber Terminology

• Entity – A logical or physical part of the system required for normal


service (e.g. organisation, process, actor, system, subsystem,
function)
• Attack – An attempt to misuse an entity by exploiting an entity’s
vulnerability
• Capability – The wherewithal to carry out an attack on a specific
entity’s vulnerability
• Vulnerability – A weakness in an entity that, when exploited by an
attacker, leads to a compromise
• Control – A function designed to deter, prevent or protect against
attack or mitigate a consequence of an attack (negates vulnerability)
• Compromise – An entity whose actions/use is controlled by an
attacker that acts as a pre-condition enabling attacks on other entities
or leads to a loss
• Breach – Actual loss of availability, confidentiality or integrity (i.e.
compromise at systems boundary)
Slide 38
Cyber
Kill-Chain
(Graph)
Lesson Summary

• Presented operational risk for multiple domains: financial,


engineered systems and cyber
• Key to success is systems perspective
• Role of prevention and detection is clear
• Catastrophic failures and the Swiss Cheese model
• BNs can replace a plethora of techniques
Martin Neil &
Norman Fenton

Risk Assessment and Decision


Analysis with Bayesian Networks

Building and Using Bayesian


Networks for Legal Reasoning

Copyright Martin Neil and Norman Fenton 2019


• At the end of this lecture you should
understand:
• Use of statistics and probability in the
law
• Application of Bayes in forensic and
legal reasoning
Learning • That Bayesian reasoning can help
Goals experts formulate accurate
• How it can represent informative
opinions and help the court in
determining the admissibility of
evidence
• How can BNs provide a rigorous
method for combining such evidence
2nd Edition Chapter 15:

The Role of Bayes in Forensic


and Legal Evidence
Presentation

Slide 3
In 1964, Malcolm and A maths instructor assigned
Janet Collins were the following approximate
convicted of mugging estimates to the probability of
the particular characteristics:
People v and robbing an elderly
woman in an alleyway in • Yellow car: 1/10
Collins Los Angeles. The victim
had described her
• Man with moustache: 1/4
• Woman with ponytail: 1/10
(1964–68) assailant as a young
blonde woman, and • Woman with blonde hair:
another witness saw a 1/3
blonde woman with a
• Black man with beard: 1/10
ponytail run out of the
alley and jump into a • Interracial couple in car:
waiting yellow car driven 1/1000
by a black man with a
moustache and beard.
Use product rule to get 1 in
The Collinses were a
12 million chance that the
local couple who
couple were innocent
“matched” these various
characteristics.

Slide 4
Revising beliefs when you get forensic ‘match’
evidence

• Fred is one of a number of men who were at the scene of the


crime. The (prior) probability he committed the crime is the
same probability as the other men.
• We discover the criminal’s shoe size was 13 – a size found
nationally in about only 1 in a 100 men. Fred is size 13.
• Clearly our belief in Fred’s innocence decreases. But what is
the probability now?
• Are these statements correct/ equivalent?
• The probability of finding this evidence (matching shoe size) given
the defendant is innocent is 1 in 100
• Prosecution claim that the probability the defendant is innocent
given this evidence is 1 in 100
Fred has size 13
Fred has size 13

Imagine 1,000
other people
also at scene
Fred has size 13

About 10
out of the
1,000 people
have size 13
Fred is one of
11 with size 13

So there is a
10/11 chance
that Fred is NOT
guilty

That’s very
different from
the prosecution
claim of 1%
Prosecutor’s and Defendant’s
Fallacies
• “Suppose a crime has been committed. DNA found at the scene matches the
defendant. It is of a type which is present in 1 in a 1000 people.”
• Prosecutor’s fallacy
– “There is a 1 in a 1000 chance that the defendant would have the DNA
match if he were innocent. Thus, there is a 99.9% chance that he is guilty.”
– This simply (and wrongly) assumes P(H | E) = P(E | H) and also ignores the
prior P(H)
• Defendant’s fallacy
– “This crime occurred in a city of 8,000,000 people. Hence, this blood type
would be found in approximately 8,000 people. The evidence has provided
a probability of 1 in 8,000 that the defendant is guilty and thus has no
relevance.”
– This provides a correct posterior P(H | E) assuming prior P(H) = 1 in
8,000,000 but ignores the change in the posterior from the prior.
What have the law and Bayes’ in common?

• The central idea of Bayes’ theorem – that we start with a


prior belief about the probability of an unknown
hypothesis and revise our belief about it once we see
evidence – is also the central concept of the law.
• Proper use of Bayesian reasoning has the potential to
improve dramatically the efficiency, transparency and
fairness of the criminal justice system.
• Bayesian reasoning can help
– The expert witness in formulating accurate and informative opinions
– Identify which cases should and should not be pursued
– Lawyers explain, and jurors to evaluate, the weight of evidence during a
trial.

Slide 11
Example: Legal Reasoning
– The Birmingham Six

• Six people were convicted of IRA bombing on the basis of


the following evidence:
• Each had traces of nitro-glycerine (NITRO) on their hands
• Expert scientific testimony claimed this gave overwhelming
support for claim that they had handled high explosives
(HE)
• Experts testimony coded as: 𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = 𝐻𝐸)=0.99
• And taken to mean:

𝑃 𝐺𝑢𝑖𝑙𝑡𝑦 𝐸 = 𝑁𝐼𝑇𝑅𝑂)=𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = 𝐻𝐸)=0.99

Slide 12
Example: Legal Reasoning
– The Birmingham Six
• Subsequent investigation showed that Nitro-glycerine traces
could be deposited by many common materials, including
playing cards. Roughly 50% of the population had such
traces. Therefore assume 𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = ¬𝐻𝐸) = 0.5
• Assume prior is 𝑃(𝐻 = 𝐻𝐸) = 0.05
• What is 𝑃 𝐻 = 𝐻𝐸 𝐸 = 𝑁𝐼𝑇𝑅𝑂)?

𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = 𝐻𝐸)𝑃(𝐻 = 𝐻𝐸)


𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = 𝐻𝐸)𝑃 𝐻 = 𝐻𝐸 + 𝑃 𝐸 = 𝑁𝐼𝑇𝑅𝑂 𝐻 = ¬𝐻𝐸)𝑃(𝐻 = ¬𝐻𝐸)

0.99 × 0.05 0.0495


= = = 0.0943
0.99 × 0.05 + 0.5 × 0.95 0.0495 + 0.4750

• Prior is 𝑃 𝐻 = 𝐻𝐸 = 0.05. Posterior is 0.0943


• Is this enough to convict? Slide 13
The Likelihood Ratio

• We uses Bayes’ Theorem because we want to update our prior beliefs in a


hypothesis when we observe evidence:
– 𝑃 𝐸 𝐻) – the probability of evidence 𝐸 if 𝐻 is true
– 𝑃 𝐸 𝑛𝑜𝑡 𝐻) – the probability of evidence 𝐸 if 𝐻 is false

𝑃 𝐸 𝐻)
𝐿𝑅 = – the Likelihood Ratio (LR)
𝑃 𝐸 𝑛𝑜𝑡 𝐻)

• LR often used as a measure of ‘probative value’ of evidence:


– 𝐿𝑅 > 1 – the evidence 𝐸 more likely under 𝐻 than 𝑛𝑜𝑡 𝐻
– 𝐿𝑅 < 1 – the evidence 𝐸 less likely under 𝐻 than 𝑛𝑜𝑡 𝐻
– 𝐿𝑅 = 1 – the evidence 𝐸 equally likely under 𝐻 than 𝑛𝑜𝑡 𝐻
• Barry George case:
– 𝑃 𝐸 = 𝑔𝑢𝑛𝑝𝑜𝑤𝑒𝑟 𝑓𝑟𝑜𝑚 𝑔𝑢𝑛 𝑔𝑢𝑖𝑙𝑡𝑦) = 0.01
– 𝑃 𝐸 = 𝑔𝑢𝑛𝑝𝑜𝑤𝑒𝑟 𝑓𝑟𝑜𝑚 𝑔𝑢𝑛 𝑛𝑜𝑡 𝑔𝑢𝑖𝑙𝑡𝑦) = 0.01
– 𝐿𝑅 = 1 Slide 14
Fundamentals

The hypothesis H that the defendant is or is not guilty is often


referred to by lawyers as the ultimate hypothesis. In general a
legal case may consistent of many additional hypotheses.

The direction of the causal structure makes sense


here because the defendant’s guilt (innocence)
increases (decreases) the probability of finding
incriminating evidence.
Slide 15
DNA Evidence

Assume island of 10,000 people


Random Match Probability (RMP)

Slide 16
DNA Profiles

• DNA is formed from four chemical “bases” that bind together in pairs, called
“base pairs.” Each person’s DNA contains millions of such base pairs.
• A person’s DNA profile is determined by analyzing just a small number of
regions of the DNA, known as loci or markers. At each locus there are two
alleles, one inherited from mother and one from the father:

• Each genotype is shared by a proportion of the population


• UK, the DNA-17 system (17 loci) is used

Slide 17
DNA Profiles

• The probability an arbitrarily selected unrelated person has the same


profile as Fred (on these four loci) is:

0.09 × 0.14 × 0.07 × 0.05 = 0.0000441 = 0.00441%


• This is the Random Match Probability (RMP)
• With a 17 loci system and if we have a person’s full 17-loci profile
then the RMP for a DNA profile is incredibly low (assuming each loci
probability is 10%): 10-17
• Number of people on planet is 8 billion, therefore suggests DNA
profile is unique.
Slide 18
DNA Profiles - Issues

• The more closely two people are related, the more likely they are to
share genotypes.
• Getting a ‘reference’ sample from suspect is usually straightforward
but ‘questioned’ sample from crime scene can be more difficult
because it may be small or degraded (so-called ‘low template’ DNA
samples)
• How much of the original DNA sample can be determined from the
low template DNA sample?
– Suppose only two loci identifiable for Fred, D21 and vW1 with reference probabilities 0.09 and
0.14
» Fred’s RMP will be 0.09(0.14) = 0.126 = 1.26%
» A DNA match is only approx. 1 in 100!
• Even more problematic when we have mixed samples from more than
one person
• Also judgements about the peaks are made by forensic specialists
and algorithms
• Idea of a match is actually quite complex! Slide 19
DNA Profiles - Issues

Even the case of a “single


piece of DNA evidence” could
more realistically be modelled
as a BN model

Slide 20
• Assumes two deaths
would be independent
In 1999, Sally Clark was events, and hence that
convicted of the murder the assumed probability
The Case of of her two young children
who had died one year
of 1/8500 for a single
SIDS death could be
R v Sally apart. The prosecution
case relied partly on
multiplied by 1/8500.
• This (very low)
Clark flawed statistical
evidence presented by
probability is assumed
to be equivalent to the
(1998–2003) paediatrician Professor
Sir Roy Meadow to
probability of Sally
Clark’s innocence
counter the hypothesis (prosecutor’s fallacy).
that the children had died • The (prior) probability of
as a result of Sudden a SIDS death was
Infant Death Syndrome considered in isolation,
(SIDS) rather than without comparing it with
murder. He asserted that the (prior) probability of
there was a 1 in 73 the proposed
million probability of two alternative, namely of a
SIDS deaths in the same child being murdered by
family. a parent

Slide 21
Sally Clark Model – Original Trial

Slide 22
Sally Clark Model – Re-Trial

Signs of Disease for


each child – “Yes”

Slide 23
2nd Edition Chapter 16:

Building and Using Bayesian


Networks for Legal Reasoning

Slide 24
Legal Arguments

• Practical legal arguments normally involve multiple pieces


of evidence
• Uncertainty unavoidable
• Inference about unknown event (guilt or innocence)
• Complex causal dependencies
• Use BN fragments to model these arguments and their
dependencies
• Based on a very small number of special cases of the
idioms

Slide 25
.. this is a
typical real
legal BN
Evidence idiom

• Corroboration pattern - E1 and E2 that both support one


side of the argument
• Conflict pattern - E1 and E2 with one supporting the
prosecution and the other supporting the defence
Slide 27
In the case of R v Adams (convicted of rape) a
DNA match was the only prosecution
evidence against the defendant, but it was
also the only evidence that was presented in
probabilistic terms in the original trial, even
though there was actually great uncertainty
Example of about the value (the match probability was
disputed, but it was accepted to be between 1
Evidence in 2 million and 1 in 200 million).

Idiom: This had a powerful impact on the jury. The


other two pieces of evidence favoured the
Regina Vs defendant – failure of the victim to identify
Adams and an unchallenged alibi.
Adams
In the Appeal the defence argued that it was
wrong to consider the impact of the DNA
probabilistic evidence alone without combining
it with the other evidence.
Example Evidence Idiom

Slide 29
Example Evidence Idiom

DNA Evidence alone Other evidence included

Slide 30
Evidence Accuracy Idiom

Slide 31
Evidence Accuracy Idiom DNA Example

• Let’s look at some DNA evidence assumptions


– The blood tested really was that found at the scene of the crime.
– None of the samples became contaminated at any time
– The DNA testing is perfect, in particular there is no possibility of wrongly
finding a match (note, this is very different to the assumption inherent in
the random match probability)
– The person presenting the DNA evidence in court does so in a completely
truthful and accurate way.
• If any of the above is uncertain (which may be the case
even for DNA evidence) then the presentation of evidence
of blood match DNA being true or false cannot be simply
accepted unconditionally.
• Need to condition evidence on its accuracy.

Slide 32
Evidence Accuracy Idiom DNA Example

Slide 33
Idioms to deal with Motive and Opportunity

• Motive: There is a widespread acceptance within the


police and legal community that a crime normally requires
a motive (this covers the notions of ‘intention’ and
‘premeditation’).
• Opportunity: When lawyers refer to ‘opportunity’ for a
crime they actually mean a necessary requirement for the
defendant’s guilt.

Slide 34
Idiom for Dependency Between
Different Types of Evidence

• In the case of a hypothesis with multiple pieces of


evidence we have so far assumed that the pieces of
evidence were independent (conditional on H).
• Confirmation bias:
– Two experts determining whether there is a forensic match. It has been
shown that the second expert’s conclusion will be biased if he/she knows
the conclusion of the first expert.
– Expert examining poor quality photograph of number plate and being
shown the suspects actual number plate
• Information dependence between different source of
evidence
– Witness statements
– Surveillance camera evidence

Slide 35
Camera Dependence

Suppose, for example, that the two pieces of evidence for ‘defendant present at scene’ were images from two
video cameras. If the cameras were of the same make and were pointing at the same spot then there is clear
dependency between the two pieces of evidence: if we know that one of the cameras captures an image of a
person matching the defendant, there is clearly a very high chance that the same will be true of the other
camera, irrespective of whether the defendant really was or was not present. Conversely, if one of the
cameras does not capture such an image, there is clearly a very high chance that the same will be true of the
other camera, irrespective of whether the defendant really was not present.

Slide 36
Alibi evidence

• Alibi evidence is simply evidence that directly contradicts


a prosecution hypothesis.
• For example an eyewitness statement contradicting the
hypothesis that the defendant was present at the scene of
the crime, normally by asserting that the defendant was in
a different location.

H2 influences A1 Slide 37
Explaining Away Idiom
• Explaining away idiom simple common consequence:

• Problem of two separate mutually exclusive causal pathways

Enforces
mutual Slide 38
exclusivity
Agatha Christie’s play
Witness for the Prosecution
Tyrone Power, Marlene Dietrich, Charles Laughton
Academy Award Best Picture Winner 1957 Slide 39
Incriminating Evidence

Leonard Vole is charged with murdering a rich elderly lady, Miss


French. He had befriended her, and visited her regularly at her
home, including the night of her death. Miss French had recently
changed her will, leaving Vole all her money. She died from a blow
to the back of the head.

There were various pieces of incriminating evidence: Vole was poor


and looking for work; he had visited a travel agent to enquire about
luxury cruises soon after Miss French had changed her will; the
maid claimed that Vole was with Miss French shortly before she
was killed; the murderer did not force entry into the house; Vole had
blood stains on his cuffs that matched Miss French’s blood type.

Slide 40
Romaine, the Loyal Wife

There were also several pieces of exonerating evidence: the maid admitted that
she disliked Vole; the maid was previously the sole benefactor in Miss French’s
will; Vole’s blood type was the same as Miss French’s, and thus also matched
the blood found on his cuffs; Vole claimed that he had cut his wrist slicing ham;
Vole had a scar on his wrist to back this claim.

There was one other critical piece of defence evidence: Vole’s wife, Romaine,
was to testify that Vole had returned home at 9.30pm. This would place him far
away from the crime scene at the time of Miss French’s death. Slide 41
Romaine – Witness for the Prosecution!

Then during the trial Romaine was called as a witness for the
prosecution. Dramatically, she changed her story and testified
that Vole had returned home at 10.10pm, with blood on his
cuffs, and had proclaimed: ‘I’ve killed her’!

Slide 42
Scandal! Communist Overseas Lover

Just as the case looked hopeless for Vole, a mystery woman supplied the defence
lawyer with a bundle of letters. Allegedly these were written by Romaine to her
overseas lover (who was a communist!). In one letter she planned to fabricate her
testimony in order to incriminate Vole, and rejoin her lover.

This new evidence had a powerful impact on the judge and jury. The key witness
for the prosecution was discredited, and Vole was acquitted.
Slide 43
The dénouement.....

After the court case, Romaine revealed to the defence lawyer that she had
forged the letters herself. There was no lover overseas.

She reasoned that the jury would have dismissed a simple alibi from a
devoted wife; instead, they could be swung by the striking discredit of the
prosecution’s key witness!

Slide 44
Agatha Christie’s play
Witness for the Prosecution

Slide 45
Agatha Christie’s play
Witness for the Prosecution

Slide 46
Will Bayes be Accepted in the Law?
Lesson Summary

• BNs provides a natural way to reason about legal evidence


because it easily exposes, and hence helps avoid, many common
fallacies in legal reasoning
• Any serious attempt to use Bayes in the law requires BN
arguments, which can be built up using a small set of BN idioms
• Opportunity to rigorously combine impact of multiple types of
evidence in a case accurately.
• Demonstrated that many complex cases can be represented in
rudimentary BN models
Martin Neil &
Norman Fenton

Risk Assessment and Decision


Analysis with Bayesian Networks

System Reliability Modeling

Copyright Martin Neil and Norman Fenton 2019


• Methods for predicting system reliability
covering hardware, software and complex
systems
• Predict system reliability from component
reliabilities and knowledge of their
interactions and data gathered during test
Learning and/or operation
Goals • Types of problem covered:
• Probability of failure on demand (discrete
use)
• Time to failure of (continuous use) system
• Dynamic Fault Trees
• Software Defect Prediction
2nd Edition Chapter 14:

System Reliability Modeling

Slide 3
Estimating Reliability

• Assume a discrete use system (e.g. Missile launch)


• Interested in the probability of failure on demand
(𝑝𝑓𝑑)
• Estimate the 𝑝𝑓𝑑 from some test data gathered from
trials or field use.
• Use observed number of failures, 𝑓 , from an
observed number 𝑛 trials (demands on the system)
to estimate 𝑝𝑓𝑑.

Slide 4
Discrete Reliability Modeling

• Model assumptions:
𝑝𝑓𝑑~𝐵𝑒𝑡𝑎(𝛼, 𝛽, 0,1)
𝑓~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝𝑓𝑑)
𝑛~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(1,10000)

• Reliability target is 0.01:


IF(pfd < 0.01,"True","False")
• Structure:

Slide 5
Beta Priors for 𝑝𝑓𝑑

𝑝𝑓𝑑~𝐵𝑒𝑡𝑎(𝛼, 𝛽, 0,1)

𝛼 is the number of
“past” successes

𝛽 is the number of
“past” failures

Reflects confidence in
people and processes

Slide 6
Discrete Reliability Example

• Tests, 𝑛 = 225, zero failures observed, 𝑓 = 0


• What is 𝑝𝑓𝑑 and is the target requirement met?

Slide 7
Confidence in Target
Reliability target (𝑝𝑓𝑑 < 0.01)

90%

10%

𝑃 𝑝𝑓𝑑 < 0.01 = 0.9


Slide 8
Continuous use system

• Various reliability measures:


– Probability of failure within specified time
– Mean Time to Failure (MTTF)
– Mean Time between Failures (MTBF)
• Focus on learning TTF from test or field data
• Determine the reliability, 𝑃 𝑆 = 𝑓𝑎𝑖𝑙 for a given
operating time target, 𝑡, in terms of a time to failure
distribution (TTF), 𝜏𝑆 :
𝑡
𝑃 𝑆 = 𝑓𝑎𝑖𝑙 = 𝑃 𝜏𝑆 ≤ 𝑡 = න 𝑓𝜏𝑆 𝑢 𝑑𝑢
0

Slide 9
Censoring

• Right Censoring – The system under study has


survived longer than the time available to observe it.
Therefore our data is not a specific TTF value but an
observation that 𝑃 𝜏𝑆 > 𝑡 = 1 that is the actual
failure time is missing.
• Type II Censoring – The system has failed at some
point in time before we inspect or observe it, thus all
we know is that 𝑃 𝜏𝑆 < 𝑡 = 1
• Mixed Censoring – we only know the interval during
which the system failed. This is simply an
observation, 𝑃 𝑡1 > 𝜏𝑆 > 𝑡2 = 1

Slide 10
Challenging TTF estimation problem
with imperfect data

• Estimate TTF for a single system class


• Handle censored data for one class of system
• Estimate TTF for the super-class of all systems
• Predict TTF and reliability for future, new, system

Solution
• Model TTF for each system from failure data
• Handle observations as censored (or not)
• Use hierarchical model with hyper-parameters to mix
super-class and individual systems

Slide 11
IF(pfd > 100,"True","False")

𝛼~𝑇𝑟𝑖𝑎𝑛𝑔𝑢𝑙𝑎𝑟 0,1,10 log10 𝛽 ~𝑇𝑟𝑖𝑎𝑛𝑔𝑢𝑙𝑎𝑟 −6, −3, −1

5
𝜆𝑖 𝑖=1 ~𝐺𝑎𝑚𝑚𝑎 𝛼, 𝛽

IF(pfd >
200,"True","False")

𝑛𝑖 Slide 12
𝑡𝑖𝑗 ~ exp 𝜆𝑖 , 𝑖 = 1, . . . , 5
𝑖=1
Dynamic Fault Trees

• Fault Tree Gates


Cold Standby (CSP) gate
The spare components never fail
AND gate
(the hazard rate is zero) when in
Output fails only if ALL standby mode
input components fail
Warm Standby (WSP) gate
The hazard rate of the spare
OR gate components is less in standby
mode than in active mode
Output fails if one or more
input components fail Priority AND (PAND) gate
The output will fail if all of its
input components fail in a
predefined order (left to right)

Slide 13
Dynamic Fault Trees

• TTF of Fault Tree Gates

CSP gate

AND gate 𝜏𝐶𝑆𝑃 = 𝜏𝑚𝑎𝑖𝑛 + 𝜏𝑆𝑃1 + ⋯ 𝜏𝑆𝑃𝑛

𝜏𝐴𝑁𝐷 = max 𝜏𝑖
𝑖

WSP gate
𝑠𝑏
𝜏main if 𝜏spare < 𝜏main
𝜏𝑊𝑆𝑃 =ቐ
OR gate 𝑎𝑐𝑡
𝜏main + 𝜏spare 𝑠𝑏
if 𝜏spare > 𝜏main

𝜏𝑂𝑅 = min 𝜏𝑖
𝑖
PAND gate

𝜏 if 𝜏1 < 𝜏2
𝜏𝑃𝐴𝑁𝐷 = ቊ 2
∞ otherwise

Slide 14
HCAS system example
IF(𝜏𝐶𝑃𝑈𝑇 < 100,“Fail",“On")

𝜏𝑃 if 𝜏𝐵𝑠𝑏 < 𝜏𝑃
𝜏𝐶𝑃𝑈 =൝
𝜏 𝑇 = min 𝜏𝐶𝑆 , 𝜏𝑆𝑆 𝜏𝑃 + 𝜏𝐵𝑎𝑐𝑡 if 𝜏𝐵𝑠𝑏 > 𝜏𝑃

𝜏𝐶𝑃𝑈𝑇 = min 𝜏𝐶𝑃𝑈 , 𝜏 𝑇

Slide 15
HCAS example solution

𝑀𝑇𝑇𝐹 = 351

𝜏𝐶𝑆 ~𝑒𝑥𝑝 10−6 𝜏𝑆𝑆 ~𝑒𝑥𝑝 10−6

𝜏𝑃 ~𝑒𝑥𝑝 2 × 10−6 standby Slide 16


𝜏𝐵 ~𝑒𝑥𝑝 10−6 𝜏𝐵active ~𝑒𝑥𝑝 2 × 10−6
Putting it all together….

Slide 17
Software Defect Prediction

• Predict quality of software before it is used


• Goals:
– predicting the number of defects in the system
– estimating the reliability of the system in terms of time to failure
– understanding the impact of design and testing processes on
defect counts and failure densities
• Number of defects found in operation depends on:
– The amount of operational usage. If you do not use the system you
will find no defects irrespective of the number there
– The number of defects found and fixed is dependent on the number
introduced
– The better the design the fewer the defects and the less complex
the problem the fewer defects
– The amount of testing effort

Slide 18
Simplified Defects Model

Slide 19
Node Probability Tables

Node Name NPT

Defects found in operation Binomial (n, p) where n = ‘residual defects’ and


p = ‘operational usage’

Residual defects Defects inserted – Defects found (and fixed) in


testing

Defects found in testing Binomial (n, p) where n = ‘defects inserted’ and


p = ‘testing quality

Defects inserted Truncated Normal on range 0 to 500 with


mean complexity × (1 - design) × 90and variance
300.

Slide 20
Prior model

Shows the marginal distributions of the simple model before any evidence
has been entered. So this represents our uncertainty before we enter any
specific information about this product.

Slide 21
Scenario 1
Zero defects found and fixed in testing AND problem complexity is ‘High’

Slide 22
Scenario 2
Operational usage is “Very High”. Replicates the apparently counter-
intuitive empirical observations whereby a product with no defects found in
testing has a high number of defects post-release.

Slide 23
Scenario 3

Test quality was “Very High”…..

Then we completely revise our beliefs. We are now fairly


certain that the product will be fault free in operation. Slide 24
Modelling a Software Life-Cycle using
Risk Objects

Slide 25
Lesson Summary

• Introduced the concept of reliability modeling


• Learn the reliability parameters for a system or component
from test and operational data using a number of Bayesian
learning models
• Model discrete and continuous use systems
• Predict and estimate defect counts in software products
• Can build large scale models from risk objects

You might also like