2MLIntrodpart 2
2MLIntrodpart 2
2MLIntrodpart 2
39
Dilemma
This person dropped their
ticket in the hallway.
Do you call out
“Excuse me, ma’am!”
or
“Excuse me, sir!”
You have to make a
guess.
40
1
Bayesian Inference
• Bayesian inference is a way to capture
common sense.
• It helps you use what you know to make
better guesses.
41
Conditional probabilities
P(A | B) is the probability of A, given B.
“If I know B is the case, what is the probability that A is also the case?”
P(A | B) is not the same as P(B | A).
42
2
Joint probabilities
● P(A and B) or P(A,B) or P(A with B) or P(A Ⴖ B)
● P(A,B)=P(A)*P(B|A)
● P(A,B,C)=P(A)*P(B|A)*P(C|A and B)
● P(B,A)=P(B)* P(A|B)
e.g. What is the probability that a person is both a woman and has short
hair?
P(woman with short hair)
= P(woman) * P(short hair | woman)
= .5 * .5 = .25
● P( Ⴈ A | B ) = 1- P( A | B )
43
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:
44
3
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:
● P(D2=Sunny)=
P(D2=Sunny and D1=Sunny) +
P(D2=Sunny and D1=Rainy)
= P(D2=Sunny | D1=Sunny) * P(D1=Sunny) +
P(D2=Sunny | D1=Rainy)* P(D1=Rainy)
=0.8 * 0.9+ 0.6 * 0.1
=0.78
45
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Rainy?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:
46
4
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Rainy?
P(D1=Sunny)=0.9 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Sunny)=0.8
P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny | D1=Rainy)=0.6
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer: 0.22
47
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 3 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer: solve it by replacing D3 by D2 and D2 by D1 from previous solution.
48
5
Theorem of Total Probability
• if events A1, …., An are mutually exclusive with
n
P ( B ) P ( B | Ai ) P ( Ai )
i 1
49
50
6
Bayes’ Theorem
51
52
7
Bayes Rule example
• There is a specific type of cancer which exists for 1% of population. Probability of a
test coming positive given that one has cancer is 0.9. And the probability of this test
coming out negative given that one doesn’t have cancer is 0.2. What is the probability
that a person has this cancer given that he just received a positive test.
• Answer:
P(C)=0.01
P(+ve | C)=0.9 P(ႨC)=0.9
P(-ve | Ⴈ C)=0.2 P( -ve | C)=0.1
P(C | +ve)=?
P( -ve | Ⴈ C)=0.8
53
Example 2
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.
P ( cancer ) . 008 , P ( cancer ) . 992
P ( | cancer ) . 98 , P ( | cancer ) . 02
P ( | cancer ) . 03 , P ( | cancer ) . 97
P ( | cancer ) P ( cancer )
P ( cancer | )
P ( )
P ( | cancer ) P ( cancer )
P ( cancer | )
P ( )
54
8
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data.
• Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured.
55
56
9
Approaches in Dealing with
Uncertainty
Numerically oriented methods:
• Bayes’ Rules
• Certainty Factors
• Dempster Shafer
• Fuzzy Sets
Quantitative approaches
• Non-monotonic reasoning
Symbolic approaches
• Cohen’s Theory of Endorsements
• Fox’s semantic systems
57
58
10
Naïve Bayesian Classifier
● Let D=training set of tuples, each tuple is represented by
n-dimensional vector X=(x1, x2, x3….xn)
59
Bayesian Theorem
• Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem
P(h | D) P( D | h)P(h)
P(D)
• MAP (maximum posteriori) hypothesis
h arg max P ( h | D ) arg max P ( D | h ) P ( h ) .
MAP hH hH
• Practical difficulty: require initial knowledge of
many probabilities, significant computational
cost
60
11
Estimating a-posteriori
probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
61
P ( D | h ) P ( h)
P(h | D )
P( D)
Output the hypothesis hmap with the highest posterior probability
hmap max P (h | D)
hH
Comments:
Computational intensive
Providing a standard for judging the performance
of learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
62
12
Bayesian classification
• The classification problem may be formalized
using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
63
Bayesian classification
• The classification problem may be
formalized using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
• E.g. P(class=N |
outlook=sunny,windy=true,…)
64
13
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
• If i-th attribute is continuous:
P(xi|C) is estimated through a Gaussian density
function
• Computationally easy in both cases
65
likelihood
66
14
Day Outlook Temperature Humidity Wind Play ball
67
68
15
weather example: classifying
X
• An unseen sample X = <rain, hot, high, weak>
• P(X|yes)·P(yes) =
P(rain|yes)·P(hot|yes)·P(high|yes)·P(weak|yes)·
P(yes) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|no)·P(no) =
P(rain|no)·P(hot|no)·P(high|no)·P(weak|no)·P(n
o) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
69
70
16
Probability distributions
Height of adults in cm
.28
.24
Probability
.17
.12
.09
.06
.03
.01
<150 150 160 170 180 190 200 >21
to to to to to to 0
160 170 180 190 200 210
71
Source: https://www.nature.com/articles/s41598-018-33413-y
72
17
Laplacian correction/estimator
• What if we encounter probability values of zero?
• It will result in probability product coming out to be zero.
• There is a simple trick to avoid this problem. We can assume
that our training database, D, is so large that adding one to
each count that we need would only make a negligible
difference in the estimated probability value, yet would
conveniently avoid the case of probability values of zero.
• This technique for probability estimation is known as the
Laplacian correction or Laplace estimator, named after
Pierre Laplace, a French mathematician who lived from 1749
to 1827.
• If we have, say, q counts to which we each add one, then
we must remember to add q to the corresponding
denominator used in the probability calculation.
73
Laplacian correction/estimator
• Suppose for some training database D with 1000 tuples, we
have this situation:
– 0 tuples with income = low
– 990 tuples with income = medium
– 10 tuples with income = high
• Then probabilities of these events are:
0 990/1000 10/1000
=0 =0.999 =0.010
• Using Laplacian correction, we pretend that we have one
more tuple for each income-value pair:
1/1003 991/1003 11/1003
=0.001 =0.998 =0.011
• Hence the corrected probability estimates are close to their
uncorrected counterparts, yet the zero probability value is
avoided.
74
18
Advantages of Bayesian Learning
• Unlike many other models of supervised learning, the
naive Bayes classifier can handle missing data where
not all features are observed; the agent conditions on the
features that are observed.
• Naive Bayes is optimal
• A naive Bayes model gives a direct way to assess the
weights and allows for missing data.
• It however makes the assumption that the Xi are
independent given Y, which may not hold practically.
• A linear regression model trained, for example, with
gradient descent can take into account dependencies,
but does not work for missing data.
75
76
19
Applications of Bayes Classifier
77
78
20
Probability of a bigram w1 and w2
79
● Bigram example
● Trigram example
80
21
Predicting words based on Shakespeare
● What is the probability of seeing the, given that we’ve just
seen of ?
81
82
22
Few more applications
● It is widely used in Spam filtering (identify spam e-mail) and
83
84
23
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• but is seldom satisfied in practice, as attributes
(variables) are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes
– Decision trees, that reason on one attribute at the
time, considering most important attributes first
85
86
24
Bayesian Belief Network
• A directed acyclic probabilistic graphical model that captures
dependence among the attributes
• Bayesian Net
– Nodes: Variable/Attributes/Class
– Directed edges: Causality/dependency
– Absence of edge: independence
– Network structure: domain knowledge
– Joint probabilities: from data
87
Example
• Find P(R|S)=?
(ans=P(R)=0.01)
• Find P(R|H,S)=?
(0.0142) -Sunny and Raise are
evidence variable
• Find P(R|H)=? -Happy is query variable
(0.97)
• Hint: Apply Bayes Rule
88
25
Example of Bayesian Belief Network
1 4 2 2
1
89
Bayesian Reasoning
• Given the evidence of who has or has not called, we would
like to estimate the probability of a burglary.
• In probabilistic inference, the output is not a single number
for each of query variables, rather it is a joint probability
distribution over the query variables.
• We call it posterior distribution of one or more query
variables, given the evidence.
• It is the probability distribution of one or more query
variables given the values of evidence variables:
90
26
Bayesian Reasoning
• We call it posterior distribution of one or more query
variables Qi, given the evidence variables Ei :
P(Q1, Q2,….|E1=e1, E2=e2……En=en)
• Another question that can be answered, out of all possible
values for all the query variables, which combination of
values has the highest probability:
argmax P(Q1=q1, Q2=q2,….|E1=e1, E2=e2……)
91
Bayesian Reasoning
• One great thing about Bayes nets is that we are not
restricted to go only in one direction rather we could go in
causal direction or reverse that causal flow.
92
27
Bayesian Inferencing
• Find probability that alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both John
and Mary call
• Find P(J,M,A,~B,~E)=?
93
CPT
• Find probability that alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both John
and Mary call
• Find P(J,M,A,~B,~E)=?
Answer:
=P(J|A).P(M|A).P(A|~B,~E).P(~B).P(~E)
=0.90*0.70*0.001*0.999*0.998
=0.00062
94
28
Advantage of Bayesian networks
● P(B,E,A,J,M) = P(B)*P(E)*P(A|B,E)*P(J|A)*P(M|A)
● B and E have no incoming arrows, so they a have a
probability distribution of P(B) and P(E).
● A has two incoming arrows, so it’s probability is
conditioned on B and E, giving us P(A|B,E).
● J and M are both conditioned on A, giving us P(J|A) and
P(M|A).
● So, the definition of this setup for the joint distribution,
P(B,E,A,J,M) is based on the factors above, and gives us
one really BIG advantage.
● We know that the joint distribution over any five
random variables requires 25-1=31 probability values,
while our Bayes network only requires 10 probability
values.
95
96
29
Representing the full joint distribution
source: Russel & Norvig Book, page 513
97
98
30
Advantage of Bayesian networks
99
100
31
Example
101
102
32
Training Bayesian Belief Networks
• Once network topology and observable variables are
known, gradient descent strategy can be used to learn the
CPT entries.
103
Clustering
income
education
age
104
33
Clustering
income
education
age
105
Clustering
• Clustering is alternatively called as
“grouping”
• Intuitively, if you would want to assign
same label to a data points that are close
to each other
• Thus, clustering algorithms rely on a
distance metric between data points
106
34
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined
classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
107
Issues
• Given desired number of clusters?
• Finding “best” clusters
• Are clusters semantically meaningful?
108
35
Cluster Analysis
• Finding groups of objects such that
– the objects in a group will be similar (or related) to
one another and
– different from (or unrelated to) the objects in other
groups
109
Example
• Suppose we have 4 types of medicines and each has two attributes
(pH and weight index). Our goal is to group these objects into K=2
group of medicine.
C 4 3 A B
D 5 4
110
36
Another Example: Text
• Each document is a vector
– e.g., <100110...> contains words 1,4,5,...
• Clusters contain “similar” documents
• Useful for understanding, searching
documents
sports
international
news
business
111
• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic
112
37
K-means Algorithm
• Given the cluster number K, the K-means algorithm
is carried out in three steps after initialisation:
Initialisation: set seed points (randomly selected as
means of clusters)
1)Assign each object to the cluster of the nearest seed
point measured with a specific distance metric
2)Compute new seed points as the centroids of the
clusters of the current partition (the centroid is the
centre, i.e., mean point, of the cluster)
3)Go back to Step 1), stop when no more new
assignment (i.e., membership in each cluster no longer
changes)
113
Class Problem
Training x1 x2
Examples
A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
• Let k=2, means we are interested in two clusters
• Let A and C are randomly selected as the means of 2
clusters.
114
38
Class Problem
Mean/Center Distance from Distance from
center 1 center 2
Class Problem
Mean/Center Distance from Distance from
center 1 center 2
116
39
Class Problem
Mean/Center Distance from Distance from
center 1 {1,0.5} center 2 {1.7, 3.7}
117
Class Problem
Mean/Cent Distance from center Distance from
er 1 {0.7,1} center 2 {2.5, 4.5}
A
B
C
D
E
118
40
Exercise 1
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer three questions as follows:
1. How many steps are required for convergence?
2. What are memberships of two clusters after convergence?
3. What are centroids of two clusters after convergence?
B 2 1 A B
C 4 3
D 5 4
119
119
Exercise 2: K means
120
41
Drawback of K-means
• K-means fails for non-linearly separable data.
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially
distort the distribution of the data.
• There are other limitations – still a need for reducing costs
of calculating distances to centroids.
• K-Medoids: Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used,
– Which is the most centrally located object in a cluster.
• One solution is advanced algorithm Kernal K-means
clustering which is based on idea to project data onto high-
dimensional kernel space, and then perform K-means
clustering.
121
Summary
• K-means algorithm is a simple yet popular
method for clustering analysis
• Its performance is determined by initialisation and
appropriate distance measure
• There are several variants of K-means to
overcome its weaknesses
– K-Medoids: resistance to noise and/or outliers
– K-Modes: extension to categorical data clustering
analysis
– CLARA: extension to deal with large data sets
– Mixture models (EM algorithm): handling uncertainty of
clusters
122
42