0% found this document useful (0 votes)

23 views75 pages

Foundations of Machine Learning: Part A: Probability Basics

Uploaded by

nilayesh.bhattacharya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views75 pages

Foundations of Machine Learning: Part A: Probability Basics

Uploaded by

nilayesh.bhattacharya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Foundations of Machine Learning

Module 4:
Part A: Probability Basics

Sudeshna Sarkar
IIT Kharagpur
• Probability is the study of randomness and
uncertainty.
• A random experiment is a process whose
outcome is uncertain.
Examples:
– Tossing a coin once or several times
– Tossing a die
– Tossing a coin until one gets Heads
– ...
Events and Sample Spaces
Sample Space
The sample space is the set of all possible outcomes.

Event
An event is any
Simple Events collection of one or
The individual outcomes more simple events
are called simple events. 3
Sample Space
• Sample space Ω : the set of all the possible
outcomes of the experiment
– If the experiment is a roll of a six-sided die, then the
natural sample space is {1, 2, 3, 4, 5, 6}
– Suppose the experiment consists of tossing a coin
three times.
Ω = {(ℎℎℎ, ℎℎ𝑡, ℎ𝑡ℎ, ℎ𝑡𝑡, 𝑡ℎℎ, 𝑡ℎ𝑡, 𝑡𝑡ℎ, 𝑡𝑡𝑡}
– the experiment is the number of customers that
arrive at a service desk during a fixed time period, the
sample space should be the set of nonnegative
integers: Ω = 𝑍 + = 0, 1, 2, 3, …
Events
• Events are subsets of the sample space
o A= {the outcome that the die is even} ={2,4,6}
o B = {exactly two tosses come out tails}=(htt, tht, tth}
o C = {at least two heads} = {hhh, hht, hth, thh}
Probability
• A Probability is a number assigned to each
event in the sample space.
• Axioms of Probability:
– For any event A, 0  P(A)  1.
– P() =1 and 𝑃 𝜙 = 0
– If A1, A2, … An is a partition of A, then
P(A) = P(A1) + P(A2) + ...+ P(An)
Properties of Probability
• For any event A, P(Ac) = 1 - P(A).
• If A  B, then P(A)  P(B).
• For any two events A and B,
P(A  B) = P(A) + P(B) - P(A  B).
For three events, A, B, and C,
P(ABC) =
P(A) + P(B) + P(C)
- P(AB) - P(AC) - P(BC)
+ P(AB C)
Intuitive Development (agrees with axioms)
• Intuitively, the probability of an event a could
be defined as:

Where N(a) is the number that event a happens in n trials

8
Random Variable
• A random variable is a function defined on the
sample space Ω
– maps the outcome of a random event into real
scalar values


X(w)
w
Discrete Random Variables
• Random variables (RVs) which may take on only a
countable number of distinct values
– e.g., the sum of the value of two dies

• X is a RV with arity k if it can take on exactly one

value out of k values,
– e.g., the possible values that X can take on are
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Probability of Discrete RV
• Probability mass function (pmf): P  X  xi 
• Simple facts about pmf
–  i P  X  xi   1
– P  X  xi  X  x j   0 if i  j
– P  X  xi  X  x j   P  X  xi   P  X  x j  if i  j
– P  X  x1  X  x2   X  xk   1
Common Distributions
• Uniform 𝑋~ 𝑈[1, ⋯ , 𝑁]
– X takes values 1, 2, …, N
– PX  i  1 N
– E.g. picking balls of different colors from a box
• Binomial 𝑋~𝐵𝑖𝑛(𝑛, 𝑝)
– X takes values 0, 1, …, n
n i n i
– P  X  i     p 1  p 
i
– E.g. coin flips
Joint Distribution
• Given two discrete RVs X and Y, their joint
distribution is the distribution of X and Y together
– e.g.
you and your friend each toss a coin 10 times
P(You get 5 heads AND you friend get 7 heads)
•  x y
P X  x  Y  y  1

  P  You get i heads AND your friend get j heads   1

50 100
i 0 j 0
Conditional Probability
 
• P X  x Y  y is the probability of 𝑋 = 𝑥, given
the occurrence of 𝑌 = 𝑦
– E.g. you get 0 heads, given that your friend gets 3
heads
P X  x  Y  y
• P X  x Y  y 
P Y  y
Law of Total Probability
• Given two discrete RVs X and Y, which take values in
x1, , xm  and  y1, , yn  , We have

P  X  xi   P X  x  Y  y 
j i j

  P  X  x Y  y P  Y  y 
i j j
j
Marginalization

Marginal Probability Joint Probability

P  X  xi    P X  x  Y  y 
j i j

  P  X  x Y  y P  Y  y 
i j j
j

Conditional Probability Marginal Probability

Bayes Rule
• X and Y are discrete RVs…

P X  x  Y  y
P X  x Y  y 
P Y  y

 
P Y  y j X  xi P  X  xi 
 
P X  xi Y  y j 
 P Y  y
k j 
X  xk P  X  xk 
Independent RVs

• X and Y are independent means that 𝑋 = 𝑥

does not affect the probability of 𝑌 = 𝑦

• Definition: X and Y are independent iff

– P(XY) = P(X)P(Y)
– P X  x  Y  y   P X  x P Y  y 
More on Independence
•
P X  x  Y  y   P X  x P Y  y 

P X  x Y  y  P X  x P Y  y X  x  P Y  y 

• E.g. no matter how many heads you get, your

friend will not be affected, and vice versa
Conditionally Independent RVs
• Intuition: X and Y are conditionally
independent given Z means that once Z is
known, the value of X does not add any
additional information about Y
• Definition: X and Y are conditionally
independent given Z iff

P X  x  Y  y Z  z  P X  x Z  z  P Y  y Z  z 
More on Conditional Independence

P X  x  Y  y Z  z  P X  x Z  z  P Y  y Z  z 

P  X  x Y  y, Z  z   P  X  x Z  z 

P  Y  y X  x, Z  z   P  Y  y Z  z 
Continuous Random Variables
• What if X is continuous?
• Probability density function (pdf) instead of
probability mass function (pmf)
• A pdf is any function 𝑓(𝑥) that describes the
probability density in terms of the input
variable x.
PDF
• Properties of pdf
–
f  x   0, x
– 

– 
f  x  1
f  x   1 ???
• Actual probability can be obtained by taking
the integral of pdf
– E.g. the probability of X being between 0 and 1 is
1
P  0  X  1  
0
f  x dx
Cumulative Distribution Function
• FX  v   P  X  v 
• Discrete RVs
– FX  v    v P  X  vi 
i

• Continuous vRVs
– FX  v   

f  x  dx
d
– FX  x   f  x 
dx
Common Distributions

• Normal 𝑋~𝑁(𝜇, 𝜎 2 )
1  2
 x   
– f  x  exp  , x 
2  2 2

– E.g. the height of the entire population
0.4

0.35

0.3

0.25
f(x)

0.2

0.15

0.1

0.05

0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Multivariate Normal
• Generalization to higher dimensions of the
one-dimensional normal
Covariance Matrix

• 1
f X  x1 , , xd  
 2  d 2

12

 1 
 exp   x      x    
T 1

 2 
Mean
Mean and Variance
• Mean (Expectation):   E  X 
– Discrete RVs: E  X    vi P  X  vi 
v i

– Continuous RVs: E X  
xf  x  dx

• Variance: V  X   E  X   
2

V  X   vi    P  X  vi 
2
– Discrete RVs:
vi

V X    x    f  x dx
2
– Continuous RVs:

Mean Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the mean of the distribution
by:

28
Variance Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the variance of the
distribution by:

29
Thank You
Foundations of Machine Learning

Module 4:
Part B: Bayesian Learning

Sudeshna Sarkar
IIT Kharagpur
Probability for Learning
• Probability for classification and modeling
concepts.
• Bayesian probability
– Notion of probability interpreted as partial belief
• Bayesian Estimation
– Calculate the validity of a proposition
• Based on prior estimate of its probability
• and New relevant evidence
Bayes Theorem
• Goal: To determine the most probable hypothesis,
given the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
Bayes Theorem

P( D | h) P(h)
Bayes Rule: P ( h | D) 
P ( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P (cancer )  .008, P(cancer)  .992

hMAP  arg max P(h | D)

hH

P ( D | h) P ( h)
 arg max
hH P( D)
 arg max P( D | h) P(h)
hH
Maximum Likelihood (ML) Hypothesis
hMAP  arg max P(h | D)
hH

P ( D | h) P ( h)
 arg max
hH P( D)
 arg max P( D | h) P(h)
hH

• If every hypothesis in H is equally probable a priori,

we only need to consider the likelihood of the data D
given h, P(D|h). Then, hMAP becomes the Maximum
Likelihood,
hML= argmax hH P(D|h)
MAP Learner
For each hypothesis h in H, calculate the posterior probability
P( D | h) P(h)
P ( h | D) 
P ( D)
Output the hypothesis hMAP with the highest posterior probability
hMAP  max P(h | D)
hH

Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
Maximum likelihood and least-squared error
• Learn a Real-Valued Function:
• Consider any real-valued target function f.
• Training examples (xi,di) are assumed to have Normally
distributed noise ei with zero mean and variance σ2, added
to the true target value f(xi),
di satisfies 𝑁(𝑓 𝑥𝑖 , 𝜎 2 )
Assume that ei is drawn independently for each xi .
Compute ML Hypo

hML  arg max p( D | h)

hH
m 1 d  h ( xi ) 2
1  ( i
 arg max 
)
2 
e
hH 2
i 1
2

m
1 1 d i  h( xi ) 2
 arg max   ln( 2 )  (
2
)
hH i 1 2 2 
m
 arg min  (d i  h( xi )) 2
hH i 1

10
Bayes Optimal Classifier
Question: Given new instance x, what is its most probable classification?
• ℎ𝑀𝐴𝑃 (𝑥) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
arg max
v j V
 P(v
hi H
j | hi ) P(hi | D)
where V is the set of all the values a classification can take and vj is one
possible such classification.
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1  P( | h ) P(h | D)  .4
hiH
i i

P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0

P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0  P( | h ) P(h | D)  .6
hiH
i i
Why “Optimal”?
• Optimal in the sense that no other classifier
using the same H and prior knowledge can
outperform it on average

12
Gibbs Algorithm
• Bayes optimal classifier is quite computationally
expensive, if H contains a large number of
hypotheses.
• An alternative, less optimal classifier Gibbs algorithm,
defined as follows:
1. Choose a hypothesis randomly according to
P(h|D), where D is the posterior probability
distribution over H.
2. Use it to classify new instance

13
Error for Gibbs Algorithm
• Surprising fact: Assume the expected value is
taken over target concepts drawn at random,
according to the prior probability distribution
assumed by the learner, then (Haussler et al.
1994)

E f [errorX , f GibbsClassifier ]  2 E f [errorX , f BayesOPtimal ],

where f denotes a target function, X denotes the instance space.
Thank You
Foundations of Machine Learning

Module 4:
Part C: Naïve Bayes

Assuming conditional independence among Xi’s:

So, classification rule for Xnew = < X1, …, Xn > is:

Naïve Bayes Algorithm – discrete Xi

• Train Naïve Bayes (examples)

for each* value yk
estimate
for each* value xij of each attribute Xi
estimate

• Classify (Xnew)

* probabilities must sum to 1, so need estimate only n-1 parameters...

Estimating Parameters: Y, Xi discrete-valued

Maximum likelihood estimates (MLE’s):

Number of items in set D for

which Y=yk
Example
• Example: Play Tennis

7
Example
Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No

High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

9
Estimating Parameters: Y, Xi discrete-valued

If unlucky, our MLE estimate for P(Xi | Y) may be zero.

MAP estimates:
Only difference:
“imaginary” examples
Naïve Bayes: Assumptions of Conditional
Independence
Often the Xi are not really conditionally independent

• We can use Naïve Bayes in many cases anyway

– often the right classification, even when not the right
probability
Gaussian Naïve Bayes (continuous X)
• Algorithm: Continuous-valued Features
– Conditional probability often modeled with the normal
distribution

Sometimes assume variance

– is independent of Y (i.e., i),
– or independent of Xi (i.e., k)
– or both (i.e., )
12
Gaussian Naïve Bayes Algorithm – continuous Xi
(but still discrete Y)
• Train Naïve Bayes (examples)
for each value yk
estimate*
for each attribute Xi estimate
class conditional mean , variance

• Classify (Xnew)
Estimating Parameters: Y discrete, Xi continuous

Maximum likelihood estimates: jth training

example

ith feature kth class

(z)=1 if z true,
else 0
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes  21.64, Yes  2.35
   xn ,    ( xn  )2
2
N n1 N n1  No  23.88, No  7.09

– Learning Phase: output two Gaussian models for P(temp|C)

ˆ 1  ( x  21 . 64 ) 2
 1  ( x  21 . 64 ) 2

P( x | Yes )  exp    exp  
2.35 2  2  2.35  2.35 2
2
 11.09 

ˆ 1  ( x  23 .88 ) 2
 1  ( x  23 .88 ) 2

P( x | No)  
exp  
  
exp  
7.09 2  2  7.09  7.09 2
2
 50.25 
15
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• Rarely satisfied in practice, as attributes (variables)
are often correlated.
• To overcome this limitation:
– Bayesian networks combine Bayesian reasoning with
causal relationships between attributes
Thank You
Foundations of Machine Learning

Module 4:
Part D: Bayesian Networks

Sudeshna Sarkar
IIT Kharagpur
Why Bayes Network
• Bayes optimal classifier is too costly to apply
• Naïve Bayes makes overly restrictive
assumptions.
– But all variables are rarely completely independent.
• Bayes network represents conditional
independence relations among the features.
• Representation of causal relations makes the
representation and inference efficient.
Late Rainy
Accident
wakeup day

Traffic Meeting
Jam postponed

Late for
Work

Late for
meeting
Bayesian Network
• A graphical model that efficiently encodes the joint
probability distribution for a large set of variables
• A Bayesian Network for a set of variables (nodes)
X = { X1,…….Xn}
• Arcs represent probabilistic dependence among
variables
• Lack of an arc denotes a conditional independence
• The network structure S is a directed acyclic graph
• A set P of local probability distributions at each node
(Conditional Probability Table)
Representation in Bayesian Belief
Networks
Late
Accide Rainy
wake Conditional probability table
nt day
up
associated with each node
specifies the conditional
Traffic Meeting distribution for the
Jam postponed variable given its immediate
Late parents in
for the graph
Work

Late for
meeting

Each node is asserted to be conditionally independent of

its non-descendants, given its immediate parents
5
Inference in Bayesian Networks
• Computes posterior probabilities given evidence about
some nodes
• Exploits probabilistic independence for efficient
computation.
• Unfortunately, exact inference of probabilities in
general for an arbitrary Bayesian Network is known to
be NP-hard.
• In theory, approximate techniques (such as Monte
Carlo Methods) can also be NP-hard, though in
practice, many such methods were shown to be useful.
• Efficient algorithms leverage the structure of the graph
6
Applications of Bayesian Networks
• Diagnosis: P(cause|symptom)=?
• Prediction: P(symptom|cause)=?
cause
• Classification: P(class|data) cause
• Decision-making
(given a cost function) C C2
1

symptom
symptom
Bayesian Networks
• Structure of the graph  Conditional independence relations

In general,
p(X1, X2,....XN) =  p(Xi | parents(Xi ) )

The full joint distribution

The graph-structured approximation
• Requires that graph is acyclic (no directed cycles)

• 2 components to a Bayesian network

– The graph structure (conditional independence
assumptions)
– The numerical probabilities (for each variable given its
parents)
Examples
A B C Marginal Independence:
p(A,B,C) = p(A) p(B) p(C)

A: D Conditionally independent effects:

p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent
B: S1 C: S2 Given A

A: Traffic B: Late wakeup Independent Causes:

p(A,B,C) = p(C|A,B)p(A)p(B)
“Explaining away”
C: late

A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Naïve Bayes Model
Y1 Y2 Y3 Yn

C
Hidden Markov Model (HMM)
Y1 Y3 Yn
Observed
Y2

----------------------------------------------------

S1 S2 S3 Sn Hidden

Assumptions:
1. hidden state sequence is Markov
2. observation Yt is conditionally independent of all other
variables given St

Widely used in sequence learning eg, speech recognition, POS

tagging
Inference is linear in n
Learning Bayesian Belief Networks
1. The network structure is given in advance and all the
variables are fully observable in the training examples.
– estimate the conditional probabilities.
2. The network structure is given in advance but only
some of the variables are observable in the training
data.
– Similar to learning the weights for the hidden units of
a Neural Net: Gradient Ascent Procedure
3. The network structure is not known in advance.
– Use a heuristic search or constraint-based technique to
search through potential structures.
14
Thank You

ML Cheat Sheet
50% (2)
ML Cheat Sheet
74 pages
Probablity Mit Removed
No ratings yet
Probablity Mit Removed
31 pages
Intelligent Control of Robotic Systems 1st Edition Laxmidhar Behera Instant Download
100% (1)
Intelligent Control of Robotic Systems 1st Edition Laxmidhar Behera Instant Download
59 pages
Lecture 2 ML - Maths
No ratings yet
Lecture 2 ML - Maths
80 pages
ML DL AI Cheatsheet
No ratings yet
ML DL AI Cheatsheet
52 pages
PTSP
No ratings yet
PTSP
101 pages
3 Prob-Review
No ratings yet
3 Prob-Review
77 pages
Unit 2 (2) - 1
No ratings yet
Unit 2 (2) - 1
37 pages
2 Probability and Statistics
No ratings yet
2 Probability and Statistics
29 pages
PTSP
No ratings yet
PTSP
74 pages
Operations Research Lesson 3-1
No ratings yet
Operations Research Lesson 3-1
42 pages
Random Variables and Distributions - New
No ratings yet
Random Variables and Distributions - New
84 pages
Probability Cheatsheet
100% (1)
Probability Cheatsheet
10 pages
Mathematics in Machine Learning
No ratings yet
Mathematics in Machine Learning
83 pages
Lec-1 Probabilistic Models
No ratings yet
Lec-1 Probabilistic Models
29 pages
07 Probability Review
No ratings yet
07 Probability Review
56 pages
Statistics and Probability Katabasis
No ratings yet
Statistics and Probability Katabasis
7 pages
Probability Theory Cheat Sheet
No ratings yet
Probability Theory Cheat Sheet
10 pages
Fall 2019 Prob Review
No ratings yet
Fall 2019 Prob Review
33 pages
ML - Lec 2 - Review of Probability and Statistics
No ratings yet
ML - Lec 2 - Review of Probability and Statistics
30 pages
Probability Cheatsheet
100% (2)
Probability Cheatsheet
10 pages
Reccurences General
No ratings yet
Reccurences General
22 pages
Probability Is A Branch of Mathematics That Deals With Measuring The Likelihood of Events
No ratings yet
Probability Is A Branch of Mathematics That Deals With Measuring The Likelihood of Events
34 pages
Statistics Concepts: An Overview of Upper-Division Statistics With R
No ratings yet
Statistics Concepts: An Overview of Upper-Division Statistics With R
69 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
Lec-1 Probabilistic Models
No ratings yet
Lec-1 Probabilistic Models
29 pages
Probability Review: Thursday Sep 13
No ratings yet
Probability Review: Thursday Sep 13
55 pages
Introduction To Discrete Probability Theory and Bayesian Networks
No ratings yet
Introduction To Discrete Probability Theory and Bayesian Networks
26 pages
Turn in Recitation and Tutorial Scheduling Form Policy: Text
No ratings yet
Turn in Recitation and Tutorial Scheduling Form Policy: Text
52 pages
All in One CheatSheet PDF
No ratings yet
All in One CheatSheet PDF
52 pages
On Probability Theory &stochastic Process
No ratings yet
On Probability Theory &stochastic Process
101 pages
01 Lectureslides ProbTheory
No ratings yet
01 Lectureslides ProbTheory
42 pages
Foundations of Machine Learning: Part A: Logistic Regression
No ratings yet
Foundations of Machine Learning: Part A: Logistic Regression
63 pages
What Is The Role of Algorithm Analysis in Data Structures?: Computer Science
No ratings yet
What Is The Role of Algorithm Analysis in Data Structures?: Computer Science
10 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
Aas24 1
No ratings yet
Aas24 1
29 pages
UNIT 4 Probability
No ratings yet
UNIT 4 Probability
20 pages
Information & Communication
No ratings yet
Information & Communication
13 pages
CS145: Probability & Computing: Lecture 6: Multiple Discrete Variables, Joint & Conditional Distributions, Independence
No ratings yet
CS145: Probability & Computing: Lecture 6: Multiple Discrete Variables, Joint & Conditional Distributions, Independence
23 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
ML Unit2-1
No ratings yet
ML Unit2-1
11 pages
12th Maths Public Exam 2,3 Marks Important Questions 2025
No ratings yet
12th Maths Public Exam 2,3 Marks Important Questions 2025
20 pages
Probability
No ratings yet
Probability
9 pages
MAS 102 - Topic 1
No ratings yet
MAS 102 - Topic 1
13 pages
All Cheat Shests 1749903425
No ratings yet
All Cheat Shests 1749903425
3 pages
Question Paper Code:: (10 2 20 Marks)
No ratings yet
Question Paper Code:: (10 2 20 Marks)
3 pages
Lecture03 Probability Review
No ratings yet
Lecture03 Probability Review
48 pages
Sample Space.: How Many Different Equally Likely Possibilities Are There?
No ratings yet
Sample Space.: How Many Different Equally Likely Possibilities Are There?
12 pages
Probability Review
No ratings yet
Probability Review
12 pages
Lecture13 BlockCipher
No ratings yet
Lecture13 BlockCipher
60 pages
AI UNIT 1 BEC Part 2
No ratings yet
AI UNIT 1 BEC Part 2
102 pages
Unit-IV Engineering Maths-III (Defn and Problems)
No ratings yet
Unit-IV Engineering Maths-III (Defn and Problems)
14 pages
Day 1 Coaching For MTS, Postman Mailgurad
No ratings yet
Day 1 Coaching For MTS, Postman Mailgurad
8 pages
A 18-Page Statistics & Data Science Cheat Sheets
No ratings yet
A 18-Page Statistics & Data Science Cheat Sheets
18 pages
Class Notes EOB (Hindi) Till Capital Budgeting May 25 Hindi
No ratings yet
Class Notes EOB (Hindi) Till Capital Budgeting May 25 Hindi
27 pages
Probability Theory - Towards Data Science
No ratings yet
Probability Theory - Towards Data Science
19 pages
Probability FoundationalMathofAI S24
No ratings yet
Probability FoundationalMathofAI S24
7 pages
Probability and Random Processes: Lessons 5-6 Discrete Random Variables
No ratings yet
Probability and Random Processes: Lessons 5-6 Discrete Random Variables
66 pages
Debasrita Physics
No ratings yet
Debasrita Physics
3 pages
Control Lab Quiz
No ratings yet
Control Lab Quiz
1 page
CHEMISTY TEST Class 7 (2nd)
No ratings yet
CHEMISTY TEST Class 7 (2nd)
2 pages
CHEMISTRY TEST 3rd
No ratings yet
CHEMISTRY TEST 3rd
2 pages
Icoei 198
No ratings yet
Icoei 198
2 pages
Probability Cheatsheet v2.0: Thinking Conditionally
No ratings yet
Probability Cheatsheet v2.0: Thinking Conditionally
10 pages
Luterbacher Predicting Crises and Monitoring Their Evolution-Ices 2015 Grenoble
No ratings yet
Luterbacher Predicting Crises and Monitoring Their Evolution-Ices 2015 Grenoble
24 pages
Diffie Hellman
No ratings yet
Diffie Hellman
8 pages
Q No Questions Marks BTL CO PO CO 3 PO1: 1 2 F 1.5+20PG1+0.1 (PG1) F2 1.9+30PG2+0.1 (PG2) 3 4
No ratings yet
Q No Questions Marks BTL CO PO CO 3 PO1: 1 2 F 1.5+20PG1+0.1 (PG1) F2 1.9+30PG2+0.1 (PG2) 3 4
1 page
2023-Contextualizing The Current State of Research On The Use Ofmachine Learning For Student Performance Prediction Asystematic Literature Review
No ratings yet
2023-Contextualizing The Current State of Research On The Use Ofmachine Learning For Student Performance Prediction Asystematic Literature Review
25 pages
Experiment No.2 - Cyber Security
No ratings yet
Experiment No.2 - Cyber Security
16 pages
MIC 2010 3 1haugen
No ratings yet
MIC 2010 3 1haugen
13 pages
Project Report: "Shannon Fannon Coding"
No ratings yet
Project Report: "Shannon Fannon Coding"
8 pages
Probability-The Science of Uncertainty and Data
No ratings yet
Probability-The Science of Uncertainty and Data
4 pages
A Conditional Generative Chatbot Using Transformer
No ratings yet
A Conditional Generative Chatbot Using Transformer
14 pages
Lecture 01 On Joint Distribution For Discrete RV - 04-09-19
No ratings yet
Lecture 01 On Joint Distribution For Discrete RV - 04-09-19
3 pages
Basics On Probability
No ratings yet
Basics On Probability
35 pages
38.1 - Problem Formulation Movie Reviews - mp4
No ratings yet
38.1 - Problem Formulation Movie Reviews - mp4
5 pages
Design of Risk-Based Univariate Control Charts With Measurement Uncertainty
No ratings yet
Design of Risk-Based Univariate Control Charts With Measurement Uncertainty
7 pages
Anti Pid Windup
No ratings yet
Anti Pid Windup
14 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Traffic Sign Recognition Project
No ratings yet
Traffic Sign Recognition Project
9 pages
Problems: Soru 6.1
No ratings yet
Problems: Soru 6.1
2 pages
Electronics and Communication Engineering: PAPR Analysis of DHT-Precoded OFDM System For M-QAM
No ratings yet
Electronics and Communication Engineering: PAPR Analysis of DHT-Precoded OFDM System For M-QAM
19 pages
Intro To LINGO
No ratings yet
Intro To LINGO
21 pages
SDET Formulae MidSem2 2018 Ver3
No ratings yet
SDET Formulae MidSem2 2018 Ver3
2 pages
New Jersey Institute of Technology AI COurse Syllabus
No ratings yet
New Jersey Institute of Technology AI COurse Syllabus
4 pages
Differential Games
From Everand
Differential Games
Avner Friedman
No ratings yet
Infinite Series
From Everand
Infinite Series
James M Hyslop
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus Super Review
From Everand
Calculus Super Review
Editors of REA
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
From Everand
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Math for Computer Applications
From Everand
Math for Computer Applications
The Editors of REA
No ratings yet
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet

Foundations of Machine Learning: Part A: Probability Basics

Uploaded by

Foundations of Machine Learning: Part A: Probability Basics

Uploaded by

Foundations of Machine Learning

Where N(a) is the number that event a happens in n trials

• X is a RV with arity k if it can take on exactly one

  P  You get i heads AND your friend get j heads   1

Marginal Probability Joint Probability

Conditional Probability Marginal Probability

• X and Y are independent means that 𝑋 = 𝑥

• Definition: X and Y are independent iff

• E.g. no matter how many heads you get, your

P (cancer )  .008, P(cancer)  .992

hMAP  arg max P(h | D)

• If every hypothesis in H is equally probable a priori,

hML  arg max p( D | h)

P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0

E f [errorX , f GibbsClassifier ]  2 E f [errorX , f BayesOPtimal ],

Assuming conditional independence among Xi’s:

So, classification rule for Xnew = < X1, …, Xn > is:

• Train Naïve Bayes (examples)

* probabilities must sum to 1, so need estimate only n-1 parameters...

Maximum likelihood estimates (MLE’s):

Number of items in set D for

Humidity Play=Yes Play=No Wind Play=Yes Play=No

P(Play=Yes) = 9/14 P(Play=No) = 5/14

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

If unlucky, our MLE estimate for P(Xi | Y) may be zero.

• We can use Naïve Bayes in many cases anyway

Sometimes assume variance

Maximum likelihood estimates: jth training

ith feature kth class

– Learning Phase: output two Gaussian models for P(temp|C)

Each node is asserted to be conditionally independent of

The full joint distribution

• 2 components to a Bayesian network

A: D Conditionally independent effects:

A: Traffic B: Late wakeup Independent Causes:

Widely used in sequence learning eg, speech recognition, POS

You might also like