0% found this document useful (0 votes)
42 views38 pages

Machine Learning Models and Theories

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views38 pages

Machine Learning Models and Theories

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Machine learning models and

theories
Content

1. Bayesian Belief Network


2. Neyman–Pearson lemma
3. Error Bounds for normal density
4. Expectation maximization

Probability and computing /


Michael Mitzenmacher Eli Upfal.
Bayesian Belief Network
Characteristics

Common names:
Bayesian network,
Bayes network
Belief network
Bayes(ian) model
probabilistic directed acyclic graphical model
Bayesian Belief Network

• One important consideration about Bayesian network is representation of


relationship between variables in terms of probability.
• To represent these probabilistic relations among different variables/parameters/
hypothesis, BBN applies directed acyclic graph (DAG) (No cycle graph).
• Each node is associated with a probability function that takes, as input, a
particular set of values for the node's parent variables, and gives (as output) the
probability (or probability distribution, if applicable) of the variable
represented by the node.
Where you can use it in your research ?

• In many problems where decision based outcomes are necessary to show


results in your outcomes.
• Deciding uncertainty in behaviour of proposed model.
• Similar ideas may be applied to undirected, and possibly cyclic, graphs;
such as Markov networks.
Mathematical background of BBN

• Interpretation of P(a)  Probability or chance of occurring event a.


• Interpretation of P( a | b )  Probability or chance of occurring event a when
b has been occur.
• Example: P (Rain | last day was sunny)  What will be probability of Rain
today if last day was sunny.

What are conclusions that can be extracted from given
graph

• Grass wet can be seen because of either


Sprinkler or Rain.
• Sprinkler can be seen because of Rain.
• There is no decision can be taken for Rain.
• It can be represented as:
• P(Grass wet | Rain) = Some value
• P(Grass wet | Sprinkler) = Some value
• P(Rain | Grass wet) = decision problem
• P(Rain | Sprinkler) = decision problem
Joint Probability Density function
• Suppose that there are two events which could cause grass to be wet: either the sprinkler is
on or it's raining. Also, suppose that the rain has a direct effect on the use of the sprinkler
(namely that when it rains, the sprinkler is usually not turned on). Then the situation can
be modeled with a Bayesian network (shown to the right). All three variables have two
possible values, T (for true) and F (for false) See the table.
Test case Sprinkler Rain Grass wet Decision
1 0 0 0 0
2 0 0 1 0
3 0 1 0 0
4 0 1 1 1
5 1 0 0 0
6 1 0 1 1
7 1 1 0 1
8 1 1 1 1
Joint probability density function (Adopted from
chain rule of probability)
• P ( A|B) =
• P (A |B)* P (A) = P = P (B,A) = P (A, B)
• P(A,B,C) = P (A|B,C) * P(B|C) * P(C)
• First C will occur then only B will occur then P(C)

• Only and only A will occur.


P(B) P()

P(A) P()
So how to solve this ?

Query 1:

Rain occur

Sprinkler Sprinkler
On Off

Grass not
Grass wet
wet
Solution
• Select probability of Rain occurrence = (.2)
• Select probability of Rain occurrence and Sprinkler On = ( 0.01)
• Select probability of Rain occurrence and Sprinkler On and Wet grass (.99)

Ans = .2 *0.01*.99 = 0.00198


Some other example
Another example
Query: What is the probability that the alarm has sounded but neither a
burglary nor an earthquake has occurred, and both Ali and Veli call ?

= P (̰̰) *P(E) *P(Alarm | , E) * P(VC | A) *P (AC|A)


= 0.999*.998*0.001* 0.90*0.70 = 0.00062

P (B | AC) =
Likelihood Test (LR test)

• In statistics, a likelihood ratio test (LR test) is a statistical test used for comparing the
goodness of fit of two statistical models — a null model against an alternative model.
The test is based on the likelihood ratio, which expresses how many times more likely
the data are under one model than the other. This likelihood ratio, or equivalently its
logarithm, can then be used to compute a p-value, or compared to a critical value to
decide whether or not to reject the null model.
Discrete probability distribution

Getting a no. 1 2 3 4 5 6
Probability 1/6 1/6 1/6 1/6 1/6 1/6

P( getting even no. ) = 1/6 + 1/6 + 1/6 = ½;


P(probability of getting even no) = 1/6 + 1/6 = 1/3
P () means x is a limiting condition and is a desirable condition.
Neyman–Pearson lemma

Objective: Finding similarity between two approaches θ1 and θ0 using the


likelihood-ratio test with threshold η
Normal Density function

1: Continuous density function


2: A random variable with a Gaussian distribution is said to be normally distributed and is
called a normal deviate.
3: The probability density of the normal distribution is
Expectation in normal density function

• If X is a random variable whose cumulative distribution function admits a


density f(x), then the expected value is defined as the following Lebesgue
integral:
• E [x]=
Concept of decision making in Bayesion decision
theory

• Suppose that we want to classify two kind of fish (A) Sea Bass and (B)
Salmon . We have a fish which has some properties w1, similar to Sea Bass
and some properties w2 similar to Salmon and we don’t know whether it is
Sea Bass or Salmon actually.

• So How to take correct decision when we have this kind of situation?



Simple phenomenon of estimation

• Calculate P (W1 | x) and P (W2 | x) from previous samples of fishes.


• { if P (W1 | x) > P (W2 | x) then choose W1 else choose W2 }

• Let us generalise this case when x is a condition and W is correct class/


decision
• We note first that the (joint) probability density of finding a pattern that is
in category ωj and has feature value x can be written two ways: p(ωj , x) =
P(ωj |x)p(x) = p(x|ωj) P(ωj). Rearranging these leads us to the answer to our
question, which is called Bayes’ formula:
To estimate it in a correct way, we have to find error
value regarding both w1 and w2 in continuous space
What is probabilistic error estimation respect to x

LOC Function Adjacent Effort (Ground


point feature truth of decision) Error function or cost function or objective function
1000 103.7 52 5500 and it is responsibility of analyst th develop it for ex.

1700 90.50 17 2000 Error function = (Oi – Ei | Feature value x)


3310 103.87 91 10000
3500 144.90 103 9000
So selection criteria will be

Select the region which is showing minimum error among given decision areas
So again problem is same “ How to find min[P (w1|x), P(W2|x)] ” and can we get specific bound of this
problem?

There are different methods are available in statistics to solve this optimization
problem. Traditionally we solve it by applying global minima and Dynamic
programming but here we will go for Chernoff bound to estimate
Chernoff bound

• Lemma :

Let us discuss it. Assume that we have two cases either a is greater than b and b is greater than a then
Assume a> b then we have to prove only

b<
1<

1<
Bhattacharyya Bound (An extension of Chernoff
bound which is slightly less tight and developed for )
Error probabilities in probabilistic classifier Confusion
estimations matrix

• We can obtain additional insight into the operation of a general classifier


— Bayes or otherwise — if we consider the sources of its error.
• Consider first the two-category case, and suppose the dichotomizer has
divided the space into two regions R1 and R2 in a possibly non-optimal
way.
• There are two ways in which a classification error can occur; either an
observation x falls in R2 and the true state of nature is ω1, or x falls in R1
and the true state of nature is ω2. Since these events are mutually exclusive
and exhaustive, the probability of error is
Error Bounds for Normal Densities

1: Chernoff Bound
2: Bhattacharyya Bound
The concept of bounds are directly taken from moment generating
function of a random variable X.
Chernoff Bound

• Suppose X1, ..., Xn are independent random variables taking values in {0,
1}. Let X denote their sum and let μ = E[X] denote the sum's expected
value. Then for any δ > 0

A similar proof strategy can be used to show that


Probabilistic details of Moment generating
function
For n = 1 : Expectation
(Single order moment)
For n = 2 : Second order
moment
For n = 3: Third order moment

Consider a geometric random


variable X with parameter p, as
Chernoff bounds
Problem 1:

• Consider a biased coin with probability p = 1/3 of landing heads and probability 2/3 of
landing tails. Suppose the coin is flipped some number n of times, and let Xi be a
random variable denoting the ith flip, where Xi = 1 means heads, and Xi = 0 means
tails. Use the Chernoff bound to determine a value for n so that the probability that
more than half of the coin flips come out heads is less that 0.001
Solution
Expectation–maximization (EM) algorithm

• In statistics, an expectation–maximization (EM) algorithm is an iterative method to


find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in
statistical models, where the model depends on unobserved latent variables.
• The EM iteration alternates between performing an expectation (E) step, which creates
a function for the expectation of the log-likelihood evaluated using the current estimate
for the parameters, and a maximization (M) step, which computes parameters
maximizing the expected log-likelihood found on the E step.
Objective of EM algorithm
• Probabilistic models, such as hidden Markov models or Bayesian networks,
are commonly used to model biological data. Much of their popularity can be
attributed to the existence of efficient and robust procedures for learning
parameters from observations.
• Often, however, the only data available for training a probabilistic model are
incomplete. Missing values can occur, for example, in medical diagnosis,
where patient histories generally include results from a limited battery of tests.
• Alternatively, in gene expression clustering, incomplete data arise from the
intentional omission of gene-to-cluster assignments in the probabilistic model.
The expectation maximization algorithm enables parameter estimation in
probabilistic models with incomplete data.
A coin-flipping experiment

You might also like