Unit 5 - ML
Unit 5 - ML
Gaussian Mixture Models (GMMs) assume that there are a certain number of Gaussian
distributions, and each of these distributions represent a cluster. Hence, a Gaussian Mixture
Model tends to group the data points belonging to a single distribution together.
Let’s say we have three Gaussian distributions – GD1, GD2, and GD3. These have a certain
mean (μ1, μ2, μ3) and variance (σ1, σ2, σ3) value respectively. For a given set of data points,
our GMM would identify the probability of each data point belonging to each of these
distributions.
Gaussian Mixture Models are probabilistic models and use the soft clustering approach for
distributing the points in different clusters.
Here, we have three clusters that are denoted by three colours – Blue, Green, and Cyan. Let’s
take the data point highlighted in red. The probability of this point being a part of the blue
cluster is 1, while the probability of it being a part of the green or cyan clusters is 0.
Now, consider another point – somewhere in between the blue and cyan (highlighted in the
below figure). The probability that this point is a part of cluster green is 0. And the probability
that this belongs to blue and cyan is 0.2 and 0.8 respectively.
Gaussian Mixture Models use the soft clustering technique for assigning data points to
Gaussian distributions. I’m sure you’re wondering what these distributions are so let me
explain that in the next section.
The Gaussian Distribution
Gaussian Distributions (or the Normal Distribution) has a bell-shaped curve, with the data
points symmetrically distributed around the mean value.
The below image has a few Gaussian distributions with a difference in mean (μ) and variance
(σ2). Remember that the higher the σ value more would be the spread:
In a one dimensional space, the probability density function of a Gaussian distribution is given
by:
In the case of two variables, instead of a 2D bell-shaped curve, we will have a 3D bell curve as
shown below:
The probability density function would be given by:
where x is the input vector, μ is the 2D mean vector, and Σ is the 2×2 covariance matrix. The
covariance would now define the shape of this curve. We can generalize the same for d-
dimensions.
Thus, this multivariate Gaussian model would have x and μ as vectors of length d, and Σ would
be a d x d covariance matrix.
Hence, for a dataset with d features, we would have a mixture of k Gaussian distributions
(where k is equivalent to the number of clusters), each having a certain mean vector and
variance matrix.
The mean and variance values are determined using a technique called Expectation-
Maximization (EM).
These missing variables are called latent variables. It’s difficult to determine the right model
parameters due to these missing variables.
Since we do not have the values for the latent variables, Expectation-Maximization tries to
use the existing data to determine the optimum values for these variables and then finds
the model parameters. Based on these model parameters, we go back and update the values
for the latent variable, and so on.
• E-step: In this step, the available data is used to estimate (guess) the values of the
missing variables
• M-step: Based on the estimated values generated in the E-step, the complete data is
used to update the parameters
•
Expectation-Maximization is the base of many algorithms, including Gaussian Mixture
Models.
Let’s understand this using another example. I want you to visualize the idea in your mind as
you read along. This will help you better understand what we’re talking about.
Let’s say we need to assign k number of clusters. This means that there are k Gaussian
distributions, with the mean and covariance values to be μ1, μ2, .. μk and Σ1, Σ2, .. Σk .
Additionally, there is another parameter for the distribution that defines the number of points
for the distribution. Or in other words, the density of the distribution is represented with Π i.
Now, we need to find the values for these parameters to define the Gaussian distributions.
We already decided the number of clusters, and randomly assigned the values for the mean,
covariance, and density. Next, we’ll perform the E-step and the M-step!
E-step:
For each point xi, calculate the probability that it belongs to cluster/distribution c1, c2, … ck.
This is done using the below formula:
This value will be high when the point is assigned to the right cluster and lower otherwise.
M-step:
Post the E-step, we go back and update the Π, μ and Σ values. These are updated in the
following manner:
1. The new density is defined by the ratio of the number of points in the cluster and the
total number of points:
2. The mean and the covariance matrix are updated based on the values assigned to the
distribution, in proportion with the probability values for the data point. Hence, a data
point that has a higher probability of being a part of that distribution will contribute a
larger portion:
Based on the updated values generated from this step, we calculate the new probabilities for
each data point and update the values iteratively. This process is repeated in order to
maximize the log-likelihood function. Effectively we can say that the
k-means only considers the mean to update the centroid while GMM takes into account the
mean as well as the variance of the data!
RIENFORCEMENT LEARNING
o Reinforcement Learning is a feedback-based Machine learning technique in which an
agent learns to behave in an environment by performing the actions and seeing the
results of actions. For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback or penalty.
o Since there is no labeled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the goal
is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary goal of
an agent in reinforcement learning is to improve the performance by getting the
maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the experience, it
learns to perform the task in a better way. Hence, we can say that "Reinforcement
learning is a type of machine learning method where an intelligent agent (computer
program) interacts with the environment and learns to act within that
Terms used in Reinforcement Learning
o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action taken by
the agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the
action of the agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
o Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).
1. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behavior. In other
words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which
can diminish the results
2. Negative –
Negative Reinforcement is defined as strengthening of behavior because a
negative condition is stopped or avoided.
Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Various Practical applications of Reinforcement Learning –
BAYESIAN NETWORKS
• "A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
• It is also called a Bayes network, belief network, decision network, or Bayesian model.
• Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o
o Each node corresponds to the random variables, and a variable can
be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes
in the graph.
These links represent that one node directly influence the other node, and if there is
no directed link that means that nodes are independent with each other
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.
• In the above figure, we have an alarm ‘A’ – a node, say installed in a house of a
person ‘gfg’, which rings upon two probabilities i.e burglary ‘B’ and fire ‘F’,
which are – parent nodes of the alarm node. The alarm is the parent node of
two probabilities P1 calls ‘P1’ & P2 calls ‘P2’ person nodes.
• Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’,
respectively. But, there are few drawbacks in this case, as sometimes ‘P1’ may
forget to call the person ‘gfg’, even after hearing the alarm, as he has a
tendency to forget things, quick. Similarly, ‘P2’, sometimes fails to call the
person ‘gfg’, as he is only able to hear the alarm, from a certain distance.
•
Q) Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called ‘gfg’)
when the alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’
events]
[ Note: The values mentioned below are neither calculated nor computed. They have
observed values ]
Burglary ‘B’ –
• P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
• P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –
• P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
• P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –
B F P (A=T) P (A=F)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
• The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or may not have
rung). It has two parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or
‘false’ (i.e may have occurred or may not have occurred) depending upon
different conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95
• The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the person
‘gfg’ or not) . It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’
(i.e may have rung or may not have rung ,upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99
• The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’
or not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e
may have rung or may not have rung, upon burglary ‘B’ or fire ‘F’).
•
Solution: Considering the observed probabilistic scan –
With respect to the question — P ( P1, P2, A, ~B, ~F) , we need to get the probability of
‘P1’. We find it with regard to its parent node – alarm ‘A’. To get the probability of ‘P2’, we
find it with regard to its parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary ‘B’ and
fire ‘F’ are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce –
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075