0% found this document useful (0 votes)
122 views28 pages

ML Unit 3

The document provides an overview of decision trees in machine learning, detailing their structure, terminology, and the process of constructing them using algorithms like CART (Classification and Regression Trees). It explains how decision trees can be used for both classification and regression tasks, emphasizing their interpretability and versatility. Additionally, it discusses the advantages, limitations, and applications of the CART algorithm in various fields.

Uploaded by

BALARAMAN N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views28 pages

ML Unit 3

The document provides an overview of decision trees in machine learning, detailing their structure, terminology, and the process of constructing them using algorithms like CART (Classification and Regression Trees). It explains how decision trees can be used for both classification and regression tasks, emphasizing their interpretability and versatility. Additionally, it discusses the advantages, limitations, and applications of the CART algorithm in various fields.

Uploaded by

BALARAMAN N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

lOMoARcPSD|27949228

Unit III ML Notes - R22 Regulation

Machine Learning (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by BALARAMAN N ([email protected])
lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

Decision Tree in Machine Learning


A decision tree is a type of supervised learning algorithm that is
commonly used in machine learning to model and predict
outcomes based on input data. It is a tree-like structure where
each internal node tests on attribute, each branch corresponds
to attribute value and each leaf node represents the final
decision or prediction. The decision tree algorithm falls under the
category of supervised learning. They can be used to solve
both regression and classification problems.
Decision Tree Terminologies
There are specialized terms associated with decision trees that
denote various components and facets of the tree structure and
decision-making procedure. :
 Root Node: A decision tree’s root node, which
represents the original choice or feature from which the
tree branches, is the highest node.
 Internal Nodes (Decision Nodes): Nodes in the tree
whose choices are determined by the values of particular
attributes. There are branches on these nodes that go to
other nodes.
 Leaf Nodes (Terminal Nodes): The branches’ termini,
when choices or forecasts are decided upon. There are
no more branches on leaf nodes.
 Branches (Edges): Links between nodes that show how
decisions are made in response to particular
circumstances.
 Splitting: The process of dividing a node into two or
more sub-nodes based on a decision criterion. It involves
selecting a feature and a threshold to create subsets of
data.
 Parent Node: A node that is split into child nodes. The
original node from which a split originates.
 Child Node: Nodes created as a result of a split from a
parent node.
 Decision Criterion: The rule or condition used to
determine how the data should be split at a decision
node. It involves comparing feature values against a
threshold.
 Pruning: The process of removing branches or nodes
from a decision tree to improve its generalisation and
prevent overfitting.
Understanding these terminologies is crucial for interpreting and
working with decision trees in machine learning applications.
How Decision Tree is formed?
Page 1 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

The process of forming a decision tree involves recursively


partitioning the data based on the values of different attributes.
The algorithm selects the best attribute to split the data at each
internal node, based on certain criteria such as information gain
or Gini impurity. This splitting process continues until a stopping
criterion is met, such as reaching a maximum depth or having a
minimum number of instances in a leaf node.
Why Decision Tree?
Decision trees are widely used in machine learning for a number
of reasons:
 Decision trees are so versatile in simulating intricate
decision-making processes, because of their
interpretability and versatility.
 Their portrayal of complex choice scenarios that take into
account a variety of causes and outcomes is made
possible by their hierarchical structure.
 They provide comprehensible insights into the decision
logic, decision trees are especially helpful for tasks
involving categorisation and regression.
 They are proficient with both numerical and categorical
data, and they can easily adapt to a variety of datasets
thanks to their autonomous feature selection capability.
 Decision trees also provide simple visualization, which
helps to comprehend and elucidate the underlying
decision processes in a model.
Decision Tree Approach
Decision tree uses the tree representation to solve the problem
in which each leaf node corresponds to a class label and
attributes are represented on the internal node of the tree. We
can represent any boolean function on discrete attributes using
the decision tree.

Constructing Decision Trees


1. Decision Node 2. Leaf Node
Here D1 to D6 are Data Sets as per above classification
Page 2 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

Learning with Trees – Decision Trees – Constructing Decision


Trees
Step1 : given data set, choose target attribute
Step 2: calculate information gain with entropy

Step 3: Information Gain (IG) = −¿


N
P+ N
log ( N
)
2 P+ N ¿ ¿ ¿
¿
P i+N
For each attribute , find entropy E(A) where E(A) = ∑ P+N I( i

P i N i)

Entropy = IGX Probability


Step 4 : calculate Gain = IG – E(A)

CART(Classification And Regression


Tree) for Decision Tree
CART is a predictive algorithm used in Machine learning and it
explains how the target variable’s values can be predicted based
on other matters. It is a decision tree where each fork is split into
a predictor variable and each node has a prediction for the
target variable at the end.
The term CART serves as a generic term for the following
categories of decision trees:

Page 3 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

 Classification Trees: The tree is used to determine


which “class” the target variable is most likely to fall into
when it is continuous.
 Regression trees: These are used to predict a
continuous variable’s value.
In the decision tree, nodes are split into sub-nodes based on a
threshold value of an attribute. The root node is taken as the
training set and is split into two by considering the best attribute
and threshold value. Further, the subsets are also split using the
same logic. This continues till the last pure sub-set is found in
the tree or the maximum number of leaves possible in that
growing tree.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree
algorithm that is used for both classification and regression
tasks. It is a supervised learning algorithm that learns from
labelled data to predict unseen data.
 Tree structure: CART builds a tree-like structure
consisting of nodes and branches. The nodes represent
different decision points, and the branches represent the
possible outcomes of those decisions. The leaf nodes in
the tree contain a predicted class label or value for the
target variable.
 Splitting criteria: CART uses a greedy approach to split
the data at each node. It evaluates all possible splits and
selects the one that best reduces the impurity of the
resulting subsets. For classification tasks, CART uses Gini
impurity as the splitting criterion. The lower the Gini
impurity, the more pure the subset is. For regression
tasks, CART uses residual reduction as the splitting
criterion. The lower the residual reduction, the better the
fit of the model to the data.
 Pruning: To prevent overfitting of the data, pruning is a
technique used to remove the nodes that contribute little
to the model accuracy. Cost complexity pruning and
information gain pruning are two popular pruning
techniques. Cost complexity pruning involves calculating
the cost of each node and removing nodes that have a
negative cost. Information gain pruning involves
calculating the information gain of each node and
removing nodes that have a low information gain.
How does CART algorithm works?
The CART algorithm works via the following process:
 The best-split point of each input is obtained.

Page 4 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

 Based on the best-split points of each input in Step 1, the


new “best” split point is identified.

 Split the chosen input according to the “best” split point.


 Continue splitting until a stopping rule is satisfied or no
further desirable splitting is available.

CART algorithm uses Gini Impurity to split the dataset into a


decision tree .It does that by searching for the best homogeneity
for the sub nodes, with the help of the Gini index criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It
stores the sum of squared probabilities of each class. It
computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation
of the Gini coefÏcient. It works on categorical variables, provides
outcomes either “successful” or “failure” and hence conducts
binary splitting only.
The degree of the Gini index varies from 0 to 1,
 Where 0 depicts that all the elements are allied to a
certain class, or only one class exists there.
 Gini index close to 1 means a high level of impurity,
where each class contains a very small fraction of
elements, and
 A value of 1-1/n occurs when the elements are uniformly
distributed into n classes and each class has an equal
probability of 1/n. For example, with two classes, the Gini
impurity is 1 – 1/2 = 0.5.
Mathematically, we can write Gini Impurity as follows:
Gini=1−∑i=1n(pi)2Gini=1−∑i=1n(pi)2

Page 5 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

where pipi is the probability of an object being classified to a


particular class.
In conclusion, Gini impurity is the probability of misclassification,
assuming independent selection of the element and its class
based on the class probabilities.
CART for Classification
A classification tree is an algorithm where the target variable is
categorical. The algorithm is then used to identify the “Class”
within which the target variable is most likely to fall.
Classification trees are used when the dataset needs to be split
into classes that belong to the response variable(like yes or no)
For classification in decision tree learning algorithm that creates
a tree-like structure to predict class labels. The tree consists of
nodes, which represent different decision points, and branches,
which represent the possible result of those decisions. Predicted
class labels are present at each leaf node of the tree.
How Does CART for Classification Work?
CART for classification works by recursively splitting the training
data into smaller and smaller subsets based on certain criteria.
The goal is to split the data in a way that minimizes the impurity
within each subset. Impurity is a measure of how mixed up the
data is in a particular subset. For classification tasks, CART uses
Gini impurity
 Gini Impurity- Gini impurity measures the probability of
misclassifying a random instance from a subset labeled
according to the majority class. Lower Gini impurity
means more purity of the subset.
 Splitting Criteria- The CART algorithm evaluates all
potential splits at every node and chooses the one that
best decreases the Gini impurity of the resultant subsets.
This process continues until a stopping criterion is
reached, like a maximum tree depth or a minimum
number of instances in a leaf node.
CART for Regression
A Regression tree is an algorithm where the target variable is
continuous and the tree is used to predict its value. Regression
trees are used when the response variable is continuous. For
example, if the response variable is the temperature of the day.
CART for regression is a decision tree learning method that
creates a tree-like structure to predict continuous target
variables. The tree consists of nodes that represent different
decision points and branches that represent the possible
outcomes of those decisions. Predicted values for the target
variable are stored in each leaf node of the tree.

Page 6 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

How Does CART works for Regression?


Regression CART works by splitting the training data recursively
into smaller subsets based on specific criteria. The objective is to
split the data in a way that minimizes the residual reduction in
each subset.
 Residual Reduction- Residual reduction is a measure of
how much the average squared difference between the
predicted values and the actual values for the target
variable is reduced by splitting the subset. The lower the
residual reduction, the better the model fits the data.
 Splitting Criteria- CART evaluates every possible split
at each node and selects the one that results in the
greatest reduction of residual error in the resulting
subsets. This process is repeated until a stopping
criterion is met, such as reaching the maximum tree
depth or having too few instances in a leaf node.
Pseudo-code of the CART algorithm
d = 0, endtree = 0 Note(0) = 1, Node(1) = 0, Node(2) =
0 while endtree < 1 if Node(2 d-1) + Node(2d) + .... + Node(2d+1-
2) = 2 - 2d+1 endtree = 1 else do i = 2 d-1, 2d, .... , 2d+1-2 if
Node(i) > -1 Split tree else Node(2i+1) = -1 Node(2i+2) = -1 end
if end do end if d = d + 1 end while

CART model representation


CART models are formed by picking input variables and
evaluating split points on those variables until an appropriate
tree is produced.
Steps to create a Decision Tree using the CART algorithm:
 Greedy algorithm: In this The input space is divided
using the Greedy method which is known as a recursive
binary spitting. This is a numerical method within which
all of the values are aligned and several other split points
are tried and assessed using a cost function.
 Stopping Criterion: As it works its way down the tree
with the training data, the recursive binary splitting
method described above must know when to stop
splitting. The most frequent halting method is to utilize a
minimum amount of training data allocated to every leaf
node. If the count is smaller than the specified threshold,
the split is rejected and also the node is considered the
last leaf node.
 Tree pruning: Decision tree’s complexity is defined as
the number of splits in the tree. Trees with fewer

Page 7 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

branches are recommended as they are simple to grasp


and less prone to cluster the data. Working through each
leaf node in the tree and evaluating the effect of deleting
it using a hold-out test set is the quickest and simplest
pruning approach.
 Data preparation for the CART: No special data
preparation is required for the CART algorithm.
POPULAR CART-BASED ALGORITHMS:
 CART (Classification and Regression Trees): The
original algorithm that uses binary splits to build decision
trees.
 C4.5 and C5.0: Extensions of CART that allow for
multiway splits and handle categorical variables more
effectively.
 Random Forests: Ensemble methods that use multiple
decision trees (often CART) to improve predictive
performance and reduce overfitting.
 Gradient Boosting Machines (GBM): Boosting
algorithms that also use decision trees (often CART) as
base learners, sequentially improving model
performance.
Advantages of CART
 Results are simplistic.
 Classification and regression trees are Nonparametric
and Nonlinear.
 Classification and regression trees implicitly perform
feature selection.
 Outliers have no meaningful effect on CART.
 It requires minimal supervision and produces easy-to-
understand models.
Limitations of CART
 Overfitting.
 High Variance.
 low bias.
 the tree structure may be unstable.
Applications of the CART algorithm
 For quick Data insights.
 In Blood Donors Classification.
 For environmental and ecological data.
 In the financial sectors.

What is Ensemble Learning with examples?

Page 8 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

 Ensemble learning is a machine learning technique that


combines the predictions from multiple individual models
to obtain a better predictive performance than any single
model. The basic idea behind ensemble learning is to
leverage the wisdom of the crowd by aggregating the
predictions of multiple models, each of which may have
its own strengths and weaknesses. This can lead to
improved performance and generalization.
 Ensemble learning can be thought of as compensation
for poor learning algorithms that are computationally
more expensive than a single model. But they are more
efÏcient than a single non-ensemble model that has
passed through a lot of learning. In this article, we will
have a comprehensive overview of the importance of
ensemble learning and how it works, different types of
ensemble classifiers, advanced ensemble learning
techniques, and some algorithms (such as random
forest, xgboost) for better clarification of the common
ensemble classifiers and finally their uses in the technical
world.
 Several individual base models (experts) are fitted to
learn from the same data and produce an aggregation of
output based on which a final decision is taken. These
base models can be machine learning algorithms such as
decision trees (mostly used), linear models, support
vector machines (SVM), neural networks, or any other
model that is capable of making predictions.
 Most commonly used ensembles include techniques such
as Bagging- used to generate Random Forest algorithms
and Boosting- to generate algorithms such
as Adaboost, Xgboost etc.
Ensemble Learning Techniques
 Gradient Boosting Machines (GBM): Gradient
Boosting is a popular ensemble learning technique that
sequentially builds a group of decision trees and corrects
the residual errors made by previous trees, enhancing its
predictive accuracy. It trains each new weak learner to fit
the residuals of the previous ensemble’s predictions thus
making it less sensitive to individual data points or
outliers in the data.
 Extreme Gradient Boosting (XGBoost): XGBoost
features tree pruning, regularization, and parallel
processing, which makes it a preferred choice for data
scientists seeking robust and accurate predictive models.
Page 9 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

 CatBoost: It is designed to handle features categorically


that eliminates the need for extensive pre-
processing.CatBoost is known for its high predictive
accuracy, fast training, and automatic handling of
overfitting.
 Stacking: It combines the output of multiple base
models by training a combiner(an algorithm that takes
predictions of base models) and generate more accurate
prediction. Stacking allows for more flexibility in
combining diverse models, and the combiner can be any
machine learning algorithm.
 Random Subspace Method (Random Subspace
Ensembles): It is an ensemble learning approach that
improves the predictive accuracy by training base
models on random subsets of input features. It mitigates
overfitting and improves the generalization by
introducing diversity in the model space.
 Random Forest Variants: They introduce variations in
tree construction, feature selection, or model
optimization to enhance performance.
Selecting the right advanced ensemble technique depends on
the nature of the data, the specific problem trying to be solved,
and the computational resources available. It often requires
experimentation and changes to achieve the best results.
Algorithm based on Bagging and Boosting
Bagging Algorithm
Bagging is a supervised learning technique that can be used for
both regression and classification tasks. Here is an overview of
the steps including Bagging classifier algorithm:
 Bootstrap Sampling: Divides the original training data
into ‘N’ subsets and randomly selects a subset with
replacement in some rows from other subsets. This step
ensures that the base models are trained on diverse
subsets of the data and there is no class imbalance.
 Base Model Training: For each bootstrapped sample,
train a base model independently on that subset of data.
These weak models are trained in parallel to increase
computational efÏciency and reduce time consumption.
 Prediction Aggregation: To make a prediction on
testing data combine the predictions of all base models.
For classification tasks, it can include majority voting or
weighted majority while for regression, it involves
averaging the predictions.

Page 10 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

 Out-of-Bag (OOB) Evaluation: Some samples are


excluded from the training subset of particular base
models during the bootstrapping method. These “out-of-
bag” samples can be used to estimate the model’s
performance without the need for cross-validation.
 Final Prediction: After aggregating the predictions from
all the base models, Bagging produces a final prediction
for each instance.
Boosting Algorithm
Boosting is an ensemble technique that combines multiple weak
learners to create a strong learner. The ensemble of weak
models are trained in series such that each model that comes
next, tries to correct errors of the previous model until the entire
training dataset is predicted correctly. One of the most well-
known boosting algorithms is AdaBoost (Adaptive Boosting).
Here are few popular boosting algorithm frameworks:
 AdaBoost (Adaptive Boosting): AdaBoost assigns
different weights to data points, focusing on challenging
examples in each iteration. It combines weighted weak
classifiers to make predictions.
 Gradient Boosting: Gradient Boosting, including
algorithms like Gradient Boosting Machines (GBM),
XGBoost, and LightGBM, optimizes a loss function by
training a sequence of weak learners to minimize the
residuals between predictions and actual values,
producing strong predictive models.
Uses of Ensemble Learning
Ensemble learning is a versatile approach that can be applied to
a wide range of machine learning problems such as:-
 Classification and Regression: Ensemble techniques
make problems like classification and regression
versatile in various domains, including finance,
healthcare, marketing, and more.
 Anomaly Detection: Ensembles can be used to detect
anomalies in datasets by combining multiple anomaly
detection algorithms, thus making it more robust.
 Portfolio Optimization: Ensembles can be employed to
optimize investment portfolios by collecting predictions
from various models to make better investment
decisions.
 Customer Churn Prediction: In business and
marketing analytics, by combining the results of various
models capturing different aspects of customer

Page 11 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

behaviour, ensembles can be used to predict customer


churn.
 Medical Diagnostics: In healthcare, ensembles can be
used to make more accurate predictions of diseases
based on various medical data sources and diagnostic
models.
 Credit Scoring: Ensembles can be used to improve the
accuracy of credit scoring models by combining the
outputs of various credit risk assessment models.
 Climate Prediction: Ensembles of climate models help
in making more accurate and reliable predictions for
weather forecasting, climate change projections, and
related environmental studies.
 Time Series Forecasting: Ensemble learning combines
multiple time series forecasting models to enhance
accuracy and reliability, adapting to changing temporal
patterns.
What is Statistics?
Statistics is the science of collecting, organizing, analyzing,
interpreting, and presenting data. It encompasses a wide range
of techniques for summarizing data, making inferences, and
drawing conclusions.
Statistical methods help quantify uncertainty and variability in
data, allowing researchers and analysts to make data-driven
decisions with confidence.
Types of Statistics
There are commonly two types of statistics, which are
discussed below:
• Descriptive Statistics: "Descriptive Statistics " helps us
simplify and organize big chunks of data. This makes large
amounts of data easier to understand.
• Inferential Statistics: "Inferential Statistics " is a little
different. It uses smaller data to draw conclusions about a
larger group. It helps us predict and draw conclusions about
a population.

• Descriptive Statistics: "Descriptive Statistics" helps us


simplify and organize big chunks of data. This makes large
amounts of data easier to understand.
• Mean is calculated by summing all values present in the
sample divided by total number of values present in the
sample.
Mean(μ)=SumofValuesNumberofValues Mean(μ)=Numbero
fValuesSumofValues

Page 12 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes


Median is the middle of a sample when arranged from
lowest to highest or highest to lowest. in order to find the
median, the data must be sorted.
For odd number of data points:
• Median=(n+12)thMedian=(2n+1)th
• For even number of data points:
• Median=Average of(n2)th value and its next value
Median=Average of (2n)th value and its next value
• Mode is the most frequently occurring value in the dataset.
• Measures of Dispersion
• Range: The difference between the maximum and minimum
values.
• Variance: The measure of how data points differ from the
mean.
• Standard Deviation: The square root of the
variance, representing the average distance from the mean.

Measures of Dispersion

• Range: The difference between the maximum and minimum values.

• Variance: The average squared deviation from the mean, representing data spread.

• Standard Deviation: The square root of variance, indicating data spread relative to the mean.

• Interquartile Range: The range between the first and third quartiles, measuring data spread
around the median.

Measures of Shape

• Skewness: Indicates data asymmetry.

Measures of Shape

• Kurtosis: Measures the peakedness of the data distribution.

Page 13 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

Inferential Statistics

Inferential statistics is used to make predictions by taking any group of data in which you are
interested. It can be defined as a random sample of data taken from a population to describe and
make inferences about the population. Any group of data that includes all the data you are
interested in is known as population. It basically allows you to make predictions by taking a small
sample instead of working on the whole population.

Population and Sample

• Population: The entire group being studied.

• Sample: A subset of the population used for analysis.

Estimation

• Point Estimation: Provides a single value estimate of a population parameter.

• Interval Estimation: Offers a range of values (confidence interval) within which the
parameter likely lies.

• Confidence Intervals: Indicate the reliability of an estimate.

Hypothesis Testing

Page 14 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

• Null and Alternative Hypotheses: The null hypothesis assumes no effect or relationship, while
the alternative suggests otherwise.

• Type I and Type II Errors: Type I error is rejecting a true null hypothesis, while Type II is failing
to reject a false null hypothesis

• p-Values: Measure the probability of obtaining the observed results under the null
hypothesis.

• t-Tests and z-Tests: Compare means to assess statistical significance.

ANOVA (Analysis of Variance):

Compares means across multiple groups to determine if they differ significantly.

Chi-Square Tests:

Assess the association between categorical variables.

Correlation and Regression:

Understanding relationships between variables is critical in machine learning.

Correlation

• Pearson Correlation CoefÏcient: Measures linear relationship strength between two


variables.

• Spearman Rank Correlation: Assesses the strength and direction of the monotonic
relationship between variables.

Regression Analysis

• Simple Linear Regression: Models the relationship between two variables.

• Multiple Linear Regression: Extends to multiple predictors.

• Assumptions of Linear Regression: Linearity, independence, homoscedasticity, normality.

• Interpretation of Regression CoefÏcients: Explains predictor influence on the response


variable.

• Model Evaluation Metrics: R-squared, Adjusted R-squared, RMSE.

Bayesian Statistics

Bayesian statistics incorporate prior knowledge with current evidence to update beliefs.

Page 15 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

Bayes' Theorem is a fundamental concept in probability theory that relates conditional probabilities.
It is named after the Reverend Thomas Bayes, who first introduced the theorem. Bayes' Theorem is a
mathematical formula that provides a way to update probabilities based on new evidence. The
formula is as follows:

P(A∣B)=P(B)P(B∣A)⋅P(A), where

• P(A∣B): The probability of event A given that event B has occurred (posterior probability).

• P(B∣A): The probability of event B given that event A has occurred (likelihood).

• P(A): The probability of event A occurring (prior probability).

• P(B): The probability of event B occurring.

Applications of Statistics in Machine Learning

Statistics is a key component of machine learning, with broad applicability in various fields.

• Feature engineering relies heavily on statistics to convert geometric features into meaningful
predictors for machine learning algorithms.

• In image processing tasks like object recognition and segmentation, statistics accurately
reflect the shape and structure of objects in images.

• Anomaly detection and quality control benefit from statistics by identifying deviations from
norms, aiding in the detection of defects in manufacturing processes.

• Environmental observation and geospatial mapping leverage statistical analysis to monitor


land cover patterns and ecological trends effectively.

Gaussian Mixture Models

Normal or Gaussian Distribution

In real life, many datasets can be modeled by Gaussian


Distribution (Univariate or Multivariate). So it is quite natural and
intuitive to assume that the clusters come from different
Gaussian Distributions. Or in other words, it tried to model the
dataset as a mixture of several Gaussian Distributions. This is
the core idea of this model.
In one dimension the probability density function of a Gaussian
Distribution is given by

Page 16 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

where and are respectively the mean and variance of the


distribution. For Multivariate ( let us say d-variate) Gaussian
Distribution, the probability density function is given by

Here
µ is a d dimensional vector denoting the mean of the distribution
and ∑ is the d X d covariance matrix.

In probability theory and statistics, a normal distribution or Gaussian


distribution is a type of continuous probability distribution for a real-valued random
variable. The general form of its probability density function is

A random variable with a Gaussian distribution is said to be normally distributed,


and is called a normal deviate.
Normal distributions are important in statistics and are often used in
the natural and social sciences to represent real-valued random variables whose
distributions are not known Their importance is partly due to the central limit
theorem. It states that, under some conditions, the average of many samples
(observations) of a random variable with finite mean and variance is itself a random
variable—whose distribution converges to a normal distribution as the number of
samples increases. Therefore, physical quantities that are expected to be the sum of
many independent processes, such as measurement errors, often have distributions
that are nearly normal

Moreover, Gaussian distributions have some unique properties that are valuable in
analytic studies. For instance, any linear combination of a fixed collection of
independent normal deviates is a normal deviate. Many results and methods, such
as propagation of uncertainty and least squares[5] parameter fitting, can be derived
analytically in explicit form when the relevant variables are normally distributed.

A normal distribution is sometimes informally called a bell curve. However, many


other distributions are bell-shaped (such as the Cauchy, Student's t,
and logistic distributions).

The univariate probability distribution is generalized for vectors in the multivariate


normal distribution and for matrices in the matrix normal distribution.

Page 17 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

In real-world machine learning applications, it is common to have


many relevant features, but only a subset of them may be
observable. When dealing with variables that are sometimes
observable and sometimes not, it is indeed possible to utilize the
instances when that variable is visible or observed in order to
learn and make predictions for the instances where it is not
observable. This approach is often referred to as handling
missing data. By using the available instances where the variable
is observable, machine learning algorithms can learn patterns
and relationships from the observed data. These learned
patterns can then be used to predict the values of the variable in
instances where it is missing or not observable.
The expectation-Maximization algorithm can be used to handle
situations where variables are partially observable. When certain
variables are observable, we can use those instances to learn
and estimate their values. Then, we can predict the values of
these variables in instances when it is not observable.
Expectation-Maximization (EM) Algorithm
The Expectation-Maximization (EM) algorithm is an iterative
optimization method that combines
different unsupervised machine learning algorithms to find
maximum likelihood or maximum posterior estimates of
parameters in statistical models that involve unobserved latent
variables. The EM algorithm is commonly used for latent variable
models and can handle missing data. It consists of an estimation
step (E-step) and a maximization step (M-step), forming an
iterative process to improve model fit.

Page 18 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

 In the E step, the algorithm computes the latent


variables i.e. expectation of the log-likelihood using the
current parameter estimates.

 In the M step, the algorithm determines the parameters


that maximize the expected log-likelihood obtained in the
E step, and corresponding model parameters are
updated based on the estimated latent variables.

Key Terms in Expectation-Maximization (EM) Algorithm


Some of the most commonly used key terms in the Expectation-
Maximization (EM) Algorithm are as follows:
 Latent Variables: Latent variables are unobserved
variables in statistical models that can only be inferred
indirectly through their effects on observable variables.
They cannot be directly measured but can be detected
by their impact on the observable variables.
 Likelihood: It is the probability of observing the given
data given the parameters of the model. In the EM
algorithm, the goal is to find the parameters that
maximize the likelihood.
 Log-Likelihood: It is the logarithm of the likelihood
function, which measures the goodness of fit between
the observed data and the model. EM algorithm seeks to
maximize the log-likelihood.
 Maximum Likelihood Estimation (MLE): MLE is a
method to estimate the parameters of a statistical model
by finding the parameter values that maximize the
likelihood function, which measures how well the model
explains the observed data.
 Posterior Probability: In the context of Bayesian
inference, the EM algorithm can be extended to estimate
Page 19 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

the maximum a posteriori (MAP) estimates, where the


posterior probability of the parameters is calculated
based on the prior distribution and the likelihood
function.
 Expectation (E) Step: The E-step of the EM algorithm
computes the expected value or posterior probability of
the latent variables given the observed data and current
parameter estimates. It involves calculating the
probabilities of each latent variable for each data point.
 Maximization (M) Step: The M-step of the EM
algorithm updates the parameter estimates by
maximizing the expected log-likelihood obtained from
the E-step. It involves finding the parameter values that
optimize the likelihood function, typically through
numerical optimization methods.
 Convergence: Convergence refers to the condition
when the EM algorithm has reached a stable solution. It
is typically determined by checking if the change in the
log-likelihood or the parameter estimates falls below a
predefined threshold.
How Expectation-Maximization (EM) Algorithm Works:
The essence of the Expectation-Maximization algorithm is to use
the available observed data of the dataset to estimate the
missing data and then use that data to update the values of the
parameters. Let us understand the EM algorithm in detail.

Page 20 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

EM Algorithm Flowchart

1. Initialization:
 Initially, a set of initial values of the parameters
are considered. A set of incomplete observed
data is given to the system with the assumption
that the observed data comes from a specific
model.
2. E-Step (Expectation Step): In this step, we use the
observed data in order to estimate or guess the values of
the missing or incomplete data. It is basically used to
update the variables.
 Compute the posterior probability or
responsibility of each latent variable given the
observed data and current parameter estimates.
 Estimate the missing or incomplete data values
using the current parameter estimates.
 Compute the log-likelihood of the observed data
based on the current parameter estimates and
estimated missing data.
3. M-step (Maximization Step): In this step, we use the
complete data generated in the preceding “Expectation”
– step in order to update the values of the parameters. It
is basically used to update the hypothesis.

Page 21 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

 Update the parameters of the model by


maximizing the expected complete data log-
likelihood obtained from the E-step.
 This typically involves solving optimization
problems to find the parameter values that
maximize the log-likelihood.
 The specific optimization technique used
depends on the nature of the problem and the
model being used.
4. Convergence: In this step, it is checked whether the
values are converging or not, if yes, then stop otherwise
repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
 Check for convergence by comparing the change
in log-likelihood or the parameter values between
iterations.
 If the change is below a predefined threshold,
stop and consider the algorithm converged.
 Otherwise, go back to the E-step and repeat the
process until convergence is achieved.
Applications of the EM algorithm
 It can be used to fill in the missing data in a sample.
 It can be used as the basis of unsupervised learning of
clusters.
 It can be used for the purpose of estimating the
parameters of the Hidden Markov Model (HMM).
 It can be used for discovering the values of latent
variables.
Advantages of EM algorithm
 It is always guaranteed that likelihood will increase with
each iteration.
 The E-step and M-step are often pretty easy for many
problems in terms of implementation.
 Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm
 It has slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward
(numerical optimization requires only forward
probability).
What is the K-Nearest Neighbors
Algorithm?

Page 22 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

KNN is one of the most basic yet essential classification


algorithms in machine learning. It belongs to the supervised
learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-
parametric, meaning it does not make any underlying
assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian
distribution of the given data). We are given some prior data
(also called training data), which classifies coordinates into
groups identified by an attribute.
As an example, consider the following table of data points
containing two features:

Why do we need a KNN algorithm?


(K-NN) algorithm is a versatile and widely used machine learning
algorithm that is primarily used for its simplicity and ease of
implementation. It does not require any assumptions about the
underlying data distribution. It can also handle both numerical
and categorical data, making it a flexible choice for various types
of datasets in classification and regression tasks. It is a non-
parametric method that makes predictions based on the
similarity of data points in a given dataset. K-NN is less sensitive
to outliers compared to other algorithms.
The K-NN algorithm works by finding the K nearest neighbors to
a given data point based on a distance metric, such as Euclidean
distance. The class or value of the data point is then determined
by the majority vote or average of the K neighbors. This

Page 23 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

approach allows the algorithm to adapt to different patterns and


make predictions based on the local structure of the data.
Distance Metrics Used in KNN Algorithm
As we know that the KNN algorithm helps us identify the nearest
points or the groups for a query point. But to determine the
closest groups or the nearest points for a query point we need
some metric. For this purpose, we use below distance metrics:
Euclidean Distance
This is nothing but the cartesian distance between the two points
which are in the plane/hyperplane. Euclidean distance can also
be visualized as the length of the straight line that joins the two
points which are into consideration. This metric helps us
calculate the net displacement done between the two states of
an object.
distance(x,Xi)=∑j=1d(xj–Xij)2] distance(x,Xi)=∑j=1d(xj–Xij)2]
Manhattan Distance
Manhattan Distance metric is generally used when we are
interested in the total distance traveled by the object instead of
the displacement. This metric is calculated by summing the
absolute difference between the coordinates of the points in n-
dimensions.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan
distance, are special cases of the Minkowski distance .
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above we can say that when p = 2 then it is
the same as the formula for the Euclidean distance and when p
= 1 then we obtain the formula for the Manhattan distance.
The above-discussed metrics are most common while dealing
with a Machine Learning problem but there are other distance
metrics as well like Hamming Distance which come in handy
while dealing with problems that require overlapping
comparisons between two vectors whose contents can be
Boolean as well as string values.

How to choose the value of k for KNN


Algorithm?
The value of k is very crucial in the KNN algorithm to define the
number of neighbors in the algorithm. The value of k in the k-
nearest neighbors (k-NN) algorithm should be chosen based on
the input data. If the input data has more outliers or noise, a
higher value of k would be better. It is recommended to choose
Page 24 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

an odd value for k to avoid ties in classification. Cross-


validation methods can help in selecting the best k value for the
given dataset.
Algorithm for K-NN
DistanceToNN=sort(distance from 1st example, distance from
kth example)
value i=1 to number of training records:
Dist=distance(test example, ith example)
if (Dist<any example in DistanceToNN):
Remove the example from DistanceToNN and value.
Put new example in DistanceToNN and value in sorted order.
Return average of value
Fit using K-NN is more reasonable than 1-NN, K-NN affects very
less from noise if dataset is large.
In K-NN algorithm, We can see jump in prediction values due to
unit change in input. The reason for this due to change in
neighbors. To handles this situation, We can use weighting of
neighbors in algorithm. If the distance from neighbor is high, we
want less effect from that neighbor. If distance is low, that
neighbor should be more effective than others.

Workings of KNN algorithm


Thе K-Nearest Neighbors (KNN) algorithm operates on the
principle of similarity, where it predicts the label or value of a
new data point by considering the labels or values of its K
nearest neighbors in the training dataset.

Page 25 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

Step-by-Step explanation of how KNN works is discussed below:


Step 1: Selecting the optimal value of K
 K represents the number of nearest neighbors that needs
to be considered while making prediction.
Step 2: Calculating distance
 To measure the similarity between target and training
data points, Euclidean distance is used. Distance is
calculated between each of the data points in the
dataset and target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the
target point are the nearest neighbors.
Step 4: Voting for Classification or Taking Average for
Regression
 In the classification problem, the class labels of K-nearest
neighbors are determined by performing majority voting.
The class with the most occurrences among the
neighbors becomes the predicted class for the target
data point.
 In the regression problem, the class label is calculated by
taking average of the target values of K nearest
neighbors. The calculated average value becomes the
predicted output for the target data point.
Let X be the training dataset with n data points, where each data
point is represented by a d-dimensional feature vector XiXi and Y
be the corresponding labels or values for each data point in X.
Given a new data point x, the algorithm calculates the distance
between x and each data point XiXi in X using a distance metric,
such as Euclidean distance: distance(x,Xi)=∑j=1d(xj–Xij)2] distance(x,Xi
)=∑j=1d(xj–Xij)2]
The algorithm selects the K data points from X that have the
shortest distances to x. For classification tasks, the algorithm

Page 26 of 27

Downloaded by BALARAMAN N ([email protected])


lOMoARcPSD|27949228

Machine Learning UNIT -III Notes

assigns the label y that is most frequent among the K nearest


neighbors to x. For regression tasks, the algorithm calculates the
average or weighted average of the values y of the K nearest
neighbors and assigns it as the predicted value for x.
Advantages of the KNN Algorithm
 Easy to implement as the complexity of the algorithm
is not that high.
 Adapts Easily – As per the working of the KNN algorithm
it stores all the data in memory storage and hence
whenever a new example or data point is added then the
algorithm adjusts itself as per that new example and has
its contribution to the future predictions as well.
 Few Hyperparameters – The only parameters which
are required in the training of a KNN algorithm are the
value of k and the choice of the distance metric which we
would like to choose from our evaluation metric.
Disadvantages of the KNN Algorithm
 Does not scale – As we have heard about this that the KNN
algorithm is also considered a Lazy Algorithm. The main
significance of this term is that this takes lots of computing
power as well as data storage. This makes this algorithm
both time-consuming and resource exhausting.
 Curse of Dimensionality – There is a term known as the
peaking phenomenon according to this the KNN algorithm is
affected by the curse of dimensionality which implies the
algorithm faces a hard time classifying the data points
properly when the dimensionality is too high.
 Prone to Overfitting – As the algorithm is affected due to
the curse of dimensionality it is prone to the problem of
overfitting as well. Hence generally feature selection as well
as dimensionality reduction techniques are applied to deal
with this problem.

Page 27 of 27

Downloaded by BALARAMAN N ([email protected])

You might also like