0% found this document useful (0 votes)
55 views248 pages

ML 1-6

This document provides an introduction to machine learning, including definitions, key concepts, and applications. It defines machine learning as programming computers to optimize performance using example data or past experience. The main types of machine learning are supervised learning (classification and regression), unsupervised learning (clustering), and reinforcement learning. Choosing the right machine learning algorithm depends on the goal and type of data available. Model performance is evaluated using training error on familiar data and generalization error on new data.

Uploaded by

Kartik Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views248 pages

ML 1-6

This document provides an introduction to machine learning, including definitions, key concepts, and applications. It defines machine learning as programming computers to optimize performance using example data or past experience. The main types of machine learning are supervised learning (classification and regression), unsupervised learning (clustering), and reinforcement learning. Choosing the right machine learning algorithm depends on the goal and type of data available. Model performance is evaluated using training error on familiar data and generalization error on new data.

Uploaded by

Kartik Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 248

Module # 1

Introduction to Machine
Learning
Introduction
 To solve a problem on a computer, we need an algorithm.
 An algorithm is a sequence of instructions that should be
carried out to transform the input to output.
 For example, one can devise an algorithm for sorting.
 For some tasks, however, we do not have an algorithm—for
example, to tell spam emails from legitimate emails.
 What we lack in knowledge, we make up for in data.
 We can easily compile thousands of example messages, some
of which we know to be spam and what we want is to
“learn” what constitutes spam from them.
Machine Learning
Definition
 Machine learning is programming computers to optimize
a performance criterion using example data or past
experience.

 We have a model defined up to some parameters, and


learning is the execution of a computer program to
optimize the parameters of the model using the training
data or past experience.

 Machine learning is turning data into information.


Key Terminologies

• Feature / Attributes • Training Set


• Classification • Test Set (80:20 rule)
• Classes • Knowledge Representation
Key Task of Machine Learning
 Classification
 In classification, our job is to predict what class an instance of data
should fall into.

 Regression
 Regression is the prediction of a numeric value.

 Clustering
 In unsupervised learning, there’s no label or target value given for the
data.
 A task where we group similar items together is known as clustering.
Type of Machine Learning
Types of Machine Learning
Machine Learning Classification

Machine Learning

Supervised Learning Unsupervised Learning Reinforcement Learning

Regression Clustering
Linear Semi-Supervised
K-means, PCA Learning
Multivariate
Classification Association

Logistic, Trees, KNN,


Apriori
Naïve Bayes, SVM,
Neural Network
Fundamental Issues in Machine Learning
 Specialized Learning versus General Learning
 For most input spaces of interest (not just language) the training data can never specify a
behavior for every possible input.
 Any learning system must consider some input-output relationships more likely than others
and this preference is called “bias”.
 The poverty of the stimulus, and the need for bias, is sometimes called the “no free lunch
theorem” — to generalize you must have a bias.
 Bayesians versus Frequentists
 In classical physics the world behaves deterministically — given the current state all future
states are determined.
 Given physical determinism, probability is used to represent uncertain beliefs about the
world.
 The interpretation of a probability as a degree of belief is fundamentally Bayesian.
 In quantum physics the objective world behaves randomly.
 The interpretation of probability as part of objective reality is fundamentally frequentist
Issues in Machine Learning
 Which Algorithm to select?
 How much training data is sufficient?
 Prior knowledge held by the learner is used at which time and
manner to guide the process of generalization from examples?
 What is the best strategy for choosing a useful next training
experience, and how does the choice of this strategy after the
complexity of the learning problem?
 To reduce the task of learning one or more function
approximation problems, what will be the best approach?
 To improve the knowledge representation and to learn the
target function, how can the learner automatically alter its
representation.
Applications of Machine Learning
 Automating Employee Access Control
 Protecting Animals
 Predicting Emergency Room Wait Times
 Identifying Heart Failure
 Predicting Strokes and Seizures
 Predicting Hospital Readmissions
 Stop Malware
 Understand Legalese
 Improve Cybersecurity
 Get Ready For Smart Cars
{ Like wise.. Any other applications can be enlisted}
How to choose the right Algorithm
 First, you need to consider your goal. What are you trying to get out of
this? (Do you want a probability that it might rain tomorrow, or do you
want to find groups of voters with similar interests?) What data do you
have or can you collect?
 If you’re trying to predict or forecast a target value, then you need to
look into supervised learning. If not, then unsupervised learning is the
place you want to be.
 If you’ve chosen supervised learning, what’s your target value?
 Is it a discrete value like Yes/No, 1/2/3, A/B/C, or Red/Yellow/Black?
If so, then you want to look into classification.
 If the target value can take on a number of values, say any value from
0.00 to 100.00, or -999 to 999, or +_ to -_, then you need to look into
regression
Contd…
 If you’re not trying to predict a target value, then you need to look
into unsupervised learning.

 Are you trying to fit your data into some discrete groups? If so and
that’s all you need, you should look into clustering.

 Do you need to have some numerical estimate of how strong the fit
is into each group? If you answer yes, then you probably should look
into a density estimation algorithm.

 You should spend some time getting to know your data, and the
more you know about it, the better you’ll be able to build a
successful application.
Steps in developing a machine learning
application
 Collect data
 Prepare the input data
 Analyse the input data
 Train the algorithm
 Test the algorithm
 Use it
Training Error and Testing Error
 There are two important concepts used in machine learning: the
training error and the test error.

 Training Error: We get by calculating the classification error of a


model on the same data the model was trained on.

 Test Error: We get this by using two completely disjoint datasets:


one to train the model and the other to calculate the classification
error. Both datasets need to have values for y. The first dataset is
called training data and the second, test data.
Training Error and Testing Error
 Key takeaways

 In machine learning, training a predictive model means finding a


function which maps a set of values x to a value y.
 We can calculate how well a predictive model is doing by comparing
the predicted values with the true values for y.
 If we apply the model to the data it was trained on, we are
calculating the training error.
 If we calculate the error on data which was unknown in the training
phase, we are calculating the test error.
Generalization Error
 The term ‘generalization’ refers to a model’s ability to adapt and
react appropriately to previously unseen, fresh data chosen from
the same distribution as the model’s initial input.

 In other words, generalization assesses a model’s ability to


process new data and generate accurate predictions after being
trained on a training set.

 For supervised learning applications in machine learning and


statistical learning theory, generalization error (also known as
the out-of-sample error or the risk) is a measure of how
accurately an algorithm is able to predict outcome values for
previously unseen data.
Generalization Error

 A model’s ability to generalize is critical to its success.


 Over-training on training data will prevent a model from
generalizing.
 In such cases, when new data is supplied, it will make
inaccurate predictions.
 Even if the model is capable of making accurate
predictions based on the training data set, it will be
rendered ineffective.
Training Error vs Generalization Error
 Training Error: We get by calculating the classification error of a
model on the same data the model was trained on.
 Generalization error: Is a measure of how accurately an algorithm is
able to predict outcome values for previously unseen data.
 Problematically, generalization error cannot be calculated
accurately. In practice we use data which were not part of the
training set but, this is not a true estimate.
Overfitting and Underfitting
 When we talk about the Machine Learning model, we actually talk
about how well it performs and its accuracy which is known as
prediction errors.
 Let us consider that we are designing a machine learning model.
 A model is said to be a good machine learning model if it generalizes
any new input data from the problem domain in a proper way.
 This helps us to make predictions about the future data, that the
data model has never seen.
 Now, suppose we want to check how well our machine learning
model learns and generalizes to the new data.
 For that, we have overfitting and underfitting, which are majorly
responsible for the poor performances of the machine learning
algorithms.
Bias and Variance
 Bias: Assumptions made by a model to make a function easier to
learn. It is actually the error rate of the training data. When the
error rate has a high value, we call it High Bias and when the error
rate has a low value, we call it low Bias.
 Variance: The error rate of the testing data is called variance. When
the error rate has a high value, we call it High variance and when
the error rate has a low value, we call it Low variance.

Prediction
Error
Bias and Variance
 {Bias refers to predicted value by the model and the actual value}
 Low Bias: Suggests less assumptions about the form of the target
function. {Less Gap between PV & AV}
 High-Bias: Suggests more assumptions about the form of the target
function. {More Gap between PV & AV}

 {Variance refers to How much scattered Predicted values are in


relation with each other}
 Low Variance: Suggests small changes to the estimate of the target
function with changes to the training dataset. {less Scattered}
 High Variance: Suggests large changes to the estimate of the target
function with changes to the training dataset. {More Scattered}
Underfitting
 A statistical model or a machine learning algorithm is said to have
underfitting when it cannot capture the underlying trend of the data,
i.e., it only performs well on training data but performs poorly on
testing data.

 Underfitting destroys the accuracy of our machine learning model.


 Its occurrence simply means that our model or the algorithm does
not fit the data well enough.

 An underfitted model has high bias and low variance.


Underfitting
 It usually happens when we have fewer data to build an accurate
model and also when we try to build a linear model with fewer non-
linear data.
 In such cases, the rules of the machine learning model are too easy
and flexible to be applied on such minimal data and therefore the
model will probably make a lot of wrong predictions.
 Underfitting can be avoided by using more data and also reducing the
features by feature selection.
Underfitting
 Reasons for Underfitting:
 High bias and low variance
 The size of the training dataset used is not enough.
 The model is too simple.
 Training data is not cleaned and also contains noise in it.
 Techniques to reduce underfitting:
 Increase model complexity
 Increase the number of features, performing feature engineering
 Remove noise from the data.
 Increase the number of epochs or increase the duration of training to
get better results.
Overfitting
 A statistical model is said to be overfitted when the model does not
make accurate predictions on testing data.
 When a model gets trained with so much data, it starts learning from
the noise and inaccurate data entries in our data set.
 And when testing with test data results in High variance.
 Then the model does not categorize the data correctly, because of
too many details and noise.
 The causes of overfitting are the non-parametric and non-linear
methods because these types of machine learning algorithms have
more freedom in building the model based on the dataset and
therefore they can really build unrealistic models.
Overfitting
 A solution to avoid overfitting is using a linear algorithm if we have
linear data or using the parameters like the maximal depth if we are
using decision trees.

 Very Good Validation Accuracy and Very Poor Testing Accuracy.

 The overfitted model has low bias and high variance.


Overfitting
 Reasons for Overfitting are as follows:
 High variance and low bias
 The model is too complex
 The size of the training data
 Techniques to reduce overfitting:
 Increase training data.
 Reduce model complexity.
 Early stopping during the training phase (have an eye over the loss over
the training period as soon as loss begins to increase stop training).
 Ridge Regularization and Lasso Regularization
 Use dropout for neural networks to tackle overfitting.
Overfitting and Underfitting
Overfitting and Underfitting
 Use these steps to determine if your machine learning
model, deep learning model or neural network is
currently underfit or overfit.

 Ensure that you are using validation loss next to training


loss in the training phase.
 When your validation loss is decreasing, the model is still
underfit.
 When your validation loss is increasing, the model is
overfit.
 When your validation loss is equal, the model is either
perfectly fit or in a local minimum.
Bias and Variance Tradeoff
 The goal of any supervised machine learning algorithm is to achieve
low bias and low variance. In turn the algorithm should achieve good
prediction performance.
 You can see a general trend in the examples above:
 Linear machine learning algorithms often have a high bias but a low
variance.
 Nonlinear machine learning algorithms often have a low bias but a
high variance.
 The parameterization of machine learning algorithms is often a
battle to balance out bias and variance.
Bias and Variance Tradeoff
 Below are two examples of configuring the bias-variance
trade-off for specific algorithms:
 The k-nearest neighbours algorithm has low bias and high
variance, but the trade-off can be changed by increasing the
value of k which increases the number of neighbours that
contribute to the prediction and in turn increases the bias of
the model.
 The support vector machine algorithm has low bias and high
variance, but the trade-off can be changed by increasing the C
parameter that influences the number of violations of the
margin allowed in the training data which increases the bias
but decreases the variance.
Bias and Variance Tradeoff
Bias and Variance Tradeoff
 There is no escaping the relationship between bias and variance in
machine learning.
 Increasing the bias will decrease the variance.
 Increasing the variance will decrease the bias.
 There is a trade-off at play between these two concerns and the
algorithms you choose and the way you choose to configure them are
finding different balances in this trade-off for your problem
 In reality, we cannot calculate the real bias and variance error terms
because we do not know the actual underlying target function.
 Nevertheless, as a framework, bias and variance provide the tools to
understand the behaviour of machine learning algorithms in the
pursuit of predictive performance.
Module # 2

Learning with
Regression and Trees
Learning with Regression
Regression
 If two variables are closely related we may be interested in
estimating (predicting) the value of one variable given the value of
another.
 For example, advertising and sales are correlated we find out
expected amount of sales for a given advertising expenditure or the
required amount of expenditure for attaining a given amount of
sales.
 Similarly, if we know that the yield of rice and rainfall are closely
related we may find out the amount of rain required to achieve a
certain production figure.
 Regression analysis reveals average relationship between two
variables and this makes possible estimation or prediction.
 The dictionary meaning of the term 'regression' is the act of returning
or going back.
Regression Contd…
 The variable which is used to predict the variable of interest is called
the independent variable or explanatory variable and the variable we
are trying to predict is called the dependent variable or explained"
variable.
 The independent variable is denoted by X and the dependent variable
by Y.
 The analysis used is called the simple linear regression analysis-
simple because there is only one predictor or independent variable,
and linear because of the assumed linear relationship between the
dependent and independent variables.
Regression Contd…
Simple Regression Equation of Y on X
 The regression equation of Y on X is expressed as follows:
 Y=aX + b
 It may be noted that in this equation 'Y' is a dependent variable, ie,
its value depends on X. 'X' is independent variable, i.e., we can take
a given value of X and compute the value of Y.
 ‘b’ is "Y-intercept" because its value is the point at which the
regression line crosses the Y-axis, that is, the vertical axis.
 ‘a’“ is the slope of line. It represents change in Y variable for unit
change in X variable.
 ‘a’ and 'b' in the equation are called numerical constants because for
any given straight line, their value does not change.
Sums # 1
Sums # 2
Sums # 2
Multiple Linear Regression
 Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable.
 Multiple regression is an extension of linear (OLS) regression that
uses just one explanatory variable.
Multiple vs Multivariate Linear Regression
Sums - Multiple Linear Regression
Sums - Multiple Linear Regression
Sums – Multivariate Linear Regression
Sums – Multivariate Linear Regression
Logistic Regression
 This type of statistical model (also known as logit model) is often
used for classification and predictive analytics.
 Logistic regression estimates the probability of an event occurring,
such as voted or didn’t vote, based on a given dataset of
independent variables.
 Since the outcome is a probability, the dependent variable is
bounded between 0 and 1.
Logistic Regression Contd…
Sums - Logistic Regression
Sums - Logistic Regression
Learning with Trees
Decision Tree
 Decision Tree is the most powerful and popular tool for classification
and prediction.
 A Decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (terminal node) holds a class
label.
Strengths and Weaknesses
 The strengths of decision tree methods are:
 Decision trees are able to generate understandable rules.
 Decision trees perform classification without requiring much computation.
 Decision trees are able to handle both continuous and categorical variables.
 Decision trees provide a clear indication of which fields are most important for
prediction or classification.
 The weaknesses of decision tree methods :
 Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
 Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
 Decision tree can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting
field must be sorted before its best split can be found. In some algorithms,
combinations of fields are used and a search must be made for optimal combining
weights. Pruning algorithms can also be expensive since many candidate sub-trees
must be formed and compared.
Entropy in Information Theory

 Information entropy measures the amount of impurity in a


set of features.
 The entropy H of a set of probabilities pi is:
Contd…
Information Gain
 One important idea is to work out how much the entropy
of the whole training set would decrease if we choose
each particular feature for the next classification step.
This is known as information gain.
 Information gain is defined as the entropy of the whole
set minus the entropy when a particular feature is chosen.
Contd…
Contd…
Contd…
Contd…

 The ID3 algorithm computes this information gain for each


feature and chooses the one that produces the highest
value.
Sum – ID3
Sum – ID3
Sum – ID3
Sum – ID3
Sum – ID3
Sum – ID3
Sum – ID3
Sum – ID3
Sum – ID3
Sum – Gini
Sum – Gini
Sum – Gini
Sum – Gini
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Sum – CART
Performance Metrics
Performance Metrics for Regression
 There are three error metrics that are commonly used for evaluating
and reporting the performance of a regression model; they are:

 Mean Squared Error (MSE).


 Root Mean Squared Error (RMSE).
 Mean Absolute Error (MAE)
MAE, MAPE, MSE, RMSE and R2

 The Mean absolute error (MAE) represents the average of the


absolute difference between the actual and predicted values in the
dataset. It measures the average of the residuals in the dataset.
MAE, MAPE, MSE, RMSE and R2

 The mean absolute percentage error (MAPE), also known as mean


absolute percentage deviation (MAPD), is a measure of prediction
accuracy of a forecasting method in statistics. It usually expresses
the accuracy as a ratio defined by the formula:
MAE, MAPE, MSE, RMSE and R2

 Mean Squared Error (MSE) represents the average of the squared


difference between the original and predicted values in the data set.
It measures the variance of the residuals.

 Root Mean Squared Error (RMSE) is the square root of Mean Squared
error. It measures the standard deviation of residuals
MAE, MAPE, MSE, RMSE and R2

 The coefficient of determination or R-squared represents the


proportion of the variance in the dependent variable which is
explained by the linear regression model. It is a scale-free score i.e.
irrespective of the values being small or large, the value of R square
will be less than one.
MAE, MAPE, MSE, RMSE and R2
 Mean Squared Error(MSE) and Root Mean Square Error penalizes the
large prediction errors vi-a-vis Mean Absolute Error (MAE). However,
RMSE is widely used than MSE to evaluate the performance of the
regression model with other random models as it has the same units
as the dependent variable (Y-axis).
 MSE is a differentiable function that makes it easy to perform
mathematical operations in comparison to a non-differentiable
function like MAE. Therefore, in many models, RMSE is used as a
default metric for calculating Loss Function despite being harder to
interpret than MAE.
 MAE is more robust to data with outliers.
 The lower value of MAE, MSE, and RMSE implies higher accuracy of a
regression model. However, a higher value of R square is considered
desirable.
Performance Metrics for Classification
 Classification is a type of supervised machine learning problem where
the goal is to predict, for one or more observations, the category or
class they belong to.
 An important element of any machine learning workflow is the
evaluation of the performance of the model. This is the process
where we use the trained model to make predictions on previously
unseen, labelled data. In the case of classification, we then evaluate
how many of these predictions the model got right.
 In real-world classification problems, it is usually impossible for a
model to be 100% correct. When evaluating a model it is, therefore,
useful to know, not only how wrong the model was, but in which way
the model was wrong.
Performance Metrics for Classification
 7 Metrics to Measure Classification Performance

 Accuracy
 Confusion Matrix
 Precision
 Recall
 F1 score
 AUC/ROC
 Kappa
Performance Metrics for Classification
 Accuracy
 The overall accuracy of a model is simply the number of correct predictions
divided by the total number of predictions. An accuracy score will give a
value between 0 and 1, a value of 1 would indicate a perfect model.

 This metric should rarely be used in isolation, as on imbalanced data, where


one class is much larger than another, the accuracy can be highly
misleading.
 If we go back to the cancer example. Imagine we have a dataset where only
1% of the samples are cancerous. A classifier that simply predicts all
outcomes as benign would achieve an accuracy score of 99%. However, this
model would, in fact, be useless and dangerous as it would never detect a
cancerous observation.
Performance Metrics for Classification
 Confusion Matrix
 A confusion matrix is an extremely useful tool to observe in which
way the model is wrong (or right!). It is a matrix that compares the
number of predictions for each class that are correct and those that
are incorrect.
 In a confusion matrix, there are 4 numbers to pay attention to.
 True positives: The number of positive observations the model
correctly predicted as positive.
 False-positive: The number of negative observations the model
incorrectly predicted as positive.
 True negative: The number of negative observations the model
correctly predicted as negative.
 False-negative: The number of positive observations the model
incorrectly predicted as negative.
Performance Metrics for Classification
 The image below shows a confusion
matrix for a classifier. Using this we can
understand the following:
 The model correctly predicted 3,383
negative samples but incorrectly
predicted 46 as positive.
 The model correctly predicted 962
positive observations but incorrectly
predicted 89 as negative.
 We can see from this confusion matrix
that the data sample is imbalanced, with
the negative class having a higher volume
of observations.
Performance Metrics for Classification
 Precision
 Precision measures how good the model is at correctly identifying
the positive class. In other words out of all predictions for the
positive class how many were actually correct? Using alone this
metric for optimising a model we would be minimising the false
positives. This might be desirable for our fraud detection example,
but would be less useful for diagnosing cancer as we would have little
understanding of positive observations that are missed.
Performance Metrics for Classification
 Recall
 Recall tell us how good the model is at correctly predicting all the
positive observations in the dataset. However, it does not include
information about the false positives so would be more useful in the
cancer example.

 Usually, precision and recall are observed together by constructing a


precision-recall curve. This can help to visualise the trade-offs
between the two metrics at different thresholds.
Performance Metrics for Classification
 The F1 score is the harmonic mean of precision and recall. The F1
score will give a number between 0 and 1. If the F1 score is 1.0 this
indicates perfect precision and recall. If the F1 score is 0 this means
that either the precision or the recall is 0.
Performance Metrics for Classification
Performance Metrics for Classification
Performance Metrics for Classification
Module # 3

Ensemble Learning
K-Fold Validation
K-Fold Cross Validation
 Machine learning model performance assessment is just like assessing
the scores.

 What is Accuracy of the Model and Performance?


 Accuracy is the just number, for getting a better understanding of a
prediction-based problem that corrects the predictions which are
made by the model built by the team with the available number of
records.
 So we need to train the model across different combinations of data.
K-Fold
 In each set (fold) training and the test would be performed precisely
once during this entire process.
 It helps us to avoid overfitting.
 As we know when a model is trained using all of the data in a single
short and give the best performance accuracy.
 To resist this k-fold cross-validation helps us to build the model is a
generalized one.
 To achieve this K-Fold Cross Validation, we have to split the data set
into three sets, Training, Testing, and Validation.
K-Fold
 In K-Fold Validation Test and Train data set will support building model
and hyperparameter assessments.

 In which the model has been validated multiple times based on the
value assigned as a parameter and which is called K and it should be
an INTEGER.

 Make it simple, based on the K value, the data set would be divided,
and train/testing will be conducted in a sequence way equal to K time.
K-Fold Validation
K-Fold Validation
 The general process of k-fold cross-validation for evaluating a model’s
performance is:

 The whole dataset is randomly split into independent k-folds without


replacement.
 k-1 folds are used for the model training and one fold is used for
performance evaluation.
 This procedure is repeated k times (iterations) so that we obtain k
number of performance estimates (e.g. MSE) for each iteration.
 Then we get the mean of k number of performance estimates (e.g.
MSE).
K-Fold Validation
K-Fold Validation
 Remark 1: The splitting process is done without replacement. So, each
observation will be used for training and validation exactly once.
 Remark 2: Good standard values for k in k-fold cross-validation are 5
and 10. However, the value of k depends on the size of the dataset.
For small datasets, we can use higher values for k. However, larger
values of k will also increase the runtime of the cross-validation
algorithm and the computational cost.
 Remark 3: When k=5, 20% of the test set is held back each time.
When k=10, 10% of the test set is held back each time and so on…
 Remark 4: A special case of k-fold cross-validation is the Leave-one-
out cross-validation (LOOCV) method in which we set k=n (number of
observations in the dataset). Only one training sample is used for
testing during each iteration. This method is very useful when working
with very small datasets.
Ensemble Learning
Ensemble Learning
 An ensemble is a machine learning model that combines the predictions
from two or more models.
 The models that contribute to the ensemble, referred to as ensemble
members,
 May be the same type or different types and may or may not be trained on
the same training data.

Definition :
 Ensemble learning is the process by which multiple models, such as
classifiers or experts, are strategically generated and combined to solve
a particular computational intelligence problem.
 Ensemble learning is primarily used to improve the (classification,
prediction, function approximation, etc.) performance of a model, or
reduce the likelihood of an unfortunate selection of a poor one.
 Imaging that Fable of blind men and elephant. All of the blind men had
their own description of the elephant. Even though each of the
description was true, it would have been better to come together and
discuss their understanding before coming to final conclusion. This story
perfectly describes the Ensemble learning method.
Ensemble learning Types / Ways to Combine Classifiers
Ensemble learning Types
Parallel Ensemble Learning(Bagging)
 Bagging, is a machine learning ensemble meta-algorithm intended to improve the strength
and accuracy of machine learning algorithms used in classification and regression purpose.
It additionally help for over-fitting.
 Parallel ensemble methods where the base learners are generated in parallel
 Algorithms : Random Forest, Bagged Decision Trees, Extra Trees
Parallel Ensemble Learning(Bagging)
 “Standard” bagging: each of the T subsamples has size n and created with
replacement. –

 “Sub-bagging”: create T subsamples of size α only (α < n).


Bagging
 A Bagging classifier is an ensemble meta-estimator that fits base
classifiers each on random subsets of the original dataset and then
aggregate their individual predictions (either by voting or by
averaging) to form a final prediction.
 Such a meta-estimator can typically be used as a way to reduce the
variance of a black-box estimator (e.g., a decision tree), by
introducing randomization into its construction procedure and then
making an ensemble out of it.
 Each base classifier is trained in parallel with a training set which is
generated by randomly drawing, with replacement, N examples(or
data) from the original training dataset, where N is the size of the
original training set.
Bagging
 The training set for each of the base classifiers is independent of each
other.
 Many of the original data may be repeated in the resulting training set
while others may be left out.
 Bagging reduces overfitting (variance) by averaging or voting, however,
this leads to an increase in bias, which is compensated by the
reduction in variance though.
Bagging
Random Forest
 Random Forest is a popular machine learning algorithm that belongs to
the supervised learning technique.
 It can be used for both Classification and Regression problems in ML.
 As the name suggests, "Random Forest is a classifier that contains a
number of decision trees on various subsets of the given dataset
and takes the average to improve the predictive accuracy of that
dataset."
 Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
 The greater number of trees in the forest leads to higher accuracy
and prevents the problem of overfitting.
Random Forest
Random Forest
 Why use Random Forest?

 Below are some points that explain why we should use the Random
Forest algorithm:
 It takes less training time as compared to other algorithms.
 It predicts output with high accuracy, even for the large dataset it runs
efficiently.
 It can also maintain accuracy when a large proportion of data is
missing.
Sequential Ensemble learning (Boosting)
 Boosting, is a machine learning ensemble meta-algorithm for principally
reducing bias, and furthermore variance in supervised learning, and a
group of machine learning algorithms that convert weak learner to
string ones.
 Sequential ensemble methods where the base learners are generated
sequentially.
 Example : Adaboost, Stochastic Gradient Boosting
Boosting
 Boosting is an ensemble modelling, technique that attempts to build a
strong classifier from the number of weak classifiers.
 It is done by building a model by using weak models in series.
 Firstly, a model is built from the training data.
 Then the second model is built which tries to correct the errors
present in the first model.
 This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum
number of models are added.
Boosting
Gradient Boosting
 Gradient Boosting is a popular boosting algorithm. In gradient
boosting, each predictor corrects its predecessor’s error.

 In contrast to Adaboost, the weights of the training instances are not


tweaked, instead, each predictor is trained using the residual errors of
predecessor as labels.

 There is a technique called the Gradient Boosted Trees whose base


learner is CART (Classification and Regression Trees).
XGBoost
 XGBoost is an implementation of Gradient Boosted decision trees.
 In this algorithm, decision trees are created in sequential form.
 Weights play an important role in XGBoost.
 Weights are assigned to all the independent variables which are then
fed into the decision tree which predicts results.
 The weight of variables predicted wrong by the tree is increased and
these variables are then fed to the second decision tree.
 These individual classifiers/predictors then ensemble to give a strong
and more precise model.
 It can work on regression, classification, ranking, and user-defined
prediction problems.
Bagging vs Boosting
S.NO Bagging Boosting
The simplest way of combining predictions that A way of combining predictions that
1.
belong to the same type. belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
Models are weighted according to their
3. Each model receives equal weight.
performance.
New models are influenced
4. Each model is built independently.
by the performance of previously built models.

Different training data subsets are selected using


row sampling with replacement and random Every new subset contains the elements that
5.
sampling methods from the entire training were misclassified by previous models.
dataset.
6. Bagging tries to solve the over-fitting problem. Boosting tries to reduce bias.
If the classifier is unstable (high variance), then If the classifier is stable and simple (high bias)
7.
apply bagging. the apply boosting.
8. In this base classifiers are trained parallelly. In this base classifiers are trained sequentially.

9 Example: The Random forest model uses Bagging. Example: The AdaBoost uses Boosting techniques
Stacking & Blending
 Stacking is a way of combining multiple models, that introduces the
concept of a meta learner. It is less widely used than bagging and boosting.
Unlike bagging and boosting, stacking may be (and normally is) used to
combine models of different types.
The procedure is as follows:
1. Split the training set into two disjoint sets.
2. Train several base learners on the first part.
3. Test the base learners on the second part.
 Using the predictions from 3) as the inputs, and the correct responses as
the outputs, train a higher level learner.
 Example : Voting Classifier

 Blending is technique where we can do weighted averaging of final result.


Decision Stump
 A decision stump is a machine learning model consisting of a one-level
decision tree.
 That is, it is a decision tree with one internal node (the root) which is
immediately connected to the terminal nodes (its leaves).
 A decision stump makes a prediction based on the value of just a single
input feature.
 Sometimes they are also called 1-rules.
Chapter # 4

Support Vector Machine


Introduction

 Basic idea of support vector machines:

 – Optimal hyperplane for linearly separable patterns

 – Extend to patterns that are not linearly separable by


transformations of original data to map into new space – the Kernel
function

 - Use of quadratic optimization problem to avoid ‘local minimum’


issues
Contd…

 Support vectors are the data points that lie closest to the
decision surface (or hyperplane)

 • They are the data points most difficult to classify

 • They have direct bearing on the optimum location of the


decision surface
Contd…
Contd…
Maximum Margin Separators
Contd…

 In SVMs, the decision boundary has the special property that it is


as far away as possible from both the positive and the negative
examples.
 The distance of the decision boundary to the nearest example is
called the margin.
 Since SVMs maximize this margin, it is often called a Large Margin
Classifier.
 The SVM will separate the negative and positive examples by a
large margin.
 Data is linearly separable when a straight line can separate the
positive and negative examples
Quadratic Programming Problem
Constrained Optimization
 Constrained optimization is the process of optimizing an objective
function with respect to some variables in the presence of
constraints on those variables.
 The objective function is either
 a cost function or energy function which is to be minimized, or
 a reward function or utility function, which is to be maximized.
 Constraints can be either
 hard constraints which set conditions for the variables that are
required to be satisfied, or
 soft constraints which have some variable values that are penalized
in the objective function if the conditions on the variables are not
satisfied.
Contd…
Linear and Non-linear Classification
 Now looking back at what we’ve derived, it is clear that we are
only using w.x+b. This is simply only a linear equation. That means
SVM works best when you can classify the data linearly!

 That is another really huge limitation! However, the authors have


found a hack for this!! & that’s the kernel trick.

 In simplistic terms:
 The Kernel simply converts the non-linear datapoints to linear
datapoints, so that the SVM can bisect two classes.
Contd…
Kernels
 There are several kernel functions used for SVMs. Some of the
popular ones are:
 Gaussian Radial Basis Function (RBF):

 where 𝛾 > 0.
 A special case is 𝛾 = 1/2𝜎²

 Gaussian Kernel:
Kernels
 Polynomial Kernel:

 Sigmoid kernel:
Support Vector Regression
 Support Vector Regression is a supervised learning algorithm that is
used to predict discrete values.
 Support Vector Regression uses the same principle as the SVMs.
 The basic idea behind SVR is to find the best fit line.
 In SVR, the best fit line is the hyperplane that has the maximum
number of points.
Support Vector Regression
 Unlike other Regression models that try to minimize the error
between the real and predicted value, the SVR tries to fit the best
line within a threshold value.
 The threshold value is the distance between the hyperplane and
boundary line.
 The fit time complexity of SVR is more than quadratic with the
number of samples which makes it hard to scale to datasets with
more than a couple of 10000 samples.
Support Vector Regression
Multiclass Classification
 In its most basic type, SVM doesn’t support multiclass classification.
 For multiclass classification, the same principle is utilized after breaking
down the multi-classification problem into smaller subproblems, all of
which are binary classification problems.

 The popular methods which are used to perform multi-classification on


the problem statements using SVM are as follows:

 👉 One vs One (OVO) approach

 👉 One vs All (OVA) approach

 👉 Directed Acyclic Graph (DAG) approach


One vs One (OVO) approach
 This technique breaks down our multiclass classification problem
into subproblems which are binary classification problems.
 So, after this strategy, we get binary classifiers per each pair of
classes.
 For final prediction for any input use the concept of majority
voting along with the distance from the margin as its confidence
criterion.

 👉 The major problem with this approach is that we have to train


too many SVMs.
One vs One (OVO) approach
 Let’s have an example of 3 class
classification problem: Green, Red,
and Blue.

 In the One-to-One approach, we try to


find the hyperplane that separates
between every two classes, neglecting
the points of the third class.
 For example, here Red-Blue line tries
to maximize the separation only
between blue and red points while It
has nothing to do with the green
points.
One vs All (OVA)
 In this technique, if we have N class problem, then we learn N
SVMs:
 SVM number -1 learns “class_output = 1” vs “class_output ≠ 1″
 SVM number -2 learns “class_output = 2” vs “class_output ≠ 2″
 :
 SVM number -N learns “class_output = N” vs “class_output ≠ N”

 Then to predict the output for new input, just predict with each of
the build SVMs and then find which one puts the prediction the
farthest into the positive region (behaves as a confidence criterion
for a particular SVM).
One vs All (OVA)
 In the One vs All approach, we try to find a hyperplane to
separate the classes. This means the separation takes all points
into account and then divides them into two groups in which there
is a group for the one class points and the other group for all other
points.
 For example, here, the Greenline tries to maximize the gap
between green points and all other points at once.
One vs All (OVA)
 There are some challenges to train these N SVMs, which are:

 1. Too much Computation: To implement the OVA strategy, we


require more training points which increases our computation.

 2. Problems becomes Unbalanced: Let’s you are working on


an MNIST dataset, in which there are 10 classes from 0 to 9 and if
we have 1000 points per class, then for any one of the SVM having
two classes, one class will have 9000 points and other will have
only 1000 data points, so our problem becomes unbalanced.
Multiclass Classification
 NOTE: A single SVM does binary classification and can differentiate
between two classes. So according to the two above approaches,
to classify the data points from L classes data set:

 👉 In the One vs All approach, the classifier can use L SVMs.

 👉 In the One vs One approach, the classifier can use L(L-1)/2


SVMs.
Directed Acyclic Graph (DAG)
 This approach is more hierarchical in nature and it tries to addresses
the problems of the One vs One and One vs All approach.

 👉 This is a graphical approach in which we group the classes based on


some logical grouping.

 👉 Benefits: Benefits of this approach includes a fewer number of SVM


trains with respect to the OVA approach and it reduces the diversity
from the majority class which is a problem of the OVA approach.

 👉 Problem: If we have given the dataset itself in the form of


different groups ( e.g, cifar 10 image classification dataset ) then we
can directly apply this approach but if we don’t give the groups, then
the problem with this approach is of finding the logical grouping in
the dataset i.e, we have to manually pick the logical grouping.
Chapter # 5

Learning with Clustering


Introduction

 Clustering is similar to classification in that data are grouped.


 However, unlike classification, the groups are not predefined.
 Instead, the grouping is accomplished by finding similarities
between data according to characteristics found in the actual data.
 The groups are called clusters.
 Many definitions for clusters have been proposed:
 • Set of like elements. Elements from different clusters are not
alike.
 • The distance between points in a cluster is less than the distance
between a point in the cluster and any point outside it.
Introduction

 Example:
 An international online catalog company wishes to group its
customers based on common features.
 Company management does not have any predefined labels for these
groups.
 Based on the outcome of the grouping, they will target marketing
and advertising campaigns to the different groups.
 The information they have about the customers includes income,
age, number of children, marital status, location of house and
education among others.
Introduction

 Based on the attributes different clusters can be obtained.


Challenges in Clustering

 Outlier handling is difficult.

 Dynamic data in the database implies that cluster membership may


change over time.

 Interpreting the semantic meaning of each cluster may be difficult.

 There is no one correct answer to a clustering problem.

 Another related issue is what data should be used for clustering.


Categories of Clustering Algorithms
Categories of Clustering Algorithms
 Clustering algorithms themselves may be viewed as hierarchical or
partitional.
 With hierarchical clustering, a nested set of clusters is created.
 Each level in the hierarchy has a separate set of clusters.
 At the lowest level, each item is in its own unique cluster.
 At the highest level, all items belong to the same cluster.
 With hierarchical clustering, the desired number of clusters is not input.
 With partitional clustering, the algorithm creates only one set of clusters.
 These approaches use the desired number of clusters to drive how the final
set is created.
 Traditional clustering algorithms tend to be targeted to small numeric
databases that fit into memory.
Categories of Clustering Algorithms

 There are, however, more recent clustering algorithms that look at


categorical data and are targeted to larger, perhaps dynamic,
databases.
 Algorithms targeted to larger databases may adapt to memory
constraints by either sampling the database or using data structures,
which can be compressed or pruned to fit into memory regardless of
the size of the database.
Categories of Clustering Algorithms

 Another way clustering algorithms can categorised is,

 Density-based
 Distribution-based
 Centroid-based
 Hierarchical-based
Density-based

 In density-based clustering, data is grouped by areas of high


concentrations of data points surrounded by areas of low
concentrations of data points.
 Basically the algorithm finds the places that are dense with data
points and calls those clusters.
 The great thing about this is that the clusters can be any shape. You
aren't constrained to expected conditions.
 The clustering algorithms under this type don't try to assign outliers
to clusters, so they get ignored.
Distribution-based

 With a distribution-based clustering approach, all of the data points


are considered parts of a cluster based on the probability that they
belong to a given cluster.
 It works like this: there is a center-point, and as the distance of a
data point from the center increases, the probability of it being a
part of that cluster decreases.
 If you aren't sure of how the distribution in your data might be, you
should consider a different type of algorithm.
Centroid-based

 Centroid-based clustering is the one you probably hear about the


most. It's a little sensitive to the initial parameters you give it, but
it's fast and efficient.
 These types of algorithms separate data points based on multiple
centroids in the data.
 Each data point is assigned to a cluster based on its squared
distance from the centroid.
 This is the most commonly used type of clustering.
Hierarchical-based

 Hierarchical-based clustering is typically used on hierarchical data,


like you would get from a company database or taxonomies.
 It builds a tree of clusters so everything is organized from the top-
down.
 This is more restrictive than the other clustering types, but it's
perfect for specific kinds of data sets.
Major Clustering Algorithms

 K-means clustering algorithm


 DBSCAN clustering algorithm
 Gaussian Mixture Model algorithm
 BIRCH algorithm
 Affinity Propagation clustering algorithm
 Mean-Shift clustering algorithm
 OPTICS algorithm
 Agglomerative Hierarchy clustering algorithm
Similarity and Distance Measures
 There are many desirable properties for the clusters created by a
solution to a specific clustering problem.
 The most important one is that a tuple within one cluster is more
like tuples within that cluster than it is similar to tuples outside it.
 As with classification, we assume the definition of a similarity
measure, sim(ti , tl ), defined between any two tuples, ti, tl є D.
 This provides a more strict and alternative clustering definition,
 Unless otherwise stated, we use the first definition rather than the
second.
 Keep in mind that the similarity relationship stated within the
second definition is a desirable, although not always obtainable,
property.
Similarity and Distance Measures
Similarity and Distance Measures
Similarity and Distance Measures
 Given clusters Ki and KJ , there are several standard alternatives to
calculate the distance between clusters.
 A representative list is :
 • Single link: Smallest distance between an element in one cluster and an
element in the other.
 • Complete link: Largest distance between an element in one cluster and
an element in the other.
 • Average: Average distance between an element in one cluster and an
element in
 the other.
 • Centroid: If clusters have a representative centroid, then the centroid
distanceis defined as the distance between the centroids.
 • Medoid: Using a medoid to represent each cluster, the distance between
the clusters.
Agglomerative Clustering Algorithm
 As mentioned earlier, hierarchical clustering algorithms actually creates sets of
clusters.
 Hierarchical algorithms differ in how the sets are created.
 A tree data structure, called a dendrogram, can be used to illustrate the
hierarchical clustering technique and the sets of different clusters.
 The root in a dendrogram tree contains one cluster where all elements are
together.
 The leaves in the dendrogram each consist of a single element cluster.
 Internal nodes in the dendrogram represent new clusters formed by merging the
clusters that appear as its children in the tree.
 Each level in the tree is associated with the distance measure that was used to
merge the clusters.
 All clusters created at a particular level were combined because the children
clusters had a distance between them less than the distance value associated with
this level in the tree.
Sum on Agglomerative Clustering

 Sum
Divisive Algorithm

 With divisive clustering, all items are initially placed in one cluster
and clusters are repeatedly split in two until all items are in their
own cluster.
 The idea is to split up clusters where some elements are not
sufficiently close to other elements.
K-Means Clustering

 K- means is an iterative clustering algorithm in which items are


moved among sets of clusters until the desired set is reached.
 As such, it may be viewed as a type of squared error algorithm,
although the convergence criteria need not be defined based on the
squared error.
 A high degree of similarity among elements in clusters is obtained,
while a high degree of dissimilarity among elements in different
clusters is achieved simultaneously
K-Means Clustering
Graph Based Clustering: Clustering with
minimal spanning tree
 MST for Clustering involves two steps

 Build an MST for the given datapoints (Usually Kruskal)


 Then remove all the edges with the highest weight, until the desired
number of clusters are formed.
Graph Based Clustering: Clustering with
minimal spanning tree
Model Based Clustering: Expectation
Maximization Algorithm
 Expectation-Maximization algorithm can be used for the latent
variables (variables that are not directly observable and are actually
inferred from the values of the other observed variables) in order to
predict their values with the condition that the general form of
probability distribution governing those latent variables is known to
us.

 This algorithm is actually at the base of many unsupervised


clustering algorithms in the field of machine learning.
Model Based Clustering: Expectation
Maximization Algorithm
 Algorithm:
 Given a set of incomplete data, consider a set of starting
parameters.
 Expectation step (E – step): Using the observed available data of
the dataset, estimate (guess) the values of the missing data.
 Maximization step (M – step): Complete data generated after the
expectation (E) step is used in order to update the parameters.
 Repeat step 2 and step 3 until convergence.
Model Based Clustering: Expectation
Maximization Algorithm
 Usage of EM algorithm –
 It can be used to fill the missing data in a sample.
 It can be used as the basis of unsupervised learning of clusters.
 It can be used for the purpose of estimating the parameters of Hidden Markov Model (HMM).
 It can be used for discovering the values of latent variables.
 Advantages of EM algorithm –
 It is always guaranteed that likelihood will increase with each iteration.
 The E-step and M-step are often pretty easy for many problems in terms of implementation.
 Solutions to the M-steps often exist in the closed form.
 Disadvantages of EM algorithm –
 It has slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward (numerical optimization requires
only forward probability).
Density Based Clustering: DBSCAN
 Centrally, all clustering methods use the same approach i.e. first we calculate
similarities and then we use it to cluster the data points into groups or batches.
 Here we will focus on the Density-based spatial clustering of applications with
noise (DBSCAN) clustering method.
 Why do we need a Density-Based clustering algorithm like DBSCAN when we
already have K-means clustering?
 K-Means clustering may cluster loosely related observations together.
 Every observation becomes a part of some cluster eventually, even if the
observations are scattered far away in the vector space.
 Since clusters depend on the mean value of cluster elements, each data point
plays a role in forming the clusters.
 A slight change in data points might affect the clustering outcome.
 This problem is greatly reduced in DBSCAN due to the way clusters are formed.
 This is usually not a big problem unless we come across some odd shape data.
Density Based Clustering: DBSCAN

 Another challenge with k-means is that you need to specify the


number of clusters (“k”) in order to use it.
 Much of the time, we won’t know what a reasonable k value is a
priori.
 What’s nice about DBSCAN is that you don’t have to specify the
number of clusters to use it.
 All you need is a function to calculate the distance between values
and some guidance for what amount of distance is considered
“close”.
 DBSCAN also produces more reasonable results than k-means across
a variety of different distributions.
Density Based Clustering: DBSCAN
 The DBSCAN algorithm uses two parameters:
 minPts: The minimum number of points (a threshold) clustered together
for a region to be considered dense.
 eps (ε): A distance measure that will be used to locate the points in the
neighborhood of any point.
 These parameters can be understood if we explore two concepts called
Density Reachability and Density Connectivity.
 Reachability in terms of density establishes a point to be reachable from
another if it lies within a particular distance (eps) from it.
 Connectivity, on the other hand, involves a transitivity based chaining-
approach to determine whether points are located in a particular cluster.
For example, p and q points could be connected if p->r->s->t->q, where a-
>b means b is in the neighborhood of a.
Density Based Clustering: DBSCAN

 There are three types of points after the DBSCAN clustering is


complete:
 Core — This is a point that has at least m points within
distance n from itself.
 Border — This is a point that has at least one Core point at a
distance n.
 Noise — This is a point that is neither a Core nor a Border. And it has
less than m points within distance n from itself.
Density Based Clustering: DBSCAN

 Algorithmic steps for DBSCAN clustering


 The algorithm proceeds by arbitrarily picking up a point in the
dataset (until all points have been visited).
 If there are at least ‘minPoint’ points within a radius of ‘ε’ to the
point then we consider all these points to be part of the same
cluster.
 The clusters are then expanded by recursively repeating the
neighborhood calculation for each neighboring point
Density Based Clustering: DBSCAN
Chapter # 6

Dimensionality
Reduction
Introduction
 The number of input features, variables, or columns present in a given
dataset is known as dimensionality, and the process to reduce these
features is called dimensionality reduction.

 A dataset contains a huge number of input features in various cases, which


makes the predictive modeling task more complicated.

 Because it is very difficult to visualize or make predictions for the training


dataset with a high number of features, for such cases, dimensionality
reduction techniques are required to use.

 Dimensionality reduction technique can be defined as, "It is a way of


converting the higher dimensions dataset into lesser dimensions dataset
ensuring that it provides similar information."
Introduction
 These techniques are widely used in machine learning for obtaining
a better fit predictive model while solving the classification and
regression problems.

 It is commonly used in the fields that deal with high-dimensional


data, such as speech recognition, signal processing, bioinformatics,
etc.

 It can also be used for data visualization, noise reduction, cluster


analysis, etc.
Curse of Dimensionality
 Handling the high-dimensional data is very difficult in practice, commonly
known as the curse of dimensionality.

 If the dimensionality of the input dataset increases, any machine learning


algorithm and model becomes more complex.

 As the number of features increases, the number of samples also gets


increased proportionally, and the chance of overfitting also increases.

 If the machine learning model is trained on high-dimensional data, it


becomes overfitted and results in poor performance.
 Hence, it is often required to reduce the number of features, which can be
done with dimensionality reduction.
Curse of Dimensionality

 A short list of other reasons we want to simplify our data includes


the following:

 ■ Making the dataset easier to use


 ■ Reducing computational cost of many algorithms
 ■ Removing noise
 ■ Making the results easier to understand
Benefits of applying Dimensionality
Reduction
 Some benefits of applying dimensionality reduction technique to the given
dataset are given below:

 By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
 Less Computation training time is required for reduced dimensions of
features.
 Reduced dimensions of features of the dataset help in visualizing the data
quickly.
 It removes the redundant features (if present) by taking care of
multicollinearity.
Disadvantages of dimensionality
Reduction

 There are also some disadvantages of applying the dimensionality


reduction, which are given below:

 Some data may be lost due to dimensionality reduction.

 In the PCA dimensionality reduction technique, sometimes the


principal components required to consider are unknown.
Approaches of Dimensionality Reduction
Dimensionality
Reduction Techniques

Feature Selection Feature Extraction

1. Principal Component Analysis


1. Filters Methods
2. Linear Discriminant Analysis
2. Wrappers Methods
3. Kernel PCA
3. Embedded Methods
4. Quadratic Discriminant Analysis
Feature Selection
 Feature selection is the process of selecting the subset of the relevant
features and leaving out the irrelevant features present in a dataset
to build a model of high accuracy.

 In other words, it is a way of selecting the optimal features from the


input dataset.

 Three methods are used for the feature selection:


 1. Filters Methods
 2. Wrappers Methods
 3. Embedded Methods
Filters Methods

 In this method, the dataset is filtered, and a subset that contains


only the relevant features is taken. Some common techniques of
filters method are:

 Correlation
 Chi-Square Test
 ANOVA
 Information Gain, etc.
Wrappers Methods
 The wrapper method has the same goal as the filter method, but it takes a
machine learning model for its evaluation.
 In this method, some features are fed to the ML model, and evaluate the
performance.
 The performance decides whether to add those features or remove to
increase the accuracy of the model.
 This method is more accurate than the filtering method but complex to
work.
 Some common techniques of wrapper methods are:

 Forward Selection
 Backward Selection
 Bi-directional Elimination
Embedded methods

 Embedded methods check the different training iterations of the


machine learning model and evaluate the importance of each
feature.
 Some common techniques of Embedded methods are:

 LASSO
 Elastic Net
 Ridge Regression, etc.
Feature extraction

 Feature extraction is the process of transforming the space


containing many dimensions into space with fewer dimensions.
 This approach is useful when we want to keep the whole information
but use fewer resources while processing the information.
 Some common feature extraction techniques are:

1. Principal Component Analysis


2. Linear Discriminant Analysis
3. Kernel PCA
4. Quadratic Discriminant Analysis
Principal Component Analysis
 Principal Component Analysis is a statistical process that converts
the observations of correlated features into a set of linearly
uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal
Components.
 It is one of the popular tools that is used for exploratory data
analysis and predictive modeling.
 PCA works by considering the variance of each attribute because the
high attribute shows the good split between the classes, and hence
it reduces the dimensionality.
 Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels.
Principal Component Analysis

 Consider for a moment the mass of data in figure


 If I asked you to draw a line covering the data
points, what’s the longest possible line you could
draw?
 We’ve drawn a few choices.
 Line B is the longest of these three lines.
 In PCA, we rotate the axes of the data.
 The rotation is determined by the data itself.
 The first axis is rotated to cover the largest
variation in the data: line B in.
 The largest variation is the data telling us what’s
most important.
Principal Component Analysis

 After choosing the axis covering the most variability, we choose the
next axis, which has the second most variability, provided it’s
perpendicular to the first axis.
 The real term used is orthogonal.
 On this two-dimensional plot, perpendicular and orthogonal are the
same.
 In figure, line C would be our second axis.
 With PCA, we’re rotating the axes so that they’re lined up with the
most important directions from the data’s perspective.
Linear Discriminant Analysis

 Linear Discriminant Analysis or Normal Discriminant Analysis or


Discriminant Function Analysis is a dimensionality reduction
technique that is commonly used for supervised classification
problems.

 It is used for modelling differences in groups i.e. separating two or


more classes.

 It is used to project the features in higher dimension space into a


lower dimension space.
Linear Discriminant Analysis

 For example, we have two classes and we need to separate them


efficiently.
 Classes can have multiple features.
 Using only a single feature to classify them may result in some
overlapping as shown in the below figure.
 So, we will keep on increasing the number of features for proper
classification.
Linear Discriminant Analysis

 Example:
 Suppose we have two sets of data points belonging to two different
classes that we want to classify.
 As shown in the given 2D graph, when the data points are plotted on
the 2D plane, there’s no straight line that can separate the two
classes of the data points completely.
 Hence, in this case, LDA (Linear Discriminant Analysis) is used which
reduces the 2D graph into a 1D graph in order to maximize the
separability between the two classes.
Linear Discriminant Analysis

 Here, Linear Discriminant Analysis uses both the axes (X and Y) to


create a new axis and projects data onto a new axis in a way to
maximize the separation of the two categories and hence, reducing
the 2D graph into a 1D graph.

 Two criteria are used by LDA to create a new axis:


 Maximize the distance between means of the two classes.
 Minimize the variation within each class.
Linear Discriminant Analysis

 In the above graph, it can be seen that a


new axis (in red) is generated and plotted
in the 2D graph such that it maximizes the
distance between the means of the two
classes and minimizes the variation within
each class.
 In simple terms, this newly generated axis
increases the separation between the data
points of the two classes.
 After generating this new axis using the
above-mentioned criteria, all the data
points of the classes are plotted on this
new axis and are shown in the figure given
below.
Drawbacks of Linear Discriminant
Analysis
 Although, LDA is specifically used to solve supervised classification
problems for two or more classes which are not possible using
logistic regression in machine learning.
 But LDA also fails in some cases where the Mean of the distributions
is shared.
 In this case, LDA fails to create a new axis that makes both the
classes linearly separable.

 To overcome such problems, we use non-linear Discriminant analysis


in machine learning.
Singular Valued Decomposition

 Singular-Value decomposition is also one of the popular


dimensionality reduction techniques and is also written as SVD in
short form.

 It is the matrix-factorization method of linear algebra, and it is


widely used in different applications such as feature selection,
visualization, noise reduction, and many more.

You might also like