Lec05 - Supervised
Lec05 - Supervised
Lecture 5: Supervised
Pro f. Dr. M d . R a k i b Ha s s an
De pt . o f Co m p u te r S c i en ce a n d M at h e m at ics ,
Ba n gl adesh Agr i cul tural Un i ve rsi ty.
E m a i l: ra k i b@ bau .edu.bd
Supervised Learning
❖ A supervised learning algorithm takes a known set of
input data (the training set) and known responses to
the data (output) and trains a model to generate
reasonable predictions for the response to new input
data.
2
Supervised Learning Techniques
❖ Classification:
❑ It predicts discrete responses—for example, whether an
email is genuine or spam, or whether a tumor is small,
medium, or large.
❑ Classification models are trained to classify data into
categories.
❑ Applications include medical imaging, speech recognition,
and credit scoring.
❖ Regression:
❑ Predicts continuous responses—for example, changes in
temperature or fluctuations in electricity demand.
❑ Applications include forecasting stock prices, handwriting
recognition, and acoustic signal processing.
3
Selecting the Right Algorithm
❖ Speed of training
❖ Memory usage
❖ Predictive accuracy on new data
❖ Transparency or interpretability (how easily you can
understand the reasons an algorithm makes its
predictions)
4
Binary vs. Multiclass Classification
❖ Binary classification problem:
❑ a single training or test item (instance) can only be divided
into two classes—for example, if you want to determine
whether an email is genuine or spam.
❖ Multiclass classification problem:
❑ it can be divided into more than two—for example, if you
want to train a model to classify an image as a dog, cat, or
other animal.
❖ A multiclass classification problem is generally more
challenging because it requires a more complex
model.
5
Common Classification Algorithms
❖ Logistic Regression
❑ How it Works
o Fits a model that can predict the probability of a binary response
belonging to one class or the other. Because of its simplicity,
logistic regression is commonly used as a starting point for binary
classification problems.
❑ Best Used...
o When data can be clearly separated by a single, linear boundary
o As a baseline for evaluating more complex classification methods
6
Common Classification Algorithms
❖ k Nearest Neighbor (kNN)
❑ How it Works
o kNN categorizes objects based on the classes of their nearest
neighbors in the dataset. kNN predictions assume that objects
near each other are similar. Distance metrics, such as Euclidean,
city block, cosine, and Chebychev, are used to find the nearest
neighbor.
❑ Best Used...
o When you need a simple algorithm to establish benchmark
learning rules
o When memory usage of the trained model is a lesser concern
o When prediction speed of the trained model is a lesser concern
7
Common Classification Algorithms
❖ Support Vector Machine (SVM)
❑ How It Works
o Classifies data by finding the linear decision boundary (hyperplane) that
separates all data points of one class from those of the other class.
o The best hyperplane for an SVM is the one with the largest margin
between the two classes, when the data is linearly separable.
o If the data is not linearly separable, a loss function is used to penalize
points on the wrong side of the hyperplane.
o SVMs sometimes use a kernel transform to transform nonlinearly
separable data into higher dimensions where a linear decision boundary
can be found.
❑ Best Used...
o For data that has exactly two classes (you can also use it for multiclass
classification with a technique called error-correcting output codes)
o For high-dimensional, nonlinearly separable data
o When you need a classifier that’s simple, easy to interpret, and accurate
8
Common Classification Algorithms
❖ Neural Network
❑ How it Works
o Inspired by the human brain, a neural network consists of highly
connected networks of neurons that relate the inputs to the
desired outputs.
o The network is trained by iteratively modifying the strengths of
the connections so that given inputs map to the correct response.
❑ Best Used...
o For modeling highly nonlinear systems
o When data is available incrementally and you wish to constantly
update the model
o When there could be unexpected changes in your input data
o When model interpretability is not a key concern
9
Common Classification Algorithms
❖ Naïve Bayes
❑ How It Works
o A naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.
o It classifies new data based on the highest probability of its
belonging to a particular class.
❑ Best Used...
o For a small dataset containing many parameters
o When you need a classifier that’s easy to interpret
o When the model will encounter scenarios that weren’t in the
training data, as is the case with many financial and medical
applications
10
Common Classification Algorithms
❖ Discriminant Analysis
❑ How It Works
o Discriminant analysis classifies data by finding linear combinations
of features.
o Discriminant analysis assumes that different classes generate data
based on Gaussian distributions.
o Training a discriminant analysis model involves finding the
parameters for a Gaussian distribution for each class. The
distribution parameters are used to calculate boundaries, which
can be linear or quadratic functions. These boundaries are used to
determine the class of new data.
❑ Best Used...
o When you need a simple model that is easy to interpret
o When memory usage during training is a concern
o When you need a model that is fast to predict
11
Common Classification Algorithms
❖ Decision Tree
❑ How it Works
o A decision tree lets you predict responses to data by following the
decisions in the tree from the root (beginning) down to a leaf
node.
o A tree consists of branching conditions where the value of a
predictor is compared to a trained weight. The number of
branches and the values of weights are determined in the training
process. Additional modification, or pruning, may be used to
simplify the model.
❑ Best Used...
o When you need an algorithm that is easy to interpret and fast to
fit
o To minimize memory usage
o When high predictive accuracy is not a requirement
12
Common Classification Algorithms
❖ Bagged and Boosted Decision Trees
❑ How They Work
o In these ensemble methods, several “weaker” decision trees are
combined into a “stronger” ensemble.
o A bagged decision tree consists of trees that are trained
independently on data that is bootstrapped from the input data.
o Boosting involves creating a strong learner by iteratively adding
“weak” learners and adjusting the weight of each weak learner to
focus on misclassified examples.
❑ Best Used...
o When predictors are categorical (discrete) or behave nonlinearly
o When the time taken to train a model is less of a concern
13
Common Regression Algorithms
❖ Linear Regression
❑ How it Works
o Linear regression is a statistical modeling technique used to
describe a continuous response variable as a linear function of one
or more predictor variables. Because linear regression models are
simple to interpret and easy to train, they are often the first model
to be fitted to a new dataset.
❑ Best Used...
o When you need an algorithm that is easy to interpret and fast to
fit
o As a baseline for evaluating other, more complex, regression
models
14
Common Regression Algorithms
❖ Nonlinear Regression
❑ How It Works
o Nonlinear regression is a statistical modeling technique that helps
describe nonlinear relationships in experimental data.
o Nonlinear regression models are generally assumed to be
parametric, where the model is described as a nonlinear equation.
❑ Best Used...
o When data has strong nonlinear trends and cannot be easily
transformed into a linear space
o For fitting custom models to data
15
Common Regression Algorithms
❖ Gaussian Process Regression Model
❑ How it Works
o Gaussian process regression (GPR) models are nonparametric
models that are used for predicting the value of a continuous
response variable. They are widely used in the field of spatial
analysis for interpolation in the presence of uncertainty. GPR is
also referred to as Kriging.
❑ Best Used...
o For interpolating spatial data, such as hydrogeological data for the
distribution of ground water
o As a surrogate model to facilitate optimization of complex designs
such as automotive engines
16
Common Regression Algorithms
❖ SVM Regression
❑ How It Works
o SVM regression algorithms work like SVM classification
algorithms, but are modified to be able to predict a continuous
response. Instead of finding a hyperplane that separates data,
SVM regression algorithms find a model that deviates from the
measured data by a value no greater than a small amount, with
parameter values that are as small as possible (to minimize
sensitivity to error).
❑ Best Used...
o For high-dimensional data (where there will be many predictor
variables)
17
Common Regression Algorithms
❖ Generalized Linear Model
❑ How it Works
o A generalized linear model is a special case of nonlinear models
that uses linear methods. It involves fitting a linear combination of
the inputs to a nonlinear function (the link function) of the
outputs.
❑ Best Used...
o When the response variables have non-normal distributions, such
as a response variable that is always expected to be positive
18
Common Regression Algorithms
❖ Regression Tree
❑ How It Works
o Decision trees for regression are similar to decision trees for
classification, but they are modified to be able to predict
continuous responses.
❑ Best Used...
o When predictors are categorical (discrete) or behave nonlinearly
19
Improving Models
❖ Improving a model means
increasing its accuracy and
predictive power and
preventing overfitting
(when the model cannot
distinguish between data
and noise).
❖ Model improvement
involves feature engineering
(feature selection and
transformation) and
hyperparameter tuning.
20
Feature Selection
❖ Identifying the most relevant features, or variables,
that provide the best predictive power in modeling
your data. This could mean adding variables to the
model or removing variables that do not improve
model performance.
❖ It’s especially useful when you’re dealing with high-
dimensional data or when your dataset contains a
large number of features and a limited number of
observations.
❖ Reducing features also saves storage and
computation time and makes your results easier to
understand.
21
Feature Selection Techniques
❖ Stepwise regression:
❑ Sequentially adding or removing features until there is no
improvement in prediction accuracy.
❖ Sequential feature selection:
❑ Iteratively adding or removing predictor variables and
evaluating the effect of each change on the performance of
the model.
❖ Regularization:
❑ Using shrinkage estimators to remove redundant features by
reducing their weights (coefficients) to zero.
❖ Neighborhood component analysis (NCA):
❑ Finding the weight each feature has in predicting the output,
so that features with lower weights can be discarded.
22
Feature Transformation
❖ Turning existing features into new features using
techniques such as principal component analysis,
nonnegative matrix factorization, and factor analysis.
❖ Feature transformation is a form of dimensionality
reduction.
❖ As discussed earlier, the three most commonly used
dimensionality reduction techniques are:
❑ Principal component analysis (PCA)
❑ Nonnegative matrix factorization
❑ Factor analysis
23
Hyperparameter Tuning
❖ The process of identifying the set of parameters that
provides the best model. Hyperparameters control
how a machine learning algorithm fits the model to
the data.
❖ Parameter tuning is an iterative process. You begin by
setting parameters based on a “best guess” of the
outcome. Your goal is to find the “best possible”
values - those that yield the best model.
❖ As you adjust parameters and model performance
begins to improve, you see which parameter settings
are effective and which still require tuning.
24
Hyperparameter Tuning Methods
❖ Three common parameter tuning methods are:
❑ Bayesian optimization
❑ Grid search
❑ Gradient-based optimization
25
PROF. DR. MD. RAKIB HASSAN 26