0% found this document useful (0 votes)
23 views44 pages

Cp4252 ML Unit-II

Supervised learning is a machine learning approach where models are trained on labeled data to predict outputs based on input variables. It includes regression for continuous outcomes and classification for categorical outcomes, with various algorithms like linear regression and decision trees. Key challenges include underfitting and overfitting, which can be addressed through techniques such as cross-validation and regularization methods like Lasso regression.

Uploaded by

SMILEY FF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views44 pages

Cp4252 ML Unit-II

Supervised learning is a machine learning approach where models are trained on labeled data to predict outputs based on input variables. It includes regression for continuous outcomes and classification for categorical outcomes, with various algorithms like linear regression and decision trees. Key challenges include underfitting and overfitting, which can be addressed through techniques such as cross-validation and regularization methods like Lasso regression.

Uploaded by

SMILEY FF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

UNIT – II

1].SUPERVISED LEARNING

• Supervised learning is the types of machine learning in which machines


are trained using well
• "labelled" training data, and on basis of that data, machines predict the
output.
• The labelled data means some input data is already tagged with the
correct output
• Supervised learning is a process of providing input data as well as correct
output data to the machine learning model.
• The aim of a supervised learning algorithm is to find a mapping function
to map the input variable(x) with the output variable(y). Y=f(X)
Working of Supervised Learning Algorithm :
• In supervised learning, models are trained using labelled dataset, where
the model learns about each type of data.
• Once the training process is completed, the model is tested on the basis
of test data (a subset of the training set), and then it predicts the output.

Training Data :
• Suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon.
• Now the first step is that we need to train the model for each shape.
➢ If the given shape has four sides, and all the sides are equal, then
it will be labelled as a Square.
➢ If the given shape has three sides, then it will be labelled as a
triangle.
➢ If the given shape has six equal sides then it will be labelled as
hexagon.
Steps of Supervised Learning Algorithm :
• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and
validation dataset.
• Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
• Determine the suitable algorithm for the model, such as support vector
machine, decision tree etc.
• Execute the algorithm on the training dataset.
• Evaluate the accuracy of the model by providing the test set.
• If the model predicts the correct output, which our model is accurate.

Types of supervised learning algorithm :


Type 1 – Regression :
• Regression algorithms are used if there is a relationship between the
input variable and the output variable.
• It is used for the prediction of continuous & real variables.
Example:
1. Weather forecasting, Market Trends
2. Temperature, Age, Salary, Price, etc.
Regression Analysis Algorithm:
1. Linear Regression
2. Logistic Regression
3. Support Vector Regression
4. Decision Tree Regression
5. Random Forest Regression
Type 2 – Classification :
• Classification algorithms are used when the output variable is
categorical, which means there are two classes such as Yes-No, Male-
Female, True-false, etc.

Classification Analysis Algorithm:


1. Random Forest
2. Decision Trees
3. Logistic Regression
4. Support vector Machines

Example:
1. Determining an email is spam or not
2. The outcome can be a binary classification / logistic regression
3. Something is harmful or not
4. Rainfall tomorrow or no rainfall.
Types of Classifier :
The algorithm which implements the classification on a dataset is known as a
classifier.
There are two types of Classifications:

1. Binary Classifier: If the classification problem has only two possible


outcomes, then it is called as Binary Classifier. (Divided into Positive &
Negative Class)
- Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT
or DOG, etc.
2. Multi-class Classifier: If a classification problem has more than two
outcomes, then it is called as Multi-class Classifier.
- Example: Classifications of types of crops, Classification of types of
music.

Advantages:
1. With the help of supervised learning, the model can predict the output
on the basis of prior experiences.
2. In supervised learning, we can have an exact idea about the classes of
objects.
3. Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.

Disadvantages:
1. Supervised learning models are not suitable for handling the complex
tasks.
2. Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
3. Training required lots of computation times.
4. In supervised learning, we need enough knowledge about the classes of
object.
Regression vs Classification :
2]. Linear Regression
Definition
Linear Regression is a supervised learning algorithm used for predictive analysis.
It models the linear relationship between a dependent variable (output) and one
or more independent variables (input).

Mathematical Representation :

• y = Dependent (target) variable


• x = Independent (predictor) variable
• a0 = Intercept
• a1 = Slope (coefficient)
• ε = Random error (residual)
Types of Linear Regression :
Type Description
Simple Linear Regression One independent variable

Multiple Linear Regression Two or more independent variables

Regression Line Types


• Positive Linear Relationship: As x increases, y increases.
• Negative Linear Relationship: As x increases, y decreases.

Goal: Find the Best Fit Line


A best fit line minimizes the error between predicted and actual values.
Cost Function – Mean Squared Error (MSE)

• yi = actual value
• y^i = predicted value
• N = number of observations
Residual: The difference between actual and predicted value.
Smaller residuals mean better model accuracy.

Gradient Descent (Optimization)


Used to minimize the MSE by adjusting a0a_0a0, a1a_1a1 through iterations.

Update Rule:

Where:
• η = learning rate

• = partial derivative of cost function w.r.t parameter aj


Model Performance: Goodness of Fit
R-squared ( R2 )
• Measures the strength of the relationship between independent and
dependent variables.
• Ranges from 0 to 1 (or 0% to 100%).

• SSres : Sum of squares of residuals


• SStot : Total sum of squares

Assumptions of Linear Regression


Assumption Description
Linearity Relationship between features and target must
be linear.

No or little Independent variables shouldn't be highly


multicollinearity correlated.

Homoscedasticity Equal variance of errors across all levels of


input.

Normal distribution of Residuals should follow a normal distribution.


errors

No autocorrelation Error terms should not be dependent on each


other.
3]. Least Squares Method – CP4252 Machine Learning
• The Least Squares Method is a statistical technique used to determine the
best-fitting curve or line for a set of data points by minimizing the sum of
the squared residuals (differences between actual and predicted values).
• It is commonly used in regression analysis and curve fitting.

To minimize:

Where:
• yi : Actual observed value
• y^i = f(xi) : Predicted value from the fitted line
• (yi−y^i) : Residual (error)

Types of Least Squares Problems


Type Description
Ordinary Residuals form a linear relationship; used in linear
(Linear) regression

Nonlinear Residuals form a nonlinear relationship; solved iteratively


Least Squares Formula
Let the best-fit line be:

(1)
To determine a (intercept) and b (slope), solve the normal equations:

(2)

Worked-out Example
Given data:
x y
8 4

3 12

2 1

10 12

11 9

3 4

6 9

5 6

6 1

8 14
Step 1: Calculate sums

Step 2: Solve normal equations

Solving gives:



Final equation:

Residuals (Errors)
Example residual:

Sum of squared residuals:

Smaller residuals → Better fit.


Graphical Interpretation
• Data points plotted on X-Y axis
• Regression line y=a+bx drawn
• Vertical lines from points to line represent residuals

Limitations
• Assumes zero or negligible error in independent variable.
• Not suitable if independent variable has significant measurement error.
• May require hypothesis testing and confidence intervals when dealing
with real-world data.

4]. Underfitting, Overfitting & Cross-Validation

Model Fit
In machine learning, a model fit refers to how well the algorithm captures the
patterns in the training data and how well it generalizes to unseen data.
Underfitting
Underfitting occurs when the model is too simple to capture the underlying
patterns in the data.

Symptoms:
• Poor performance on training and testing data
• High bias, low variance

Reasons:
1. Insufficient training time
2. Model too simple (e.g., using linear model for nonlinear data)
3. Incomplete or irrelevant features
4. Noisy or insufficient data

Fixes:
• Use a more complex model
• Add more features or data
• Clean the dataset
• Train for more epochs

Overfitting

Overfitting occurs when the model learns not only the patterns but also the
noise and outliers in the training data, leading to poor performance on new
(test) data.
Symptoms:
• Excellent accuracy on training data

• Poor accuracy on testing data


• High variance, low bias
Reasons:
1. Model too complex
2. Too many features
3. Training for too long
4. Insufficient or noisy data

Fixes:
• Simplify the model
• Regularization (L1 - Lasso, L2 - Ridge)
• Early stopping
• Dropout (in neural networks)
• Add more training data

Good Fit
• The model achieves low error on both training and testing datasets.
• This represents the ideal balance between bias and variance.

Bias–Variance Trade-off :
Fit Type Bias Variance

Underfitting High Low

Good Fit Low Low

Overfitting Low High


Cross-Validation

Cross-validation is a model validation technique to evaluate how well a model


generalizes to an independent dataset. It helps prevent overfitting and supports
model selection.

How It Works:
• The dataset is split into:
o Training set
o Validation set
o Testing set
Most common: k-Fold Cross-Validation
• Split data into k subsets
• Train on k-1 folds, test on the remaining
• Repeat k times, each fold being the test set once
• Final score = average of all iterations

Why Use Cross-Validation?


• Helps choose the best hyperparameters
• Detects overfitting before final testing
• Measures generalization ability

Early Stopping (in Cross-Validation)


• Monitor validation loss during training.
• Stop training just before validation loss increases (indicating overfitting).

5]. Lasso Regression


Lasso Regression is a regularization technique used in regression problems to
prevent overfitting and enhance model simplicity.
• Lasso = Least Absolute Shrinkage and Selection Operator
• It is particularly effective when:
o The model has many features
o There is multicollinearity
o Feature selection is desired
Purpose
• Prevent overfitting
• Improve generalization to unseen data
• Automatically select features by shrinking some coefficients to zero

Mathematical Formula
The Lasso cost function is:

Where:
• yi : actual value
• y^i : predicted value
• βj : model coefficients
• λ : regularization parameter
• ∑∣ βj ∣ : L1 penalty term
How It Works :
Parameter λ Effect on Model
λ=0 Equivalent to Linear Regression

Small λ Minor shrinkage, small regularization

Large λ More coefficients shrink to zero

λ→∞ All features eliminated, model dies

L1 Regularization (Lasso) vs L2 (Ridge) :


Feature Lasso (L1) Ridge (L2)
Penalty ( \sum \beta_j

Feature Yes (eliminates irrelevant No (retains all


selection vars) variables)

Sparsity High (some coefficients = 0) Low (all coefficients


shrink)

Interpretability High Moderate

Benefits of Lasso Regression


• Automatically performs feature selection
• Prevents overfitting
• Useful for high-dimensional datasets
• Simplifies the model
Bias–Variance Trade-Off
As λ increases Effect
Bias increases Less flexible model

Variance decreases Less overfitting

Use Cases
• Predictive modeling with many features
• Models requiring interpretability
• Use in compressed sensing, bioinformatics, text classification

Key Takeaways
• Lasso Regression = Linear Regression + L1 penalty
• Controls model complexity by shrinking coefficients
• Promotes sparse models (better interpretability)
• Helps avoid overfitting, especially in high-dimensional spaces

6]. Classification

Classification is a type of supervised learning where the task is to assign input


data to one of N discrete classes based on past training data.
Each input example belongs to exactly one class, and the classifier must learn
how to correctly predict that class.
Concepts
• The output space is discrete.
• Every example is assigned only one label.
• All possible output labels must be known in advance.

Real-World Example: Coin Classification (Vending Machine) :

Problem: A vending machine needs to classify coins to accept valid currency.


• Features measured:
o Diameter
o Weight
o Shape (coded as 1 = circle, 2 = hexagon, etc.)
Input Vector = [diameter, weight, shape]
Important Note: Feature selection is critical — too few may underfit, too many
may cause overfitting and increase training complexity (curse of
dimensionality).

Challenge: Novelty Detection


What happens if the machine is given a foreign coin (not part of training)?
• Classifier might assign it to the closest known class, even though it's
wrong.
• This scenario is called novelty detection (i.e., identifying that an input is
outside known classes).
For simplicity, this classifier assumes only known inputs will be received.
Feature Engineering: Good vs Useless Features
• Good features: e.g., diameter and colour can distinguish between NZ
coins.
• Useless features: e.g., shape if all NZ coins are circular — adds no value.
Choosing the right set of features is a crucial part of designing a classification
model.

Decision Boundaries
• Classifiers learn decision boundaries that separate the input space into
different class regions.

Example:
• 2D plot of inputs: Different regions represent different coin classes.
• Boundaries can be:
o Linear (e.g., straight lines — simple but less accurate)
o Non-linear (e.g., curves — more complex, better separation)

Visual :

• Left image: Straight line • Right image: Non-linear


decision boundaries (simple, decision boundaries (complex,
fast, may misclassify) but accurate classification)
7.] Logistic Regression

Logistic Regression is a supervised learning algorithm used for classification,


not regression.
It is primarily used for binary classification problems where the output is either
0 or 1, True or False, Yes or No, etc.

Why Not Use Linear Regression :


Linear regression predicts continuous values. But classification requires
categorical outputs (like 0/1). So, logistic regression uses a sigmoid function to
map any input to a value between 0 and 1 (interpreted as probability).

Mathematical Formula
Hypothesis Function:

• Θ : vector of weights
• x : input features
• Output hθ(x) is a value in range (0, 1)
• Prediction is made as:
o If hθ(x) ≥ 0.5 → class = 1
o Else → class = 0
Sigmoid Function (Logistic Function)

• S-curve shape
• Squashes real-valued input into [0, 1]
• Interpreted as probability

Cost Function (Log Loss)

• Used instead of MSE to ensure convexity


• Optimized using Gradient Descent

Gradient Descent Update Rule

• η : learning rate
• ∇J(θ) : gradient of the cost function
Applications of Logistic Regression :

Domain Use Case

Email Filtering Spam or Not Spam

Finance Fraud Detection

Medical Disease (Yes/No)

HR Employee attrition (Stay/Leave)

Advantages
• Simple to implement and interpret
• Probabilistic output
• Works well for linearly separable data

Limitations
• Can’t handle non-linear relationships directly
• Prone to underfitting if the relationship is complex
• Assumes independence of features

Extension: Multiclass Logistic Regression :


For more than 2 classes, we use:
• One-vs-All (OvA) or
• Softmax Regression (Multinomial Logistic Regression)
8].Gradient-Based Linear Model

1. A model
1. Uses a linear predictor

2. Passes z through an (optional) link / activation to obtain the output

3. Learns θ by minimising a differentiable loss with an iterative gradient


method

Typical gradient optimisers: Batch GD, Stochastic GD (SGD), Mini-


batch GD, Adam, RMSProp.
2. Common Members of the Family
Task Model Link / Activation g(z) Loss J(θ) (per-sample)

Regression Linear
regression

Binary Logistic
class. regression

Count Poisson
data regression

Large Linear SVM


margin (hinge)

Streaming Adaline /
LMS

These are often grouped under the statistical umbrella of


Generalised Linear Models (GLMs);

3. Optimisation by Gradient Descent


3.1 Batch Gradient Descent

3.2 SGD / Mini-batch


Same rule but update after each example or small batch → faster
convergence on large data and helps escape shallow local minima.
4. Regularisation in Gradient Linear Models
Technique Added term Effect
L2 / Ridge Shrinks weights smoothly (keeps
all features).

L1 / Lasso (\lambda\sum_j \theta_j

Elastic-Net (\lambda\big[\alpha \sum \theta_j

5 Bias–Variance Insight
• High learning rate / too few iterations → under-fitting (high bias).
• Too many iterations with no regularisation → over-fitting (high variance).
• Early stopping acts like an implicit L2L2L2 penalty.

6 Quick Worked Example (Binary Classification) :


1. Initialise θ (e.g., small random values).
2. Repeat until convergence

o Compute logits .
o Apply sigmoid to get y^(i).
o Compute gradient

3. Threshold y^ ≥ 0.5 for class 1 else class 0.


9]. Support Vector Machine (SVM)
A Support Vector Machine is a powerful supervised machine learning algorithm
used for both classification and regression, but primarily for binary
classification.
It works by finding the best hyperplane that separates data into distinct classes
with the maximum margin.

Core Idea
SVM tries to:
• Maximize the margin (distance) between the classes.
• Identify the support vectors: the closest data points to the separating
hyperplane.

Mathematical Formulation
For linearly separable data:
Given data (xi,yj) where yi ∈{−1,+1} :
The decision boundary (hyperplane) :

Optimization Objective:
Subject to:

Soft Margin SVM (Non-separable data)


Adds slack variables ξi to allow misclassification:

Subject to:

• CCC: Regularization parameter controlling trade-off between


margin size and error
Kernel Trick (for Non-linear SVM)
SVM can classify non-linearly separable data using kernel functions
by mapping data to higher dimensions.

Decision Function
After training:

Where:
• αi : learned support vector weights
• K : kernel function
Applications of SVM
Domain Use Case
Email Filtering Spam vs Non-spam

Bioinformatics Disease classification

Image Processing Face recognition

Text Mining Sentiment analysis


Advantages
• Effective in high-dimensional spaces
• Works well when margin is clear
• Can handle non-linear boundaries using kernels
• Uses only support vectors (efficient)

Disadvantages
• Not suitable for very large datasets
• Doesn’t perform well with noisy data
• Requires feature scaling
• Choosing the right kernel is crucial
10]. Kernel Methods

A kernel method is a type of algorithm that relies on similarity measures


(kernels) between data points. It maps input data into a higher-dimensional
feature space without explicitly computing the transformation — using a trick
called the kernel trick.

Use Kernel Methods


• To handle non-linear classification problems.
• To avoid explicitly computing the high-dimensional mapping ϕ(x).
• To use inner products in feature space:

Common Kernel Functions


Kernel Type Formula
Linear

Polynomial

RBF (Gaussian)

Sigmoid
Kernel Trick Example in SVM
In non-linear SVM, we use kernel functions to transform the feature
space so that classes become linearly separable.

Advantages
• Handles complex, non-linear relationships
• No need to compute high-dimensional features explicitly
• Used in many algorithms: SVM, PCA (Kernel PCA), Ridge, etc.

Disadvantages
• Kernel choice is crucial
• Can be computationally expensive on large datasets

11]. Instance-Based Methods

An instance-based learning algorithm doesn't explicitly build a model. Instead,


it:
• Stores training instances.
• Uses them at prediction time to compute outputs based on similarity.
Core Idea
“Memorize the data, then compare new inputs to the stored instances.”

How It Works
• Stores all training data (xi,yj)
• For a new point x , predicts based on similarity to xi

Popular Instance-Based Algorithms


Algorithm Description
k-NN Predicts based on majority vote or average of k
nearest neighbors.

Case-Based Reasoning Uses past cases to predict new ones.

Locally Weighted Performs regression near query point.


Regression

Distance Metrics Used


• Euclidean Distance:

• Manhattan Distance, Cosine Similarity, etc.


Advantages
• Simple to implement
• No training time (lazy learning)
• Good for multi-modal distributions

Disadvantages
• Slow prediction (stores all data)
• Sensitive to irrelevant features
• Needs distance metric tuning

12]. Ensemble Methods – CP4252 Machine Learning

Ensemble methods combine the predictions of multiple models (learners) to


produce better performance than any single model alone.
“Wisdom of the crowd” applied to machine learning.

Goals of Ensemble Learning


• Improve accuracy
• Reduce variance (overfitting)
• Reduce bias (underfitting)
• Increase robustness
Types of Ensemble Learning

1. Bagging (Bootstrap Aggregating)


• Trains multiple independent models on random subsets of data (with
replacement).
• Final prediction:
o Classification: Majority vote
o Regression: Average

Example: Random Forest


• Ensemble of Decision Trees using bagging
• Reduces variance
• Great for high-variance models like decision trees

2. Boosting
• Trains models sequentially
• Each model tries to correct the errors of the previous one
• Focuses more on difficult examples
Examples:
• AdaBoost (Adaptive Boosting): Assigns weights to samples
• Gradient Boosting: Models residuals using gradient descent
• XGBoost: Optimized version of gradient boosting (fast & accurate)
Boosting reduces bias and can create powerful models
3. Stacking (Stacked Generalization)
• Combines predictions of multiple different models (base
learners)
• A meta-model learns how to combine base model outputs
More flexible and can capture diverse learning patterns

Why Ensemble Works


Because different models make different errors:
• Averaging predictions reduces variance
• Correcting mistakes (in boosting) reduces bias
• Combining different learners improves generalization

Visual Intuition
Imagine:
• Model A: Accuracy = 70%
• Model B: Accuracy = 72%
• Model C: Accuracy = 71%
Ensemble of A + B + C → Accuracy ≈ 78–85%
Comparison Table :
Method Models Goal Output Aggregation Key Use
Trained

Bagging Independently Reduce Voting / Averaging Random


variance Forest

Boosting Sequentially Reduce bias Weighted vote / sum AdaBoost,


XGB

Stacking Parallel, Improve Meta-learner (ML Ensemble


diverse accuracy model) combo

Advantages
• Higher accuracy than individual models
• Combats overfitting and underfitting
• Handles both classification and regression

Disadvantages
• Computationally expensive
• Harder to interpret than single models
• May overfit if not tuned properly (especially boosting)

Real-World Applications
• Fraud detection
• Credit scoring
• Spam detection
• Stock price prediction
• Healthcare diagnostics
13]. Random Forest
Random Forest is an ensemble learning algorithm based on the principle of
bagging. It builds multiple decision trees and merges them to get a more
accurate and stable prediction.
It is one of the most powerful and widely used supervised learning algorithms.

Key Components
1. Base Learners: Decision Trees
2. Bagging (Bootstrap Aggregation): Each tree is trained on a random
subset of data
3. Random Feature Selection: At each split in a tree, only a random subset
of features is considered

How It Works
1. Create multiple random subsets of the original dataset (with
replacement).
2. Train a decision tree on each subset.
3. For prediction:
o Classification: Take a majority vote across all trees.
o Regression: Take the average of all tree outputs.
Algorithm Steps
Training:
• For each tree:
o Sample data points randomly with replacement.
o At each node, choose the best split among a random subset of
features.
o Grow the tree fully (or up to max depth).
Prediction:
• Classification:

• Regression:

Advantages
• High accuracy and generalization
• Handles both categorical and numerical data
• Robust to overfitting
• Handles missing values
• Works well even without feature scaling
Disadvantages
• Slower and more memory-intensive than a single tree
• Less interpretable than a single decision tree
• Not ideal for real-time predictions in constrained systems

Use Cases
• Fraud detection
• Stock market prediction
• Medical diagnosis
• Loan default prediction

14].

Forgot the Evaluation of classification algorithm

You might also like