Cp4252 ML Unit-II
Cp4252 ML Unit-II
1].SUPERVISED LEARNING
Training Data :
• Suppose we have a dataset of different types of shapes which includes
square, rectangle, triangle, and Polygon.
• Now the first step is that we need to train the model for each shape.
➢ If the given shape has four sides, and all the sides are equal, then
it will be labelled as a Square.
➢ If the given shape has three sides, then it will be labelled as a
triangle.
➢ If the given shape has six equal sides then it will be labelled as
hexagon.
Steps of Supervised Learning Algorithm :
• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and
validation dataset.
• Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
• Determine the suitable algorithm for the model, such as support vector
machine, decision tree etc.
• Execute the algorithm on the training dataset.
• Evaluate the accuracy of the model by providing the test set.
• If the model predicts the correct output, which our model is accurate.
Example:
1. Determining an email is spam or not
2. The outcome can be a binary classification / logistic regression
3. Something is harmful or not
4. Rainfall tomorrow or no rainfall.
Types of Classifier :
The algorithm which implements the classification on a dataset is known as a
classifier.
There are two types of Classifications:
Advantages:
1. With the help of supervised learning, the model can predict the output
on the basis of prior experiences.
2. In supervised learning, we can have an exact idea about the classes of
objects.
3. Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.
Disadvantages:
1. Supervised learning models are not suitable for handling the complex
tasks.
2. Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
3. Training required lots of computation times.
4. In supervised learning, we need enough knowledge about the classes of
object.
Regression vs Classification :
2]. Linear Regression
Definition
Linear Regression is a supervised learning algorithm used for predictive analysis.
It models the linear relationship between a dependent variable (output) and one
or more independent variables (input).
Mathematical Representation :
• yi = actual value
• y^i = predicted value
• N = number of observations
Residual: The difference between actual and predicted value.
Smaller residuals mean better model accuracy.
Update Rule:
Where:
• η = learning rate
To minimize:
Where:
• yi : Actual observed value
• y^i = f(xi) : Predicted value from the fitted line
• (yi−y^i) : Residual (error)
(1)
To determine a (intercept) and b (slope), solve the normal equations:
(2)
Worked-out Example
Given data:
x y
8 4
3 12
2 1
10 12
11 9
3 4
6 9
5 6
6 1
8 14
Step 1: Calculate sums
Solving gives:
•
•
Final equation:
Residuals (Errors)
Example residual:
Limitations
• Assumes zero or negligible error in independent variable.
• Not suitable if independent variable has significant measurement error.
• May require hypothesis testing and confidence intervals when dealing
with real-world data.
Model Fit
In machine learning, a model fit refers to how well the algorithm captures the
patterns in the training data and how well it generalizes to unseen data.
Underfitting
Underfitting occurs when the model is too simple to capture the underlying
patterns in the data.
Symptoms:
• Poor performance on training and testing data
• High bias, low variance
Reasons:
1. Insufficient training time
2. Model too simple (e.g., using linear model for nonlinear data)
3. Incomplete or irrelevant features
4. Noisy or insufficient data
Fixes:
• Use a more complex model
• Add more features or data
• Clean the dataset
• Train for more epochs
Overfitting
Overfitting occurs when the model learns not only the patterns but also the
noise and outliers in the training data, leading to poor performance on new
(test) data.
Symptoms:
• Excellent accuracy on training data
Fixes:
• Simplify the model
• Regularization (L1 - Lasso, L2 - Ridge)
• Early stopping
• Dropout (in neural networks)
• Add more training data
Good Fit
• The model achieves low error on both training and testing datasets.
• This represents the ideal balance between bias and variance.
Bias–Variance Trade-off :
Fit Type Bias Variance
How It Works:
• The dataset is split into:
o Training set
o Validation set
o Testing set
Most common: k-Fold Cross-Validation
• Split data into k subsets
• Train on k-1 folds, test on the remaining
• Repeat k times, each fold being the test set once
• Final score = average of all iterations
Mathematical Formula
The Lasso cost function is:
Where:
• yi : actual value
• y^i : predicted value
• βj : model coefficients
• λ : regularization parameter
• ∑∣ βj ∣ : L1 penalty term
How It Works :
Parameter λ Effect on Model
λ=0 Equivalent to Linear Regression
Use Cases
• Predictive modeling with many features
• Models requiring interpretability
• Use in compressed sensing, bioinformatics, text classification
Key Takeaways
• Lasso Regression = Linear Regression + L1 penalty
• Controls model complexity by shrinking coefficients
• Promotes sparse models (better interpretability)
• Helps avoid overfitting, especially in high-dimensional spaces
6]. Classification
Decision Boundaries
• Classifiers learn decision boundaries that separate the input space into
different class regions.
Example:
• 2D plot of inputs: Different regions represent different coin classes.
• Boundaries can be:
o Linear (e.g., straight lines — simple but less accurate)
o Non-linear (e.g., curves — more complex, better separation)
Visual :
Mathematical Formula
Hypothesis Function:
• Θ : vector of weights
• x : input features
• Output hθ(x) is a value in range (0, 1)
• Prediction is made as:
o If hθ(x) ≥ 0.5 → class = 1
o Else → class = 0
Sigmoid Function (Logistic Function)
• S-curve shape
• Squashes real-valued input into [0, 1]
• Interpreted as probability
• η : learning rate
• ∇J(θ) : gradient of the cost function
Applications of Logistic Regression :
Advantages
• Simple to implement and interpret
• Probabilistic output
• Works well for linearly separable data
Limitations
• Can’t handle non-linear relationships directly
• Prone to underfitting if the relationship is complex
• Assumes independence of features
1. A model
1. Uses a linear predictor
Regression Linear
regression
Binary Logistic
class. regression
Count Poisson
data regression
Streaming Adaline /
LMS
5 Bias–Variance Insight
• High learning rate / too few iterations → under-fitting (high bias).
• Too many iterations with no regularisation → over-fitting (high variance).
• Early stopping acts like an implicit L2L2L2 penalty.
o Compute logits .
o Apply sigmoid to get y^(i).
o Compute gradient
Core Idea
SVM tries to:
• Maximize the margin (distance) between the classes.
• Identify the support vectors: the closest data points to the separating
hyperplane.
Mathematical Formulation
For linearly separable data:
Given data (xi,yj) where yi ∈{−1,+1} :
The decision boundary (hyperplane) :
Optimization Objective:
Subject to:
Subject to:
Decision Function
After training:
Where:
• αi : learned support vector weights
• K : kernel function
Applications of SVM
Domain Use Case
Email Filtering Spam vs Non-spam
Disadvantages
• Not suitable for very large datasets
• Doesn’t perform well with noisy data
• Requires feature scaling
• Choosing the right kernel is crucial
10]. Kernel Methods
Polynomial
RBF (Gaussian)
Sigmoid
Kernel Trick Example in SVM
In non-linear SVM, we use kernel functions to transform the feature
space so that classes become linearly separable.
Advantages
• Handles complex, non-linear relationships
• No need to compute high-dimensional features explicitly
• Used in many algorithms: SVM, PCA (Kernel PCA), Ridge, etc.
Disadvantages
• Kernel choice is crucial
• Can be computationally expensive on large datasets
How It Works
• Stores all training data (xi,yj)
• For a new point x , predicts based on similarity to xi
Disadvantages
• Slow prediction (stores all data)
• Sensitive to irrelevant features
• Needs distance metric tuning
2. Boosting
• Trains models sequentially
• Each model tries to correct the errors of the previous one
• Focuses more on difficult examples
Examples:
• AdaBoost (Adaptive Boosting): Assigns weights to samples
• Gradient Boosting: Models residuals using gradient descent
• XGBoost: Optimized version of gradient boosting (fast & accurate)
Boosting reduces bias and can create powerful models
3. Stacking (Stacked Generalization)
• Combines predictions of multiple different models (base
learners)
• A meta-model learns how to combine base model outputs
More flexible and can capture diverse learning patterns
Visual Intuition
Imagine:
• Model A: Accuracy = 70%
• Model B: Accuracy = 72%
• Model C: Accuracy = 71%
Ensemble of A + B + C → Accuracy ≈ 78–85%
Comparison Table :
Method Models Goal Output Aggregation Key Use
Trained
Advantages
• Higher accuracy than individual models
• Combats overfitting and underfitting
• Handles both classification and regression
Disadvantages
• Computationally expensive
• Harder to interpret than single models
• May overfit if not tuned properly (especially boosting)
Real-World Applications
• Fraud detection
• Credit scoring
• Spam detection
• Stock price prediction
• Healthcare diagnostics
13]. Random Forest
Random Forest is an ensemble learning algorithm based on the principle of
bagging. It builds multiple decision trees and merges them to get a more
accurate and stable prediction.
It is one of the most powerful and widely used supervised learning algorithms.
Key Components
1. Base Learners: Decision Trees
2. Bagging (Bootstrap Aggregation): Each tree is trained on a random
subset of data
3. Random Feature Selection: At each split in a tree, only a random subset
of features is considered
How It Works
1. Create multiple random subsets of the original dataset (with
replacement).
2. Train a decision tree on each subset.
3. For prediction:
o Classification: Take a majority vote across all trees.
o Regression: Take the average of all tree outputs.
Algorithm Steps
Training:
• For each tree:
o Sample data points randomly with replacement.
o At each node, choose the best split among a random subset of
features.
o Grow the tree fully (or up to max depth).
Prediction:
• Classification:
• Regression:
Advantages
• High accuracy and generalization
• Handles both categorical and numerical data
• Robust to overfitting
• Handles missing values
• Works well even without feature scaling
Disadvantages
• Slower and more memory-intensive than a single tree
• Less interpretable than a single decision tree
• Not ideal for real-time predictions in constrained systems
Use Cases
• Fraud detection
• Stock market prediction
• Medical diagnosis
• Loan default prediction
14].