0% found this document useful (0 votes)
8 views32 pages

Sourav Moocs A2 65

The document is a seminar report on the Machine Learning Specialization completed by Mr. Gaurav Singh as part of his CSE V Semester at Graphic Era Hill University. It includes a certification of course completion, acknowledgments, a table of contents, and detailed descriptions of machine learning concepts covered in two modules: Supervised Machine Learning and Advanced Learning Algorithms. Key topics include regression, classification, neural networks, and techniques to prevent overfitting.

Uploaded by

Sourav Bisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views32 pages

Sourav Moocs A2 65

The document is a seminar report on the Machine Learning Specialization completed by Mr. Gaurav Singh as part of his CSE V Semester at Graphic Era Hill University. It includes a certification of course completion, acknowledgments, a table of contents, and detailed descriptions of machine learning concepts covered in two modules: Supervised Machine Learning and Advanced Learning Algorithms. Key topics include regression, classification, neural networks, and techniques to prevent overfitting.

Uploaded by

Sourav Bisht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

MOOCS SEMINAR REPORT

ON
MACHINE LEARNING SPECIALIZATION
(CSE V Semester MOOC Seminar) 2025-2026

Submitted To: Submitted By:


Mr. Samir Rana Mr. Gaurav Singh
(CC-CSE-A2-V-SEM) Roll No. 2218768

CSE-A2-V-Sem
Session- 2025-2026

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

GRAPHIC ERA HILL UNIVERSITY, DEHRADUN


CERTIFICATE
(from Internal Co-ordinator of MOOC i.e. Class Coordinator)

Certified that Mr. Gaurav Singh (Roll No.- 2218768) have


Completed MOOC Seminar on the topic “ Machine Learning
Specialization ” (module 1&2) from Coursera for fulfilment of CSE V
Semester MOOC Seminar in Graphic Era Hill University, Dehradun.
Students have successfully Completed this Course as best of my knowledge.

DATE: 20/01/2025
(Mr. Samir Rana)
Class Coordinator
CC-CSE-A2-V-Sem
CSE Department
GEHU, DEHRADUN
CERTIFICATE

Module 1:
https://coursera.org/verify/6ZP2QHZOUEYM

Module 2: https://coursera.org/verify/JMMG3Q1DJETI
ACKNOWLEDGEMENT

I wish to thank my parents for their continuing support and


encouragement. We also wish to thank them for providing us with
the opportunity to reach this far in our studies.
I also acknowledge to my Class coordinator Mr. Samir Rana sir
who helped me to understand this course. At last but not the least I
greatly indebted to all other persons who directly or indirectly
helped me during this course.

Mr. Gaurav Singh


Roll No.- 2218768
CSE-A2-V-Sem
Session: 2025-2026
GEHU, Dehradun
TABLE OF CONTENTS

Module 1: Supervised Machine Learning: Regression & Classification

o Introduction to Machine Learning and its Applications.


o Supervised & Unsupervised Learning.
o Linear Regression
 Cost function
 Gradient Descent
o Multiple Linear Regression
 Feature Scaling and Feature Engineering
o Logistic Regression
 Sigmoid activation function
 Cost function for Logistic Regression
o Overfitting
 Cost function with regularization

Module 2: Advance Learning Algorithms

o Neural Network
 Building a Nueral Network
 Forward Propagation in neural Network
o Activation Function
 Activation function(ReLU)
 Softmax for Multiclass classification
 Additional Layers
o Model Selection
o Bias variance tradeoff
o Decision trees
 Information gain
 Random Forest
Module 1
Introduction to Machine Learning and its Applications

Machine Learning is a branch of Artificial Intelligence (AI) that enables systems to


learn from data and make decisions or predictions without being explicitly programmed.
In simple terms, ML models identify patterns in historical data to predict future
outcomes or categorize new data.
Imagine a machine learning model used to predict house prices:
 Input: Features like house size, number of bedrooms, location, etc.
 Output: Predicted house price.
 Goal: Train a model using past sales data to accurately predict prices for new
houses.
ML models such as linear regression, decision trees, or neural networks can be used to
solve this problem by learning patterns in the dataset.

Key Applications of Machine Learning:-

1. Healthcare:
o Disease diagnosis (e.g., identifying cancer in medical images).
o Predicting patient outcomes (e.g., risk of heart attacks based on patient
history).
2. Finance:
o Fraud detection (e.g., identifying unusual credit card transactions).
o Loan approval based on creditworthiness.
3. Retail:
o Recommendation systems (e.g., suggesting products on e-commerce
platforms).
o Demand forecasting for inventory management.
4. Transportation:
o Self-driving cars (e.g., identifying pedestrians and road signs).
o Traffic prediction (e.g., using GPS data to estimate travel time).
5. Natural Language Processing (NLP):
o Sentiment analysis (e.g., classifying customer reviews as positive or
negative).
o Chatbots and virtual assistants (e.g., Siri, Alexa).
6. Image and Speech Recognition:
o Face recognition for security.
o Voice-controlled devices.
Supervised & Unsupervised Learning

Supervised Learning:
In supervised learning, the model is trained on labeled data, where the input data is
paired with the correct output (target). The goal is to learn a mapping from inputs to
outputs and make predictions for new inputs.

Examples of Supervised Learning:


1. Regression: Regression involves predicting a continuous numeric value based on
input features. It aims to model the relationship between the input variables (x)
and the output variable (y).
o Example: Predicting house prices.
 Input: Features like house size (x1) and number of bedrooms (x2).
 Output: Predicted price (y).
 Equation: h(x)=θ0+θ1x1+θ2x2
2. Classification: Classification involves predicting a discrete category or class label
based on input features. It is used for problems where the output is categorical..
o Example: Spam detection.
 Input: Email text or metadata.
 Output: Spam (1) or Not Spam (0).

Unsupervised Learning:
In unsupervised learning, the model is trained on data without labeled outputs. The goal
is to discover hidden patterns, groupings, or structures within the data.

Examples of Unsupervised Learning:


1. Clustering: Clustering is an unsupervised learning technique that groups data
points into clusters based on their similarity or distance from each other. Unlike
supervised learning, clustering does not use labeled data; instead, it identifies
inherent patterns or groupings in the data.
o Example: Customer segmentation in marketing.
 Input: Customer purchase history.
 Output: Groups like "High spenders" or "Frequent buyers."
2. Dimensionality Reduction: Reduce the number of features while retaining
important information.
o Example: Principal Component Analysis (PCA) for visualizing high-
dimensional data.
Linear Regression

Linear regression predicts a continuous output y based on input x, and we denote


the predicted value as y_hat. The goal is to model the relationship between x (input
features) and y (output) using a linear equation.
Key Concepts
1. Hypothesis (Model)
The linear regression hypothesis predicts y as:
y_hat =w0+w1x.
Where:
 y_hat: Predicted value (hypothesis).
 w0: Intercept (bias term).
 w1: Slope (weight for x).
 x: Input feature.

Example of Linear Regression

Problem Statement
You are given data about the size of houses (in square feet) and their corresponding
prices (in $1000s). Your task is to find a linear relationship between house size and
price, and predict the price for a new house of size 1800 sq.

Dataset
x=[1000,1500,2000,2500,3000] (House size in sq. ft)
y=[200,250,300,350,400] (Price in $1000s)

Step 1: Assume a Linear Relationship


The linear regression equation is:
y^=w0+w1x

Step 2: Fit the Line Using the Dataset


Using statistical methods or computational tools (e.g., least squares), the best-fit
line for the given data is found to have:
w0=100 and w1=0.1
So, the equation becomes:
y^=100+0.1x

Step 3: Predict the Price for 1800 sq. ft


Substitute x=1800 into the equation:
y^=100+0.1(1800)
y^=100+180=280
Final Answer
The predicted price for a house of size 1800 sq. ft is: $280,000

Cost Function:
A cost function is a mathematical function used to measure the error or discrepancy
between the predicted values (y^) and the actual values (y) in a regression model. It
quantifies how well the model's predictions align with the actual data.

Why is the Cost Function Used?


1. Evaluate Model Performance:
The cost function provides a single scalar value that indicates how well the model
fits the training data. Smaller values indicate a better fit.
2. Optimize Model Parameters:
The cost function is minimized during training to find the best parameters
(w0,w1,…,wn ) for the model.

Cost Function in Linear Regression


In linear regression, the most commonly used cost function is the Mean Squared Error
(MSE):
J(w)=1/2m∑i=1 to m(y^(i)−y(i))^2
Where:
 J(w): Cost function value for given parameters w.
 m: Number of training examples.
 y^(i)=w0+w1x(i): Predicted value for the i-th training example.
 y(i)): Actual value for the i-th training example.
How is the Cost Function Used in Linear Regression?
1. Prediction (y^):
For each input x(i) , compute the predicted value y^(i) using the model:
y^(i)=w0+w1x(i)
2. Calculate the Error:
Compute the difference between y^(i) and y(i)y^{(i)}y(i):
e(i)=y^(i)−y(i)
3. Compute the Cost:
Use the squared error for all examples to calculate J(w)J(w)J(w):
J(w)=1/2m∑i=1 to m(e(i))^2
4. Optimize the Parameters (w0, w1 ):
Use optimization techniques like Gradient Descent to iteratively adjust the
parameters w0, w1 to minimize J(w).

Gradient Descent:
Gradient Descent is an optimization algorithm used to minimize the cost function by
iteratively updating the model parameters (e.g., w0,w1 ) in the direction that reduces the
cost function value.
The cost function J(w) represents the error of the model for given parameters w0,w1 .
Gradient Descent adjusts the parameters by calculating the slope (gradient) of J(w) and
moving in the opposite direction of the gradient to minimize the cost.

Gradient Descent Algorithm:


For each parameter wj (j=0,1,…,n), the update rule is:
Wj := wj−α*(∂J(w)/∂wj)
Where:
 α: Learning rate (controls step size).
 ∂J(w)/∂wj: Gradient of the cost function with respect to wj.
Multiple Linear Regression

Multiple Linear Regression is an extension of simple linear regression that models the
relationship between a dependent variable (y) and multiple independent variables
(x1,x2,…,xn).
The equation for multiple linear regression is:
y^=w0+w1x1+w2x2+⋯+wnxn
Where:
 y^: Predicted value.
 W0: Intercept (constant term).
 w1,w2,…,wn: Coefficients for each independent variable.
 x1,x2,…,xn: Independent variables.

Example: Predicting House Prices

Problem Statement:
A real estate agent wants to predict house prices(y) based on:
 x1: Size of the house (in square feet).
 x2: Number of bedrooms.
 x3: Distance from the city center (in miles).

Feature scaling:
Feature Scaling refers to the process of standardizing or normalizing the range of
independent variables (features) to ensure all features contribute equally to the model.
Why is Feature Scaling Needed?
1. Avoid Dominance of Large-Scale Features
2. Improve Gradient Descent Convergence

Methods of Feature Scaling:


i. Min-Max Scaling (Normalization): Rescales features to a fixed range, typically [0,1]:
xscaled=x−xmin/xmax−xmin

ii. Z-Score Scaling (Standardization): Centers data around zero with a unit variance:
xscaled=x−μ/σ

Feature Engineering:
Feature Engineering is the process of creating, transforming, or selecting features to
improve the predictive performance of a machine learning model.

Why is Feature Engineering Important?


1. Improves Model Performance
2. Reduces Model Complexity
3. Makes Models More Interpretable

Steps in Feature Engineering:


1. Feature Creation: Combines or derives new features from existing ones.
o Interaction Terms:
Example: If x1 is income and x2 is spending, create x3=x1×x2
o Date-Time Features:
Extract components like day, month, or hour from timestamps.
2. Feature Selection: Identifies the most important features using statistical methods
or algorithms.
o Techniques:
 Correlation analysis.
 Recursive Feature Elimination (RFE).
 Feature importance from tree-based models.
3. Handling Missing Data:
o Replace missing values with the mean, median, or mode.
o Use advanced techniques like K-Nearest Neighbors imputation.
Logistic Regression

Logistic Regression is a statistical model used primarily for binary classification tasks.
It is an extension of linear regression but is designed to predict the probability that a
given input belongs to a particular class (usually labeled as 0 or 1).
Key Concepts of Logistic Regression:
1. Binary Outcome: Logistic regression is used when the outcome variable is
categorical, specifically when it has two possible outcomes, typically labeled as 0
and 1.
2. Sigmoid Function: The core of logistic regression is the sigmoid function, also
known as the logistic function, which converts any real-valued number into a
value between 0 and 1. This output can then be interpreted as a probability.
The formula for the sigmoid function is:
σ(z)=1+e−z\sigma(z)
Where z is the linear combination of the input features, represented as:
z=wTx+bz
o w is a vector of weights (coefficients).
o x is a vector of input features (independent variables).
o b is the bias term (intercept).
The sigmoid function will output a value between 0 and 1, which can be
interpreted as the probability of the sample belonging to class 1

3. Hypothesis (Model): In logistic regression, the hypothesis h(x) is given by the


sigmoid function applied to the linear combination of features and weights:
h(x)=σ(wTx+b)
This gives the probability that the input vector x belongs to class 1 (versus class
0).
4. Threshold for Classification: To make a decision about which class an input
belongs to, a threshold is applied to the output probability:
o If h(x)≥0.5, classify the sample as class 1.
o If h(x)<0.5, classify the sample as class 0.
The threshold can be adjusted depending on the application and desired trade-off
between precision and recall.

Cost Function for Logistic Regression:


The cost function, also known as the loss function, for logistic regression is the log loss
or binary cross-entropy. It measures how well the predicted probabilities match the
actual outcomes. The formula for binary cross-entropy is:

Where :
N is the number of training examples.
yi is the actual value for ith training example.
p(yi) is the predicted value for ith training example.
Overfitting

Overfitting occurs when a machine learning model learns not only the underlying
patterns in the data but also the noise and random fluctuations. As a result, the model
performs very well on the training data but poorly on unseen data (test set), because it
has essentially memorized the training examples instead of generalizing to new data.
Symptoms of Overfitting:
 High training accuracy but low test accuracy.
 The model performs well on the training data but fails to generalize to new,
unseen data.

Why does Overfitting Happen?


 Complex Models: When the model has too many parameters relative to the
number of training examples, it has the capacity to memorize the data rather than
generalizing.
 Insufficient Training Data: With fewer data points, the model has less information
to work with, making it prone to overfitting the noise.
 Training too long: Allowing the model to learn for too many iterations or epochs
can lead to overfitting, as it
starts memorizing the
noise in the training data.

Cost Function with Regularization:


Regularization techniques add a penalty term to the cost function to prevent overfitting
by discouraging the model from fitting the data too closely. This term effectively
reduces the magnitude of the model parameters (weights), leading to simpler models
that generalize better.

Types of Regularization:
1. L2 Regularization (Ridge Regression):
o L2 regularization adds the sum of the squared values of the weights to the
cost function, discouraging large weights.
o The regularized cost function is:

J(w,b)=−1/m∑i=1 to m[y(i)log(hw,b(x(i)))+(1−y(i))log(1−hw,b(x(i)))]+λ∑j=1 to n
(wj)^2
Where:
 λ is the regularization parameter (also known as the regularization
strength).
 ∑j=1ton(wj)^2 is the sum of the squared values of the weights (for
all features except the bias term).
 The regularization term helps to shrink the weights, making the
model simpler.
2. L1 Regularization (Lasso Regression):
o L1 regularization adds the sum of the absolute values of the weights to the
cost function, which can encourage sparsity (some weights become zero).
o The regularized cost function is:
J(w,b)=−1/m∑i=1tom[y(i)log(hw,b(x(i)))+(1−y(i))log(1−hw,b(x(i)))]+λ∑j=1ton∣wj∣
Where:
 λ is the regularization parameter (which controls the penalty term).
 ∑j=1ton∣wj∣ is the sum of the absolute values of the weights.
 L1 regularization can drive some weights to exactly zero, effectively
performing feature selection.
Module 2

Neural Network

A Neural Network is a computational model inspired by the way biological neural


networks (like the human brain) process information. It is widely used in machine
learning for tasks such as classification, regression, image recognition, and more.
A basic neural network consists of layers of interconnected neurons. These neurons
process input data, apply transformations, and pass the result through activation
functions to produce the output.
Basic Neuron Model (Artificial Neuron):
An artificial neuron is a simplified version of a biological neuron. It receives input,
processes that input, and produces an output. The components of an artificial neuron
are:
1. Inputs (x1,x2,…,xn): These are the features or signals that the neuron receives
from the previous layer or from the external environment (e.g., pixel values in an
image).
2. Weights (w1,w2,…,wn): Each input is associated with a weight. The weight
controls the importance of the corresponding input. In a learning process, the
weights are adjusted to minimize the error in the network's predictions.
3. Bias (b): The bias term is an additional parameter added to the weighted sum of
inputs. It allows the neuron to shift its activation function, which helps the
network learn patterns more effectively. The bias is typically a constant value that
is learned during the training process.
4. Summation (Weighted Sum): The neuron computes a weighted sum of all its
inputs:
z=w1x1+w2x2+⋯+wnxn+b

Structure of an artificial Neuron


Building a Neural Network:
Here’s a simple guide to building a neural network for binary classification:
1. Initialize Weights and Biases: Start with small random values for weights and
biases.
2. Forward Propagation: The main task in building a neural network is forward
propagation, where data flows through the layers of the network to generate
predictions.
3. Activation Functions: After computing the weighted sum of inputs, the result is
passed through an activation function to introduce non-linearity and enable the
network to model complex relationships.
4. Cost Function: The cost function (like binary cross-entropy for binary
classification) is used to quantify the difference between the network's predictions
and the actual labels. The network’s objective is to minimize this cost.
5. Optimization (Gradient Descent): Once the cost is calculated, backpropagation is
used to compute the gradients, and the weights and biases are updated to
minimize the cost function using an optimization algorithm like gradient descent.
6. Training: This process of forward propagation, cost computation,
backpropagation, and optimization is repeated iteratively through many epochs
until the model converges.

Forward Propagation in Neural Network:


Forward propagation is the process by which input data is passed through the neural
network to produce an output. Here's how it works step-by-step:
1. Input Data: The input features x (such as a vector of pixel values for an image)
are fed into the input layer of the neural network.
2. Weighted Sum Calculation: In each layer, a weighted sum of the inputs is
computed. For the j-th neuron in a layer:
zj=∑i=1 to n wji*xi+bj
Where:
o xi are the input features.
o Wji are the weights connecting input neurons to the j-th neuron in the
current layer.
o bj is the bias for the j-th neuron.
3. Apply Activation Function: After calculating the weighted sum zj, the result is
passed through an activation function σ(zj), which introduces non-linearity:
aj=σ(zj)
Where σ(zj) could be:
o Sigmoid: σ(zj)=1/(1+e−zj)
o ReLU: σ(zj)=max(0,zj)
4. Pass to Next Layer: The output aj from each neuron in a layer becomes the input
for the neurons in the next layer. This process is repeated through all the hidden
layers.
5. Final Output: The last layer (output layer) produces the final output prediction. In
a binary classification, this could be a probability value between 0 and 1,
representing the likelihood of the positive class. For multi-class classification, the
output could be a vector of probabilities (e.g., using the softmax function).
Activation Function

Activation functions are crucial components of neural networks that introduce non-
linearity into the network. Without activation functions, a neural network would simply
be a linear model, regardless of how many layers it has. This would limit its ability to
capture complex patterns and relationships in the data.

Why Do We Need Activation Functions?


 Introduce Non-Linearity:
 Neural networks aim to model complex patterns. Without an activation function,
the network would be a series of linear transformations, which is not sufficient to
capture complex data patterns. By introducing non-linearity, activation functions
allow the network to learn from the errors and adjust the weights during training,
enabling it to approximate complex functions.
 Ability to Learn Complex Patterns:
 In a multi-layer neural network, the combination of linear transformations
(without activation functions) will result in a linear mapping between input and
output. An activation function enables the network to capture non-linear
relationships between inputs and outputs, making it possible to model more
complex tasks like image recognition, speech processing, etc.
 Modeling Decision Boundaries:
 The decision boundaries in a classification problem become non-linear after
applying activation functions. This is especially useful when trying to classify
data that cannot be separated by a straight line (e.g., XOR problem).
 Enabling Backpropagation:
 Activation functions are necessary for backpropagation. During training, the
gradients (used to update weights) depend on the activation function. For
example, if the activation function has a derivative (like sigmoid or ReLU), we
can compute the gradient and update weights.
 Control Output Range:
 Activation functions can constrain the output range of a neuron, which can be
beneficial in certain cases. For example, the sigmoid function outputs values
between 0 and 1, which is ideal for binary classification, while softmax outputs
probabilities that sum to 1 for multiclass classification.
Rectified Linear Unit(ReLU):
The ReLU function is the most commonly used activation function for hidden layers in
a neural network. It’s simple and effective in avoiding the vanishing gradient problem,
which makes training deep neural networks much easier.
ReLU Function:
ReLU(x)=max(0,x)
 If x≥0, the output is x.
 If x<0, the output is 0.
ReLU is non-linear, which allows the network to learn more complex patterns. It also
has the advantage of being computationally efficient because it only involves a
comparison operation.
Benefits of ReLU:
 Simple and Fast:
 ReLU is computationally efficient as it involves only a comparison operation .
 This simplicity makes ReLU faster than sigmoid or tanh.
 Sparse Activation:
 ReLU outputs zero for negative inputs, which leads to sparse activations. Sparse
activations mean that only a subset of neurons are active for a given input,
improving model efficiency and reducing overfitting.
 Avoids Vanishing Gradient:
 Unlike sigmoid and tanh, the derivative of ReLU is constant (111) for positive
inputs, which helps prevent gradients from becoming very small during
backpropagation.
 Improves Training Convergence:
 ReLU helps networks converge faster during training because it does not saturate
for large positive values of xxx, unlike sigmoid and tanh.

However, there are some potential downsides, such as the dying ReLU problem where
neurons can get stuck and stop learning if they enter a region where they output zeros
constantly. Variants like Leaky ReLU or Parametric ReLU can be used to address this.
Softmax for Multiclass Classification:
The Softmax function converts the outputs of the network into a set of probabilities,
where each probability corresponds to a class label.
Given the raw output z=[z1,z2,…,zk] (logits) from the output layer, the Softmax
function is defined as:
Softmax(zi)=ezi/∑j=1tokezj
Where:
 zi is the raw score (logit) for class i,
 k is the number of classes,
 The denominator is the sum of exponentials of all the logits, ensuring that the
output values sum to 1 and can be interpreted as probabilities.

Key Features of Softmax:


 The output is a vector of probabilities, with each element representing the
likelihood of the corresponding class.
 The sum of all output probabilities is 1, making it a valid probability distribution.
 The class with the highest probability is typically the predicted class.

Additional Layers:
1. Convolutional Layers (Conv Layers)
 Purpose: Extract features from input data, especially in image processing tasks.
 How It Works:
o Applies convolution operations between input data and a set of learnable
filters (kernels).
o Captures local patterns, such as edges, textures, or specific shapes, in
images.
 Key Hyperparameters:
o Filter size
o Number of filters
o Stride and padding
 Applications: Computer vision (image classification, object detection,
segmentation).

2. Pooling Layers
 Purpose: Reduce the spatial dimensions (height and width) of the feature maps to
decrease computational complexity and focus on dominant features.
 Types:
o Max Pooling: Takes the maximum value in a region.
o Average Pooling: Computes the average value in a region.
o Global Pooling: Applies pooling across the entire feature map.
 Applications: Used in convolutional neural networks (CNNs) to downsample
data.

3. Dropout Layers
 Purpose: Prevent overfitting by randomly "dropping out" a fraction of neurons
during training.
 How It Works:
o Sets a random subset of activations to zero at each training iteration.
o Forces the network to learn more robust features by not relying on any
specific neuron.
 Key Parameter: Dropout rate (e.g., 0.5 means 50% of neurons are dropped).
 Applications: General-purpose, effective in fully connected and convolutional
layers.

4. Batch Normalization Layers


 Purpose: Normalize the input of each layer to have a mean of 0 and variance of 1,
stabilizing and accelerating training.
 How It Works:
o Normalizes activations across the batch.
o Includes learnable parameters (γ,β) to maintain the layer's capacity to
represent features.
 Benefits:
o Reduces sensitivity to weight initialization.
o Mitigates the vanishing or exploding gradient problem.
 Applications: Common in deep learning architectures, especially deep CNNs and
RNNs.
Model Selection

Model selection is the process of choosing the best model among a set of candidate
models to solve a particular problem. It involves evaluating various models based on
their performance on unseen data to ensure the chosen model generalizes well.

Why Model Selection is Important?


1. Avoid Overfitting/Underfitting:
o Overfitting: Model too complex; performs well on training data but poorly
on test data.
o Underfitting: Model too simple; fails to capture the underlying patterns.
2. Optimize Performance:
o Ensure the model achieves a good balance between bias and variance.
3. Efficient Resource Utilization:
o Avoid wasting computational resources on models that do not perform
well.

Techniques for Model Selection:


1. Cross-Validation
 Purpose: Evaluate model performance using limited data without overfitting.
 Methods:
o k-Fold Cross-Validation:
 Split the data into kkk subsets (folds).
 Train on k−1k-1k−1 folds and validate on the remaining fold.
 Repeat kkk times and average the results.
o Leave-One-Out Cross-Validation (LOOCV):
 Train on all data except one instance and validate on that instance.
Repeat for all instances.
 Advantage: Provides robust performance estimates.
 Disadvantage: Computationally expensive for large datasets.
2. Train-Test Split
 Purpose: Evaluate a model's performance on unseen data.
 How It Works:
o Split the dataset into training and test sets (e.g., 80-20 split).
o Train the model on the training set and evaluate on the test set.
 Limitation: Single split may not represent the data distribution well.
Bias & Variance Tradeoff

The bias-variance tradeoff is a fundamental concept in machine learning and statistics


that describes the balance between two sources of error in a predictive model: bias and
variance.

1. Bias
 Definition: Bias is the error introduced by approximating a real-world problem
(which may be complex) by a simplified model. It reflects how far off the model's
predictions are from the actual target values on average.
 High Bias: Models with high bias are overly simplistic and may not capture the
underlying patterns of the data (underfitting).
o Example: Using a linear model to fit non-linear data.

2. Variance
 Definition: Variance is the error caused by the model's sensitivity to small
fluctuations in the training data. It reflects how much the model's predictions
would change if trained on different data sets.
 High Variance: Models with high variance are overly complex and overly tuned
to the training data (overfitting).
o Example: Using a high-degree polynomial to fit noisy data.

Tradeoff
 A model with too much bias will miss the relevant patterns in the data (underfit),
while a model with too much variance will model the random noise in the data
rather than the underlying pattern (overfit).
 The goal is to find a balance between bias and variance that minimizes the total
error (sum of bias squared, variance, and irreducible error).
Error metrics for skewed datasets:
When working with skewed datasets (where one class or outcome is significantly more
frequent than others), standard error metrics like accuracy can be misleading. Instead,
you should use error metrics that account for the class imbalance and better reflect the
model's performance on minority classes. Below are the most commonly used metrics
for skewed datasets, along with their applications.

1. Precision
 Definition: The proportion of true positive predictions out of all positive
predictions. Precision=True Positives (TP)/TP+False Positives (FP)
 When to Use:
o When false positives are more costly than false negatives.
o Example: In spam detection, you want fewer legitimate emails incorrectly
classified as spam.

2. Recall (Sensitivity or True Positive Rate)


 Definition: The proportion of actual positives that are correctly identified.
Recall=TP/TP+False Negatives (FN)
 When to Use:
o When false negatives are more costly than false positives.
o Example: In medical diagnosis, failing to detect a disease (FN) is critical.

3. F1 Score
 Definition: The harmonic mean of precision and recall, balancing the two.
F1 Score=2⋅Precision⋅Recall/Precision+Recall
 When to Use:
o When you need a balance between precision and recall.
o Example: In fraud detection, you need to detect fraud while minimizing
false alarms.

Precision-Recall Tradeoff:
The precision-recall tradeoff reflects the balance between precision and recall, two key
metrics in evaluating classification models, especially for imbalanced datasets. Tuning
this tradeoff is important because improving one often comes at the expense of the
other.
The Tradeoff:
 High Precision, Low Recall:
o The model is very selective, predicting "positive" only when it is very
confident.
o Results in fewer false positives but may miss many true positives (high
false negatives).
o Example: In spam detection, predicting only obvious spam emails
(precision-focused).
 High Recall, Low Precision:
o The model predicts "positive" more liberally.
o Captures most true positives but may incorrectly classify many negatives
as positives (high false positives).
o Example: In disease screening, ensuring that almost all cases of a disease
are flagged (recall-focused).
Decision Trees

A decision tree is a popular supervised learning algorithm used for both classification
and regression tasks. It works by recursively splitting the dataset into subsets based on
feature values, creating a tree-like structure to make predictions.

Key Concepts of Decision Trees:


1. Root Node:
o The topmost node in the tree.
o Represents the entire dataset and the first decision point.
2. Internal Nodes:
o Represent a decision based on a feature.
o Each node splits the data into subsets based on a condition (e.g., x>5x >
5x>5).
3. Leaf Nodes:
o Terminal nodes that represent the output (class or value) of the decision
tree.
4. Branches:
o Connections between nodes that represent the decision path.

Step-by-Step Working:
1. Start at the Root Node:
o The tree starts with the entire dataset at the root node.
o The algorithm determines the best feature and corresponding split point to
partition the data.
2. Select the Best Split:
o For each feature, the algorithm evaluates all possible split points to
minimize the impurity in the resulting subsets.
o Metrics like Gini Impurity, Entropy, or Variance Reduction (for regression)
are used to determine the split quality.
3. Partition the Data:
o Based on the chosen split, the dataset is divided into two or more subsets
(child nodes).
4. Repeat Recursively:
o The algorithm repeats the splitting process for each subset (child node)
until:
 A predefined stopping criterion is met (e.g., maximum depth,
minimum samples per leaf).
 The node becomes pure (all samples belong to a single class).
 No further splits provide significant impurity reduction.
5. Stop at Leaf Nodes:
o Leaf nodes represent the final predictions:
 In classification: The majority class of samples in the node.
 In regression: The average value of samples in the node.

Entropy:
Entropy measures the uncertainty or impurity in a dataset. It quantifies how mixed the
classes are in a subset of data.

Gini Impurity:
The internal working of Gini impurity is also somewhat similar to the working of
entropy in the Decision Tree. In the Decision Tree algorithm, both are used for building
the tree by splitting as per the appropriate features but there is quite a difference in the
computation of both methods.
Information Gain

Information Gain is a key concept in decision trees, used to measure the effectiveness of
an attribute in classifying a dataset. It quantifies the reduction in entropy or impurity
after a dataset is split on an attribute.
To calculate information gain in a decision tree, follow these steps:
1. Calculate the Entropy of the Parent Node:
 Compute the entropy of the parent node using the formula:
Entropy=−∑i=1 ∑ci=1∑ci=1pi ⋅log2(pi)
 Where pi is the proportion of instances belonging to class i, and c is the
number of classes.
2. Split the Data:
 Split the dataset into subsets based on the values of a selected attribute
(feature).
3. Calculate the Entropy of Child Nodes:
 For each subset (child node), calculate its entropy using the same formula
as step 1.
4. Calculate the Weighted Average Entropy of Child Nodes:
 Calculate the weighted average entropy of the child nodes using the
formula: Weighted Average Entropy= ∑j=1mNJN×Entropy(j)∑j=1mNNJ
×Entropy(j)
 Where Nj is the number of instances in the jth child node, N is the total
number of instances, and m is the number of child nodes.
5. Calculate Information Gain:
 Information Gain is the difference between the entropy of the parent node
and the weighted average entropy of the child nodes:
Information Gain=Entropy(Parent)−Weighted Average Entropy(Children)I
nformation Gain=Entropy(Parent)−Weighted Average Entropy(Children)
6. Select the Attribute with the Highest Information Gain:
 Choose the attribute (feature) that yields the highest information gain as the
splitting criterion for the current node in the decision tree.
Random Forest

Random Forest is an ensemble method that builds multiple decision trees (a "forest")
during training. Each tree in the forest is trained on a random subset of the data and a
random subset of features, which helps reduce overfitting and improves generalization.
 Ensemble Learning: Combines multiple models to improve overall performance.
 Bagging (Bootstrap Aggregating): A key technique in Random Forest where each
tree is trained on a different bootstrap sample (random sampling with
replacement).

How Does Random Forest Work?


1. Bootstrap Sampling:
o Randomly selects subsets of the training dataset (with replacement).
o Each subset is used to train a single decision tree.
2. Feature Randomization:
o At each split in a tree, a random subset of features is considered rather than
all features.
o This introduces diversity in the trees, reducing the correlation among them.
3. Tree Construction:
o Each tree is grown to its maximum depth (not pruned), but overfitting is
mitigated due to averaging across many trees.
4. Aggregation of Results:
o For Classification:
 Uses majority voting across all trees.
 The class with the highest votes is the predicted output.
o For Regression:
 Averages the predictions of all trees.

Advantages:
1. Handles Overfitting:
o By averaging the predictions of multiple trees, it reduces the variance of
the model.
2. Robust to Noise:
o Reduces the impact of noisy data and outliers.
3. Works Well with Imbalanced Datasets:
o Captures minority class patterns more effectively than single decision trees.
4. Handles Missing Data:
o Can use surrogate splits to handle missing values.
Disadvantages
1. Computationally Intensive:
o Training many trees can be slow, especially on large datasets.
2. Less Interpretability:
o While decision trees are easy to interpret, the ensemble of many trees in a
Random Forest is not.
3. Memory Usage:
o Requires more memory to store multiple trees.

You might also like