Sourav Moocs A2 65
Sourav Moocs A2 65
ON
MACHINE LEARNING SPECIALIZATION
(CSE V Semester MOOC Seminar) 2025-2026
CSE-A2-V-Sem
Session- 2025-2026
DATE: 20/01/2025
(Mr. Samir Rana)
Class Coordinator
CC-CSE-A2-V-Sem
CSE Department
GEHU, DEHRADUN
CERTIFICATE
Module 1:
https://coursera.org/verify/6ZP2QHZOUEYM
Module 2: https://coursera.org/verify/JMMG3Q1DJETI
ACKNOWLEDGEMENT
o Neural Network
Building a Nueral Network
Forward Propagation in neural Network
o Activation Function
Activation function(ReLU)
Softmax for Multiclass classification
Additional Layers
o Model Selection
o Bias variance tradeoff
o Decision trees
Information gain
Random Forest
Module 1
Introduction to Machine Learning and its Applications
1. Healthcare:
o Disease diagnosis (e.g., identifying cancer in medical images).
o Predicting patient outcomes (e.g., risk of heart attacks based on patient
history).
2. Finance:
o Fraud detection (e.g., identifying unusual credit card transactions).
o Loan approval based on creditworthiness.
3. Retail:
o Recommendation systems (e.g., suggesting products on e-commerce
platforms).
o Demand forecasting for inventory management.
4. Transportation:
o Self-driving cars (e.g., identifying pedestrians and road signs).
o Traffic prediction (e.g., using GPS data to estimate travel time).
5. Natural Language Processing (NLP):
o Sentiment analysis (e.g., classifying customer reviews as positive or
negative).
o Chatbots and virtual assistants (e.g., Siri, Alexa).
6. Image and Speech Recognition:
o Face recognition for security.
o Voice-controlled devices.
Supervised & Unsupervised Learning
Supervised Learning:
In supervised learning, the model is trained on labeled data, where the input data is
paired with the correct output (target). The goal is to learn a mapping from inputs to
outputs and make predictions for new inputs.
Unsupervised Learning:
In unsupervised learning, the model is trained on data without labeled outputs. The goal
is to discover hidden patterns, groupings, or structures within the data.
Problem Statement
You are given data about the size of houses (in square feet) and their corresponding
prices (in $1000s). Your task is to find a linear relationship between house size and
price, and predict the price for a new house of size 1800 sq.
Dataset
x=[1000,1500,2000,2500,3000] (House size in sq. ft)
y=[200,250,300,350,400] (Price in $1000s)
Cost Function:
A cost function is a mathematical function used to measure the error or discrepancy
between the predicted values (y^) and the actual values (y) in a regression model. It
quantifies how well the model's predictions align with the actual data.
Gradient Descent:
Gradient Descent is an optimization algorithm used to minimize the cost function by
iteratively updating the model parameters (e.g., w0,w1 ) in the direction that reduces the
cost function value.
The cost function J(w) represents the error of the model for given parameters w0,w1 .
Gradient Descent adjusts the parameters by calculating the slope (gradient) of J(w) and
moving in the opposite direction of the gradient to minimize the cost.
Multiple Linear Regression is an extension of simple linear regression that models the
relationship between a dependent variable (y) and multiple independent variables
(x1,x2,…,xn).
The equation for multiple linear regression is:
y^=w0+w1x1+w2x2+⋯+wnxn
Where:
y^: Predicted value.
W0: Intercept (constant term).
w1,w2,…,wn: Coefficients for each independent variable.
x1,x2,…,xn: Independent variables.
Problem Statement:
A real estate agent wants to predict house prices(y) based on:
x1: Size of the house (in square feet).
x2: Number of bedrooms.
x3: Distance from the city center (in miles).
Feature scaling:
Feature Scaling refers to the process of standardizing or normalizing the range of
independent variables (features) to ensure all features contribute equally to the model.
Why is Feature Scaling Needed?
1. Avoid Dominance of Large-Scale Features
2. Improve Gradient Descent Convergence
ii. Z-Score Scaling (Standardization): Centers data around zero with a unit variance:
xscaled=x−μ/σ
Feature Engineering:
Feature Engineering is the process of creating, transforming, or selecting features to
improve the predictive performance of a machine learning model.
Logistic Regression is a statistical model used primarily for binary classification tasks.
It is an extension of linear regression but is designed to predict the probability that a
given input belongs to a particular class (usually labeled as 0 or 1).
Key Concepts of Logistic Regression:
1. Binary Outcome: Logistic regression is used when the outcome variable is
categorical, specifically when it has two possible outcomes, typically labeled as 0
and 1.
2. Sigmoid Function: The core of logistic regression is the sigmoid function, also
known as the logistic function, which converts any real-valued number into a
value between 0 and 1. This output can then be interpreted as a probability.
The formula for the sigmoid function is:
σ(z)=1+e−z\sigma(z)
Where z is the linear combination of the input features, represented as:
z=wTx+bz
o w is a vector of weights (coefficients).
o x is a vector of input features (independent variables).
o b is the bias term (intercept).
The sigmoid function will output a value between 0 and 1, which can be
interpreted as the probability of the sample belonging to class 1
Where :
N is the number of training examples.
yi is the actual value for ith training example.
p(yi) is the predicted value for ith training example.
Overfitting
Overfitting occurs when a machine learning model learns not only the underlying
patterns in the data but also the noise and random fluctuations. As a result, the model
performs very well on the training data but poorly on unseen data (test set), because it
has essentially memorized the training examples instead of generalizing to new data.
Symptoms of Overfitting:
High training accuracy but low test accuracy.
The model performs well on the training data but fails to generalize to new,
unseen data.
Types of Regularization:
1. L2 Regularization (Ridge Regression):
o L2 regularization adds the sum of the squared values of the weights to the
cost function, discouraging large weights.
o The regularized cost function is:
J(w,b)=−1/m∑i=1 to m[y(i)log(hw,b(x(i)))+(1−y(i))log(1−hw,b(x(i)))]+λ∑j=1 to n
(wj)^2
Where:
λ is the regularization parameter (also known as the regularization
strength).
∑j=1ton(wj)^2 is the sum of the squared values of the weights (for
all features except the bias term).
The regularization term helps to shrink the weights, making the
model simpler.
2. L1 Regularization (Lasso Regression):
o L1 regularization adds the sum of the absolute values of the weights to the
cost function, which can encourage sparsity (some weights become zero).
o The regularized cost function is:
J(w,b)=−1/m∑i=1tom[y(i)log(hw,b(x(i)))+(1−y(i))log(1−hw,b(x(i)))]+λ∑j=1ton∣wj∣
Where:
λ is the regularization parameter (which controls the penalty term).
∑j=1ton∣wj∣ is the sum of the absolute values of the weights.
L1 regularization can drive some weights to exactly zero, effectively
performing feature selection.
Module 2
Neural Network
Activation functions are crucial components of neural networks that introduce non-
linearity into the network. Without activation functions, a neural network would simply
be a linear model, regardless of how many layers it has. This would limit its ability to
capture complex patterns and relationships in the data.
Additional Layers:
1. Convolutional Layers (Conv Layers)
Purpose: Extract features from input data, especially in image processing tasks.
How It Works:
o Applies convolution operations between input data and a set of learnable
filters (kernels).
o Captures local patterns, such as edges, textures, or specific shapes, in
images.
Key Hyperparameters:
o Filter size
o Number of filters
o Stride and padding
Applications: Computer vision (image classification, object detection,
segmentation).
2. Pooling Layers
Purpose: Reduce the spatial dimensions (height and width) of the feature maps to
decrease computational complexity and focus on dominant features.
Types:
o Max Pooling: Takes the maximum value in a region.
o Average Pooling: Computes the average value in a region.
o Global Pooling: Applies pooling across the entire feature map.
Applications: Used in convolutional neural networks (CNNs) to downsample
data.
3. Dropout Layers
Purpose: Prevent overfitting by randomly "dropping out" a fraction of neurons
during training.
How It Works:
o Sets a random subset of activations to zero at each training iteration.
o Forces the network to learn more robust features by not relying on any
specific neuron.
Key Parameter: Dropout rate (e.g., 0.5 means 50% of neurons are dropped).
Applications: General-purpose, effective in fully connected and convolutional
layers.
Model selection is the process of choosing the best model among a set of candidate
models to solve a particular problem. It involves evaluating various models based on
their performance on unseen data to ensure the chosen model generalizes well.
1. Bias
Definition: Bias is the error introduced by approximating a real-world problem
(which may be complex) by a simplified model. It reflects how far off the model's
predictions are from the actual target values on average.
High Bias: Models with high bias are overly simplistic and may not capture the
underlying patterns of the data (underfitting).
o Example: Using a linear model to fit non-linear data.
2. Variance
Definition: Variance is the error caused by the model's sensitivity to small
fluctuations in the training data. It reflects how much the model's predictions
would change if trained on different data sets.
High Variance: Models with high variance are overly complex and overly tuned
to the training data (overfitting).
o Example: Using a high-degree polynomial to fit noisy data.
Tradeoff
A model with too much bias will miss the relevant patterns in the data (underfit),
while a model with too much variance will model the random noise in the data
rather than the underlying pattern (overfit).
The goal is to find a balance between bias and variance that minimizes the total
error (sum of bias squared, variance, and irreducible error).
Error metrics for skewed datasets:
When working with skewed datasets (where one class or outcome is significantly more
frequent than others), standard error metrics like accuracy can be misleading. Instead,
you should use error metrics that account for the class imbalance and better reflect the
model's performance on minority classes. Below are the most commonly used metrics
for skewed datasets, along with their applications.
1. Precision
Definition: The proportion of true positive predictions out of all positive
predictions. Precision=True Positives (TP)/TP+False Positives (FP)
When to Use:
o When false positives are more costly than false negatives.
o Example: In spam detection, you want fewer legitimate emails incorrectly
classified as spam.
3. F1 Score
Definition: The harmonic mean of precision and recall, balancing the two.
F1 Score=2⋅Precision⋅Recall/Precision+Recall
When to Use:
o When you need a balance between precision and recall.
o Example: In fraud detection, you need to detect fraud while minimizing
false alarms.
Precision-Recall Tradeoff:
The precision-recall tradeoff reflects the balance between precision and recall, two key
metrics in evaluating classification models, especially for imbalanced datasets. Tuning
this tradeoff is important because improving one often comes at the expense of the
other.
The Tradeoff:
High Precision, Low Recall:
o The model is very selective, predicting "positive" only when it is very
confident.
o Results in fewer false positives but may miss many true positives (high
false negatives).
o Example: In spam detection, predicting only obvious spam emails
(precision-focused).
High Recall, Low Precision:
o The model predicts "positive" more liberally.
o Captures most true positives but may incorrectly classify many negatives
as positives (high false positives).
o Example: In disease screening, ensuring that almost all cases of a disease
are flagged (recall-focused).
Decision Trees
A decision tree is a popular supervised learning algorithm used for both classification
and regression tasks. It works by recursively splitting the dataset into subsets based on
feature values, creating a tree-like structure to make predictions.
Step-by-Step Working:
1. Start at the Root Node:
o The tree starts with the entire dataset at the root node.
o The algorithm determines the best feature and corresponding split point to
partition the data.
2. Select the Best Split:
o For each feature, the algorithm evaluates all possible split points to
minimize the impurity in the resulting subsets.
o Metrics like Gini Impurity, Entropy, or Variance Reduction (for regression)
are used to determine the split quality.
3. Partition the Data:
o Based on the chosen split, the dataset is divided into two or more subsets
(child nodes).
4. Repeat Recursively:
o The algorithm repeats the splitting process for each subset (child node)
until:
A predefined stopping criterion is met (e.g., maximum depth,
minimum samples per leaf).
The node becomes pure (all samples belong to a single class).
No further splits provide significant impurity reduction.
5. Stop at Leaf Nodes:
o Leaf nodes represent the final predictions:
In classification: The majority class of samples in the node.
In regression: The average value of samples in the node.
Entropy:
Entropy measures the uncertainty or impurity in a dataset. It quantifies how mixed the
classes are in a subset of data.
Gini Impurity:
The internal working of Gini impurity is also somewhat similar to the working of
entropy in the Decision Tree. In the Decision Tree algorithm, both are used for building
the tree by splitting as per the appropriate features but there is quite a difference in the
computation of both methods.
Information Gain
Information Gain is a key concept in decision trees, used to measure the effectiveness of
an attribute in classifying a dataset. It quantifies the reduction in entropy or impurity
after a dataset is split on an attribute.
To calculate information gain in a decision tree, follow these steps:
1. Calculate the Entropy of the Parent Node:
Compute the entropy of the parent node using the formula:
Entropy=−∑i=1 ∑ci=1∑ci=1pi ⋅log2(pi)
Where pi is the proportion of instances belonging to class i, and c is the
number of classes.
2. Split the Data:
Split the dataset into subsets based on the values of a selected attribute
(feature).
3. Calculate the Entropy of Child Nodes:
For each subset (child node), calculate its entropy using the same formula
as step 1.
4. Calculate the Weighted Average Entropy of Child Nodes:
Calculate the weighted average entropy of the child nodes using the
formula: Weighted Average Entropy= ∑j=1mNJN×Entropy(j)∑j=1mNNJ
×Entropy(j)
Where Nj is the number of instances in the jth child node, N is the total
number of instances, and m is the number of child nodes.
5. Calculate Information Gain:
Information Gain is the difference between the entropy of the parent node
and the weighted average entropy of the child nodes:
Information Gain=Entropy(Parent)−Weighted Average Entropy(Children)I
nformation Gain=Entropy(Parent)−Weighted Average Entropy(Children)
6. Select the Attribute with the Highest Information Gain:
Choose the attribute (feature) that yields the highest information gain as the
splitting criterion for the current node in the decision tree.
Random Forest
Random Forest is an ensemble method that builds multiple decision trees (a "forest")
during training. Each tree in the forest is trained on a random subset of the data and a
random subset of features, which helps reduce overfitting and improves generalization.
Ensemble Learning: Combines multiple models to improve overall performance.
Bagging (Bootstrap Aggregating): A key technique in Random Forest where each
tree is trained on a different bootstrap sample (random sampling with
replacement).
Advantages:
1. Handles Overfitting:
o By averaging the predictions of multiple trees, it reduces the variance of
the model.
2. Robust to Noise:
o Reduces the impact of noisy data and outliers.
3. Works Well with Imbalanced Datasets:
o Captures minority class patterns more effectively than single decision trees.
4. Handles Missing Data:
o Can use surrogate splits to handle missing values.
Disadvantages
1. Computationally Intensive:
o Training many trees can be slow, especially on large datasets.
2. Less Interpretability:
o While decision trees are easy to interpret, the ensemble of many trees in a
Random Forest is not.
3. Memory Usage:
o Requires more memory to store multiple trees.