0% found this document useful (0 votes)
11 views11 pages

Module B Handbook

The Module B Handbook for the Minor in AI covers key concepts in supervised learning, including linear and polynomial regression, gradient descent, regularization techniques, classification methods, support vector machines, logistic regression, principal component analysis, and the PageRank algorithm. It provides mathematical formulations, objectives, and comparisons of different methods, emphasizing the importance of model evaluation and the trade-offs involved in choosing algorithms. This handbook serves as a comprehensive guide for understanding and applying machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

Module B Handbook

The Module B Handbook for the Minor in AI covers key concepts in supervised learning, including linear and polynomial regression, gradient descent, regularization techniques, classification methods, support vector machines, logistic regression, principal component analysis, and the PageRank algorithm. It provides mathematical formulations, objectives, and comparisons of different methods, emphasizing the importance of model evaluation and the trade-offs involved in choosing algorithms. This handbook serves as a comprehensive guide for understanding and applying machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Module B Handbook

Minor In AI
March 2, 2025

Introduction to Supervised Learning


Supervised learning is a type of machine learning where a model is trained on a labeled dataset, meaning
each training example is paired with the correct output. The goal is for the model to learn a mapping from
inputs to outputs so it can make accurate predictions on new, unseen data.

Let's dive into concepts of Supervised Learning:

1 Linear Regression
Linear Regression is a statistical method used to model the relationship between a de-
pendent variable (target) and one or more independent variables (features). It assumes
a linear relationship between variables.

1.1 Types of Linear Regression


• Simple Linear Regression – One independent variable.

• Multiple Linear Regression – Multiple independent variables.

1.2 Equation of Linear Regression


For Simple Linear Regression:

y = mx + c (1)

where:

• y = Dependent variable

• x = Independent variable

• m = Slope of the line (coefficient)

• c = Intercept

For Multiple Linear Regression:

y = b0 + b1 x1 + b2 x2 + ... + bn xn (2)

where:

• b0 is the intercept

• b1 , b2 , ..., bn are the coefficients of independent variables

1
Minor in AI 2

1.3 Objective Function


The goal is to minimize the Mean Squared Error (MSE):
n
1X
M SE = (yi − ŷi )2 (3)
n i=1

where:

• yi is the actual value

• ŷi is the predicted value

• n is the number of observations

1.4 Finding the Best Fit Line


1.4.1 Gradient Descent

bj = bj − α J(b) (4)
∂bj
where:

• α is the learning rate

• J(b) is the cost function (MSE)

1.5 Assumptions of Linear Regression


• Linearity – The relationship between independent and dependent variables is lin-
ear.

• Independence – Observations are independent of each other.

• Homoscedasticity – Constant variance of errors.

• Normality of Residuals – Residuals should be normally distributed.

• No Multicollinearity – Independent variables should not be highly correlated.

2 Polynomial Regression
2.1 Introduction
Polynomial Regression is an extension of Linear Regression where the relationship be-
tween variables is modeled as an n-degree polynomial. It helps in capturing non-linear
relationships.
Minor in AI 3

2.2 Equation of Polynomial Regression


y = b0 + b1 x + b2 x2 + b3 x3 + ... + bn xn (5)
where:

• xn represents polynomial terms

• bn are the coefficients

2.3 Why Use Polynomial Regression?


• When data is non-linear and a straight line does not fit well.

• It allows flexibility in curve fitting by increasing the degree of the polynomial.

2.4 Transforming Data for Polynomial Regression


Since Polynomial Regression is still a form of linear regression (in terms of coefficients),
we transform the features:

• Convert x into x, x2 , x3 , ...

• Then apply Linear Regression on transformed data.

2.5 Choosing the Degree of the Polynomial


• Degree = 1 → Linear Regression

• Degree = 2 → Quadratic Regression

• Degree = 3 → Cubic Regression, etc.

Higher degrees may lead to overfitting.

2.6 Overfitting vs. Underfitting


• Underfitting: Model is too simple (high bias, low variance).

• Overfitting: Model is too complex (low bias, high variance).

• Choose an optimal degree using cross-validation.

2.7 Comparison: Linear vs. Polynomial Regression

Feature Linear Regression Polynomial Regression


Relationship Linear Non-linear
Model Complexity Simple Complex
Overfitting Risk Low High for large degrees
Interpretability Easy Harder for high-degree

Table 1: Comparison of Linear and Polynomial Regression


Minor in AI 4

3 Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the cost function in
machine learning and deep learning models. It iteratively adjusts model parameters to
find the best fit for the data.

3.1 Why Gradient Descent?


• For simple models, we can use Ordinary Least Squares (OLS) to find the best-fit
parameters directly.

• However, for complex models (high-dimensional data, deep learning), OLS is com-
putationally expensive, so we use Gradient Descent.

3.2 How Gradient Descent Works?


1. Initialize parameters (θ) randomly or with zeros.

2. Compute the cost function J(θ).

3. Compute the gradient (derivative) of the cost function.

4. Update parameters using:



θj = θj − α J(θ)
∂θj
where:

• α is the learning rate


• ∂
∂θj
J(θ) is the gradient of the cost function

5. Repeat until convergence (cost function stops changing significantly).

3.3 Cost Function for Linear Regression


m
1 X
J(θ) = (hθ (xi ) − yi )2
2m i=1
where:

• hθ (x) = θ0 + θ1 x (hypothesis function)

• m = number of training examples

3.4 Gradient Descent Update Rule


For Linear Regression, the gradient descent update rule is:
m
X
θj = θj − α (hθ (xi ) − yi )xi
i=1
Minor in AI 5

3.5 Types of Gradient Descent


• Batch Gradient Descent (BGD): Uses all training examples in each iteration.

• Stochastic Gradient Descent (SGD): Updates parameters after each training


example.

• Mini-Batch Gradient Descent: Uses a small batch of data in each iteration.

3.6 Challenges in Gradient Descent


• Choosing the Learning Rate (α): If too high, the algorithm diverges; if too
low, convergence is slow.

• Local Minima and Saddle Points: Momentum-based optimizers like Adam can
help.

• Feature Scaling: Gradient Descent converges faster when features are standard-
ized.

4 Regularization
Regularization is a technique used to prevent overfitting in machine learning models
by adding a penalty to large coefficients.

4.1 Why Regularization?


• In high-dimensional models, some features may capture noise rather than actual
patterns.

• Regularization shrinks coefficients, ensuring the model generalizes well to unseen


data.

4.2 Types of Regularization


4.2.1 L1 Regularization (Lasso Regression)
L1 Regularization adds the absolute value of coefficients as a penalty:
m n
1 X X
J(θ) = (hθ (xi ) − yi )2 + λ |θj |
2m i=1 j=1

• Encourages sparsity (some coefficients become exactly zero).

• Useful for feature selection.


Minor in AI 6

4.2.2 L2 Regularization (Ridge Regression)


L2 Regularization adds the square of coefficients as a penalty:
m n
1 X X
J(θ) = (hθ (xi ) − yi )2 + λ θj2
2m i=1 j=1

• Does not eliminate coefficients but shrinks them towards zero.


• Useful when all features contribute but need regularization.

4.2.3 Elastic Net (Combination of L1 and L2)


Elastic Net combines L1 (Lasso) and L2 (Ridge) regularization:
m n n
1 X X X
J(θ) = (hθ (xi ) − yi )2 + λ1 |θj | + λ2 θj2
2m i=1 j=1 j=1

• Helps when highly correlated features exist.

4.3 Choosing Between Ridge, Lasso, and Elastic Net

Method Effect on Coefficients Use Case


Lasso (L1) Some coefficients become zero Feature selection
Ridge (L2) Shrinks coefficients but none become zero When all features are important
Elastic Net Mix of Ridge and Lasso When features are correlated

Table 2: Comparison of Regularization Techniques

5 Classification
Classification is a supervised learning task where the goal is to assign a given input into
one of several predefined categories. The model learns from labeled training data and
predicts the category for unseen data.

5.1 Key Steps in Classification


1. Data Preprocessing
• Handling Missing Data
• Feature Scaling
• Encoding Categorical Variables
• Feature Selection
2. Splitting the Dataset (Train/Test Split)
3. Training the Model
4. Making Predictions
5. Evaluating the Model
Minor in AI 7

5.2 Performance Evaluation Metrics


5.2.1 Confusion Matrix

Actual \ Predicted Positive (1) Negative (0)


Positive (1) True Positive (TP) False Negative (FN)
Negative (0) False Positive (FP) True Negative (TN)

Table 3: Confusion Matrix

5.2.2 Accuracy
TP + TN
Accuracy = (6)
TP + TN + FP + FN

5.2.3 Precision, Recall, and F1-Score


TP
P recision = (7)
TP + FP
TP
Recall = (8)
TP + FN
P recision × Recall
F 1 − Score = 2 × (9)
P recision + Recall

6 Support Vector Machines (SVM)


Support Vector Machines (SVM) is a supervised learning algorithm used for both
classification and regression tasks. It is particularly effective in high-dimensional
spaces and works well when the number of features is greater than the number of samples.

6.1 Why Use SVM?


• Works well for small datasets.

• Effective for high-dimensional data.

• Can model non-linear relationships using kernel functions.

• Finds an optimal decision boundary by maximizing the margin between classes.

6.2 Understanding the Hyperplane in SVM


A hyperplane is a decision boundary that separates different classes.

• For 2D Data → The hyperplane is a straight line.

• For 3D Data → The hyperplane is a plane.

• For higher dimensions → The hyperplane is a mathematical construct.

Goal of SVM: Find the hyperplane that maximizes the margin (distance between
the closest points from both classes, called support vectors).
Minor in AI 8

6.3 Mathematical Formulation of SVM


Given a dataset with features X and labels y, where yi ∈ {−1, 1}, SVM finds a hyper-
plane defined by:
f (x) = wT x + b
where:
• w is the weight vector.
• b is the bias term.
• x is the input vector.
The objective of SVM is to maximize the margin while ensuring correct classifica-
tion:
yi (wT xi + b) ≥ 1, ∀i

6.4 Optimization Problem


1
min ||w||2
w,b 2

subject to:
yi (wT xi + b) ≥ 1
This quadratic optimization problem ensures that the margin is maximized while
keeping misclassifications to a minimum.

6.5 Types of SVM


6.5.1 Hard Margin SVM (No Misclassification Allowed)
• Assumes data is perfectly linearly separable.
• Finds a hyperplane with maximum margin.
• Not robust to noise or outliers.

6.5.2 Soft Margin SVM (Allows Some Misclassification)


• Used when data is not perfectly separable.
• Introduces a regularization parameter C to balance margin width and misclas-
sification.
• Higher C → Less tolerance for misclassification (more overfitting).
• Lower C → More tolerance for misclassification (better generalization).
Objective function with slack variables ξ:
n
1 2
X
min ||w|| + C ξi
w,b 2
i=1

subject to:
yi (wT xi + b) ≥ 1 − ξi , ∀i
where ξi are slack variables that allow misclassification.
Minor in AI 9

6.6 SVM with Non-Linearly Separable Data (Kernel Trick)


If data is not linearly separable, SVM uses the Kernel Trick to transform the input
space into a higher-dimensional space where it becomes separable.

6.6.1 Common Kernel Functions

Kernel Formula Use Case


Linear Kernel K(xi , xj ) = xTi xj Linearly separable data
Polynomial Kernel K(xi , xj ) = (xTi xj + c)d Captures curved relation-
ships
Radial Basis Function K(xi , xj ) = exp(−γ||xi − Handles non-linear rela-
(RBF) Kernel xj ||2 ) tionships
Sigmoid Kernel K(xi , xj ) = Similar to neural net-
tanh(αxTi xj + c) works

RBF Kernel is the most commonly used because it can model complex
decision boundaries.

6.7 Regularization in SVM (C Parameter)


• The C parameter controls the trade-off between margin width and misclassi-
fication errors.

• High C → Tries to classify all points correctly (can overfit).

• Low C → Allows some misclassifications but generalizes better.

6.8 Advantages and Disadvantages of SVM


6.8.1 Advantages
• Works well with high-dimensional data.

• Effective when number of features ¿ number of samples.

• Uses kernels to model non-linear relationships.

6.8.2 Disadvantages
• Computationally expensive for large datasets.

• Performance depends on kernel choice and hyperparameter tuning.

• Sensitive to imbalanced data (use class weight=’balanced’).


7. Logistic Regression
Logistic regression is used for binary classification problems. It models the probability
that a given input x belongs to class 1.

Model Equation
The model estimates the probability using the sigmoid function:
1
P (y = 1 | x) = σ(z) = , where z = wT x + b
1 + e−z

Loss Function (Binary Cross-Entropy)


N
1 X  (i)
y log(ŷ (i) ) + (1 − y (i) ) log(1 − ŷ (i) )

L=−
N i=1
where:
• y (i) is the true label
• ŷ (i) is the predicted probability
• N is the number of training examples

8. Principal Component Analysis (PCA)


PCA is a dimensionality reduction technique that projects data onto directions (principal
components) that maximize variance.

Steps
1. Center the data:
Xcentered = X − X̄

2. Compute the covariance matrix:


1 T
C= X Xcentered
n centered
3. Compute eigenvectors and eigenvalues of C
4. Select top k eigenvectors to form projection matrix Wk
5. Project data onto lower dimension:

Z = Xcentered Wk

9. PageRank Algorithm
PageRank measures the importance of a node (web page) in a directed graph based on
incoming links.

1
PageRank Formula
n
X P R(Bi )
P R(A) = (1 − d) + d
i=1
L(Bi )
where:

• P R(A) is the PageRank of page A

• d is the damping factor (typically 0.85)

• Bi are the pages linking to A

• L(Bi ) is the number of outbound links from page Bi

You might also like