0% found this document useful (0 votes)
5 views67 pages

Supervised Learning

Supervised learning involves training models using labeled data, guiding the learning process akin to a teacher-student relationship. Key algorithms include k-Nearest Neighbour (kNN), Decision Trees, Random Forest, and Support Vector Machines (SVM), each with unique strengths and weaknesses. The document outlines the steps for classification learning, challenges in algorithm selection, and practical applications of these algorithms.

Uploaded by

manishsinghdewas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views67 pages

Supervised Learning

Supervised learning involves training models using labeled data, guiding the learning process akin to a teacher-student relationship. Key algorithms include k-Nearest Neighbour (kNN), Decision Trees, Random Forest, and Support Vector Machines (SVM), each with unique strengths and weaknesses. The document outlines the steps for classification learning, challenges in algorithm selection, and practical applications of these algorithms.

Uploaded by

manishsinghdewas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Supervised Learning

Supervised Learning
● Supervised Learning Definition:
Learning based on labeled
training data.
● Role of Labeled Data: Acts as
experience, prior knowledge, or
belief.
● The process is similar to a
teacher supervising a student.
● Training data serves as the
teacher guiding the learning
process.
Classification Learning Steps
● Problem Identification: Define a well-formed
problem with clear goals and long-term benefits.
● Identification of Required Data: Select relevant
datasets that represent the problem accurately.
● Data Pre-processing: Clean and transform raw data
for analysis.
● Definition of Training Data Set: Choose appropriate
input and output data for training.
● Algorithm Selection: Pick the best learning
algorithm based on the problem.
● Training: Run the selected algorithm on the training
set for fine-tuning.
● Evaluation with Test Data: Measure performance
and refine the model if needed.
Supervised Learning Algorithms
● k-Nearest Neighbour (kNN)
● Decision tree
● Random forest
● Support Vector Machine (SVM)
● Naïve Bayes classifier
K-Nearest Neighbour
● A simple yet powerful classification algorithm.
● Similar entities tend to stay close, like neighbors in a locality.
● Unknown data is classified based on similar training data points.
● The label of an unknown element is determined by its nearest neighbors.
● Uses similarity to make predictions, just like people with similar mindsets group together.
Student Data Set: Consists of 15 students with scores in Aptitude and Communication (scale of 10).
Classifications:

● Leader: High aptitude & communication skills.


● Speaker: High communication, low aptitude.
● Intel: High aptitude, low communication.

Training & Test Data:


● A portion of labeled data is retained as test data to evaluate the model.
● Remaining data is used for training the model.

Performance Evaluation:

● Accuracy is determined by comparing predicted vs. actual class labels.


● Example: The record of Josh is used as test data.
Data set
kNN Algorithm in Use:

● Class label of test data is determined by neighboring training data points.

Challenges in kNN:

1. Defining Similarity: How do we measure similarity between data points?


2. Number of Neighbors: How many similar points should be considered?

Solution - Euclidean Distance:

● The most common method used by kNN to measure similarity between two data points.
● Value of 'k': Determines the number of neighbors to consider in kNN.
● User-Defined Parameter: The value of 'k' is set by the user.
● Example (k = 3): The three nearest neighbors are considered, and the majority class is assigned.
● Example (k = 1): Only the closest neighbor's class label is assigned to the test data.
● Impact of 'k': Affects classification accuracy and decision-making in kNN.
Determine the class label for Josh using the kNN algorithm.

k = 1:

● Closest neighbor: Gouri (distance = 1.118).


● Gouri's class: Intel → Josh is classified as Intel.

k = 3:

● Nearest neighbors: Gouri (1.118), Susant (1.414), Bobby (1.5).


● Gouri (Intel), Bobby (Intel), Susant (Leader).
● Majority class: Intel → Josh is classified as Intel.

The process can be applied for any value of k using majority voting.
Choosing K
Choosing k is challenging due to the following reasons:

● Large k (e.g., total training records): Majority class dominates, ignoring nearest neighbors.
● Small k (e.g., k = 1): Risk of assigning class of an outlier or noisy data point.

Best k value lies between these extremes.


Common strategies to determine k:

● Square root rule: Set k = √(number of training records).


● Experimentation: Test multiple k values on different datasets and choose the best.
● Weighted voting: Use a larger k but give more weight to closer neighbors.

Why is kNN called a Lazy Learner?

● Unlike eager learners, kNN skips abstraction and generalization steps.


● Stores training data and directly applies the nearest neighbor approach.
● No real learning occurs, making it a lazy learner.
Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours
to be considered)
Steps:
Do for all test data points
Calculate the distance (usually Euclidean distance) of the test data point from
the different training data points.
Find the closest ‘k’ training data points, i.e. training data points whose
distances are least from the test data point.
If k = 1
Then assign class label of the training data point to the test data point
Else
Whichever class label is predominantly present in the training data points, assign that class label
to the test data point
End do
Strengths and Weaknesses of kNN
Strengths of kNN

● Simple and easy to understand.


● Effective for certain applications, e.g., recommender systems.
● Fast training phase (almost no training time required).

Weaknesses of kNN

● No actual learning—relies entirely on training data.


● Performance depends on training data quality; poor data leads to poor classification.
● Slow classification process due to lack of pre-trained models.
● High computational space required for storing training data.

Applications of kNN

● Recommender Systems: Suggests items based on user preferences (e.g., past purchases, browsing history).
● Information Retrieval: Used for concept search—finding similar documents or content.
Decision Tree
Decision Tree Learning

● Widely Adopted: Common algorithm for classification tasks.


● Tree Structure: Builds a model in the form of a tree.
● Multi-Dimensional Analysis: Can handle multiple classes and variables.
● Fast Execution: Efficient with quick performance.
● Interpretability: Easy to understand and interpret rules.

How Decision Trees Work

● Model Creation: Based on past data (feature vectors), predicts output variable values.
● Nodes: Each decision node represents a feature from the input vector.
● Edges: Connect nodes to children, each representing possible feature values.
● Leaf Nodes: Terminate the tree, representing possible values for the output variable.
● Path Followed: Classification is determined by following the path from the root to a leaf node based on input variable values
● Each leaf node assigns a classification. The first node is called as Root’ Node. Branches from the root node are called as ‘Leaf’
Nodes where ‘A’ is the Root Node (first node). ‘B’ is the Branch Node. ‘T’ & ‘F’ are Leaf Nodes.
Decision Tree Construction

● Recursive Partitioning: Decision trees are built using this method, splitting data into subsets based on feature values.
● Root Node: The entire dataset is initially considered the root.
● Feature Selection: The feature that predicts the target class most strongly is selected.
● Splitting: Data is partitioned based on the chosen feature, forming branches.
● Continued Splitting: The process continues, selecting the best feature to split each node, until a stopping criterion is met.

Stopping Criteria

1. Same Class: All or most examples at a node belong to the same class.
2. No Features Left: All features have been used for partitioning.
3. Predefined Limit: The tree grows to a specified threshold.

Example (GTS Recruitment)

● Context: GTS hires B.Tech. students from Engineering College


● Data: Interview evaluation results of 18 shortlisted students are available.
● Goal: Chandra, a student with a high CGPA, wants to know if he will be offered a job.
Decision Tree Example for Chandra

● Path: Based on Chandra's attributes (CGPA = High, Communication = Bad, Aptitude = High, Programming Skills = Bad), the
decision tree predicts he will get the job offer.

Decision Tree Algorithms

● Popular Algorithms:
○ C5.0, CART (Classification and Regression Tree), CHAID, ID3.
● Feature Selection Challenge:
○ The algorithm selects the feature to split on by ensuring partitions are as pure as possible.
○ Entropy: Measures the impurity of an attribute, used in algorithms like ID3 and C5.0.
○ Information Gain: Calculated by measuring the decrease in entropy after splitting data on a feature. The goal is to find
the feature with the highest information gain, creating the most homogeneous branches.

Building the Decision Tree

● First Node: The first split is based on Aptitude. If Aptitude = Low, the job offer is FALSE.
● Aptitude = High:
○ If Communication = Good, the job offer is TRUE.
○ If Communication = Bad, the job offer is TRUE if CGPA = High.
Avoiding Overfitting in Decision Trees – Pruning
● Issue: Without stopping criteria, the decision tree may keep growing indefinitely, leading to overfitting.
● Solution: Pruning reduces tree size, making the model more generalized and better at classifying unseen data.

Types of Pruning

1. Pre-pruning (Early Stopping)


○ Stops tree growth before it reaches full depth.
○ Limits the number of decision nodes.
○ Pros: Reduces overfitting and computational cost.
○ Cons: May ignore important patterns by stopping too early.
2. Post-pruning (Prune After Growth)
○ Grows the full tree first, then removes some branches based on pruning criteria (e.g., error rates).
○ Pros: More accurate classification as all data is considered.
○ Cons: Higher computational cost.
Strengths of Decision Trees
● Produces simple, interpretable rules.
● Works well for most classification problems.
● Handles both numerical and categorical variables.
● Effective with both small and large datasets.
● Helps identify important features for classification.
● No Need for Feature Scaling
● Handles Non-linear Relationship

Weaknesses of Decision Trees


● Biased toward features with more possible values.
● Overfits or underfits easily.
● Struggles with multi-class classification when training data is limited.
● Can be computationally expensive for deep trees.
Applications of Decision Trees

● Structured Data Handling: Works well with datasets having a finite list of attributes, where each instance has a value for
each attribute (e.g., 'High' for CGPA).
● Efficient with Discrete Values:
○ Performs well when attributes have a small number of distinct values (e.g., 'High', 'Medium', 'Low').
○ Can be extended to handle real-valued attributes (e.g., temperature as a floating point value).
● Binary & Multi-Class Classification:
○ Handles Boolean classification (e.g., Communication = ‘Good’ or ‘Bad’).
○ Supports multi-class classification (e.g., CGPA = ‘High’, ‘Medium’, or ‘Low’).
● No Infinite Loops:
○ The decision-making process must move step-by-step from root to decision node.
○ Prevents cases where an algorithm could enter an infinite loop and fail to produce a result.
Random Forest Algorithm
1. The random forest algorithm works as follows:
2. If there are N variables or features in the input data set, select a subset of ‘m’ (m <
N) features at random out of the N features. Also, the observations or data instances
should be picked randomly.
3. Use the best split principle on these ‘m’ features to calculate the number of nodes
‘d’.
4. Keep splitting the nodes to child nodes till the tree is grown to the maximum
possible extent.
5. Select a different subset of the training data ‘with replacement’ to train another
decision tree following steps (1) to (3). Repeat this to build and train ‘n’ decision
trees.
6. Final class assignment is done on the basis of the majority votes from the ‘n’ trees.
Support Vector Machine
● SVM (Support Vector Machine) is a model used for both linear classification and regression.
● It is based on the concept of a hyperplane, which serves as a decision boundary between data points in a multi-dimensional
feature space.
● The output prediction of an SVM falls into one of two predefined classes present in the training data.
● The SVM algorithm constructs an N-dimensional hyperplane, which helps in classifying future data instances into one of the
two possible output classes.
● SVM builds a model to distinguish data instances belonging to different classes.
● When data instances are linearly separable, they can be divided by a straight line in a two-dimensional space.
● In a multi-dimensional feature space, this straight line extends to form a hyperplane that separates different classes.
● The SVM model represents input instances as points in the feature space, ensuring a clear gap between different classes.
● The goal of SVM analysis is to find an optimal hyperplane that effectively separates the data instances based on their classes.
● New instances are mapped into the same space and classified based on which side of the hyperplane they fall on.
● The SVM algorithm identifies a surface (hyperplane) in the feature space to separate data instances during training.
● Since multiple hyperplanes can exist, a key challenge in SVM
is to find the optimal hyperplane for the best classification.
● SVM aims to find an optimal hyperplane that effectively
separates data instances based on their classes.
● New instances are mapped into the same feature space and
classified based on which side of the hyperplane they fall
on.
● Data points farther from the hyperplane indicate a higher
confidence in correct classification.
● When new test data is added, its position relative to the
hyperplane determines its assigned class.
● The distance between the hyperplane and data points is
called the margin.
Scenario 1
● There are three hyperplanes: A, B, and C in the given scenario.
● The task is to identify the best hyperplane that effectively separates the two
classes (triangles and circles).
● Hyperplane 'A' is the most effective in segregating the two classes
correctly.
Scenario 2
● There are three hyperplanes: A, B, and C in the given scenario.
● The goal is to identify the best hyperplane for classifying triangles and circles.
● The correct hyperplane is determined by maximizing the margin, which is the distance between the nearest data
points of both classes and the hyperplane.
● Margin refers to the distance between the hyperplane and the closest data points from each class.
● In Figure b, hyperplane A has a higher margin compared to B and C, making it the best choice.
● A higher margin ensures robustness, reducing the risk of misclassification.
● Lower margin hyperplanes (like B and C) are more prone to misclassification errors.
Scenario 3

● Hyperplane B has a higher margin than A, which may seem like a better
choice.
● However, SVM prioritizes accurate classification before maximizing the
margin.
● Hyperplane B has a classification error, meaning it misclassifies some data
points.
● Hyperplane A classifies all data instances correctly, making it the correct
choice
Scenario 4
● Figure a shows that a straight line cannot distinctly separate the two classes because of an outlier.
● One triangle lies in the circle’s territory, making it an outlier for its class.
● Another triangle at the opposite end is also an outlier for the triangle class.
● SVM has the capability to ignore outliers and still find the optimal hyperplane with the maximum margin.
● In Figure b, hyperplane A is chosen as it has the maximum margin and effectively handles outliers.
● Therefore, SVM is robust to outliers, ensuring accurate classification even with outliers in the data.
● So, by summarizing the observations from the different scenarios, we can say that
● The hyperplane should segregate the data instances belonging to the two classes in
the best possible way.
● It should maximize the distances between the nearest data points of both the
classes, i.e. maximize the margin.
● If there is a need to prioritize between higher margin and lesser misclassification,
the hyperplane should try to reduce misclassifications.
● Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training dataset
consisting of input feature vectors X and their corresponding class labels Y.
● The equation for the linear hyperplane can be written as:
● Where: w is the normal vector to the hyperplane (the direction perpendicular to it). b is the offset or bias term,
representing the distance of the hyperplane from the origin along the normal vector.
● For a linearly separable dataset, the goal is to find the hyperplane that maximizes the margin between the two
classes while ensuring that all data points are correctly classified. This leads to the following optimization problem:
● Using Lagrange multipliers, these constraints are incorporated into a single Lagrangian function, converting it into an
unconstrained optimization problem.

Converting to a Dual Problem for Efficient Solving

● The Lagrangian formulation helps transform the problem into a dual form, which is easier to solve using techniques like the
Quadratic Programming (QP) approach.
● The dual problem depends only on the dot product of data points, making it computationally efficient.

Support Vector Identification

● The Lagrange multipliers determine the support vectors, which are the key data points that define the decision boundary.
To solve minimization problem we have to take the partial derivative w.r.t w as well as b
Choose an appropriate kernel:

● Linear Kernel: Use when data is linearly separable.


● Polynomial Kernel: Use for polynomial decision boundaries.
● Radial Basis Function (RBF) Kernel: Works well with non-linearly separable data
Strengths of SVM

● Effective Margin Separation: Works well when there is a clear margin between classes.
● High-Dimensional Efficiency: Performs well in high-dimensional spaces.
● Handles Small Datasets: Requires only a few support vectors for classification.
● Non-Linear Decision Boundaries: Uses the kernel trick for complex decision boundaries.
● Good Generalization: Classifies unseen data effectively.
● Versatility: Applicable to both classification and regression tasks.
Weaknesses of SVM
● Not suitable for large datasets: Computationally expensive for large datasets.
● Struggles with noisy data: Performs poorly when target classes overlap.
● Underperformance with high-dimensional data: Struggles when the number of features exceeds the number of training
samples.
● Memory-intensive: Requires storing a large kernel matrix, making it resource-heavy.
● Sensitive to parameter selection: Performance depends on the right choice of kernel and regularization parameters.
● Limited multi-class support: Primarily designed for binary classification; multi-class problems require additional techniques.
● Slow for high-feature datasets: Becomes inefficient when dealing with many features.

Applications of SVM
● Most effective for binary classification problems.
● Used in bioinformatics for detecting cancer and genetic disorders.
● Applied in facial recognition by classifying images into face and non-face components.
● Has various other applications in pattern recognition and machine learning.
Naive Bayes Theorem
● P(c|d) is Posterior probability: Probability of hypothesis A on the observed event B.
● P(d|c) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
● P(c) is Prior Probability: Probability of hypothesis before observing the evidence.
● P(d) is Marginal Probability: Probability of Evidence
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
● Feature independence: This means that when we are trying to classify something, we assume
that each feature (or piece of information) in the data does not affect any other feature.
● Continuous features are normally distributed: If a feature is continuous, then it is assumed to
be normally distributed within each class.
● Features are equally important: All features are assumed to contribute equally to the prediction
of the class label.
● No missing data: The data should not contain any missing values.
1. Construct a frequency table. The posterior probability can be easily derived by constructing a frequency
table for each attribute against the target. For example, frequency of Weather Condition variable with values
‘Sunny’ when the target value Won match is ‘Yes’, is, 3/(3+4+2) = 3/9.
2. To predict whether the team will win for given weather conditions (a ) = Rainy, Wins in last three matches (a
) = 2 wins, Humidity (a ) = Normal and Win toss (a ) = True, we need to choose ‘Yes’ from the above table
for the given conditions.
3. by normalizing the above two probabilities, we can ensure that the sum of these
two probabilities is 1.
Regression
Real Estate Price Prediction Using Regression

● Problem Overview:
○ Real estate price prediction is a supervised learning problem, specifically solved using regression.
○ New City has seen a rapid increase in commercial activities, leading to a housing demand boom.
○ Karen, a real estate business owner, initially determined property prices based on intuition and experience.
○ With business growth, personal interactions became unmanageable, and her assistant struggled with price estimations.
● Solution:
○ Karen's friend, Frank, a data scientist, proposed a machine learning-based solution.
○ He built a regression model to predict property prices based on factors like:
■ Area (sq. m.)
■ Location
■ Floor number
■ Number of years since purchase
■ Available amenities
● Understanding Regression:
○ Regression is used to predict numerical values by finding relationships between variables.
○ Dependent Variable (Y): The target value to be predicted (e.g., real estate price).
○ Independent Variables (X): Predictors influencing the dependent variable (e.g., area, location, floor).
○ The goal of regression is to find a function Y = f(X) that best explains the relationship between predictors and the target value.
● Conclusion:
○ Regression models, like the one Frank built, can solve various numerical prediction problems beyond real estate pricing.
Regression Algorithms

● Simple linear regression


● Multiple linear regression
● Polynomial regression
● Multivariate adaptive regression splines
● Logistic regression
● Maximum likelihood estimation (least squares)
Simple Linear Regression
Definition:

● Simplest regression model involving only one predictor


variable.
● Assumes a linear relationship between the dependent and
predictor variable.

Application in Karen’s Problem:

● Dependent Variable (Y): Price of the Property.


● Predictor Variable (X): Area of the Property (in sq. m.).

where ‘a’ and ‘b’ are intercept and slope of the straight line, respectively.
Positive Slope
Negative Slope
Error
Hypothesis:

● A college professor believes higher internal marks lead to higher external marks.
● A random sample of 15 students was selected for analysis.

Scatter Plot Analysis:

● X-axis: Internal marks (independent variable).


● Y-axis: External marks (dependent variable).
● A regression line is drawn to represent the relationship.

Understanding the Regression Line:

● The regression line does not perfectly predict data but approximates the relationship.
● Some predictions are higher or lower than actual values.

Residual Error:

● Definition: The difference between actual and predicted values.


● Represented as the vertical distance between points and the regression line.
Finding the Best-Fit Line:

● Ordinary Least Squares (OLS):


○ A technique used to minimize residual error (ε).
○ Finds the best values for a (intercept) and b (slope).
○ Minimizes the Sum of Squared Errors (SSE).
○ It is observed that the SSE is least when b takes the value

The corresponding value of ‘a’ calculated using the above value of ‘b’ is
OLS algorithm

Step 1: Calculate the mean of X and Y

Step 2: Calculate the errors of X and Y

Step 3: Get the product

Step 4: Get the summation of the products

Step 5: Square the difference of X

Step 6: Get the sum of the squared difference

Step 7: Divide output of step 4 by output of step 6 to calculate ‘b’

Step 8: Calculate ‘a’ using the value of ‘b’


Multiple Regression
Definition:

● A regression model involving two or more independent variables (predictors).


● Extends simple linear regression by considering multiple factors affecting the dependent variable.

Example (Karen’s Problem):

● Dependent variable: Price of a property ($).


● Independent variables: Area (sq. m.), location, floor, years since purchase, amenities.

General Multiple Regression Equation:

● With two predictors:


● With ‘n’ predictors:
● Parameters:
○ a = Intercept (where the plane crosses the Y-axis).
○ b₁, b₂, ..., bₙ = Partial regression coefficients.
○ b₁ represents the change in mean response for a unit change in X₁, keeping X₂ constant.
○ b₂ represents the change in mean response for a unit change in X₂, keeping X₁ constant
Interpretation:

● Represents a plane in three-dimensional space for two predictors.


● For n predictors, it extends into higher dimensions.

Types of Multiple Regression:

● Polynomial Regression: Fits a non-linear relationship using higher-degree terms.


● Curvilinear Regression: Captures curved trends in data.
Multicollinearity
● Multicollinearity occurs when there is a strong correlation between the independent variables in a regression
model.
● It does not only involve the relationship between the dependent and independent variables, but also the
correlation among the independent variables themselves.
● While multicollinearity can still lead to good predictions in multiple regression, it makes it difficult to determine
the individual effect of each independent variable on the dependent variable.
● The presence of multicollinearity increases the standard errors of the coefficients.
● By inflating the standard errors, multicollinearity can cause variables to appear statistically insignificant, even
when they should be significant with lower standard errors.
● The Variance Inflation Factor (VIF) is used to assess the extent of multicollinearity. A VIF value of 1 means no
correlation, and higher values indicate greater correlation.
● The assumption of no perfect collinearity implies:
● No exact linear relationship among independent variables.
● Independent variables, excluding the intercept term, cannot be constants.
● There must be variation in the independent variables (X’s).
● More variation in independent variables typically leads to better OLS (Ordinary Least Squares) estimates,
helping to more accurately identify the impact of each independent variable on the dependent variable.
Heteroskedasticity refers to the changing variance of the error term in a regression model.

When the variance of the error term is not constant across data sets, it leads to erroneous predictions.

For accurate predictions in a regression equation, the error term should be:

● Independent.
● Identically distributed (iid).
● Normally distributed.

where ‘var’ represents the variance, ‘cov’ represents the covariance, ‘u’ represents the error terms, and ‘X’ represents the independent
variables.
Bias and Variance in regression models are similar to accuracy and prediction:

● Accuracy: How close the estimation is to the actual value.


● Prediction: Continuous estimation of future values.

High bias = low accuracy: Predictions are not close to the real value.
High variance = low prediction: Predictions are scattered.

Low bias = high accuracy: Predictions are close to the real value.

Low variance = high prediction: Predictions are close to each other.

Ideal situation: A model with:

● High accuracy (low bias).


● High prediction (low variance).
● Low overall error (low bias and low variance).

Increasing variance (low prediction) leads to more scattered data points, reducing accuracy.
Increasing bias (low accuracy) increases the error between predicted and observed values.

Balancing bias and variance is crucial for a good regression model.


Linear regression assumptions:

● Number of observations (n) should be greater than the number of parameters (k), i.e., n > k.
● When n > k, least squares estimates have low variance and perform well on test data.

If n is not much larger than k, high variability in the least squares fit can cause overfitting, leading to poor predictions.
If k > n, linear regression becomes unusable, leading to infinite variance.

Improving accuracy in linear regression can be achieved using:

1. Shrinkage Approach.
2. Subset Selection.
3. Dimensionality (Variable) Reduction.
Shrinkage (Regularization) Approach:
● Shrinks estimated coefficients to reduce variance at the cost of a small increase in bias.
● This reduction in variance improves model accuracy.
Irrelevant Variables: Some variables in a multiple regression model may not be related to the response and add unnecessary complexity.
Shrinkage Technique:
● Fits a model with all predictors but shrinks the coefficients towards zero, compared to least squares estimates.
● This reduces the overall variance and may result in some coefficients being exactly zero, thus performing variable selection.
Two common shrinkage techniques:
1. Ridge Regression (L2 Regularization).
2. Lasso Regression (L1 Regularization).
Ridge Regression:
● Adds a penalty equivalent to the square of the coefficients' magnitude.
● Minimization objective: LS Objective + α × (sum of square of coefficients).
● Works well when k > n (more predictors than observations), trading a small bias increase for a large variance decrease.
● Includes all predictors in the final model, which may complicate interpretation when k is large.
● Best used when many predictors influence the response, with coefficients of roughly equal size.
Lasso Regression:

● Adds a penalty equivalent to the absolute value of the coefficients' magnitude.


● Minimization objective: LS Objective + α × (absolute value of coefficients).
● Forces some coefficients to zero, yielding a sparse model (only a subset of predictors remain).
● More interpretable and simpler than ridge regression.
● Performs better when a small number of predictors have substantial coefficients, and others have coefficients near zero.

Subset Selection:

● Identify a subset of predictors assumed to be related to the response and fit a model using OLS on this reduced subset.

Two Methods for Subset Selection:

1. Best Subset Selection.


2. Stepwise Subset Selection.
Best Subset Selection:
● Fits a separate least squares regression for each possible subset of the k predictors.
● Cannot be applied with a very large number of predictors (k) due to computational complexity.
● Considers all possible (2^k) models containing subsets of the p predictors.
Stepwise Subset Selection:
● A more computationally efficient alternative to best subset selection.
● Can be done using two approaches:
1. Forward Stepwise Selection (0 to k).
2. Backward Stepwise Selection (k to 0).
Forward Stepwise Selection:
● Starts with a model containing no predictors.
● Predictors are added one by one to the model, selecting the variable (X) that provides the highest improvement in
fit at each step.
● Continues adding predictors until all k predictors are included.
Backward Stepwise Selection:
● Starts with a model containing all k predictors.
● Iteratively removes the least useful predictor one by one, refining the model.
Dimensionality Reduction (Variable Reduction):

● Unlike subset selection and shrinkage, dimensionality reduction transforms the predictors (X) rather than
selecting or shrinking them.
● The model is then built using the transformed variables after dimensionality reduction.
● The goal is to reduce the number of variables in the model.

Principal Component Analysis (PCA):

● One of the most important techniques for dimensionality (variable) reduction.


● PCA transforms the original predictors into a smaller set of uncorrelated variables (principal components),
which capture the most significant variance in the data.

You might also like