Supervised Learning
Supervised Learning
Supervised Learning
● Supervised Learning Definition:
Learning based on labeled
training data.
● Role of Labeled Data: Acts as
experience, prior knowledge, or
belief.
● The process is similar to a
teacher supervising a student.
● Training data serves as the
teacher guiding the learning
process.
Classification Learning Steps
● Problem Identification: Define a well-formed
problem with clear goals and long-term benefits.
● Identification of Required Data: Select relevant
datasets that represent the problem accurately.
● Data Pre-processing: Clean and transform raw data
for analysis.
● Definition of Training Data Set: Choose appropriate
input and output data for training.
● Algorithm Selection: Pick the best learning
algorithm based on the problem.
● Training: Run the selected algorithm on the training
set for fine-tuning.
● Evaluation with Test Data: Measure performance
and refine the model if needed.
Supervised Learning Algorithms
● k-Nearest Neighbour (kNN)
● Decision tree
● Random forest
● Support Vector Machine (SVM)
● Naïve Bayes classifier
K-Nearest Neighbour
● A simple yet powerful classification algorithm.
● Similar entities tend to stay close, like neighbors in a locality.
● Unknown data is classified based on similar training data points.
● The label of an unknown element is determined by its nearest neighbors.
● Uses similarity to make predictions, just like people with similar mindsets group together.
Student Data Set: Consists of 15 students with scores in Aptitude and Communication (scale of 10).
Classifications:
Performance Evaluation:
Challenges in kNN:
● The most common method used by kNN to measure similarity between two data points.
● Value of 'k': Determines the number of neighbors to consider in kNN.
● User-Defined Parameter: The value of 'k' is set by the user.
● Example (k = 3): The three nearest neighbors are considered, and the majority class is assigned.
● Example (k = 1): Only the closest neighbor's class label is assigned to the test data.
● Impact of 'k': Affects classification accuracy and decision-making in kNN.
Determine the class label for Josh using the kNN algorithm.
k = 1:
k = 3:
The process can be applied for any value of k using majority voting.
Choosing K
Choosing k is challenging due to the following reasons:
● Large k (e.g., total training records): Majority class dominates, ignoring nearest neighbors.
● Small k (e.g., k = 1): Risk of assigning class of an outlier or noisy data point.
Weaknesses of kNN
Applications of kNN
● Recommender Systems: Suggests items based on user preferences (e.g., past purchases, browsing history).
● Information Retrieval: Used for concept search—finding similar documents or content.
Decision Tree
Decision Tree Learning
● Model Creation: Based on past data (feature vectors), predicts output variable values.
● Nodes: Each decision node represents a feature from the input vector.
● Edges: Connect nodes to children, each representing possible feature values.
● Leaf Nodes: Terminate the tree, representing possible values for the output variable.
● Path Followed: Classification is determined by following the path from the root to a leaf node based on input variable values
● Each leaf node assigns a classification. The first node is called as Root’ Node. Branches from the root node are called as ‘Leaf’
Nodes where ‘A’ is the Root Node (first node). ‘B’ is the Branch Node. ‘T’ & ‘F’ are Leaf Nodes.
Decision Tree Construction
● Recursive Partitioning: Decision trees are built using this method, splitting data into subsets based on feature values.
● Root Node: The entire dataset is initially considered the root.
● Feature Selection: The feature that predicts the target class most strongly is selected.
● Splitting: Data is partitioned based on the chosen feature, forming branches.
● Continued Splitting: The process continues, selecting the best feature to split each node, until a stopping criterion is met.
Stopping Criteria
1. Same Class: All or most examples at a node belong to the same class.
2. No Features Left: All features have been used for partitioning.
3. Predefined Limit: The tree grows to a specified threshold.
● Path: Based on Chandra's attributes (CGPA = High, Communication = Bad, Aptitude = High, Programming Skills = Bad), the
decision tree predicts he will get the job offer.
● Popular Algorithms:
○ C5.0, CART (Classification and Regression Tree), CHAID, ID3.
● Feature Selection Challenge:
○ The algorithm selects the feature to split on by ensuring partitions are as pure as possible.
○ Entropy: Measures the impurity of an attribute, used in algorithms like ID3 and C5.0.
○ Information Gain: Calculated by measuring the decrease in entropy after splitting data on a feature. The goal is to find
the feature with the highest information gain, creating the most homogeneous branches.
● First Node: The first split is based on Aptitude. If Aptitude = Low, the job offer is FALSE.
● Aptitude = High:
○ If Communication = Good, the job offer is TRUE.
○ If Communication = Bad, the job offer is TRUE if CGPA = High.
Avoiding Overfitting in Decision Trees – Pruning
● Issue: Without stopping criteria, the decision tree may keep growing indefinitely, leading to overfitting.
● Solution: Pruning reduces tree size, making the model more generalized and better at classifying unseen data.
Types of Pruning
● Structured Data Handling: Works well with datasets having a finite list of attributes, where each instance has a value for
each attribute (e.g., 'High' for CGPA).
● Efficient with Discrete Values:
○ Performs well when attributes have a small number of distinct values (e.g., 'High', 'Medium', 'Low').
○ Can be extended to handle real-valued attributes (e.g., temperature as a floating point value).
● Binary & Multi-Class Classification:
○ Handles Boolean classification (e.g., Communication = ‘Good’ or ‘Bad’).
○ Supports multi-class classification (e.g., CGPA = ‘High’, ‘Medium’, or ‘Low’).
● No Infinite Loops:
○ The decision-making process must move step-by-step from root to decision node.
○ Prevents cases where an algorithm could enter an infinite loop and fail to produce a result.
Random Forest Algorithm
1. The random forest algorithm works as follows:
2. If there are N variables or features in the input data set, select a subset of ‘m’ (m <
N) features at random out of the N features. Also, the observations or data instances
should be picked randomly.
3. Use the best split principle on these ‘m’ features to calculate the number of nodes
‘d’.
4. Keep splitting the nodes to child nodes till the tree is grown to the maximum
possible extent.
5. Select a different subset of the training data ‘with replacement’ to train another
decision tree following steps (1) to (3). Repeat this to build and train ‘n’ decision
trees.
6. Final class assignment is done on the basis of the majority votes from the ‘n’ trees.
Support Vector Machine
● SVM (Support Vector Machine) is a model used for both linear classification and regression.
● It is based on the concept of a hyperplane, which serves as a decision boundary between data points in a multi-dimensional
feature space.
● The output prediction of an SVM falls into one of two predefined classes present in the training data.
● The SVM algorithm constructs an N-dimensional hyperplane, which helps in classifying future data instances into one of the
two possible output classes.
● SVM builds a model to distinguish data instances belonging to different classes.
● When data instances are linearly separable, they can be divided by a straight line in a two-dimensional space.
● In a multi-dimensional feature space, this straight line extends to form a hyperplane that separates different classes.
● The SVM model represents input instances as points in the feature space, ensuring a clear gap between different classes.
● The goal of SVM analysis is to find an optimal hyperplane that effectively separates the data instances based on their classes.
● New instances are mapped into the same space and classified based on which side of the hyperplane they fall on.
● The SVM algorithm identifies a surface (hyperplane) in the feature space to separate data instances during training.
● Since multiple hyperplanes can exist, a key challenge in SVM
is to find the optimal hyperplane for the best classification.
● SVM aims to find an optimal hyperplane that effectively
separates data instances based on their classes.
● New instances are mapped into the same feature space and
classified based on which side of the hyperplane they fall
on.
● Data points farther from the hyperplane indicate a higher
confidence in correct classification.
● When new test data is added, its position relative to the
hyperplane determines its assigned class.
● The distance between the hyperplane and data points is
called the margin.
Scenario 1
● There are three hyperplanes: A, B, and C in the given scenario.
● The task is to identify the best hyperplane that effectively separates the two
classes (triangles and circles).
● Hyperplane 'A' is the most effective in segregating the two classes
correctly.
Scenario 2
● There are three hyperplanes: A, B, and C in the given scenario.
● The goal is to identify the best hyperplane for classifying triangles and circles.
● The correct hyperplane is determined by maximizing the margin, which is the distance between the nearest data
points of both classes and the hyperplane.
● Margin refers to the distance between the hyperplane and the closest data points from each class.
● In Figure b, hyperplane A has a higher margin compared to B and C, making it the best choice.
● A higher margin ensures robustness, reducing the risk of misclassification.
● Lower margin hyperplanes (like B and C) are more prone to misclassification errors.
Scenario 3
● Hyperplane B has a higher margin than A, which may seem like a better
choice.
● However, SVM prioritizes accurate classification before maximizing the
margin.
● Hyperplane B has a classification error, meaning it misclassifies some data
points.
● Hyperplane A classifies all data instances correctly, making it the correct
choice
Scenario 4
● Figure a shows that a straight line cannot distinctly separate the two classes because of an outlier.
● One triangle lies in the circle’s territory, making it an outlier for its class.
● Another triangle at the opposite end is also an outlier for the triangle class.
● SVM has the capability to ignore outliers and still find the optimal hyperplane with the maximum margin.
● In Figure b, hyperplane A is chosen as it has the maximum margin and effectively handles outliers.
● Therefore, SVM is robust to outliers, ensuring accurate classification even with outliers in the data.
● So, by summarizing the observations from the different scenarios, we can say that
● The hyperplane should segregate the data instances belonging to the two classes in
the best possible way.
● It should maximize the distances between the nearest data points of both the
classes, i.e. maximize the margin.
● If there is a need to prioritize between higher margin and lesser misclassification,
the hyperplane should try to reduce misclassifications.
● Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training dataset
consisting of input feature vectors X and their corresponding class labels Y.
● The equation for the linear hyperplane can be written as:
● Where: w is the normal vector to the hyperplane (the direction perpendicular to it). b is the offset or bias term,
representing the distance of the hyperplane from the origin along the normal vector.
● For a linearly separable dataset, the goal is to find the hyperplane that maximizes the margin between the two
classes while ensuring that all data points are correctly classified. This leads to the following optimization problem:
● Using Lagrange multipliers, these constraints are incorporated into a single Lagrangian function, converting it into an
unconstrained optimization problem.
● The Lagrangian formulation helps transform the problem into a dual form, which is easier to solve using techniques like the
Quadratic Programming (QP) approach.
● The dual problem depends only on the dot product of data points, making it computationally efficient.
● The Lagrange multipliers determine the support vectors, which are the key data points that define the decision boundary.
To solve minimization problem we have to take the partial derivative w.r.t w as well as b
Choose an appropriate kernel:
● Effective Margin Separation: Works well when there is a clear margin between classes.
● High-Dimensional Efficiency: Performs well in high-dimensional spaces.
● Handles Small Datasets: Requires only a few support vectors for classification.
● Non-Linear Decision Boundaries: Uses the kernel trick for complex decision boundaries.
● Good Generalization: Classifies unseen data effectively.
● Versatility: Applicable to both classification and regression tasks.
Weaknesses of SVM
● Not suitable for large datasets: Computationally expensive for large datasets.
● Struggles with noisy data: Performs poorly when target classes overlap.
● Underperformance with high-dimensional data: Struggles when the number of features exceeds the number of training
samples.
● Memory-intensive: Requires storing a large kernel matrix, making it resource-heavy.
● Sensitive to parameter selection: Performance depends on the right choice of kernel and regularization parameters.
● Limited multi-class support: Primarily designed for binary classification; multi-class problems require additional techniques.
● Slow for high-feature datasets: Becomes inefficient when dealing with many features.
Applications of SVM
● Most effective for binary classification problems.
● Used in bioinformatics for detecting cancer and genetic disorders.
● Applied in facial recognition by classifying images into face and non-face components.
● Has various other applications in pattern recognition and machine learning.
Naive Bayes Theorem
● P(c|d) is Posterior probability: Probability of hypothesis A on the observed event B.
● P(d|c) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
● P(c) is Prior Probability: Probability of hypothesis before observing the evidence.
● P(d) is Marginal Probability: Probability of Evidence
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
● Feature independence: This means that when we are trying to classify something, we assume
that each feature (or piece of information) in the data does not affect any other feature.
● Continuous features are normally distributed: If a feature is continuous, then it is assumed to
be normally distributed within each class.
● Features are equally important: All features are assumed to contribute equally to the prediction
of the class label.
● No missing data: The data should not contain any missing values.
1. Construct a frequency table. The posterior probability can be easily derived by constructing a frequency
table for each attribute against the target. For example, frequency of Weather Condition variable with values
‘Sunny’ when the target value Won match is ‘Yes’, is, 3/(3+4+2) = 3/9.
2. To predict whether the team will win for given weather conditions (a ) = Rainy, Wins in last three matches (a
) = 2 wins, Humidity (a ) = Normal and Win toss (a ) = True, we need to choose ‘Yes’ from the above table
for the given conditions.
3. by normalizing the above two probabilities, we can ensure that the sum of these
two probabilities is 1.
Regression
Real Estate Price Prediction Using Regression
● Problem Overview:
○ Real estate price prediction is a supervised learning problem, specifically solved using regression.
○ New City has seen a rapid increase in commercial activities, leading to a housing demand boom.
○ Karen, a real estate business owner, initially determined property prices based on intuition and experience.
○ With business growth, personal interactions became unmanageable, and her assistant struggled with price estimations.
● Solution:
○ Karen's friend, Frank, a data scientist, proposed a machine learning-based solution.
○ He built a regression model to predict property prices based on factors like:
■ Area (sq. m.)
■ Location
■ Floor number
■ Number of years since purchase
■ Available amenities
● Understanding Regression:
○ Regression is used to predict numerical values by finding relationships between variables.
○ Dependent Variable (Y): The target value to be predicted (e.g., real estate price).
○ Independent Variables (X): Predictors influencing the dependent variable (e.g., area, location, floor).
○ The goal of regression is to find a function Y = f(X) that best explains the relationship between predictors and the target value.
● Conclusion:
○ Regression models, like the one Frank built, can solve various numerical prediction problems beyond real estate pricing.
Regression Algorithms
where ‘a’ and ‘b’ are intercept and slope of the straight line, respectively.
Positive Slope
Negative Slope
Error
Hypothesis:
● A college professor believes higher internal marks lead to higher external marks.
● A random sample of 15 students was selected for analysis.
● The regression line does not perfectly predict data but approximates the relationship.
● Some predictions are higher or lower than actual values.
Residual Error:
The corresponding value of ‘a’ calculated using the above value of ‘b’ is
OLS algorithm
When the variance of the error term is not constant across data sets, it leads to erroneous predictions.
For accurate predictions in a regression equation, the error term should be:
● Independent.
● Identically distributed (iid).
● Normally distributed.
where ‘var’ represents the variance, ‘cov’ represents the covariance, ‘u’ represents the error terms, and ‘X’ represents the independent
variables.
Bias and Variance in regression models are similar to accuracy and prediction:
High bias = low accuracy: Predictions are not close to the real value.
High variance = low prediction: Predictions are scattered.
Low bias = high accuracy: Predictions are close to the real value.
Increasing variance (low prediction) leads to more scattered data points, reducing accuracy.
Increasing bias (low accuracy) increases the error between predicted and observed values.
● Number of observations (n) should be greater than the number of parameters (k), i.e., n > k.
● When n > k, least squares estimates have low variance and perform well on test data.
If n is not much larger than k, high variability in the least squares fit can cause overfitting, leading to poor predictions.
If k > n, linear regression becomes unusable, leading to infinite variance.
1. Shrinkage Approach.
2. Subset Selection.
3. Dimensionality (Variable) Reduction.
Shrinkage (Regularization) Approach:
● Shrinks estimated coefficients to reduce variance at the cost of a small increase in bias.
● This reduction in variance improves model accuracy.
Irrelevant Variables: Some variables in a multiple regression model may not be related to the response and add unnecessary complexity.
Shrinkage Technique:
● Fits a model with all predictors but shrinks the coefficients towards zero, compared to least squares estimates.
● This reduces the overall variance and may result in some coefficients being exactly zero, thus performing variable selection.
Two common shrinkage techniques:
1. Ridge Regression (L2 Regularization).
2. Lasso Regression (L1 Regularization).
Ridge Regression:
● Adds a penalty equivalent to the square of the coefficients' magnitude.
● Minimization objective: LS Objective + α × (sum of square of coefficients).
● Works well when k > n (more predictors than observations), trading a small bias increase for a large variance decrease.
● Includes all predictors in the final model, which may complicate interpretation when k is large.
● Best used when many predictors influence the response, with coefficients of roughly equal size.
Lasso Regression:
Subset Selection:
● Identify a subset of predictors assumed to be related to the response and fit a model using OLS on this reduced subset.
● Unlike subset selection and shrinkage, dimensionality reduction transforms the predictors (X) rather than
selecting or shrinking them.
● The model is then built using the transformed variables after dimensionality reduction.
● The goal is to reduce the number of variables in the model.