0% found this document useful (0 votes)
15 views62 pages

Machine Learning

The document provides an overview of supervised learning, focusing on classification and regression algorithms used in machine learning. It details various types of regression models, including simple and multiple linear regression, as well as classification techniques like decision trees, Naïve Bayes, and KNN. Additionally, it discusses performance metrics for evaluating regression models and the principles behind gradient descent.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views62 pages

Machine Learning

The document provides an overview of supervised learning, focusing on classification and regression algorithms used in machine learning. It details various types of regression models, including simple and multiple linear regression, as well as classification techniques like decision trees, Naïve Bayes, and KNN. Additionally, it discusses performance metrics for evaluating regression models and the principles behind gradient descent.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Unit II

Supervised Learning - I

Dr. Y. Krishna Bhargavi


Associate Professor
Department of CSE
GRIET
Supervised Learning
• Classification and Regression algorithms are Supervised Learning algorithms.
• Both the algorithms can be used for forecasting in Machine learning and operate with the labelled
datasets.
• Regression algorithms are used to determine continuous values such as price, income, age, etc.
• Classification algorithms are used to forecast or classify the distinct values such as Real or False, Male
or Female, Spam or Not Spam, etc.
• Types of Regression Algorithms:
• Simple Linear Regression
• Multiple Linear Regression
• Types of Classification Algorithms:
• Decision Trees
• Naïve Bayes
• K Nearest Neighbors
• Logistic Regression
• Multinominal Logistic Regression
• SVM
Regression Models

Dr. Y. Krishna Bhargavi


Associate Professor
Department of CSE
GRIET
Regression
• Regression in machine learning consists of mathematical methods that allow data scientists to
predict a continuous outcome (y) based on the value of one or more predictor variables (X).
• Regression analysis is a form of predictive modeling technique which investigates the relationship
between a dependent and independent variable.
• The aim of Regression is to find a function of X to predict y.
y=f(X)
• X: independent variable also known as predictor variable, Regressor, Exploratory
Variable, input Variable
• y: dependent variable also known as Response Variable, Regressand, predicted
variable, output variable

• Types of Regression Models:


• Linear Regression: One dependent and one independent variable
• Multiple Linear Regression : One dependent and multiple independent variables
Regression
Regression
• Using regression, a function is fit on the available data and try to predict the outcome for the future
or hold-out data points.
• Fitting of Regression Functions help in Interpolation and Extrapolation.
• Interpolation - estimate missing data within the data range
• Extrapolation - estimate future data out of the data range

Interpolation
Regression

Extrapolation
Simple Linear Regression - Least Square Regression Line
• Least Squares method is used to determine the best-fitting line for the given data by reducing the
sum of the squares of the vertical deviations from each data point to the line.
• If a point rests on the fitted line accurately, then the value of its perpendicular deviation is 0.
• Linear regression determines the straight line, known as the least-squares regression line or LSRL.
• Suppose y is a dependent variable and X is an independent variable, then the population
regression line is given by the equation;

• When a random sample of observations is given, then the regression line is expressed as:

• Where B0 is a constant
• B1 is the regression coefficient
• X is the independent variable
• is known as the predicted value of the dependent variable.
Simple Linear Regression - Implementation
• Obtain the input dataset and identify independent and dependent features.
• Fit the regression line of the form:

• Compute B0 and B1

• Using B0 and B1 values, compute

• Calculate the evaluation metrics to find whether the line obtained is best fit or not.
Regression - Performance Metrics
• The performance of a Regression model is reported as errors in the prediction.
• Following are the popular metrics that are used to evaluate the performance of Regression models.
• Mean Absolute Error: MAE is one of the simplest metrics, which measures the absolute difference between
actual and predicted values, where absolute means taking a number as positive.

• y is the Actual outcome, y' is the predicted outcome, and N is the total number of data points.
• Mean Squared Error: MSE measures the average of the Squared difference between predicted values and the actual
value given by the model.

• R2 Score: R2error is also known as Coefficient of Determination, which is another popular metric used for
Regression model evaluation.
• The R-squared metric enables us to compare the model with a constant baseline to determine the performance of the
model.
• To select the constant baseline, we need to take the mean of the data and draw the line at the mean.
Regression - Performance Metrics

• Adjusted R2:
• Adjusted R2, as the name suggests, is the improved version of R2 error.
• R2 has a limitation of improvement of a score on increasing the terms, even though the model is not
improving, and it may mislead the data scientists.
• To overcome the issue of R square, adjusted R squared is used, which will always show a lower value than
R². It is because it adjusts the values of increasing predictors and only shows improvement if there is a real
improvement.
• We can calculate the adjusted R squared as follows:

• n is the number of observations


• k denotes the number of independent variables
• Ra2 denotes the adjusted R2
Multiple Linear Regression
• Multiple regression, also known as multiple linear regression (MLR), is a statistical technique that
uses two or more explanatory variables to predict the outcome of a response variable.
• In other words, it can explain the relationship between multiple independent variables against one
dependent variable.
• Multiple linear regression formula
• Here’s the formula for multiple linear regression, which produces a more specific calculation:
y = β0 + β1 X1 + β2 X2 + ... + βp Xp
• The variables in this equation are:
• y is the predicted or expected value of the dependent variable.
• X1, X2, Xp are three independent or predictor variables.
• β0 is the value of y when all the independent variables are equal to zero.
• β1 , β2, βp are the estimated regression coefficients.
• Each regression coefficient represents the change in y relative to a one-unit change in the
respective independent variable.
Multiple Linear Regression - Implementation
• Obtain the input dataset and identify independent and dependent features.
• Fit the regression line of the form:

• Calculate
• Calculate Regression sums using
Multiple Linear Regression - Implementation
Gradient Descent
Gradient Descent
Gradient Descent
• The basic principle of gradient descent is to choose the step size (also called as learning rate)
appropriately so that we can get close to the exact solution.
• Gradient descent stops when step size is very close to zero along with max number of steps we
want to perform.
Classification

Dr. Y. Krishna Bhargavi


Associate Professor
Department of CSE
GRIET
Classification
• Classification is the process of predicting the class of given data points. Classes are sometimes
called targets, labels or categories.
• Classification predictive modeling is the task of approximating a mapping function (f) from input
variables (X) to discrete output variables (y).
• There are two types of learners in classification — lazy learners and eager learners.
• Lazy learners store the training data and wait until testing data appears. When it does,
classification is conducted based on the most related stored training data. Compared to eager
learners, lazy learners spend less training time but more time in predicting.
• Eager learners construct a classification model based on the given training data before receiving
data for classification. It must be able to commit to a single hypothesis that covers the entire
instance space. Because of this, eager learners take a long time for training and less time for
predicting.
Decision Trees – ID3, CART

Dr. Y. Krishna Bhargavi


Associate Professor
Department of CSE
GRIET
Decision Trees
• A decision tree is a non-parametric supervised learning algorithm for classification and regression
tasks.
• It breaks down a dataset into smaller and smaller subsets while at the same time, an associated
decision tree is incrementally developed.
• It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and leaf
nodes.
• The final result is a tree with decision nodes and leaf nodes.
• Decision Trees are the foundation for many classical machine learning algorithms like Random
Forests, Bagging, and Boosting.
• Types of Decision Trees:
• CART (Classification and Regression Trees) → uses Gini Index(Classification) as metric.
• ID3 (Iterative Dichotomiser 3) → uses Entropy function and Information gain as metrics.
Decision Trees
• Decision Tree Terminologies:
• Root Node
• Decision Nodes
• Leaf Nodes
• Sub-Tree
• Pruning
• Parent and Child node
Decision Tree – ID3
• ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step.
• The ID3 algorithm builds decision trees using a top-down greedy search approach through the
space of possible branches with no backtracking.
• A greedy algorithm, as the name suggests, always makes the choice that seems to be the best at
that moment.
• Steps of ID3 Algorithm:
1. Calculate the Information Gain of each feature.
2. Considering that all rows don’t belong to the same class, split the dataset S into subsets
using the feature for which the Information Gain is maximum.
3. Make a decision node using the feature with the maximum Information gain.
4. If all rows belong to the same class, make the current node as a leaf node with the class as
its label.
5. Repeat for the remaining features until we run out of all features, or the decision tree has all
leaf nodes.
Decision Tree – ID3
• Entropy is a measure of disorder in a dataset.
• A dataset with high entropy is a dataset where the data points are evenly distributed across the
different categories.
• A dataset with low entropy is a dataset where the data points are concentrated in one or a few
categories.

• Information Gain calculates the reduction in the entropy and measures how well a given feature
separates or classifies the target classes.
• The feature with the highest Information Gain is selected as the best one.
Decision Tree – ID3
• Pros:
• Builds fastest tree
• Builds short tree
• Searches the whole dataset to build the decision tree
• Generates Multibranch tree

• Cons:
• Handles only Categorical data
• Cannot perform pruning
• Prioritizes attributes with more values
• Do not handle imbalanced and missing data values
Decision Tree – CART
• CART is an alternative decision tree building algorithm.
• It can handle both classification and regression tasks.
• It generates binary trees. This algorithm uses a new metric named gini index to create decision
points for classification tasks.
• Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each
class. We can formulate it as illustrated below.
Gini Index(Class) = 1 – Σ (Pi)2 for i=1 to number of subclasses
• Gini Index of feature = Weighted sum of the Gini Index of classes in a feature

• Gini Index(feature)=
Decision Tree – CART
• Process:
1. Start: Begin with the full dataset.
2. Split: Find the best feature and split point to divide the data into two subsets.
3. Create Node: Make a node in the tree based on the chosen split, with lower gini index.
4. Partition Data: Split the dataset into two subsets based on the chosen split.
• 2n - 2 possible ways to form two partitions of the data.
• n is the number of categories in an attribute
5. Recursive Splitting: Repeat steps 2-4 for each subset until stopping criteria are met.
6. Stopping Criteria: Stop growing the tree when certain conditions are reached, all instances belong
to the same class.
7. Pruning (optional): Trim the tree to avoid overfitting.
8. Output: Use the resulting tree for making predictions on new data.
Decision Tree – CART
• Pros:
• Handles both categorical and numerical values
• Handles Outliers
• Generates only binary trees

• Cons:
• Produces unstable decision trees
• Has a preference towards multivalued attributes
Naïve Bayes Algorithm

Dr. Y. Krishna Bhargavi


Associate Professor
Department of CSE
GRIET
Naïve Bayes Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for
solving classification problems.
• Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.
• Some popular applications of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
• The formula for Bayes' theorem is given as:

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.


P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Naïve Bayes Algorithm
• Process:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Use Bayes theorem to calculate the posterior probability for a new instance.

4. Normalize the obtained Naïve Bayes probability values.


5. The value with highest normalized probability will be the result of classifier.
Naïve Bayes Algorithm
• Pros:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• Efficient with Large Datasets

• Cons:
• Assumption of Independence
• Limited Expressiveness
• Requires Sufficient Training Data
• Cannot capture the correlation among features
KNN Algorithm

Dr. Y. Krishna Bhargavi


Associate Professor
Department of CSE
GRIET
KNN Algorithm
• The abbreviation KNN stands for “K-Nearest Neighbor”. It is a supervised machine learning algorithm.
• The algorithm can be used to solve both classification and regression problem statements.
• It is a voting system, where the majority class label determines the class label of a new data point among its
nearest ‘k’ (where k is an integer) neighbors in the feature space.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that
data into a category that is much similar to the new data.
KNN Algorithm
KNN Algorithm - Process
1. Select the number K of the neighbors
2. Choose the appropriate distance measure and compute the distance of K number of neighbors
3. Take the K nearest neighbors as per the calculated distance.
4. Among these k neighbors, count the number of the data points in each category.
5. Assign the new data points to that category for which the number of the neighbor is maximum.
Types of Distance Metrics
a. Euclidean distance: It is the square root of the sum of squared distance between two points.

b. Manhattan distance: It is the sum of the absolute values of the differences between two points.
Types of Distance Metrics

c. Minkowski distance: It is used to find distance similarity between two points. Based on the below formula
changes to either Manhattan distance (When p=1) and Euclidean distance (When p=2).

d. Hamming Distance: It is used for categorical variables. This metric will tell whether two categorical
variables are the same or not.
Example
Example
K-NN Algorithm
• Pros:
• k-NN is easy to understand and implement.
• No Training Phase is needed.
• Can handle categorical and numerical data.
• It has versatile distance measures.
• Incremental learning

• Cons:
• Slow and inefficient for large datasets, especially as the size of the training set grows.
• Choosing right value of ‘k’
• Imbalanced Data
Logistic Regression

Dr. Y. Krishna Bhargavi


Associate Professor
Department of CSE
GRIET
Logistic Regression
• It is used for predicting the categorical dependent variable using a given set of independent variables.
• Instead of giving a categorical or discrete value, it gives the probabilistic values that lie between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
Logistic Regression
• The sigmoid function is a mathematical function used to map the predicted values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms
a curve like the "S" form.
• The S-form curve is called the Sigmoid function or the logistic function.

Assumptions:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.

Cost Function:
• Cost function for logistic regression called log loss which is also derived from the maximum likelihood
estimation method.
Logistic Regression
Process using Gradient Descent:
1. Initialize parameters - Intercept and coefficient(s)
2. Substitute the values in sigmoid function and obtain

3. Compute the cost function

4. Compute Gradients

5. Update the intercept and coefficients using the partial derivatives


6. Reiterate the steps through steps 1 through 3 until optimal values of parameters are obtained.
Types of Logistic Regression
Multinominal Logistic Regression

Assumptions:
• The Dependent variable should be either nominal or ordinal variable.
• Set of one or more Independent variables can be continuous, ordinal or nominal.
• The Observations and dependent variables must be mutually exclusive and exhaustive.
• No Multicollinearity between Independent variables.
• There should be no Outliers in the data points.
Multinominal Logistic Regression
Process using Gradient Descent:
1. Initialize parameters - Intercept and coefficient(s)
2. Compute Logits: Calculate the logits for all samples in the batch for each class using the current model parameters (weights
and biases).

3. Compute Softmax Probabilities: Apply the softmax function to the logits to obtain the predicted probabilities for each class
for all samples in the batch.

4. Compute Loss: Calculate the cross-entropy loss between the predicted probabilities and the true class labels for all samples in
the batch.

5. Compute Gradients: Compute the gradients of the loss function with respect to the model parameters (weights and biases)
using the entire batch of samples. This involves taking the derivative of the loss function with respect to each parameter.

6. Update Parameters: Update the model parameters (weights and biases) using the gradients and the chosen optimization
algorithm (e.g., gradient descent, SGD, Adam)..
Support Vector Machine
• Step 1: SVM algorithm predicts the classes. One of the classes is identified as 1 while the other is identified as -1.
• Step 2: SVM classifier uses a loss function known as the hinge loss function to find the maximum margin.
• Step 3: There is a trade-off between maximizing margin and the loss generated if the margin is maximized to a very
large extent. A regularization parameter is used to handle this.
• Step 4: weights are optimized by calculating the gradients.
• Step 5: The gradients are updated only by using the regularization parameter when there is no error in the
classification while the loss function is also used when misclassification happens.
Support Vector Machine

Dr. Y. Krishna Bhargavi


Associate Professor
Department of CSE
GRIET
Support Vector Machine
• Support Vector Machines uses the concept of Margins to come up with predictions.
• The goal of the SVM algorithm is to create a hyperplane in an N-dimensional space that divides the data
points belonging to different classes.
• This hyperplane is chosen as the hyperplane providing the maximum margin between the two classes is
considered.
• These margins are calculated using data points known as Support Vectors.
• Support Vectors are those data points that are near to the hyper-plane and help in orienting it.
Support Vector Machine
Types of Margins:
Hard Margin: Hard Margin refers to that kind of decision boundary that makes sure that all the data points
are classified correctly.
While this leads to the SVM classifier not causing any error, it can also cause the margins to
shrink thus making the whole purpose of running an SVM algorithm futile.
Soft Margin: A regularization parameter is also added to the loss function in the SVM classification
algorithm.
This combination of the loss function with the regularization parameter allows the user to
maximize the margins at the cost of misclassification.
However, this classification needs to be kept in check, which gives birth to another
hyper-parameter that needs to be tuned.
Support Vector Machine
Types of Margins:
Support Vector Machine
• The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support
vector.
Types of SVM:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into
two classes by using a single straight line, then such data is termed as linearly separable data, and classifier
is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot
be classified by using a straight line, then such data is termed as non-linear data and classifier used is called
as Non-linear SVM classifier.
Support Vector Machine
Linear SVM
• To compute the decision boundary, Lagrangian function is used to obtain w and b values.

Steps:
• Compute the partial derivatives of L with respect to w and b.

• Solve the dual maximization problem


• Compute partial derivatives of dual problem Lagrangian
• Compute margin

• Compute bias using any of the support vectors


Non Linear SVM
• Nonlinear SVM (Support Vector Machine) is necessary when the data cannot be effectively separated by a
linear decision boundary in the original feature space.
• Nonlinear SVM addresses this limitation by utilizing kernel functions to map the data into a
higher-dimensional space where linear separation becomes possible.
• Kernel Function transforms the training set of data so that a non-linear decision surface is able to transform
to a linear equation in a higher number of dimension spaces.
Support Vector Machine
Non Linear SVM

• Is the transformed/ mapped feature space.


• Issues in feature transformation:
• It is difficult to specify the explicit form of ϕ(X)
• If the original feature space is a quadratic or higher polynomial dimensions, then computing the dot
product is time consuming and computationally expensive.
• To solve this, Kernel trick is used. Instead of computing , Kernel function gives the direct value of
similarity.
• For example, if kernel is to be used as a quadratic polynomial of degree 2, then

• Then the feature transformation with kernel results in:


Non Linear SVM – Types of Kernels

Linear Kernel

Polynomial Kernel

Gaussian Kernel

Exponential Kernel

Laplacian Kernel

Hyperbolic Kernel
Support Vector Machine
Advantages of SVM:
• Effective in high dimensional cases
• Its memory efficient as it uses a subset of training points in the decision function called support vectors
• Different kernel functions can be specified for the decision functions and its possible to specify custom
kernels
THANK YOU

You might also like