Supervised Learning Algorithms
Supervised Learning Algorithms
Algorithms
By : Dr Sonali Vyas
UPES
Introduction
• The supervised learning models are trained using the labeled dataset.
Once the training and processing are done, the model is tested by
providing a sample test data to check whether it predicts the correct
output.
• Supervised learning is a process of providing input data as well as
correct output data to the machine learning model. The aim of a
supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).
• In the real-world, supervised learning can be used for Risk
Assessment, Image classification, Fraud Detection, spam filtering, etc.
Regression
Regression algorithms are used if there is a relationship between the
input variable and the output variable. It is used for the prediction of
continuous variables, such as Weather forecasting, Market Trends,
etc.
Regression Analysis in Machine learning
• Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor) variables with
one or more independent variables.
• It predicts continuous/real values such as temperature, age, salary, price,
etc.
• Regression shows a line or curve that passes through all the datapoints on
target-predictor graph in such a way that the vertical distance between the
datapoints and the regression line is minimum.
• Example:
• Prediction of rain using temperature and other factors
• Determining Market trends
• Prediction of road accidents due to rash driving.
Terminologies Related to the Regression Analysis:
• Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.
• Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.
• Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
• Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
• Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is called
underfitting.
Linear Regression
• Linear regression is a statistical method that is used for predictive analysis.
• Linear regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.
• It assumes a linear relationship between a dependent variable (Y-axis) and one or
more independent variables (X-axis). The relationship is represented by the
equation: Y = a + bX
where:
Y is the dependent variable
β0 is the intercept
β1 is the slope
USE CASE:
• In a case study evaluating student performance analysts use simple
linear regression to examine the relationship between study hours and
exam scores. By collecting data on the number of hours students
studied and their corresponding exam results the analysts developed a
model that reveal correlation, for each additional hour spent studying,
student's exam scores increased by an average of 5 points. This case
highlights the utility of simple linear regression in understanding and
improving academic performance.
• Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression. The goal of the algorithm is to find the best Fit Line equation that
can predict the values based on the independent variables.
The equation for multiple linear regression is:
y=β0+β1X1+β2X2+………βnXny=β0+β1X1+β2X2+………βnXn
where:
• Y is the dependent variable
• X1, X2, …, Xn are the independent variables
• β0 is the intercept
• β1, β2, …, βn are the slopes
USE CASE
• Agricultural Yield Prediction: Farmers can use MLR to estimate
crop yields based on several variables like rainfall, temperature, soil
quality and fertilizer usage. This information helps in planning
agricultural practices for optimal productivity.
• When working with linear regression, our main goal is to find the best
fit line that means the error between predicted values and actual values
should be minimized. The best fit line will have the least error.
• The different values for weights or the coefficient of lines (a0, a1)
gives a different line of regression, so we need to calculate the best
values for a0 and a1 to find the best fit line, so to calculate this we use
cost function.
• For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values. It
can be written as:
• For the above linear equation, MSE can be calculated as:
Where,
• N=Total number of observation
• Yi = Actual value
• (a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual.
If the observed points are far from the regression line, then the residual will be high, and
so cost function will high. If the scatter points are close to the regression line, then the
residual will be small and hence the cost function.
• Gradient Descent : A linear regression model can be trained using the
optimization algorithm Gradient Descent by iteratively modifying the model’s
parameters to reduce the mean squared error (MSE) of the model on a training
dataset. The idea is to start with random θ1 and θ2 values and then iteratively
update the values, reaching minimum cost.
Finding the coefficients of a linear equation that best fits the training data is the
objective of linear regression. By moving in the direction of the Mean Squared Error
negative gradient with respect to the coefficients, the coefficients can be changed. And
the respective intercept and coefficient of X will be if α is the learning rate.
R-squared Method
• R-squared is a statistical method that determines the goodness of fit.
• It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.
• The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
• It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
• It can be calculated from the below formula: