Linear Regression

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 83

Linear Regression

Regression:
 It is one of the most important statistical and machine learning tools.

 It is defined as the parametric technique that allows us to take decisions based on data.

 It allows us to make predictions based upon data by learning the relationship between input and

output variables.
 The output variable dependent on the input variables are continuous valued real numbers.

 Regression help us to understand how the value of output variable changes with respect to the

changes in the input variable.


 Regression techniques are used for the prediction of prices , economics and variations.
Simple Linear Regression:
 It is the simplest form of linear regression used when there is one single input variable(input) for
the output variable(target).
 The input variable helps in predicting the value of the output variable.
 It is referred to as X.
 The output or target variable is the variable that we want to predict(y).
Simple Linear Regression:
 ß0 , called the intercept , shows the point where the estimated regression line crosses the y-axis.

 ß1 determines the slope of the estimated regression line.

 Random error describes the random component of the linear component between independent

and dependent variable.


 The true regression model is usually never known.

 The value of the random error term corresponding to observed data points remains unknown.

 Regression model can be estimated by calculating the parameters of the model for an observed

dataset.
Simple Linear Regression:
 The main aim of regression is to estimate the parameters ß0 and ß1 from the sample.

 Once we find the optimum values for these two parameters , a line of best fit can be used to find

the values of Y given the values of X.


 Fit a line to find the relationship between input and output variables.

 The line is used to predict the output of unseen inputs.


Simple Linear Regression:
 ß0 and ß1 values are estimated using the Ordinary Least Squares(OLS).

 The main goal is to minimize the distance from the black dots to the red line as close to zero as

possible.
 It is done by minimizing the squared distances between actual and predicted outcomes.

 The difference between actual and predicted value is called the residual(e) .

 It can be negative or positive depending on whether the model overpredicted or underpredicted

the outcome.
 To calculate the net error , adding all the residuals can lead to the cancellation of terms and

reduction of net effect.


Simple Linear Regression:
 To avoid this , take the sum of squares of error terms and it is called the residual sum of
squares(RSS).
Simple Linear Regression:
 The ordinary least squares method(OLS) method reduces the residual sum of squares(RSS) .

 Its objective is to fit a regression line that would minimize the regression line from the observed

values to the predicted values(the regression line).


Different Kinds Of Relationship:
 Positive Relationship: When the regression line between two variables moves in the same

direction with an upward slope , then the variables are said to be positively correlated.
 If we increase the value of x(independent variable) , then we will see an increase in the

dependent variable.

 Negative Relationship: When the regression line between two variables moves in the same

direction with a downward slope , then the variables are said to be in a negative relationship.
 If we increase the value of independent variable(x) , we will see a decrease in the depenedent

variable(y).
Different Kinds Of Relationship:
 No Relationship: If the best fit line is flat , then we can say that there is no relationship between

the variables.
 The dependent variable won’t change by increasing or decreasing the independent variable.
Linear Regression Relationship:
 Covariance: This paramter tell us the direction of relationship between x and y .

 It doesn’t tell anything about how positive or negative a relationship is.

 If the covariance value is negative , if the independent variable increases , then the dependent

variable decreases.

 Correlation: It is a statistical measure that tell us the direction of relationship as well as the

strength of relationship.
Applications:
 Predicting advertising expenses.
 Medical diagnosis.
 Agricultural research.
Advantages and Disadvantages:
It performs well for linearly separable data.
It is easier to implement , interpret and training can be done in a faster
way.

Disadvantages:
The assumption of linearity between independent and dependent variables.
It is prone to noise and overfitting.
Regression:
 A regression problem is one when the output variable is a continuous value , such as “salary” or

“weight”.
 Linear regression is a statistical method of finding the relationship between the independent and

dependent variable.
Regression:
 This regression is a technique where the correct data is given and we need to find the correlation

between the data.


 In a regression problem , it always predicts a real value or continuous value as the input.

 This example is used to predict the salary (dependent variable y) of a person based on the

independent variable(x) are given.


Regression:
First , we need to find the independent variable(values which are used to

predict the dependent variable) and dependent(value which is to be


predicted) variable from the dataset .
We need to fit those variables in the linear regression cost function.

The cost function is used to measure the performance of the machine

learning model for the given data.


Regression:
A regression plot is being plotted and when a new value comes in(year) ,

the salary of the person can be predicted with the help of regression
model.
Only one independent variable is taken and it is also called as linear

regression with one variable or univariate linear regression.


Cost Function Of Linear Regression:
 The linear function equation is the cost function for this simple linear regression.

 ‘x’ is used to denote the input variable(years of experience) .

 They are also called as input features.

 y is used to denote the “output” or target variable .

 y is nothing but the predictor variable(salary).

 When the target variable we are trying to predict is continuous , the learning problem is called as

a regression problem.
Cost Function Of Linear Regression:
Cost Function Of Linear Regression:
 Theta 0 and theta 1 are the parameters of the model are the parameters of the model .

 X is the independent variable.

 Theta 0 and theta 1 values must be chosen such that h(x) is close to y.

 Linear Regression algorithm aims to solve a minimization problem.

 The difference between h(x) and y should be small.

 Use the notation (x(i),y(i)) to denote the ith training example.

 Sum over the training set , i=1 to m(training examples) , of the squared difference between them

and this is the prediction of the hypothesis.


Cost Function Of Linear Regression:
Cost Function Of Linear Regression:
The accuracy of the hypothesis function can be measured by using the cost

function.
It takes an average difference of all the results of hypothesis with inputs

from x’s and the actual outputs y’s.


Cost Function Of Linear Regression:
 To break it apart , it is 1/2x-,

 X- is the mean of the squares of h(theta) (x{i})-y{i} .

 It is nothing but the difference between the predicted value and the actual value.

 This function is called as “Mean Squared Error”.

 This is the cost function.


Cost Function Of Linear Regression:
First , assign some random values to theta0,theta1 and then find h(x).
Cost Function Of Linear Regression:
 In the above plot , the curved graph is drawn flat.

 The 3D drawing is plotted as 2D.

 We have to find the min. Value of J(theta0,theta1) that is the small oval(global optimum).

 From the contour plot , some method like OLS method is used to find the min of

J(theta0,theta1)
 The corresponding values of theta0 and theta1 is taken for h(x).

 The regression line is plotted to that data and this is the cost function.
Ordinary Least Squares(OLS):
 We need to find the best fit line to the dataset.

 In order to find the best fit line , we need to use the OLS method:

 Y = mx+b.

 M – slope,

 X – independent variable.

 B – intercept.

 OLS method is used to find the best line intercept:


Ordinary Least Squares(OLS):
Ordinary Least Squares(OLS):
 So our regression value m(theta1)=9449.96232 and b(theta0)=25792.2002.So ŷ =
9449.96232X + 25792.2002 this is our regression line.
Gradient Descent:
 It is an efficient function to find out the min .values of J(theta0,theta1).

 This method is not only used in linear regression but it is also employed in other machine learning

algorithms.
 First , the process is started with some random values of theta0, theta1 and the values of theta0 and

theta1 will be changed to reduce J(theta0,theta1).


 This step is done repeatedly until we end to the min . Value.

 If we start at a point , the gradient – descent algorithm will take small steps in order to find the

local minimum.
 This is an important property of gradient descent .

 If we start at a different point , it may find out a different local minimum.


Gradient Descent:
Following is an equation of gradient descent algorithm ,
Gradient Descent:
 J = 0,1 -> It denotes the feature index number.

 Alpha-> Learning rate.

 Next alpha->Partial derivative of theta j.

 At each iteration j , one should simultaneously update the parameters theta_1, theta_2 ,….,

theta_n.

 This parameter should be updated properly in order to get the correct implementation of the

gradient descent.
Gradient Descent:
Gradient Descent:
Gradient Descent:
• Consider the partial derivative term and theta1.

• J(theta1) is nothing but the slope of that point theta1.

• This derivative term is used to find the slope of thetaj(j =0,1).


Gradient Descent:
Gradient Descent:
If the learning rate is too small, then gradient descent will take small steps
and it will take more time to find a min value.
If the learning rate is too large , it will take a huge step and if the value is
near to minimum but the learning rate is too high , it will fail to converge
or even diverge.
Multiple Linear Regression:
 It uses several explanatory variables in order to predict the outcome of a response variable.

 It uses several explanatory variables in order to predict the outcome of a response variable.

 The main aim of the multiple linear regression model is to model the relationship between the

independent variable and the response variable.


 Multiple linear regression is an extension of OLS regression because it involves more than one

explanatory variable.
 MLR uses several explanatory variables in order to predict the outcome of a response variable.

 It is used in econometrics and financial inference.


Multiple Linear Regression:
Why Multiple Linear Regression:(MLR)
 This type of algorithm is useful in such situations when the number of variables is small.
 This algorithm is used in finding the correlation between the dependent and independent
variables.
Multiple Linear Regression(MLR):
 y = m1.x + m2.z+ c

 y is the dependent variable, that is, the variable that needs to be predicted.

x is the first independent variable. It is the first input.


 m1 is the slope of x1. It lets us know the angle of the line (x).

z is the second independent variable. It is the second input.


m2 is the slope of z. It helps us to know the angle of the line (z).
c is the intercept. A constant that finds the value of y when x and z are 0.
Steps Of Multivariate Regression Analysis:
 Feature Selection:

 It is an important step in multivariate regression.

 This step is essential in order to pick important features for model building.

 Normalizing Features:

 The features should be scaled as it maintains general distribution and ratios in the data.

 Loss Function Selection and Hypothesis:

 The loss function predicts if there is an error.

 Hypothesis is the predicted value from the feature.


Steps Of Multivariate Regression Analysis:
 Minimize the loss function:

 The loss function should be minimized using a loss minimization algorithm on the dataset.

 Gradient descent is one of the commonly used algorithms for loss minimization.

 Test the Hypothesis function:

 The hypothesis function should be checked as it is predicting values.


Cost Function:
 It is nothing but the sum of the square of the difference between the predicted value and the

actual value divided by twice the length of the dataset.


Multiple Linear Regression (MLR):

 It is a form of linear regression used when there are two or more

predictor or independent variables.


 It includes some additional predictors.
Multiple Linear Regression (MLR):
 The above equation is an extension of simple linear regression one.

 Here , each input has the corresponding slope coefficient (ß).

 ß0 is the intercept constant and is the value of y in the absence of all predictors(when all x terms

are zero).
 As the number of features grow , the complexity of the model increases.

 It becomes more difficult to visualize our data.

 As there are more parameters in these models , we should be more careful while working with

them.
 If we add more terms , it will improve the fit to the data.

 This is dangerous because it leads to a model that fits the data but doesn’t mean anything useful.
Example:
 The advertising dataset consists of a sales of a product in 200 different markets .
 It contains advertising budgets for three different media : TV , radio and newspaper.
 Dataset is used to predict the amount of sales(dependent variable) based on TV , radio and
newspaper advertising budgets(independent variables).
 The formula is:
Example:
 The ß values are found in order to find the error function and fit the best line or hyperplane(depending on the number of
input variables).
 Load The Data and Describe the Data:
 Import the required libraries:

 import pandas as pd
 import numpy as np
 import seaborn as sns
 import matplotlib.pyplot as plt
 from sklearn.model_selection
 import train_test_split
 from sklearn.linear_model
 import LinearRegression
 from sklearn import metrics
 from sklearn.metrics import r2_score
 import statsmodels.api as sm
Example:
 Load the Dataset:
 df = pd.read_csv(“Advertising.csv”)

 Understand the Dataset and Describe it:


 df.head()
Example:
 Drop the first column unnamed since we don’t need them:
 df = df.drop([‘Unnamed: 0’], axis=1)
 df.info()
Example:
 Dataset contains four columns , 200 registers and no missing values.
 Visualize the relationship between independent and target variables.

 sns.pairplot(df)
Example:
 The relationship between TV and sales is very strong .

 There is some trend between radio and sales , the relationship between newspaper and sales is

non-existent.
 It can be verified numerically through a correlation map.

 mask = np.tril(df.corr())

 sns.heatmap(df.corr(), fmt=’.1g’, annot=True, cmap= ‘cool’, mask=mask)


Output:
Example:
 The strongest positive correlation happens between sales and TV .

 The relationship between sales and newspaper is close to 0.

 Select Features and Target Variable:

 Divide the variables into two sets: dependent(or target variable “y”) and

 Independent(or feature variable “X”).

 X = df.drop([‘sales’], axis=1)

 y = df[‘sales’]
Example:
 Split the Dataset:

 For understanding the model performance , the dataset is divided into training set and the testing set.

 By splitting the dataset into two separate sets , we can train the model using one set and test the

performance of the model using unseen data on the other one.


 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

 The dataset is splitted into 70%train and 30%test.

 The random_state parameter is used for initializing the internal random number generator.

 If the random state is set to 0 . We can compare the output over multiple runs of the code using the

same parameter.
Example:
 print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

 From the output , we can observe the following:

 2 datasets of 140 registers each, one with 3 independent variables and one with the target variable.

 It will be used for training and producing the linear regression model.

 2 datasets of 60 registers each , one with 3 independent variables and one with the target variable ,

that will be used for testing the performance of the linear regression model.
Build Model:
 mlr = LinearRegression()

 Train the Model:


 The training data is fitted to the model and it denotes the training part of the modelling process.
 After it is trained , the model can be used to make predictions.

 mlr.fit(X_train, y_train)

 mlr.intercept_
Example:
 Print the values of the coefficients ß:
 coeff_df = pd.DataFrame(mlr.coef_, X.columns, columns =[‘Coefficient’]) coeff_df.
Example:
 Sales value can be estimated based on different budget values of “TV” , “radio” and

“newspaper”.
Example:
 For example, if we determine a budget value of 50 for TV, 30 for radio and 10 for newspaper,

the estimated value of “sales” will be:

 example = [50, 30, 10]

 output = mlr.intercept_ + sum(example*mlr.coef_)

 output
Test Model:

 A test dataset is a dataset that is independent of the training dataset:

 This test dataset is the unseen data set for your model which will help

you have a better view of its ability to generalize:

 y_pred = mlr.predict(X_test)
Evaluate Performance:
 The quality of the model is estimated on how well the predictions match up against the actual

values of the testing dataset:


 print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred)) print(‘Mean

Squared Error:’, metrics.mean_squared_error(y_test, y_pred)) print(‘Root Mean Squared


Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
 print(‘R Squared Score is:’, r2_score(y_test, y_pred))
Advantages and Disadvantages:
 This type of algorithmm help us to find the relationship between the various variables present in

the dataset.
 It helps us in understanding the relation between the independent and the dependent variables.
Disadvantages:
 They are a bit complex and require high levels of mathematical calculation.

 It is not easy to interpret .

 It contains some loss and error output which are not identical.

 They are not suitable for small datasets . They can be applied only on larger datasets.
Limitations:
 Mismeasurement: Factors might not be measured correctly.

 For example , aptitude is difficult to measure and there are well known problems with IQ tests.

 As a result , regression using IQ might not properly control for aptitude.

 Too limited a focus:

 A regression coefficient provides information about only about how small changes in one

variable relate to changes in other variable.


 For eg , it will show how a small change in education will affect the earnings but it will not

allow the researcher to generalize about the effect of large changes.


Multiple Linear Regression:
 Simple linear regression function allow us to make predictions about one variable based on the

information that is available about the other variable.


 Linear regression algorithm can only be used when one has two continuous variables – an

independent variable and a dependent variable.


 The independent variable is the parameter that is used to calculate the dependent variable.

 A multiple linear regression model can be extended to several explanatory variables.


Multiple Linear Regression:
 There is a linear relationship between the dependent variable and the independent variable.

 The independent variables are not highly correlated with each other.

 Yi observations are selected independently and randomly from the population.

 Residuals should be normally distributed with a mean of 0 and a variance of sigma.

 The coefficient of determination(R – Squared) – It is a statistical metric and it is used to measure

how much of the variation in the outcome can be explained by the variation in the independent
variables.
Multiple Linear Regression:
 R^2 itself cannot be used to identify which predictors should be included in the model and which

should be excluded.
 R^2 value can only vary between 0 and 1.

 The value 0 indicates that the value cannot be predicted by any of the independent variables.

 The value 1 indicates that the outcome can be predicted without error from the independent

variables.
 When we interpret the results of multiple regression , beta coefficients are valid while holding all

other variables constant.


 The output from a multiple regression can be displayed horizontally as an equation or it can be

displayed vertically in a table form.


How to Use Multiple Linear Regression:
 An analyst wants to know how the movement of market affects the price of
ExxonMobil(XOM).
 The linear equation will have the value of S and P.

 500 index as the independent variable or predictor and the price of XOM as the dependent

variable.
 There are various factors that affect the outcome of an event.

 The price movement of ExxonMobil , depends on just the performance of the overall market.
How to Use Multiple Linear Regression?
 There are other predictors such as price of oil , interest rates and the price movement of oil can

affect the price of XOM.


 They also affect the stock prices of other oil companies.

 In order to understand the relationship when two or more variables are present , multiple linear

regression is used.
How to Use Multiple Linear Regression?
 Multiple Linear Regression(MLR) is used to establish a mathematical relationship between

several random variables.


 This algorithm examines how multiple indepenedent variables are related to one dependent

variable.
 Once each of the independent factors has been determined to predict the dependent variable , the

information on multiple variables can be used to create an accurate prediction on the level of
effect they have on the outcome variable.
 The model creates a relationship in the form of a straight line that best approximates all the

individual data points.


How to Use Multiple Linear Regression?
 When we see the Multiple Linear Regression(MLR) equation above , we can see that:
 Yi = dependent variable – the price of XOM.
 Xi1 = interest rates.
 Xi2 = oil price.
 Xi3 = value of S and P 500 index.
 Xi4 = price of oil features.
 B0 = y-intercept at time 0.
 B1 = Regression coefficient . It measures the unit change in the dependent variable.
 When xi1 changes , the change in XOM price when interest rates changes.
How to Use Multiple Linear Regression?
 B2 – coefficient value that measures a unit change in the dependent variable when xi2 changes –

the change in XOM price when oil prices changes.


 The least squares estimates – B0,B1,B2 … Bp . These values are usually computed by statistical

software.
 Many different variables can be included in a regression model.

 Each independent variable is differentiated with a number – 1,2,3,4,…p.

 Multiple Regression model allows an analyst to predict an income based on the information

provided on multiple explanatory variables.


How to Use Multiple Linear Regression?
 Model is not perfectly accurate as each data point can differ slightly from the outcome predicted

by the model.
 The residual error , e is the difference between the actual outcome and the predicted outcome.

 It is included in the model to account for such slight variations.

 If the price of other variables are held constant , then the price of XOM will increase by 7.8% if

the price of oil in the markets increases by 1%.


 The model also shows that the price of XOM will decrease by 1.5% following a 1% rise in the

interest rates.
How to Use Multiple Linear Regression?
 R^2 indicates that 86.5% of the variations in the stock price of Exxon Mobil can be explained

by changes in the interest rate, oil price , oil futures and S and P 500 index.
Difference Between Linear and Multiple Regression:
 Ordinary Least squares (OLS) method compares the response of a dependent variable with

respect to some change in some explanatory variables.


 A dependent variable is rarely explained by only one variable.

 An analyst uses multiple regression .

 It attempts to explain a dependent variable using more than one independent variable.

 Multiple regressions can be linear and nonlinear.

 These regression algorithms are based on the assumption that there is a linear relationship
between the dependent and the independent variables.
Difference Between Linear and Multiple Regression:
 It is also based on the assumption that there is no correlation between the independent variables.
What makes Multiple Regression Multiple?
 A multiple regression considers the effect of more than one explanatory variable on some

outcome of interest.
 It evaluates the relative effect of these independent variables on the dependent variable and it

holds some other variables in the model as constants.


Advantages Of Multiple Regression Over Simple OLS
Regression:
 A dependent variable is rarely explained by only one variable.

 In case of Multiple Linear Regression , it attempts to explain a dependent variable by more than

one independent variable.


 The model assumes that there are no major correlations between the independent variables.

 Multiple Regression models are complex .

 It becomes even more complex when more variables are included in the model or when the size

of the data grows.


 To run multiple regression , we need to use specialized software functions within programs like

Excel.
How We Can Make Multiple Regressions To Be Linear:
 Multiple Linear Regression model calculates the best fit line .

 It minimizes the variances of each of the variables included as it relates to the dependent

variable.
 As it fits a line , it is considered as a linear model.

 There are other non-linear regression models and it involves multiple variables , such as logistic

Regression , quadratic Regression and probit models.


Application Of Multiple Linear Regression
Models In Finance:
Any econometric model that looks at more than one variable is
considered as multiple.
Factor models compare two or more factors to analyze the relationships
between the variables and the resulting performance.
Limitations:
 Omitted Variables:

 It is necessary to have a good theoretical model to suggest variables that explain the dependent variable.

 Various factors should be considered to explain the dependent variable while dealing with two-variable

regression.

 Reverse Causality:

 Many theoretical models predict bidirectional causality – a dependent variable can cause changes in one or

more explanatory variables.


 For instance , higher earnings may enable people to invest more in their education which in turn raises

their earnings.

You might also like