Linear Regression

Linear Regression

 It is one of the most important statistical and machine learning tools.

 It is defined as the parametric technique that allows us to take decisions based on data.

 It allows us to make predictions based upon data by learning the relationship between input and

output variables.
 The output variable dependent on the input variables are continuous valued real numbers.

 Regression help us to understand how the value of output variable changes with respect to the

changes in the input variable.

 Regression techniques are used for the prediction of prices , economics and variations.
Simple Linear Regression:
 It is the simplest form of linear regression used when there is one single input variable(input) for
the output variable(target).
 The input variable helps in predicting the value of the output variable.
 It is referred to as X.
 The output or target variable is the variable that we want to predict(y).
Simple Linear Regression:
 ß0 , called the intercept , shows the point where the estimated regression line crosses the y-axis.

 ß1 determines the slope of the estimated regression line.

 Random error describes the random component of the linear component between independent

and dependent variable.

 The true regression model is usually never known.

 The value of the random error term corresponding to observed data points remains unknown.

 Regression model can be estimated by calculating the parameters of the model for an observed

Simple Linear Regression:
 The main aim of regression is to estimate the parameters ß0 and ß1 from the sample.

 Once we find the optimum values for these two parameters , a line of best fit can be used to find

the values of Y given the values of X.

 Fit a line to find the relationship between input and output variables.

 The line is used to predict the output of unseen inputs.

Simple Linear Regression:
 ß0 and ß1 values are estimated using the Ordinary Least Squares(OLS).

 The main goal is to minimize the distance from the black dots to the red line as close to zero as

 It is done by minimizing the squared distances between actual and predicted outcomes.

 The difference between actual and predicted value is called the residual(e) .

 It can be negative or positive depending on whether the model overpredicted or underpredicted

the outcome.
 To calculate the net error , adding all the residuals can lead to the cancellation of terms and

reduction of net effect.

Simple Linear Regression:
 To avoid this , take the sum of squares of error terms and it is called the residual sum of
Simple Linear Regression:
 The ordinary least squares method(OLS) method reduces the residual sum of squares(RSS) .

 Its objective is to fit a regression line that would minimize the regression line from the observed

values to the predicted values(the regression line).

Different Kinds Of Relationship:
 Positive Relationship: When the regression line between two variables moves in the same

direction with an upward slope , then the variables are said to be positively correlated.
 If we increase the value of x(independent variable) , then we will see an increase in the

dependent variable.

 Negative Relationship: When the regression line between two variables moves in the same

direction with a downward slope , then the variables are said to be in a negative relationship.
 If we increase the value of independent variable(x) , we will see a decrease in the depenedent

Different Kinds Of Relationship:
 No Relationship: If the best fit line is flat , then we can say that there is no relationship between

the variables.
 The dependent variable won’t change by increasing or decreasing the independent variable.
Linear Regression Relationship:
 Covariance: This paramter tell us the direction of relationship between x and y .

 It doesn’t tell anything about how positive or negative a relationship is.

 If the covariance value is negative , if the independent variable increases , then the dependent

variable decreases.

 Correlation: It is a statistical measure that tell us the direction of relationship as well as the

strength of relationship.
 Predicting advertising expenses.
 Medical diagnosis.
 Agricultural research.
Advantages and Disadvantages:
It performs well for linearly separable data.
It is easier to implement , interpret and training can be done in a faster

The assumption of linearity between independent and dependent variables.
It is prone to noise and overfitting.
 A regression problem is one when the output variable is a continuous value , such as “salary” or

 Linear regression is a statistical method of finding the relationship between the independent and

dependent variable.
 This regression is a technique where the correct data is given and we need to find the correlation

between the data.

 In a regression problem , it always predicts a real value or continuous value as the input.

 This example is used to predict the salary (dependent variable y) of a person based on the

independent variable(x) are given.

First , we need to find the independent variable(values which are used to

predict the dependent variable) and dependent(value which is to be

predicted) variable from the dataset .
We need to fit those variables in the linear regression cost function.

The cost function is used to measure the performance of the machine

learning model for the given data.

A regression plot is being plotted and when a new value comes in(year) ,

the salary of the person can be predicted with the help of regression
Only one independent variable is taken and it is also called as linear

regression with one variable or univariate linear regression.

Cost Function Of Linear Regression:
 The linear function equation is the cost function for this simple linear regression.

 ‘x’ is used to denote the input variable(years of experience) .

 They are also called as input features.

 y is used to denote the “output” or target variable .

 y is nothing but the predictor variable(salary).

 When the target variable we are trying to predict is continuous , the learning problem is called as

a regression problem.
Cost Function Of Linear Regression:
Cost Function Of Linear Regression:
 Theta 0 and theta 1 are the parameters of the model are the parameters of the model .

 X is the independent variable.

 Theta 0 and theta 1 values must be chosen such that h(x) is close to y.

 Linear Regression algorithm aims to solve a minimization problem.

 The difference between h(x) and y should be small.

 Use the notation (x(i),y(i)) to denote the ith training example.

 Sum over the training set , i=1 to m(training examples) , of the squared difference between them

and this is the prediction of the hypothesis.

Cost Function Of Linear Regression:
Cost Function Of Linear Regression:
The accuracy of the hypothesis function can be measured by using the cost

It takes an average difference of all the results of hypothesis with inputs

from x’s and the actual outputs y’s.

Cost Function Of Linear Regression:
 To break it apart , it is 1/2x-,

 X- is the mean of the squares of h(theta) (x{i})-y{i} .

 It is nothing but the difference between the predicted value and the actual value.

 This function is called as “Mean Squared Error”.

 This is the cost function.

Cost Function Of Linear Regression:
First , assign some random values to theta0,theta1 and then find h(x).
Cost Function Of Linear Regression:
 In the above plot , the curved graph is drawn flat.

 The 3D drawing is plotted as 2D.

 We have to find the min. Value of J(theta0,theta1) that is the small oval(global optimum).

 From the contour plot , some method like OLS method is used to find the min of

 The corresponding values of theta0 and theta1 is taken for h(x).

 The regression line is plotted to that data and this is the cost function.
Ordinary Least Squares(OLS):
 We need to find the best fit line to the dataset.

 In order to find the best fit line , we need to use the OLS method:

 Y = mx+b.

 M – slope,

 X – independent variable.

 B – intercept.

 OLS method is used to find the best line intercept:

Ordinary Least Squares(OLS):
Ordinary Least Squares(OLS):
 So our regression value m(theta1)=9449.96232 and b(theta0)=25792.2002.So ŷ =
9449.96232X + 25792.2002 this is our regression line.
Gradient Descent:
 It is an efficient function to find out the min .values of J(theta0,theta1).

 This method is not only used in linear regression but it is also employed in other machine learning

 First , the process is started with some random values of theta0, theta1 and the values of theta0 and

theta1 will be changed to reduce J(theta0,theta1).

 This step is done repeatedly until we end to the min . Value.

 If we start at a point , the gradient – descent algorithm will take small steps in order to find the

local minimum.
 This is an important property of gradient descent .

 If we start at a different point , it may find out a different local minimum.

Gradient Descent:
Following is an equation of gradient descent algorithm ,
Gradient Descent:
 J = 0,1 -> It denotes the feature index number.

 Alpha-> Learning rate.

 Next alpha->Partial derivative of theta j.

 At each iteration j , one should simultaneously update the parameters theta_1, theta_2 ,….,


 This parameter should be updated properly in order to get the correct implementation of the

gradient descent.
Gradient Descent:
Gradient Descent:
Gradient Descent:
• Consider the partial derivative term and theta1.

• J(theta1) is nothing but the slope of that point theta1.

• This derivative term is used to find the slope of thetaj(j =0,1).

Gradient Descent:
Gradient Descent:
If the learning rate is too small, then gradient descent will take small steps
and it will take more time to find a min value.
If the learning rate is too large , it will take a huge step and if the value is
near to minimum but the learning rate is too high , it will fail to converge
or even diverge.
Multiple Linear Regression:
 It uses several explanatory variables in order to predict the outcome of a response variable.

 The main aim of the multiple linear regression model is to model the relationship between the

independent variable and the response variable.

 Multiple linear regression is an extension of OLS regression because it involves more than one

explanatory variable.
 MLR uses several explanatory variables in order to predict the outcome of a response variable.

 It is used in econometrics and financial inference.

Multiple Linear Regression:
Why Multiple Linear Regression:(MLR)
 This type of algorithm is useful in such situations when the number of variables is small.
 This algorithm is used in finding the correlation between the dependent and independent
Multiple Linear Regression(MLR):
 y = m1.x + m2.z+ c

 y is the dependent variable, that is, the variable that needs to be predicted.

x is the first independent variable. It is the first input.

 m1 is the slope of x1. It lets us know the angle of the line (x).

z is the second independent variable. It is the second input.

m2 is the slope of z. It helps us to know the angle of the line (z).
c is the intercept. A constant that finds the value of y when x and z are 0.
Steps Of Multivariate Regression Analysis:
 Feature Selection:

 It is an important step in multivariate regression.

 This step is essential in order to pick important features for model building.

 Normalizing Features:

 The features should be scaled as it maintains general distribution and ratios in the data.

 Loss Function Selection and Hypothesis:

 The loss function predicts if there is an error.

 Hypothesis is the predicted value from the feature.

Steps Of Multivariate Regression Analysis:
 Minimize the loss function:

 The loss function should be minimized using a loss minimization algorithm on the dataset.

 Gradient descent is one of the commonly used algorithms for loss minimization.

 Test the Hypothesis function:

 The hypothesis function should be checked as it is predicting values.

Cost Function:
 It is nothing but the sum of the square of the difference between the predicted value and the

actual value divided by twice the length of the dataset.

Multiple Linear Regression (MLR):

 It is a form of linear regression used when there are two or more

predictor or independent variables.

 It includes some additional predictors.
Multiple Linear Regression (MLR):
 The above equation is an extension of simple linear regression one.

 Here , each input has the corresponding slope coefficient (ß).

 ß0 is the intercept constant and is the value of y in the absence of all predictors(when all x terms

are zero).
 As the number of features grow , the complexity of the model increases.

 It becomes more difficult to visualize our data.

 As there are more parameters in these models , we should be more careful while working with

 If we add more terms , it will improve the fit to the data.

 This is dangerous because it leads to a model that fits the data but doesn’t mean anything useful.
 The advertising dataset consists of a sales of a product in 200 different markets .
 It contains advertising budgets for three different media : TV , radio and newspaper.
 Dataset is used to predict the amount of sales(dependent variable) based on TV , radio and
newspaper advertising budgets(independent variables).
 The formula is:
 The ß values are found in order to find the error function and fit the best line or hyperplane(depending on the number of
input variables).
 Load The Data and Describe the Data:
 Import the required libraries:

 import pandas as pd
 import numpy as np
 import seaborn as sns
 import matplotlib.pyplot as plt
 from sklearn.model_selection
 import train_test_split
 from sklearn.linear_model
 import LinearRegression
 from sklearn import metrics
 from sklearn.metrics import r2_score
 import statsmodels.api as sm
 Load the Dataset:
 df = pd.read_csv(“Advertising.csv”)

 Understand the Dataset and Describe it:

 df.head()
 Drop the first column unnamed since we don’t need them:
 df = df.drop([‘Unnamed: 0’], axis=1)
 Dataset contains four columns , 200 registers and no missing values.
 Visualize the relationship between independent and target variables.

 sns.pairplot(df)
 The relationship between TV and sales is very strong .

 There is some trend between radio and sales , the relationship between newspaper and sales is

 It can be verified numerically through a correlation map.

 mask = np.tril(df.corr())

 sns.heatmap(df.corr(), fmt=’.1g’, annot=True, cmap= ‘cool’, mask=mask)

 The strongest positive correlation happens between sales and TV .

 The relationship between sales and newspaper is close to 0.

 Select Features and Target Variable:

 Divide the variables into two sets: dependent(or target variable “y”) and

 Independent(or feature variable “X”).

 X = df.drop([‘sales’], axis=1)

 y = df[‘sales’]
 Split the Dataset:

 For understanding the model performance , the dataset is divided into training set and the testing set.

 By splitting the dataset into two separate sets , we can train the model using one set and test the

performance of the model using unseen data on the other one.

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

 The dataset is splitted into 70%train and 30%test.

 The random_state parameter is used for initializing the internal random number generator.

 If the random state is set to 0 . We can compare the output over multiple runs of the code using the

same parameter.
 print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

 From the output , we can observe the following:

 2 datasets of 140 registers each, one with 3 independent variables and one with the target variable.

 It will be used for training and producing the linear regression model.

 2 datasets of 60 registers each , one with 3 independent variables and one with the target variable ,

that will be used for testing the performance of the linear regression model.
Build Model:
 mlr = LinearRegression()

 Train the Model:

 The training data is fitted to the model and it denotes the training part of the modelling process.
 After it is trained , the model can be used to make predictions.

, y_train)

 mlr.intercept_
 Print the values of the coefficients ß:
 coeff_df = pd.DataFrame(mlr.coef_, X.columns, columns =[‘Coefficient’]) coeff_df.
 Sales value can be estimated based on different budget values of “TV” , “radio” and

 For example, if we determine a budget value of 50 for TV, 30 for radio and 10 for newspaper,

the estimated value of “sales” will be:

 example = [50, 30, 10]

 output = mlr.intercept_ + sum(example*mlr.coef_)

 output
Test Model:

 A test dataset is a dataset that is independent of the training dataset:

 This test dataset is the unseen data set for your model which will help

you have a better view of its ability to generalize:

 y_pred = mlr.predict(X_test)
Evaluate Performance:
 The quality of the model is estimated on how well the predictions match up against the actual

values of the testing dataset:

 print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred)) print(‘Mean

Squared Error:’, metrics.mean_squared_error(y_test, y_pred)) print(‘Root Mean Squared

Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
 print(‘R Squared Score is:’, r2_score(y_test, y_pred))
Advantages and Disadvantages:
 This type of algorithmm help us to find the relationship between the various variables present in

the dataset.
 It helps us in understanding the relation between the independent and the dependent variables.
 They are a bit complex and require high levels of mathematical calculation.

 It is not easy to interpret .

 It contains some loss and error output which are not identical.

 They are not suitable for small datasets . They can be applied only on larger datasets.
 Mismeasurement: Factors might not be measured correctly.

 For example , aptitude is difficult to measure and there are well known problems with IQ tests.

 As a result , regression using IQ might not properly control for aptitude.

 Too limited a focus:

 A regression coefficient provides information about only about how small changes in one

variable relate to changes in other variable.

 For eg , it will show how a small change in education will affect the earnings but it will not

allow the researcher to generalize about the effect of large changes.

Multiple Linear Regression:
 Simple linear regression function allow us to make predictions about one variable based on the

information that is available about the other variable.

 Linear regression algorithm can only be used when one has two continuous variables – an

independent variable and a dependent variable.

 The independent variable is the parameter that is used to calculate the dependent variable.

 A multiple linear regression model can be extended to several explanatory variables.

Multiple Linear Regression:
 There is a linear relationship between the dependent variable and the independent variable.

 The independent variables are not highly correlated with each other.

 Yi observations are selected independently and randomly from the population.

 Residuals should be normally distributed with a mean of 0 and a variance of sigma.

 The coefficient of determination(R – Squared) – It is a statistical metric and it is used to measure

how much of the variation in the outcome can be explained by the variation in the independent
Multiple Linear Regression:
 R^2 itself cannot be used to identify which predictors should be included in the model and which

should be excluded.
 R^2 value can only vary between 0 and 1.

 The value 0 indicates that the value cannot be predicted by any of the independent variables.

 The value 1 indicates that the outcome can be predicted without error from the independent

 When we interpret the results of multiple regression , beta coefficients are valid while holding all

other variables constant.

 The output from a multiple regression can be displayed horizontally as an equation or it can be

displayed vertically in a table form.

How to Use Multiple Linear Regression:
 An analyst wants to know how the movement of market affects the price of
 The linear equation will have the value of S and P.

 500 index as the independent variable or predictor and the price of XOM as the dependent

 There are various factors that affect the outcome of an event.

 The price movement of ExxonMobil , depends on just the performance of the overall market.
How to Use Multiple Linear Regression?
 There are other predictors such as price of oil , interest rates and the price movement of oil can

affect the price of XOM.

 They also affect the stock prices of other oil companies.

 In order to understand the relationship when two or more variables are present , multiple linear

regression is used.
How to Use Multiple Linear Regression?
 Multiple Linear Regression(MLR) is used to establish a mathematical relationship between

several random variables.

 This algorithm examines how multiple indepenedent variables are related to one dependent

 Once each of the independent factors has been determined to predict the dependent variable , the

information on multiple variables can be used to create an accurate prediction on the level of
effect they have on the outcome variable.
 The model creates a relationship in the form of a straight line that best approximates all the

individual data points.

How to Use Multiple Linear Regression?
 When we see the Multiple Linear Regression(MLR) equation above , we can see that:
 Yi = dependent variable – the price of XOM.
 Xi1 = interest rates.
 Xi2 = oil price.
 Xi3 = value of S and P 500 index.
 Xi4 = price of oil features.
 B0 = y-intercept at time 0.
 B1 = Regression coefficient . It measures the unit change in the dependent variable.
 When xi1 changes , the change in XOM price when interest rates changes.
How to Use Multiple Linear Regression?
 B2 – coefficient value that measures a unit change in the dependent variable when xi2 changes –

the change in XOM price when oil prices changes.

 The least squares estimates – B0,B1,B2 … Bp . These values are usually computed by statistical

 Many different variables can be included in a regression model.

 Each independent variable is differentiated with a number – 1,2,3,4,…p.

 Multiple Regression model allows an analyst to predict an income based on the information

provided on multiple explanatory variables.

How to Use Multiple Linear Regression?
 Model is not perfectly accurate as each data point can differ slightly from the outcome predicted

by the model.
 The residual error , e is the difference between the actual outcome and the predicted outcome.

 It is included in the model to account for such slight variations.

 If the price of other variables are held constant , then the price of XOM will increase by 7.8% if

the price of oil in the markets increases by 1%.

 The model also shows that the price of XOM will decrease by 1.5% following a 1% rise in the

interest rates.
How to Use Multiple Linear Regression?
 R^2 indicates that 86.5% of the variations in the stock price of Exxon Mobil can be explained

by changes in the interest rate, oil price , oil futures and S and P 500 index.
Difference Between Linear and Multiple Regression:
 Ordinary Least squares (OLS) method compares the response of a dependent variable with

respect to some change in some explanatory variables.

 A dependent variable is rarely explained by only one variable.

 An analyst uses multiple regression .

 It attempts to explain a dependent variable using more than one independent variable.

 Multiple regressions can be linear and nonlinear.

 These regression algorithms are based on the assumption that there is a linear relationship
between the dependent and the independent variables.
Difference Between Linear and Multiple Regression:
 It is also based on the assumption that there is no correlation between the independent variables.
What makes Multiple Regression Multiple?
 A multiple regression considers the effect of more than one explanatory variable on some

outcome of interest.
 It evaluates the relative effect of these independent variables on the dependent variable and it

holds some other variables in the model as constants.

Advantages Of Multiple Regression Over Simple OLS
 A dependent variable is rarely explained by only one variable.

 In case of Multiple Linear Regression , it attempts to explain a dependent variable by more than

one independent variable.

 The model assumes that there are no major correlations between the independent variables.

 Multiple Regression models are complex .

 It becomes even more complex when more variables are included in the model or when the size

of the data grows.

 To run multiple regression , we need to use specialized software functions within programs like

How We Can Make Multiple Regressions To Be Linear:
 Multiple Linear Regression model calculates the best fit line .

 It minimizes the variances of each of the variables included as it relates to the dependent

 As it fits a line , it is considered as a linear model.

 There are other non-linear regression models and it involves multiple variables , such as logistic

Regression , quadratic Regression and probit models.

Application Of Multiple Linear Regression
Models In Finance:
Any econometric model that looks at more than one variable is
considered as multiple.
Factor models compare two or more factors to analyze the relationships
between the variables and the resulting performance.
 Omitted Variables:

 It is necessary to have a good theoretical model to suggest variables that explain the dependent variable.

 Various factors should be considered to explain the dependent variable while dealing with two-variable


 Reverse Causality:

 Many theoretical models predict bidirectional causality – a dependent variable can cause changes in one or

more explanatory variables.

 For instance , higher earnings may enable people to invest more in their education which in turn raises

their earnings.

