0% found this document useful (0 votes)
84 views30 pages

Regression (Hrishikesh)

Regression analysis models the relationship between a dependent variable and one or more independent variables. It helps understand how the dependent variable changes with the independent variables and can be used for prediction. The document provides an example of using regression to predict annual sales from various factors. It also outlines the simple linear regression model and assumptions, estimation process using least squares, and how to assess the model using the coefficient of determination.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views30 pages

Regression (Hrishikesh)

Regression analysis models the relationship between a dependent variable and one or more independent variables. It helps understand how the dependent variable changes with the independent variables and can be used for prediction. The document provides an example of using regression to predict annual sales from various factors. It also outlines the simple linear regression model and assumptions, estimation process using least squares, and how to assess the model using the coefficient of determination.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Regression Analysis

Regression Analysis

by
Hrishikesh Khaladkar
Department of Mathematics
Fergusson College

May 25, 2018

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Introduction

Regression analysis is used to model the relationship between


a dependent variable and one or more independent variables.
More specifically, regression analysis helps one understand
how the typical value of the dependent variable (or ’criterion
variable’) changes when any one of the independent variables
is varied, while the other independent variables are held fixed.
Linear regression is the next step up after correlation.
It is widely used for prediction and forecasting.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Example

Suppose your manager asked you to predict annual sales. There


can be a hundred of factors (drivers) that affects sales. In this
case, sales is your dependent variable. Factors affecting sales are
independent variables. Regression analysis would help you to solve
this problem.
It will help us understand the following questions.
Which of the drivers have a significant impact on sales.
Which is the most important driver of sales
How do the drivers interact with each other
What would be the annual sales next year.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

General Information

Familiar methods such as linear regression and ordinary least


squares regression are parametric, in that the regression
function is defined in terms of a finite number of unknown
parameters that are estimated from the data.
Nonparametric regression refers to techniques that allow
the regression function to lie in a specified set of functions,
which may be infinite-dimensional.
The earliest form of regression was the method of least
squares, which was published by Legendre in 1805,and by
Gauss in 1809. Legendre and Gauss both applied the method
to the problem of determining, from astronomical
observations, the orbits of bodies about the Sun.
The term ”Regression” was coined by Francis Galton in the
nineteenth century to describe a biological phenomenon.
Hrishikesh Khaladkar,Fergusson College
Regression Analysis
Regression Analysis

Simple Linear Regression Model


The equation that describes how y is related with x is called as the
Regression Model
y = β0 + β1 x +  where
β0 , β1 are the parameters of the model
 : the error term.
The error term accounts for the variability in y that cannot be
explained by the linear relationship between x and y .
The Simple Linear Regression Equation is given by
E (y ) = β0 + β1 x
The graph is a straight line
β0 : y intercept of the Regression line
β1 : slope of the Regression Line
E (y ) : is the mean of expected value of y for given value of x.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Estimated Line of Regression


Estimated Line of Regression is : ŷ = b0 + b1 x where
b0 : y interesept.
b1 : slope of the estimated Regression line.
ŷ : is the estimated value of y for a given x using the
Regression Line.
Note that b0 , b1 provide the estimates for β0 , β1 of the
Regression Model.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Possible Regression Lines

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Estimation Process

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Method of Least Squares

Consider a set of observations (x1 , y1 ), (x2 , y2 )...(xn , yn ) given


where we are trying to estimate the Regression Model given by
ŷ = P
b0 + b1 x ,then the LEAST SQUARES CRITERION involves
min (yi − yˆi )2 where
yi : obsereved value of dependent variable for the ith
observation
yˆi : estimated value of dependent variable for the ith
observation

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Estimates for the Regression Coeffecients

Consider the Estimated Regression Line ŷ = b0 + b1 x then the


estimates
Pn b0 , b1 are estimated as
(xi − x̄)(yi − ȳ )
b1 = i=1P
(xi − x̄)2
b0 = ȳ − b1 x̄.
xi : value of the independent ith observation
yi : value of the dependent ith observation
x̄ : mean value of the independent variable
ȳ : mean value of the dependent variable
n : total number of observations

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Assumptions in the Model


Consider the Regression Model given by y = β0 + β1 x + 
These are the following assumptions that one needs to validate
before building a Regression Model
1) There needs to be a linear relationship between the two
variables. you can plot the dependent variable against your
independent variable and then visually inspect the scatterplot
to check for linearity.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Assumptions in the Model

2) There should be no significant outliers. An outlier is an


observed data point that has a dependent variable value that
is very different to the value predicted by the regression
equation. As such, an outlier will be a point on a scatterplot
that is (vertically) far away from the regression line indicating
that it has a large residual

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Assumptions in the Model (Regarding the error terms)

3) Autocorrelation of errors:The values of  are independent


Implication: The value of  for a particular set of values for
the independent variables is not related to the value of  for
any other set of values. In R use Durbin Watson Test
H0 : ρ = 0
H1 : ρ > 0
If the P value is less than 0.05 then the null hypothesis is
rejected.and hence the autocorrelation between the errors in
greater than 0. On the other hand if P values is greater than
0.05 then it means that there is no sufficient evidence to
reject H0 and hence it can be assumed that the error terms
are independent.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Assumptions in the Model (Regarding the error terms)

4) The variance of ,denoted by σ 2 ,is the same for all values of x.


Implication: The variance of y about the regression line
equals σ 2 and is the same for all values of x.This is also called
as Homosedasticity. Else it is called as Heteroscedasticity. (In
R use the Residual Plots)

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Assumptions in the Model (Regarding the error terms)

5) The error term  is a normally distributed random variable.


Implication: Because y is a linear function of , y is also a
normally distributed random variable.(In R use the Normal
QQ Plot or a histogram with a superimposed normal
curve or use the Shapiro.test )
H0 : The errors are normally distributed
H1 : The errors are not normally distributed
If the P value is less than 0.05 then the null hypothesis is
rejected.and hence the errors in the populations will turn out
not be normally distributed. On the other hand if P values is
greater than 0.05 then it means that there is no sufficient
evidence to reject H0 and hence it can be assumed that the
errors in the poulations are normally distributed.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Coeffecient of Determination (Inferential Statistics)

Coeffiecient of Determination provides a measure of goodness of fit


for the estimated line of Regression.
Sum of Squares due to Error (SSE)= (yi − yˆi )2
P

Sum of Squares due to Regression (SSR)= (yˆi − ȳ )2


P

Total Sum of Squares (SST)=SSR+SSE


SSR
R2 =
SST
Note:
It is a perfect fit when SSR=SST
Poorer fit will result in larger values of SSE.The most poorest
fit occurs when SSR = 0 and SSE = SST
The value of R 2 lies between 0 and 1.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Coeffecient of Determination (Inferential Statistics)

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Multiple Linear Regression Model

The equation that describes how y is related with x1 , x2 ...xp is


called as the Regression Model
y = β0 + β1 x1 + β2 x2 + .....βp xp +  where
β0 , β1 , .....βp are the parameters of the model
 : the error term
The error term accounts for the variability in y that cannot be
explained by the linear effect of p independent variables .
The Simple Linear Regression Equation is given by
E (y ) = β0 + β1 x + β2 x2 + ....βp xp
The graph is represents a hyper plane in Rp

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Multiple Linear Regression Equation (Regression Plane)


Estimated Line of Regression is : ŷ = b0 + b1 x1 + b2 x2 + .....bp xp
where
ŷ : is the estimated value of y for a given x1 , x2 ....xn using the
Regression plane.
Note that b0 , b1 , b2 ....bp provide the estimates for
β0 , β1 , β2 ....βp of the Regression Model.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Estimation Process for Multiple Regression

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Method of least Squares

As in the case of Simple Linear Regression Method, the Method of


Least Squares is used to calculate the estimate for the coeffecients
in the
Pmodel LEAST SQUARES CRITERION involves
min (yi − yˆi )2 where
yi : obsereved value of dependent variable for the ith
observation
yˆi : estimated value of dependent variable for the ith
observation

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Method of Least Squares

As in the case of Simple Linear Regression Method, the Method of


Least Squares is used to calculate the estimate for the coeffecients
in the
Pmodel LEAST SQUARES CRITERION involves
min (yi − yˆi )2 where
yi : obsereved value of dependent variable for the ith
observation
yˆi : estimated value of dependent variable for the ith
observation

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Adjusted Multiple Coeffecient of Determination (Inferential


Statistics)
Coeffiecient of Determination provides a measure of goodness
of fit for the estimated plane of Regression.
Sum of Squares due to Error (SSE)= (yi − yˆi )2
P

Sum of Squares due to Regression (SSR)= (yˆi − ȳ )2


P

Total Sum of Squares (SST)=SSR+SSE


SSR
R2 =
SST
n−1
Ra = 1 − (1 − R 2 )
2
n−p−1
where n: denotes the number of observations
p: denotes the number of independent variables
Please note that Ra2 ≤ R 2 .
Hrishikesh Khaladkar,Fergusson College
Regression Analysis
Regression Analysis

Other Criterias (Inferential Statistics)

Aikake’s Information
  Criteria (AIC):
SSR
AIC=n log + 2p
n
Bayesian Information
  Criteria (BIC):
SSR
BIC=n log + 2 log n
n
Mallow’s Cp
SSR
CP= − n + 2p
MSR
In all the above formulas p is the number of features in the model
that we are testing and n is the sample size.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Assumptions regarding the Model

1) Continous dependent variable Your dependent variable


should be measured on a continuous scale (i.e., it is either an
interval or ratio variable).Examples of variables that meet this
criterion include revision time (measured in hours), intelligence
(measured using IQ score), exam performance (measured from
0 to 100), weight (measured in kg), and so forth.
If your dependent variable was measured on an ordinal scale,
you will need to carry out ordinal regression rather than
multiple regression. Examples of ordinal variables include
Likert items (e.g., a 7-point scale from ”strongly agree”
through to ”strongly disagree”), amongst other ways of
ranking categories (e.g., a 3-point scale explaining how much
a customer liked a product, ranging from ”Not very much” to
”Yes, a lot”).

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Assumptions regarding the Model

2) You have two or more independent variables, which can be


either continuous (i.e., an interval or ratio variable) or
categorical (i.e., an ordinal or nominal variable). For examples
of continuous and ordinal variables.
Examples of nominal variables include gender (e.g., 2 groups:
male and female), ethnicity (e.g., 3 groups: Caucasian, African
American and Hispanic), physical activity level (e.g., 4 groups:
sedentary, low, moderate and high), profession (e.g., 5 groups:
surgeon, doctor, nurse, dentist, therapist), and so forth.

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Assumptions regarding the Model

3) There needs to be a linear relationship between


the dependent variable and each of your independent variables
the dependent variable and the independent variables
collectively.
Use Scatter Plots and Partial Regression Plots
4) Multicollinearity: Your data must not show multicollinearity,
which occurs when you have two or more independent
variables that are highly correlated with each other. This leads
to problems with understanding which independent variable
contributes to the variance explained in the dependent
variable, as well as technical issues in calculating a multiple
regression model. (In R calculate Variance Inflation
Factor) also called as VIF

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Variance Inflation Factor (VIF)


Correlation matrix when computing the matrix of Pearson’s
Bivariate Correlation among all independent variables the
correlation coefficients are less than or equal to 1.
Tolerance the tolerance measures the influence of one
independent variable on all other independent variables; the
tolerance is calculated with an initial linear regression analysis.
Tolerance is defined as T = 1 − R 2 for these first step
regression analysis. WithT < 0.1 there might be
multicollinearity in the data and with T < 0.01 there certainly
is.
Variance Inflation Factor (VIF) the variance inflation factor of
the linear regression is defined as VIF = 1/T. Similarly with
VIF > 10 there is an indication for multicollinearity to be
present; with VIF > 100 there is certainly multicollinearity in
the sample.
Hrishikesh Khaladkar,Fergusson College
Regression Analysis
Regression Analysis

Assumptions in the Model (Regarding the error terms)

5) You should have independence of observations (i.e.,


independence of residuals), (In R use Durbin Watson
Test as described in the earlier parts)
6) Your data needs to show homoscedasticity, which is where
the variances along the line of best fit remain similar as you
move along the line.In R use Residual Plots
7) you need to check that the residuals (errors) are
approximately normally distributed. (In R use the Normal
QQ Plots or shapiro.test mentioned earlier.)

Hrishikesh Khaladkar,Fergusson College


Regression Analysis
Regression Analysis

Assumptions in the Model (Regarding the error terms)

8) There should be no significant outliers, high leverage


points or highly influential points. Outliers, leverage and
influential points are different terms used to represent
observations in your data set that are in some way unusual
when you wish to perform a multiple regression analysis.
These different classifications of unusual points reflect the
different impact they have on the regression line. An
observation can be classified as more than one type of unusual
point. However, all these points can have a very negative
effect on the regression equation that is used to predict the
value of the dependent variable based on the independent
variables

Hrishikesh Khaladkar,Fergusson College


Regression Analysis

You might also like