Chapter 3 - Classical Simple Linear Regression

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 52

Simple Linear Regression

 Regression is probably the single most important


tool at the econometrician’s disposal.
 Regression Analysis is concerned with the study
of the dependence of one variable (The
Dependent Variable), on one or more other
variable(s) (The Explanatory Variable), with a
view to estimating and/or predicting the
(population) mean or average value of the
former in term of the known or fixed (in
repeated sampling) values of the latter.
Classical Linear Regression Model

 In econometrics, there is need to examine the


relationship between two or more financial
variables.
 The relationship between variables can be
explored by:
a. Correlation Analysis
b. Building a linear regression model
Correlation Analysis
 The correlation between two variables
measures the degree of linear association
between them.
 A group of statistical techniques used to
measure the strength of the relationship
(correlations) between two variables.
 Once a linear relationship is established,
knowledge of the independent variable/s can
be used to forecast the dependent variable.
THE SIMPLE REGRESSION MODEL

 Regression analysis is concerned with


describing and evaluating the relationship
between a given variable (explained or
dependent variable) and one or more other
variables (explanatory or independent
variables).
 In statistical modelling, regression analysis is a
statistical process for estimating the
relationships among variables.
 Explained variable is denoted by y and
explanatory variable by x.
 Regression is an attempt to explain the
variation in a dependent variable using the
variation in independent variables.
 Regression is thus an explanation of
causation.
 If the independent variable(s) sufficiently
explain the variation in the dependent
variable, the model can be used for
prediction.
Types of
Regression Models

1 Explanatory Regression 2+ Explanatory


Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear

EPI 809/Spring 2008 7


𝒀=𝒂+𝒃𝑿 + 
Where:
 Y is the dependent variable
 X is the independent variable (the variable
that drives the dependent variable, i.e. level of
activity).
 a is the intercept of the trend-line on the Y axis
(i.e. the fixed component or starting point)
 b is the gradient of the trend-line
 Y - yield
 X - fertilizer
 The agricultural researcher is interested in the
effect of fertilizer on yield, holding other
factors fixed.
 The error term u contains factors such as land
quality, rainfall, and so on.
2-5. The Significance of the Stochastic
Disturbance Term

Why not include as many as variable into the


model (or the reasons for using ui)
+ Vagueness of theory
+ Unavailability of Data
+ Core Variables vs. Peripheral Variables
+ Intrinsic randomness in human behavior
+ Poor proxy variables
+ Principle of parsimony
+ Wrong functional form
 To review the analysis of the relationships
between two variables, consider the following
example.
 The values of a and b can be calculated using
the following formulae:
Example
Question
 Calculate correlation coefficient(r) for Tom’s T
– shirts using the following formula.

 xy   x y
r n

 x 
2 (  x) 2
 
.  y 
2 (  y) 2


 n  n 
  
Example
 Consider the following example
Weight Age
Serial
Y2 X2 xy (Kg) (years)
.n

144 49 84 12 7 1

64 36 48 8 6 2

144 64 96 12 8 3

100 25 50 10 5 4

121 36 66 11 6 5

169 81 117 13 9 6

=y2∑ =x2∑ xy=∑ =y ∑ =x ∑ Total


742 291 461 66 41
 Using the information in the table above,
Calculate the following:
a) 
b) 
c) Correlation coefficient (r)
d) Fit the regression line
Required
Calculate the following:

a) 
b) 
c) R-squared and comment on the goodness of fit of
the model to data
d) Fit the regression model and interpret the results
Sources of Errors in Regression

 The question which remains unanswered is


why should we add an error term? What are
the sources of the error term u in the
equation?
 Technically, u is known as the stochastic
disturbance or stochastic error term.
 It is a surrogate or proxy for all the omitted or
neglected variables that may affect Y but are
not (or cannot be) included in the regression
model.
Sources of Errors in Regression
1.Unpredictable element of randomness in
human response
 If y = consumption expenditure of household
and x = disposable income of household.
There is an unpredictable element of
randomness in each household’s
consumption.
 The household does not behave like a
machine, the expenditure pattern fluctuates.
2. Effect of Omitted Variables

 In our example x is not the only variable


influencing y.
 The family size, tastes of the family, spending
habits etc can affect the variable y.
 The u is a catch ball for the effects of all the
variables, some of which may not even be
quantifiable and some of which may not
even be identifiable.
3. Measurement Error in y
 It refers to the measurement error in in the
household consumption.
 The argument is we cannot measure it
accurately
 For now lets assume that there is
measurement error in y but not in x.
 Vagueness of theory - The theory, if any,
determining the behaviour of Y may be, and
often is, incomplete.
 Unavailability of data - It is a common
experience in empirical analysis that the data we
would ideally like to have often are not available.
 Wrong functional form - Even if we have
theoretically correct variables explaining a
phenomenon and even if we can obtain data on
these variables, very often we do not know the
form of the functional relationship between the
regressand and the regressors.
The Gauss-Markov Theorem
 Under assumptions, OLS is the Best Linear
Unbiased Estimator (BLUE) of the population
parameters.
 Best = smallest variance
 It’s reassuring to know that, under
assumptions, you cannot find a better
estimator than OLS.
 If one or several of these assumptions fail, OLS
is no longer BLUE
Regression Line

 We need also to draw a fitted regression line that


best fits the collection of (x-y) data points.
 A better procedure is to find the best straight line
using a criterion that minimises the sum of
squared distances (errors) from the points to the
line as measured by Y direction.
 It is possible to use the general equation for a
straight line to get the line that best ‘fits’ the data.
 The researcher would then be seeking to find the
values of the parameters or coefficients, α and β,
which would place the line as close as possible to
all of the data points taken together.
 The most common method used to fit a line to
the data is known as Ordinary Least Squares
(OLS).
 Ordinary Least Squares (OLS) or linear least
squares is a method for estimating the unknown
parameters in a linear regression model, with
the goal of minimizing the sum of the squares of
the differences between the observed responses
in the given dataset and those predicted by a
linear function of a set of explanatory variables.
 This line is known as the least squares line or
fitted regression line.
DERIVING THE ORDINARY LEAST SQUARES ESTIMATES

 Linear regression also known as linear least


squares, computes a line that best fit the
observations.
 The method of least squares requires that we
should choose as estimates of

 Thus the predictions must be based on


parameters- estimated values and testing is
based on estimated values in relation to
hypothesized population values.
Example
 In other words, for the given value of x of this
observation t, ˆyt is the value for y which the
model would have predicted.
 Note that a hat (ˆ) over a variable or
parameter is used to denote a value estimated
by a model.
 Finally, let ˆut denote the residual, which is
the difference between the actual value of y
and the value fitted by the model for this data
point -- i.e. (yt − ˆyt).
 The distance of the data points from the fitted
line is the residual – which is the difference
between the actual value of the dependent
variable and the predicted value.
 The method minimises collectively the vertical
distances from the data points to the fitted
line (y-y hat).
 To guess is cheap. To guess wrongly is
expensive - Chinese proverb
 The reason that the sum of the squared
distances is minimised - finding the sum of ˆut
that is as close to zero as possible - points will
lie above the line while others lie below it.
 When the sum to be made as close to zero as
possible is formed, the points above the line
would count as positive values, while those
below would count as negatives.
 Any fitted line that goes through the mean of
the observations (i.e. ¯x, ¯y) would set the
sum of the ˆut to zero.
 So minimising the Sum of Squared Errors (SSE)
is given by minimising:
Explained and Unexplained Variation

 We need to define a line that passes through


the point determined by the mean x-value and
the mean y-value.
 The variation in the dependent (y) variable can
be partitioned as follows:
 Total variation in the dependent (y) variable
dependents on
 Variation in the dependent (y) variable explained by
the independent (x) variable.
 The variation in the dependent (y) variable NOT
explained by the independent (x) variable (residual).
Total Variation
$ Spent on
Health Care
Y
Y-Y=deviation unexplained by regression
8 x x
(x,y)
7 x
x x Y-Y=deviation explained by regression
6
5.9 Y
5 x x Y-Y=total deviation around Y

4 x
Y=a+bx

10 20 30 40 50 60 70 X
Income
Explaining Variation

SST = SSR + SSE

Total = Explained + Unexplained


Deviation Deviation Deviation
Example
Suppose Mr Kuwaza observes the selling price
and sales volume of milk for 10 randomly
selected weeks. The data he has collected are
presented in the Table 2.1 overleaf.

Required, Calculate
a) SSR
b) SST
c) SSE
Table
Week Weekly Sales (1000 s of Selling Price ($)
gallons)
1 10 1.30
2 6 2.00
3 5 1.70
4 12 1.50
5 10 1.60
6 15 1.20
7 5 1.60
8 12 1.40
9 17 1.00
10 20 1.10
Y X XY
10 1.30 13.0 1.69 100
6 2.00 12.0 4.00 36
5 1.70 8.5 2.89 25
12 1.50 18.0 2.25 144
10 1.60 16.0 2.56 100
15 1.20 18.0 1.44 225
5 1.60 8.0 2.56 25
12 1.40 16.8 1.96 144
17 1.00 17.0 1.00 289
20 1.10 22.0 1.21 400
Totals: 112 14.40 149.3 21.56 1488
 The line that best fits a collection of X-Y data
points is the line that minimises the sum of
squared distances from the fitted line.
 This is known as the least squares line or
fitted regression equation.
 The fitted line will be in the form of:
Calculation of Residuals (SST)
Residuals with Predicted Data
Testing Validity of the Model
 In simple linear regression, the validity of the
model is tested by Coefficient of
Determination
 If Coefficient of Determination is high, it
shows the model is very good.
 We can also check validity of the model If
SSR>SSE
Coefficient of Determination
 The coefficient of determination measures the
percentage of variability in Y that can be
explained through knowledge of the variability
in the independent variable X -

 The more of the variance in Y you can explain, the more


powerful your model
 Calculate Coefficient of Determination using previous
example.
Solution

You might also like