Chapter - 3.
Chapter - 3.
Lecture Plan
Chapter three
1. Regression, Correlation & Causation
2. Use of regression analysis
A brief overview of the classical linear 3. Variables in regression models
regression model 4. Types of regression analysis
5. Best-fitting straight line –OLS
6. The assumptions underlying the classical linear
regression model
7. Quality of straight line fit
Sisay D. (PhD) 8. Exercises
1 2
1 2
3 4
3 4
Regression analysis implies (but does not prove) In correlation analysis both Y & X are independent
causality between dependent variable Y & variables, but
independent variable X.
Regression is estimation or prediction of the
However, correlation analysis implies no causality average value of a dependent variable on the basis
or dependency but refers simply to the type & of the fixed values of other variables.
degree of association between two variables.
In regression analysis Y is dependent variable &
For example: X & Y may be highly correlated because of X is independent variable or there are many
another variable that strongly affects both. independent variables.
5 6
5 6
Dr Sisay Debebe 1
11/1/2024
The statistical analysis helps to formulate an Causation comes from theory rather than
algebraic relationship between two or more variables statistics. Thus, regression does not necessarily
to estimate the values of associated parameters with imply causation.
the variables (alias independent variables), given the Correlation measures the strength of linear
value of another variable (alias dependent variable) is association between variables.
known as regression analysis.
In regression, we have stochastic dependent
Regression model or equation is the mathematical variable & non-stochastic independent variable
equation that helps to predict or forecast the value (fixed) while in correlation, variables involved
of the dependent variable based on the known are stochastic.
independent variables.
7 8
7 8
9 10
9 10
Pearson correlation …
2. Use of regression analysis
Examples of different values for linear correlations: (a) a perfect negative correlation, −1.00; (b) no linear
trend, 0.00; (c) a strong positive relationship, approximately +0.90; (d) a relatively weak negative correlation,
approximately −0.40.
11 12
11 12
Dr Sisay Debebe 2
11/1/2024
To characterize the relationship (association or To determine the best mathematical model (or
perhaps causality) between dependent & independent parsimonious model) for describing the relationship
variables by determining the extent, direction & between a dependent & one or more independent
strength of the relationship. variables.
To obtain a quantitative formula or equation to
To describe quantitatively or qualitatively the
describe the dependent variable as a function of the
relationship between different observed variables.
independent variables.
To determine which of several independent variables
To assess the interactive effects of two or more
are important & which are not for describing or
independent variables with regard to a dependent
predicting a dependent variable.
variable.
13 14
13 14
What are some of the considerations we should What are some of the considerations or
make when choosing statistical analysis? assumptions that lead to a choice of analysis?
Some of the major considerations are These are considerations made about the variables
(continuous or discrete) that lead to what type of
❖ The purpose of investigation,
models to apply to the data.
❖ The mathematical characteristics of the variable,
How the data were collected (sampling method);
❖ The statistical assumptions made about the variables the use of sample characteristics to estimate
population characteristics such as mean, variance,
How the data were collected (sampling procedures).
covariance, proportion, etc. depends on the
❖
15 16
15 16
17 18
17 18
Dr Sisay Debebe 3
11/1/2024
• We have some intuition that the beta on this fund is positive, and we
therefore want to find whether there appears to be a relationship
between x and y given the data that we have. The first stage would be
19 to form a scatter plot of the two variables. 20
19 20
40
21 22
21 22
23 24
Dr Sisay Debebe 4
11/1/2024
25 26
Explain the variables involved in a regression Explain the variables involved in a regression
model? model?...
27 28
27 28
Explain the variables involved in a regression What is the significance of the stochastic
model?... term?
29 30
29 30
Dr Sisay Debebe 5
11/1/2024
What are the justifications for the inclusion of What are the justifications for the inclusion of
the disturbance term in a regression model? the disturbance term in a regression model?...
(1) Effect of omitted variables from the model (2) The Randomness of human behaviour
Due to a variety of reasons some variables (or factors) which
There is an unpredictable element of randomness in
❖
31 32
What are the justifications for the inclusion of What are the justifications for the inclusion of
the disturbance term in a regression model?... the disturbance term in a regression model?...
33 34
35 36
35 36
Dr Sisay Debebe 6
11/1/2024
Note: The main focus of this lecture is on Ordinary Least Squares (OLS) which is
one of the most widely used methods of estimating the parameters
37 38
37 38
Ordinary Least Squares (OLS) Method… Ordinary Least Squares (OLS) Method…
39 40
39 40
Ordinary Least Squares (OLS) Method… Ordinary Least Squares (OLS) Method…
41 42
41 42
Dr Sisay Debebe 7
11/1/2024
Ordinary Least Squares (OLS) Method… Ordinary Least Squares (OLS) Method…
43 44
Ordinary Least Squares (OLS) Method… Ordinary Least Squares (OLS) Method…
45 46
45 46
47 48
47 48
Dr Sisay Debebe 8
11/1/2024
49 50
51 52
2. Continuity 3. Independence
The dependent variable is a continuous random The observations or the explanatory variables,
variable, whereas values of the independent X’s, are statistically independent of each another.
variable are fixed values; they can take continuous
Mathematically, it means the covariance between
or discrete values.
any two observations is zero.
Cautions must be taken that if the dependent
Meaning, the X observations are independent of
variable is not continuous, then other type of
each other, & denoted as Cov (Xi, Xj) = 0.
regression models such as Probit, Logit, Tobit,
etc. should be used accordingly. However, it is not unusual for there to be some
association between independent variables
53 54
53 54
Dr Sisay Debebe 9
11/1/2024
55 56
55 56
57 58
59 60
59 60
Dr Sisay Debebe 10
11/1/2024
61 62
61 62
8.Non-Endogeniety 9.Linearity
The mean value of response variable ( Y) is a straight-
Any of the independent variables should not be
line function of the independent variables, X’s.
correlated with any error term
Mathematically, the relation between a dependent
variable & an independent variable is denoted y
A departure from this assumption ( , known
as endogeniety problem, occurs where irrelevant A violation of this assumption may indicate that there
variables or lagged dependent variable (s) are is a non-linear relationship between the response &
introduced as independent variable(s) in a model. explanatory variables.
This leads to high standard error & inefficient Thus, the linear regression model may not be
parameter estimates. 63
applicable or fitted to the data under consideration.
64
63 64
65 66
65 66
Dr Sisay Debebe 11
11/1/2024
In fact, the closer the observations are to the fitted Hence, the OLS method estimates the parameters by
or estimated regression line, the higher the variation minimizing the ESS. Or, OLS estimates must have
in the dependent variable explained by the estimated minimized Mean Square Error (MSE).
regression equation.
Now the total variation in the dependent variable, Y, Now, the difference between Total Sum of Squares
is equal to the explained variation in the dependent (TSS) & Regression Sum of Squares (RSS) is Error
variable plus the residual variation. Sum of Squares (ESS), which is expressed as -
Mathematically, it is formulated as
67 68
67 68
69 70
• Interpretation of R2
• Suppose R2 = 90%, this means that the regression
line gives a good fit to the observed data since this
line explains 90% of the total variation of the Y
value around their mean. The remaining 10% of the
total variation in Y is unaccounted for by the
regression line & is attributed to the factors
included in the disturbance variable.
71 72
71 72
Dr Sisay Debebe 12
11/1/2024
Exercise I Exercise I…
73 74
73 74
−
Mean of adverts = 8.1 = X Getting 0
−
Mean of sales = 22.7 = Y 0 = Y 1 X
Using the formulae to get the parameters we get 1 first since 0 includes 1 = 22.7 − 2.189 (8.1)
X tYt − Y X t = 22.7 − 17.731
1 =
X t2 − X X t Intercept = 4.969
1972 − 22.7 ( 81)
=
717 − 8.1( 81) Therefore the estimated regression line, is:
133.3 Yt = 4.969 + 2.189 X t
=
60.9
= 2.189
75 76
75 76
77 78
77 78
Dr Sisay Debebe 13
11/1/2024
Use the above Exercise I datasets & run Source SS df MS Number of obs = 10
79 80
79 80
Dr Sisay Debebe 14