Theme 3 Multivariante Regression Model
Theme 3 Multivariante Regression Model
[Multivariate Model - Matrix form - assumptions OLS- Omitted Variable Bias - Redundant
Variables- Goodness of Fit- Adjusted R^2 - The coefficient of Correlation- Practical
Applications]
Ø In regression analysis our objective is not only to obtain the OLS estimators but also
to draw inferences about their true values in population, we would like to know how
close the estimators are to their counterparts in the population.
Ø In this example the coefficient of interest will be β 1and the ceteris paribus effect of
expenditure (expend) on average score (Av score), but the model will have omitted
variables that could be explaining why students get higher scores such as average
family income (Avginc). In the normal univariate OLS family income will be
included in u error term. Even we can later include other variables such as teacher
quality and school size.
Ø If correlation between x2, x1 and x2, y is the same direction, bias will be positive
(wealth) over specified.
Ø If correlation between x2, x1 and x2, y is the opposite direction, bias will be negative
(poverty rate) Under specified.
The problem takes place when a signficant variable that belongs to the true population
model is omitted and this results in the model’s under specification. Example 3.2.2 :-
Suppose that at the elementary school level, the average score for students on the
standardized exam is determined by:-
Ø Let’s assume that we can only get information on the percentage rate of passing grade
for students and students’ expenditure, and we don’t have information on poverty rate.
Ø There is already ample evidence that children living in poverty rate might get lower
scores which signifies a negative correlation (y, x2) and as well there might exist a
negative correlation corr (x1, x2) <0 between average expenditure per student and
poverty rate.
Yi = β 1 + β 2 X 2i + β 3 X 3 i+ . . . β k X ki + ui i= 1, 2, 3, n
Where:
Ø Y = an n x 1 vector of observation on the explained variable.
Ø X = an n x (k) matrix of observation on the explanatory variables.
Ø b = a (k) x 1 vector of parameters to be estimated
Ø u = an n x 1 vector of errors
3.4 Goodness of Fit:
Ø It represents the R^2, we shall find out how “well” the sample regression line fits the
data, by estimating the value of the coefficient of determination R2.
Ø TSS= ESS+RSS dividing by TSS
Ø 1= ESS/TSS+ RSS/TSS
R2 = ESS/TSS = 1 – RSS/TSS
Where:
Ø ESS stands for explained sum of squares.
Ø RSS: stands for residuals sum of squares.
Ø TSS: Total sum of squares
Ø Recall that the R2 will always increase as more variables are added to the model. The
adjusted R2 considers the number of variables in a model and may decrease.
Ø It’s easy to see that the adjusted R2 is just (1 – R2) (n – 1) / (n – k – 1), but most
packages will give you both R2 and adj-R2.
SSR
[ ]
n−k −1
R¿
¿
=1- [
SST
]
n−1
Ø The correlation coefficient is a statistical measure that quantifies the strength and
direction of the linear relationship between two continuous variables. It is typically
denoted by the symbol "r." The correlation coefficient can take on values between -1
and 1, where:
Ø The absolute value of the correlation coefficient (|r|) indicates the strength of the
relationship between the variables. The closer |r| is to 1, the stronger the linear
relationship. If |r| is close to 0, there is a weaker or no linear relationship. The null
hypothesis (H0) typically assumes no correlation (r = 0), while the alternative
hypothesis (Ha) suggests a correlation exists (r ≠ 0). You can use a t-test to assess the
significance of the correlation coefficient. The test statistic follows a t-distribution,
and you can calculate the p-value associated with the test.
Ø A laboratory collected data about the cost of material used for testing necessary
products over a one-year period. They want to know if the cost of materials A, B and
C have a significant value on the overall cost of testing. Observe the following tables
and answer to the questions below: P value: 0.043
C) Using a significance level of 10%, analyse the global significance of the model.
Answer: If the p value is less than 10% significance, then we reject Ho and hence there is a
relation between costs of material A, B and C and testing costs. It's important to note that the
significance level of 10% is relatively high and may increase the chance of a Type I error
(incorrectly rejecting a true null hypothesis).
D) Which of the three coefficients can be considered as the most efficient? Why?
The most efficient coefficient is the one showing the lowest standard error possible and most
significant which is cost component C as it has a p value of 0.03.
E) Which regressor(s) should we keep in our equation? Why? The ones that are mostly
significant as the lack of significance could mean that some of the regressors might have
higher standard error or encounter great variability of its variance and misinterpret the
significance of the variables used it could disrupt the characteristics of efficiency and
unbiasedness.
Ø We use monthly data on the excess return of two industry portfolios (consumer goods
and hi-tech) compiled by French11. We regress the excess returns of the two
industries on the excess market return based on a value-weighted average of all
NYSE, AMEX, and NASDAQ firms (all returns are measured in percentage terms).
Using data from January 2000 to December 2004 (n=60) we obtain the following
estimates for the consumer goods portfolio (p-values in parenthesis)
Ø We briefly investigate one version of multi-factor models using the so-called Fama-
French benchmark factors SMB (small minus big) and HML (high minus low) to test
whether excess returns depend on other factors than the market return. The factor
SMB measures the difference in returns of portfolios of small and large stocks and is
intended to measure the so-called size effect. HML measures the difference between
value stocks (having a high book value relative to their market value) and growth
stocks (with a low book-market ratio.
Ø Consumer goods portfolio
The coefficients 0.624 and 1.74 indicate that a change in the (excess) market return by one
percentage point implies a change in the expected excess return by 0.624 percentage points
and 1.74 percentage points, respectively. In other words, the hi-tech portfolio has much
higher market risk than the consumer goods portfolio. The beta-factor remains significant in
both industries and changes only slightly compared to the market model estimates. However,
the results indicate a significant return premium for holding value stocks in the consumer
goods industry. For the hi-tech portfolio we find support for a size-effect. Overall, the results
can be viewed as supporting multi-factor models.