0% found this document useful (0 votes)
68 views77 pages

Linear Regression Model: Man - PN@VNP - Edu.vn

The document discusses linear regression models. It explains that linear regression examines the relationship between a dependent variable and one or more independent variables. The regression estimates coefficients (β) that indicate the impact of the independent variables on the dependent variable. Ordinary least squares regression fits a line that minimizes the distance between the actual and predicted dependent variable values to estimate the β coefficients.

Uploaded by

Mah Do
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views77 pages

Linear Regression Model: Man - PN@VNP - Edu.vn

The document discusses linear regression models. It explains that linear regression examines the relationship between a dependent variable and one or more independent variables. The regression estimates coefficients (β) that indicate the impact of the independent variables on the dependent variable. Ordinary least squares regression fits a line that minimizes the distance between the actual and predicted dependent variable values to estimate the β coefficients.

Uploaded by

Mah Do
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

LINEAR REGRESSION MODEL

Pham Nhu Man [email protected]


DEPENDENT AND INDEPENDENT VARIABLE

𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 𝜷𝑖𝑛𝑐𝑜𝑚𝑒
Dependent variable Independent variable

▪ Dependent variable (𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 − 𝑌) is the variable that we are interested


in its value, which lies on the left-hand side of the function.
▪ Independent variable (income − 𝑋) is the variable that has an impact on
the dependent variable, which lies on the right-hand side of the function.

𝑌 = 𝜷𝑋

2
PRACTICE
• Identify the dependent variable and independent variable

(1) wage (US$/hour) and years of schooling (years).

(2) rice yield (tons/hectare) and the amount of fertilizer (1000kg/hectare)

(3) the number of air conditioner sold per month and the price of air conditioner
(mil. VND/air conditioner)

3
TWO-VARIABLE: WHY DO WE NEED TO ESTIMATE 𝛽?

𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 𝜷𝑖𝑛𝑐𝑜𝑚𝑒
Dependent variable Independent variable

 WE NEED TO ESTIMATE 𝜷 to find when income increases by 1 mil.,


expenditure on drinking changes 𝛽.
▪ But, how can we estimate 𝛽?

4
TWO-VARIABLE: ESTIMATING 𝛽 USING RSTUDIO

▪ What is the intercept?


▪ “16.007”: What does it
mean?

5
THE COEFFICIENTS 𝛽
▪ 𝛽: regression/estimated coefficients
▪ 𝛽 shows the effect of the independent variable on the dependent variable.
▪ 𝛽: how much the dependent variable changes when the independent variable
increases/decreases by 1 unit.

6
TWO-VARIABLE: WHAT IS 𝛽 MEANING?
We’ve estimated the above function. We have the regression function as

𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 83.93 + 16.007𝑖𝑛𝑐𝑜𝑚𝑒


𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 𝛽0 + 𝛽1 𝑖𝑛𝑐𝑜𝑚𝑒

▪ 83.93 is the intercept (𝜷𝟎 )


▪ The intercept indicates that the mean value of 𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 83.93 when 𝑖𝑛𝑐𝑜𝑚𝑒 = 0.
▪ When income is equal to 0, the average expenditure on drinking is about 84
thousand VND per month.

7
TWO-VARIABLE: WHAT IS 𝛽 MEANING?
We’ve estimated the above function. We have the regression function as

𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 83.93 + 16.007𝑖𝑛𝑐𝑜𝑚𝑒


𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 𝛽0 + 𝛽1 𝑖𝑛𝑐𝑜𝑚𝑒

▪ 16.007 is the slope coefficient (𝜷𝟏 ) that is the angle of the line, compared to a
horizontal line (x axis).
▪ The slope coefficient indicates when income increases by 1 million VND/month,
the average expenditure on drinking increases 16 thousand VND per month.

8
TWO-VARIABLE: WHAT IS 𝛽 MEANING?
We’ve estimated the above function. We have the regression function as

𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 83.93 + 16.007𝑖𝑛𝑐𝑜𝑚𝑒

▪ Predict the average expenditure on drinking:

▪ When income is 10 mil. VND per month, the average expenditure on drinking is
244 thousand VND per month.

9
TWO-VARIABLE: 𝛽 MEANING IN THE GRAPH
expdrink
𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 83.93 + 16.007𝑖𝑛𝑐𝑜𝑚𝑒

Regression line
𝛽1 = 16.007

83.93
𝛽0

income
▪ If 𝛽 changes (negative/positive), the blue line in the graph will change.
▪ 𝛽 coefficients determine the location of the regression line. 10
REGRESSION…
▪ evaluates the relationship between the dependent variable and one or more
independent variable(s).
▪ is the most important technique in econometrics.

11
TWO-VARIABLE REGRESSION MODEL

• Economic theories
• Logical thinking
Theory

• Firm’s labor demand depends on salary levels


• Labor productivity relates to the number of experience years of employees
Causal
relationship • Household’s expenditure relates to their income

• Expenditure is the dependent variable


Regression • Income is the independent variable
model

12
THEORY AND PRACTICE: REGRESSION MODEL
▪ In economic theory, when the income increases, the expenditure also increases.
▪ In practice, we need to have a clear answer which is represented by the number
rather than theoretical statements.

➔ A particular dataset ➔ the specific regression model ➔ ESTIMATE 𝛽 .

13
TWO-VARIABLE (SIMPLE) LINEAR REGRESSION MODEL

Economic theories/logical thinking Causal relationship

Regression function

𝑒𝑥𝑝𝑑𝑟𝑖𝑛𝑘 = 𝛽𝑖𝑛𝑐𝑜𝑚𝑒

Dependent variable Independent variable


Regression coefficients
➔ Two-variable (simple) regression function: has one independent variable.
14
SUMMARY - REGRESSION ANALYSIS
▪ The regression analysis examines cause-and-effect relationship in economic
theory through quantitative models between
▪ dependent variable
▪ and independent variable.
➔ Regression models estimate 𝛽 coefficients which indicate the impact of
independent variable on the dependent variable.

15
THE PURPOSES OF TODAY LECTURE
▪ The purposes of today lecture is to help you how to

(1) estimate 𝛽 using Ordinary Least Square (OLS) method


(2) interpret the economic meaning of 𝛽

(3) implement the testing of 𝛽

16
▪ The multiple regression model allows us
▪ Estimate the partial effect of each independent variable on the dependent
variable, holding other variables unchanged,
▪ and improve the quality of regression model.
▪ The general form of the LRM model is:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝑒𝑖
where 𝑖 indicates the observation.
▪ Or, as written in short form:
𝑌 = 𝑋𝛽 + 𝑒
▪ 𝑌 is the regressand, or dependent/explained variable

▪ 𝑋 is a vector of regressors, or independent/explanatory variables

▪ 𝑒 is an error term/residual.
▪ OLS requires linearity in the coefficients.
▪ The linear regression
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝑒𝑖

2
𝑙𝑛𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝑒𝑖

▪ The nonlinear regression


𝑌𝑖 = 𝛽0 + 𝛽12 𝑋1𝑖 + 𝑒𝑖
BEST FIT LINE
▪ Most basic regression do this:
try to get a line of BEST FIT
➔ What does BEST FIT mean?
➔ A line of best fit is a straight
line that minimizes the
𝑌෡𝑖 = 𝛽መ0 + 𝛽መ1 𝑋𝑖
distance of the actual and
estimated 𝑌 values. 𝑌෡𝑖

𝛽መ0

20
ORDINARY LEAST SQUARES (OLS) METHOD

➔ The method of Ordinary least squares


(OLS) minimizes the sum of the squared
“distances”.
➔ σ𝑛 2 ෢ ෢
𝑖=1 𝑒𝑖 = 𝑓 𝛽0 , 𝛽1 ⇒ 𝑀𝑖𝑛

21
RESIDUALS
▪ Why do we need the residual?
▪ Lack of the independent variable
▪ Measurement error
▪ Effects of random externalities

22
OPTIONAL

▪ The residual

𝑒 = 𝑦 − 𝑋𝛽መ
▪ The sum of squares

𝑒 ′ 𝑒 = 𝑌 − 𝑋𝛽መ ′ 𝑌 − 𝑋𝛽መ
▪ To minimize 𝑒′𝑒 we need to find 𝛽መ such that

𝜕𝑒 ′ 𝑒
= −2𝑋 ′ 𝑌 + 2𝑋 ′ 𝑋𝛽መ = 0
𝜕𝛽መ
▪ The first order conditions is

𝛽መ = 𝑋 ′ 𝑋 −1
𝑋 ′𝑌

▪ This is the closed form solution for coefficients.


OPTIONAL

▪ Under homoskedasticity, the variance-covariance matrix (VCV) of coefficients is


𝑉𝐶𝑉 = 𝜎 2 𝑋 ′ 𝑋 −1

▪ Where 𝜎 2 could be estimated by the mean squared error



2
𝑒 𝑒
𝑠𝑠 = 𝜎ො =
𝑁−𝑘
where N is the number of observations and k number of coefficients.

▪ The standard errors of the OLS estimates are calculated by

𝑠𝑒 = diag(𝑉𝐶𝑉)
OPTIONAL

▪ The t-statistics are

𝑡 = 𝑏 ÷ 𝑠𝑒
where ÷ indicate element-wise division of vectors.
▪ Each element 𝑗 of 𝑝𝑣𝑎𝑙𝑢𝑒 is given by

𝑝𝑗 = 2 1 − 𝐹 𝑡𝑗 , 𝑑𝑓
where 𝑑𝑓 = 𝑁 − 1, the degree of freedom, and 𝐹 the Student’s t cumulative distribution function.
Note that 𝑁 is the number of observations (# rows of 𝑋).
OPTIONAL

▪ 𝑅 2 , the coefficient of determination, is an overall measure of goodness of fit of the


estimated regression line.
▪ Gives the percentage of the total variation in the dependent variable that is explained
by the regressors.
▪ It is a value between 0 (no fit) and 1 (perfect fit).
▪ Let: Explained Sum of Squares 𝐸𝑆𝑆 = σ 𝑌 ෠ − 𝑌ത 2
▪ Residual Sum of Squares 𝑅𝑆𝑆 = σ 𝑒 2
▪ Total Sum of Squares 𝑇𝑆𝑆 = σ 𝑌 − 𝑌ത 2
𝐸𝑆𝑆 𝑅𝑆𝑆
▪ Then: 𝑅 2 = =1−
𝑇𝑆𝑆 𝑇𝑆𝑆
▪ Higher 𝑅 2 indicates better fit.
▪ When 𝑅 2 = 1, 𝑅𝑆𝑆 = 0 and σ 𝑒 2 = 0.
OPTIONAL

▪ 𝑁 is number of observations

▪ 𝑘 is number of estimated coefficients

▪ 𝑑𝑓 for 𝑅𝑆𝑆 = 𝑁 − 𝑘
OPTIONAL

▪ 𝑅 2 is higher when more regressors are added

▪ Sometimes researchers play the game of “maximizing” 𝑅 2 (Somebody think the higher
the 𝑅2 , the better the model. BUT THIS IS NOT NECESSARILY TRUE!)
▪ To avoid this temptation: 𝑅 2 should takes into account the number of regressors

▪ Such an 𝑅 2 is called an adjusted 𝑅 2 , denoted as ഥ


𝑅2 (R-bar squared), and is computed
from the (unadjusted) 𝑅2 as follows:
𝑛−1
ഥ𝑅2 = 1 − 1 − 𝑅2
𝑛−𝑘
A survey of 20,306 individuals in the U.S.

male 1 = male; 2 = female

age age (year)

wage wage (US$/hour)

tenure # years working for current employer

union 1 = union member, 0 otherwise

edu years of schooling (years)

married 1 = married or living together with a partner, 0 otherwise

Data file: lrm.xlsx.


▪ For dummy variables, the mean
and sd do not make a lot of sense.
▪ We present the frequency of each
outcome instead.
▪ Command:
hist(Z$wage, main = "Histogram of wage", xlab = "Wage",col = "yellow", breaks =
100, freq = TRUE)
▪ Now limit the range of X axis to (0,100):
hist(Z$wage, main = "Histogram of wage", xlab = "Wage",col = "yellow", breaks = 1000, xlim
= c(0,100))
▪ To run a linear regression model, use:
lm(formula = <formula>, data = <dataset>)

▪ Mind that the data is not the first argument, so if you use it in a pipe,
you have to do: model = lm(<formula>, data = .)
▪ The formula is: dependent variable ~ independent variable
▪ Try: model = lm(wage ~ edu, data = Z)
▪ Regression and display the result in 1 coding line
distribution of residuals
Section 1: a summary of the distribution of
residuals from the regression model.

▪ Recall that a residual is the difference


between the observed value and the
predicted value from the regression
model.
▪ The minimum residual was -31.5, the
median residual was -2.96 and the max
residual was 462.39.
Estimate
Section 2: The estimated
coefficients of the regression model

▪ One more schooling year results in wage


increase of US$2/hour, ceteris paribus.

▪ Those who are union member have


wage in hour is US$0.4/hour higher than
those who are not union member,
holding others constant.
Section 2: The estimated Estimate
coefficients of the regression model

▪ Those who are male have wage in


hour is US$5.8/hour higher than those
who are female, holding others
constant.
Standard error of the estimates
▪ The standard error of coefficients is
a measure of the uncertainty in our
estimate of the coefficient.
t-value

▪ The t-value shows the importance of a


variable in the model.
▪ The t-values test the hypothesis that
the coefficient is different from 0. To
reject this, you need a t-value greater
than 1.96 (for 95% confidence).
Significance of the
indep. variable
Pr(>|t|)
▪ p-value corresponds to the t-statistic:
▪ If p-value of t-stat < alpha = 0.1 (e.g. 10%):
the independent variable is said to be
statistically significant.
Assessing Model Fit
▪ This section displays various numbers that
help us assess how well the regression model
fits our dataset.

Section 3: How well the regression


model fits our dataset
Residual standard error
▪ Residual standard error: the average
distance that the observed values fall from
the regression line. The smaller the value,
the better the regression model is able to fit
the data.
▪ Residual standard error is a measure of the
quality of a linear regression fit.
Multiple R-squared

▪ Multiple R-Squared (coefficient of determination):


the proportion of the variance in the dependent
variable that can be explained by the independent
variables.
▪ This value ranges from 0 to 1. The closer it is to 1,
the better the predictor variables are able to
predict the value of the response variable.
Unexplained variance
Explained
variance

Total variance

▪ R-square shows the proposition of variance of 𝑌


explained by 𝑋.
Adjusted R-squared

▪ Adjusted R-squared: adjusted for the number of


predictors in the model.
▪ 𝐴𝑑𝑗 𝑅2 < 𝑅2 .
▪ When the number of variables is small and the
number of cases is very large, then 𝐴𝑑𝑗 𝑅 2 is close to
𝑅2 . This provides a more honest association
between 𝑋 and 𝑌 .
F-statistic

▪ F-statistic: This indicates whether the


regression model provides a better fit
to the data than a model that contains
no independent variables.
p-value corresponding to F-statistic

▪ p-value of the F-statistic. P-value is less than


some significance levels, then the regression
model fits the data better than a model with no
regressors.
▪ P-value <𝛼 = 0.1 (10%) indicates that the
regressors are actually useful for predicting the
value of dependent variable.
▪ If we remove the outliers of wage (wage
> 100), the result will be
▪ One more schooling year, results in an
average increase in wage of US$1.8/hour.
▪ If we remove the outliers of wage
(wage > 100), the result will be
▪ One more schooling year, results in
an average increase in wage of
US$1.8/hour.
Testing individual coefficient: t test
Testing multiple coefficients: F test
▪ To test the following hypothesis:
• 𝐻0 : 𝐵𝑘 = 0
• 𝐻1 : 𝐵𝑘 ≠ 0
▪ Calculate the following and use the 𝑡 table to
obtain the critical 𝑡 value with 𝑛 − 𝑘 degrees of
TESTING INDIVIDUAL freedom for a given level of significance (or 𝛼,
equal to 10%, 5%, or 1%):
COEFFICIENT: T TEST 𝑡=
𝑏𝑘
𝑠𝑒 𝑏𝑘
▪ If this value is greater than the critical 𝑡 value, we
can reject 𝐻0.
▪ Step 1: Form hypotheses
• 𝐻0 : 𝐵𝑘 = 0
• 𝐻1 : 𝐵𝑘 ≠ 0
▪ Step 2: Determine confidence interval,
TESTING INDIVIDUAL critical values, region of rejection, region of
acceptance.
COEFFICIENT: T TEST 𝑡𝛼∗ ,𝑛−𝑘
2
▪ Step 3: Calculate test statistic
𝛽𝑚
𝑡𝑡𝑡 =
𝑠𝛽𝑚
▪ Step 4: Decide
▪ If 𝑡𝑡𝑡 > 𝑡𝛼,𝑛−𝑘 Reject 𝐻0 at level of significance of 𝛼
2

▪ If 𝑃𝑣𝑎𝑙𝑢𝑒 < 𝛼 Reject 𝐻0 at level of significance of 𝛼

P-value/2

Region of rejection Region of acceptance Region of rejection


the hypothesis that schooling years has no impact on wage is rejected at 10% (even at 1%).
▪ Step 1: Form hypotheses
• 𝐻0 : 𝛽𝑚+1 = 𝛽𝑚+2 = ⋯ = 𝛽𝑘 = 0
TESTING MULTIPLE • 𝐻𝛼 : At least one β different from 0

COEFFICIENTS: 𝐹 TEST ▪ Step 2: Calculate test statistic (𝐹)


(𝑅𝑆𝑆𝑅 −𝑅𝑆𝑆𝑈 )/(𝑑𝑓𝑅 −𝑑𝑓𝑈 )
𝐹𝑐 =
𝑅𝑆𝑆𝑈 /𝑑𝑓𝑈
𝑑𝑓𝑈 = 𝑛 − 𝑘
𝑑𝑓𝑅 = 𝑛 − 𝑚
𝑑𝑓𝑅 − 𝑑𝑓𝑈 = 𝑘 − 𝑚
▪ Step 3: Determine the critical value

𝐹𝑘−𝑚,𝑛−𝑘 (𝛼)
• (𝑘 − 𝑚) degree of freedom for nominator
• (𝑛 − 𝑘) degree of freedom for
denominator
▪ Step 4: Decide
• 𝐹𝑡𝑡 > 𝐹 ∗ , 𝑜𝑟
• 𝑃𝑣𝑎𝑙𝑢𝑒 = 𝑃 𝐹 > 𝐹𝑡𝑡 < 𝛼

=> Reject 𝐻0 at the significance level of 𝛼


The hypothesis that male, married and age are equal to zero simultaneously is rejected at 1%.
F TEST FOR
OVERALL … is the F-test for the null hypothesis that all

SIGNIFICANCE coefficients are equal to zero simultaneously.


The hypothesis that all coefficients are equal to zero simultaneously is rejected at 1%.
▪ A1: Linear relationship
▪ A2: No perfect collinearity
▪ A3: Residual has zero mean
▪ A4: Homoskedasticity
▪ A5: Exogeneity of regressors
▪ A6: Normality of residuals
▪ A7: Correctly specified
▪ A1: Linear relationship
𝑌 = 𝑋𝛽 + 𝑒

▪ A3: Zero mean of the residual


𝐸(𝑒|𝑋) = 0
This also implies 𝐸(𝑌) = 𝑋𝛽

▪ A2: No perfect collinearity (𝑋 is full rank or the columns of 𝑋 are linearly


independent)
▪ Note: we can estimate a model with imperfect collinearity, but this results in
multicollinearity.
▪ A4 implies that the error term has a constant variance
▪ As a result, the VCV of the error term is

𝐸 𝑒𝑒’ 𝑋 = 𝜎 2 𝐼

▪ This assumption is used to calculate the VCV of coefficients (which is used to calculate
standard errors).
▪ Violation of this assumption results in heteroskedasticity, which gives biased estimates of
the standard errors.
OPTIONAL

▪ This implies that [A5] 𝑋 is unrelated to 𝑒


▪ Violation of this assumption results in the problem of endogeneity, which produces
biased estimated coefficients.
▪ On the basis of assumptions A1 to A5 (A6 is not necessary),
the OLS method gives best linear unbiased estimators (BLUE):
• (1) Estimators are linear functions of the dependent variable Y.
• (2) The estimators are unbiased; in repeated applications of the
method, the estimators approach their true values.
• (3) In the class of linear estimators, OLS estimators have minimum
variance; i.e., they are efficient, or the “best” estimators.
A6
▪ A6 says that
𝑒~𝑁 0, 𝜎 2
▪ This is not required for the model, but for the hypothesis testing.
▪ If the residual follows normal distribution, then the coefficients follows Student-t
distribution.
▪ Otherwise, this is wrong (coefficients does not really follows Student-t.)
▪ But this should not be a problem with large sample.
A7
▪ A7 says that
▪ No specification bias
▪ No specification error
▪ We assume that N > k.

You might also like