0% found this document useful (0 votes)
42 views49 pages

Linear Regression Model: Topic 2

The document discusses linear regression models, including describing variables, regression analysis, the regression function, coefficients, ordinary least squares estimation, and assumptions. It also provides an example of estimating a linear regression model using R and interpreting the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views49 pages

Linear Regression Model: Topic 2

The document discusses linear regression models, including describing variables, regression analysis, the regression function, coefficients, ordinary least squares estimation, and assumptions. It also provides an example of estimating a linear regression model using R and interpreting the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Topic 2

Linear Regression Model

Pham Nhu Man


[email protected]

1
OUTLINE
• Variables
• Regression analysis
• Regression function
• Linear Regression Model (LRM)
• Regression coefficients
• Ordinary Least Squares (OLS) method
• The estimation procedure
• Estimation of the LRM with R
• Summary statistics
• Regression
• Interpreting regression results
• Hypothesis testing: T-test vs. F-test
• Assumptions of classical LRM
THE IDEA BEHIND REGRESSION

» We want to relate two different variables – how does one effect the other?
» Particularly, we want to know how much 𝑦 changes when 𝑥 increases/decreases by 1 unit.
» In doing so, we need the function
𝑦 = 𝛽𝑥
the function let us know that when 𝑥 increases by 1 unit, 𝑦 changes by 𝛽.
» How?
• By producing a line of best fit
» What does best fit mean?

3
REGRESSION ANALYSIS

» Imagine we have a set of observational data


» Most basic regression do this: try to get a line of best fit for these points

» The method of Ordinary least squares (OLS) minimizes the sum of the squared “distances”
The method of Ordinary Least Squares (OLS)
minimizes the sum of the squared “distances”. REGRESSION FUNCTION
y
y4 . 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
e4 {

y3 . e3
y2 e2 {
.

y1
}
. e1

x1 x2 x3 x4 x
THE LINEAR REGRESSION MODEL (LRM)
» The general form of the LRM model is:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝑒𝑖

where 𝑖 indicates the observation.

» Or, as written in short form:


𝑌 = 𝑋𝛽 + 𝑒

» 𝑌 is the regressand, or dependent/explained variable

» 𝑋 is a vector of regressors, or independent/explanatory variables

» 𝑒 is an error term/residual.
REGRESSION COEFFICIENTS
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝑒𝑖
» 𝛽0 is the intercept/constant
» 𝛽1 to 𝛽𝑘 are the slope coefficients
» In general 𝛽 are the regression coefficients or regression parameters. THEY ARE WHAT WE NEED TO
ESTIMATE!
» Each slope coefficient indicates the (partial) rate of change in the mean value of 𝑌 for a unit change in the
value of a regressor, ceteris paribus
» Roughly speaking: 𝛽1 lets us know when 𝑥1 increases by one unit, 𝑦 changes by 𝛽1 , other things (all other
X) unchanged.
METHOD OF ORDINARY LEAST SQUARES

» Method of Ordinary Least Squares (OLS) search for coefficients that minimizes residual
sum of squares (RSS):

» We need a data set of 𝑌 and 𝑋 to find 𝛽.


» Finding 𝛽 is an optimization problem.
THE ESTIMATION PROCEDURE: COEFFICIENTS
» The residual
𝑒 = 𝑦 − 𝑋𝛽መ
» The sum of squares
𝑒 ′ 𝑒 = 𝑌 − 𝑋𝛽መ ′ 𝑌 − 𝑋𝛽መ
» To minimize 𝑒′𝑒 we need to find 𝛽መ such that
𝜕𝑒 ′ 𝑒
= −2𝑋 ′ 𝑌 + 2𝑋 ′ 𝑋𝛽መ = 0
𝜕𝛽መ
» The first order conditions is
𝛽መ = 𝑋 ′ 𝑋 −1
𝑋′𝑌

» This is the closed form solution for coefficients.


ESTIMATION OF THE LRM
WITH R

10
DATA SAMPLE
Survey object: students in Ho Chi Minh City who consume instant noodles in 2023
Sample size: 900 – 1000
Variables:
» noodle 1 = male; 2 = female
» income age (year)
» price wage (US$/hour)
» distance # years working for current employer
» gender 1 = union member, 0 otherwise
» family years of schooling (years)
» region hometown, "HCMC" = HCMC, "ocities" = from other big cities, "rural" = from rural area
» ocities a dummy variable, takes 1 if region = "ocities"
» rural a dummy variable, takes 1 if region = “rural“
» rent expenses for accommodation, rent (million VND/month)
» housemates number of people currently living in the same house (person)

Data file: noodle13.csv


IMPORTING DATA
DESCRIBING THE IMPORTED DATA
SUMMARY STATISTICS: CONTINUOUS VARIABLES

» For continuous variables, we present the mean and S.D., min and max which show the
distribution of variables.
SUMMARY STATISTICS: DUMMY, CATEGORY VARIABLES

» For dummy and category variables, the


mean and S.D. do not make a lot of
sense.
» We present the frequency of each outcome
instead.
SUMMARY STATISTICS: DUMMY, CATEGORY VARIABLES

» For dummy and category variables:


• We also present the frequency and the percentage of each outcome.
MORE DETAILED DESCRIPTION: HISTOGRAM

» Command:
hist(data$noodle, main = "Histogram of noodle", xlab = "Noodle consumption (packs/month)", col = "yellow", breaks = 100,
freq = TRUE)
MORE DETAILED DESCRIPTION: HISTOGRAM
» Now limit the range of X axis to (0,45):
hist(data$noodle, main = "Histogram of noodle", xlab = "Noodle consumption (packs/month)", col = "yellow", breaks =
100, xlim = c(0,45), freq = TRUE)
BIVARIATE ANALYSIS: SCATTER PLOT
» Command:
» plot(data$income,data$noodle, ylab = "Number of noodle (packs/month)", xlab = "Disposable income
(mil.VND/month)")
BIVARIATE ANALYSIS: SCATTER PLOT
plot(data$price, data$noodle, ylab = "Number of noodle (packs/month)", xlab = "Price of noodle (1000VND/pack)")
SCATTER PLOT
plot(data$distance,data$noodle, ylab = "Number of noodle (packs/month)", xlab = "Distance from home to the frequently visited instant
noodle store (100m)", ylim = c(0,100))
BIAVARIATE ANALYSIS: COMPARING NOODLE BETWEEN GROUPS
LINEAR REGRESSION MODEL: RESULTS

When the disposable income increases by one million


VND per month, the instant noodle consumption
increases by 0.75 pack/month, other things equal.
LINEAR REGRESSION MODEL: RESULTS
» You can display the result in one line of coding

When the price of instant noodle increases by one


thousand VND per pack, the instant noodle consumption
decreases by 0.58 pack/month.
LRM WITHOUT OUTLIERS
LRM WITHOUT OUTLIERS
1 line of coding
STANDARD ERRORS

» Under homoskedasticity, the variance-covariance matrix (VCV) of coefficients is


𝑉𝐶𝑉 = 𝜎 2 𝑋 ′ 𝑋 −1

» Where 𝜎 2 could be estimated by the mean squared error


𝑒 ′𝑒
𝑠𝑠 = 𝜎ො 2 =
𝑁−𝑘
where N is the number of observations and k number of coefficients.

» The standard errors of the OLS estimates are calculated by


𝑠𝑒 = diag(𝑉𝐶𝑉)
T-STATISTICS AND P-VALUES

» The t-statistics are


𝑡 = 𝑏 ÷ 𝑠𝑒
where ÷ indicate element-wise division of vectors.
» Each element 𝑗 of 𝑝𝑣𝑎𝑙𝑢𝑒 is given by

𝑝𝑗 = 2 1 − 𝐹 𝑡𝑗 , 𝑑𝑓
where 𝑑𝑓 = 𝑁 − 1, the degree of freedom, and 𝐹 the Student’s t cumulative distribution function. Note that
𝑁 is the number of observations (# rows of 𝑋).
GOODNESS OF FIT: R2
» 𝑅2 , the coefficient of determination, is an overall measure of goodness of fit of the estimated regression
line.
» Gives the percentage of the total variation in the dependent variable that is explained by the regressors.
» It is a value between 0 (no fit) and 1 (perfect fit).
2
» Let: Explained Sum of Squares 𝐸𝑆𝑆 = σ 𝑌෠ − 𝑌ത
Residual Sum of Squares 𝑅𝑆𝑆 = σ 𝑒 2
Total Sum of Squares 𝑇𝑆𝑆 = σ 𝑌 − 𝑌ത 2

𝐸𝑆𝑆 𝑅𝑆𝑆
» Then: 𝑅2 = =1−
𝑇𝑆𝑆 𝑇𝑆𝑆
» Higher 𝑅2 indicates better fit.
» When 𝑅2 = 1, 𝑅𝑆𝑆 = 0 and σ 𝑒 2 = 0.
DEGREE OF FREEDOM 𝑑𝑓
» 𝑁 is number of observations
» 𝑘 is number of estimated coefficients
» 𝑑𝑓 for 𝑅𝑆𝑆 = 𝑁 − 𝑘
GOODNESS OF FIT: R SQUARED ADJUSTED

» 𝑅2 is higher when more regressors are added


» Sometimes researchers play the game of “maximizing” 𝑅2 (Somebody think the higher the 𝑅2 , the better
the model. BUT THIS IS NOT NECESSARILY TRUE!)
» To avoid this temptation: 𝑅2 should takes into account the number of regressors
» Such an 𝑅2 is called an adjusted 𝑅2 , denoted as ഥ𝑅2 (R-bar squared), and is computed from the
(unadjusted) 𝑅2 as follows:
𝑛−1
ഥ𝑅2 = 1 − 1 − 𝑅2
𝑛−𝑘
HYPOTHESIS TESTING

▪ Testing individual coefficient: T-test


▪ Testing multiple coefficients: F-test

32
TESTING INDIVIDUAL COEFFICIENT: T TEST

» To test the following hypothesis:


• 𝐻0 : 𝐵𝑘 = 0
• 𝐻1 : 𝐵𝑘 ≠ 0
» Calculate the following and use the 𝑡 table to obtain the critical 𝑡 value with 𝑛 − 𝑘 degrees of
freedom for a given level of significance (or 𝛼, equal to 10%, 5%, or 1%):
𝑏𝑘
𝑡=
𝑠𝑒 𝑏𝑘
» If this value is greater than the critical 𝑡 value, we can reject 𝐻0.
TESTING INDIVIDUAL COEFFICIENT: T TEST

» Step 1: Form hypotheses


• 𝐻0 : 𝐵𝑘 = 0
• 𝐻1 : 𝐵𝑘 ≠ 0
» Step 2: Determine confidence interval, critical values, region of rejection, region of acceptance.
𝑡𝛼∗ ,𝑛−𝑘
2
» Step 3: Calculate test statistic
𝛽𝑚
𝑡𝑡𝑡 =
𝑠𝛽𝑚
» Step 4: Decide
TESTING INDIVIDUAL COEFFICIENT: T TEST
» If 𝑡𝑡𝑡 > 𝑡𝛼,𝑛−𝑘 Reject 𝐻0 at level of significance of 𝛼
2

» If 𝑃𝑣𝑎𝑙𝑢𝑒 < 𝛼 Reject 𝐻0 at level of significance of 𝛼

P-value/2

Region of rejection Region of acceptance Region of rejection


TESTING INDIVIDUAL COEFFICIENT: T TEST

The null hypothesis that income has no impact


on noodle is rejected at 10% of statistical
significance (even at 1%).
TESTING INDIVIDUAL COEFFICIENT: T TEST

The null hypothesis that distance has no impact


on noodle is accepted at 10% of the statistical
significance.
TESTING MULTIPLE COEFFICIENTS: 𝐹 TEST

» Step 1: Form hypotheses


• 𝐻0 : 𝛽𝑚+1 = 𝛽𝑚+2 = ⋯ = 𝛽𝑘 = 0
• 𝐻𝛼 : At least one β different from 0
» Step 2: Calculate test statistic (𝐹)
(𝑅𝑆𝑆𝑅 −𝑅𝑆𝑆𝑈 )/(𝑑𝑓𝑅 −𝑑𝑓𝑈 )
𝐹𝑐 =
𝑅𝑆𝑆𝑈 /𝑑𝑓𝑈
𝑑𝑓𝑈 = 𝑛 − 𝑘
𝑑𝑓𝑅 = 𝑛 − 𝑚
𝑑𝑓𝑅 − 𝑑𝑓𝑈 = 𝑘 − 𝑚
TESTING MULTIPLE COEFFICIENTS: 𝐹 TEST

» Step 3: Determine the critical value



𝐹𝑘−𝑚,𝑛−𝑘 (𝛼)
• (𝑘 − 𝑚) degree of freedom for nominator
• (𝑛 − 𝑘) degree of freedom for denominator
» Step 4: Decide
• 𝐹𝑡𝑡 > 𝐹 ∗ , 𝑜𝑟
• 𝑃𝑣𝑎𝑙𝑢𝑒 = 𝑃 𝐹 > 𝐹𝑡𝑡 < 𝛼
=> Reject 𝐻0 at the significance level of 𝛼
FISHER DISTRIBUTION
TESTING MULTIPLE COEFFICIENTS: 𝐹 TEST

The hypothesis that income and gender are equal to zero simultaneously is rejected at 10% (even at 1%)
F TEST FOR OVERALL SIGNIFICANCE

… is the F-test for the null hypothesis that all coefficients are equal to zero simultaneously.
F TEST FOR OVERALL SIGNIFICANCE

The hypothesis that all coefficients are equal to zero simultaneously is rejected at 10% (even at 1%).
ASSUMPTIONS OF THE CLASSICAL LRM
▪ A1: Linear relationship
▪ A2: No perfect collinearity
▪ A3: Residual has zero mean
▪ A4: Homoskedasticity
▪ A5: Exogeneity of regressors
▪ A6: Normality of residuals

44
MAINTAINED ASSUMPTIONS
» A1: Linear relationship
𝑌 = 𝑋𝛽 + 𝑒

» A3: Zero mean of the residual


𝐸(𝑒|𝑋) = 0
This also implies 𝐸(𝑌) = 𝑋𝛽

» A2: No perfect collinearity (𝑋 is full rank or the columns of 𝑋 are linearly independent)
» Note: we can estimate a model with imperfect collinearity, but this results in multicollinearity.
HOMOSKEDASTICITY
» A4 implies that the error term has a constant variance
» As a result, the VCV of the error term is

𝐸 𝑒𝑒’ 𝑋 = 𝜎 2 𝐼

» This assumption is used to calculate the VCV of coefficients (which is used to calculate standard errors).
» Violation of this assumption results in heteroskedasticity, which gives biased estimates of the standard errors.
EXOGENEITY OF REGRESSORS
» This implies that [A5] 𝑋 is unrelated to 𝑒
» Violation of this assumption results in the problem of endogeneity, which produces biased estimated coefficients.
GAUSS – MARKOV THEOREM

» On the basis of assumptions A1 to A5 (A6 is not necessary), the OLS method gives best linear
unbiased estimators (BLUE):
• (1) Estimators are linear functions of the dependent variable Y.
• (2) The estimators are unbiased; in repeated applications of the method, the estimators
approach their true values.
• (3) In the class of linear estimators, OLS estimators have minimum variance; i.e., they are
efficient, or the “best” estimators.
NORMALITY OF RESIDUAL [A6]
» A6 says that
𝑒~𝑁 0, 𝜎 2
» This is not required for the model, but for the hypothesis testing.
» If the residual follows normal distribution, then the coefficients follows Student-t distribution.
» Otherwise, this is wrong (coefficients does not really follows Student-t.)
» But this should not be a problem with large sample.

You might also like