Linear Regression Model: Topic 2
Linear Regression Model: Topic 2
1
OUTLINE
• Variables
• Regression analysis
• Regression function
• Linear Regression Model (LRM)
• Regression coefficients
• Ordinary Least Squares (OLS) method
• The estimation procedure
• Estimation of the LRM with R
• Summary statistics
• Regression
• Interpreting regression results
• Hypothesis testing: T-test vs. F-test
• Assumptions of classical LRM
THE IDEA BEHIND REGRESSION
» We want to relate two different variables – how does one effect the other?
» Particularly, we want to know how much 𝑦 changes when 𝑥 increases/decreases by 1 unit.
» In doing so, we need the function
𝑦 = 𝛽𝑥
the function let us know that when 𝑥 increases by 1 unit, 𝑦 changes by 𝛽.
» How?
• By producing a line of best fit
» What does best fit mean?
3
REGRESSION ANALYSIS
» The method of Ordinary least squares (OLS) minimizes the sum of the squared “distances”
The method of Ordinary Least Squares (OLS)
minimizes the sum of the squared “distances”. REGRESSION FUNCTION
y
y4 . 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
e4 {
y3 . e3
y2 e2 {
.
y1
}
. e1
x1 x2 x3 x4 x
THE LINEAR REGRESSION MODEL (LRM)
» The general form of the LRM model is:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝑒𝑖
» 𝑒 is an error term/residual.
REGRESSION COEFFICIENTS
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝑒𝑖
» 𝛽0 is the intercept/constant
» 𝛽1 to 𝛽𝑘 are the slope coefficients
» In general 𝛽 are the regression coefficients or regression parameters. THEY ARE WHAT WE NEED TO
ESTIMATE!
» Each slope coefficient indicates the (partial) rate of change in the mean value of 𝑌 for a unit change in the
value of a regressor, ceteris paribus
» Roughly speaking: 𝛽1 lets us know when 𝑥1 increases by one unit, 𝑦 changes by 𝛽1 , other things (all other
X) unchanged.
METHOD OF ORDINARY LEAST SQUARES
» Method of Ordinary Least Squares (OLS) search for coefficients that minimizes residual
sum of squares (RSS):
10
DATA SAMPLE
Survey object: students in Ho Chi Minh City who consume instant noodles in 2023
Sample size: 900 – 1000
Variables:
» noodle 1 = male; 2 = female
» income age (year)
» price wage (US$/hour)
» distance # years working for current employer
» gender 1 = union member, 0 otherwise
» family years of schooling (years)
» region hometown, "HCMC" = HCMC, "ocities" = from other big cities, "rural" = from rural area
» ocities a dummy variable, takes 1 if region = "ocities"
» rural a dummy variable, takes 1 if region = “rural“
» rent expenses for accommodation, rent (million VND/month)
» housemates number of people currently living in the same house (person)
» For continuous variables, we present the mean and S.D., min and max which show the
distribution of variables.
SUMMARY STATISTICS: DUMMY, CATEGORY VARIABLES
» Command:
hist(data$noodle, main = "Histogram of noodle", xlab = "Noodle consumption (packs/month)", col = "yellow", breaks = 100,
freq = TRUE)
MORE DETAILED DESCRIPTION: HISTOGRAM
» Now limit the range of X axis to (0,45):
hist(data$noodle, main = "Histogram of noodle", xlab = "Noodle consumption (packs/month)", col = "yellow", breaks =
100, xlim = c(0,45), freq = TRUE)
BIVARIATE ANALYSIS: SCATTER PLOT
» Command:
» plot(data$income,data$noodle, ylab = "Number of noodle (packs/month)", xlab = "Disposable income
(mil.VND/month)")
BIVARIATE ANALYSIS: SCATTER PLOT
plot(data$price, data$noodle, ylab = "Number of noodle (packs/month)", xlab = "Price of noodle (1000VND/pack)")
SCATTER PLOT
plot(data$distance,data$noodle, ylab = "Number of noodle (packs/month)", xlab = "Distance from home to the frequently visited instant
noodle store (100m)", ylim = c(0,100))
BIAVARIATE ANALYSIS: COMPARING NOODLE BETWEEN GROUPS
LINEAR REGRESSION MODEL: RESULTS
𝑝𝑗 = 2 1 − 𝐹 𝑡𝑗 , 𝑑𝑓
where 𝑑𝑓 = 𝑁 − 1, the degree of freedom, and 𝐹 the Student’s t cumulative distribution function. Note that
𝑁 is the number of observations (# rows of 𝑋).
GOODNESS OF FIT: R2
» 𝑅2 , the coefficient of determination, is an overall measure of goodness of fit of the estimated regression
line.
» Gives the percentage of the total variation in the dependent variable that is explained by the regressors.
» It is a value between 0 (no fit) and 1 (perfect fit).
2
» Let: Explained Sum of Squares 𝐸𝑆𝑆 = σ 𝑌 − 𝑌ത
Residual Sum of Squares 𝑅𝑆𝑆 = σ 𝑒 2
Total Sum of Squares 𝑇𝑆𝑆 = σ 𝑌 − 𝑌ത 2
𝐸𝑆𝑆 𝑅𝑆𝑆
» Then: 𝑅2 = =1−
𝑇𝑆𝑆 𝑇𝑆𝑆
» Higher 𝑅2 indicates better fit.
» When 𝑅2 = 1, 𝑅𝑆𝑆 = 0 and σ 𝑒 2 = 0.
DEGREE OF FREEDOM 𝑑𝑓
» 𝑁 is number of observations
» 𝑘 is number of estimated coefficients
» 𝑑𝑓 for 𝑅𝑆𝑆 = 𝑁 − 𝑘
GOODNESS OF FIT: R SQUARED ADJUSTED
32
TESTING INDIVIDUAL COEFFICIENT: T TEST
P-value/2
The hypothesis that income and gender are equal to zero simultaneously is rejected at 10% (even at 1%)
F TEST FOR OVERALL SIGNIFICANCE
… is the F-test for the null hypothesis that all coefficients are equal to zero simultaneously.
F TEST FOR OVERALL SIGNIFICANCE
The hypothesis that all coefficients are equal to zero simultaneously is rejected at 10% (even at 1%).
ASSUMPTIONS OF THE CLASSICAL LRM
▪ A1: Linear relationship
▪ A2: No perfect collinearity
▪ A3: Residual has zero mean
▪ A4: Homoskedasticity
▪ A5: Exogeneity of regressors
▪ A6: Normality of residuals
44
MAINTAINED ASSUMPTIONS
» A1: Linear relationship
𝑌 = 𝑋𝛽 + 𝑒
» A2: No perfect collinearity (𝑋 is full rank or the columns of 𝑋 are linearly independent)
» Note: we can estimate a model with imperfect collinearity, but this results in multicollinearity.
HOMOSKEDASTICITY
» A4 implies that the error term has a constant variance
» As a result, the VCV of the error term is
𝐸 𝑒𝑒’ 𝑋 = 𝜎 2 𝐼
» This assumption is used to calculate the VCV of coefficients (which is used to calculate standard errors).
» Violation of this assumption results in heteroskedasticity, which gives biased estimates of the standard errors.
EXOGENEITY OF REGRESSORS
» This implies that [A5] 𝑋 is unrelated to 𝑒
» Violation of this assumption results in the problem of endogeneity, which produces biased estimated coefficients.
GAUSS – MARKOV THEOREM
» On the basis of assumptions A1 to A5 (A6 is not necessary), the OLS method gives best linear
unbiased estimators (BLUE):
• (1) Estimators are linear functions of the dependent variable Y.
• (2) The estimators are unbiased; in repeated applications of the method, the estimators
approach their true values.
• (3) In the class of linear estimators, OLS estimators have minimum variance; i.e., they are
efficient, or the “best” estimators.
NORMALITY OF RESIDUAL [A6]
» A6 says that
𝑒~𝑁 0, 𝜎 2
» This is not required for the model, but for the hypothesis testing.
» If the residual follows normal distribution, then the coefficients follows Student-t distribution.
» Otherwise, this is wrong (coefficients does not really follows Student-t.)
» But this should not be a problem with large sample.