0% found this document useful (0 votes)

42 views49 pages

Linear Regression Model: Topic 2

The document discusses linear regression models, including describing variables, regression analysis, the regression function, coefficients, ordinary least squares estimation, and assumptions. It also provides an example of estimating a linear regression model using R and interpreting the results.

Uploaded by

ĐÀO PHAN THỊ HỒNG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views49 pages

Linear Regression Model: Topic 2

Uploaded by

ĐÀO PHAN THỊ HỒNG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Topic 2

Linear Regression Model

Pham Nhu Man

[email protected]

1
OUTLINE
• Variables
• Regression analysis
• Regression function
• Linear Regression Model (LRM)
• Regression coefficients
• Ordinary Least Squares (OLS) method
• The estimation procedure
• Estimation of the LRM with R
• Summary statistics
• Regression
• Interpreting regression results
• Hypothesis testing: T-test vs. F-test
• Assumptions of classical LRM
THE IDEA BEHIND REGRESSION

» We want to relate two different variables – how does one effect the other?
» Particularly, we want to know how much 𝑦 changes when 𝑥 increases/decreases by 1 unit.
» In doing so, we need the function
𝑦 = 𝛽𝑥
the function let us know that when 𝑥 increases by 1 unit, 𝑦 changes by 𝛽.
» How?
• By producing a line of best fit
» What does best fit mean?

3
REGRESSION ANALYSIS

» Imagine we have a set of observational data

» Most basic regression do this: try to get a line of best fit for these points

» The method of Ordinary least squares (OLS) minimizes the sum of the squared “distances”
The method of Ordinary Least Squares (OLS)
minimizes the sum of the squared “distances”. REGRESSION FUNCTION
y
y4 . 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
e4 {

y3 . e3
y2 e2 {
.

y1
}
. e1

x1 x2 x3 x4 x
THE LINEAR REGRESSION MODEL (LRM)
» The general form of the LRM model is:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝑒𝑖

where 𝑖 indicates the observation.

» Or, as written in short form:

𝑌 = 𝑋𝛽 + 𝑒

» 𝑌 is the regressand, or dependent/explained variable

» 𝑋 is a vector of regressors, or independent/explanatory variables

» 𝑒 is an error term/residual.
REGRESSION COEFFICIENTS
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝑒𝑖
» 𝛽0 is the intercept/constant
» 𝛽1 to 𝛽𝑘 are the slope coefficients
» In general 𝛽 are the regression coefficients or regression parameters. THEY ARE WHAT WE NEED TO
ESTIMATE!
» Each slope coefficient indicates the (partial) rate of change in the mean value of 𝑌 for a unit change in the
value of a regressor, ceteris paribus
» Roughly speaking: 𝛽1 lets us know when 𝑥1 increases by one unit, 𝑦 changes by 𝛽1 , other things (all other
X) unchanged.
METHOD OF ORDINARY LEAST SQUARES

» Method of Ordinary Least Squares (OLS) search for coefficients that minimizes residual
sum of squares (RSS):

» We need a data set of 𝑌 and 𝑋 to find 𝛽.

» Finding 𝛽 is an optimization problem.
THE ESTIMATION PROCEDURE: COEFFICIENTS
» The residual
𝑒 = 𝑦 − 𝑋𝛽መ
» The sum of squares
𝑒 ′ 𝑒 = 𝑌 − 𝑋𝛽መ ′ 𝑌 − 𝑋𝛽መ
» To minimize 𝑒′𝑒 we need to find 𝛽መ such that
𝜕𝑒 ′ 𝑒
= −2𝑋 ′ 𝑌 + 2𝑋 ′ 𝑋𝛽መ = 0
𝜕𝛽መ
» The first order conditions is
𝛽መ = 𝑋 ′ 𝑋 −1
𝑋′𝑌

» This is the closed form solution for coefficients.

ESTIMATION OF THE LRM
WITH R

10
DATA SAMPLE
Survey object: students in Ho Chi Minh City who consume instant noodles in 2023
Sample size: 900 – 1000
Variables:
» noodle 1 = male; 2 = female
» income age (year)
» price wage (US$/hour)
» distance # years working for current employer
» gender 1 = union member, 0 otherwise
» family years of schooling (years)
» region hometown, "HCMC" = HCMC, "ocities" = from other big cities, "rural" = from rural area
» ocities a dummy variable, takes 1 if region = "ocities"
» rural a dummy variable, takes 1 if region = “rural“
» rent expenses for accommodation, rent (million VND/month)
» housemates number of people currently living in the same house (person)

Data file: noodle13.csv

IMPORTING DATA
DESCRIBING THE IMPORTED DATA
SUMMARY STATISTICS: CONTINUOUS VARIABLES

» For continuous variables, we present the mean and S.D., min and max which show the
distribution of variables.
SUMMARY STATISTICS: DUMMY, CATEGORY VARIABLES

» For dummy and category variables, the

mean and S.D. do not make a lot of
sense.
» We present the frequency of each outcome
instead.
SUMMARY STATISTICS: DUMMY, CATEGORY VARIABLES

» For dummy and category variables:

• We also present the frequency and the percentage of each outcome.
MORE DETAILED DESCRIPTION: HISTOGRAM

» Command:
hist(data$noodle, main = "Histogram of noodle", xlab = "Noodle consumption (packs/month)", col = "yellow", breaks = 100,
freq = TRUE)
MORE DETAILED DESCRIPTION: HISTOGRAM
» Now limit the range of X axis to (0,45):
hist(data$noodle, main = "Histogram of noodle", xlab = "Noodle consumption (packs/month)", col = "yellow", breaks =
100, xlim = c(0,45), freq = TRUE)
BIVARIATE ANALYSIS: SCATTER PLOT
» Command:
» plot(data$income,data$noodle, ylab = "Number of noodle (packs/month)", xlab = "Disposable income
(mil.VND/month)")
BIVARIATE ANALYSIS: SCATTER PLOT
plot(data$price, data$noodle, ylab = "Number of noodle (packs/month)", xlab = "Price of noodle (1000VND/pack)")
SCATTER PLOT
plot(data$distance,data$noodle, ylab = "Number of noodle (packs/month)", xlab = "Distance from home to the frequently visited instant
noodle store (100m)", ylim = c(0,100))
BIAVARIATE ANALYSIS: COMPARING NOODLE BETWEEN GROUPS
LINEAR REGRESSION MODEL: RESULTS

When the disposable income increases by one million

VND per month, the instant noodle consumption
increases by 0.75 pack/month, other things equal.
LINEAR REGRESSION MODEL: RESULTS
» You can display the result in one line of coding

When the price of instant noodle increases by one

thousand VND per pack, the instant noodle consumption
decreases by 0.58 pack/month.
LRM WITHOUT OUTLIERS
LRM WITHOUT OUTLIERS
1 line of coding
STANDARD ERRORS

» Under homoskedasticity, the variance-covariance matrix (VCV) of coefficients is

𝑉𝐶𝑉 = 𝜎 2 𝑋 ′ 𝑋 −1

» Where 𝜎 2 could be estimated by the mean squared error

𝑒 ′𝑒
𝑠𝑠 = 𝜎ො 2 =
𝑁−𝑘
where N is the number of observations and k number of coefficients.

» The standard errors of the OLS estimates are calculated by

𝑠𝑒 = diag(𝑉𝐶𝑉)
T-STATISTICS AND P-VALUES

» The t-statistics are

𝑡 = 𝑏 ÷ 𝑠𝑒
where ÷ indicate element-wise division of vectors.
» Each element 𝑗 of 𝑝𝑣𝑎𝑙𝑢𝑒 is given by

𝑝𝑗 = 2 1 − 𝐹 𝑡𝑗 , 𝑑𝑓
where 𝑑𝑓 = 𝑁 − 1, the degree of freedom, and 𝐹 the Student’s t cumulative distribution function. Note that
𝑁 is the number of observations (# rows of 𝑋).
GOODNESS OF FIT: R2
» 𝑅2 , the coefficient of determination, is an overall measure of goodness of fit of the estimated regression
line.
» Gives the percentage of the total variation in the dependent variable that is explained by the regressors.
» It is a value between 0 (no fit) and 1 (perfect fit).
2
» Let: Explained Sum of Squares 𝐸𝑆𝑆 = σ 𝑌෠ − 𝑌ത
Residual Sum of Squares 𝑅𝑆𝑆 = σ 𝑒 2
Total Sum of Squares 𝑇𝑆𝑆 = σ 𝑌 − 𝑌ത 2

𝐸𝑆𝑆 𝑅𝑆𝑆
» Then: 𝑅2 = =1−
𝑇𝑆𝑆 𝑇𝑆𝑆
» Higher 𝑅2 indicates better fit.
» When 𝑅2 = 1, 𝑅𝑆𝑆 = 0 and σ 𝑒 2 = 0.
DEGREE OF FREEDOM 𝑑𝑓
» 𝑁 is number of observations
» 𝑘 is number of estimated coefficients
» 𝑑𝑓 for 𝑅𝑆𝑆 = 𝑁 − 𝑘
GOODNESS OF FIT: R SQUARED ADJUSTED

» 𝑅2 is higher when more regressors are added

» Sometimes researchers play the game of “maximizing” 𝑅2 (Somebody think the higher the 𝑅2 , the better
the model. BUT THIS IS NOT NECESSARILY TRUE!)
» To avoid this temptation: 𝑅2 should takes into account the number of regressors
» Such an 𝑅2 is called an adjusted 𝑅2 , denoted as ഥ𝑅2 (R-bar squared), and is computed from the
(unadjusted) 𝑅2 as follows:
𝑛−1
ഥ𝑅2 = 1 − 1 − 𝑅2
𝑛−𝑘
HYPOTHESIS TESTING

▪ Testing individual coefficient: T-test

▪ Testing multiple coefficients: F-test

32
TESTING INDIVIDUAL COEFFICIENT: T TEST

» To test the following hypothesis:

• 𝐻0 : 𝐵𝑘 = 0
• 𝐻1 : 𝐵𝑘 ≠ 0
» Calculate the following and use the 𝑡 table to obtain the critical 𝑡 value with 𝑛 − 𝑘 degrees of
freedom for a given level of significance (or 𝛼, equal to 10%, 5%, or 1%):
𝑏𝑘
𝑡=
𝑠𝑒 𝑏𝑘
» If this value is greater than the critical 𝑡 value, we can reject 𝐻0.
TESTING INDIVIDUAL COEFFICIENT: T TEST

» Step 1: Form hypotheses

• 𝐻0 : 𝐵𝑘 = 0
• 𝐻1 : 𝐵𝑘 ≠ 0
» Step 2: Determine confidence interval, critical values, region of rejection, region of acceptance.
𝑡𝛼∗ ,𝑛−𝑘
2
» Step 3: Calculate test statistic
𝛽𝑚
𝑡𝑡𝑡 =
𝑠𝛽𝑚
» Step 4: Decide
TESTING INDIVIDUAL COEFFICIENT: T TEST
» If 𝑡𝑡𝑡 > 𝑡𝛼,𝑛−𝑘 Reject 𝐻0 at level of significance of 𝛼
2

» If 𝑃𝑣𝑎𝑙𝑢𝑒 < 𝛼 Reject 𝐻0 at level of significance of 𝛼

P-value/2

Region of rejection Region of acceptance Region of rejection

TESTING INDIVIDUAL COEFFICIENT: T TEST

The null hypothesis that income has no impact

on noodle is rejected at 10% of statistical
significance (even at 1%).
TESTING INDIVIDUAL COEFFICIENT: T TEST

The null hypothesis that distance has no impact

on noodle is accepted at 10% of the statistical
significance.
TESTING MULTIPLE COEFFICIENTS: 𝐹 TEST

» Step 1: Form hypotheses

• 𝐻0 : 𝛽𝑚+1 = 𝛽𝑚+2 = ⋯ = 𝛽𝑘 = 0
• 𝐻𝛼 : At least one β different from 0
» Step 2: Calculate test statistic (𝐹)
(𝑅𝑆𝑆𝑅 −𝑅𝑆𝑆𝑈 )/(𝑑𝑓𝑅 −𝑑𝑓𝑈 )
𝐹𝑐 =
𝑅𝑆𝑆𝑈 /𝑑𝑓𝑈
𝑑𝑓𝑈 = 𝑛 − 𝑘
𝑑𝑓𝑅 = 𝑛 − 𝑚
𝑑𝑓𝑅 − 𝑑𝑓𝑈 = 𝑘 − 𝑚
TESTING MULTIPLE COEFFICIENTS: 𝐹 TEST

» Step 3: Determine the critical value

∗
𝐹𝑘−𝑚,𝑛−𝑘 (𝛼)
• (𝑘 − 𝑚) degree of freedom for nominator
• (𝑛 − 𝑘) degree of freedom for denominator
» Step 4: Decide
• 𝐹𝑡𝑡 > 𝐹 ∗ , 𝑜𝑟
• 𝑃𝑣𝑎𝑙𝑢𝑒 = 𝑃 𝐹 > 𝐹𝑡𝑡 < 𝛼
=> Reject 𝐻0 at the significance level of 𝛼
FISHER DISTRIBUTION
TESTING MULTIPLE COEFFICIENTS: 𝐹 TEST

The hypothesis that income and gender are equal to zero simultaneously is rejected at 10% (even at 1%)
F TEST FOR OVERALL SIGNIFICANCE

… is the F-test for the null hypothesis that all coefficients are equal to zero simultaneously.
F TEST FOR OVERALL SIGNIFICANCE

The hypothesis that all coefficients are equal to zero simultaneously is rejected at 10% (even at 1%).
ASSUMPTIONS OF THE CLASSICAL LRM
▪ A1: Linear relationship
▪ A2: No perfect collinearity
▪ A3: Residual has zero mean
▪ A4: Homoskedasticity
▪ A5: Exogeneity of regressors
▪ A6: Normality of residuals

44
MAINTAINED ASSUMPTIONS
» A1: Linear relationship
𝑌 = 𝑋𝛽 + 𝑒

» A3: Zero mean of the residual

𝐸(𝑒|𝑋) = 0
This also implies 𝐸(𝑌) = 𝑋𝛽

» A2: No perfect collinearity (𝑋 is full rank or the columns of 𝑋 are linearly independent)
» Note: we can estimate a model with imperfect collinearity, but this results in multicollinearity.
HOMOSKEDASTICITY
» A4 implies that the error term has a constant variance
» As a result, the VCV of the error term is

𝐸 𝑒𝑒’ 𝑋 = 𝜎 2 𝐼

» This assumption is used to calculate the VCV of coefficients (which is used to calculate standard errors).
» Violation of this assumption results in heteroskedasticity, which gives biased estimates of the standard errors.
EXOGENEITY OF REGRESSORS
» This implies that [A5] 𝑋 is unrelated to 𝑒
» Violation of this assumption results in the problem of endogeneity, which produces biased estimated coefficients.
GAUSS – MARKOV THEOREM

» On the basis of assumptions A1 to A5 (A6 is not necessary), the OLS method gives best linear
unbiased estimators (BLUE):
• (1) Estimators are linear functions of the dependent variable Y.
• (2) The estimators are unbiased; in repeated applications of the method, the estimators
approach their true values.
• (3) In the class of linear estimators, OLS estimators have minimum variance; i.e., they are
efficient, or the “best” estimators.
NORMALITY OF RESIDUAL [A6]
» A6 says that
𝑒~𝑁 0, 𝜎 2
» This is not required for the model, but for the hypothesis testing.
» If the residual follows normal distribution, then the coefficients follows Student-t distribution.
» Otherwise, this is wrong (coefficients does not really follows Student-t.)
» But this should not be a problem with large sample.

PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
R-Programming - Unit 5
No ratings yet
R-Programming - Unit 5
43 pages
Slides 1 Iu
No ratings yet
Slides 1 Iu
45 pages
CH 3
No ratings yet
CH 3
123 pages
Linear Regression Model: Man - PN@VNP - Edu.vn
No ratings yet
Linear Regression Model: Man - PN@VNP - Edu.vn
77 pages
Econometrics For Finace Lecture II-Session Three
No ratings yet
Econometrics For Finace Lecture II-Session Three
32 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Lecture 2 - LRM
No ratings yet
Lecture 2 - LRM
43 pages
Lecture 3 - LRM
No ratings yet
Lecture 3 - LRM
40 pages
Week 8 - 10
No ratings yet
Week 8 - 10
72 pages
Economatrics 3
No ratings yet
Economatrics 3
32 pages
C1 English
No ratings yet
C1 English
26 pages
STAT22209 - Chapter 02-Regression Analyisis - 2022
No ratings yet
STAT22209 - Chapter 02-Regression Analyisis - 2022
41 pages
Topic 2
No ratings yet
Topic 2
23 pages
01 - Simple Linear Regression
No ratings yet
01 - Simple Linear Regression
24 pages
Regression 2
No ratings yet
Regression 2
28 pages
Is The Dependent Variable Related To The Independent Variable?
No ratings yet
Is The Dependent Variable Related To The Independent Variable?
10 pages
ECMT1020 2023S1 Formulas
No ratings yet
ECMT1020 2023S1 Formulas
10 pages
Unit 4 Multiple Regression Model: 4.0 Objectives
No ratings yet
Unit 4 Multiple Regression Model: 4.0 Objectives
23 pages
Regression: Dr. Agustinus Suryantoro, M.S
No ratings yet
Regression: Dr. Agustinus Suryantoro, M.S
31 pages
Econometric Estimation BETA
No ratings yet
Econometric Estimation BETA
36 pages
ECMT1020
No ratings yet
ECMT1020
4 pages
Chapter 5 - STATISTICAL TESTS OF THE LEAST SQUARES ESTIMATES
No ratings yet
Chapter 5 - STATISTICAL TESTS OF THE LEAST SQUARES ESTIMATES
10 pages
Topic 2
No ratings yet
Topic 2
23 pages
Econometrics Practical
No ratings yet
Econometrics Practical
13 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
BST 32202 Linear Regression 6 SLR Assumptions Lse
No ratings yet
BST 32202 Linear Regression 6 SLR Assumptions Lse
20 pages
Chap3 - Multiple Regression
No ratings yet
Chap3 - Multiple Regression
56 pages
Econometrics Chapter 3
No ratings yet
Econometrics Chapter 3
24 pages
Ch3 Multiple Regression
No ratings yet
Ch3 Multiple Regression
56 pages
Basic Econometrics III
No ratings yet
Basic Econometrics III
23 pages
Chapter 2-Simple Regression Model
No ratings yet
Chapter 2-Simple Regression Model
25 pages
Chapter 2 Econometrics
No ratings yet
Chapter 2 Econometrics
9 pages
Econometrics Summary
No ratings yet
Econometrics Summary
6 pages
Chapter 02
No ratings yet
Chapter 02
14 pages
UNIT 3 For ACfn & MGT
No ratings yet
UNIT 3 For ACfn & MGT
28 pages
Home Work 1: Group Member Student Name ID Contribution
No ratings yet
Home Work 1: Group Member Student Name ID Contribution
7 pages
Lecture 2-3
No ratings yet
Lecture 2-3
8 pages
Chapter 3 - Classical Simple Linear Regression
No ratings yet
Chapter 3 - Classical Simple Linear Regression
52 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Regn Lect 5
No ratings yet
Regn Lect 5
9 pages
Multiple Regression
No ratings yet
Multiple Regression
49 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
64 pages
ECMT1020 Formulas 2021
No ratings yet
ECMT1020 Formulas 2021
9 pages
Econometrics: Domodar N. Gujarati
No ratings yet
Econometrics: Domodar N. Gujarati
36 pages
Simple Linear
No ratings yet
Simple Linear
10 pages
Chapter 8: Multiple and Logistic Regression Learning Objectives
No ratings yet
Chapter 8: Multiple and Logistic Regression Learning Objectives
3 pages
Regression
No ratings yet
Regression
24 pages
Short - Notes - Econometric Methods
No ratings yet
Short - Notes - Econometric Methods
22 pages
Theme 3 Multivariante Regression Model
No ratings yet
Theme 3 Multivariante Regression Model
8 pages
Chapter 3
100% (1)
Chapter 3
28 pages
CHP 3 Notes, Gujarati
No ratings yet
CHP 3 Notes, Gujarati
4 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
EC226 - Econometrics (Revision Guide - Simple Linear Regression)
No ratings yet
EC226 - Econometrics (Revision Guide - Simple Linear Regression)
9 pages
Emet2007 Notes
No ratings yet
Emet2007 Notes
6 pages
Simple Regression 1
No ratings yet
Simple Regression 1
18 pages

Linear Regression Model: Topic 2

Uploaded by

Linear Regression Model: Topic 2

Uploaded by

Topic 2

Linear Regression Model

Pham Nhu Man

» Imagine we have a set of observational data

where 𝑖 indicates the observation.

» Or, as written in short form:

» 𝑌 is the regressand, or dependent/explained variable

» 𝑋 is a vector of regressors, or independent/explanatory variables

» We need a data set of 𝑌 and 𝑋 to find 𝛽.

» This is the closed form solution for coefficients.

Data file: noodle13.csv

» For dummy and category variables, the

» For dummy and category variables:

When the disposable income increases by one million

When the price of instant noodle increases by one

» Under homoskedasticity, the variance-covariance matrix (VCV) of coefficients is

» Where 𝜎 2 could be estimated by the mean squared error

» The standard errors of the OLS estimates are calculated by

» The t-statistics are

» 𝑅2 is higher when more regressors are added

▪ Testing individual coefficient: T-test

» To test the following hypothesis:

» Step 1: Form hypotheses

» If 𝑃𝑣𝑎𝑙𝑢𝑒 < 𝛼 Reject 𝐻0 at level of significance of 𝛼

Region of rejection Region of acceptance Region of rejection

The null hypothesis that income has no impact

The null hypothesis that distance has no impact

» Step 1: Form hypotheses

» Step 3: Determine the critical value

» A3: Zero mean of the residual

You might also like