Lecture 2-3
Lecture 2-3
Simple (single) linear regression model: 1 dependent variable and 1 independent variable
Y=b1+b2*X+u
Y=b1+b2*X1+b3*X2+u
- Cross sectional data: data collected at one point in time but in many locations: score (GPA) of
students in our class in the semester 3 (45 students= observations), GDP of 63 provinces in VN in
2021 à independent observations (in this course)
- Time series data: data collected for one subject (location) overtime: GDP of VN from 2000 to
2021 (22 obs)à dependent obs (time series data analysis: chapters 21,22)
- Combine cross sectional and time series à panel data (panel data analysis)
Note: it is not always b1 has economic meaning. That is why we are not very much concerning on
explaining for b1 in the model.
Example: Consider Salary on Experience (years of working): Salary=b1+b2*Exp+u: b1 in this case has the
meaning of starting salary. Another example: cost on quantity produced (b1: fixed cost)
In another case, consider the price of house on area of house: b1 does not make sense because no
house having 0 area (no need to explain b1)
b2: slope of the line: rate of change of Average of Y on X: when X changes 1 unit, average of Y changes
b2 unit.
The difference between observation Yi and its average is the error term ui
Y^i = b^1 + b^2 Xi (3) in which Y^ is the estimator of E(Y/X), b^1 and b^2 are the estimators of b1
and b2. The differences between Yi and Yi^ is called residuals, denoted ui^ à
Yi – Yi^ = ui^ à Yi=b^1 + b^2*Xi + ui^ (4)
RSS=
In which:
Example: Consider the dependence of Weight on Height, the population is the students in
Class EPMP 6. Take a sample of 6 students, data is collected in the excel file.
Using the linear model, the population function is: Wi= β1+ β2Hi+ui (1)
Students W H w h wh x^2
1 75 180 13.5 8.666667 117 75.11111
2 61 174 -0.5 2.666667 -1.33333 7.111111
3 78 175 16.5 3.666667 60.5 13.44444
4 57 169 -4.5 -2.33333 10.5 5.444444
5 51 170 -10.5 -1.33333 14 1.777778
6 47 160 -14.5 -11.3333 164.3333 128.4444
beta^2 1.57781
beta^1 -208.831
Lecture 3
The coefficient of determination R2 to measure the goodness of fit of the model (how good of
the model to fit the data): R2 lets us know how much percent of variation of Y that can be
explained by the model (or by independent variable)
Y=Y^+u^ à Y-Ybar=Y^-Ybar +u^
TSS=RSS+ESS
TSS: Total Sum of Squares: total variation of Y
RSS: Residual Sum of Squares: Variation dued to the error
ESS: Explained Sum of Squares: Variation ued the model
Define: R2= ESS/TSS: how much percent of variation of Y explained by the model.
In R-studio: Multiple R-squared: R2= 0.7239: 72.39% of variation of Weight that can be
explained the model of Height
More than 80%: very good
From 60% to under 80%: good
From 50% to under 60%: average
Under 50%: Weak
Example: Use a data set from outside (excel data file), load it in R-studio, estimate the model,
explain the meaning of coefficient, R2, find RSS, TSS, ESS.
The data set is: Table 2_6: this is the data to consider the depedence of Hourly wages (Wages)
on the years of Education (Years)
Model: Wagesi = β1+ β2*Yearsi + ui (1)
Hypothesis (expectation): positive relationship between Wages and Years of Education, β2 > 0
Loading data into R, estimate using command “lm”
Comment: The relationship between wages and Years are strong positive.
Calculation correlation
cor(Years,wages)
[1] 0.9527809
Estimate model (1), using lm command
lm(wages~Years)
Find RSS:
From Residual standard error: 0.9387:
The formula: Residual standard error: σ^=sqrt(RSS/(n-k))=sqrt(RSS/(n-2))=0.92387
RSS=0.92872* (n-2)= 0.92872* (13-2)=9.4873
R2=ESS/TSS=1-RSS/TSS=0.9078 à TSS=102.89 à ESS= TSS-RSS=93.4
Precision of the estimators: How precise of the estimators β1^ and β2^, we use their
standard errors. On the results, we have se(β1^)=0.8746, se(β2^)=0.06958
Var(β1^)= (se(β1^))2; Var(β2^)= (se(β2^))2
Precision of the model (estimated model): σ^
Variance of u is assumed σ2, we estimate σ2 by σ2^ , the formula is: σ2^=RSS/(n-k)
To compare the precision of two models: we use the Coefficient of Variation (CV):
Exercises:
1. Redo the estimation of coefficients for the simple model using R-studio
2 A training manager wondered whether the length of time his trainees revised for an
. examination had any effect on the marks they scored in the examination. Before the exam,
he asked a random sample of 10 of them to estimate honestly how long, to the nearest hour,
they had spent revising. After the examination he investigated the relationship between the
two variables. The sample data provided observations on the following:
Yi = exam mark for individual i (%)
Xi = revision time by individual i (hours)
The manager believes that the relationship between revision hours and exam mark takes
the form:
Yi = β1 + β2Xi + ui
where B1 and B2 are unknown parameters. The stochastic disturbance terms, u i, are
assumed to be normally and independently distributed with zero mean and constant
variance s2.
Preliminary analysis of the sample data produces the following sample information:
n=10
Lower case letters indicate that the variables are measured as deviations from their
respective sample means i.e xi = Xi - X .
Use the above sample information to answer all the following questions. Show
explicitly all calculations.
a) Calculate estimates for the unknown coefficients, β^1 and β^2. [6 marks]
b) Present the resulting regression equation and interpret the meaning of the
estimated coefficients in this problem. [6 marks]
install.packages("MASS")
library(MASS)