Lecture 12 Regression Edited
Lecture 12 Regression Edited
Regression
51 13,508
✤ What if AUB was interested in
84 26,156
predicting the starting salaries
of psychology students based 81 26,000
on their final grades in Psych 63 17,156
284?
74 22,256
✤ This would only be relevant 88 29,600
given a significant correlation
53 15,680
between the variables.
79 22,940
71 19,544
Linear Model
25000
50 60 70 80
Final Grades
Linear Model
25000
50 60 70 80
Final Grades
Linear Model
Residuals
25000
Annual Salary (USD)
20000
15000
50 60 70 80
Final Grades
Line of Best Fit…
residuals.
25000
y^i=b0 + b1(xi)
or, 50 60 70 80
Final Grades
We can use R.
Line of Best Fit
25000
y^i=b0 + b1(xi)
Predicted value of Y
15000
50 60 70 80
Final Grades
Line of Best Fit
25000
y^i=b0 + b1(xi)
50 60 70 80
Final Grades
Line of Best Fit
25000
y^i=b0 + b1(xi)
^
30000
yi=b0 + b1(xi)
25000
^
yi= -7634.4 + 408.5(xi) 15000
50 60 70 80
Final Grades
Regression Equation
50 60 70 80
Final Grades
Regression Equation Annual Salary
Grades
(USD)
51 13,508
^
yi= -7634.4 + 408.5(xi) 84 26,156
81 26,000
Based on the data, what would we predict
is the salary for someone who completed 63 17,156
Psyc 284 with:
74 22,256
a 51? ^
yi= -7634.4 + 408.5(51) = $13,199.10/yr
88 29,600
a 88? ^ 53 15,680
yi= -7634.4 + 408.5(88) = $28,313.60/yr
79 22,940
86 29,000
71 19,544
Regression Equation
Predicted Value of Y Relationship beween Psyc 284 Final Grades and Starting Salary
30000
^
yi=b0 + b1(xi)
25000
50 60 70 80
Final Grades
Regression Equation
Predicted Value of Y Relationship beween Psyc 284 Final Grades and Starting Salary
30000
^
yi=b0 + b1(xi)
25000
50 60 70 80
Final Grades
Residual, difference between
predicted Y and obtained Y
Regression Equation
Predicted Value of Y Relationship beween Psyc 284 Final Grades and Starting Salary
30000
^
yi=b0 + b1(xi)
25000
50 60 70 80
Final Grades
A regression equation
produces a line that
minimizes ei
Regression Analysis
We need some way of testing how well the model fits the observed
data.
How?
Regression Analysis
Measure how much better our regression model is than a model based
on the mean.
The mean represents the best guess for Y when we have no other
information.
25000
Annual Salary (USD)
Total squared
Average
deviations between
data and mean
20000 Salary = $22,184
15000
50 60 70 80
Final Grades
SST
MODEL SUM OF SQUARES
A number that represents systematic variation
Relationship beween Psyc 284 Final Grades and Starting Salary
30000
25000
Total squared
Annual Salary (USD)
deviations between
regression line and 20000
mean
15000
50 60 70 80
Final Grades
RESIDUAL SUM OF SQUARES
A number that represents unsystematic variation
Relationship beween Psyc 284 Final Grades and Starting Salary
30000
25000
Total squared
Annual Salary (USD)
deviations between
data and regression 20000
line
15000
50 60 70 80
Final Grades
SSR
TOTAL VARIANCE
IN Y, THE OUTCOME VARIABLE
SST
=
SYSTEMATIC VARIANCE
ACCOUNTED FOR BY THE REGRESSION MODEL
+
UNSYSTEMATIC VARIANCE
NOT ACCOUNTED FOR BY THE MODEL
SSR
HYPOTHESIS TESTING
1) Overall regression model fits the data better than a model based on the
mean (F-test).
2) Individual predictors are significantly related to the outcome variable
(t-test).
HYPOTHESIS TESTING
If b1 ≠ 0, then SSM will be larger than 0, implying that a portion of the total
variation in the outcome is systematically explained by applying a
regression equation.
EXPLAINING VARIATION
b1 will rarely be exactly 0. We need to figure out whether the amount of
systematic, explained variation is substantially greater than the leftover,
unsystematic, unexplained variation.
Systemic Variance
F =Unsystematic
———— Variance
MEAN SQUARES
Because Sum of Squares are affected by the sample size, we use the
“average” sum of squares, called mean squares.
SSM
MSM = ——— dfM = k =
dfM #of predictors
MSM
F = ———
SSR
MSR MSR = ———
dfR dfR = n - k - 1
F test
If H0 is true, the regression model does not account for significantly more systematic
variation than unsystematic variation. Thus, we would expect MSM to be no larger than MSR,
and F less than or equal to 1.
If H0 is false, the regression model does account for significantly more systematic variation
than unsystematic variation. Thus, we would expect MSM to be significantly larger than MSR,
and F greater than 1.
SSM
MSM = ———
dfM dfM = 1
MSM
F= ———
MSR SSR
MSR = ———
dfR
dfR = n - 2
F distribution
The F distribution is a family of distributions based on the degrees of
freedom associated with MSM and MSR (and therefore the sample size).
If the null hypothesis is true, what is the probability of obtaining our F value?
If it is less than alpha (.05), I would infer that the null hypothesis is not true.
Assessing the importance of the
predictor variable
It is not enough to establish that our regression model results in a better fit
than a model based on the mean. We also need to evaluate the importance of
the actual predictor.
b1 - 0 b1 under H0
t = ———
SEb1
Standard error of b1 is the measure of variability between all possible b1
that could be obtained if we collected all possible samples of X and Y
Our t test is comparing the obtained t against a critical t associated with an given
alpha (typically .05).
Assuming the null hypothesis is true, what is the probability of obtaining our t
value? If it is less than alpha (.05), I would infer that the null hypothesis is not true.
Regression in R - Function and
Output
lm(Outcome ~ Predictor)
Reporting and Interpreting
Regression Output
Based on these data, we reject the null hypothesis. Psyc
284 grades significantly predict one’s starting salary,
t(8)=12.12, p<.001, two-tailed. 94.8% of the variability in
starting salary is accounted for by Psyc 284 grade. The
regression model fits the data well overall, F(1,8)=146.9,
p<.001.
Assumptions of a hypothesis test for
a simple linear regression