0% found this document useful (0 votes)
15 views

Lecture 12 Regression Edited

The document discusses using simple linear regression to model the relationship between psychology students' final grades in Psych 284 and their starting salaries. It finds a significant positive correlation between the two variables and calculates a regression equation to predict starting salary based on final grade. The regression equation and a graph of the best fit line are presented to demonstrate how final grades can be used to predict starting salary.

Uploaded by

lamita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture 12 Regression Edited

The document discusses using simple linear regression to model the relationship between psychology students' final grades in Psych 284 and their starting salaries. It finds a significant positive correlation between the two variables and calculates a regression equation to predict starting salary based on final grade. The regression equation and a graph of the best fit line are presented to demonstrate how final grades can be used to predict starting salary.

Uploaded by

lamita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Regression

Regression

✤ Correlational analyses quantify the relationship


between two or more naturally occurring variables

✤ Simple linear regression analyses produce a


hypothetical model of the relationship between 2
variables that allows us to predict that value of one
variable based on the other.
Regression Grades
Annual Salary
(USD)

51 13,508
✤ What if AUB was interested in
84 26,156
predicting the starting salaries
of psychology students based 81 26,000
on their final grades in Psych 63 17,156
284?
74 22,256
✤ This would only be relevant 88 29,600
given a significant correlation
53 15,680
between the variables.
79 22,940

r=.97, p<.05, right tailed 86 29,000

71 19,544
Linear Model

Relationship beween Psyc 284 Final Grades and Starting Salary


30000

25000

Regression analyses give us a

Annual Salary (USD)


way to model the relationship
between two or more 20000

variables using a straight line.


15000

50 60 70 80
Final Grades
Linear Model

Relationship beween Psyc 284 Final Grades and Starting Salary


30000

25000

No line can fit all the data

Annual Salary (USD)


perfectly. The deviation
between a line and the 20000

datapoint is called a residual.


15000

50 60 70 80
Final Grades
Linear Model

Relationship beween Psyc 284 Final Grades and Starting Salary


30000

Residuals

25000
Annual Salary (USD)

20000

15000

50 60 70 80
Final Grades
Line of Best Fit…

…is the line that


Relationship beween Psyc 284 Final Grades and Starting Salary

minimizes the total 30000

residuals.
25000

y^i=b0 + b1(xi)

Annual Salary (USD)


20000

There are formulas for obtaining a


slope and intercept for a given set of
data. 15000

or, 50 60 70 80
Final Grades

We can use R.
Line of Best Fit

Relationship beween Psyc 284 Final Grades and Starting Salary


Regression equation 30000

25000

y^i=b0 + b1(xi)

Annual Salary (USD)


20000

Predicted value of Y

15000

50 60 70 80
Final Grades
Line of Best Fit

Relationship beween Psyc 284 Final Grades and Starting Salary


Regression equation 30000

25000

y^i=b0 + b1(xi)

Annual Salary (USD)


20000

Y-intercept- the predicted


value of y when x is 0 15000

50 60 70 80
Final Grades
Line of Best Fit

Relationship beween Psyc 284 Final Grades and Starting Salary


Regression equation 30000

25000

y^i=b0 + b1(xi)

Annual Salary (USD)


20000

slope - the change in the


predicted value of y for 15000

every 1 unit change in x


50 60 70 80
Final Grades
Regression Equation

Relationship beween Psyc 284 Final Grades and Starting Salary

^
30000

yi=b0 + b1(xi)
25000

Annual Salary (USD)


In this example, the regression
equation is: 20000

^
yi= -7634.4 + 408.5(xi) 15000

50 60 70 80
Final Grades
Regression Equation

Relationship beween Psyc 284 Final Grades and Starting Salary


30000

y^i= -7634.4 + 408.5(xi) 25000

Annual Salary (USD)


20000

You can use this regression line to


predict the salary of someone based
on their Psych 284 final grade. 15000

50 60 70 80
Final Grades
Regression Equation Annual Salary
Grades
(USD)

51 13,508
^
yi= -7634.4 + 408.5(xi) 84 26,156

81 26,000
Based on the data, what would we predict
is the salary for someone who completed 63 17,156
Psyc 284 with:
74 22,256
a 51? ^
yi= -7634.4 + 408.5(51) = $13,199.10/yr
88 29,600
a 88? ^ 53 15,680
yi= -7634.4 + 408.5(88) = $28,313.60/yr
79 22,940

86 29,000

71 19,544
Regression Equation

Predicted Value of Y Relationship beween Psyc 284 Final Grades and Starting Salary
30000

^
yi=b0 + b1(xi)
25000

Annual Salary (USD)


Obtained Value of Y 20000

yi=b0 + b1(xi) + ei 15000

50 60 70 80
Final Grades
Regression Equation

Predicted Value of Y Relationship beween Psyc 284 Final Grades and Starting Salary
30000

^
yi=b0 + b1(xi)
25000

Annual Salary (USD)


Obtained Value of Y 20000

yi=b0 + b1(xi) + ei 15000

50 60 70 80
Final Grades
Residual, difference between
predicted Y and obtained Y
Regression Equation

Predicted Value of Y Relationship beween Psyc 284 Final Grades and Starting Salary
30000

^
yi=b0 + b1(xi)
25000

Annual Salary (USD)


Obtained Value of Y 20000

yi=b0 + b1(xi) + ei 15000

50 60 70 80
Final Grades
A regression equation
produces a line that
minimizes ei
Regression Analysis

The regression line is only a model based on the data.

This model might not reflect reality.

We need some way of testing how well the model fits the observed
data.

How?
Regression Analysis

Measure how much better our regression model is than a model based
on the mean.

The mean represents the best guess for Y when we have no other
information.

If grades are a significant predictor of salary, then we would expect our


regression line to better fit the data than the mean line.

If grades are NOT a significant predictor of salary, then modeling the


relationship using a regression line will not be an improvement over the
mean.
SUM OF
SQUARES
Comparing residuals of two
competing models
1. mean
2. regression equation
TOTAL SUM OF SQUARES
A number that represents total variation in Y irrespective of X. i.e., b1 = 0
Relationship beween Psyc 284 Final Grades and Starting Salary
30000

25000
Annual Salary (USD)

Total squared
Average
deviations between
data and mean
20000 Salary = $22,184

15000

50 60 70 80
Final Grades

SST
MODEL SUM OF SQUARES
A number that represents systematic variation
Relationship beween Psyc 284 Final Grades and Starting Salary
30000

25000

Total squared
Annual Salary (USD)

deviations between
regression line and 20000

mean

15000

50 60 70 80
Final Grades
RESIDUAL SUM OF SQUARES
A number that represents unsystematic variation
Relationship beween Psyc 284 Final Grades and Starting Salary
30000

25000

Total squared
Annual Salary (USD)

deviations between
data and regression 20000

line

15000

50 60 70 80
Final Grades

SSR
TOTAL VARIANCE
IN Y, THE OUTCOME VARIABLE

SST
=
SYSTEMATIC VARIANCE
ACCOUNTED FOR BY THE REGRESSION MODEL

+
UNSYSTEMATIC VARIANCE
NOT ACCOUNTED FOR BY THE MODEL

SSR
HYPOTHESIS TESTING

Null Hypothesis H0 : There is no relationship between the two


variables. b1 = 0

Alternate Hypothesis H1 : There is a significant relationship


between the two variables. b1 ≠ 0

1) Overall regression model fits the data better than a model based on the
mean (F-test).
2) Individual predictors are significantly related to the outcome variable
(t-test).
HYPOTHESIS TESTING

Null Hypothesis H0 : There is no relationship between the two


variables. b1 = 0

Alternate Hypothesis H1 : There is a significant relationship


between the two variables. b1 ≠ 0

If b1 = 0, then SSM will also be close to 0 => applying a regression equation


does not systematically explain much of the total variation in the outcome.

If b1 ≠ 0, then SSM will be larger than 0, implying that a portion of the total
variation in the outcome is systematically explained by applying a
regression equation.
EXPLAINING VARIATION
b1 will rarely be exactly 0. We need to figure out whether the amount of
systematic, explained variation is substantially greater than the leftover,
unsystematic, unexplained variation.

In other words, is SSM larger than SSR?

To test whether the regression model accounts for more systematic


variation (SSM) than unsystematic, error variance (SSR), we use an F test.

Systemic Variance
F =Unsystematic
———— Variance
MEAN SQUARES
Because Sum of Squares are affected by the sample size, we use the
“average” sum of squares, called mean squares.

SSM
MSM = ——— dfM = k =
dfM #of predictors
MSM
F = ———
SSR
MSR MSR = ———
dfR dfR = n - k - 1
F test
If H0 is true, the regression model does not account for significantly more systematic
variation than unsystematic variation. Thus, we would expect MSM to be no larger than MSR,
and F less than or equal to 1.
If H0 is false, the regression model does account for significantly more systematic variation
than unsystematic variation. Thus, we would expect MSM to be significantly larger than MSR,
and F greater than 1.
SSM
MSM = ———
dfM dfM = 1
MSM
F= ———
MSR SSR
MSR = ———
dfR
dfR = n - 2
F distribution
The F distribution is a family of distributions based on the degrees of
freedom associated with MSM and MSR (and therefore the sample size).

Our F test is comparing the obtained F against a critical F associated with an


given alpha (typically .05).

If the null hypothesis is true, what is the probability of obtaining our F value?
If it is less than alpha (.05), I would infer that the null hypothesis is not true.
Assessing the importance of the
predictor variable
It is not enough to establish that our regression model results in a better fit
than a model based on the mean. We also need to evaluate the importance of
the actual predictor.

If the predictor is not related to the outcome, we would expect b1 = 0.

Again, b1 is rarely equal to zero. We need a way to figure out whether b1 is


significantly different from 0.
Is the coefficient significantly
For this, we use a t-test. different from zero?

yi= -7634.4 + 408.5 (xi)


t value associated with our b1

We convert b1 into a t-value, and that t-value is assessed against a t-distribution.

b1 - 0 b1 under H0
t = ———
SEb1
Standard error of b1 is the measure of variability between all possible b1
that could be obtained if we collected all possible samples of X and Y

If the b1 are similar, SE will be small.


T distribution
The t distribution is a family of distributions based on the degrees of freedom.
In simple linear regression analyses, the degrees of freedom associated with the
t-test is N - 2.

Our t test is comparing the obtained t against a critical t associated with an given
alpha (typically .05).
Assuming the null hypothesis is true, what is the probability of obtaining our t
value? If it is less than alpha (.05), I would infer that the null hypothesis is not true.
Regression in R - Function and
Output
lm(Outcome ~ Predictor)
Reporting and Interpreting
Regression Output
Based on these data, we reject the null hypothesis. Psyc
284 grades significantly predict one’s starting salary,
t(8)=12.12, p<.001, two-tailed. 94.8% of the variability in
starting salary is accounted for by Psyc 284 grade. The
regression model fits the data well overall, F(1,8)=146.9,
p<.001.
Assumptions of a hypothesis test for
a simple linear regression

Outcome data are independent.


Relationship between the variables is linear.
Outcome variable must be continuous.

Residuals are normally distributed.


Equality of variance (homoscedasticity).

You might also like