Simple and Multiple Regression Analysis
Simple and Multiple Regression Analysis
y a b1 x1 b2 x2 b3 x3 ... bk xk
y
X3
X1
X2
P1 = P2 = P3 = = P k
1 = 2 = 3 = = k
If so,
By how much?
And
How strong is the connection/relationship
y
between Xs and Y?
what % of differences/variations
in Y values (e.g., income) among
study subjects can be explained by
(or attributed to) differences in
X1
X values (e.g. years of education,
years of experience, etc.)?
X3
X2
Y = a + b1 x1 + b2 x2 + b3 x3 + + bk xk+
yi
Family
Number
Actual # of Credit
Cards
10
y Estimate?
y y
56
7
8
QUESTION: Can we
determine how much error in
estimation we are committing
by using Y 7 as our estimate,
for each of these households?
56
* This example was adopted from Hair, Black, Babin, Anderson, & Tatham, (2006). Multivariate Data Analysis, 6th ed., Prentice Hall.
yi
Actual # of
Credit Cards
y y
Estimate for #
of Credit
Cards
Error in
Estimation
10
yi 56
y y
56
7
8
yi
yi y
Actual # of
Credit Cards
y y
Estimate for #
of Credit
Cards
Error in
Estimation
-3
-1
-1
+1
+1
10
+3
yi 56
y y
56
7
8
10
9
8
7
6
F8
F5
F7
F6
F4
F2, F3
5
4
F1
3
2
1
0
Y Y Estimate
10
9
8
7
F3
3
2
1
0
F4
F7
F6
F2
5
4
F8
Graphic Representation
Actual Estimate F5
F1
Estimation Error
Can we determine the
total estimation error
for all 8 families?
Y Y Estimate
yi
yi y
Actual # of
Credit Cards
y y
Estimate for #
of Credit
Cards
Error in
Estimation
-3
-1
-1
+1
+1
10
yi 56
7
56
y y
7
8
+3
(
yi y ) =
Solution?
i
Family
Number
yi
Actual # of
Credit Cards
y y
Estimate for #
of Credit
Cards
Error in
Estimation
Errorsi Squared
-3
-1
-1
+1
+1
+3
( yi y ) 0
9
2
( yi y ) 22
10
yi 56
y y
56
7
8
( y y)
i
Family
Number
Actual # of
Credit Cards
Family Size
10
Y
# Of Credi t Cards
10
9
F2
F5
F6
F4
F7
F1
y y
Original (Baseline)
Estimate
F3
x 2, y 4
3
2
1
0
F8
7
Family Size
# Of Credi t Cards
F8
y a3 b3 x
Regression Line
9
8
F4
F2
F5
F7
F3
5
4
F1
3
2
y a2 b2 x
y y
Original (Baseline)
Estimate
F6
y a 0 x y
Regression Line
(Line of Best Fit)-new improved
location for CC
estimates (see next
slide)
1
0
7
Family Size
# Of Credi t Cards
10
y a bx
9
8
F2
F4
y Original
(Baseline)
Estimate
F3
5
4
Estimation ERROR ( y
F1
F7
F6
F5
( y y )
y )
7
Family Size
y a bx
( x x)( y y)
b
2
(x x)
a y bx
Lets use above formulas to compute the values of a
and b for the regression line in our example.
We will need: y , x ,
( x x )( y y ),
and
(x x)
x
Family
Size
xx
y y ( x x )( y y )
(x x)
10
56
Y
7
8
34
x 4.25
8
( x x )( y y ) ?
(x x) ?
2
x
Family
Size
xx
y y ( x x )( y y )
(x x)
-2.25
-3
6.75
5.0625
-2.25
-1
2.25
5.0625
-.25
-1
.25
.0625
-.25
.0625
.75
.75
.5625
.75
.5625
1.75
1.75
3.0625
10
1.75
5.25
3.0625
56
Y
7 x 34 4.25
8
8
( x x )( y y ) 17 ( x x )
17.5
y a bx
( x x)( y y ) 17
.971
2
17.5
(
x
x
)
a =2.87
b = .97
y 2.87 .97 x
?
Y-Intercept
Regression Coefficient
F5
F2
F7
F4
Estimate
F3
5
4
New
Improved
Estimates
y Original
(Baseline)
F6
F1
2
1
0
y 2.87 .97 x
F8
10
7
Family Size
x
Family
Size
y
y
y y
( y y )
Regression
Error
Estimate
(Residual)
Errors
Squared
10
( y y )
i
y
Family Actual #
Numbe of Credit
r
Cards
x
Family
Size
y y
( y y )
Regression
Error
Estimate
(Residual)
Errors
Squared
4.81
-.81
.66
4.81
1.19
1.42
6.76
-.76
.58
6.76
.24
.06
7.73
.27
.07
7.73
-.73
.53
8.7
-.7
.49
10
8.7
1.3
1.69
5.486 ( y y ) 2
SSE = Sum of Squares Error (SS Residual)
5.5
16.5
X1
Note: When dealing with only two variables (a single X and Y):
16.514
r R
.75 .866
22
2
Pearson Correlation
of Y with X1
(NOT controlling for
any other var.)
10
F8
y 2.87 .97 x
9
8
F2
7
6
5
4
y y
y y
Original
Baseline
ERROR
for F1
F4
by
? Explained
REGRESSION
F5
F6
F7
y Original
(Baseline)
Estimate
F3
Model
? y y
F1 New ERROR
(Unexplained/
RESIDUAL)
3
2
1
0
7
Family Size
x2
i
Family
Number
Actual # of
Credit Cards
Family Size
14
16
14
17
18
21
17
10
25
yi
Family
Income
y a b1 x1 b2 x2
Y = # of Credit Cards
12
y a b1 x1 b2 x2
11
10
Formulas are available for
computing values of
9
a, b1 and b2
8
MULTIPLE REGRESSION
7
MODEL FOR OUR EXAMPLE:
Family Income
6
5
4
3
2
1
0
Actual
Regression Estimate
X1 = Family Size
x1
Family
Size
x2
y y
Family Regression
Income Estimate
($000)
Error
(Residual)
( y y )
Errors
Squared
14
16
14
17
18
21
17
10
25
)
(y y
x1
Family
Size
x2
Family Regression
Income Estimate
($000)
y y
Error
(Residual)
( y y )
Errors
Squared
14
4.77
-.77
.59
16
5.20
.80
.64
14
6.03
-.03
.00
17
6.68
.32
.10
18
7.53
.47
.22
21
8.18
-1.18
1.39
17
7.95
.05
.00
10
25
9.67
.33
.11
3.05 ( y y )
Y-Intercept,
Y-Intercept a
SST = 22
SSE = 3.05
Y= # of CC
X1=Family
Size
X2 = Family
y 2.87 .97 X 1 r2 = ?
SSR =
Income
a+c
X1=Family
= 16.5
size
y 0.063 .398 X 2
SSR =
c+b
= 15.12
X2 = Family
Income
R2 = (a+c) / (a+b+c+d)
R2 = 16.5 / 22 = 0.75
Pearson/simpl
bc
r
yx2
e Correlation
abcd
of Y with X2
(not
15.11
ryx2
0.829
controlling for
22
X)?
a
c
X1=Family
Size
X2 = Family
Income
Graphically = ?
NOTE: c is explained by
both X1 and X2
R2
SSR = a + b +c = 18.95
SST = a + b + c + d = 22
x1
x2
Family Regression
Income Estimate
($000)
y y
Error
(Residual)
( y y )
Actual #
of Credit
Cards
Family
Size
14
4.77
-.77
.59
16
5.20
.80
.64
14
6.03
-.03
.00
17
6.68
.32
.10
18
7.53
.47
.22
21
8.18
-1.18
1.39
17
7.95
.05
.00
10
25
9.67
.33
.11
Remember:
Errors
Squared
3.05 ( y y )
understanding the role that the following demographics (age, educ, sibs,
agewed), as well as respondent income (rincmdol), job satisfaction (satjob_2),
and marriage satisfaction (hapmar_2) play in determining/predicting ones
general happiness (happy_2).
We also wish to know which of the above variables is the strongest predictor of
general happiness (Standardized Reg. Coefficients).
Use the gss_2 data file and conduct the appropriate analysis.
NOTE:
satjob_2 is coded as:
1 = Very Dissatisfied
2 = A Little Dissatisfied
3 = Pretty Satisfied
4 = Very Satisfied
Independent = 3
Other = 4
Meaning?
EXAMPLE 1:
Meaning?
Among people of the same gender, every
additional year of education results in an
average additional income of $1,000.
Males make, on average, $800 more in
comparison with females who have the
same number of years of education.
Assignment 5
Data file Salary.sav contains information about 474 employees hired by a Midwestern bank
between 1969 and 1971 (NOTE: Due to SPSS site license restrictions, this hyperlink will
not work if you are off campus). Of the 474 employees, 258 were men, 216 women, 370
white, and 104 non-white. The bank was subsequently involved in EEOC litigation; the
bank was accused of gender and race discrimination in its hiring and compensation
practices. The two issues that were of particular interest in the litigation were alleged
gender and racial inequalities not only in the banks beginning salaries (variable salbeg),
but also in its later salaries (variable salnow).
1.
Print, examine, and interpret correlation coefficients between beginning salary
(salbeg) and age in years (age), education in years (edlevel), employment category or job
classification level--rated from 1=lowest to 8=highest (jobcat), and work experience in
months (work).
2.
Conduct the appropriate analysis to see: (a) What role each of the variables age,
education (edlevel), employment category (jobcat), and work experience (work) played,
holding all other variables constant, in determining the banks beginning salaries? For
example, what was the differential pay for one additional year of education among new
hires who otherwise had the same age, employment category, and work experience? (b)
Which of the above demographic characteristics had the strongest influence on beginning
pay? How can you tell? (c) What percent of the differences in employees beginning
salaries can be explained by/attributed to difference in all of the above characteristics?
Assignment 5
3.
Now conduct the appropriate analysis to indicate, holding all other variables
constant, what roles gender (sex, male=0, female=1) played in determining beginning
salaries at the bank. That is, what was the differential beginning pay between male and
female employees who otherwise had the same age, education, employment category, and
work experience? Does this evidence support the charges of gender discrimination in the
banks practices regarding initial compensation?
4.
During litigation, it was charged that the banks unfair compensation practices had
continued beyond its initial salary decisions. That is, the prosecution claimed that with
time, not only the beginning salary disparities between men and women did not shrink, but
further widened. Conduct the appropriate analysis to indicate (a) everything else being
equal, what roles gender played in determining employees later salaries at the bank
(salnow). That is, what was the average differential pay between male and female
employees who otherwise had the same age, education, employment category, work
experience, and job seniority (variable time represents seniority in terms of number of
months employed at the bank)? (b) Compare the later pay disparities you have just
identified with the beginning pay disparities you had found in question 3 above to explain
if the evidence supports the prosecutions charges of continued gender discrimination
beyond initial salary decisions, resulting in widening disparities in later pay.
NOTE: For each question, provide thorough explanations on corresponding pages and
parts of your printout.
QUESTIONS
OR
COMMENTS
?