Correlation and Chi-Square Test - LDR 280
Correlation and Chi-Square Test - LDR 280
Correlation and Chi-Square Test - LDR 280
Widely-Used Statistical
Methods
REVIEW OF FUNDAMENTALS
When testing hypotheses,
all statistical methods will
always be testing the null.
Null Hypothesis?
No difference/no relationship
Y = a + b 1 x1
Compare Groups
Between Proportions (e.g., Chi Square Test 2)
H0 :
P 1 = P 2 = P3 = = P k
1 = 2 = 3 = = k
DEPENDENT
NOMINAL/CATEGORICAL
N
O
M
I
N
A
L
M
E
T
R
I
C
* Chi-Square
* T-Test
* Analysis of Variance
* Discriminant Analysis
* Logit Regression
Two
Two indices
indices that
that measure
measure the
the linear
linear relationship
relationship between
between
continuous/metric
continuous/metric variables
variables are:
are:
a.
a.Covariance
Covariance
b.
b.Correlation
Correlation Coefficient
Coefficient (Pearson
(Pearson Correlation)
Correlation)
Covariance
Covariance
Covariance is
is aa measure
measure of
of the
the linear
linear association
association betw
betw
two
two metric
metric variables
variables (i.e.,
(i.e., ordered
ordered metric,
metric, interval,
interval, or
or ra
ra
variables).
variables).
Covariance
Covariance (for
(for aa sample)
sample) is
is computed
computed as
as follows:
follows:
( xi x )( yi y )
sxy
n 1
for
samples
Positive
Positive values
values indicate
indicate aa positive
positive relationship.
relationship.
Negative
Negative values
values indicate
indicate aa negative
negative (inverse)
(inverse) relationsh
relationsh
277.6
259.5
269.1
267.0
255.6
272.9
69
71
70
70
71
69
sxy
( xi x )( yi y )
n 1
y (xi x) (yi y)
69 10.65
71 -7.45
70
2.15
70
0.05
71 -11.35
69
5.95
-1.0
1.0
0
0
1.0
-1.0
(xi x)(yi y)
-10.65
-7.45
0
0
-11.35
-5.95
Total -35.40
n=6
(s
xn)1y635.4107.08
ii
xy
Covariance
What can we say about the relationship between the two variables?
The relationship is negative/inverse.
That is, the longer a golfers driving distance is, the lower (better)
his/her score is likely to be.
How strong is the relationship between x and y?
(s
xn
)1
(y)6
3
5
.4
0
.10
7
8
x
yii
Covariance
Covariance
It means:
If driving distance (x) were measured in feet, rather than yards,
even though it is the same relationship (using the same data),
the covariance sxy would have been much larger. WHY?
Because x-values would be much larger, and thus ( xi x )
(
values will be much larger which,
in turn, will make
( xi x )( yi ymuch
)
larger.
SOLUTION: Correlation Coefficient comes to the rescue!
Correlation Coefficient
Correlation
(Pearson/simple correlatio
Correlation Coefficient
Coefficient rrxy
xy (Pearson/simple correlatio
is
is aa measure
measure of
of linear
linear association
association between
between two
two variabl
variab
ItIt may
may or
or may
may not
not represent
represent causation.
causation.
The
(for sample data) is
The correlation
correlation coefficient
coefficient rrxy
xy (for sample data) is
computed
computed as
as follows:
follows:
for
samples
rxy
s xy
sx s y
Correlation Coefficient = r
Francis Galton
(English researcher,
inventor of
fingerprinting, and
cousin of Charles
Darwin)
To examine:
a. Whether a relationship exists between two metric
variables
e.g., income and education, or
workload and job satisfaction and
b. What the nature and strength of that relationship
may be.
Range of Values for r?
14
r-values
r-values closer
closer to
to -1
-1 or
or +1
+1 indicate
indicate stronger
stronger linear
linear relationships
relationship
r-values
r-values closer
closer to
to zero
zero indicate
indicate aa weaker
weaker relationship.
relationship.
NOTE:
NOTE: Once
Once rrxyxy is
is calculated,
calculated, we
we need
need to
to see
see whether
whether itit is
is
statistically
statistically significant
significant (if
(if using
using sample
sample data).
data).
Null
Null Hypothesis
Hypothesis when
when using
using r?
r?
H
H00:: rr =
=0
0
There
There is
is no
no relationship
relationship between
between the
the two
two varia
varia
16
277.6
259.5
269.1
267.0
255.6
272.9
y =Average
18-Hole Score
69
71
70
70
71
69
y (xi x) (yi y)
69 10.65
71 -7.45
70
2.15
70
0.05
71 -11.35
69
5.95
-1.0
1.0
0
0
1.0
-1.0
(xi x)(yi y)
-10.65
-7.45
0
0
-11.35
-5.95
Total -35.40
(s
xn)1
(y)6
3
5
.4
0
17
.0
8
ii
xy
-.9631
sxsy (8.2192)(.8944)
Conclusion?
Not only is the relationship negative, but also extremely
strong!
( x x)( y y )
( x x)
( y y)
s xy
( s x ).( s y )
( x x)( y y )
( x x)
Y
XX
12
-3
19
-1
14
2
.
.
.
.
_
Y=16
YY
-4
3
-2
.
.
( y y)
_
(X X) (Y Y)
12
-3
-4
.
.
_
_
(X X) (Y Y)
s xy
( s x ).( s y )
_
(X X)2
9
1
4
.
.
_
(X X)2
_
(Y Y)2
16
9
4
.
.
_
(Y Y)2
Correlation Coefficient?
a) GMAT score and 1st year GPA are positively related so that as
values of one variable increase, values of the other also tend to
increase, and
b) R2 = (0.48)2 = 23% of variations/differences in students GPAs
are explained by (or can be attributed to) variations/
differences in their GMAT scores.
Lets now practice on SPSS
Menu Bar: Analyze, Correlate, Bivariate, Pearson
EXAMPLE: Using data in SPSS File Salary.sav
we wish to see if beginning salary is related to seniority,
age, work experience, and education
22
Y = a + b 1 x1
Compare Groups
Between Proportions (e.g., Chi Square Test 2)
H 0:
P1 = P2 = P3 = = Pk
1 = 2 = 3 = = k
23
QUESTION: Logically, what would be the first thing you would do?
25
Male
Smoker
Nonsmoker
TOTAL
Female TOTAL
O11 = 15
O12 = 25
O21 = 5
O22 = 55
20
80
40
60
n = 100
26
Hint:
What % of all the subjects are smokers/non-smokers?
Male
Smoker
Female
O11 = 15
TOTAL
O12 = 25
40
Nonsmoker
O21 = 5
O22 = 55
60
TOTAL
20
80
n = 100
27
Smoker
Nonsmoker
TOTAL
Male
O11 = 15
E11 = 8
Female
O12 = 25
E12 = 32
TOTAL
40
O21 = 5
E21 = 12
O22 = 55
E22 = 48
60
20
80
n = 100
28
Nonsmoker
TOTAL
Male
O11 = 15
Female
O12 = 25
E11 = 8
E12 = 32
O21 = 5
E21 = 12
O22 = 55
E22 = 48
20
80
TOTAL
40
60
n = 100
Positive and negative values of (Oij Eij) RESIDUALS for different cells
Solution?
Square each (Oij Eij) and then sum them up--compute (Oij Eij)2.
Solution?
31
Divide each (Oij Eij)2 value by its corresponding Eij value before
summing them up across all cells
That is, compute an index for average discrepancy per subject.
(Oij Eij)2
Eij
(Oij Eij)2
Eij
32
Smoker
Nonsmoker
Male
O11 = 15
E11 = 8
Female
O12 = 25
E12 = 32
TOTAL
40
O21 = 5
E21 = 12
O22 = 55
E22 = 48
60
20
80
TOTAL
2 =
(15 8)2
8
(25 32)2
32
(5 12)2
12
n = 100
(55 48)2
48
= 12.76
33
df = (r-1) (c-1)
df = (2 1) (2 1) = 1
where r and c are # of rows and columns of the contingency
table.
36
37
Smoker
Male
O11 = 15
Female
O12 = 25
TOTAL
40
Nonsmoker
O21 = 5
O22 = 55
60
20
80
TOTAL
15 / 20 = 75%
n = 100
25 / 80 = 31%
2 / N
39
For larger tables (df > 1), eliminate small cells by combining
their corresponding categories in a meaningful way.
That is, recode the variable that is causing small cells into a
new variable with fewer categories and then use this new
variable to redo the Chi-Square test.
40
42
44
Correlation Coefficient:
To understand the practical meaning of r, we can square it.
What would r2 mean/represent?
How is it calculated?
r2 = (Covariation of X and Y together) / (Total variation of X & Y combined)
How do we measure/quantify variations?
r2
( x x)( y y ) / n 1
[ ( x x) 2 / n 1][ ( y y ) 2 / n 1]
( x x)( y y)
( x x) ( y y )
2
r2 always represents a %
2
45
(x
(x
X
4
6
9
.
.
.
_
X=7
x)( y y )
x) 2
(y
Y
XX
12
-3
19
-1
14
2
.
.
.
.
.
.
_
Y=16
YY
-4
3
-2
.
.
.
y) 2
(X X) (Y Y)
12
-3
-4
.
.
.
_
_
(X X) (Y Y)
_
(X X)2
9
1
4
.
.
.
_
(X X)2
_
(Y Y)2
16
9
4
.
.
.
_
(Y Y)2
Correlation Coefficient?
47
Y = a + b 1 x 1 + b2 x 2 + b 3 x 3 + + b k x k
Compare Groups
Proportions (e.g., Chi Square Test 2)
Means (e.g., Analysis of Variance)
48
QUESTION: Logically, what would be the first thing you would do?
Smoker
Nonsmoker
TOTAL
Male
O11 = 15
O21 = 5
Female
O12 = 25
O22 = 55
TOTAL
40
60
20
80
n = 100
51
Female
O11 = 15
TOTAL
O12 = 25
40
Nonsmoker
O21 = 5
O22 = 55
60
TOTAL
20
80
n = 100
52
Smoker
Nonsmoker
TOTAL
Male
O11 = 15
E11 = 8
Female
O12 = 25
E12 = 32
TOTAL
40
O21 = 5
E21 = 12
O22 = 55
E22 = 48
60
20
80
n = 100
53
Nonsmoker
TOTAL
Male
O11 = 15
Female
O12 = 25
E11 = 8
E12 = 32
O21 = 5
E21 = 12
O22 = 55
E22 = 48
20
80
TOTAL
40
60
n = 100
So, the key to answering our original question lies in the size of the
discrepancies between observed and expected frequencies.
What is, then, the next logical step?
55
Positive and negative values of (Oij Eij) RESIDUALS for different cells will
cancel out.
Solution?
Square each (Oij Eij) and then sum them up--compute (Oij Eij)2.
Solution?
56
Divide each (Oij Eij)2 value by its corresponding Eij value before
summing them up across all cells
That is, compute the total discrepancy per subject index.
(Oij Eij)2
Eij
(Oij Eij)2
Eij
57
Smoker
Nonsmoker
Male
O11 = 15
E11 = 8
Female
O12 = 25
E12 = 32
TOTAL
40
O21 = 5
E21 = 12
O22 = 55
E22 = 48
60
20
80
TOTAL
2 =
(15 8)2
8
(25 32)2
32
(5 12)2
12
n = 100
(55 48)2
48
= 12.76
58
59
df = (r-1) (c-1)
df = (2 1) (2 1) = 1
where r and c are # of rows and columns of the contingency
table.
61
62
Smoker
Male
O11 = 15
Female
O12 = 25
TOTAL
40
Nonsmoker
O21 = 5
O22 = 55
60
20
80
TOTAL
15 / 20 = 75%
n = 100
25 / 80 = %31
2 / N
64
For larger tables (df > 1), eliminate small cells by combining
their corresponding categories in a meaningful way.
65
67
Assignment #3
68
Assignment #3
69
Assignment #3
NOTE:
If you examine the value labels for the variable daysofwk, you will see that it is coded
as 1=Sunday, 2=Monday, 3=Tuesday, 4=Wednesday, 5=Thursday, 6=Friday, and
7=Saturday. Therefore, for part (b), you will need to create a new variable--i.e., Recode
daysofwk into a new dichotomous variable (say, deathday), that would represent death
during Mondays, Tuesdays, and Wednesdays vs. Fridays, Saturdays and Sundays.
Notice that the subjects who died on Thursdays should not be included in the analysis
(i.e., should not be represented in any of the two categories of days represented by the
new variable) Also, make sure you properly define the attributes (e.g., label, value
label, etc.) of this new variable (i.e., deathday).
REMINDERS:
For each analysis, include the Notes in the printout. Also, edit the first page of your
first analysis output to include your name. Make sure that on your printout you
explain your findings and conclusions. Be specific as to what parts of the output you
have used, and how you have used them, to reach your conclusions.
Make sure that you tell the whole story and that your explanations of the findings are
complete. For example, it is not enough to say that there is a significant relationship
between characteristic A and characteristic B. You have to go on to indicate how the
two characteristics are related and what that relationship really means.
70
HYPOTHESIS TESTING
QUESTIONS OR
COMMENTS
?
71