CH 14 and 15

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Chapter 14

Analysis of Variance
Responsible for Section 14.1 only:
p. 520-538

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.1

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 1
Analysis of Variance
ANOVA stands for ANalysis Of VAriance
Analysis of variance is a technique that allows us to
compare two or more populations of interval data.

Analysis of variance is:


 an extremely powerful and widely used procedure.
 a procedure which determines whether differences
exist between population means.
 a procedure which works by analyzing sample
variance.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.2

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 2
One-Way ANOVA
One-way ANOVA allows us to simultaneously test to
determine if
two or more population means are equal

HO: 1 = 2 = 3
HA: At least two means differ

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use.

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 3
ANOVA Assumptions
• All populations are normally distributed
• The population variances are equal
• ANOVA tests assume that variances can be pooled
• The observations are independent

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use.

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 4
One-Way Analysis of Variance
Independent samples are drawn from k populations:

Note: These populations are referred to as treatments.


It is not a requirement that n1 = n2 = … = nk.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.5

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 5
One Way Analysis of Variance
New Terminology:

x is the response variable, and its values are responses.

xij refers to the ith observation in the jth sample.


E.g. x35 is the third observation of the fifth sample.

The grand mean, , is the mean of all the observations, i.e.:

(n = n1 + n2 + … + nk)

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.6

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 6
One Way Analysis of Variance
More New Terminology:

Population classification criterion is called a factor.

Each population is a factor level.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.7

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 7
Example 14.1
In the last decade stockbrokers have drastically changed the
way they do business. It is now easier and cheaper to invest
in the stock market than ever before.

What are the effects of these changes?

To help answer this question a financial analyst randomly


sampled 366 American households and asked each to report
the age of the head of the household and the proportion of
their financial assets that are invested in the stock market.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.8

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 8
Example 14.1
The age categories are
Young (Under 35)
Early middle-age (35 to 49)
Late middle-age (50 to 65)
Senior (Over 65)
The analyst was particularly interested in determining
whether the ownership of stocks varied by age. Xm14-01

Do these data allow the analyst to determine that there are


differences in stock ownership between the four age groups?

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.9

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 9
Example 14.1 Terminology

Percentage of total assets invested in the stock market is


the response variable; the actual percentages are the
responses in this example.

Population classification criterion is called a factor.

The age category is the factor we’re interested in. This is the
only factor under consideration (hence the term “one way”
analysis of variance).

Each population is a factor level.


In this example, there are four factor levels: Young, Early
middle age, Late middle age, and Senior.
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.10

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 10
Example 14.1

Young Early Middle Age Late Middle Age Senior

24.8 28.9 81.5 66.8

35.5 7.3 0.0 77.4

68.7 61.8 61.3 32.9

42.2 53.6 0.0 74.0

⋮ ⋮ ⋮ ⋮

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.11

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 11
Example 14.1

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.12

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 12
Example 14.1

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.13

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 13
Example 14.1 IDENTIFY

The null hypothesis in this case is:


H0:µ1 = µ2 = µ3 = µ4

i.e. there are no differences between population means.

Our alternative hypothesis becomes:


H1: at least two means differ

OK. Now we need some test statistics…

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.14

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 14
Test Statistic
Since µ1 = µ2 = µ3 = µ4 is of interest to us, a statistic that
measures the proximity of the sample means to each other
would also be of interest.

Such a statistic exists, and is called the between-treatments


variation. It is denoted SST, short for “sum of squares for
treatments”. Its is calculated as:
grand mean
sum across k treatments

A large SST indicates large variation between sample means which supports H1.
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.15

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 15
Test Statistic
When we performed the equal-variances test to determine whether
two means differed (Chapter 13) we used

( x1  x 2 ) (n 1  1)s12  (n 2  1)s 22
t where s 2p 
2 1 1  n1  n 2  2
s p   
 n1 n 2 

The numerator measures the difference between sample means


and the denominator measures the variation in the samples.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.16

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 16
Test Statistic
SST gave us the between-treatments variation. A second
statistic, SSE (Sum of Squares for Error) measures the
within-treatments variation.

SSE is given by: or:

In the second formulation, it is easier to see that it provides a


measure of the amount of variation we can expect from the
random variable we’ve observed.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.17

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 17
Example 14.1 COMPUTE

Since:

If it were the case that: x1  x 2  x 3  x 4

then SST = 0 and our null hypothesis, H0:µ1 = µ2 = µ3 = µ4


would be supported.

More generally, a small value of SST supports the null


hypothesis. A large value of SST supports the alternative
hypothesis. The question is, how large is “large enough”?
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.18

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 18
Example 14.1 COMPUTE

The following sample statistics and grand mean were


computed
x1  44.40
x 2  52.47
x 3  51.14
x 4  51.84
x  50.18

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.19

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 19
Example 14.1 COMPUTE

Hence, the between-treatments variation, sum of squares for


treatments, is

SST  84( x1  x ) 2  131( x 2  x ) 2  93( x 3  x ) 2  58( x 4  x ) 2


 84(44.40  50.18) 2  131(52.47  50.18) 2  93(51.14  50.18) 2
 58(51.84  50.18) 2
 3741.4

Is SST = 3,741.4 “large enough”?

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.20

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 20
Example 14.1 COMPUTE

We calculate the sample variances as:


s12  386 .55 , s 22  469 .44 , s 32  461 .82 , s 24  444 .79

and from these, calculate the within-treatments variation


(sum of squares for error) as:
SSE  (n1  1)s12  (n 2 1)s 22  (n 3  1)s 32  (n 4  1)s 24

 (84  1)(386.55)  (131  1)(469.44)  (93  1)(471.82)  (58  1)(444.79)

= 161,871.0
We still need a couple more quantities in order to relate SST
and SSE together in a meaningful way…
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.21

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 21
MST and MSE
• Dividing the sum of squares treatments by the number of
degrees of freedom used (k-1) gives us the variance
explained
• Dividing the sum of squares errors by the number of
degrees of freedom remaining (or residual degrees of
freedom) (N-k) gives us the variance unexplained
• If we take the ratio of two variances, we have an F-test

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use.

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 22
Mean Squares
The mean square for treatments (MST) is given by:

The mean square for errors (MSE) is given by:

And the test statistic:

is F-distributed with k–1 and n–k degrees of freedom.


Aha! We must be close…

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.23

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 23
Example 14.1 COMPUTE

We can calculate the mean squares treatment and mean


squares error quantities as:
SST 3,741.4
MST    1,247.12
k 1 3
SSE 161,612.3
MSE    447.16
nk 362

Giving us our F-statistic of:


MST 1,247.12
F   2.79
MSE 447.16
Does F = 2.79 fall into a rejection region or not? What is the
p-value?
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.24

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 24
Example 14.1 INTERPRET

Since the purpose of calculating the F-statistic is to


determine whether the value of SST is large enough to
reject the null hypothesis, if SST is large, F will be large.

P-value = P(F > Fstat)

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.25

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 25
Example 14.1 COMPUTE

Using Excel:
Click Data, Data Analysis, Anova: Single Factor

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.26

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 26
Example 14.1 COMPUTE

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.27

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 27
Example 14.1 INTERPRET

Since the p-value is .0405, which is small we reject the null


hypothesis (H0:µ1 = µ2 = µ3 = µ4) in favor of the alternative
hypothesis (H1: at least two population means differ).

That is: there is enough evidence to infer that the mean


percentages of assets invested in the stock market differ
between the four age categories.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.28

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 28
ANOVA Table
The results of analysis of variance are usually reported in an
ANOVA table…
Source of degrees of
Sum of Squares Mean Square
Variation freedom

Treatments k–1 SST MST=SST/(k–1)

Error n–k SSE MSE=SSE/(n–k)

Total n–1 SS(Total)

F-stat=MST/MSE

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.29

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 29
ANOVA and t-tests of 2 means
Why do we need the analysis of variance? Why not test every pair of
means? For example say k = 6. There are C26 = 6(5)/2= 14 different
pairs of means.
1&2 1&3 1&4 1&5 1&6
2&3 2&4 2&5 2&6
3&4 3&5 3&6
4&5 4&6
5&6

If we test each pair with α = .05 we increase the probability of making


a Type I error. If there are no differences then the probability of
making at least one Type I error is 1-(.95)14 = 1 - .463 = .537

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.30

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 30
Checking the Required Conditions
The F-test of the analysis of variance requires that the
random variable be normally distributed with equal
variances. The normality requirement is easily checked
graphically by producing the histograms for each sample.
(To see histograms click Example 14.1 Histograms)

The equality of variances is examined by printing the sample


standard deviations or variances. The similarity of sample
variances allows us to assume that the population variances
are equal.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.31

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 31
Violation of the Required Conditions
If the data are not normally distributed we can replace the
one-way analysis of variance with its nonparametric
counterpart, which is the Kruskal-Wallis test. (See Section
19.3.)

If the population variances are unequal, we can use several


methods to correct the problem.

However, these corrective measures are beyond the level of


this book.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.32

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 32
Identifying Factors
Factors that Identify the One-Way Analysis of Variance:

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 14.33

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 33
Chapter 15

Chi-Squared Tests
Responsible for Section 15.1 - 15.3
only.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.34

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 34
A Common Theme…
Number of Statistical
What to do? Data Type?
Categories? Technique:

Describe a X2 goodness of fit


Nominal Two or more
population test

Compare two X2 test of a


Nominal Two or more
populations contingency table

Compare two or X2 test of a


Nominal --
more populations contingency table
Analyze relationship
X2 test of a
between two Nominal --
contingency table
variables

One data type…


…Two techniques

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.35

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 35
Two Techniques…
The first is a goodness-of-fit test applied to data produced by
a multinomial experiment, a generalization of a binomial
experiment and is used to describe one population of data.

The second uses data arranged in a contingency table to


determine whether two classifications of a population of
nominal data are statistically independent; this test can also
be interpreted as a comparison of two or more populations.

In both cases, we use the chi-squared ( ) distribution.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.36

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 36
The Multinomial Experiment…
Unlike a binomial experiment which only has two possible
outcomes (e.g. heads or tails), a multinomial experiment:

• Consists of a fixed number, n, of trials.


• Each trial can have one of k outcomes, called cells.
• Each probability pi remains constant.
• Our usual notion of probabilities holds, namely:
p1 + p2 + … + pk = 1, and
• Each trial is independent of the other trials.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.37

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 37
Chi-squared Goodness-of-Fit Test…
We test whether there is sufficient evidence to reject a
specified set of values for pi.

To illustrate, our null hypothesis is:

H0: p1 = a1, p2 = a2, …, pk = ak

where a1, a2, …, ak are the values we want to test.

Our research hypothesis is:


H1: At least one pi is not equal to its specified value

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.38

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 38
Example 15.1
Two companies, A and B, have recently conducted
aggressive advertising campaigns to maintain and possibly
increase their respective shares of the market for fabric
softener. These two companies enjoy a dominant position in
the market. Before the advertising campaigns began, the
market share of company A was 45%, whereas company B
had 40% of the market. Other competitors accounted for the
remaining 15%.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.39

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 39
Example 15.1
To determine whether these market shares changed after the
advertising campaigns, a marketing analyst solicited the
preferences of a random sample of 200 customers of fabric
softener. Of the 200 customers, 102 indicated a preference
for company A's product, 82 preferred company B's fabric
softener, and the remaining 16 preferred the products of one
of the competitors. Can the analyst infer at the 5%
significance level that customer preferences have changed
from their levels before the advertising campaigns were
launched?

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.40

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 40
Example 15.1…
We compare market share before and after an advertising
campaign to see if there is a difference (i.e. if the advertising
was effective in improving market share). We hypothesize
values for the parameters equal to the before-market share.
That is,
H0: p1 = .45, p2 = .40, p3 = .15

The alternative hypothesis is a denial of the null. That is,

H1: At least one pi is not equal to its specified value

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.41

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 41
Example 15.1…
Test Statistic
If the null hypothesis is true, we would expect the number of
customers selecting brand A, brand B, and other to be 200 times the
proportions specified under the null hypothesis. That is,
e1 = 200(.45) = 90
e2 = 200(.40) = 80
e3 = 200(.15) = 30
In general, the expected frequency for each cell is given by
ei = npi

This expression is derived from the formula for the expected value of a
binomial random variable, introduced in Section 7.4.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.42

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 42
Example 15.1…
If the expected frequencies and the observed frequencies are quite
different, we would conclude that the null hypothesis is false, and we
would reject it.

However, if the expected and observed frequencies are similar, we


would not reject the null hypothesis.

The test statistic measures the similarity of the expected and observed
frequencies.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.43

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 43
Chi-squared Goodness-of-Fit Test…
Our Chi-squared goodness of fit test statistic is given by:
observed expected
frequency frequency

Note: this statistic is approximately Chi-squared with k–1


degrees of freedom provided the sample size is large. The
rejection region is:

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.44

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 44
Example 15.1… COMPUTE

In order to calculate our test statistic, we lay-out the data in a


tabular fashion for easier calculation by hand:
Observed Expected Summation
Delta
Company Frequency Frequency Component
fi ei (fi – ei) (fi – ei)2/ei
A 102 90 12 1.60
B 82 80 2 0.05
Others 16 30 -14 6.53
Total 200 200 8.18

Check that these are equal

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.45

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 45
Example 15.1… INTERPRET

Our rejection region is:

Since our test statistic is 8.18 which is greater than our


critical value for Chi-squared, we reject H0 in favor of H1,
that is,

“There is sufficient evidence to infer that the proportions


have changed since the advertising campaigns were
implemented”

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.46

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 46
Example 15.1… p-value

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.47

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 47
Required Conditions…
In order to use this technique, the sample size must be large
enough so that the expected value for each cell is 5 or more.
(i.e. n x pi ≥ 5)

If the expected frequency is less than five, combine it with


other cells to satisfy the condition.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.48

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 48
Identifying Factors…
Factors that Identify the Chi-Squared Goodness-of-Fit Test:

ei=(n)(pi)

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.49

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 49
Chi-squared Test of a Contingency Table
The Chi-squared test of a contingency table is used to:
• determine whether there is enough evidence to infer
that two nominal variables are related, and
• to infer that differences exist among two or more
populations of nominal variables.

In order to use use these techniques, we need to classify the


data according to two different criteria.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.50

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 50
Independence Test

The chi-square independence test can be used


to test the independence of two variables.

H0: There is no relationship between two variables.

H1: There is a relationship between two variables.

If the null hypothesis is rejected, there is


some relationship between the variables.

51
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use.

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 51
Chi-Square Independence Test
In order to test the null hypothesis, you must compute the
expected frequencies, assuming the null hypothesis is true.

When data are arranged in table form for the


independence test, the table is called a
contingency table.

52
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use.

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 52
Contingency Table
Column 1 Column 2 Column 3
Row 1 C1,1 C1,2 C1,3
Row 2 C2,1 C2,2 C2,3

Block or cell Cij(i-row no. & j-column no.)

The degrees of freedom for any contingency table are


d.f. = (rows – 1) (columns – 1) =(R – 1)(C – 1).

53
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use.

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 53
Independence Test Value
The formula for the test value for the independence test is
the same as the one for the goodness-of-fit test.

With d.f. = (R – 1)(C – 1).

54
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use.

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 54
Required Condition – Rule of Five…
In a contingency table where one or more cells have
expected values of less than 5, we need to combine rows or
columns to satisfy the rule of five.

Note: by doing this, the degrees of freedom must be changed


as well.

55
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use.

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 55
Example 15.2
The MBA program was experiencing problems scheduling
their courses. The demand for the program's optional courses
and majors was quite variable from one year to the next.

In desperation the dean of the business school turned to a


statistics professor for assistance.

The statistics professor believed that the problem may be the


variability in the academic background of the students and
that the undergraduate degree affects the choice of major.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.56

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 56
Example 15.2
As a start he took a random sample of last year's MBA
students and recorded the undergraduate degree and the
major selected in the graduate program.

The undergraduate degrees were BA, BEng, BBA, and


several others.

There are three possible majors for the MBA students,


accounting, finance, and marketing. Can the statistician
conclude that the undergraduate degree affects the choice of
major?

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.57

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 57
Example 15.2
Xm15-02

The data are stored in two columns. The first column consist of
integers 1, 2, 3, and 4 representing the undergraduate degree
where
1 = BA
2 = BEng
3 = BBA
4 = other

The second column lists the MBA major where

1= Accounting
2 = Finance
3 = Marketing
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.58

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 58
Example 15.2 IDENTIFY

The problem objective is to determine whether two variables


(undergraduate degree and MBA major) are related. Both
variables are nominal. Thus, the technique to use is the chi-
squared test of a contingency table. The alternative
hypotheses specifies what we test. That is,

H1: The two variables are dependent

The null hypothesis is a denial of the alternative hypothesis.

H0: The two variables are independent.


© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.59

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 59
Test Statistic
The test statistic is the same as the one used to test
proportions in the goodness-of-fit-test. That is, the test statistic
is
( f  e ) 2
2  i i
ei
Note however, that there is a major difference between the two
applications. In this one the null does not specify the
proportions pi, from which we compute the expected values ei,
which we need to calculate the χ2 test statistic. That is, we
cannot use
e = npi

because we don’t know the pi (they are not specified by the


null hypothesis). It is necessary to estimate the pi from the
data.
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.60

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 60
Example 15.2
The first step is to count the number of students in each of
the 12 combinations. The result is called a cross-
classification table.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.61

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 61
Example 15.2

MBA Major

Undergrad
Accounting Finance Marketing Total
Degree

BA 31 13 16 60

BEng 8 16 7 31

BBA 12 10 17 39

Other 10 5 7 22

Total 61 44 47 152

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.62

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 62
Example 15.2
If the null hypothesis is true (Remember we always start with this
assumption.) and the two nominal variables are independent, then, for
example

P(BA and Accounting) = [P(BA)] [P(Accounting)]

Since we don’t know the values of P(BA) or P(Accounting)


We need to use the data to estimate the probabilities.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.63

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 63
Test Statistic
There are 152 students of which 61 who have chosen
accounting as their MBA major. Thus, we estimate the
probability of accounting as

61
P(Accounting)   .401
152

Similarly
60
P(BA)   .395
152

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.64

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 64
Example 15.2…
If the null hypothesis is true

P(BA and Accounting) = (60/152)(61/152)

Now that we have the probability we can calculate the expected value.
That is,

E(BA and Accounting) = 152(60/152)(61/152)


= (60)(61)/152 = 24.08

We can do the same for the other 11 cells.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.65

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 65
Example 15.2 COMPUTE

We can now compare observed with expected frequencies…

MBA Major
Undergrad
Accounting Finance Marketing
Degree
BA 31 24.08 13 17.37 16 18.55

BEng 8 12.44 16 8.97 7 9.59

BBA 12 15.65 10 11.29 17 12.06

Other 10 8.83 5 6.37 7 6.80

and calculate our test statistic:

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.66

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 66
Example 15.2… COMPUTE

Using Excel : Click Add-Ins, Data Analysis Plus, Contingency


Table [if the table has already been prepared] or Contingency
Table (Raw Data) [if the table has not been completed]

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.67

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 67
Example 15.2… COMPUTE

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.68

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 68
Example 15.2… INTERPRET

The p-value is .0227. There is enough evidence to infer that


the MBA major and the undergraduate degree are related.

We can also interpret the results of this test in two other ways.

1.There is enough evidence to infer that there are differences


in MBA major between the four undergraduate categories.

2. There is enough evidence to infer that there are differences


in undergraduate degree between the majors.

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.69

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 69
Identifying Factors…
Factors that identify the Chi-squared test of a contingency
table:

© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.70

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 70
Table 15.1 Statistical Techniques for Nominal Data
Problem Objective Categories Statistical Technique
Describe a population 2 z-test of p or the chi-squared
goodness-of-fit test

Describe a population More than 2 Chi-squared goodness-of-fit test

Compare two populations 2 z-test p1–p2 or chi-squared test


of a contingency table

Compare two populations More than 2 Chi-squared test of a contingency table

Compare more than


two populations 2 or more Chi-squared test of a contingency table

Analyze the relationship


between two variables 2 or more Chi-squared test of a contingency table
© 2015 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a
license distributed with a certain product or service or otherwise on a password-protected website for classroom use. 15.71

Copyright © 2006 Brooks/Cole, a division of Thomson


Learning, Inc. 71

You might also like