0% found this document useful (0 votes)
22 views

Chapter 4-6

The document introduces the chi-square distribution and how it can be used to test goodness of fit for multinomial populations and test independence between classifications. It provides examples of using chi-square tests to analyze market share data and test if a new product impacts market shares. Formulas and steps for conducting chi-square tests are outlined.

Uploaded by

Getaneh Yenealem
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Chapter 4-6

The document introduces the chi-square distribution and how it can be used to test goodness of fit for multinomial populations and test independence between classifications. It provides examples of using chi-square tests to analyze market share data and test if a new product impacts market shares. Formulas and steps for conducting chi-square tests are outlined.

Uploaded by

Getaneh Yenealem
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Chapter 4: The Chi-Square Distributions

4.1. Introduction
Dear learners, in this chapter you will be introduce with the use of statistical techniques to make inferences about
populations from sample data. Basically, this chapter develops two statistical techniques that involve nominal data.
The first is a goodness-of-fit test applied to data produced by a multinomial experiment, a generalization of a binomial
experiment. The second uses data arranged in a table (called a contingency table) to determine whether two
classifications of a population of nominal data are statistically independent; this test can also be interpreted as a
comparison of two or more populations.
Chapter objectives
After accomplishment of this chapter, you should be able to:
 Define Chi-square distribution.
 Identify the characteristics of the chi-square distribution.
 Conduct a test of hypothesis comparing an observed set of frequencies to an expected distribution.
 Conduct a test of hypothesis to determine whether two classification criteria are related.
Chi-square (𝒙𝟐 ) Distribution- Areas of Application
Overview
Dear learner, this section will introduce you with goodness-of-fit test applied to data produced by a multinomial
experiment, a generalization of a binomial experiment. In general, the goodness of fit test can be used with any
hypothesized probability distribution.
At the end of this section, you should be able to:
 Define Chi-square distribution.
 Identify characteristics of Chi-square distribution.
 Perform a goodness of fit test for multinomial population.
 Conduct Chi-square Test of Independence.
Self Test Exercise-1:
1. Define Chi-square distribution.
2. What are the characteristics of Chi-square distribution?
4.1.1 Chi-square (𝒙𝟐 ) Distribution
Chi-square distribution is a continuous probability distribution with positively skewed ranging between 0 and +∞. As
showed below in chart, like that of the Student t distribution, its shape depends on its number of degrees of freedom
(df).

The mean (E(x2)) and variance of a chi-square random variable are E(x2) = df andV(x2)= 2df respectively.
The value of x2 with number of degrees of freedom (df) such that the area to its right under the chi-squared curve is
equal to α is denoted x2α,df. We cannot use ─x2α,df to represent the point such that the area to its left is α(as we did with
the standard normal and Student t values) because x2 is always greater than 0. To represent left-tail critical values, we

1
note that if the area to the left of a point is α, the area to its right must be 1─ α because the entire area under the chi
squared curve (as well as all continuous distributions) must equal 1. Thus, x21─ α,df denotes the point such that the area
to its left is α.
The critical values of the chi-squared distribution for a given degrees of freedom can be obtained from the Chi-squared
distribution table. For instance, to find the point in a chi-squared distribution with 8 degrees of freedom such that the
area to its right is 0.05, locate 8 degrees of freedom in the left column and across the top. The intersection of the row
and column contains the number we seek, i.e., x20.05,8= 15.5. To find the point in the same distribution such that the
area to its left is 0.05, find the point such that the area to its right is (1─ 0.05) = 0.95. Locate x20.950across the top row
and 8 degrees of freedom down the left column, thus, x20.95,8 = 2.73.
For values of degrees of freedom greater than 100, the chi-squared distribution can be approximated by a normal
distribution with µ = df and 𝜎=√2𝑑𝑓.
Self Test Exercise-2:
What is the base to perform a goodness of fit test for multinomial population?
4.1.2 Goodness of Fit Test: A Multinomial Population
Chi-square is a statistical test commonly used to compare observed data with data we would expect to obtain
according to a specific hypothesis. For example, if, according to Mendel's (family name) laws, you expected 10 of 20
offspring from a cross to be male and the actual observed number was 8 males, then you might want to know about the
"goodness to fit" between the observed and expected. Were the deviations (differences between observed and
expected) the result of chance, or were they due to other factors. How much deviation can occur before you, the
investigator, must conclude that something other than chance is at work, causing the observed to differ from the
expected? The chi-square test is always testing what scientists call the null hypothesis, which states that there is no
significant difference between the expected and observed result.
In this section, we consider the case in which each element of a population is assigned to one and only one of several
classes or categories. Such a population is a multinomial population. The multinomial distribution can be thought of
as an extension of the binomial distribution to the case of three or more categories of outcomes. On each trial of a
multinomial experiment, one and only one of the outcomes occur. Each trial of the experiment is assumed
independent, and the probabilities of the outcomes remain the same for each trial.
The assumptions for the multinomial experiment parallel those for the binomial experiment with the exception that the
multinomial has three or more outcomes per trial.
As an example, consider the market share study being conducted by Addis Marketing Research. Over the past year,
market shares stabilized at 30% for company A, 50% for company B, and 20% for company C. Recently company C
developed a “new and improved” product to replace its current entry in the market. Company C retained Addis
Marketing Research to determine whether the new product will alter market shares.
In this case, the population of interest is a multinomial population; each customer is classified as buying from
company A, company B, or company C. Thus, we have a multinomial population with three outcomes. Let us use the
following notation for the proportions.
pA = market share for company A
pB = market share for company B
pC = market share for company C
Addis Marketing Research will conduct a sample survey and compute the proportion preferring each company’s
product. A hypothesis test will then be conducted to see whether the new product caused a change in market shares.
Assuming that company C’s new product will not alter the market shares, the null and alternative hypotheses are stated
as follows.
H0:pA = 0.30, pB = 0.50, and pC = 0.20
H1: The population proportions are not pA = 0.30, pB = 0.50, and pC = 0.20
If the sample results lead to the rejection of H0, Addis Marketing Research will have evidence that the introduction of
the new product affects market shares.

2
Let us assume that the market research firm has used a consumer panel of 200 customers for the study. Each individual
was asked to specify a purchase preference among the three alternatives: company A’s product, company B’s product,
and company C’s new product. The 200 responses are summarized here.
Observed Frequency
Company A’s product Company B’s product Company C’s new product
48 98 54
We now can perform a goodness of fit test that will determine whether the sample of 200 customer purchase
preferences is consistent with the null hypothesis. The goodness of fit test is based on a comparison of the sample of
observed results with the expected results under the assumption that the null hypothesis is true. Hence, the next step is
to compute expected purchase preferences for the 200 customers under the assumption that pA = 0.30, pB = 0.50, and
pC = 0.20. Doing so provides the expected results.
Thus, we see that the expected frequency for each category is found by multiplying the sample size of 200 by the
hypothesized proportion for the category.
Expected Frequency
Company A’s product Company B’s product Company C’s new product
200(0.30) = 60 200(0.50) =100 200(0.20) = 40

The goodness of fit test now focuses on the differences between the observed frequencies and the expected frequencies.
Large differences between observed and expected frequencies cast doubt on the assumption that the hypothesized
proportions or market shares are correct. Whether the differences between the observed and expected frequencies are
“large” or “small” is a question answered with the aid of the following test statistic.
Test Statistic for Goodness of Fit
(𝑓𝑖 −𝑒𝑖 )2
𝜒2 = ∑𝑘𝑖=1
𝑒𝑖
Where: fi= observed frequency for category i
ei= expected frequency for category i
k = the number of categories
Note: The test statistic has a chi-square distribution with k ─1 degrees of freedom provided that the expected
frequencies are 5 or more for all categories.
Let us continue with the Addis Market Research example and use the sample data to test the hypothesis that the
multinomial population retains the proportions pA = 0.30, pB = 0.50, and pC = 0.20. We will use an α= 0.05 level of
significance. We proceed by using the observed and expected frequencies to compute the value of the test statistic.
With the expected frequencies all 5 or more, the computation of the chi-square test statistic is shown in table below.
Hypothesized Observed Expected (𝒇𝒊 − 𝒆𝒊 )𝟐
Category proportion frequency (fi) frequency (ei) fi ─ ei (𝒇𝒊 − 𝒆𝒊 )𝟐 𝒆𝒊
Company A 0.30 48 60 -12 144 2.40
Company B 0.50 98 100 -2 4 0.04
Company C 0.20 54 40 14 196 4.90
2
Total 1 200 𝜒 = 7.34
We will reject the null hypothesis if the differences between the observed and expected frequencies are large. Large
differences between the observed and expected frequencies will result in a large value for the test statistic. The test for
goodness of fitis always a one-tailed test with the rejection occurring in the upper tail of the chi-square distribution.
We can use the upper tail area for the test statistic and the p-value approach to determine whether the null hypothesis
can be rejected. With k ─ 1 = 3 ─ 1 = 2degrees of freedom, the chi-square table provides the following:
Area in upper tail 0.10 0.05 0.025 0.01 0.005
𝝌2 value (2 df) 4.605 5.991 7.378 9.210 10.597
𝜒2 = 7.34

3
The test statistic 𝜒2 = 7.34 is between 5.991 and 7.378. Thus, the corresponding upper tail area or p-value must be
between 0.05 and 0.025. With p-value ≤α =0.05, we reject H0and conclude that the introduction of the new product by
company C will alter the current market share structure. However, the p-value is computed using computer (Excel or
Minitab) to show 𝜒2 = 7.34 provides a p-value = 0.0255.
Instead of using the p-value, we could use the critical value approach to draw the same conclusion. With α = 0.05 and
2 degrees of freedom, the critical value for the test statistic is 𝜒 2 = 5.991. The upper tail rejection rule becomes
Reject H0if 𝜒2 ≥ 5.991
With 7.34 > 5.991, we reject H0. The p-value approach and critical value approach provide the same hypothesis-testing
conclusion.
Although no further conclusions can be made as a result of the test, we can compare the observed and expected
frequencies informally to obtain an idea of how the market share structure may change. Considering company C, we
find that the observed frequency of 54 is larger than the expected frequency of 40. Because the expected frequency
was based on current market shares, the larger observed frequency suggests that the new product will have a positive
effect on company C’s market share. Comparisons of the observed and expected frequencies for the other two
companies indicate that company C’s gain in market share will hurt company A more than company B.
Interpretation: There is sufficient evidence at the 5% significance level to infer that the proportions have changed
since the new product was introduced. If the sampling was conducted properly, we can be quite confident in our
conclusion.
Required Condition
The actual sampling distribution of the test statistic defined previously is discrete, but it can be approximated by the
chi-squared distribution provided that the sample size is large. This requirement is similar to the one we imposed when
we used the normal approximation to the binomial in the sampling distribution of a proportion. In that approximation
we needed np and n(1 ─p) to be 5 or more. A similar rule is imposed for the chi-squared test statistic. It is called the
rule of five, which states that the sample size must be large enough so that the expected value for each cell must be 5
or more. Where necessary, cells should be combined to satisfy this condition.
Self Test Exercise-3:
How can you determine Chi-square Test of Independence?
4.1.3 Chi-square Test of Independence (Contingency Table)
Another important application of the chi-square distribution involves using sample data to test for the independence of
two variables. We introduce another chi-squared test, which is designed to satisfy two different problem objectives.
The chi-squared test of a contingency table is used to determine whether there is enough evidence to infer that two
nominal variables are related and to infer that differences exist between two or more populations of nominal variables.
Let us illustrate the test of independence by considering the study conducted by the Dashen Brewery. Dashen Brewery
manufactures and distributes three types of beer: light, regular, and dark. In an analysis of the market segments for the
three beers, the firm’s market research group raised the question of whether preferences for the three beers differ
among male and female beer drinkers. If beer preference is independent of the gender of the beer drinker, one
advertising campaign will be initiated for all of Dashen’s beers. However, if beer preference depends on the gender of
the beer drinker, the firm will tailor its promotions to different target markets.
A test of independence addresses the question of whether the beer preference (light, regular, or dark) is independent of
the gender of the beer drinker (male, female). The hypotheses for this test of independence are:
H0: Beer preference is independent of the gender of the beer drinker
H1: Beer preference is not independent of the gender of the beer drinker
Data in the following table can be used to describe the situation being studied.
Sample results for beer preferences of male and female beer drinkers (observed frequencies)
Beer Preference
Gender Light Regular Dark Total
Male 20 40 20 80
Female 30 30 10 70
Total 50 70 30 150

4
After identification of the population as all male and female beer drinkers, a sample can be selected and each
individual asked to state his or her preference for the three Dashen beers. Every individual in the sample will be
classified in one of the six cells in the table. Because we have listed all possible combinations of beer preference and
gender or, in other words, listed all possible contingencies, the above table is called a contingency table. The test of
independence uses the contingency table format and for that, reason is sometimes referred to as a contingency table
test.
Suppose a simple random sample of 150 beer drinkers is selected. After tasting each beer, the individuals in the
sample are asked to state their preference or first choice. The cross-tabulation in the above table summarizes the
responses for the study. As we see, the data for the test of independence are collected in terms of counts or frequencies
for each cell or category. Of the 150 individuals in the sample, 20 were men who favored light beer, 40 were men who
favored regular beer, 20 were men who favored dark beer, and so on.
The data in the above table are the observed frequencies for the six classes or categories. If we can determine the
expected frequencies under the assumption of independence between beer preference and gender of the beer drinker,
we can use the chi-square distribution to determine whether there is a significant difference between observed and
expected frequencies.
In Unit 2 we introduced independent events and showed that if two events A and B are independent, the joint
probability P(A and B) is equal to the product of P(A) and P(B).
That is, P(A and B) = P(A) X P(B)
The events in this example are the values each of the two nominal variables can assume. Unfortunately, we do not
have the probabilities of A and B. However, these probabilities can be estimated from the given data.
Expected frequencies for the cells of the contingency table are based on the following rationale. First, we assume that
the null hypothesis of independence between beer preference and gender of the beer drinker is true. Then we note that
in the entire sample of 150 beer drinkers, a total of 50 prefer light beer, 70 prefer regular beer, and 30 prefer dark beer.
In terms of probability we conclude that 50/150 = 1⁄3 of the beer drinkers prefer light beer, 70⁄150 = 7⁄15 prefer regular
beer, and 30⁄150 = 1⁄5 prefer dark beer. If the independence assumption is valid, we argue that these fractions must be
applicable to both male and female beer drinkers. Thus, under the assumption of independence, we would expect the
sample of 80 male beer drinkers to show that (1⁄3)80 = 26.67 prefer light beer, ( 7⁄15)80 = 37.33 prefer regular beer,
and (1⁄5)80 = 16 prefer dark beer. Application of the same probabilities to the 70 female beer drinkers provides the
expected frequencies shown in Table below.
EXPECTED FREQUENCIES IF BEER PREFERENCE IS INDEPENDENT OF THE GENDER OF THE BEER DRINKER
Beer Preference
Gender Light Regular Dark Total
Male 26.67 37.33 16 80
Female 23.33 32.67 14 70
Total 50 70 30 150
Let eij denote the expected frequency for the contingency table category in row i and column
j. With this notation, let us reconsider the expected frequency calculation for males (row i = 1) who prefer regular beer
(column j = 2), that is, expected frequency e12. Following the preceding argument for the computation of expected
frequencies, we can show that
70 80𝑥70
𝑒12 = 150 𝑥80 = 150 = 37.33
Note that 80 in the expression is the total number of males (row 1 total), 70 is the total number of individuals
preferring regular beer (column 2 total), and 150 is the total sample size.
Hence, we see that
(𝑅𝑜𝑤 1 𝑡𝑜𝑡𝑎𝑙)(𝐶𝑜𝑙𝑢𝑚 2 𝑡𝑜𝑡𝑎𝑙)
𝑒12 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
Generalization of the expression shows that the following formula provides the expected frequencies for a contingency
table in the test of independence.
(𝑅𝑜𝑤 𝑖 𝑡𝑜𝑡𝑎𝑙)(𝐶𝑜𝑙𝑢𝑚 𝑗 𝑡𝑜𝑡𝑎𝑙)
𝑒𝑖𝑗 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒

5
Using the formula for male beer drinkers who prefer dark beer, we find an expected frequency of e13 = (80)(30)/150 =
16, as shown in the above table.
The test procedure for comparing the observed frequencies with the expected frequencies is similar to the goodness of
fit calculations made in the above section. Specifically, the 𝜒2 value based on the observed and expected frequencies
are computed as follows.
(𝑓𝑖𝑗−𝑒𝑖𝑗 )2
𝜒2 = ∑𝑖𝑗 𝑒𝑖𝑗
Where: fij= observed frequency for contingency table category in row i and column j
eij= expected frequency for contingency table category in row i and column j based on
the assumption of independence
Note: With rows and columns in the contingency table, the test statistic has a chi-square distribution with (r─ 1)(c─ 1)
degrees of freedom provided that the expected frequencies are five or more for all categories.
Beer Observed Expected
Gender preference frequency frequency fij ─ eij (𝐟𝐢𝐣 − 𝐞𝐢𝐣 )𝟐 (𝐟𝐢𝐣 − 𝐞𝐢𝐣 )𝟐 ⁄𝒆𝒊𝒋
fij eij
Male Light 20 26.67 -6.67 44.44 1.67
Male Regular 40 37.33 2.67 7.44 0.19
Male Dark 20 16 4 16 1
Female Light 30 23.33 6.67 44.44 1.9
Female Regular 30 32.67 -2.67 7.44 0.22
Female Dark 10 14 -4 16 1.14
2
Total 150 𝝌 = 6.12

We can also calculate the value of the test statistic:


(𝑓𝑖− 𝑒𝑖 )2 (20− 26.67)2 (40− 37.33)2 (20− 16)2 (30− 23.33)2 (30− 32.67)2 (10− 14)2
𝜒2 = ∑𝑘𝑖=1 𝑒𝑖
= 26.67 + 37.33
+ 16
+ 23.33
+ 32.67
+ 14
= 1.67 + 0.19 + 1 + 1.9 + 0.22 + 1.14 = 6.12
To determine the rejection region we must know the number of degrees of freedom associated with the chi-squared
statistic. The number of degrees of freedom for a contingency table with r rows and c columns is (r ─ 1)(c ─ 1) = df .
For this example, the number of degrees of freedom is df = (2 ─ 1)(3─1) = 2.
2 2
If we employ a 5% significance level, the rejection region is 𝜒2>𝜒𝛼,𝑑𝑓 = 𝜒0.05, 2 = 5.991
2
Because 𝝌 = 6.12, we reject the null hypothesis and conclude that beer preference is not independent of the gender of
the beer drinker.
The p-value of the test statistic is 0.0469; unfortunately, we cannot determine the p-value manually.
4.2. Goodness of Fit Test: Poisson and Normal Distributions
Overview
Dear learner, In Section 4.2, you introduced the goodness of fit test for a multinomial population. In this section, we
illustrate the goodness of fit test procedure for cases in which the population is hypothesized to have a Poisson or a
normal distribution. As we shall see, the goodness of fit test and the use of the chi-square distribution for the test
follow the same general procedure used for the goodness of fit test in Section 4.2.
Section objectives:
At the end of this section, you should be able to:
 Perform goodness of fit test for Poisson distribution.
 Conduct goodness of fit test for a normal distribution
Self Test Exercise-4:
How can you determine goodness of fit test for Poisson distribution?

6
4.2.1. Poisson Distribution
Let us illustrate the goodness of fit test for the case in which the hypothesized population distribution is a Poisson
distribution. As an example, consider the arrival of customers at Bank. Because of some recent staffing problems,
bank’s managers asked a local consulting firm to assist with the scheduling of clerks for the checkout lanes. After
reviewing the checkout lane operation, the consulting firm will make a recommendation for a clerk-scheduling
procedure. The procedure, based on a mathematical analysis of waiting lines, is applicable only if the number of
customers arriving during a specified time period follows the Poisson distribution. Therefore, before the scheduling
process is implemented, data on customer arrivals must be collected and a statistical test conducted to see whether an
assumption of a Poisson distribution for arrivals is reasonable.
We define the arrivals at the bank in terms of the number of customers entering the bank during 5-minute intervals.
Hence, the following null and alternative hypotheses are appropriate for this example.
H0: The number of customers entering the store during 5-minute intervals has a Poisson probability distribution
H1: The number of customers entering the store during 5-minute intervals does not have a
Poisson distribution
To test the assumption of a Poisson distribution for the number of arrivals during weekday morning hours, a bank
employee randomly selects a sample of 128, 5-minute intervals during weekday mornings over a three-week period.
For each 5-minute interval in the sample, the bank employee records the number of customer arrivals. In summarizing
the data, the employee determines the number of 5-minute intervals having no arrivals, the number of 5-minute
intervals having one arrival, the number of 5-minute intervals having two arrivals, and so on. These data are
summarized in Table below.
Number of Customers Observed 0 1 2 3 4 5 6 7 8 9 Total
Arriving Frequency 2 8 10 12 18 22 22 16 12 6 128
The above Table gives the observed frequencies for the 10 categories. We now want to use a goodness of fit test to
determine whether the sample of 128 time periods supports the hypothesized Poisson distribution. To conduct the
goodness of fit test, we need to consider the expected frequency for each of the 10 categories under the assumption
that the Poisson distribution of arrivals is true. That is, we need to compute the expected number of time periods in
which no customers, one customer, two customers, and so on would arrive if, in fact, the customer arrivals follow a
Poisson distribution.
𝜇𝑥 𝑒 −𝜇
The Poisson probability function is f(x) = 𝑥!
In this function, μ represents the mean or expected number of customers arriving per 5-minute period, x is the random
variable indicating the number of customers arriving during a 5-minute period, and f (x) is the probability that x
customers will arrive in a 5-minute interval.
Before we use the above equation to compute Poisson probabilities, we must obtain an estimate of μ, the mean number
of customer arrivals during a 5-minute time period. The sample mean for the data in the above Table provides this
estimate. With no customers arriving in two 5-minute time periods, one customer arriving in eight 5-minute time
periods, and so on, the total number of customers who arrived during the sample of 128, 5-minute time periods is
given by 0(2) + 1(8) + 2(10) + . . . + 9(6) = 640. The 640 customer arrivals over the sample of 128 periods provide a
mean arrival rate of μ = 640/128 = 5 customers per 5-minute period. With this value for the mean of the Poisson
5𝑥 𝑒 −5
distribution, an estimate of the Poisson probability function for the bank is f(x) = 𝑥!
.
This probability function can be evaluated for different values of x to determine the probability associated with each
category of arrivals. For example, the probability of zero customers arriving during a 5-minute interval is f (0) =
0.0067, the probability of one customer arriving during a 5-minute interval is f (1) =0.0337, and so on. As we saw in
Section 6.2, the expected frequencies for the categories are found by multiplying the probabilities by the sample size.
For example, the expected number of periods with zero arrivals is given by (0.0067)(128) =0.86, the expected number
of periods with one arrival is given by (0.0337)(128) = 4.31, and so on. The summarized data are provided below in
table.

7
EXPECTED FREQUENCY OF BANK’S CUSTOMER ARRIVALS, ASSUMING A POISSON DISTRIBUTION WITH μ = 5
Number of Customers Poisson
Expected Number of5-Minute Time
Arriving (x) Probability f (x)
Periods with x Arrivals,128 f (x)
0 0.0067 0.86
1 0.0337 4.31
2 0.0842 10.78
3 0.1404 17.97
4 0.1755 22.46
5 0.1755 22.46
6 0.1462 18.71
7 0.1044 13.36
8 0.0653 8.36
9 0.0363 4.65
10 or more 0.0318 4.07
Total 128
Before we make the usual chi-square calculations to compare the observed and expected frequencies, note that in the
above Table, four of the categories have an expected frequency less than five. This condition violates the requirements
for use of the chi-square distribution. However, expected category frequencies less than five cause no difficulty,
because adjacent categories can be combined to satisfy the “at least five” expected frequency requirement. In
particular, we will combine 0 and 1 into a single category and then combine9 with “10 or more” into another single
category. Thus, the rule of a minimum expected frequency of five in each category is satisfied. The following Table
shows the observed and expected frequencies after combining categories.
Number of Customers Arriving (x) Observed frequency (fi) Expected frequency (ei)
0 or 1 10 5.17
2 10 10.78
3 12 17.97
4 18 22.46
5 22 22.46
6 22 18.71
7 16 13.36
8 12 8.36
9 or more 6 8.72
As in Section 6.2, the goodness of fit test focuses on the differences between observed and expected frequencies, fi─
ei. Thus, we will use the observed and expected frequencies shown in the above Table, to compute the chi-square test
statistic.
(𝑓𝑖− 𝑒𝑖 )2 (10− 5.17)2 (10− 10.78)2 (12− 17.97)2 (18− 22.46)2 (22− 22.46)2 (22− 18.71)2 (16− 13.36)2
𝜒2 = ∑𝑘𝑖=1 𝑒𝑖
= 5.17
+ 10.78
+ 17.97
+ 22.46
+ 22.46
+ 18.71
+ 13.36
+
(12− 8.36)2 (6− 8.72)2
8.36
+ 8.72
𝜒2 = 4.5 + 0.06 + 1.98 + 0.89 + 0.01 + 0.58 + 0.52 + 1.59 + 0.85 = 10.96
The value of the test statistic is 𝜒2 = 10.96.
In general, the chi-square distribution for a goodness of fit test has k ─ p ─ 1 degrees of freedom, where k is the
number of categories and p is the number of population parameters estimated from the sample data. For the Poisson
distribution goodness of fit test, the above Table shows k = 9 categories. Because the sample data were used to
estimate the mean of the Poisson distribution, p = 1. Thus, there are k ─ p ─ 1 = k ─ 2 degrees of freedom. With k = 9,
we have 9 ─ 2 = 7 degrees of freedom.
Suppose we test the null hypothesis that the probability distribution for the customer arrivals is a Poisson distribution
with a 0.05 level of significance. Using the 0.05 level of significance, from Chi-square table, the critical value of χ2
with 7 degrees of freedom is 14.067.
8
The decision rule is Reject H0 if χ2> 14.067; otherwise do not reject H0.
From the above computation, since χ2 = 10.96 < 14.067, the decision is not to reject H0. There is insufficient evidence
to conclude that the arrivals per 5-minuteinterval do not fit a Poisson distribution. Hence, the assumption of a Poisson
probability distribution for weekday morning customer arrivals cannot be rejected. As a result, Bank’s management
may proceed with the consulting firm’s scheduling procedure for weekday mornings.
Activity 5:
How can you determine goodness of fit test for a normal distribution?

6.2.2. Normal Distribution


The goodness of fit test for a normal distribution is also based on the use of the chi-square distribution. It is similar to
the procedure we discussed for the Poisson distribution. In particular, observed frequencies for several categories of
sample data are compared to expected frequencies under the assumption that the population has a normal distribution.
Because the normal distribution is continuous, we must modify the way the categories are defined and how the
expected frequencies are computed.
Let us demonstrate the goodness of fit test for a normal distribution by considering the job applicant test data for
Chemline, Inc., as listed below. Chemline hires approximately 400 new employees annually for its four plants located
throughout the United States. The personnel director asks whether a normal distribution applies for the population of
test scores. If such a distribution can be used, the distribution would be helpful in evaluating specific test scores; that
is, scores in the upper 20%, lower40%, and so on, could be identified quickly. Hence, we want to test the null
hypothesis that the population of test scores has a normal distribution.
71, 66, 61, 65, 54, 93, 60, 86, 70, 70, 73, 73, 55, 63, 56, 62, 76, 54, 82, 79, 76, 68, 53, 58, 85, 80, 56, 61, 61, 64, 65,
62, 90, 69, 76, 79, 77, 54, 64, 74, 65, 65, 61, 56, 63, 80, 56, 71, 79, 84.
Let us first use the above data to develop estimates of the mean and standard deviation of the normal distribution that
will be considered in the null hypothesis. We use the sample mean “ X ” and the sample standard deviation “s”as point
estimators of the mean and standard deviation of the normal distribution. The calculations follow.

𝛴(𝑥𝑖 − X )2
s=√
𝛴𝑥𝑖 5310.0369
X= = 3421/50 = 68.42 and =√ = 10.41
𝑛 𝑛−1 49
Using these values, we state the following hypotheses about the distribution of the job applicant
test scores.

H0: The population of test scores has a normal distribution with mean 68.42 and standard
deviation 10.41
H1: The population of test scores does not have a normal distribution with mean 68.42 and
standard deviation 10.41
Now let us consider a way of defining the categories for a goodness of fit test involving normal distribution. For the
discrete probability distribution in the Poisson distribution test, the categories were readily defined in terms of the
number of customers arriving, such as 0, 1, 2, and so on. However, with the continuous normal probability
distribution, we must use a different procedure for defining the categories. We need to define the categories in terms of
intervals of test scores.
Recall the rule of thumb for an expected frequency of at least five in each interval or category. We define the
categories of test scores such that the expected frequencies will beat least five for each category. With a sample size of
50, one way of establishing categorists to divide the normal distribution into 10 equal-probability intervals. With a
sample size of 50, we would expect five outcomes in each interval or category, and the rule of thumb for expected
frequencies would be satisfied.
Let us look more closely at the procedure for calculating the category boundaries. When the normal probability
distribution is assumed, the standard normal probability tables can be used to determine these boundaries. First
consider the test score cutting off the lowest 10% of the test scores. From z value table we find that the z value for this

9
test score is ─1.28. Therefore, the test score of x = 68.42 ─ 1.28(10.41) = 55.10 provides this cutoff value for the
lowest 10% of the scores. For the lowest 20%, we find z = ─ 0.84, and thus x = 68.42 ─ 0.84(10.41) = 59.68. Working
through the normal distribution in that way provides the following test score values.
Percentage Z-value Test Score
10% ─ 1.28 68.42 ─ 1.28(10.41) = 55.10
20% ─ 0.84 68.42 ─ 0.84(10.41) = 59.68
30% ─ 0.52 68.42 ─0.52(10.41) = 63.01
40% ─ 0.25 68.42 ─0.25(10.41) = 65.82
50% 0.00 68.42 +0(10.41) = 68.42
60% + 0.25 68.42 +0.25(10.41) = 71.02
70% + 0.52 68.42 +0.52(10.41) = 73.83
80% + 0.84 68.42 +0.84(10.41) = 77.16
90% +1.28 68.42 + 1.28(10.41) = 81.74
With the categories or intervals of test, scores now defined and with the known expected frequency of five per
category, we can return to the above sample data and determine the observed frequencies for the categories. Doing so
provides the results in the following Table.
Test Score Interval Observed Frequency ( fi) Expected Frequency (ei)
Less than 55.10 5 5
55.10 to 59.68 5 5
59.68 to 63.01 9 5
63.01 to 65.82 6 5
65.82 to 68.42 2 5
68.42 to 71.02 5 5
71.02 to 73.83 2 5
73.83 to 77.16 5 5
77.16 to 81.74 5 5
81.74 and over 6 5
Total 50 50

With the results in above Table, the goodness of fit calculations proceeds exactly as before.
Namely, we compare the observed and expected results by computing a χ2value. Thus,

(𝑓𝑖− 𝑒𝑖 )2 (5− 5)2 (5− 5)2 (9− 5)2 (6− 5)2 (2− 5)2 (5− 5)2 2−5)2 (5− 5)2 (5− 5)2
χ2 =∑𝑘𝑖=1 = + + + + + + + + +
𝑒𝑖 5 5 5 5 5 5 5 5 5
(6− 5)2
5
= 0 + 0 + 3.2 + 0.2 + 1.8 + 0 + 1.8 + 0 + 0 + 0.2 =7.2
We see that the value of the test statistic is χ2=7.2.
To determine whether the computed value of χ2 = 7.2 is large enough to reject H0, we need to refer to the appropriate
chi-square distribution tables. Using the rule for computing the number of degrees of freedom for the goodness of fit
test, we have k ─p ─ 1 =10 ─ 2 ─ 1 = 7 degrees of freedom based on k = 10 categories and p = 2 parameters (mean
and standard deviation) estimated from the sample data.
Suppose that we test the null hypothesis that the distribution for the test scores is a normal distribution with a 0.10
level of significance. To test this hypothesis, we need to determine the χ20.1, 7 = 12.017.
The decision rule is Reject H0 if χ2> 12.017; otherwise do not reject H0.
The hypothesis that the probability distribution for the Chemline job applicant test scores is a normal distribution
cannot be rejected. Because the computed χ2= 7.2 is less than the critical value of χ2 = 12.017.Therefore, the normal
distribution may be applied to assist in the interpretation of test scores.

10
Summary
Chi-square distribution is a continuous probability distribution with positively skewed ranging between 0 and +∞.
The characteristics of the chi-square distribution are:
 The value of chi-square is never negative.
 The chi-square distribution is positively skewed.
 There is a family of chi-square distributions.
Each time the degrees of freedom change, a new distribution is formed. As the degrees of freedom increase, the
distribution approaches the normal distribution. A goodness-of-fit test will show whether an observed set of
frequencies could have come from a hypothesized population distribution.
The goodness of fit test now focuses on the differences between the observed frequencies and the expected frequencies.
Large differences between observed and expected frequencies cast doubt on the assumption that the hypothesized
proportions or market shares are correct. Whether the differences between the observed and expected frequencies are
“large” or “small” is a question answered with the aid of the following test statistic.
Test statistic for goodness of fit
(𝑓𝑖 −𝑒𝑖 )2
𝜒2 = ∑𝑘𝑖=1 𝑒𝑖
Where: fi= observed frequency for category i
ei= expected frequency for category i
k = the number of categories
Chapter Review Questions
1. The human resources director at Georgetown Paper, Inc. is concerned about absenteeism among hourly workers.
She decides to sample the records. To determine whether absenteeism is distributed evenly throughout the six-
day workweek. The null hypothesis to be tested is: Absenteeism is distributed evenly throughout the week. The
sample results are:
Day Number Absent Day Number absent
Monday 12 Monday 10
Tuesday 9 Tuesday 9
Wednesday 11 Wednesday 9

Use the .01 significance level and the five-step hypothesis testing procedure.
a. What are the numbers 12, 9, 11, 10, 9, and 9 called?
b. How many categories (cells) are there?
c. What is the expected frequency for each day?
d. d) How many degrees of freedom are there?
e. What is the chi-square critical value at the 1 percent significance level?
f. Compute the x2 test statistic.
g. What is the decision regarding the null hypothesis?
h. Specifically, what does this indicate to the human resources director?
2. The American Accounting Association classifies accounts receivable as "current," "late," and “not collectible."
Industry figures show that 60 percent of accounts receivable are current, 30percent are late, and 10 percent are
not collectible. Massa and Barr, attorneys in Greenville, Ohio, has 500 accounts receivable:. 320 are current,
120 are late, and 60 are not collectible. Are these numbers in agreement with the industry distribution? Use the
.05 significance level.
Answer for self-test exercise
1.
a. Observed frequencies.
b. Six (six days of the week).
c. 10. Total observed frequencies / 6 = 60/6 =10.
d. 5; k - 1 = 6 - 1 = 5.
11
e.15.086 (from the chi-square table in Appendix 8).
f.X2 = I[(fi–ei)2] = (12 - 10)2 + ... + (9 - 10)2 = 08 ei10 10'
g.We do not reject Ho.
h.Absenteeism is distributed evenly throughout the week. The observed differences are due to sampling
error.
2. Ho: Pc = .60, PL = .30, and Pu= .10.
H1: Distribution is not as above.
Reject Ho if x2 > 5.991.
Category fi ei (fi-ei)2/ei
Current 320 300 1.33
Late 120 150 6.00
uncollectable 60 50 2.00
500 500 9.33

Reject Ho. The accounts receivable data does not reflect the national average.
Assignment
1. The publisher of a sports magazine plans to offer new subscribers one of three gifts: a sweat shirt with the logo
of their favorite team, a coffee cup with the logo of their favorite team, or a pair of earrings also with the logo
of their favorite team. In a sample of 500 new subscribers, the number selecting each gift is reported below. At
the .05 significance level, is there a preference for the gifts or should we conclude that the gifts are equally
well liked?
Gift Frequency
sweatshirts 183
Coffee cup 175
Earrings 142
2. In the early 2000s the Deep Down Mining Company implemented new safety guidelines. Prior to these new
guidelines, management expected no accidents in 40 percent of the months, one accident in 30 percent of the
months, two accidents in 20 percent of the months, and three accidents in 10 percent of the months. Over the
last 10 years, or 120months, there have been 46 months in which there were no accidents, 40 months in which
there was one accident, 22 months in which there were two accidents, and 12 months in which there were 3
accidents. At the .05 significance level can the management at Deep Down conclude that there has been a
change in the monthly accident distribution?
3. Is it proper to respond with an email after a job interview thanking the prospective employer for the interview?
This question was asked of sample of 200 human resource professionals and 250 technical personnel who
were a part of the interview process. The results are reported below. At the .01 significance level is it
reasonable to conclude that human resource and technical personnel differ on whether an email response is
appropriate?
Email response Human resource Technical
Very appropriate 35 98
Somewhat appropriate 95 114
Somewhat inappropriate 40 22
Very inappropriate 30 16
Total 200 250

12
Chapter 5: Analysis of Variance (ANOVA)
5. Introduction
Dear learners, in this Unit you will continue your discussion of hypothesis testing. Recall that in Unit six you
examined the general theory of hypothesis testing. You described the case where a large sample was selected from the
population. You used the z distribution (the standard normal distribution) to determine whether it was reasonable to
conclude that the population mean was equal to a specified value. You tested whether two population means are the
same. You also conducted both one- and two-sample tests for population proportions, again using the standard normal
distribution as the distribution of the test statistic. You described methods for conducting tests of means where the
populations were assumed normal but the samples were small (contained fewer than 30 observations). In that case, the
t distribution was used as the distribution of the test statistic. In this Unit, you expand further your idea of hypothesis
tests. You describe a test for variances and then a test that simultaneously compares several means to determine if
they came from equal populations.
Unit objectives
After accomplishment of this Unit, you should be able to:
 List the characteristics of the F distribution.
 Conduct a test of hypothesis to determine whether the variances of two populations are equal.
 Discuss the general idea of analysis of variance.
 Organize data into a one-way ANOVAtable.
 Conduct a test of hypothesis among three or more treatment means.
 Develop confidence intervals for the difference in treatment means.
5.1 Areas of application
Overview
Dear learner, this section will introduce you with the characteristics of F distribution to test whether two samples are
from populations having equal variances and the steps of hypothesis-testing procedure in F distribution.
Section objectives:
At the end of this section, you should be able to:
 List the characteristics of the F distribution.
 Identify the steps of hypothesis-testing procedure in F distribution.
Self Test Exercise- 1:
1. What are the characteristics of the F distribution?
2. What are the steps of hypothesis-testing procedure in F distribution?
5.1.1 The F Distribution
The analysis of variance is a procedure that tests to determine whether differences exist between two or more
population means. The name of the technique derives from the way in which the calculations are performed; that is,
the technique analyzes the variance of the data to determine whether we can infer that the population means differ.
The probability distribution used in this Unit is the F distribution. It was named to honor Sir Ronald Fisher, one of the
founders of modern-day statistics. This probability distribution is used as the distribution of the test statistic for several
situations. It is used to test whether two samples are from populations having equal variances, and it is also applied
when we want to compare several population means simultaneously. The simultaneous comparison of several
population means is called analysis of variance(ANOVA). In both of these situations, the populations must follow a
normal distribution, and the data must be at least interval-scale.
What are the characteristics of the F distribution?
i. There is a "family" of F distributions. A particular member of the family is determined by two parameters: the
degrees of freedom in the numerator and the degrees of freedom in the denominator.
ii. The F distribution is continuous. This means that it can assume an infinite number of values between zero and
positive infinity.
iii. The F distribution cannot be negative. The smallest value F can assume is 0.

13
iv. It is positively skewed. The long tail of the distribution is to the right-hand side. As the number of degrees of
freedom increases in both the numerator and denominator the distribution approaches a normal distribution.
v. It is asymptotic. As the values of X increase, the F curve approaches the X-axis but never touches it. This is
similar to the behavior of the normal distribution.
Comparing Two Population Variances
The F distribution is used to test the hypothesis that the variance of one normal population equals the variance of
another normal population. The following examples will show the use of the test:
 The mean rate of return on two types of common stock may be the same, but there may be more variation in
the rate of return in one than the other. A sample of 10 Internet stocks and 10 utility stocks shows the same
mean rate of return, but there is likely more variation in the Internet stocks.
 A study by the marketing department for a large newspaper found that men and women spent about the same
amount of time per day reading the paper. However, the same report indicated there was nearly twice as much
variation in time spent per day among the men than the women.
The F distribution provides a means for conducting a test regarding the variances of two normal populations.
Regardless of whether we want to determine if one population has more variation than another population or validate
an assumption for a statistical test, we first state the null hypothesis. The null hypothesis could be that the variance of
one normal population, 𝜎12 , equals the variance of the other normal population, 𝜎22 . The alternate hypothesis is that the
variances differ. In this instance the null hypothesis and the alternate hypothesis are H0: 𝜎12 = 𝜎22 and H1: 𝜎12 ≠ 𝜎22 .
To conduct the test, we select a random sample of n1 observations from one population, and a sample of n2
observations from the second population. The test statistic is defined as follows.
𝑠2
F = 𝑠12
2
The terms 𝑠12 and 𝑠22 are the respective sample variances. The test statistic follows the F distribution with n1- 1 and n2- 1
degrees of freedom. In order to reduce the size of the table of critical values, the larger sample variance is placed in
the numerator; hence, the tabled F ratio is always larger than 1. Thus, the right-tail critical value is the only one
required. The critical value of F for a two-tailed test is found by dividing the significance level in half (α/2) and then
referring to the appropriate degrees of freedom from F distribution table.

Example: Anbessa city bus offers transport services from Adama city to Bole airport in Addis. AtoAlemu, the
president of Anbessa city bus is considering two routes. One is via Selam bus and the other Higer bus. He wants to
study the time it takes to drive to the airport using each route and then compare the results. He collected the following
sample data, which is reported in minutes. Using the 0.10 significance level, is there a difference in the variation in the
driving times for the two routes?
Selam bus 52 67 56 45 70 54 64
Higer bus 59 60 61 51 56 63 57 65
Solution:
The mean driving times along the two routes are nearly the same. The mean time is 58.29 minutes for the Selam bus
and 59 minutes along the Higer bus route. However, in evaluating travel times, Mr. Alemu is also concerned about the
variation in the travel times. The first step is to compute the two sample variances. We'll use the following formula to
compute the sample mean & standard deviations. To obtain the sample variances, we square the standard deviations.

SELAM BUS

X =
𝛴𝑥𝑖
=
408
= 58.29 s1 = √
𝛴(𝑥𝑖− X)2
=√
485.43
= 8.9947
𝑛 7 𝑛−1 7−1
HIGER BUS

X =
𝛴𝑥𝑖
=
472
= 59 s2 = √
𝛴(𝑥𝑖− X)2 134
= √8−1 = 4.3753
𝑛 8 𝑛−1
14
There is more variation, as measured by the standard deviation, in the Selam bus route than in the Higer bus route.
This is somewhat consistent with his knowledge of the two routes; the Selambus route contains more stoplights,
whereas Higer bus is a limited-access interstate highway. However, the Higer bus route is several miles longer. It is
important that the service offered be both timely and consistent, so he decides to conduct a statistical test to determine
whether there really is a difference in the variation of the two routes. The usual five-step hypothesis-testing procedure
will be employed.
Step 1: We begin by stating the null hypothesis and the alternate hypothesis. The test is two-tailed because we are
looking for a difference in the variation of the two routes. We are not trying to show that one route has more variation
than the other.
H0: 𝜎12 = 𝜎22 and H1: 𝜎12 ≠ 𝜎22 .
Step 2: We selected the 0.10 significance level.
Step 3: The appropriate test statistic follows the F distribution.
Step 4: The critical value is obtained from F-distribution table. Because we are conducting a two-tailed test, the tabled
significance level is 0.05, found by α/2 = 0.10/2 = 0.05. There are n1- 1 = 7 - 1 = 6 degrees of freedom in the
numerator, and n2-1 = 8 - 1 = 7 degrees of freedom in the denominator. To find the critical value, move horizontally
across the top portion of the F table for the 0.05 significance level to 6 degrees of freedom in the numerator. Then
move down that column to the critical value opposite 7 degrees of freedom in the denominator. The critical value is
3.87. Thus, the decision rule is: Reject the null hypothesis if the ratio of the sample variances exceeds 3.87.
Step 5: The final step is to take the ratio of the two sample variances, determine the value of the test statistic, and
make a decision regarding the null hypothesis. Note that in the above formula refers to the sample variances but we
calculated the sample standard deviations. We need to square the standard deviations to determine the variances.
𝑠2 (8.9947)2
F = 𝑠12 = (4.3753)2 = 4.23
2
The decision is to reject the null hypothesis, because the computed F value (4.23) is larger than the critical value
(3.87). We conclude that there is a difference in the variation of the travel times along the two routes.
As noted, the usual practice is to determine the F ratio by putting the larger of the two sample variances in the
numerator. This will force the F ratio to be at least 1. This allows us to always use the right tail of the F distribution,
thus avoiding the need for more extensive F tables.

5.2: Analysis of Variance


Overview
Dear learner, this section will introduce you with the conceptual overview of F distribution to test whether two
samples are from populations having equal variances, and it is also applied when you want to compare several
population means simultaneously. In both of these situations, the populations must follow a normal distribution, and
the data must be at least interval-scale.
Section objectives:
At the end of this section, you should be able to:
 Discuss the general idea of analysis of variance.
 Identify assumptions are required to use analysis of variance.
 Conduct a test of hypothesis among three or more treatment means.
 Develop confidence intervals for the difference in treatment means.

Self Test Exercise-2:


1. What is analysis of variance?
2. What are the assumptions are required to use analysis of variance?

15
5.2.1 Analysis of variance-conceptual overview
The analysis of variance (ANOVA) is a procedure that tests to determine whether differences exist between two or
more population means. The name of the technique derives from the way in which the calculations are performed; that
is, the technique analyzes the variance of the data to determine whether we can infer that the population means differ.
In this section, we describe the procedure to apply when the samples are independently drawn. The technique is called
the one-way analysis of variance.
If the means of populations are equal, we would expect the sample means to be close together. In fact, the closer the
sample means are to one another, the more evidence we have for the conclusion that the population means are equal.
Alternatively, the more the sample means differ, the more evidence we have for the conclusion that the population
means are not equal. In other words, if the variability among the sample means is “small,” it supports H0; if the
variability among the sample means is “large,” it supports H1.

Assumptions for Analysis of Variance


Another use of the F distribution is the analysis of variance (ANOVA) technique in which we compare three or more
population means to determine whether they could be equal.
Three assumptions are required to use analysis of variance.
a. For each population, the response variable is normally distributed.
b. The variance of the response variable, denoted σ2, is the same for all of the populations.
c. The observations must be independent.
As an example of an experimental statistical study, let us consider the problem facing Chemitech, Inc. Chemitech
developed a new filtration system for municipal water supplies. The components for the new filtration system will be
purchased from several suppliers, and Chemitech will assemble the components at its plant. The industrial
engineering group is responsible for determining the best assembly method for the new filtration system. After
considering a variety of possible approaches, the group narrows the alternatives to three: method A, method B, and
method C. These methods differ in the sequence of steps used to assemble the system. Managers at Chemitech want
to determine which assembly method can produce the greatest number of filtration systems per week.

In the Chemitech experiment, assembly method is the independent variable or factor. Because three assembly methods
correspond to this factor, we say that three treatments are associated with this experiment; each treatment corresponds
to one of the three assembly methods. The Chemitech problem is an example of a single-factor experiment (one way
ANOVA); it involves one qualitative factor (method of assembly). More complex experiments may consist of multiple
factors; some factors may be qualitative and others may be quantitative.
The three assembly methods or treatments define the three populations of interest for the Chemitech experiment. One
population is all Chemitech employees who use assembly method A, another is those who use method B, and the third
is those who use method C. Note that for each population the dependent or response variable is the number of
filtration systems assembled per week, and the primary statistical objective of the experiment is to determine whether
the mean number of units produced per week is the same for all three populations (methods).

Suppose a random sample of three employees is selected from all assembly workers at the Chemitech production
facility. In experimental design terminology, the three randomly selected workers are the experimental units. The
experimental design that we will use for the Chemitech problem is called a completely randomized design.
Randomization is the process of assigning the treatments to the experimental units at random. This type of design
requires that each of the three assembly methods or treatments be assigned randomly to one of the experimental units
or workers. For example, method A might be randomly assigned to the second worker, method B to the first worker,
and method C to the third worker.
Note that this experiment would result in only one measurement or number of units assembled for each treatment. To
obtain additional data for each assembly method, we must repeat or replicate the basic experimental process. Suppose,
for example, that instead of selecting just three workers at random we selected 15 workers and then randomly assigned

16
each of the three treatments to 5 of the workers. Because each method of assembly is assigned to 5 workers, we say
that five replicates have been obtained. The process of replication is another important principle of experimental
design.

Once we are satisfied with the experimental design, we proceed by collecting and analyzing the data. In the Chemi
tech case, the employees would be instructed in how to perform the assembly method assigned to them and then would
begin assembling the new filtration systems using that method. After this assignment and training, the number of units
assembled by each employee during one week is shown in Table below. The sample means, sample variances, and
sample standard deviations for each assembly method are also provided.

Method
A B C
58 58 48
64 69 57
55 71 59
66 64 47
67 68 49
Sample mean 62 66 52
Sample variance 27.5 26.5 31
Sample standard deviation 5.2445 5.148 5.568
Although we will never know the actual values of μ1, μ2, and μ3, we want to use the sample means to test the following
hypotheses.
H0 :μ1 = μ2 = μ3 and H1: Not all population means are equal
As we will demonstrate shortly, analysis of variance (ANOVA) is the statistical procedure used to determine whether
the observed differences in the three sample means are large enough to reject H0. If H0is rejected, we cannot conclude
that all population means are different. Rejecting H0 means that at least two population means have different values.
We assume that a simple random sample of size nj has been selected from each of the k populations or treatments. For
the resulting sample data, let
xij = value of observation i for treatment j
nj = number of observations for treatment j
X j= sample mean for treatment j
s2j = sample variance for treatment j
sj= sample standard deviation for treatment j

The formulas for the sample mean and sample variance for treatment j are as follow.
𝑛𝑖𝑗 𝑛𝑖𝑗
∑𝑖=1 𝑥𝑖𝑗 ∑𝑖=1(𝑥𝑖𝑗 − X 𝑗)2
X j= 𝑛𝑗
and 𝑠𝑗2 = 𝑛𝑗 −1
The overall sample mean, denoted by θ, is the sum of all the observations divided by the total
number of observations. That is,
𝑛𝑖𝑗
∑𝑘
𝑗=1 ∑𝑖=1 𝑥𝑖𝑗
θ= 𝑛𝑇
, where nT = n1+n2+…nk

∑𝑘
𝑗=1 X𝑗
If the size of each sample is equal, the above equation for overall mean is reduced to θ = 𝑘
Where: k = Number of factors (in this case the are 3 methods or factors)

In other words, whenever the sample sizes are the same, the overall sample mean is just the

17
average of the k sample means. Because each sample in the Chemitech experiment consists of n = 5 observations with
k = 3 factors, the overall sample mean can be computed as follows.
62+66+52
θ= = 60
3
If the null hypothesis is true ( μ1 = μ2 = μ3 = μ), the overall sample mean of 60 is the best estimate of the population
mean μ.

5.2.2Between-Treatments Estimate of Population Variance


An estimate of the variance of the sampling distribution of X , 𝜎 2 , is provided by the variance of the three sample
X
means.
(62−60)2 +(66−60)2 +(52−60)2
𝑠 2X = = 52
3−1
𝜎2
Because 𝜎 2 = , solving for 𝜎 2 gives 𝜎 2 = n𝜎 2
X 𝑛 X
Hence, estimate of 𝜎 2 = n(estimate 𝜎 2 ) = n𝑠 2X = 5(52) = 260
X
The result, n𝑠 2X = 260, is referred to as the between-treatments estimate of 𝜎 2 .
The between-treatments estimate of σ2 is based on the assumption that the null hypothesis is true and the sample sizes
are equal. This estimate of 𝜎 2 is called the mean square due to treatments and is denoted MSTR. The general formula
for computing MSTR is
∑𝑘
𝑗=1 𝑛𝑗 ( X 𝑗 −𝜃)
2
MSTR = 𝑘−1
The numerator in above equation is called the sum of squares due to treatments and is denoted SSTR. The
denominator, k ─ 1, represents the degrees of freedom associated with SSTR.
If H0 is true, MSTR provides an unbiased estimate of σ2. However, if the means of the k populations are not equal,
MSTR is not an unbiased estimate of σ2; in fact, in that case, MSTR should overestimate σ2.
For the Chemitech data, we obtain the following results.
SSRT = ∑𝑘𝑗=1 𝑛𝑗 ( X 𝑗 − 𝜃)2 = 5(62 − 60)2 + 5(66 − 60)2 + 5(52 − 60)2 = 520
MSRT = SSRT/k─1 = 520/2 = 260
5.2.3 Within-Treatments Estimate of Population Variance
When a simple random sample is selected from each population, each of the sample variances provides an unbiased
estimate of σ2. Hence, we can combine or pool the individual estimates of σ2 into one overall estimate. The estimate of
σ2 obtained in this way is called the pooled or within-treatments estimate of σ2. Because each sample variance provides
an estimate of σ2 based only on the variation within each sample, the within-treatments estimate of σ2 is not affected by
whether the population means are equal.

When the sample sizes are equal, the within-treatments estimate of σ2 can be obtained by computing the average of the
individual sample variances. This estimate of σ2 is called the mean
square due to error and is denoted MSE. The general formula for computing MSE is
∑𝑘 2
𝑗=1(𝑛𝑗 −1)𝑠𝑗
MSE = 𝑛𝑇 −𝑘
The numerator in above equation is called the sum of squares due to error and is denoted
SSE. The denominator of MSE is referred to as the degrees of freedom associated with SSE.
Hence, the formula for MSE can also be stated as follows.
𝑆𝑆𝐸
MSE = 𝑛
𝑇 −𝑘
Note that MSE is based on the variation within each of the treatments; it is not influenced by
whether the null hypothesis is true. Thus, MSE always provides an unbiased estimate of σ2.
For the Chemitech data we obtain the following results.

18
SSE = ∑𝑘𝑗=1(𝑛𝑗 − 1)𝑠𝑗2 = (5 ─ 1)27.5 + (5 ─ 1)26.5 + (5 ─ 1)31 = 340
𝑆𝑆𝐸 340
MSE = = = 28.33
𝑛𝑇 − 𝑘 15−3
If the null hypothesis is true, MSTR and MSE provide two independent, unbiased estimates of σ2. We know that for
normal populations, the sampling distribution of the ratio of two independent estimates of σ2 follows an F distribution.
Hence, if the null hypothesis is true and the ANOVA assumptions are valid, the sampling distribution of MSTR/MSE
is an F distribution with numerator degrees of freedom equal to k ─ 1 and denominator degrees of freedom equal to
nT─ k. In other words, if the null hypothesis is true, the value of MSTR/MSE should appear to have been selected
from this F distribution.
However, if the null hypothesis is false, the value of MSTR/MSE will be inflated because MSTR overestimates σ2.
Hence, we will reject H0 if the resulting value of MSTR/MSE appears to be too large to have been selected from an F
distribution with k ─ 1 numerator degrees of freedom and nT─ k denominator degrees of freedom. Because the
decision to reject H0 is based on the value of MSTR/MSE, the test statistic used to test for the equality of k population
means is as follows.
𝑀𝑆𝑇𝑅
F=
𝑀𝑆𝐸
The test statistic follows an F distribution with k ─ 1 degrees of freedom in the numerator and nT─ k degrees of
freedom in the denominator.
Let us return to the Chemitech experiment and use a level of significance α = 0.05 to conduct the hypothesis test. The
value of the test statistic is
𝑀𝑆𝑇𝑅 260
F= 𝑀𝑆𝐸
= 28.33 = 9.18
As with other hypothesis testing procedures, the critical value approach can be used. With α = 0.05, the critical F value
occurs with an area of 0.05 in the upper tail of an F distribution with 2 and 12 degrees of freedom. From the F
distribution table, we find F0.05(2, 12) = 3.89. Hence, the appropriate upper tail rejection rule for the Chemitech
experiment is
Reject H0 if F ≥ 3.89
With F = 9.18, we reject H0 and conclude that the means of the three populations are not equal. In other words, the test
provides sufficient evidence to conclude that the means of the three populations are not equal. The analysis of variance
supports the conclusion that the population mean number of units produced per week for the three assembly methods
are not equal.
ANOVA Table
The results of the preceding calculations can be displayed conveniently in a table referred to as the analysis of variance
table. The general form of the ANOVA table for a completely randomized design is shown in below. The Table is the
corresponding ANOVA table for the Chemitech experiment. The sum of squares associated with the source of
variation referred to as “Total” is called the total sum of squares (SST). Note that the results for the Chemitech
experiment suggest that SST = SSTR ─ SSE, and that the degrees of freedom associated with this total sum of squares
is the sum of the degrees of freedom associated with the sum of squares due to treatments and the sum of squares due
to error.
We point out that SST divided by its degrees of freedom nT─ 1 is nothing more than the overall sample variance that
would be obtained if we treated the entire set of 15 observations as one data set. With the entire data set as one sample,
the formula for computing the total sum of squares (SST), is
𝑛
SST = ∑𝑘𝑗=1 ∑𝑖=1𝑗
(𝑥𝑖𝑗 – 𝜃)2
It can be shown that the results we observed for the analysis of variance table for the Chemitech experiment also apply
to other problems. That is, SST = SSTR ─ SSE
In other words, SST can be partitioned into two sums of squares: the sum of squares due to treatments and the sum of
squares due to error. Note also that the degrees of freedom corresponding to SST, nT─ 1, can be partitioned into the
degrees of freedom corresponding to SSTR, k ─ 1, and the degrees of freedom corresponding to SSE, nT─ k. The
analysis of variance can be viewed as the process of partitioning the total sum of squares and the degrees of freedom

19
into their corresponding sources: treatments and error. Dividing the sum of squares by the appropriate degrees of
freedom provides the variance estimates, the F value, and the p-value used to test the hypothesis of equal population
means.
ANOVATABLE FOR A COMPLETELY RANDOMIZED DESIGN
Source of variation Sum of squares Degrees of Mean square F p-value
freedom
Treatments SSTR k─ 1 MSTR = SSRT/k ─ 1 𝑀𝑆𝑇𝑅
Error SSE nT─k MSE = SSE/ nT─k 𝑀𝑆𝐸
Total SST nT─ 1

ANALYSIS OF VARIANCE TABLE FOR THE CHEMITECH EXPERIMENT


Source of variation Sum of squares Degrees of Mean square F p-value*
freedom
Treatments 520 2 260
Error 340 12 28.33 9.18 0.004
Total 860 14

*Excel or other statistical soft ware package can be used to show the exact p-value is 0.004. Thus, with p-value ≤ α =
0.05, Ho is rejected.
Summary
The analysis of variance (ANOVA) is a procedure that tests to determine whether differences exist between two or
more population means. The probability distribution used in this Unit is the F distribution. It was named to honor Sir
Ronald Fisher, one of the founders of modern-day statistics. This probability distribution is used as the distribution of
the test statistic for several situations. It is used to test whether two samples are from populations having equal
variances, and it is also applied when we want to compare several population means simultaneously. The simultaneous
comparison of several population means is called analysis of variance (ANOVA). The characteristics of the F
distribution are; there is a "family" of F distributions, The F distribution is continuous, The F distribution cannot be
negative, it is positively skewed and it is asymptotic.
Regardless of whether we want to determine if one population has more variation than another population or validate
an assumption for a statistical test, we first state the null hypothesis. The null hypothesis could be that the variance of
one normal population, σ12 , equals the variance of the other normal population, σ22 . The alternate hypothesis is that the
variances differ. In this instance the null hypothesis and the alternate hypothesis are H0: σ12 = σ22 and H1: σ12 ≠ σ22 .
To conduct the test, we select a random sample of n1 observations from one population, and a sample of n2
observations from the second population. The test statistic is defined as follows.
s2
F = s12
2
The terms s12 and s22 are the respective sample variances. The test statistic follows the F distribution with n1- 1 and n2- 1
degrees of freedom.

Three assumptions are required to use analysis of variance. These are: For each population, the response variable is
normally distributed, the variance of the response variable, denoted σ2, is the same for all of the populations and the
observations must be independent.

Analysis of variance (ANOVA) is the statistical procedure used to determine whether the observed differences in the
three sample means are large enough to reject H0. If H0is rejected, we cannot conclude that all population means are
different. Rejecting H0 means that at least two population means have different values.

20
Self Test Exercise-
1. Steele Electric Products, Inc. assembles electrical components for cell phones. For the last10 days Mark Nagy
has averaged 9 rejects, with a standard deviation of 2 rejects per day.Debbie Richmond averaged 8.5 rejects,
with a standard deviation of 1.5 rejects, over thesame period. At the .05 significance level, can we conclude
that there is more variation inthe number of rejects per day attributed to Mark?
Answer for Self-test exercises
1. What are the characteristics of analysis of variance?
2. Let Mark's assemblies be population 1, then Ho: df1 = 10 - 1 = 9; and df2 also equals 9. Ho is rejected if F >
3.18.
2.02
F = 1.5 2 = 1.78
Ho is not rejected!. The variation is the same for both employees.
Assignment
1. The characteristics of the F distribution are; there is a "family" of F distributions, The F distribution is
continuous,The F distribution cannot be negative, it is positively skewed and it is asymptotic.
2. A real estate agent in the coastal area of Georgia wants to compare the variation in the selling price of homes
on the oceanfront with those one to three blocks from the ocean. A sample of 21 oceanfront homes sold within
the last year revealed the standard deviation of the selling prices was Birr 45,600. A sample of 18 homes, also
sold within the last year, that were one to three blocks from the ocean revealed that the standard deviation was
Birr 21,330. At the .01 significance level, can we conclude that there is more variation in the selling prices of
the oceanfront homes?
3. In an ANOVA table MSE was equal to 10. Random samples of six were selected from eachof four
populations, where the sum of squares total was 250.
a. Set up the null hypothesis and the alternate hypothesis.
b. What is the decision rule? Use the .05 significance level.
c. Complete the ANOVA table. What is the value of F?
d. What is your decision regarding the null hypothesis?
4. A consumer organization wants to know whether there is a difference in the price of a particular toy at three
different types of stores. The price of the toy was checked in a sample offive discount stores, five variety
stores, and five department stores. The results are shownbelow. Use the .05 significance level.
Discount Variety Department
Birr 12 Birr 15 Birr 19
13 17 17
14 14 16
12 18 20
15 17 19

21
Chapter 6: CORRELATION AND REGRESSION ANALYSIS
6. Introduction
Dear learners, in this Unit you will develop numerical measures to express the relationship between two variables. Is
the relationship strong or weak, is it direct or inverse? In addition you develop an equation to express the relationship
between variables. This will allow us to estimate one variable on the basis of another.
Unit objectives
After accomplishment of this Unit, you should be able to:
After accomplishment of this Unit, you should be able to:
 Understand and "" interpret the term dependent; an independent variable.
 Calculate and “interpret the coefficient of correlation, the coefficient of determination, and the standard error
of estimate.
 Conduct a test “of hypothesis to determine whether the coefficient of correlation in the population is zero.
 Calculate the, least squares regression line.
 Construct and interpret confidence and prediction intervals for the dependent variable.
6.1: Linear correlation
Overview
Dear learner, in this section you will examine the meaning and purpose of correlation analysis. Which studies the
relationship between two variables and Correlation Analysis involves various methods and techniques used for
studying and measuring the extent of the relationship between the two variables. “Two variables are said to be in
correlation if the change in one of the variables results in a change in the other variable”.
Section objectives:
At the end of this section, you should be able to:
 Understand and "" interpret the term dependent; an independent variable.
 Calculate and “interpret the coefficient of correlation, the coefficient of determination, and the standard error
of estimate.
 Conduct a test “of hypothesis to determine whether the coefficient of correlation in the population is zero.

Self Test Exercise-1:


1. What is the difference between dependent and independent variable?
2. What is correlation analysis?

6.1.1 Correlation Analysis


So far, we have confined our discussion to the distributions involving only one variable. Sometimes, in practical
applications, we might come across certain set of data, where each item of the set may comprise of the values of two
or more variables.
Suppose we have a set of 30 students in a class and we want to measure the heights and weights of all the students. We
observe that each individual (unit) of the set assumes two values – one relating to the height and the other to the
weight. Such a distribution in which each individual or unit of the set is made up of two values is called a bivariate
distribution. The following examples will illustrate clearly the meaning of bivariate distribution.
i. In a class of 60 students the series of marks obtained in two subjects by all of them.
ii. The series of sales revenue and advertising expenditure of two companies in a particular year.
iii. The series of ages of husbands and wives in a sample of selected married couples.
Thus in a bivariate distribution, we are given a set of pairs of observations, wherein each pair represents the values of
two variables. In a bivariate distribution, we are interested in finding a relationship (if it exists) between the two
variables under study.
The concept of ‘correlation’ is a statistical tool which studies the relationship between two variables and Correlation
Analysis involves various methods and techniques used for studying and measuring the extent of the relationship
22
between the two variables. “Two variables are said to be in correlation if the change in one of the variables results in
a change in the other variable”.
6.1.2: Types of Correlation
There are two important types of correlation. They are (1) Positive and Negative correlation and, (2) Linear and Non –
Linear correlation.
Positive and Negative Correlation
If the values of the two variables deviate in the same direction i.e. if an increase (or decrease) in the values of one
variable results, on an average, in a corresponding increase (or decrease) in the values of the other variable the
correlation is said to be positive.
Some examples of series of positive correlation are:
a. Heights and weights; c. Price and supply of commodities;
b. Household income and expenditure; d. Amount of rainfall and yield of crops.
Correlation between two variables is said to be negative or inverse if the variables deviate in opposite
direction. That is, if the increase in the variables deviate in opposite direction. That is, if increase (or
decrease) in the values of one variable results on an average, in corresponding decrease(or increase) in the
values of other variable.
Some examples of series of negative correlation are:
a. Volume and pressure of perfect gas; c. Price and demand of goods and
b. Net income and operating expense, d. Temperature and altitude
Graphs of Positive and Negative correlation:
Suppose we are given sets of data relating to heights and weights of students in a class. They can be
plotted on the coordinate plane using x –axis to represent heights and y – axis to represent weights. The
different graphs shown below illustrate the different types of
correlations.

Perfect positive correlation (r = 1)Strong positive correlation (r = 0.80)Zero correlation (r = 0)

Perfect negative correlation ( r = ─1) Moderate negative correlation Strong correlation & outlier
(r = -0.43) (r = 0.71)

Note:
i. If the points are very close to each other, a fairly good amount of correlation can be expected
between the two variables. On the other hand if they are widely scattered a poor correlation can
be expected between them.

23
ii. If the points are scattered and they reveal no upward or downward trend as in the case of then we
say the variables are uncorrelated.
iii. If there is an upward trend rising from the lower left hand corner and going upward to the upper
right hand corner, the correlation obtained from the graph is said to be positive. Also, if there is a
downward trend from the upper left hand corner the correlation obtained is said to be negative.
iv. The graphs shown above are generally termed as scatter diagrams.
Linear and Non – Linear Correlation
The correlation between two variables is said to be linear if the change of one unit in one variable result
in the corresponding change in the other variable over the entire range of values. For example, consider
the following data.
X 2 4 6 8 10
Y 7 13 19 25 31
Thus, for a unit change in the value of x, there is a constant change in the corresponding values of y and
the above data can be expressed by the relation y = 3x +1
In general two variables x and y are said to be linearly related, if there exists a relationship of the form y
= a + bx.
Where y = dependent variable, x = independent variable, a = y-intercept and b = slope of the line (defined
as rise or drop) are real numbers. This is nothing but a straight line when plotted on a graph sheet with
different values of x and y and for constant values of a and b. Such relations generally occur in physical
sciences but are rarely encountered in economic and social sciences.
Dependent variable - the variable that is being predicted or estimated (explanatory variable).
Independent variable - A variable that provides the basis for estimation (predictor variable).

The relationship between two variables is said to be non – linear if corresponding to a unit change in one
variable, the other variable does not change at a constant rate but changes at a fluctuating rate. In such
cases, if the data is plotted on a graph sheet we will not get a straight line curve. For example, one may
have a relation of the form = a + bx + cx2or more general polynomial.
Coefficient of Correlation (r)
One of the most widely used statistics is the coefficient of correlation‘r’, which measures the degree of
association between the two values of related variables given in the data set. In other words, the
coefficient of correlation describes the strength of the relationship between two sets of interval-scaled or
ratio-scaled variables. Designated r; it is often referred to as Pearson's r and as the Pearson product
moment correlation coefficient. It takes values from + 1 to – 1. If twinsets or data have r = +1, they are
said to be perfectly correlated positively ifr = -1 they are said to be perfectly correlated negatively; and
if r = 0 they are uncorrelated.
The coefficient of correlation ‘r’ is given by the formula
𝑛𝛴𝑥𝑦−𝛴𝑥𝛴𝑦
r=
√(𝑛𝛴𝑥 2 −(𝛴𝑥)2 √(𝑛𝛴𝑦 2 −(𝛴𝑦)2 )
Example: A study was conducted determine whether there is a relationship between the number of sales
calls made in a month and the number ofcopiers sold that month.The sale manager selects a random
sample of 10 representativesand determines the number of sales calls each representative made last month
andthe number of copiers sold. The sample information is shown in Table below. Compute the coefficient
of correlation.
Sales representative Number of sales call (x) Number of copies sold (y)

24
1 20 30
2 40 60
3 20 40
4 30 60
5 10 30
6 10 40
7 20 40
8 20 50
9 20 30
10 30 70
Solution:
x y x2 y2 xy
1 20 30 400 900 600 𝑛𝛴𝑥𝑦−𝛴𝑥𝛴𝑦
r=
√(𝑛𝛴𝑥 2 −(𝛴𝑥)2 )√(𝑛𝛴𝑦 2 −(𝛴𝑦)2 )
2 40 60 1,600 3,600 2,400
3 20 40 400 1,600 800
(10)(10,800)−(220)(450)
4 30 60 900 3,600 1,800 =
√(10)(5,600)−(220)2 )√(10)(22,100)−(450)2
5 10 30 100 900 300
6 10 40 100 1,600 400 9,000
= (87.17798)(136.0147) = 0.759
7 20 40 400 1,600 800
8 20 50 400 2,500 1,000 How do we interpret a correlation of 0.759? First, it is
9 20 30 400 900 600 positive, so we see there is a direct relationship between
10 30 70 900 4,900 2,100 the number of sales calls and the number of copiers sold.
Σ 220 450 5,600 22,100 10,800 The value of 0.759 is fairly close to 1, so we conclude
that the association is strong. To put itanother way, an
increase in calls will likely lead to more sales.

25
Rank Correlation
The product-moment correlation coefficient is used to measure the strength of the linear association
between two variables, i.e. how close the points on a scatter graph lie to a straight line.It is most
appropriate when the points on a scatter graph have an elliptical pattern.The product-moment correlation
coefficient is less appropriate when the points on a scatter graphseem to follow a curve or when there are
outliers (or anomalous values) on the graph.
Data which are arranged in numerical order, usually from largest to smallest and numbered 1,2,3 --- are
said to be in ranks or ranked data.. These ranks prove useful at certain times when two or more values
of onevariable are the same. The coefficient of correlation for such type of data isgiven by Spearman
rank difference correlation coefficient and is denotedby R.In order to calculate R, we arrange data in
ranks computing thedifference in rank ‘d’ for each pair. The following example will explain theusefulness
of R. R is given by the formula
(𝛴𝑑 2 )
R=1─6
𝑛(𝑛2 −1)
Where, d = difference between ranks and n = total number of observations.
Example: The data given below are obtained from student records. Calculate the rank correlation
coefficient ‘R’ for the data.
Example: The following data shows the annual income per head of population, x, (in birr) and the
infantmortality, y, (per thousand live births) for a sample of 11 countries.
Country A B C D E F G H I K L
X 130 5950 560 2010 1870 170 390 580 820 6620 3800
Y 150 43 121 53 41 169 143 59 75 20 39
The relationship between the two variables does not however appear to be linear – it is more curved (see
through scatter diagram).Calculating the product moment correlation coefficient for these data is therefore
not reallyappropriate (as this examines how well the data fit to a straight line).
Country A B C D E F G H I K L Total
x 130 5950 560 2010 1870 170 390 580 820 6620 3800
y 150 43 121 53 41 169 143 59 75 20 39
Rank x 1 10 4 8 7 2 3 5 6 11 9
Rank y 10 4 8 5 3 11 9 6 7 1 2
d -9 6 -4 3 4 -9 -6 -1 -1 10 7
2
d 81 36 16 9 16 81 36 1 1 100 49 426
Note: In ranking the numbers above, we have used rank 1 to denote the smallest number in each rowand
rank 11 to represent the largest number. Some people would use rank 1 to denote the largestnumber and
rank 11 to represent the smallest. It does not matter which way you do the ranking aslong as you rank in
the same way for both rows.
6(𝛴𝑑 2 ) 6(426)2
R = 1 ─ 𝑛(𝑛2 −1) = 11(112 −1) = -0.936
Interpretation: This (-0.936) represents strong negative rank correlation between income and
infantmortality, i.e. infant mortality tends to fall as income per head of population increases.
6.2 Simple Linear Regression
Overview
Dear learner, in this section you will develop a mathematical equation that will allow us to estimate the
value of one variable based on the value of another. This is called regression analysis. We will (1)

26
determine the equation of the line that best fits the data, (2) use the equation to estimate the value of one
variable based on another,(3) measure the error in our estimate, and (4) establish confidence and
prediction intervals for our estimate.
Section objectives:
At the end of this section, you should be able to:
 Calculate the least squares regression line.
 Construct confidence and prediction intervals for the dependent variable.
 Interpret confidence and prediction intervals for the dependent variable.
Self Test Exercise-2:
1. What do you know about least squares regression?
2. What is a regression analysis?
3. What is the term confidence interval?

Managerial decisions often are based on the relationship between two or more variables.For example,
after considering the relationship between advertising expenditures and sales,a marketing manager might
attempt to predict sales for a given level of advertising expenditures.In another case, a public utility might
use the relationship between the daily hightemperature and the demand for electricity to predict electricity
usage on the basis of nextmonth’s anticipated daily high temperatures. Sometimes a manager will rely on
intuition tojudge how two variables are related. However, if data can be obtained, a statistical
procedurecalled regression analysis can be used to develop an equation showing how the variablesare
related.
From this scenario, we can understand that regression analysis is concerned with how the values of one
variable depend on the corresponding values of second variable. This can be summarized by an equation
that enables us to predict or estimate the values of one variable given values of the other variable. In
contrast to correlation problems that involve measuring only the strength of a relationship,regression
problems are concerned with the form or nature of a relationship and able to estimate the value one
variable using the values of other variable through equation.
For instance, answer the following example whether it is correlation or regression problem.
i. – Regression
ii. - Regression
iii. To what extent is metal pitting related to pollution?- Correlation
iv. How strong is the link between inflation and employment rates?- Correlation
v. - Regression
In regression terminology, the variable being predicted is called the dependent variable.The variable or
variables being used to predict the value of the dependent variable arecalled the independent
(explanatory) variables. For example, in analyzing the effect of advertising expenditureson sales, a
marketing manager’s desire to predict sales would suggest makingsales the dependent variable.
Advertising expenditure would be the independent variableused to help predict sales. In statistical
notation, y denotes the dependent variable and xdenotes the independent variable.
In this section, we consider the simplest type of regression analysis involving one independentvariable
and one dependent variable in which the relationship between the variablesis approximated by a straight
line. It is called simple linear regression. Regressionanalysis involving two or more independent
variables is called multiple regressionanalysis.
Simple Linear Regression Model

27
HabeshaShiro is a chain of Ethiopian-food restaurants located in a five-state area.Habesha most
successful locations are near college campuses. The managers believe thatsales for these restaurants
(denoted by y) are related positively to the size of thestudent population (denoted by x); that is, restaurants
near campuses with a large studentpopulation tend to generate more sales than those located near
campuses with a small studentpopulation. Using regression analysis, we can develop an equation showing
how the dependent variable y is related to the independent variable x.
Regression Model and Regression Equation
In the HabeshaShiro restaurant example, the population consists of all the Habesha’s restaurants.For
every restaurant in the population, there is a value of x (student population) anda corresponding value of y
(sales). The equation that describes how y is related tox and an error term is called the regression model
as given below.
y = 𝛽0 + 𝛽1 𝑥 + 𝜀
𝛽0 and 𝛽1are referred to as the parameters of the model, and 𝜀 (the Greek letter epsilon) isa random
variable referred to as the error term. The error term accounts for the variabilityin y that cannot be
explained by the linear relationship between x and y.
The population of all Habesha’s restaurants can also be viewed as a collection of subpopulations,one for
each distinct value of x. For example, one subpopulation consistsof all Habesha’s restaurants located near
college campuses with 8000 students; anothersubpopulation with9000 students; and so on. Each
subpopulation has a corresponding distribution of y values.Each distribution of y values has its own mean
or expectedvalue. The equation that describes how the expected value of y, denoted E(y), isrelated to x is
called the regression equation as shown below.
E(y) = 𝛽0 + 𝛽1 𝑥
The graph of the simple linear regression equation is a straight line; 𝛽0 is the y-intercept of the regression
line, 𝛽1 is the slope, and E(y) is the mean or expected value of y for a given value of x.
Estimated Regression Equation
If the values of the population parameters 𝛽0 and 𝛽1 were known, we could use the above equation to
compute the mean value of y for a given value of x. In practice, the parameter values arenot known and
must be estimated using sample data. Sample statistics (denoted b0 and b1) are computed as estimates of
the population parameters 𝛽0 and 𝛽1. Substituting the valuesof the sample statistics b0 and b1 for
𝛽0 and 𝛽1in the regression equation, we obtain theestimated regression equation. The estimated
regression equation for simple linear regressionfollows.

ŷ = b0+ b1x
The graph of the estimated simple linear regression equation is called the estimated regressionline; b0 is
the y intercept and b1 is the slope. In the next section, we show how theleast squares method can be used
to compute the values of b0 and b1 in the estimated regressionequation.In general, ŷ is the point estimator
of E( y), the mean value of y for a given value of x.
The least squares method is a procedure for using sample data to find the estimated regressionequation.
To illustrate the least squares method, suppose data were collected froma sample of 10 HabeshaShiro
restaurants located near college campuses. For theithobservation or restaurant in the sample, xiis the size of
the student population (in thousands)and yiis the sales (in thousands of dollars). The values of xiand yifor
the10 restaurants in the sample are summarized in Table below.
Restaurant 1 2 3 4 5 6 7 8 9 10
Student (xi) 2 6 8 8 12 16 20 20 22 26

28
Sales (yi) 58 108 88 118 117 137 157 169 149 202
We therefore choose the simple linear regression model to represent the relationshipbetween sales and
student population. Given that choice, our next task is to usethe sample data in above Table to determine
the values of b0 and b1 in the estimated simple linear regression equation. For the ith restaurant, the
estimated regression equation providesŷi = b0+ b1xi.
Where,
ŷi = estimated value of sales ($1000s) for the ith restaurant
b0 = the y intercept of the estimated regression line
b1 = the slope of the estimated regression line
xi= size of the student population (1000s) for the ithrestaurant
The least squares method uses the sample data to provide the values of b0 and b1 thatminimize the sum of
the squares of the deviations between the observed values of the dependentvariable yiand the estimated
values of the dependent variable ŷi.Differential calculus can be used to show that the values of b0 and
b1that minimize the above expression can be found by using the following equations.

𝛴(𝑥𝑖 −Ẍ)(𝑦𝑖− ӯ)
b1= 𝛴(𝑥𝑖 −Ẍ)2
and b0 = ӯ ─ b1Ẍ
where, xi= value of the independent variable for the ith observation

29
yi= value of the dependent variable for the ith observation
ẍ = mean value for the independent variable
ӯ = mean value for the dependent variable
n = total number of observations
Some of the calculations necessary to develop the least squares estimated regressionequation for
HabeshaShiroare shown in Table below with the sample of 10 restaurants,we haven =10 observations.
Because above equations require and webegin the calculations by computing ẍ and ӯ.
𝛴𝑥𝑖 𝑦𝑖
ẍ 𝑛
= 140/10 = 14 and ӯ = 𝑛
= 1300/10 = 130

Restaurant i xi yi xi ─ ẍ yi ─ ӯ (xi ─ ẍ) (yi ─ ӯ) (xi ─ ẍ)2


1 2 58 -12 -72 864 144
2 6 105 -8 -25 200 64
3 8 88 -6 -42 252 36
4 8 118 -6 -12 72 36
5 12 117 2 -13 26 4
6 16 137 2 7 14 4
7 20 157 6 27 162 36
8 20 169 6 39 234 36
9 22 149 8 19 152 64
10 26 202 12 72 864 144
Totals Σxi = 140 Σyi = 1300 Σ = 2840 Σ = 568

𝛴(𝑥𝑖 −Ẍ)(𝑦𝑖− ӯ) 2840


b1= 𝛴(𝑥𝑖 −Ẍ)2
= 568
=5 and b0 = ӯ ─ b1Ẍ = 130 ─ 5(14) = 60
Thus, the estimated regression equation is ŷ = 60 + 5x
The slope of the estimated regression equation (b1 = 5) is positive, implying that as student population
increases, sales increase.
If we believe the least squares estimated regression equation adequately describes therelationship
between x and y, it would seem reasonable to use the estimated regression equationto predict the value
of y for a given value of x. For example, if we wanted to predictsales for a restaurant to be located
near a campus with 16,000 students, we wouldcompute
yi = 60 + 5x = 60 + 5(16,000) = $140,000
Coefficient of Determination
For the HabeshaShiro example, we developed the estimated regression equationŷ = 60 + 5x to
approximate the linear relationship between the size of the student populationx and sales y. A question
now is: How well does the estimated regressionequation fit the data? In this section, we show that the
coefficient of determination providesa measure of the goodness of fit for the estimated regression
equation.
For the ith observation, the difference between the observed value of the dependentvariable, yi, and the
estimated value of the dependent variable, ŷi, is called the ithresidual.The ith residual represents the
error in using ŷ to estimate yi.Thus, for the ith observation, the residual is yi─ ŷi . The sum of squares
of these residuals or errors is the quantity that is minimized by the least squares method. This
quantity, also known as the sum of squares dueto error, is denoted by SSE.
SSE = Σ(𝑦𝑖 − ŷ𝑖 )2
The value of SSE is a measure of the error in using the estimated regression equation toestimate the
values of the dependent variable in the sample. For instance, for Habesha restaurant 1 the values ofthe
independent and dependent variables are x1= 2 and y1= 58. Using the estimatedregression equation,

30
we find that the estimated value of sales for restaurant 1 is ŷ = 60 + 5(2) = 70. Thus, the error in using
to estimate y1 for restaurant 1 is y1─ ŷ1= 58 ─ 70 = -12. The squared error, (-12)2= 144.
After computing and squaring the residuals for each restaurant in the sample in the same way, we
sumthem to obtain SSE = 1530. Thus, SSE = 1530 measures the error in using the estimatedregression
equation ŷ = 60 + 5xto predict sales.
Now suppose we are asked to develop an estimate of quarterly sales without knowledgeof the size of
the student population. Without knowledge of any related variables, we would use the sample mean as
an estimate of quarterly sales at any given restaurant.We can maintain the sum of squared deviations
obtained by using the sample meanӯ= 130 to estimate thevalue of sales for each restaurant in the
sample. For the ith restaurant in the sample,the difference yi─ӯ provides a measure of the error
involved in using to estimatesales. The corresponding sum of squares, called the total sum of squares,
is denoted SST.
SST = Σ(𝑦𝑖 − ӯ)2
Thus, the total sum of squares for Habasha restaurant is SST = 15,750.
The arithmetic difference between total sum of squares and sum of squares dueto erroris called sum
of squares due to regression (SSR), i.e SSR = SST ─ SSE
Thus, SSR = 15,750 ─ 1530 = 14,200
The ratio SSR/SST, which will take values between zero and one, is used to evaluatethe goodness of
fit for the estimated regression equation. This ratio is called the coefficientof determination and is
denoted by r2.
𝑆𝑆𝑅 14,200
Therefore, r2 = = = 0.9027
𝑆𝑆𝑇 15,730
When we express the coefficient of determination as a percentage, r2 can be interpreted asthe
percentage of the total sum of squares that can be explained by using the estimated
regressionequation. For HabeshaShiro, we can conclude that 90.27% of the totalsum of squares can be
explained by using the estimated regression equation ŷ = 60 + 5xto predict sales. In other words,
90.27% of the variability in sales can be explainedby the linear relationship between the size of the
student population and sales. We shouldbe pleased to find such a good fit for the estimated regression
equation.

Summary
Correlation is a statistical tool which studies the relationship between two variables and Correlation
Analysis involves various methods and techniques used for studying and measuring the extent of the
relationship between the two variables. “Two variables are said to be in correlation if the change in
one of the variables results in a change in the other variable”.There are two important types of
correlation. They are (1) Positive and Negative correlation and, (2) Linear and Non – Linear
correlation.If the values of the two variables deviate in the same direction i.e. ifan increase (or
decrease) in the values of one variable results, on an average,in a corresponding increase (or decrease)
in the values of the other variablethe correlation is said to be positive. Correlation between two
variables is said to be negative or inverse ifthe variables deviate in opposite direction. That is, if the
increase in thevariables deviate in opposite direction. That is, if increase (or decrease) inthe values of
one variable results on an average, in corresponding decrease(or increase) in the values of other
variable.

The correlation between two variables is said to be linear if thechange of one unit in one variable
result in the corresponding change in the other variable over the entire range of values. In general two
variables x and y are said to be linearly related, ifthere exists a relationship of the formy = a + bx.
Where y = dependent variable, x = independent variable, a = y-intercept and b = slope of the line

31
(defined as rise or drop) are real numbers.The relationship between two variables is said to be non –
linear ifcorresponding to a unit change in one variable, the other variable does not change at a
constant rate but changes at a fluctuating rate. In such cases, ifthe data is plotted on a graph sheet we
will not get a straight line curve.

One of the most widely used statistics is the coefficient of correlation‘r’, which measures the degree
of association between the two values ofrelated variables given in the data set. In other words, the
coefficient of correlation describes thestrength of the relationship between two sets of interval-scaled
or ratio-scaled variables.Designated r; it is often referred to as Pearson's r and as the Pearson
productmomentcorrelation coefficient. It takes values from + 1 to – 1. If twosets or data have r = +1,
they are said to be perfectly correlated positively ifr = -1 they are said to be perfectly correlated
negatively; and if r = 0 theyare uncorrelated.
The coefficient of correlation ‘r’ is given by the formula
𝑛𝛴𝑥𝑦−𝛴𝑥𝛴𝑦
r=
√(𝑛𝛴𝑥 2 −(𝛴𝑥)2 √(𝑛𝛴𝑦 2 −(𝛴𝑦)2 )
Data which are arranged in numerical order, usually from largest to smallest and numbered 1,2,3 ---
are said to be in ranks or ranked data..These ranks prove useful at certain times when two or more
values of onevariable are the same. The coefficient of correlation for such type of data isgiven by
Spearman rank difference correlation coefficient and is denotedby R.In order to calculate R, we
arrange data in ranks computing thedifference in rank ‘d’ for each pair. The following example will
explain theusefulness of R. R is given by the formula
(𝛴𝑑 2 )
R=1─6
𝑛(𝑛2 −1)
Where, d = difference between ranks and n = total number of observations

Regression analysis is concerned with how the values of one variable depend on the corresponding
values of second variable.In regression terminology, the variable being predicted is called the
dependent variable.The variable or variables being used to predict the value of the dependent
variable arecalled the independent (explanatory) variables.The equation that describes how y is
related tox and an error term is called the regression model as given below.
y = 𝛽0 + 𝛽1 𝑥 + 𝜀
𝛽0 and 𝛽1are referred to as the parameters of the model, and 𝜀 (the Greek letter epsilon) isa random
variable referred to as the error term. The error term accounts for the variabilityin y that cannot be
explained by the linear relationship between x and y.

Self-test exercises
1. What is correlation?
2. What is the difference between dependent and independent variable in regression
terminology?
Answer for self-test exercise
1. Correlation is a statistical tool which studies the relationship between two variables and
Correlation Analysis involves various methods and techniques used for studying and
measuring the extent of the relationship between the two variables.
2. In regression terminology, the variable being predicted is called the dependent variable. The
variable or variables being used to predict the value of the dependent variable are called the
independent (explanatory) variables.
Assignment

32
1. A regional commuter airline selected a random sample of 25 flights and found that the
correlation between the number of passengers and the total weight, in pounds, of luggage
stored in the luggage compartment is 0.94. Using the .05 significance level, can we conclude
that there is a positive association between the two variables?
2. An Environmental Protection Agency study of 12 automobiles revealed a correlation of
0.47between the engine size and emissions. At the .01 significance level, can we conclude
thatthere is a positive association between these variables? What is the p-value? Interpret.
3. Below is information on the price per share and the dividend for a sample of 30 companies.

company Price per share(Birr) Dividend(Birr)


1 500 55
2 450 45
3 650 65
4 240 25
5 150 17
6 650 60
7 650 60
8 856 85
9 520 45
10 365 36
11 275 25
12 960 100
13 500 49
14 980 89
15 450 42

a. Calculate the regression equation using selling price based on the annual dividend.
Interpret the slope value.
b. Determine the coefficient of determination. Interpret its value.
c. Determine the coefficient of correlation. Can you conclude that it is greater than 0 using
the .05 significance level?

33

You might also like