0% found this document useful (0 votes)
12 views11 pages

Textbook

Uploaded by

chonruedee1303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

Textbook

Uploaded by

chonruedee1303
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Test of Significance 14

In this chapter we put together the few skills and techniques we have 14.1 Setting up the hypothesis . 71
learnt so far into a formal method for testing hypothesis, which is the 14.2 The test statistic . . . . . . . 71
classical way of doing statistical inference. 14.3 How to conclude . . . . . . . 72
14.4 The t-test . . . . . . . . . . . . 73
Assume that as a new employee at the national tax department you 14.5 Types of error . . . . . . . . . 74
propose a new tax code that you claim is revenue-neutral (i.e. tax rev-
enue will remain the same) but if adopted will save on administration
costs. To show this you collect a sample of 100 forms in the treasury
department, and applying your new tax rule, find that the sample
average came to 2190 Baht with a sample standard deviation of 7200
Baht. The question then is whether this drop in tax revenue of 2190
really represents a drop in overall tax revenue or can be attributed to
chance error.
We shall break the hypothesis testing process into three essential steps,
namely, 1) setting up the hypothesis, 2) calculating the test statistic,
and 3) concluding.

14.1 Setting up the hypothesis

We set up the null and alternative hypothesis as follows:

0: Average = 0

0: Average > 0

The null hypothesis 0 expresses the idea that an observed difference


is merely due to chance. That is, for our example, the actual change
in revenue collected is zero - there is no change. Usually the null is
something that the researcher intends to disprove. The alternative
hypothesis 0 then is what the researcher sets out to prove given
evidence from his or her sample. So, somewhat counter-intuitively,
the null hypothesis is the “alternative” explanation for the findings, in
terms of chance error.

14.2 The test statistic

A test statistic is used to measure the difference between the data (from
the sample) and what is expected as stated in the null hypothesis.

observed expected
TS =
SE
72 14 Test of Significance

Judging at the numerator, we observe that if the observed value is


much different from the expected value, the difference will be large
and so will the test statistic. More specifically, the test statistic tells
us how many standard errors away an observed value is from its ex-
pected value, and a large test statistic therefore means that the sample
evidence is much different from what is expected as depicted by the
null. This suggests that one should not accept the null hypothesis. The
next section formalizes this using the p-value and the “critical value”
methods.
For our example, the test statistic is:

2190 0
TS = ⇡ 3
720

14.3 How to conclude

We will now learn 3 ways to conclude, namely, the p-value method,


the critical value method and by constructing confidence intervals. All
3 methods are related.
Before we continue, firstly, observe from the alternative hypothesis
that this is a one-sided test, i.e. we are testing whether the average is
larger than that stated in the null hypothesis. Incidentally, a 2-sided
test can be performed when the researcher wishes to know whether
the value of the alternative is larger or smaller than that stated in the
null. Formally stated as 0 : Average < 0. Secondly, it is a good time to
set the level of significance for the test U at, say, 5%.
The “p-value” or observed significance level is the chance of getting a test
statistic as extreme as, or more extreme than, the observed one. The
p-value is computed on the basis that the null hypothesis is correct .
Informally, the p-value is a measure of the evidence against the null.
Hence, the smaller the p-value is, the stronger the evidence against the
null.

P-value
1 in 1,000
-3

So, a small p-value, more concretely, of less than the stipulated signifi-
cance level U leads us to reject the null hypothesis.
An alternative method, popularized in many textbooks, is to compare
the test statistic itself with a “critical value.” As mentioned earlier, a
large test statistic suggests a noticeable difference between the sample
statistic and the expected value. Here, we must take note of whether
a 1-sided or 2-sided test is being performed. For a one-sided test, the
critical value is found by placing all of U on one side of the distribution,
in this case, say, 5%, giving us a z-value or more precisely a “critical
14.4 The t-test 73

value” of 1.645. Notice that we have assumed a normal distribution


and .05 significance level.
Since the test statistic, which was calculated at 3 is larger than the
critical value 1.645 in the absolute sense, we reject the null. In other
words, the difference between the sample evidence and the expected
value in terms of the standard error is significantly large. Hence statis-
ticians like to say that the difference is statistically “significant” (in this
case, at the 5% level.)
Note that if one were conducting a 2-sided test, then the critical value
would be 1.96. As a rule of thumb, when looking for the critical value
in a 2-sided test as opposed to a 1-sided test, put only half of the
significance level on each side of the distribution. So, at the .05 signifi-
cance level, place 2.5% on either side of the normal distribution and
determine the z-value.
A third method used when making a decision in hypothesis testing
is to construct confidence intervals. Given a sample statistic, and, say,
a 5% significance level, construct a 90% confidence interval for the
population parameter for a 1-sided test or construct a 95% confidence
interval for the population parameter for a 2-sided test. And lastly, if
the value of the expected hypothesis does not fall inside the confidence
interval, reject the null.
Again we stress that the 3 methods are intricately related. The choice
of the level of significance is left to the researcher, but it has been a
long tradition in economics to use the .05 significance level.

14.4 The t-test

The normal distribution is not the only possible sampling distribution.


As mentioned earlier, the central limit theorem drives states that as
the sample size = increases, the distribution of the sample average
approaches the normal distribution with a mean ` and variance f 2 /=
irrespective of the shape of the original distribution. But if the sample
size is small and/or the standard deviation of the population is un-
known and has to be estimated by the unbiased sample estimate, then
the t-distribution or Student’s distribution is appropriate.⇤
A shown above, the t-distributions is shaped like the normal distri-
bution but has fatter, thicker tails on both ends. As with the normal
distribution, we might be interested in the area under the curve. Re-
call that for the normal distribution, we need to have 2 parameters,
the mean ` and the SD or variance. For the t-distribution, only one
parameter, the degrees of freedom is required. The degrees of freedom for
our purposes, which we will denote 35 , is the sample size less one or
(= 1). Note that the larger the sample size, the more and more the
t-distribution looks like the normal curve.
⇤ The derivation of the t-distribution was first published in 1908 by William Sealy Gosset,
while he worked at a Guinness Brewery in Dublin. He was prohibited from publishing
under his own name, so the paper was written under the pseudonym Student. The
t-test and the associated theory became well-known through the work of R.A. Fisher,
who called the distribution “Student’s distribution”.
74 14 Test of Significance

Figure 14.1: Student’s t-distribution

To find probabilities, we can use the Student’s t-table shown in the


appendix, which lists the critical values for the t-test. The first column
shows the degree of freedom, and the top row corresponds to the
area to the right of the critical t-values. For example, for 45 degrees of
freedom at the 0.05 level, the corresponding t-stat would be 1.676.⇤

14.5 Types of error

Recall that the task of statistical inference is to make some statement


about a population parameter given a sample statistic. In hypothesis
testing, the researcher sets up a null hypothesis without really knowing
whether it is correct or wrong. Of course, he or she will determine a
value from learned judgement and/or past researches if any. Be as it
may, there is always possibility for error in hypothesis testing, which
cannot be avoided.
An analogy of a household smoke detector is useful to make things
clear. An alarm could be very sensitive and might go off even if there
isn’t a fire; this is known as a Type-I error. That is if we set the null as
“no fire” but end up rejecting it when it is in fact true, then we have a
Type-I error. Conversely, there could be fire but our alarm may fail to
detect it. This means the null, which remains “no fire” - note that this
null is false - is not rejected in the event when there truly is a fire! This
is what statisticians call a Type-II error.

Table 14.1: Types of error NO FIRE ( 0 ) FIRE ( 0 )


No Alarm (Accept 0 ) No error Type-II error
Alarm (Reject 0 ) Type-I error No error

Note that to reduce Type-I error for the example of the fire alarm, we
could simply remove the batteries. Then the alarm will never go off
and this increases Type-II error. In effect, there is a trade-off between
Type-I and Type-II error. A researcher can decide in terms of cost which
error is less desirable when designing a test.
To summarize, Type-I error is the probability of rejecting the null when
it is in fact true, i.e. a false alarm. Since 1 U measures our confidence

⇤ If, say, 3 5 is not shown on the table, use the nearest one.
14.5 Types of error 75

that any alarm bells we hear are genuine, 0; ?⌘0 then is the probability
of Type-I error. On the other hand, the chance of making a Type-II
error, that is when the alternative of “fire” is in fact true but the alarm
does not go off, i.e. something we call a “false positive,” is usually
termed V. That is V is probability of not rejecting a hypothesis when
it is in fact false. Obviously we would like a test to reject a false null
most of the time. In fact this is known as the power of test, which is
1 V. Below is a diagram that shows Type-I and Type-II errors.

Figure 14.2: Type-I and Type-II error


The Test for the Difference in
Means 16
So far we have treated a single population only. We now turn to discuss
a situation dealing with two supposedly different and independent
populations. More specifically, a researcher might be interested to
know whether the mean of two populations are equal or not.
Imagine that a statistic course was given separately to 2 groups of
students, say, section A with 49 students and section B with 36 stu-
dents.⇤ All students sat the same exam. Section A scored an average
of 25 points with a standard deviation of 2, while section B scored
an average of 23 with standard deviation 3. The question then is, Are
students in section A smarter than students of section B. Put differently,
we wish to know is the difference in the means of 2 points significant,
statistically speaking?†
We follow the procedure we learnt in the previous chapter with hy-
pothesis; that is 1) we set up the null and alternative hypothesis, 2)
calculate the test statistic, and 3) conclude at some level of significance,
U. Here we could use the normal distribution or the t-distribution
depending on the sample size and whether we are bootstrapping.
The null and alternative hypothesis would be set up as follows:‡

0: `1 `2 = 0

0: `1 `2 > 0

Note that this is framed as a one-sided test, as the alternative hypothe-


sis shows.
The test statistic is:

observed difference expected difference


TS =
SE of difference

To find the standard error two independent samples, one must find
a way to mix the two spreads. Adding two SEs, however, will not do.
We have to resort to the law of variance, which states that Var(A+B)
= Var(A) + Var(B) for independent A and B. So the trick is to change
square the individual sample SEs then add to get some kind of accu-
mulated variance for the two samples, then get its square root to revert
to SE, but this time it will be the SE for the combined samples, which
is what we want. Mathematically,
⇤ Note that the same size of each group need not be the same.
† It is important that we are asking a statistical question; of course a score of 25 is not
the same as a score of 23. But what we are saying is that the difference of 2 could just
be chance difference and not a difference due to other factors, specifically smartness in
this case. It is a statistical question that we are asking.
‡ We could have stated the null as ` = ` , which technically is the same, but the way
1 2
we state it in the text reflects that we are doing a “difference” in means test.
80 16 The Test for the Difference in Means

q
SE of difference = (⇢ 12 + (⇢ 22

Hence for our example, we will get

(25 23) 0
TS = ⇡ 3.4
0.58

The next question is to ask whether this test statistic follows a normal
or t-distribution. The degrees of freedom is (=1 1) + (=2 1), since each
sample losses one degree of freedom. for our example, this is pretty
large so the normal can be used. But for the purist, the t-distribution is
appropriate.
To conclude, either of the 3 methods discussed in the previous chapter
can be used, i.e. 1) the p-value method, 2) the “critical value” method,
or 3) using “confidence intervals”. This is left to the reader as an
exercise.
The Chi-square Test 17
The j2 -test, pronounced “ki-square test” was invented by the promi- 17.1 Goodness-of-fit test . . . . . 81
nent statistician Karl Pearson in 1900. We have dealt so far with tests 17.2 Test for independence . . . 83
that involve, say, drawing from a 1-0 count box. In such a case we have
seen that a z-test or t-test was appropriate. We shall now turn to the
j2 -test, which is used when we wish to make do inferences when more
than 2 categories are considered. More specifically, in this chapter we
will examine two uses of the j2 -test, namely, 1) the goodness-of-fit test,
and 2) test for independence.

17.1 Goodness-of-fit test

The so called “goodness-of-fit” derives from the question of whether


data fit some predetermined model or hypothesis. Someone might be
interested to know whether a die is fair. Each throw can be classified
into one of 6 categories, {1,2,3,4,5,6}. Assuming we tossed a die 60
times, then we could assume that the die is fair if we get 10 of each
outcome. Let us assume the following as shown in the table below:

outcome observed frequency expected frequency Table 17.1: Outcome of 60 throws of a


die
1 4 10
2 6 10
3 17 10
4 16 10
5 8 10
6 9 10
sum 60 60

Clearly, the observed frequency and expected frequency differ. There


seems to be too few 1’s and too many 3’s and 4’s. The difference in
observed 5’s and 6’s to the expected values could be due to chance.
So, a researcher could conduct 6 separate t-tests, but still would be
undecided about the die as different conclusions are found for different
outcomes. The j2 -test helps overcome this by producing a single test
statistic that captures all cases simultaneously. It is called a j2 -test
because the test statistic follows a j2 -distribution.
The j2 -distribution, like the t-distribution is characterized by one
parameter, its degree of freedom. The j2 -distribution is not symmetric
but skewed to the right for smaller degrees of freedom. As the degrees
of freedom increases, the distribution approaches normal, and in fact
the two distributions are indeed related.
The j2 -table in the appendix and can be read like the student’s t-table.
The first column represents the degrees of freedom, while the top row
corresponds to the area under the curve to the right of the j2 -values
82 17 The Chi-square Test

df= 5

df= 10

Figure 17.1: j2 -distributions

shown below. For example, at 35 = 4 and .05 significance level, the


j2 -critical value is 9.488.
We follow the same procedure a before. The null and alternative hy-
pothesis are:

0: The dies is fair

0: The die is biased

Secondly, the test statistic is calculated by:

’ (observed expected) 2
TS = (17.1)
expected

Note that a large j2 -statistic means that the observed and expected
frequencies are far apart suggesting that there is a bad fit. More pre-
cisely, at some level of confidence U, one can then use the j2 -table to
find the corresponding p-value or simply compared the test statistic to
the critical value read off the table. If the p-value is less than U or the
test statistic is greater than the critical value, then the null hypothesis
is rejected.
For our example on the toss of a die 60 times, we could pick U = 0.05.
We calculated the j2 -test statistic to be 14.2. The degrees of freedom is
not the sample size less one as with the t-test, but rather the number
of cases less one, i.e. 6 1 = 5. From the table, we find that the corre-
sponding area to the right of 14.2 is between 5% (that is the area to the
right of 11.07) and 1% (i.e. the area to the right of 15.09). In other words,
the p-value is between 5 and 1%, which is less than our depicted U.
The conclusion is the same if we compare the test statistic with the
critical value at U = 0.05 and degrees of freedom 5; our test statistic is
larger than the critical value, 14.2 > 11.07 leading us to reject the null
hypothesis of unfair die.⇤

⇤ As a rule of thumb, the j2 -test should be used when the expected frequency of each
line in the table is 5 or more.
17.2 Test for independence 83

17.2 Test for independence

In the earlier part of the book, when studying probability, recall that
we tested for the independence of two variables using conditional
probability. More specifically, two events A and B are said to be inde-
pendent if %( |⌫) = %( ) or %(⌫| ) = %(⌫). The j2 -distribution is the
statistical equivalent for testing independence of two variables given
observations on them.
As before, discussions around an example should be useful. From a
survey of 2,237 people, their number of handedness is summarized in
the following table.

Male Female Table 17.2: Handedness and sex of 2,237


people surveyed
Right-handed 934 1070
Left-handed 113 92
Ambitextrous 20 8

The question is, is gender and handedness independent? There can


be various reasons why a researcher might be interested in such a
question. For example, in neurophysiology, it could be hypothesized
that women use relatively more their left-side of their brain (i.e. their
rational faculty) more than men do. This could explain why women
are more rational then men. Sociologist on the other hand argue that
women are under greater pressure to follow the social norm than men.
The alternative is that handedness is distributed the same for men and
women in the population, and any difference in the sample data is due
to mere chance. Be as it may, the j2 -test can bring some light to such
questions.
The null and alternative hypothesis are:

0: Handedness and gender are independent

0: Handedness and gender are not independent

The test statistic is the same as 16.1. The following table shows the
expected frequencies and how we got these will be explained later.

Observed Expected
Male Female Male Female
Right-handed 934 1070 956 1048
Left-handed 113 92 98 107
Ambitextrous 20 8 13 15

Applying 16.1 the j2 -test statistic is computed as follows:

(934 956) 2 (1070 1048) 2 (92 107) 2


j2 = + +
956 1048 107
(113 98) 2 (20 13) 2 (8 15) 2
+ + + ⇡ 12
98 13 15
84 17 The Chi-square Test

The degree of freedom for this problem is (3 1) ⇥ (2 1) = 2, which


correspond the the fact that there are 3 outcomes of handedness and 2
possible outcomes of gender. The following table shows the difference
between the observed and expected values, which is nothing more
than the deviations.

Male Female Sum


Right-handed -22 22 0
Left-handed -15 15 0
Ambitextrous 7 -7 0
Sum 0 0 0

The bottom row and left-most column show that the vertical and
horizontal sums respectively, or what is the sum of deviation, add
to zero. This means that we need to know only 2 deviations, and the
others can be automatically found, hence the degrees of freedom is
2. In sum, when testing independence in a < ⇥ = table with no other
constraints on their probabilities, there are (< 1) ⇥ (= 1) degrees of
freedom.
Assuming again U = 0.05, for our example, the p-value is the area to
the right of j2 = 12 at 2 degrees of freedom. From the table, the p-value
is less than 1%, which is less than the significance level of 5%, hence
we reject the null. In the same vein, the critical value at 2 degrees of
freedom and U = 0.05 is 5.99. Because |TS| > |critical value|, we reject
the null in favor of the alternative of no independence.
Lastly, we explain how the expected frequencies is found.

Male Female Ratio (%) Male Female


Right-handed 934 1070 89.6 956 1048
Left-handed 113 92 9.1 98 107
Ambitextrous 20 8 1.3 13 15
Sum 1067 1170 100 1067 1170

First, irrespective of gender, we calculate the ratios of right-handed,


left-handed and ambidextrous people in the sample. For example, the
ratio of right handed persons is (934 + 1070)/2237 = 89.6%, and so
on. If handedness and gender were independent, then we can assume
that the same ratios apply to men and women separately, that is,
the number of right-handed men in the sample would be 89.6% of
1067 = 956, and so on and so forth until the table is complete.

You might also like