0% found this document useful (0 votes)
43 views10 pages

Small Sample Tests

The document discusses the t-distribution and its applications for small sample sizes. The t-distribution must be used instead of the normal distribution when: 1) The population is normally distributed but the population variance is unknown and must be estimated from sample data. 2) The sample size is small (n<30). 3) When comparing the means of two small samples from normal populations with the same (unknown) variance. The t-distribution depends on the sample size n and has n-1 degrees of freedom. It is used to construct confidence intervals and test hypotheses about population means when the sample is small and the variance unknown.

Uploaded by

John Carter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views10 pages

Small Sample Tests

The document discusses the t-distribution and its applications for small sample sizes. The t-distribution must be used instead of the normal distribution when: 1) The population is normally distributed but the population variance is unknown and must be estimated from sample data. 2) The sample size is small (n<30). 3) When comparing the means of two small samples from normal populations with the same (unknown) variance. The t-distribution depends on the sample size n and has n-1 degrees of freedom. It is used to construct confidence intervals and test hypotheses about population means when the sample is small and the variance unknown.

Uploaded by

John Carter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Small sample tests

Sampling problems discussed so far dealt with means and proportions.


Evaluation of their sampling errors was based on the normal distribution.
In the case of the mean, the sampling distribution was normal because the
variable was distributed normally in the population or because the Central
Limit Theorem ensured normality for large samples. n the case of
proportions, the normal distribution was used as an approximation for the
underlying binomial distribution. In each case, we required a large sample
(n≤30). When samples are small, n<30, when the population is normally
distributed, and when the population variance has to be estimated from
sample data, the distribution of the sample mean isno longer normal. A
small sample distribution, known as the t-distribution, has to be used in
this case. When samples are small and the distribution of the variable in
the population is not normal, there is no readily available sampling
distribution. When dealing with proportions coming from small samples,
it is necessary to use the exact binomial distribution.

6.1 The t-distribution

Assume that the variables is distributed normally in the population with


mean µ and variance σ2, i.e. X~N(µ,σ2). If σ2 is known, then the sample
mean is normally distributed, and we have no problem. However, in
almost all cases we do not in fact know the population variance, σ2, and
must estimate it. We have seen that the estimator

n
Sˆ 2 = ∑ ( X i − X ) 2 /( n − 1) is an unbiased estimator of σ2. We let
i =1

Sˆ = Sˆ 2 . However, when we replace σ with Ŝ in the usual formula for


Z, we get:

X −µ
t=
Sˆ / n

This does not have a normal distribution. It can be shown that this
statistic, the t-statistic, has the t-distribution with n-1 degrees of freedom.

For large n, the t-distribution resembles the standard normal distribution,


but we are interest here in small samples. The formula for the t-
distribution is quite complicated, and depends on the number of degrees
of freedom. However, it is symmetric about 0, so the same useful
shortcuts, such as P(t>-a)=P(t<a) can be used as for the standard normal.
It can be shown that E(t)=0, and Var(t)=k/(k-2), where k is the number of
degrees of freedom, so in this case, Var(t)=(n-1)/(n-3).

Tables of the cumulative t-distribution for different numbers of degrees of


freedom are available. There is also a t-distribution function in Excel: For
x>0, and k degrees of freedom, the function TDIST(x,n,1) will return
P(t>x), while the function TDIST(x,n,2) will return the 2-tailed test,
P(t>x OR t<-x). There is also a function TINV(p,n) will return the critical
value XC for a 2-tailed t-distribution with n degrees of freedom, such that
P(|t|>XC)=p.

The distribution of X in the population has to be normal for the t-statistic


to have the t-distribution. However, the t-distribution is quite robust, and
small deviations from normality in the population will not invalidate it.

Tables of the t-distribution

The t-distribution will depend on degrees of freedom. Typically, a table


of the t-distribution will give the critical values corresponding to different
probability levels for a 1-tailed test. (For a 2-tailed test, you must halve
the probability level, since you are considering that probability in each
‘tail’.) Part of a typical table by degrees of freedom (k) and probability (α)
is shown below.

k/α … .05 .025 .01 …


1
2
3
4 2.1318 2.7764 3.7469
5 2.0150 2.5706 3.3649

For example, P(t≥2.5706) = 0.025 for the t-distribution with 5 degrees of


freedom. We write 2.5706 = tα=.05,k=5, or t.05,5 = 2.5706.

Uses of the t-distribution

As the t-distribution is a sampling distribution, it can be used to construct


confidence intervals for the population mean µ and to test hypotheses.

Confidence interval
If a random sample of size n comes from a normal population with mean
µ and variance σ2 (both µ and σ2 being unknown) we can state

⎡ X − µ⎤
P ⎢− tα / 2,n −1 < ⎥ < tα / 2,n −1 =1-α.
⎣ ˆ
S / n ⎦

Since there is a probability α/2 that the t-statistic will be higher than the
α/2 critical value, and another α/2 that it will be below minus that value.
This expression can be re-arrange to give the (1-α) confidence interval for
µ,

[ ]
P X − tα / 2,n −1 Sˆ / n < µ < X + tα / 2,n −1 Sˆ / n = 1-α.

For example, if we want a 95% confidence interval, then we choose


α=.05.

Compare this with the 95% confidence interval in the large sample case,
and when we were assuming a known σ. Here we had

[
P X − 1.96σ / n < µ < X + 1.96σ / n =0.95. ]
Here, 1.96 is the critical value of the standard normal distribution, such
that P(Z>1.96) = 0.025. (Since this is a 2-tailed test). Thus, the α/2 critical
value of the standard normal distribution is replaced with the α/2 critical
value of the t-distribution. The population standard deviation σ is
replaced by an unbiased estimate of the standard deviation, Ŝ . In each
case, the confidence interval is measured in standard errors of the sample
mean; in the case of the known S.D., SE( X )=σ/√n, in the case of the
unknown SD, it is Ŝ /√n.

Example

A random sample of 16 households is taken from a large block of flats,


and shows that household expenditure on food is £42 per week, with a
standard deviation of £10. Assuming that household expenditure on food
is normally distributed, find the 95% confidence interval for the
population mean.

1 n
As S2= ∑
n i =1
( X i − X ) 2 , we have
Ŝ 2=S2*n/(n-1) = (16/15)*102 = 106.67, so Ŝ =10.33.

From tables, t.025,15 = 2.1314. Thus the confidence interval is

µ=42 ±2.1314(10.33/√16) = 42±5.50

or 36.5<µ<47.5. The confidence interval is quite wide, since n is small


and Ŝ is quite large.

Test of hypothesis

The procedure for testing a hypothesis is similar to that used for large
samples, i.e. based on the normal distribution, but instead of using the z-
statistic, we now use the t-statistic.

Procedure: Set up the null hypothesis, H0:µ=µ0 (say) and the alternative
hypothesis, H1: µ≠µ0. Choose the significance level α at which H0 is to be
X − µ0
tested. The test statistic is t= . The critical value of t is tα/2,n-1 as
Sˆ / n
this is a 2-tailed test, and is found from tables. The decision rule is to
reject H0 if |t|> tα/2,n-1, and accept H0 otherwise. If the alternative
hypothesis were H1:µ>µ0 or H1:µ<µ0, then we would use a 1-tailed test,
with the critical value being tα,n-1.

Note again that our decision rule is based on measuring how many
standard errors the sample mean is from the hypothesised population
mean.

Difference between two sample means

We may

If two small random samples are taken from two normal populations with
the same variance, it can be shown that the statistic:

t = [( X 1 − X 2 ) − ( µ 1 − µ 2 )] / Sˆ p (1 / n1 ) + (1 / n 2 ) has the t-distribution


with (n1+n2-2) degrees of freedom where

ˆ 2 n1 S1 2 + n 2 S 2 2
Sp = is a pooled estimate of the common population
n1 + n 2 − 2
variance, and n1 and n2 are the sample sizes. When the variables X1 and
X2 are not normally distributed, or when the population variances are not
equal, the test (sometimes called the student t-test) is not strictly valid.
However the t-distribution is quite robust, so that small deviations from
normality or small differences in the variances can be ignored in practice.

Very often, our null hypothesis will be that the two population means are
equal, so that µ1-µ2 in the above formula will be equal to 0.

Example

Continuing the last example, suppose that a random sample of 12


households taken from another large block of flats showed an average
household food expenditure of £36 per week with a standard deviation of
£9 per week. Assuming that household expenditure on food is normally
distributed in each block, and that the population variances are equal, test
the hypothesis that the two population means are the same.

H0:µ1-µ2=0 H1:µ1≠µ2. Assume α=0.05.

We first calculate the estimated population variance,

[ ]
Sˆ p 2 = 12(9 2 ) + 16(10 2 ) /(12 + 16 − 2) = 98.92, hence Ŝ p =9.95

(42 − 36) − 0
t= =1.58.
9.95 (1 / 16) + (1 / 12)

The critical value of t obtained from tables is t.025,26 = 2.0555.

(There are 16+12-2 = 26 d.f.).

As 1.58<2.055, H0 cannot be rejected at the 5% level of significance.

The t-statistic is also crucial in regression analysis, as the difference


between an estimated regression parameter and the population parameter,
divided by its standard error, has the t-distribution. We therefore use t-
statistics to test hypotheses about regression parameters, for example the
hypothesis that the parameter is equal to zero (i.e. no relationship
between the variables).

6.2 The χ2 distribution


The χ2 distribution has many applications; it can be used to test
hypotheses about population variances, and about the distribution of two
or more populations amongst different categories. (For example, are the
distributions of different ethnic groups amongst different classes of job
the same?) It also appears in many contexts in regression analysis. We
introduce it initially in terms of population variances.

When a random sample of size n is taken from a population in which a


variable X follows the normal distribution, it can be shown that the
statistic

n
∑ (X i − X )2
χ2=nS2/σ2= i =1 2
where σ2 is the population variance has the χ2
σ
distribution with (n-1) degrees of freedom. The distribution depends on
the number of degrees of freedom, it has a complicated formula and is
positively skewed.

The variable χ2 lies between zero and ∞, E(χ2)=(n-1) and Var(χ2)=2(n-1).


As n increases, the distribution slowly approaches the normal distribution.
When n≥100, the approximation is quite close. A typical χ2 distribution
is shown below:

f(χ2)

χ2
n-1

Tables show the area (α) under the χ2 curve to the right of a particular
value of χ2 for a given number of degrees of freedom, k. For example, the
entry in the table for k=4 and α=0.95 is 0.7107. This means that
P(χ24)>0.7107=0.95. The entry for k=4 and α=0.05 is 9.488, so
P(χ24)>9.488=0.05.

Confidence interval for the population variance (σ2)

As the statistic χ2=nS2/σ2 has the χ2 distribution with n-1 d.f., we can
write

2 2
P[χ .975,n-1<(nS /σ2)<χ2.025,n-1]=0.95.

Rearranging, we get a 95% confidence interval:

P[nS2/χ2.975,n-1<σ2<nS2/χ2.025,n-1]=0.95.

Similarly, we may test hypotheses about σ2 using the χ2 statistic.

6.3 The F-distribution

The F-distribution can be used to test equality of two population


variances. It also occurs frequently in regression analysis. It is used to test
whether a set of regression results as a whole is significant, and it can be
used to test whether a more complicated model is to be preferred to a
simpler model. We introduce it in terms of population variances.

If samples of size n1 and n2 respectively are taken from two normal


populations with variances σ12 and σ22, it can be shown that the statistic

( Sˆ12 / σ 12 )
F= has the F-distribution with k1=n1-1 and k2 = n2-1 d.f.,
( Sˆ 2 / σ 2 )
2 2
where Ŝ1 and Ŝ 2 2 are the unbiased estimates of the population
2

variances, that is

n
∑ ( X 1i − X 1 )
Ŝ1 = i =1
2
and similarly for Ŝ 2 2 .
n1 − 1

The F-distribution has a complicated formula, and depends on two


degrees of freedom, k1 and k2. It is positively skewed, taking values
between 0 and ∞. It can be shown that E(F)=k2/(k2-2) for k2>2.
Tables

The F-tables show critical values of F corresponding to different values


of α (tail probabilities) and different combinations of degrees of freedom,
k1=n1-1 in the numerator and k2=n2-1 in the denominator. The table entry
will show Fk1,k2,α s.t. P(F >F k1,k2,α)=α. An extract from a table is shown
below:

k2/k1 1 2 3 4
1 α=.05
α=.025
2 α=.05 19.25
α=.025 39.25
3 α=.05
α=.025

For example, P(F>F.05,4,2=19.25)=.05, P(F>F.025,4,2 = 39.25) = .025.

That is, if our F distribution has 4 d.f. in the numerator, and 2 in the
denominator, then the 95% critical value is 19.25, and the 97.5% critical
value is 39.25. Note that only the upper-tailed values of the F-distribution
are tabulated. This is because it is always possible to place the larger
value of Sˆ 2 / σ 2 in the numerator of the F ratio, so that the observed
values of F will always fall in the right-hand tail.

Test of hypothesis

Set up hypotheses, H0:σ12=σ22 H1:σ12≠σ22.

Select level of α=0.025 (say, to get a 2-tailed test for significance level of
5%). The test statistic is

( Sˆ1 2 / σ 1 2 ) Sˆ1 2
F= = 2 since under H0, σ12=σ22.
( Sˆ 2 / σ 2 ) Sˆ 2
2 2

Convention: The larger estimate of the common population variance is


placed in the numerator of the F-ratio, so if Ŝ 2 2 > Ŝ12 , we let F= Ŝ 2 2 / Ŝ12
in order to ensure that F falls in the upper tail of the F-distribution.

The critical value of F is obtained by looking at the F-table with k1 d.f. on


the top of the table (horizontal) and k2 d.f. on the left hand side of the
table (vertical). Look up the appropriate box, and select the value of F for
the appropriate value of α in that table box.

Decision rule: For a one-tail test, if F>Fα, k1,k2, H0 can be rejected at the α
level of significance. For a 2-tailed test (usually the case), H0 can only be
rejected at the 2α level of significance, e.g. if we want a 5% level of
significance we must take α=.025.

Example

We want to test whether male and female students have different


variances in their test scores on a certain course. The 25 male students
have a sample variance of σm2 = 225, and the 31 female students have a
sample variance of σf2 = 121. Test the hypotheses that the variances are
equal at the 5% level of significance, using a 2-tailed test.

Our null hypothesis is H0: σm2 = σf2.

First of all, we must calculate Ŝ12 and Ŝ 2 2 .

We have that S12=225, so Ŝ12 = S12*n1/(n1-1)=225*25/24=234.4, and


Ŝ 2 2 =S22*n2/(n2-1) = 121*30/29=125.2. So the test statistic is

F=234.4/125.2=1.872. There are 25-1=24 d.f. in the numerator and 30-


1=29 d.f. in the denominator. We use the FINV function in Excel, where
FINV(p,k1,k2) gives the value of F* s.t. P(F>F*)=p, where the F-
distribution has k1 and k2 d.f. Hence we want FINV(0.025,24,29)=2.154.
(Since we want a 2-tailed test). Since 1.87<2.154, we cannot reject H0, so
we do not have sufficient evidence to conclude that male and female
students have different variances.

(NB: it seems here that we have taken as our sample the whole class, so
what is the difference between the sample variance and the ‘population’
variance? In this case, we would be taking our ‘population’ to be male
and female students in general, or hypothetical future students on the
course. Of course, we would need to consider carefully whether it is
legitimate to extrapolate from our sample, this year’s class, to the general
case. This is a common problem in statistical and regression analysis; we
might have quite a limited sample, and the question of whether we can
extrapolate to future cases, or to say, different countries or different
circumstances, is often quite uncertain.)

You might also like